Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
Heidi L Rehm PhD FACMG Director Partners HealthCare Laboratory for Molecular Medicine Medical Director Broad Institute Clinical Research Sequencing Platform Associate Professor of Pathology Brigham and Womenrsquos Hospital and
Harvard Medical School
Discovery bull Center for Mendelian
Genomics bull Matchmaker Exchange
Standards amp Knowledgebases
bull ClinGen bull ClinVar bull GA4GH
Clinical Implementation bull Clinical Diagnostics
(Partners LMM Broad CRSP) bull MedSeq (CSER) bull BabySeq (NSIGHT) bull eMERGE
Data resources to support genomics
bull Patient data stores (standardized phenotype and genotype)
dbGaP EGA and many other databases
bull Platforms for genomic data analysis for causality
Commercial or academic +- open-source ndash Seqr Platform
Matchmaker Exchange
bull Database for sharing interpreted variants according to evidence and impact - ClinVar
bull Database for reporting gene-disease relationships - OMIM
bull Database for defining the strength of evidence and actionability for gene-disease relationships ndash ClinGen gene resource
The Clinical Genome Resource Purpose Create authoritative central resource that defines the clinical relevance of genes and variants for use in precision medicine and research
Rehm et al ClinGen - The Clinical Genome Resource N Engl J Med 2015 3722235-2242 wwwclinicalgenomeorg gt400 people from gt90 institutions
ClinGen Acknowledgements ClinGen Steering Committee
Jonathan Berg UNC Lisa Brooks NHGRI Carlos Bustamante Stanford Mike Cherry Stanford James Evans UNC Andy Faucett Geisinger Katrina Goddard Kaiser Permanente
Danuta Krotoski NICHD Melissa Landrum NCBI David Ledbetter Geisinger Christa Lese Martin Geisinger Aleks Milosavljevic Baylor Robert Nussbaum UCSF Kelly Ormond Stanford Sharon Plon Baylor
Erin Ramos NHGRI Heidi Rehm Harvard Sheri Schully NCI Steve Sherry NCBI Michael Watson ACMG Kirk Wilhelmsen UNC Marc Williams Geisinger
Program Coordinators Danielle Azzariti Brianne Kirkpatrick Kristy Lee Laura Milko Annie Niehaus Misha Rashkin Erin Riggs
Andy Rivera Cody Sam Yekaterina Vaydylevich Meredith Weaver
ClinGen Working Groups (WG) Genomic Variant WG
Chairs Christa Martin Sharon Plon Heidi
Rehm
Sequence Variant Interpretation WG
Chairs Les Beisecker Marc Greenblat
Phenotyping WG
Chair David Miller
ClinVar IT Standards and Data Submission
WG
Chair Karen Eilbeck Melissa Landrum
Data Model WG
Chairs Larry Babb Chris Bizon
Informatics WG
Chair Carlos Bustamante
Clinical Domain WGs Hereditary Cancer
Matthew Ferber Ken Offit Sharon Plon
Somatic Cancer Shashi Kulkarni Subha
Madhavan Cardiovascular Euan
Ashley Birgit Funke RayHershberger
Metabolic Rong MaoRobert Steiner Bill
Craigen Pharmacogenomic Teri Klein Howard McLeod
Education Engagement Access
WG
Chairs Andy Faucett Erin Riggs
Consent and Disclosure
Recommendations (CADRe) WG
Chairs Andy Faucett Kelly Ormond
Gene Curation WG
Chairs Jonathan Berg Christa Martin
Actionability WG
Chairs Jim Evans Katrina Goddard
EHR WG
Chair Marc Williams
ClinGen Gene-Disease Validity Classification
httpwwwclinicalgenomeorgknowledge-curationgene-curation
ClinGen Gene-Disease Scoring Matrix
Proposed Gene Inclusion for Clinical Tests
Definitive evidence Strong evidence
Moderate evidence
LimitedDisputedNo evidence
Predictive Tests amp SFs
Diagnostic Panels
Ex omeGenome
Many ClinGen Clinical Domain WGs are initially focused on Gene Curation
Define genes appropriate for clinical testing and genes where additional evidence is needed
clinicalgenomeorg
Available ClinGen Tools amp Resources
Listed By Gene
Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data
Variant-level Data ClinVar
Linked Databases
Researchers Clinics Patients
Patient Registries
Labs
Unpublished or
Literature Citations
InSiGHT
CFTR2 OMIM
Groups
BIC
PharmGKB
Expert Clinical
505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants
ClinVar as of April 26 2016
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Discovery bull Center for Mendelian
Genomics bull Matchmaker Exchange
Standards amp Knowledgebases
bull ClinGen bull ClinVar bull GA4GH
Clinical Implementation bull Clinical Diagnostics
(Partners LMM Broad CRSP) bull MedSeq (CSER) bull BabySeq (NSIGHT) bull eMERGE
Data resources to support genomics
bull Patient data stores (standardized phenotype and genotype)
dbGaP EGA and many other databases
bull Platforms for genomic data analysis for causality
Commercial or academic +- open-source ndash Seqr Platform
Matchmaker Exchange
bull Database for sharing interpreted variants according to evidence and impact - ClinVar
bull Database for reporting gene-disease relationships - OMIM
bull Database for defining the strength of evidence and actionability for gene-disease relationships ndash ClinGen gene resource
The Clinical Genome Resource Purpose Create authoritative central resource that defines the clinical relevance of genes and variants for use in precision medicine and research
Rehm et al ClinGen - The Clinical Genome Resource N Engl J Med 2015 3722235-2242 wwwclinicalgenomeorg gt400 people from gt90 institutions
ClinGen Acknowledgements ClinGen Steering Committee
Jonathan Berg UNC Lisa Brooks NHGRI Carlos Bustamante Stanford Mike Cherry Stanford James Evans UNC Andy Faucett Geisinger Katrina Goddard Kaiser Permanente
Danuta Krotoski NICHD Melissa Landrum NCBI David Ledbetter Geisinger Christa Lese Martin Geisinger Aleks Milosavljevic Baylor Robert Nussbaum UCSF Kelly Ormond Stanford Sharon Plon Baylor
Erin Ramos NHGRI Heidi Rehm Harvard Sheri Schully NCI Steve Sherry NCBI Michael Watson ACMG Kirk Wilhelmsen UNC Marc Williams Geisinger
Program Coordinators Danielle Azzariti Brianne Kirkpatrick Kristy Lee Laura Milko Annie Niehaus Misha Rashkin Erin Riggs
Andy Rivera Cody Sam Yekaterina Vaydylevich Meredith Weaver
ClinGen Working Groups (WG) Genomic Variant WG
Chairs Christa Martin Sharon Plon Heidi
Rehm
Sequence Variant Interpretation WG
Chairs Les Beisecker Marc Greenblat
Phenotyping WG
Chair David Miller
ClinVar IT Standards and Data Submission
WG
Chair Karen Eilbeck Melissa Landrum
Data Model WG
Chairs Larry Babb Chris Bizon
Informatics WG
Chair Carlos Bustamante
Clinical Domain WGs Hereditary Cancer
Matthew Ferber Ken Offit Sharon Plon
Somatic Cancer Shashi Kulkarni Subha
Madhavan Cardiovascular Euan
Ashley Birgit Funke RayHershberger
Metabolic Rong MaoRobert Steiner Bill
Craigen Pharmacogenomic Teri Klein Howard McLeod
Education Engagement Access
WG
Chairs Andy Faucett Erin Riggs
Consent and Disclosure
Recommendations (CADRe) WG
Chairs Andy Faucett Kelly Ormond
Gene Curation WG
Chairs Jonathan Berg Christa Martin
Actionability WG
Chairs Jim Evans Katrina Goddard
EHR WG
Chair Marc Williams
ClinGen Gene-Disease Validity Classification
httpwwwclinicalgenomeorgknowledge-curationgene-curation
ClinGen Gene-Disease Scoring Matrix
Proposed Gene Inclusion for Clinical Tests
Definitive evidence Strong evidence
Moderate evidence
LimitedDisputedNo evidence
Predictive Tests amp SFs
Diagnostic Panels
Ex omeGenome
Many ClinGen Clinical Domain WGs are initially focused on Gene Curation
Define genes appropriate for clinical testing and genes where additional evidence is needed
clinicalgenomeorg
Available ClinGen Tools amp Resources
Listed By Gene
Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data
Variant-level Data ClinVar
Linked Databases
Researchers Clinics Patients
Patient Registries
Labs
Unpublished or
Literature Citations
InSiGHT
CFTR2 OMIM
Groups
BIC
PharmGKB
Expert Clinical
505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants
ClinVar as of April 26 2016
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data resources to support genomics
bull Patient data stores (standardized phenotype and genotype)
dbGaP EGA and many other databases
bull Platforms for genomic data analysis for causality
Commercial or academic +- open-source ndash Seqr Platform
Matchmaker Exchange
bull Database for sharing interpreted variants according to evidence and impact - ClinVar
bull Database for reporting gene-disease relationships - OMIM
bull Database for defining the strength of evidence and actionability for gene-disease relationships ndash ClinGen gene resource
The Clinical Genome Resource Purpose Create authoritative central resource that defines the clinical relevance of genes and variants for use in precision medicine and research
Rehm et al ClinGen - The Clinical Genome Resource N Engl J Med 2015 3722235-2242 wwwclinicalgenomeorg gt400 people from gt90 institutions
ClinGen Acknowledgements ClinGen Steering Committee
Jonathan Berg UNC Lisa Brooks NHGRI Carlos Bustamante Stanford Mike Cherry Stanford James Evans UNC Andy Faucett Geisinger Katrina Goddard Kaiser Permanente
Danuta Krotoski NICHD Melissa Landrum NCBI David Ledbetter Geisinger Christa Lese Martin Geisinger Aleks Milosavljevic Baylor Robert Nussbaum UCSF Kelly Ormond Stanford Sharon Plon Baylor
Erin Ramos NHGRI Heidi Rehm Harvard Sheri Schully NCI Steve Sherry NCBI Michael Watson ACMG Kirk Wilhelmsen UNC Marc Williams Geisinger
Program Coordinators Danielle Azzariti Brianne Kirkpatrick Kristy Lee Laura Milko Annie Niehaus Misha Rashkin Erin Riggs
Andy Rivera Cody Sam Yekaterina Vaydylevich Meredith Weaver
ClinGen Working Groups (WG) Genomic Variant WG
Chairs Christa Martin Sharon Plon Heidi
Rehm
Sequence Variant Interpretation WG
Chairs Les Beisecker Marc Greenblat
Phenotyping WG
Chair David Miller
ClinVar IT Standards and Data Submission
WG
Chair Karen Eilbeck Melissa Landrum
Data Model WG
Chairs Larry Babb Chris Bizon
Informatics WG
Chair Carlos Bustamante
Clinical Domain WGs Hereditary Cancer
Matthew Ferber Ken Offit Sharon Plon
Somatic Cancer Shashi Kulkarni Subha
Madhavan Cardiovascular Euan
Ashley Birgit Funke RayHershberger
Metabolic Rong MaoRobert Steiner Bill
Craigen Pharmacogenomic Teri Klein Howard McLeod
Education Engagement Access
WG
Chairs Andy Faucett Erin Riggs
Consent and Disclosure
Recommendations (CADRe) WG
Chairs Andy Faucett Kelly Ormond
Gene Curation WG
Chairs Jonathan Berg Christa Martin
Actionability WG
Chairs Jim Evans Katrina Goddard
EHR WG
Chair Marc Williams
ClinGen Gene-Disease Validity Classification
httpwwwclinicalgenomeorgknowledge-curationgene-curation
ClinGen Gene-Disease Scoring Matrix
Proposed Gene Inclusion for Clinical Tests
Definitive evidence Strong evidence
Moderate evidence
LimitedDisputedNo evidence
Predictive Tests amp SFs
Diagnostic Panels
Ex omeGenome
Many ClinGen Clinical Domain WGs are initially focused on Gene Curation
Define genes appropriate for clinical testing and genes where additional evidence is needed
clinicalgenomeorg
Available ClinGen Tools amp Resources
Listed By Gene
Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data
Variant-level Data ClinVar
Linked Databases
Researchers Clinics Patients
Patient Registries
Labs
Unpublished or
Literature Citations
InSiGHT
CFTR2 OMIM
Groups
BIC
PharmGKB
Expert Clinical
505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants
ClinVar as of April 26 2016
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
The Clinical Genome Resource Purpose Create authoritative central resource that defines the clinical relevance of genes and variants for use in precision medicine and research
Rehm et al ClinGen - The Clinical Genome Resource N Engl J Med 2015 3722235-2242 wwwclinicalgenomeorg gt400 people from gt90 institutions
ClinGen Acknowledgements ClinGen Steering Committee
Jonathan Berg UNC Lisa Brooks NHGRI Carlos Bustamante Stanford Mike Cherry Stanford James Evans UNC Andy Faucett Geisinger Katrina Goddard Kaiser Permanente
Danuta Krotoski NICHD Melissa Landrum NCBI David Ledbetter Geisinger Christa Lese Martin Geisinger Aleks Milosavljevic Baylor Robert Nussbaum UCSF Kelly Ormond Stanford Sharon Plon Baylor
Erin Ramos NHGRI Heidi Rehm Harvard Sheri Schully NCI Steve Sherry NCBI Michael Watson ACMG Kirk Wilhelmsen UNC Marc Williams Geisinger
Program Coordinators Danielle Azzariti Brianne Kirkpatrick Kristy Lee Laura Milko Annie Niehaus Misha Rashkin Erin Riggs
Andy Rivera Cody Sam Yekaterina Vaydylevich Meredith Weaver
ClinGen Working Groups (WG) Genomic Variant WG
Chairs Christa Martin Sharon Plon Heidi
Rehm
Sequence Variant Interpretation WG
Chairs Les Beisecker Marc Greenblat
Phenotyping WG
Chair David Miller
ClinVar IT Standards and Data Submission
WG
Chair Karen Eilbeck Melissa Landrum
Data Model WG
Chairs Larry Babb Chris Bizon
Informatics WG
Chair Carlos Bustamante
Clinical Domain WGs Hereditary Cancer
Matthew Ferber Ken Offit Sharon Plon
Somatic Cancer Shashi Kulkarni Subha
Madhavan Cardiovascular Euan
Ashley Birgit Funke RayHershberger
Metabolic Rong MaoRobert Steiner Bill
Craigen Pharmacogenomic Teri Klein Howard McLeod
Education Engagement Access
WG
Chairs Andy Faucett Erin Riggs
Consent and Disclosure
Recommendations (CADRe) WG
Chairs Andy Faucett Kelly Ormond
Gene Curation WG
Chairs Jonathan Berg Christa Martin
Actionability WG
Chairs Jim Evans Katrina Goddard
EHR WG
Chair Marc Williams
ClinGen Gene-Disease Validity Classification
httpwwwclinicalgenomeorgknowledge-curationgene-curation
ClinGen Gene-Disease Scoring Matrix
Proposed Gene Inclusion for Clinical Tests
Definitive evidence Strong evidence
Moderate evidence
LimitedDisputedNo evidence
Predictive Tests amp SFs
Diagnostic Panels
Ex omeGenome
Many ClinGen Clinical Domain WGs are initially focused on Gene Curation
Define genes appropriate for clinical testing and genes where additional evidence is needed
clinicalgenomeorg
Available ClinGen Tools amp Resources
Listed By Gene
Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data
Variant-level Data ClinVar
Linked Databases
Researchers Clinics Patients
Patient Registries
Labs
Unpublished or
Literature Citations
InSiGHT
CFTR2 OMIM
Groups
BIC
PharmGKB
Expert Clinical
505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants
ClinVar as of April 26 2016
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
ClinGen Acknowledgements ClinGen Steering Committee
Jonathan Berg UNC Lisa Brooks NHGRI Carlos Bustamante Stanford Mike Cherry Stanford James Evans UNC Andy Faucett Geisinger Katrina Goddard Kaiser Permanente
Danuta Krotoski NICHD Melissa Landrum NCBI David Ledbetter Geisinger Christa Lese Martin Geisinger Aleks Milosavljevic Baylor Robert Nussbaum UCSF Kelly Ormond Stanford Sharon Plon Baylor
Erin Ramos NHGRI Heidi Rehm Harvard Sheri Schully NCI Steve Sherry NCBI Michael Watson ACMG Kirk Wilhelmsen UNC Marc Williams Geisinger
Program Coordinators Danielle Azzariti Brianne Kirkpatrick Kristy Lee Laura Milko Annie Niehaus Misha Rashkin Erin Riggs
Andy Rivera Cody Sam Yekaterina Vaydylevich Meredith Weaver
ClinGen Working Groups (WG) Genomic Variant WG
Chairs Christa Martin Sharon Plon Heidi
Rehm
Sequence Variant Interpretation WG
Chairs Les Beisecker Marc Greenblat
Phenotyping WG
Chair David Miller
ClinVar IT Standards and Data Submission
WG
Chair Karen Eilbeck Melissa Landrum
Data Model WG
Chairs Larry Babb Chris Bizon
Informatics WG
Chair Carlos Bustamante
Clinical Domain WGs Hereditary Cancer
Matthew Ferber Ken Offit Sharon Plon
Somatic Cancer Shashi Kulkarni Subha
Madhavan Cardiovascular Euan
Ashley Birgit Funke RayHershberger
Metabolic Rong MaoRobert Steiner Bill
Craigen Pharmacogenomic Teri Klein Howard McLeod
Education Engagement Access
WG
Chairs Andy Faucett Erin Riggs
Consent and Disclosure
Recommendations (CADRe) WG
Chairs Andy Faucett Kelly Ormond
Gene Curation WG
Chairs Jonathan Berg Christa Martin
Actionability WG
Chairs Jim Evans Katrina Goddard
EHR WG
Chair Marc Williams
ClinGen Gene-Disease Validity Classification
httpwwwclinicalgenomeorgknowledge-curationgene-curation
ClinGen Gene-Disease Scoring Matrix
Proposed Gene Inclusion for Clinical Tests
Definitive evidence Strong evidence
Moderate evidence
LimitedDisputedNo evidence
Predictive Tests amp SFs
Diagnostic Panels
Ex omeGenome
Many ClinGen Clinical Domain WGs are initially focused on Gene Curation
Define genes appropriate for clinical testing and genes where additional evidence is needed
clinicalgenomeorg
Available ClinGen Tools amp Resources
Listed By Gene
Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data
Variant-level Data ClinVar
Linked Databases
Researchers Clinics Patients
Patient Registries
Labs
Unpublished or
Literature Citations
InSiGHT
CFTR2 OMIM
Groups
BIC
PharmGKB
Expert Clinical
505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants
ClinVar as of April 26 2016
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
ClinGen Gene-Disease Validity Classification
httpwwwclinicalgenomeorgknowledge-curationgene-curation
ClinGen Gene-Disease Scoring Matrix
Proposed Gene Inclusion for Clinical Tests
Definitive evidence Strong evidence
Moderate evidence
LimitedDisputedNo evidence
Predictive Tests amp SFs
Diagnostic Panels
Ex omeGenome
Many ClinGen Clinical Domain WGs are initially focused on Gene Curation
Define genes appropriate for clinical testing and genes where additional evidence is needed
clinicalgenomeorg
Available ClinGen Tools amp Resources
Listed By Gene
Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data
Variant-level Data ClinVar
Linked Databases
Researchers Clinics Patients
Patient Registries
Labs
Unpublished or
Literature Citations
InSiGHT
CFTR2 OMIM
Groups
BIC
PharmGKB
Expert Clinical
505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants
ClinVar as of April 26 2016
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
ClinGen Gene-Disease Scoring Matrix
Proposed Gene Inclusion for Clinical Tests
Definitive evidence Strong evidence
Moderate evidence
LimitedDisputedNo evidence
Predictive Tests amp SFs
Diagnostic Panels
Ex omeGenome
Many ClinGen Clinical Domain WGs are initially focused on Gene Curation
Define genes appropriate for clinical testing and genes where additional evidence is needed
clinicalgenomeorg
Available ClinGen Tools amp Resources
Listed By Gene
Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data
Variant-level Data ClinVar
Linked Databases
Researchers Clinics Patients
Patient Registries
Labs
Unpublished or
Literature Citations
InSiGHT
CFTR2 OMIM
Groups
BIC
PharmGKB
Expert Clinical
505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants
ClinVar as of April 26 2016
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Proposed Gene Inclusion for Clinical Tests
Definitive evidence Strong evidence
Moderate evidence
LimitedDisputedNo evidence
Predictive Tests amp SFs
Diagnostic Panels
Ex omeGenome
Many ClinGen Clinical Domain WGs are initially focused on Gene Curation
Define genes appropriate for clinical testing and genes where additional evidence is needed
clinicalgenomeorg
Available ClinGen Tools amp Resources
Listed By Gene
Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data
Variant-level Data ClinVar
Linked Databases
Researchers Clinics Patients
Patient Registries
Labs
Unpublished or
Literature Citations
InSiGHT
CFTR2 OMIM
Groups
BIC
PharmGKB
Expert Clinical
505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants
ClinVar as of April 26 2016
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
clinicalgenomeorg
Available ClinGen Tools amp Resources
Listed By Gene
Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data
Variant-level Data ClinVar
Linked Databases
Researchers Clinics Patients
Patient Registries
Labs
Unpublished or
Literature Citations
InSiGHT
CFTR2 OMIM
Groups
BIC
PharmGKB
Expert Clinical
505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants
ClinVar as of April 26 2016
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Available ClinGen Tools amp Resources
Listed By Gene
Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data
Variant-level Data ClinVar
Linked Databases
Researchers Clinics Patients
Patient Registries
Labs
Unpublished or
Literature Citations
InSiGHT
CFTR2 OMIM
Groups
BIC
PharmGKB
Expert Clinical
505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants
ClinVar as of April 26 2016
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data
Variant-level Data ClinVar
Linked Databases
Researchers Clinics Patients
Patient Registries
Labs
Unpublished or
Literature Citations
InSiGHT
CFTR2 OMIM
Groups
BIC
PharmGKB
Expert Clinical
505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants
ClinVar as of April 26 2016
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
ClinVar Variant View
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Assertion Levels in ClinVar
Expert Panel
Single Submitter ndash Criteria Provided
Single Submitter ndash No Criteria Provided
Multi-Source Consistency
Practice Guideline
No stars
No Assertion Not applicable
ACMG CPIC
CFTR2 InSiGHT PharmGKB ENIGMA
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus
Curated Variants
ClinVar
Variants
Gene and Variant Curation Interfaces
Case-level data store Machine-learning algorithms
Data resources
ClinGenKB
ClinGen Clinical WGs amp Expert Panels
Outside Expert Panels
Discrepancy Resolution
Primary Curators
Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation
and variant curation differences in and export (to user ClinVar and ClinVar)
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo
Phenotypic features are seen in patients
(supporting observations)
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Standardization of disease terminologies
Parkinsonrsquos disease subtypes
yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH
Diseases need to be hierarchically related
httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases
Used to define phenotypic elements of a disease or patient
Courtesy of Melissa Haendel
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Centers for Mendelian Genomics Phenotyping Standards Survey
Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name
phenotypes 44
3 3
2 2
1 1
0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other
ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to
capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4
3 3
2 2
1 1
0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips
subjects
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Joint Center for Mendelian Genomics
Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha
Steering Committee
Coordination Team Project manager Hayley Brooks
Clinical project manager Sam Baxter Monkol Lek
Methods
Analysis Software
Brain Heart
Muscle
Hearing
Retinal
Mito
Other
Clinical Analysis
Monkol Lek Elise Valkanas Tom Mullen
Ben Weisburd Harindra Arachchi
Sarah Calvo Laura Gauthier Laurent Francioli
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
MacArthur Labrsquos seqr Online platform for collaborative analysis
Platform allows collaborative analyses between central sequencing site and thousands of collaborators
enter structured phenotype data
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
seqr Online platform for collaborative analysis
httpsseqrbroadinstituteorg
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
seqr Online platform for collaborative analysis
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
seqr Online platform for collaborative analysis
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Rare Disease Analysis Platform seqr Online platform for collaborative analysis
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Genomic Matchmaking
Patient 1 Clinical Geneticist 1
Patient 2 Clinical Geneticist 2
Notification of
Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F
Phenotypic Data
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
Genotypic Data
Gene D Gene G Gene H
Phenotypic Data
Feature 1 Feature 3 Feature 4 Feature 5 Feature 6
Genomic Matchmaker
Match
Courtesy of Joel Krier
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC
Needs span multiple GA4GH workgroups
bull Data Work Group (data format and interfaces)
bull Regulatory and Ethics (patient consent)
bull Security (patient privacy and user authentication)
Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery
The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER
Innovative genomic collaboration using the GENESIS (GEMapp) platform
Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts
Participant-led matchmaking
GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery
Data sharing in the Undiagnosed Disease Network
The Genomic Birthday Paradox How Much is Enough
Quantifying and mitigating false-positive disease associations in rare disease matchmaking
Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type
GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK
Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery
wwwmatchmakerexchangeorg
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
MME Use Cases
bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant
in GUS) 10 candidate genes each with rare variant
Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GENESIS
Connected and Soon to be Connected Matchmakers
Matchmaker Exchange
Gene Matcher
DECIPHER
RD Connect
ClinGen Genome Connect
Monarch
Phenome Central
Patient initiated matching
Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)
Live
Broad Institute
RDAP
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Connecting Data in the Big Data World
Centralized Database Everyone submits
data to a single central database
Examples ClinVar
dbGaP EGA
Centralized Hub APIs connect each
database to a central hub
Example Many commercial
platforms
Federated Network All databases
connected through multiple APIs
Example Matchmaker
Exchange
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm
Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
NIAGADS
The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP
April 26 2016
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight
Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale
sequencing project to identify AD risk and protective gene variants
bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention
bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)
bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Long-Term Plan for the ADSPLong-Term Plan for the ADSP
bullDiscovery Phase 2014-2018 bull Family-based
bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family
bull Included Caribbean Hispanic families
bull Fully QCrsquod data released 7132015
bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015
bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically
diverse cohorts
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
ADSP Discovery Phase Analysis
bull NIA funding for analysis of sequence data June 2014
bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome
bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Participation
Participation bull PARTICIPANTS
Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium
Three Large Scale Sequencing Centers Baylor Broad Washington University
NIH staff NHGRI and NIA
External Consultants
bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Genomics Center
Organization of the ADSP
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
ADSP Infrastructure and Support
NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]
National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]
National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]
Alzheimerrsquos Disease Centers
NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
NIAGADS History and Function
Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012
NIArsquos repository for the genetics of late-onset Alzheimers disease data
Datasets Genomics Database Analysis Resource
Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy
Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
NIAGADS ADSP Data Coordinating Center
bull Host the sample information and study plan
bull Track progress of sequencing
bull Track samples
bull Prepare amp maintain data
bull Schedule data releases for the Study at dbGaP
bull Coordinate the flow of sequence data among
sequencing centers the consortia and dbGaP
bull Host ADSP website and ADSP data portal
bull Manage files and datasets for ADSP work groups
bull Facilitate community access to ADSP data
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Collaboration between NIAGADS and dbGaP
Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP
portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to
dbGaP bull dbGaP implements user authentication via NIH iTrust
and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Collaboration between NIAGADS and dbGaP
Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data
via the ADSP portal bull AD research community can customize data
presentation to allow data browsing and display through NIAGADS
bull Augments the capacity of dbGaP to work with specific user communities
bull Example for other user communities
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Access to ADSP Data
January 25 2015 NIH Genomics Data Sharing policy (GDS)
bull ADSP Data Access Committee initiated with the launch of the GDS
bull dbGaP application process bull IRB and Institutional Certification
documents
Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process
All restricted data stay at dbGaP
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
ADSP Portal and dbGaP
bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal
bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS
bull ADSP Portal lists the dbGaP ADSP files as well as meta-data
bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
ADSP Collaboration with dbGaP
bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
NIAGADS ADSP Data Coordinating Center
Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release
Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking
IT support Website Members area Face-to-face meeting support dbGaP exchange area
Facilitate interactions with other AD investigators
Rapid response to unforeseen ADSP and NIH requests
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Interactions
NIAGADS
NIANHGRI
BaylorBroadWashU LSACs
NCBI dbGaPSRA
ADSP Data Flow and other Work Groups
Genomics Center
ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology
NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Management of Work Group DataWork Flow
Exchange Area
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data Deposition and Data Sharing
ADSP specific datasets- 68 Tb
21 Reference datasets with 5560 files in use by the ADSP
gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =
1099511627776 bytes A 1 TB hard drive has the capacity to
hold a trillion bytes
dbGaPNIAGADS release for the research community at large
11555 BAM files released 122014
Pedigree and phenotype information
Sequencing Quality Control Metrics
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Size of the ADSP
Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
ADSP Members Area (Dashboard)
Calendar Documents Conference call minutes
Reference Dataset catalog
Information on funded cooperative agreements
Bulletin Board for Work Groups
Notification of New Datasets
ADSP by the numbers
Member list 153 records
352 meeting minutes
gt600 other documents
gt 100 consent files
10 analysis plans
85TB WGS 1075TB WES data (dbGaPSRA)
42 TB Files (dbGaP Exchange Area)
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
niagadsorggenomics Web interface for query and analysis
SNPGene reports Genome Browser interface
Integrating AD genetics with genomic knowledge
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Combine searches on GWAS results and gene annotation to find genomic features of interest
GWAS results combined with
SNPs below
Gene annotations transformed to SNPs
Genomic features from identified SNPs
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Find a genomic location of interest by viewing tracks on the genome browser
SNPs
genes
GWAS results
functional genomics data
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
View Genomic Information by the Genome Browser
search result
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data available at the NIAGADS Genomics Database
Gene models SNP information
Pathway annotations Gene Ontology
KEGG Pathway
Functional genomics data ENCODE
FANTOM5
GTEx results
NHGRI GWAS catalog
Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
What the Data Coordinating Center Should Know
bull Bioinformatics bull Expertise in genomics next generation sequencing method and
software tool development high performance computing bull Big data management bull IT infrastructure and web development
bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers
bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA
policy and infrastructure)
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
What the Data Coordinating Center Should Know
Lessons Learned
bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but
be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of
commercialproprietary solutions bull Engage external advisors
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Acknowledgements NIAGADS EAB Matthew Farrer
Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober
NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG
bull Annotation WG IGAPADGCCHARGEEADIGERAD
NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon
NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli
NIAGADS is funded by NIA U24-AG041689
Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Overview GDC Data Submission Processing and Retrieval
April 2016
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Agenda
n
GDC Overview
GDC Data Submissio
GDC Data Processing
GDC Data Retrieval
154
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Overview Mission and Goals
The mission of the GDC is to provide the cancer research community with a unified data repository that enables
data sharing across cancer genomic studies in support of precision medicine
bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions
bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET
CGCI) and new cancer research programs
ndash Application of state-of-the-art methods of generating high level data 155
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Overview Infrastructure
156
Data Sources (DCC CGHub)
Data Submitters
Open Access Data Users
Controlled Access Data Users
eRACommons amp dbGaP Data
Access Tools
Data Import System
Metadata amp Data Storage
Reporting System
Harmonization amp Generation
Pipelines
3rd Party Applications
GDC Users GDC System Components GDC Interfaces
Alignment amp Processing Tools Data
Submission Tools
Data Security System
APIs
Digital ID System
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Overview Resources
157
GDC Data Portal
GDC Data Center
GDC Data Transfer Tool GDC Reports
GDC Data Model
GDC Data Submission
Portal
GDC ApplicatioProgrammingInterface (API)
n GDC Bioinformatics
Pipeline
GDC Documentation
and Support
GDC Organization
and Collaborators
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Organization and Collaborators
bull GDC Government Sponsor
NCI Center for Cancer Genomics (CCG)
University of Chicago Team
bull Primary GDC developing organization
Ontario Institute for Cancer Research (OICR)
bull GDC developing organization supporting the University of Chicago
Leidos Biomedical Research Inc
bull Contracting organization supporting GDC management and execution
Other Government External
bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data
Submitters and User Acceptance Testers (UAT) Testers 158
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
159
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Data Submission Data Submitter Types
bull The GDC supports two types of data submitters
ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC
Data Transfer Tool (command line interface) for data submission
ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or
researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with
varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data
Transfer Tool that use the GDC API
160
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Data Submission Data Submission Policies
bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c
Data Sharing Policy bull GDC Data Sharing Policies
ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the
data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification
bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores
ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for
submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission
ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be
released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication
ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will
remove data access in the following events Data Management Incident Human Subjects 161
Compliance Issue Erroneous Data
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Data Submission Data Submission Process
162
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Data Submission dbGaP Registration
bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC
bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP
dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf
163
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Data Submission Upload and Validate Data
bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)
bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data
164
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data Types and File Formats (1 of 2)
bull The GDC provides a data dictionary and template files for each data type
165
Data Type Data Subtype Format Data Dictionary Template
Administrative Administrative Data TSV JSON Case TSV JSON
Biospecimen Biospecimen Data TSV JSON
Sample Portion Analyte Aliquot
Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON
Clinical Clinical Data TSV JSON
Demographic Diagnosis Exposure Family History Treatment
Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON
Data File
Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON
Biospecimen Metadata BCR XML GDC-approved spreadsheet
Biospecimen Metadata TSV JSON
Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data Types and File Formats (2 of 2)
Data Type Data Subtype Format Data Dictionary Template
Data File
Experimental Metadata Pathology Report Run Metadata Slide Image
Submitted Unaligned Reads
SRA XML
PDF SRA XML SVS
FASTQ BAM
Experimental Metadata Pathology Report
Submitted Unaligned Reads
TSV JSON
TSV JSON TSV JSON TSV JSON
TSV JSON
Submitted Aligned Reads BAM
Submitted Aligned Reads TSV JSON
Data Bundle Read Group
Slide
Read Group
Slide
TSV JSON
TSV JSON
166
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data Types and File Formats Biospecimen Data
bull Biospecimen data types may include samples aliquots analytes and portions
bull GDC supports the submission of biospecimen data in XML JSON or TSV file format
167
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data Types and File Formats Clinical Data
bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos
Cancer Data Standards Repository (caDSR)
Case
Clinical Data
Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment
(Optional)
Biospecimen Data Experiment Data
168
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data Types and File Formats Experiment Data
bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format
bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC
Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS
Per base sequence quality PASS PASS PASS PASS
Per tile sequence quality PASS WARNING PASS PASS
Per sequence quality scores FAIL PASS PASS PASS
Per base sequence content PASS PASS WARNING PASS
Per sequence GC content PASS PASS PASS PASS
Per base N content PASS PASS PASS PASS Sequence Length
Distribution PASS PASS PASS PASS Sequence
Duplication Levels WARNING WARNING PASS PASS Overrepresented
sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data Submission Tools GDC Data Submission Portal
bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata
Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools
Validate data against GDC standard data types defined in the project data dictionary
Obtain information on the status of data submission and processing by project
170
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data Submission Tools GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol
Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads
Supports the secure upload of controlled access data using an authentication key (token)
171
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data Submission Tools GDC Submission API
bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis
Securely submit biospecimen clinical and experiment files using a token
Submit data to GDC by performing create update and retrieve actions on entities
Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions
Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Data Submission Submit and Release Data
bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data
bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies
bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set
173
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
174
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Processing Harmonization
bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)
175
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Processing DNA and RNA Sequence Harmonization Pipelines
176
DNA Sequence Pipeline RNA Sequence Pipeline
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Processing GDC High Level Data Generation
bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations
bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines
177
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Processing GDC Variant Calling Pipelines
178
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Agenda
GDC Overview
GDC Data Submission
GDC Data Processing
GDC Data Retrieval
179
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Data Retrieval
bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing
bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions
bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API
bull Data queries are based on the GDC Data Model
180
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Data Retrieval GDC Data Model
bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC
bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers
181
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Data Retrieval GDC Data Portal
bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics
Data browsing by project file case or annotation
Visualization allowing users to perform fine-grained filtering of search results
Data search using advanced smart search technology
Data selection into a personalized cart
Data download from cart or a high-performance Data Transfer Tool
182
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Data Retrieval GDC Data Transfer Tool
bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks
Command line interface to specify desired transfer protocol and multiple files for download
Utilization of a manifest file generated from the GDC Data Portal for multiple downloads
Supports the secure transfer of controlled access data using an authentication key (token)
183
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Data Retrieval Tools GDC API
bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis
Search for projects files cases annotations and retrieve associated details in JSON format
Securely retrieve biospecimen clinical and molecular files
Perform BAM slicing
data hits [
project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo
] Portion remove for readability
API URL Endpoint URL parameters Query parameters
184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Web Site
bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission
Targets data consumers providers developers and general users
Provides access to information about GDC and contributed cancer genomic data sets
Instructs users on the use of GDC data access and submission tools
Provides descriptions of GDC bioinformatics pipelines
Documents supported GDC data types and file formats
Provides access to GDC support resources
185
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GDC Documentation Site
bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary
186
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
References
Note Requires access to the University of Chicago
Virtual Private Network
bull GDC Web Site ndash httpsgdcncinihgov
bull GDC Documentation Site ndash httpsgdc-docsncinihgov
bull GDC Data Portal ndash httpsgdc-portalncinihgov
bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal
bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool
bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov
bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Gabriella Miller Kids First Pediatric Data Resource Center
bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects
bull By enabling data ndash Aggregation ndash Access
Whole genome sequence + Phenotype ndash Sharing ndash Analysis
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Gabriella Miller Kids First Pediatric Data Resource Center
Data Resource Portal
bull Web-based public facing platform bull House organize index and display data and
analytic tools
Data Coordinating
Center
bull Facilitate deposition of sequence and phenotype data into relevant repositories
bull Harmonize phenotypes
Administrative and Outreach
Core
bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Birth defect or childhood cancer
DNA Sequencing center
cohorts
Birth defect BAMVCF Cancer BAMVCF
dbGaP NCI GDC
Birth defect VCF Cancer VCF
Gabriella Miller Kids First Data
Resource Birth defect BAMVCF
Index of datasets Phenotype Variant summaries
Cancer BAMVCF
Users
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
GMKF Data
Resource
dbGaP
NCI GDC
TOPMed
The Monarch Initiative
ClinGen
Matchmaker Exchange
Center for Mendelian Genomics
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Functionality bull Where can I find a group of patients with
diaphragmatic hernia bull What range of variants are associated with total
anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What
human phenotypes are associated with variants in ABC
bull What is the frequency of de novo variants in patients with Ewing sarcoma
bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences
Questions
bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities
bull Have we included all of the right elements in the proposal
bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences