GA4GH Schemas Documentationon the architecture, product features, and current code, as well as where you will document code you write in a user-friendly way. If you need background

GA4GH Schemas DocumentationRelease 0.0.1

Jeltje van Baren

Sep 07, 2017

Contents

1 Contents 31.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301.5 API Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411.6 App Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871.7 Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871.8 Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1381.9 Changelog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

i

GA4GH Schemas Documentation, Release 0.0.1

The OmicMD platform aggregates and interprets genomic data in order to save people’s lives and improve quality oflife through more precise, patient-driven medicine. Because both the science and technology are complex, we needto be diligent about documenting our understanding so all information and data are easily discovered, shared, andreplicated.

Contents 1

http://www.omicmd.com


2 Contents

CHAPTER 1

Contents

Introduction

This is the core product documentation for OmicMD. Here you will find the information you need to get up to speedon the architecture, product features, and current code, as well as where you will document code you write in auser-friendly way.

If you need background

For more details on web APIs, see this wikipedia page. REST protocols for data transfer are described here. Wikipediaalso has good overviews of bioinformatics and DNA sequencing

General Overview

The product contains an organized collection of research and medical databases, hosted in a secure, HIPAA-compliantlocation. The key data are integrated and exposed through services via an API-led architecture including system,process, and experience APIs. As such, we are not constructing a data lake; rather a highly-available distributednetwork of data sources and applications, including analysis “tracks” layered on for things like machine learningpredictions of disease risk or drug efficacy.

Definitions

• rsID:

• SNP: A single-nucleotide polymorphism, often abbreviated to SNP, is a variation in a single nucleotide thatoccurs at a specific position in the genome, where each variation is present to some appreciable degree within apopulation (e.g. > 1%).

3

https://en.wikipedia.org/wiki/Web_APIhttps://en.wikipedia.org/wiki/Representational_state_transferhttps://en.wikipedia.org/wiki/Bioinformaticshttps://en.wikipedia.org/wiki/DNA_sequencing


Stack

This section covers the technology stack.

Infrastructure

Describes our AWS-based infrastructure and devops. Our infrastructure is primarily AWS-based, combined withbest-in-class tools where they are substantially better than AWS tooling or where AWS has gaps. For example, wemight use Kubernetes for Docker container orchestration as opposed to AWS EC2 container service.

AWS Services

We use the following AWS services:

• EC2: Compute

• OS: Amazon EC2 Linux

• RDS: Storage

• SQL Database: PostgreSQL (default structured), MySQL (when required)

• NoSQL Database: MongoDB (TBD)

Middleware

Middleware tools, architecture, guidelines.

Structure

We use a three-tier classification structure for APIs.

• System - APIs that directly access databases and organize the data into functional domains (e.g. the “Person”data element)

• Process - APIs that deal with processes and orchestration (e.g. the “Order Test” process)

• Experience - APIs that combine process and/or System APIs to expose data directly to an app developer (e.g. aweb portal experience API that abstracts a FHIR system API and an order test process API with a light-weightRESTful interface). The Experience API can be used to create a front end experience without understandingcomplex system domains or business logic encapsulated within process APIs.

4 Chapter 1. Contents


Example

This is an example published by MuleSoft of a three-tier architecture for health care. In it, data are abstracted fromcomplex EHR systems like Epic into a canonical model that is represented via a set of FHIR REST APIs. Then,experience APIs are layered on top to provide better experiences for both patients and clinicians.

1. System Layer

System APIs abstract away the complexity of EHRs and other core systems of record from the data’s end user, whileproviding downstream insulation from any interface changes or rationalization of those systems.

Assets Included:

• CRM FHIR System API | Salesforce Implementation Template

• CRM FHIR System API | RAML Definition

• EHR FHIR System API | EHR Implementation Template*

• EHR FHIR System API | RAML Definition

• Fitness FHIR System API | Fitbit Implementation Template

• Fitness FHIR System API | RAML Definition

2. Process Layer

Process APIs decouple business processes that interact with and shape data from the source systems where the dataoriginated. For example, the “schedule appointment” process contains logic that is common across multiple entities,which can be called by product, geography, or channel-specific parent services.

Assets Included:

• Onboarding Process API | Implementation Template

• Appointments Process API | Implementation Template

• Appointments Process API | RAML Specification

1.2. Stack 5


• EHR to CRM Sync Process API | Implementation Template

• Fitness Data Sync Process API | Implementation Template

• HL7 Event Handler | Implementation Template (Coming soon)

3. Experience Layer

Experience APIs are the means by which data can be reconfigured so that it is most easily consumed by its intendedaudience, all from a common data source, rather than setting up separate point-to-point integrations.

Assets included:

• Web Portal Experience API | Implementation Template

• Web Portal Experience API | RAML Specification

• Salesforce Experience API | Implementation Template

Supporting Assets:

• HL7 Connector

• 12 FHIR APIs | RAML Specification

Applications and Data

Frameworks, languages, databases, and other supporting applications.

Technologies

1. Languages

• App dev: Java, JS

2. Databases

• Databases: PostgreSQL (primary), MySQL (secondary), Glacier (patient genomics, other static data @~$0.05/gb/yr)

• Database as a Service: Amazon RDS

• Cloud Storage: Amazon EBS/Aurora? (TBD)

• NoSQL: MongoDB

• Managed Memcache: Amazon Elasticache

• Protocols: Blockchain for patient genomics? TBD

3. Database Tools

• Database Tools: Airpal? (TBD)

• Big Data Tools: Druid, Presto (TBD)???

4. Libraries

• Web App Dev: Node.js, React.js

Utilities

Tools we use to support business operations.



Technologies

• Content Delivery Network: Amazon CloudFront

• DNS Management: Amazon Route 53

• Payment Services: TBD

• Prototyping: TBD

• A/B Testing: TBD

• Load Balancer: NGINX? (TBD)

DevOps


Technologies

• VM Manager: Vagrant? (TBD)

• Build Tools: Webpack for JS? (TBD)

• Testing: Mocha for JS (TBD), Enzyme for JS (TBD), Selenium? (TBD)

• Monitoring: Kibana (TBD)

• Server Configuration: Chef (TBD)

• Log Management: Logstash/Splunk (TBD)

• Cluster Management: Apache Mesos (TBD)

• Performance Monitoring: Scout (TBD)

• Continuous Integration: Solano CI (TBD)

• Exception Monitoring: Sentry (TBD)

• Performance Monitoring: Datadog (TBD), New Relic (TBD)

• Code Collaboration: GitHub (use gitlab??) TBD

Business Tools


Vendors

Intro

We use the latest, proven SaaS-based technologies and services to power our business.

1.2. Stack 7


Details

• Mail: Gmail

• Calendar: Google Calendar

• Office: Google Docs

• Storage and Knowledge Management: Google Drive

• Productivity Suite: Google Apps

• Group Chat: Slack

• Project Management: Asana

• CRM: Salesforce

• ERP: TBD

• HR: TBD

• Finance: TBD

• Product Mgmt: Jira/Rally TBD

• Roadmaps: Aha! https://omic.aha.io

• Documentation: Read The Docs

• Bug Management: Bugzilla? TBD

• Support Ticketing: Freshdesk

• Business Tool Integration: IF/Zapier depending on pre-built integrations

• Email Marketing: ActiveCampaign

• Business Intelligence: Superset? (TBD)

• Visual Collaboration: Assemblage? (TBD)

• Interactive Mockups: Marvel (primary), Invision (secondary)

Science

This section covers some of the science and clinical operations used in the product. The term “omics” aims at thecollective characterization and quantification of pools of biological molecules that translate into the structure, function,and dynamics of an organism or organisms.

Medicine

Not intended to be a tome of medical information, this is only meant to cover the relevant clinical practices andguidelines related to genomic medicine software.

Preventive Care

• Genetic risks

• Population health outreach


https://omic.aha.io


• When to test

Diagnostics

• Test and lab recommendations

• Genetic test findings

Treatment

• First medication decisions

• Medication reconciliation

• Dosage and drug interactions

Followup Management

• Adverse event

• Corollary orders

Efficiency & Costs

• Care plans

• Order sets

Patient Experience

• Duplicate testing alerts

Machine and Deep Learning

Deep Learning

Can start with something similar to SPIDEX - predicting phenotype from splicing

https://www.deepgenomics.com/spidex-noncommercial-download

Genomics

Genetics is the study of single genes and their role in the way traits or conditions are passed from one generation to thenext. Genomics is a term that describes the study of all parts of an organism’s genes. We typically refer to genomicsin all our documentation.

1.3. Science 9

https://www.deepgenomics.com/spidex-noncommercial-download


Genomics to Clinical Guidelines

Genomics can be connected to clinical practice through the diagram shown below:

The process starts with acquiring patient genomic data. Next Generation Sequencing (NGS) data usually enter thesystem as BAM files (analyzed, mapped and annotated) with BAI index files, dependent on genome assembly versions.Microarray gene expression data format depends on the manufacturer (ie, Affy, Illumina) and chip versions. Most otherdata file formats are flat files, vcf, tab-delimited, often coded.

From these data, clinically relevant alleles (alternative forms of the same gene) and haplotypes (groups of alleles) areidentified.

Alleles and haplotypes can quickly become complex, and new discoveries are constantly made. Fortunately, wecan make use of star alleles. Star alleles are haplotype patterns that have been defined at the gene level and, inmany cases, associated with protein activity levels. Genetic variants within a haplotype can include single nucleotidepolymorphisms (SNPs), Insertion/Deletions (InDels), and copy number variants (CNVs).



Once we have applied the right nomenclature, we can then start to correlate the specific genomic characteristics ofthe individual with phenotypes (diseases, health conditions, or reactions ot medications) and combined with drugs andclinical trials to form clinical guidelines for doctors.

Guidelines are the implications of the gene-phenotype, gene-drug, or drug-drug interaction. For example, “increasedrisk of morphine formation following codeine adminstration, leading to higher risk of toxicity.” These are then con-verted to recommendations, which actually tell the health care provider what to do. For example, “Avoid codeine usedue to potential for toxicity.”

1.3. Science 11


Reads

Reads are genetic data generated by a DNA sequencing instrument, including nucleotides and quality scores. Readsmay optionally be aligned to a reference sequence. (The data model for reads is similar to SAM/BAM.)

Reads API

See Reads schema for a detailed reference.

Reads Data Model

The Reads data model,although based on the SAM format, allows for more versatile interaction with the data. Insteadof sending whole chromosome or whole genome files, the server can send information on specific genomic regionsinstead.

The model has the following data types:

Record Description SAM/BAM roughequivalent

ReadAlignmentOne alignment for one read A single line in a fileReadGroup A group of read alignments A single RG tagReadGroupSet Collection of ReadGroups that map to the same genome Single SAM/BAM fileProgram Software version and parameters that were used to align reads to

the genomePN, CL tags in SAMheader

ReadStats Counts of aligned and unaligned reads for a ReadGroup orReadGroupSet

Samtools flagstats on afile

The relationships are mostly one to many (e.g. each ReadAlignment is part of exactly one ReadGroup), with theexception that a ReadGroup is allowed to be part of more than one ReadGroupSet.

Dataset –< ReadGroupSet >–< ReadGroup –< ReadAlignment

• A Dataset is a general-purpose container, defined in metadata.avdl.

• A ReadGroupSet is a logical collection of ReadGroups, as determined by the data owner. Typically oneReadGroupSet represents all the Reads from one experimental sample, which traditionally would be storedin a single BAM file.

• A ReadGroup is all the data that’s processed the same way by the sequencer. There are typically 1-10 Read-Groups in a ReadGroupSet.

• A ReadAlignment object is a flattened representation of several layers of bioinformatics hierarchy, includingfragments, reads, and alignments, stored in one object for easy access.

ReadAlignment: detailed discussion

One ReadAlignment object represents the following logical hierarchy. See the field definitions in theReadAlignment object for more details.


https://samtools.github.io/hts-specs/SAMv1.pdf


• A fragment is a single stretch of a DNA molecule. There are typically at least millions of fragments in aReadGroup. A fragment has a name (QNAME in BAM spec), a length (TLEN in BAM spec), and one or morereads.

• A read is a contiguous sequence of bases. There are typically only one or two reads in a fragment. If thereare two reads, they’re known as a mate pair. A read has an array of base values, an array of base qualities, andoptional alignment information.

• An alignment is the way alignment software maps a read to a reference. There’s one primary alignment, andcan be one or more secondary alignments. Secondary alignments represent alternate possible mappings.

• A linear alignment maps a string of bases to a reference using a single CIGAR string. There’s one representa-tive alignment, and can be one or more supplementary alignments. Supplementary alignments represent linearalignments that are subsets of a chimeric alignment.

The image below shows which Reads records contain other records (represented by green triangles), and which containIDs that can be used to get information from other records (arrows). The arrow points from the record that lists the IDto the record that can be identified by that ID. Records are represented by blue rectangles; dotted lines indicate recordsdefined in other schemas.

1.3. Science 13


Variants

Variants are genetic differences between an experimental sample and a reference sequence. (The data model forvariants is similar to VCF.)

Variants API

See Variants schema for a detailed reference.

Variants Data Model

The Variants data model, although based on the VCF format, allows for more versatile interaction with the data.Instead of sending whole VCF files, the server can send information on specific variants or genomic regions instead.And instead of getting the whole genotype matrix, it’s possible to just get details for one or more specified individuals.

The API uses four main entities to represent variants. The following diagram illustrates how these entities relate toeach other to constitute the genotype matrix.

The lowest-level entity is a Call:


https://samtools.github.io/hts-specs/VCFv4.2.pdf


• a Call encodes the genotype of an individual with respect to a variant, as determined by some analysis ofexperimental data.

The other entities can be thought of as collections of Calls that have something in common:

• a VariantSet supports working with a collection of Calls intended to be analyzed together.

• a Variant supports working with the subset of Calls in a VariantSet that are at the same site and are describedusing the same set of alleles. The Variant entity contains:

– a variant description: a potential difference between experimental DNA and a reference sequence, includ-ing the site (position of the difference) and alleles (how the bases differ)

– variant observations: a collection of Calls describing evidence for actual instances of that difference, asseen in analyses of experimental data

• a CallSet supports working with the subset of Calls in a VariantSet that were generated by the same analysis ofthe same sample. The CallSet includes information about which sample was analyzed and how it was analyzed,and is linked to information about what differences were found.

The following diagram shows the relationship of these four entities to each other and to other GA4GH API entities.It shows which entities contain other entities (such as VariantSetMetadata), and which contain IDs that can beused to get information from other entities (such as Variant‘s variantSetId). The arrow points from the entitythat contains the ID to the entity that can be identified by that ID.

FIXME: remove the Sample object from the graphic; that object isn’t (yet) defined in the API.

1.3. Science 15


References

References are standard genome sequences, used to provide a coordinate system for reads and variants.

References API

See References schema for a detailed reference.

References Data Model

A genome assembly is a digital representation of a genome. It is typically composed of contigs, each an uninter-rupted string representing a DNA sequence, arranged into scaffolds, each of which orders and orients a set of contigs.Scaffolds are typically represented as a string, with runs of wildcard characters (N or n) used to represent interstitialregions of uncertain DNA between contigs.

A reference genome is a genome assembly that other genomes are compared to and described with respect to. Forexample, sequencing reads are mapped to and described with respect to a reference genome in the API, and geneticvariations are described as edits to reference scaffolds/contigs. In the API a reference genome is described by aReferenceSet. In turn a ReferenceSet is composed of a set of Reference objects, each which represents a scaffold orcontig in the assembly. Reference sequences are expected to have unique names within a ReferenceSet.

Sequence Annotations

Sequence annotations describe genomic features such as genes and exons, using terms from an established sequenceontology.

Sequence Annotations API

For the Sequence Annotation schema definitions, see Sequence Annotation schema

The Sequence Annotation Schema consists of ‘Features’ for discontinuous data and ‘Continuous’ for continuous data.

Feature Based Hierarchy

A Feature describes an interval of interest on some reference(s). It has a span from a start position to a stop position aswell as descriptive data. A Feature can have a parent Feature, and can have an ordered array of child Features, whichenables the construction of more complex representations in a hierarchical way.

For example, a single gene Feature may be parent to several different transcript Features. The specific exons for eachtranscript would have that transcript Feature as parent. The same physical exon may occur as part of two differenttranscript Features, but in our notation, it would be encoded as two separate exon Features, each with a different parent,both occupying the same genomic coordinates. This structure can also extend to annotating CDS, binding sites or anyother sub-gene level features.

The Feature Sequence Annotation Schema

This model is similar to that used by the standard GFF3 file format.

The main differences concern the deprecation and replacement of discontinuous features, the replacing of multi-parentfeatures with multiple copies of that feature, and the ability to impose an explicit order on child features.


http://sequenceontology.org/resources/gff3.html


In the first case, a CDS composed of multiple regions is sometimes encoded as multiple rows of a GFF3 file, each withthe same feature ID. This is translated in our hierarchy into a single CDS Feature with an ordered set of CDS_regionFeature children, each corresponding to a single row of the original record.

In the second case, as explained above, features with multiple parents in a GFF3 record are simply replicated andassigned a new identifier as many times as needed to ensure a unique parent for every feature.

In the final case, an explicit mechanism is provided for ordering child Features. Most of the time this ordering istrivially derived from the genomic coordinate ordering of the children, but in some biologically important cases thisorder can differ, such as in non-canonical splicing of exomes into transcripts (also known as back splicing - see below).

A FeatureSet is simply a collection of features from the same source. An implementer may, for example, choose togather all Features from the same GFF3 file into a common FeatureSet.

The Continuous Sequence Annotation Schema

‘Continuous’ defines a format for exchanging continuous valued signal data, such as those produced experimentally(e.g. ChIP-Seq data) or through calculations (e.g. conservation scores). ‘Continuous’ represents numerical data inwhich a real value (or NaN) is associated with each base position. This data is often stored in BigWig, Wiggle orBedGraph formats.

Each Continuous message consists of a start position, defined on a reference, and a list of real values. The first list valueapplies to the start base position and, the second list value to the base position after the start base, and so forth. The listof values can include NaN values to represent unsampled/unknown base positions. Alternatively, a set of Continuousmessages (ContinuousSet), representing non-overlapping base positions, can be used, skipping all or some of the NaNvalues.

Annotation Design - RNA Considerations

Read data derived from RNA samples can differ from genomic read data due to the presence of non-genomic se-quences. An example would be a read that spans a splice junction. It describes a contiguous sequence of reads, buta dis-continuous genomic region due to the missing intron. Feature level read assignment is further complicated bythe existence of multiple splice isoforms. A read that can be definitely assigned to a particular feature (an exon in thiscase) may still not be definitely assigned to a particular transcript if multiple transcript share that exon. The annotationAPI needs to be able to report assignment at the feature level as well as aggregate assignment at the transcript or eventhe whole gene level if assignment is not more specific than that.

Splicing (other post-transcriptional modifications?) can occur with degrees of complexity. A ‘typical’ splice will resultin a mature transcript with exon in positional (numerical) order in a head-to-tail orientation. Back splicing (tail-to-head) can result in transcripts with the exon order reversed (1-3-2-4 instead of 1-2-3-4) and even circular RNA. Theexon order in a transcript as well as the orientation of the splice should be discoverable via the API. In a more generalcase, the API should allow child features to have an ordered relationship.

Allele Annotations

Allele annotations are additional pieces of data often generated by algorithms which help to describe, classify, andunderstand variants.

Allele Annotation API

See Allele Annotation schema for a detailed reference.

1.3. Science 17


Introduction

Variant alleles can be annotated by comparing them to gene annotation data using a variety of algorithms. A standardform of annotation is to compare alleles to a transcript set and calculate the expected functional consequence of thechange ( e.g. a variant within a protein coding transcript may change the amino acid sequence of the resulting protein).

This API supports the mining of variant annotations by region and the filtering of the results by predicted functionaleffect.

Allele Annotation Schema Entities

The VariantAnnotation data model, is based on the results provided by variant annotation programs such asVEP, SnpEff and Annovar and others, as well as the VCF’s ANN format .

Record DescriptionVari-antAnno-tationSet

A VariantAnnotationSet record groups VariantAnnotation records. It represents the comparison of aVariantSet to specified gene annotation data using specified algorithms. It holds informationdescribing the software and annotation data versions used.

Vari-antAnno-tation

A VariantAnnotation record represents the result of comparing a single variant to the set of annotationdata. It contains structured sub-records and a flexible key-value pair ‘info’ field.

Tran-scriptEf-fect

A TranscriptEffect record describes the effect of an allele on a transcript.

Allele-Location

An AlleleLocation record holds the location of an allele relative to a non-genomic coordinate systemsuch as a CDS or protein. It holds the reference and alternate sequence where appropriate

HGVSAn-notation

A HGVSAnnotation record holds Human Genome Variation Society ( HGVS ) descriptions of thesequence change at genomic, transcript and protein level where relevant.

Analy-sisResult

An AnalysisResult record holds the output of a prediction package such as SIFT on a specific allele.

The schema is shown in the diagram below.

TranscriptEffect attributes

A VariantAnnotation record may have many TranscriptEffect records as one is reported for each possi-ble combination of alternate alleles and overlapping transcripts. The record includes:

• The identifier of the transcript feature the variant was analysed against.

• The alternate allele of the variant analysed. This is necessary as the current variant model supports multiplealternate alleles.

• The predicted effects of the allele on the transcript, which should be described using Sequence Ontology terms.

• A HGVSAnnotation record containing variant descriptions at all relevant levels.

• AlleleLocation records describing the changes at cDNA, CDS and protein level.

• A set of results from prediction packages analyzing the allele impact.


http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdfhttp://www.hgvs.org/mutnomen/recs.htmlhttp://www.sequenceontology.org


Search Options

VariantAnnotationSets can be extracted by Dataset or VariantSet, or retrieved by id.

A VariantAnnotationSet can be searched for VariantAnnotations by region and filters can be applied.

• A region to search must be specified. This can be done by providing a reference sequence (identified by nameor id) with start and end coordinates.

• Results can be filtered by the predicted effect of the variant using a Sequence Ontology OntologyTerm.

RNA Quantification

The RNA quantifications provides a means of obtaining feature level quantifications derived from a set of RNA reads.

RNA Quantification API

For the RNA Quantification schema definitions, see the RNA Quantification schema

RNA Quantification

The RNA Quantification provides a means of obtaining feature level quantifications derived from a set of RNA reads.

Case 1: Obtain quantification data for one or more features (genes) in an RNASeq experiment

User desires: Feature quantification data (numeric) for one or more features identified in an RNASeq experimentresult. User will provide a list of one or more features for which results should be returned. If a feature list is notprovided, all quantification results for the selected RNASeq experiment should be returned. Numeric quantificationshould be provided as raw read count to allow for user conversion to desired units. If desired TPM, RPKM or otherenumerated units of measure can also be reported.

Additional Considerations: An RNASeq experiment result is the output of running an analysis pipeline on a set of readdata. The result should have metadata that describes the pipeline that was used in detail. It should include the identityof the input reads in enough detail to retrieve the read data. All software used should include version, parameters andcommand line. Any genome or transcriptome annotations used should be described in enough detail to retrieve theexact version used. Any changes in software, parameters, annotations or other analysis details should result in a newRNASeq experiment result associated with that input read data.

Case 2: Obtain quantification data for one or more features (genes) for comparison between multiple RNASeq exper-iments

User desires: Feature quantification data (numeric) for one or more features identified in one or more RNASeq exper-iment results. User will provide a list of one or more features for which results should be returned. If a feature list isnot provided, all quantification results for the selected RNASeq experiment(s) should be returned. User will provide alist of one or more RNASeq experiments to obtain quantification results from.

Additional Considerations: In order to request quantifications for comparison with either repository datasets or a localdataset the user needs to determine if the repository RNASeq experiment is comparable to other datasets. Pipelinemetadata needs to be provided so that this can be determined: Sample-level: bio sample data such as tissue type,collection methods, sample preparation protocol, library generation protocol, spikes used Read-level: sequencer typeand protocol, sequence data generation software version, parameters and command line Quantification-level: genomeannotation, transcriptome annotation if any, software pipeline including versions, parameters and command line Batch-level: Adjustments or normalization done at the batch level

Case 3: Obtain input data to use in Assembly activities

1.3. Science 19


User desires: Sequence level read data for both mapped and unmapped reads in the associated RNA experiment. Fortypical read data this is contained in the FASTQ file(s) produced by the sequencer pipeline. The API should eitherprovide the original FASTQ or the read data necessary and sufficient to generate it. It is desirable to be able to easilyretrieve all the related reads at the fragment level for downstream analysis. At this time, these would be either singleor paired reads but for future-proofing the API should be able to handle the delivery of an arbitrary number of readsfor a specific fragment.

Case 4: Obtain input data for DESeq Differential Expression analysis

User desires: Feature quantification array for two or more comparable RNASeq experimental results. This is similar tothe case where the user requests feature level data. In this case, it is critical that the user be able to identify comparabledatasets.

Case 5: Obtain input data for RNASeq analysis by Kallisto software

User desires: Calculate feature quantification by a new method. In the Kallisto example here, the software does notutilize read alignments. Repository needs to be able to supply raw read sequence (FASTQ format or convertible toFASTQ) and optionally annotation for the user.

Case 6: Obtain quantification data for non-read-based RNA experiments (MicroArrays)

User desires: Discover and retrieve feature level quantification data that is derived from non-read-based sources suchas the large sources of microarray-based expression data. The quantification API needs to be source agnostic andallow for a general linking of quantity to feature. It must be flexible and not lock the results to a reads/sequencerdata collection model. There should be no required data source or metadata fields that are specific for a given datacollection method.

Annotation Design - RNA Considerations

Read data derived from RNA samples can differ from genomic read data due to the presence of non-genomic se-quences. An example would be a read that spans a splice junction. It describes a contiguous sequence of reads, buta dis-continuous genomic region due to the missing intron. Feature level read assignment is further complicated bythe existence of multiple splice isoforms. A read that can be definitely assigned to a particular feature (an exon in thiscase) may still not be definitely assigned to a particular transcript if multiple transcript share that exon. The annotationAPI needs to be able to report assignment at the feature level as well as aggregate assignment at the transcript or eventhe whole gene level if assignment is not more specific than that.

Splicing (other post-transcriptional modifications?) can occur with degrees of complexity. A ‘typical’ splice will resultin a mature transcript with exon in positional (numerical) order in a head-to-tail orientation. Back splicing (tail-to-head) can result in transcripts with the exon order reversed (1-3-2-4 instead of 1-2-3-4) and even circular RNA. Theexon order in a transcript as well as the orientation of the splice should be discoverable via the API. In a more generalcase, the API should allow child features to have an ordered relationship.

The annotation API needs to also be flexible enough to handle multiple references in the same gene or transcript. Thisis needed to cover the cases of fusion genes or inter-chromosomal translocations.

RNA Quantification Schema

The RNA Quantification Schema is designed around a quantification analysis. Each set of feature quantificationsdescribes the results of running an analysis on a set of input data.

The RnaQuantificationSet collects a group of related RnaQuantifications. These are most likely associated by beingpart of a multi-sample experiment. For example, a time course experiment would be described by a RnaQuantification-Set with the individual RNASeq experiments of the time point being represented as the member RnaQuantifications.

The RnaQuantification describes the analysis pipeline used as well as the input reads dataset and which sequenceannotations, if any, used.



ExpressionLevel contains the identity of the specific feature measured as well as the final resulting quantification fromthe pipeline.

Genotype to Phenotype

Genotype and phenotype data can be linked via evidence. This protocol provides methods for describing phenotypesand associating them with genomic features.

Genotype To Phenotype API

Summary

This API allows users to search for genotype-phenotype associations in a GA4GH datastore. The user can search forassociations by building queries composed of features, phenotypes, and/or evidence terms. The API is designed toaccommodate search terms specified as either a string, external identifier, ontology identifier, or as an ‘entity’ (SeeData Model section). These terms are combined as an AND of (feature && phenotype && evidence).This flexibility in the schema allows a variety of data to be stored in the database and allows users to express a widerange of queries.

Users will receive an array of associations as a response. Associations contain description and environment fields inaddition to the relevant feature, phenotype, and evidence fields for that instance of association.

Multiple server collation - Background

G2P servers are planned to be implemented in three different contexts:

• As a wrapper around standalone local G2P “knowledge bases” (eg Monarch, CiVIC,etc). Important consider-ations are the API needs to function independently of other parts of the API and separately from any specificomics dataset. Often, these databases are not curated with complete Feature fields (referenceName, start, end,strand)

• Coupled with sequence annotation and GA4GH datasets. Clients will want implementation specific fea-tureId/genotypeId to match and integrate with the rest of the APIs.

• Operating in concert with other instances of g2p servers where the client’s loosely federated query is supportedby heterogeneous server. Challenges: Normalizing API behavior across implementations (featureId for givenregion different per implementation)

Approach

We based our original work on the model captured in ga4gh/ga4gh-schemas commit of Jul 30, 2015. This versionof the schema predates the separated genotype to phenotype files from baseline. After on review of the schemas andcode, the team had feedback about separation of responsibility in the original API. The API was refactored to separatethe searches for genotype, phenotype, feature and associations.

1.3. Science 21

https://github.com/ga4gh/ga4gh-schemas/tree/be171b00a5f164836dfd40ea5ae75ea56924d316https://github.com/ga4gh/ga4gh-schemas/commit/846b711fdcf544bf889cc7dbab19c6c48e9a9428


Data Model

The cancer genome database Clinical Genomics Knowledge Base published by the Monarch project was the source ofEvidence.

Intent: The GA4GH Ontology schema provides structures for unambiguous references to ontological concepts and/orcontrolled vocabularies within Protocol Buffers. The structures provided are not intended for de novo modeling ofontologies, or representing complete ontologies within Protocol Buffers. References to e.g. classes from externalontologies or controlled vocabularies should be interpreted only in their original context i.e. the source ontology.

Due to the flexibility of the data model, users have a number of options for specifying each query term feature,phenotype, and evidence.

API

The G2P schemas define several endpoints broken into two entity searches and an association search.

A feature or phenotype can potentially be represented in increasing specificity as either [a string, an ontology identifier,an external identifier, or as a feature ‘entity’]. One criticism of the previous API is that it was overloaded, violatingthe design goal of separation of concerns. Specifically it combines the search for evidence with search for features &search for genotypes.

The refactored API moves search, alias matching and external identifiers lookup to dedicated end points. To separateconcens, a client performs the queries for evidence in two steps: first find the desired entities and then use those enitityidentifiers to narrow the search for evidence.

Additionally the API supports two implementation styles: integrated and standalone.

science/genomics/../_static/g2p-sequence-diagram.png

Entity Searches

• /features/search

– Given a SearchFeaturesRequest, return matching features in the current 'omics dataset. In-tended for sequence annotation and GA4GH datasets.

• /phenotypes/search

– Given a SearchPhenotypesRequest, return matching phenotypes in the in the current g2p dataset.

Association Search

• /featurephenotypeassociations/search

– Given a SearchGenotypePhenotypeRequest, return matching evidence associations in the current g2pdataset.


http://nif-crawler.neuinfo.org/monarch/ttl/cgd.ttlhttps://github.com/ga4gh/ga4gh-schemas/blob/be171b00a5f164836dfd40ea5ae75ea56924d316/src/main/resources/avro/genotypephenotypemethods.avdl#L105https://github.com/ga4gh/ga4gh-schemas/blob/be171b00a5f164836dfd40ea5ae75ea56924d316/src/main/resources/avro/genotypephenotypemethods.avdl#L108https://github.com/ga4gh/ga4gh-schemas/blob/be171b00a5f164836dfd40ea5ae75ea56924d316/src/main/resources/avro/genotypephenotypemethods.avdl#L111


Usage

1. As a GA4GH client, use entity queries for the genotypes and phenotypes you are interested in.

2. Create an association search using the entity identifiers from step 1.

3. Repeat 1-2 as necessary, collating responses on the client.

Many types rely heavily on the concept of an OntologyTerm (see end of document for discussion on usage of Ontolo-gyTerms).

Implementation

Source Code

• Front End ‘/features/search’, ‘/datasets//features/search’, ‘/phenotypes/search’, ‘/featurephenoty-peassociations/search’

• Back End ‘runSearchFeatures’, ‘runSearchGenotypePhenotypes’, ‘runSearchPhenotypes’, ‘runSearchGeno-types’

• Datamodel ‘getAssociations’ Datamodel ‘getAssociations’ (Features)

Tests

• End to End

Help Wanted: Any or all use cases and scenarios

Acceptance

• Submittal of 3 simultaneous pull-requests for server, schema and compliance repositories

• 2 +1s for each repository from outside the development team

• Additional 3 day review for schemas

API Details and Examples

/phenotypes/search

science/genomics/../_static/search_phenotypes_request.png

Terms within a query are combined via AND e.g

1.3. Science 23

https://github.com/ga4gh/ga4gh-schemas/blob/be171b00a5f164836dfd40ea5ae75ea56924d316/src/main/resources/avro/ontologies.avdl#L10https://github.com/ga4gh/ga4gh-server/blob/g2p/ga4gh/frontend.pyhttps://github.com/ga4gh/ga4gh-server/blob/g2p/ga4gh/backend.pyhttps://github.com/ga4gh/ga4gh-server/blob/g2p/ga4gh/datamodel/genotype_phenotype.pyhttps://github.com/ga4gh/ga4gh-server/blob/g2p/ga4gh/datamodel/genotype_phenotype_featureset.pyhttps://github.com/ga4gh/ga4gh-server/blob/g2p/tests/end_to_end/test_g2p.py


request = "phenotype": { description:"AML", "ageOfOnset": {"id": "http://purl.→˓obolibrary.org/obo/HP_0003581"}}

is transformed by the server to:

query = (description="AML" and ageOfOnset="http://purl.obolibrary.org/obo/HP_0003581")

Items in the qualifiers array are OR’d together. For example, severe or abnormal:

request = ... "phenotype": { description:"AML", "qualifiers": [{"id": "http://purl.→˓obolibrary.org/obo/PATO_0000396"},{"id":"http://purl.obolibrary.org/obo/PATO_0000460→˓"}] } ....

is transformed by the server to:

query = (description="AML" and (qualifier = "http://purl.obolibrary.org/obo/PATO_→˓0000460" or qualifier = "http://purl.obolibrary.org/obo/PATO_0000460"))

The service returns a list of matching PhenotypeInstances.

Examples:Phenotype Lookup

Q: I have a Disease ontology id (“OBO:OMIM_606764”).

Use an OntologyTerm.

request = { ... "type": {"id": "http://purl.obolibrary.org/obo/OMIM_606764"} .... }

The system will respond with phenotypes that match on OntologyTerm.id

Q: I have a phenotype id (“p12345”) Create an PhenotypeQuery using id field.

request = ... { "id": "p12345" } ....

The system will respond with phenotypes that match on PhenotypeInstance.id

Q: I have an ontology term for a phenotype (HP:0001507, ‘Growth abnormality’ )

Use an OntologyTerm.

request = ... { "type": {"id": "http://purl.obolibrary.org/obo/HP_0001507"} } ....

The system will respond with phenotypes that match on OntologyTerm.id

Q: I am only interested in phenotypes qualified with (PATO_0001899, decreased circumference ) Create aPhenotypeQuery

request = ... { "qualifiers": [{"id": "http://purl.obolibrary.org/obo/PATO_0001899"}→˓] } ....

The system will respond with phenotypes whose qualifiers that match that ontology ‘is_a’.

Q: I have a disease name “inflammatory bowel disease”.

Create an PhenotypeQuery using description field. {"description": "inflammatory boweldisease",...} The system responds with Phenotypes that match on OntologyTerm.description Note that youcan wildcard description. {"description": ".*bowel.*",...} Supported regex


https://www.w3.org/TR/xpath-functions/#regex-syntax


/features/search

This endpoint is provided to serve features/variants/etc hosted by a g2p dataset when it is deployed independently ofthe sequenceAnnotations API. The request and response payloads are identical to /datasets//features/search.

Terms within a query are combined via AND e.g:

request = { "name":"KIT", "referenceName": "hg38" }

becomes

query = (name="KIT" and referenceName ="hg38")

The service returns a list of matching Features.

Examples:Genotype Lookup

Note: since we have switched to relying on the features/search API, external identifier queries have been deprecated.Refer to features/search documentation.

Q: I have a SNPid (“rs6920220”). Create an External Identifier Query.

{... {"ids": [{"identifier": "rs6920220", "version": "*", "database":"dbSNP"}]}, ... }

The endpoint will respond with features that match on external identifier. Multiple identifiers are OR’d together.

Q: I have an identifier for BRCA1 GO:0070531 how do I query for feature? Create an OntologyTerm query: {...{"type": {"id":"http://purl.obolibrary.org/obo/GO_0070531"}, ... }

The endpoint will respond with features that match on that term.

Q: I only want somatic variant features SO:0001777 how do I limit results? Specify featureType {...{"featureType":"http://purl.obolibrary.org/obo/SO_0001777", ... } The endpoint willrespond with features that match on that type.

/features/search

See sequence annotations documentation.

/featurephenotypeassociations/search

The endpoint accepts a SearchGenotypePhenotypeRequest POST. The request may contain a feature, phenotype,and/or evidence, which are combined as a logical AND to query the underlying datastore. Missing types are treated asa wildcard, returning all data. The genotype and phenotype fields are either null or a list of identifiers returned fromthe entity queries. The evidence query object allows filtering by evidence type.

The SearchGenotypePhenotype search is simplified. Features and Phenotypes are expressed as a simple array ofstrings. Evidence can be queried via the new EvidenceQuery.

The response is returned as a list of associations.

1.3. Science 25


science/genomics/../_static/search_genotype_phenotype_request.png

Fig. 1.1: http://yuml.me/edit/024cf70f

Implementation Guidance: Results

Q: I need a place to store publication identifiers or model machine learning and statistical data.

The “info” key value pair addition to Evidence.

{"evidenceType": {"sourceName": "IAO","id": "http://purl.obolibrary.org/obo/IAO_0000311","sourceVersion": null,"term": "publication"

},"info": {"source": ["PMID:21470995"]},"description": "Associated publication"

}{

"evidenceType": {"sourceName": "OBI","id": "http://purl.obolibrary.org/obo/OBI_0000175","sourceVersion": null,"term": "p-value"

},"info": {"p-value": ["1.00e-21"]}"description": "Associated p-value"

},{

"evidenceType": {"sourceName": "OBCS","id": "http://purl.obolibrary.org/obo/OBCS_0000054","sourceVersion": null,"term": "odds ratio"

},"description": "1.102"

}

Use cases

1. As a clinician or a genomics researcher, I may have a patient with Gastrointestinal stromal tumor, GIST, anda proposed drug for treatment, imatinib. In order to identify whether the patient would respond well to treat-ment with the drug, I need a list of features (e.g. genes) which are associated with the sensitivity of GIST toimatinib. Suppose I am specifically interested in a gene, KIT, which is implicated in the pathogenesis of sev-eral cancer types. I could submit a query to /featurephenotypeassociations/search with GIST asthe phenotype, KIT as the feature, and clinical study evidence as the evidence.


http://yuml.me/edit/024cf70f


In response, I will receive back a list of associations involving GIST and KIT, which I can filter forinstances where imatinib is mentioned. URI’s in the associations field could - hypothetically- be followed to discover that GIST patients with wild-type *KIT* have decreasedsensitivity to therapy with imatinib .

If I left both the feature and evidence fields as null, I would receive back all associations which involveGIST as a phenotype.

2. As a non-Hodgkin’s lymphoma researcher, I may know that the gene CD20 has an abnormal expression inHodgkin's lymphoma . I might be inter-ested in knowing whether CD20 also has an abnormal expression in non-Hodgkin lymphoma . Therefore I could perform a query with CD20 as a fea-ture, non-Hodgkin’s lymphoma as a phenotype, and RNA sequencing as the evidence type.

3. As a genetic counselor, I may be wondering if a mutation in one of my clients’ genes has ever been associatedwith a disease. I could then do a query based on the gene name as the feature and disease as the phenotype.

For specifics of the json representations, please see the server and compliance repositories.

Ontologies

Usage: Multiple ontology terms can be supplied e.g. to describe a series of phenotypes for a specific sample. TheOntologyTerm message is not intended to model relationships between terms, or to provide mappings between on-tologies for the same concept. Should an OntologyTerm be unavailable, or terms unmapped then an ‘annotation’ canbe provided which can later be mapped to an ontology term using a service designed for this. Using OntologyTerm ispreferred to using Annotation. Though annotations can be supplied with related ontology terms if desired. A use casecould be when a free text annotation is very specific and a more general OntologyTerm is supplied.

Read more about Ontology Terms

Directions for future capabilities.

Flexible representation of Feature

• Q: I need to lookup Feature by proteinName or other external id. How do look them up? Currently,sequence annotation’s features/search supports search by name or location. Future versions shouldimplement lookup by alias/

• Q: I have results from multiple G2P Servers. How do I collate them across datasets and implementations?This is a subject for the investigation as we create a federation of G2P servers. The responsibility forcollating features and associations across servers. One strategy might be to use HGVS’ DNA annotationfor as a neutral identifier for feature.

Expanding scope to entities other than Feature

Consider instead a PhenotypeAssociation which has a wider scope; the objects it connects and the evidence typedetermines the meaning of the association

1.3. Science 27


Genomics Workflows

Genomics can be connected to clinical practice through the diagram shown below:

The process starts with acquiring patient genomic data. Next Generation Sequencing (NGS) data usually enter thesystem as BAM files (analyzed, mapped and annotated) with BAI index files, dependent on genome assembly versions.Microarray gene expression data format depends on the manufacturer (ie, Affy, Illumina) and chip versions. Most otherdata file formats are flat files, vcf, tab-delimited, often coded.

From these data, clinically relevant alleles (alternative forms of the same gene) and haplotypes (groups of alleles) areidentified.

Alleles and haplotypes can quickly become complex, and new discoveries are constantly made. Fortunately, wecan make use of star alleles. Star alleles are haplotype patterns that have been defined at the gene level and, inmany cases, associated with protein activity levels. Genetic variants within a haplotype can include single nucleotidepolymorphisms (SNPs), Insertion/Deletions (InDels), and copy number variants (CNVs).



Once we have applied the right nomenclature, we can then start to correlate the specific genomic characteristics ofthe individual with phenotypes (diseases, health conditions, or reactions ot medications) and combined with drugs andclinical trials to form clinical guidelines for doctors.

Guidelines are the implications of the gene-phenotype, gene-drug, or drug-drug interaction. For example, “increasedrisk of morphine formation following codeine adminstration, leading to higher risk of toxicity.” These are then con-verted to recommendations, which actually tell the health care provider what to do. For example, “Avoid codeine usedue to potential for toxicity.”

1.3. Science 29


Reads

Reads are genetic data generated by a DNA sequencing instrument, including nucleotides and quality scores. Readsmay optionally be aligned to a reference sequence. (The data model for reads is similar to SAM/BAM.)

Variants

Variants are genetic differences between an experimental sample and a reference sequence. (The data model forvariants is similar to VCF.)

References

References are standard genome sequences, used to provide a coordinate system for reads and variants.

Sequence Annotations


Allele Annotations

Allele annotations are additional pieces of data often generated by algorithms which help to describe, classify, andunderstand variants.

RNA Quantification

The RNA quantifications provides a means of obtaining feature level quantifications derived from a set of RNA reads.

Genotype to Phenotype

Genotype and phenotype data can be linked via evidence. This protocol provides methods for describing phenotypesand associating them with genomic features.

Proteomics

Metabolomics

Microbiomics

Exosomics

Data Sources

This section covers the data sources we use in the platform and how they are organized.


https://samtools.github.io/hts-specs/SAMv1.pdfhttps://samtools.github.io/hts-specs/VCFv4.2.pdf


Current Data Network

—————REPRESENT AS IMAGE—————-

Omics

Omics covers genomics, proteomics, and metabolomics. Omics aims at the collective characterization and quantifi-cation of pools of biological molecules that translate into the structure, function, and dynamics of an organism ororganisms.

Omics data sources are primarily research knowledgebases and/or curated by individuals wiki-style. As such, whilethey serve a specific function, they will not be error-free, may be conflicting, and are often incomplete or crossfunctional lines.

Our list and organization is not complete, but will be a work in progress. As these sources evolve we will track themhere. Additional database details can be found in This Google sheet.

Biorepositories

Biorepositories are exactly as they sound - repositories of biological data, especially genetic test data. See below for acommon list of biorepositories around the globe.

Linked open data and biorepositories:

1. IN USE

2. TARGETED

1.4. Data Sources 31

https://docs.google.com/spreadsheets/d/1LCwOyLwVUT04TpTZyrzyKP96RnambC2YqRkmlQwPK0I/edit#gid=619233452


• Danish National Biobank - 5.6M records

• eMERGE Network - 350K records. DNA repositories with linked electronic medical records. Data generatedfrom network partners - eMERGE PGx (sequencing of target patients), IGNITE (research), electronic medicalrecords (EMRs), ENCODE, CSER. ?epidemio and statistical analysis

• UK Biobank - 500K records. population-based repo of biological samples with linked medical, lifestyle, andfamily history data. Also have genetic and phenotypic data. Researchers oriented, ? epidemiological & statisti-cal analysis not done.

• Estonian Genome Center - 52K records

• China Kadoorie Biobank - 500K records

• Kaiser Permanente Division of Research - 200K records

• MVP - 250K records

• Qatar Biobank - 300K records

• Personal Genome Project (Harvard)

3. UNDER INVESTIGATION

• Beacon Network

• DailyMed - Structured Product Label (SPL) - pdts & ingredients, title of drug use

• PGx labeled FDA approved drugs - gene & functions but not all mention alleles

• CGN (shut down for now) - Clinical database of patients from 14 cancer centers

• NCI Specimen Resource Locator - biological samples from cancer patients. Biobank resource, researchersoriented

• BioLINCC - biological samples and clinical data from NHLBI studies. Biobank resource, researchers oriented,? data format available - ie raw and need analysis

• Rare Disease-HUB - biological samples from patients with rare diseases. Biobank resources, researchers ori-ented, samples info & clinical data linked.

• PGP - genotype data with linked medical and personal information. Genome (WGS, exomes), trait, profiles,microbiome. Depends on willing participants. ? no. of participants so far, Likely that data analysis required

• iSAEC - clinical and genotyping data from studies of adverse drug events. Data dissemination web site has beensetup to provide access to the iSAEC data by qualified researchers exclusively for the purpose of conductingbiomedical research

4. REJECTED

Variants

1. IN USE

• Galaxy

2. TARGETED

• dbSNP _ - Single Nucleotide Polymorphism database

• UCSC genome browser - annotated genomic reference sequences. Similar to NCBI resources but usually slowerwith update information as new research data are submitted to repositories appear in NCBI first. Leave out fornow, or use as secondary verification source


http://www.personalgenomes.org/harvard/data/https://beacon-network.org/#/about


• Ensembl - annotated genomic reference sequences. Similar to NCBI resources, slower with updated data, anduses own IDs but longer string of numbers. Tendency to remove untranslated region from mRNA sequences (ieonly Start to Stop codon). Also the basis for other EMBL-EBI resources.

• ANNOVAR - Gene-, region- and filter-based annotation of genomic data.


• AceView

• NCBI Gene - Integrative resource for genes, variants and phenotypes. Annotated genes, variants, phenotypes,linked to specific external resources. Also linked to NCBI Genome

• NCBI Genome - Integrative resource for genes, variants and phenotypes. Chromosomal assemblies, sequences,maps and annotations. Less useful than NCBI Gene at this point

• Genome Reference Consortium - Curators for genome assemblies, patches, fixes and updates. Genome datasources (for mapping ie if patients’ genomic data require analysis)

• HVP - Human Variome Project. Work to consolidate existing gene/disease specific databases. Refers back toLocus Specific Mutation Databases at Leiden (repeated below)

• LOVD Leiden Open Variation Database - Nomenclature and linking variants under different names in research

• COSMIC - Catalogue of Somatic Mutations in Cancer. Resources include Cell Lines Project (104), COSMIC,Cancer Gene Census, Drug Sensitivity, Mutational Signatures, GRCh37 Cancer Archive.

• NIEHS SNPs - SNP discovery resource focused on environmental exposures, inter-individual sequence varia-tion, disease risk. Resequencing, data available for download. New variations deposited back to dbSNPs coshave rs IDs

• UCSC Genome assembly for positional mapping

• Seattle SNPs - SNP discovery resource for inflammatory responses. Re-sequencing methodology advice &guidance to researchers, dbSNP, RefSeq/NCBI genome assembly

• 1000 Genomes - whole genome sequencing and variants & haplotype blocks, completed project. Incorporatingdata into International Genome Sample Resource (IGSR) - useful

• SPSmart - large-scale genomic variants browsing (visualisation). Databases 1K Genome Project, HapMap,Perlegen, CEPH, dbSNPs

• SNAP - Retrieval of SNP LD data from 1000 genomes and HapMap. may be irrelevant at this point, researchersoriented

• Database of Genomic Variants (DGV) - structural genetic variants found in healthy individuals; CNVs, Inv

• Innate Immunity Programs for Genomic Applications - genotypes and haplotypes from genes related to innateimmunity. Haplotypes tools, SNPs & InDels

• JSNP database (article downloaded into shared folder) - repo of common snps in the japanese population -weblink from article error - server not found ?ETHNOS

• dbVar - NCBI repo of CNVs - not really CNVs. structural variations (INDEL, INV, translocations & rearrange-ments). Use in conjunction with dbSNPs for more complete genomic mutations

• FINDbase - Pharmacogenomic markers in different populations. OMIM, PharmGKB, Allele Freq Worldwidein ETHNOS databases

• HGDP (Human Genome Diversity Project)

4. REJECTED

• Hapmap - Note: deprecated, ignore. Subsumed by 1,000 genomes project - snp genotypes from 11 differentpopulations



Expression

1. IN USE

2. TARGETED

• UCSC Genome Browser - DNA sequences annotated with gene expression data from a wide range of sources


• ICGC - Copy number, rearrangement, expression, mutation data. Cancer projects, data repo. Drug infor fromDrugBank & ChEMBL (part of EMBL-EBI)

• GEO - Gene expression data from >2500 studies. Currently 4348 datasets, also a repo for RNAseq datasets. ?analysed or raw data or both - to double check

• European Nucleotide Archive (ENA) - WGS repo (raw, mapped & annotated)

• Oncomine - Gene expression data from GEO, TCGA and other projects. Thermo Fisher platform, need toregister to see what they can really do. Claims 700+ independent datasets but ?? exact resources.

• Cell Miner - Gene expression and GI50 drug concentration data from NCI-60 cell lines. Datasets are raw ornormalised. Facilitates systems biology research. miRBase, dbSNPs, GEO

• SBM DB - Hu mRNA (Affy U133A), HUVEC mRNA & protein (IHC) expression from healthy and tumoraltissues, normal and cancer cell lines. Provide base for new drug development

• GENT - Microarray-derived mRNA expression data from >34000 tissue samples

• Cancer Genome Anatomy Project - Gene expression from normal, precancer and cancer cells. CGAP/ MGC,ORESTES, EST libraries. Downloadable data in tab-delimited ASCII

• mirBase - miRNAs database, nomenclature.(a bit like dbSNP but only for miRNAs). need to find out wheremiRNAs microarrays vs conditions data are. can be found in ENCODE

4. REJECTED

• ArrayExpress - Microarray-derived mRNA expression data from >25000 studies (EMBL-EBI). Currently 69786studies of raw datasets for research community. Not helpful as raw data are unmapped, without statisticalsignificance to indicate relevance and require analysis

Drug-Gene

1. IN USE

2. TARGETED

• DrugBank - DrugBank (drugs, formulations, targets, interactions)

• FDA Pharmacogenomics

• PharmGKB - industry standard for PGx

• CPIC - Clinical Pharmacogenomics Implementation Consortium guidelines. CPIC guidelines are designed tohelp clinicians understand HOW available genetic test results should be used to optimize drug therapy, ratherthan WHETHER tests should be ordered.


• CTD - Comparative Toxicogenomics Database (genes, chemicals, diseases, phenotypes). Integrated with: Bi-oGRID (3.4.144 release), ChemIDplus® (as of 27 January 2017), DrugBank (as of 27 January 2017), GO (asof 27 January 2017), KEGG (as of 27 January 2017), MeSH® (2016 MeSH release), NCBI Gene (as of 27January 2017), NCBI Taxonomy (as of 27 January 2017), PubMed® (as of 27 January 2017), Reactome (as of27 January 2017)



• FINDBase - allele frequencies of PGx markers in different populations

• PHARMACO-GENDIA. offer genetic testing services, uses PharmGKB, Karolinska’s CytP450 db

• NCBI GTR - voluntary genetic testing information. resources : ClinVar, MedGen, NCBI’s mol med db & tools,OMIM, GeneReviews, Orphanet, NHGRI, Genetics Home Reference, Office of Rare Diseases Res, GeneticAlliance (mostly NIH resources)

• Warfarin Dosing. PharmCAT (PharmGKB, P-STAR, ClinGen & ClinVar, CPIC)

4. REJECTED

Gene-Phenotype

1. IN USE

2. TARGETED

• GWAS Central - genetic wide association studies for wide range of conditions (>1844 studies). dbSNPs, DBGV

• OMIM - Online Mendelian Inheritance in Man

• NCBI ClinVar - clinically relevant variants

• PhenGenI - Phenotype-Genotype Integrator (PheGenI), merges NHGRI genome-wide association study(GWAS) catalog data with Gene, dbGaP, OMIM, GTEx and dbSNP

• ClinGen - database of clinical genomics


• PhenX - GWAS & epi tool kit, research oriented. ? downloaded sources, ignore for now.

• dbGaP- Database of Genotypes and Phenotypes. controlled access mechanisms

• PhenCode - human genotype and phenotype, link locus specific dbs, UniProt with USCS Genome browser

• Model Organisms (which hyperlink?) - mouse, rat, worm, fly, yeast, slime mold

• Wellcome Trust Case Control Consortium - identifies genome sequence variants including major causes ofmorbidity and mortality through large-scale GWAS. WT funded projects, mostly UK. Application for dataaccess, ie controlled access through Eur Genotype Archive (EBI). Researchers oriented

• EMBL-EBI GWAS

• HuGE Navigator - web tools enabling mining of literature and genetic association studies. Human GenomeEpidemiology encyclopedia. CDC dbs, MeSH, NCBI Gene, HUGO, Pubmed

• ENCODE Project - genome browser, non-coding functional elements, new site for project at https://www.encodeproject.org/. Data viewable in GENCODE on BioDalliance Browser specifically for human and mousegenome

• SNP Function Portal - annotation of snps at the genome, transcript, protein, pathway, disease, and populationlevels. Research oriented for own data. Use infor from GB, UniGene LocusLink

• pfSNP - SNP function and results from GWAS

• Pupasuite - bioinformatic analyses of SNP functions - toolkits; research oriented

• FuncPred - bioinformatic analyses of SNP functions. dbSNP, miRanda

• F-SNP - bioinformatic analyses of SNP functions. dbSNP, Ensembl, SIFT, PolyPhen,

4. REJECTED

• NHGRI catalog - GWAS results, catalog moved to EMBL-EBI GWAS


https://www.encodeproject.org/https://www.encodeproject.org/


• SIFT - effects of non-synonymous SNPs; predictions. dbSNP, Ensembl gene. Not useful at the moment

• PolyPhen - effects of non-synonymous SNPs; prediction. Not useful at the moment

Drug-Phenotype - may be unnecessary/duplicative

1. IN USE

2. TARGETED


• SIDER - Drug side effects + frequency. Pubchem, MedDRA, for research & education

• RxNORM - normalized names for clinical drugs and links its names to many of the drug vocabularies commonlyused in pharmacy management and drug interaction software

4. REJECTED

Drug-Drug

1. IN USE

2. TARGETED

• DrugBank - seems to be the most common


• FDA AERS signal detection (FDA Adverse Event Reporting System)

• First Databank (FDB). Drug formulation, information & use guidelines - business oriented. RxNORM

• Micromedex. Similar to FDB - business oriented

• MediSpan - Wolters Kluwer Drug Infor. Databases (a lot) used http://www.wolterskluwercdi.com/drug-data/all-databases/

• Gold Standard Drug Database - Elsevier

• Multum - tools

• National Drug File - Reference Terminology. Sources: RxNORM, NDF-RT, RxTerms, RxImageAccess API,DailyMed API

4. REJECTED

Somatic

1. IN USE

2. TARGETED

• COSMIC - Repository of somatic mutation, genotype and whole genome sequencing data from cancer studies(above)

• cBio Cancer Genomics Portal - Searchable Web-tool which integrates tumor and somatic mutation data fromTCGA and the Memorial Sloan-Kettering Cancer Center

• Tumorscape - Repository of somatic CNV data from multiple cancer types



http://www.wolterskluwercdi.com/drug-data/all-databases/http://www.wolterskluwercdi.com/drug-data/all-databases/


• UCSC Xena - replaced Cancer Genomics Browser

• ICGC - DNA sequencing and somatic mutation data from 50 different tumor types (89 projects, drugs genetargets & mutations

• TCGA - DNA sequencing and somatic mutation data from 20 different tumor types

• Variant GPS - May have been deprecated and converted to DCEG. Repository of genotyping data and geneticvariants identified from targeted next-generation sequencing in cancer studies. Links to DCEG - epidemiologyof cancers

• Oncomine - Searchable Web-tool which integrates somatic CNV data from TCGA

• Genomics Data Commons (GDC) Data Portal - Seq, transcriptome, SNPs, CNVs, clinical. Includes data fromTCGA (33 projects) & TARGET (6 projects) - cancer programs

4. REJECTED

• Cancer Genomics Browser - Web-tool which integrates somatic mutation data from TCGA and other cancergenomic studies. Browser no longer under development. Genome data originally from Genome ReferenceConsortium. Uses ENCODE data. New tool available - UCSC Xena

eQTL - may not be useful

1. IN USE

2. TARGETED


• SCAN - SNP and CNV eQTLs identified in LCL. Classify SNPs & CNVs to functional types (eg eQTL, methyl-SNPs) and physical location (annotated to a gene if in LD). GTEx, ENDGAMe, PAAR

• GTEx - eQTL data from different tissues. ? normal tissues for reference. Tissue and data biobank (>30,000samples. Analysis completion ~June/July 2017. Information released through dbGaP

4. REJECTED

• Eqtl - Genome browser with eQTL annotations. ? dead project, site last updated in June 2012

Ontologies

1. IN USE

2. TARGETED

• SNOMED CT - clinical ontologies

• GA4GH - genomic data. Not using the API but ours will have similar concepts.


• Bio2RDF: PharmGKB, OMIM, DrugBank, NCBI Gene, PubMed & many more. Multiple sources but use butdownloaded datasets with the most recent dated as Sept 2014. Some datasets mentioned are research-oriented(eg Affy microarray probesets but not datasets). Outdated, a method for linking and creating compatible datafrom multiple sources. Code method

• HGNC - HUGO Gene Nomenclature Committee, gene nomenclature

• Variation Ontology (VariO)

• LUMC Mutalyzer - variant nomenclature

• Mutation Impact Ontology (MIO)



• Sequence Ontology (SO)

• Semantic science Integrated Ontology (SIO)

• Clinical Decision Support (CDS) ontology

4. REJECTED

Proteomics - TBC

1. IN USE

2. TARGETED


• PeptideAtlas

• PRIDE

• ProteomeXchange

• NeXtProt

• Uniprot

4. REJECTED

Glycomics - TBC

Metabolomics - TBC

1. IN USE

2. TARGETED

• MetaboLights


4. REJECTED

Transcriptomics - TBC

Epigenomics - TBC

Phenomics - TBC

Microbiome - TBC

Methylome - TBC

Immunome - TBC

Clinical Data

Clinical data are any data from hospitals, labs, insurance companies, or other direct health care sources. These caninclude patient medical records, lab tests, and medical images.



EHR Sources - TBC

• AllScripts

• McKesson

• eClinicalWorks

• Infor Cloverleaf

• Athena Health

• Meditech

• Athena

• Nextgen

• GE Healthcare

• Greenway

• QRS Healthcare

• Dr Chrono

Lab Sources - TBC (sorted by % of revenue)

• Quest (29.5%)

• LabCorp (18.8%) - https://www.labcorp.com/drug-testing/it-solutions/web-based-solutions#h2-web-services-interface

• Spectra (4.9%)

• DaVita (3.8%)

• Sonic Healthcare (1.8%)

• Bio Reference Laboratories (1.0%)

• Renalab (0.8%)

• Berkeley Heart Lab (0.8%)

• Apax Partners (0.8%)

• Nationwide Laboratory Services (0.6%)

Insurance - TBC

Insurance companies from which we can get data

Exogenous Data

Exogenous data are related to the patient’s lifestyle and environment. They include socioeconomics, environmentalfactors such as air quality, and lifestyle factors like diet and activity.


https://www.labcorp.com/drug-testing/it-solutions/web-based-solutions#h2-web-services-interfacehttps://www.labcorp.com/drug-testing/it-solutions/web-based-solutions#h2-web-services-interface


Socioeconomic - TBC

• educational

• insurance

• community and support

• ethical and ancestry

Lifestyle - TBC

• diet

• activity

• mental well-being

Environmental - TBC

• CO2/air

• water

• soil

• temp

• radiation

Remote Care Data


Advisory Support - TBC

• mhealth

• telehealth

Chronic Disease - TBC

• pain

• cardiac disease

• insulin

• mental



Sensors/Wearables/Apps - TBC

• clinical

• consumer

• medication adherence

• personal diet

• cessation

API Library

The API library

API Goals

The OmicMD api platform will allow the interoperable exchange of genomic information across multiple organiza-tions and on multiple platforms. It overcomes the barriers of incompatible infrastructure between organizations andinstitutions to enable DNA data providers and consumers to better share genomic data and work together on a globalscale, advancing genome research and clinical application.

1. The API must allow flexibility in server implementation, including:

• choice of persistent backend (e.g. files, SQL, NoSQL)

• choice of implementation language (e.g. Java, Python, Go)

• choice of authorization model (e.g. all public, all private, fine-grain ACLs)

• choice of import mechanism (e.g. self-service vs. centrally managed)

• choice of scale (from a single researcher working with dozens of sequences and homogeneous tools, togovernment-funded studies with over a million sequences and multiple tool chains)

2. The API must allow full-fidelity representation of data that was prepared using today’s common methods andstored using today’s common file formats.

• Note that real-world data files sometimes use invalid or ambiguous syntax, making it hard to understand thesemantics of the contained data. If a server can’t figure out what those semantics are, it can throw an error onimport. But whenever the semantics are clear, including when they’re specified in valid data files, the API mustallow preserving them.

3. The API should allow adding more structure to data beyond today’s common practices (e.g. formal provenance,versioning).

4. The API should allow data owners to organize their data in ways that make sense to them, which implies thereoften isn’t One True Taxonomy. For example, a ReadGroupSet is defined as “a set of ReadGroups that areintended to be analyzed together” – different researchers or clinicians might choose different sets for differentpurposes.

5. At the same time, the API should encourage reusable organization, anticipating a future that supports cross-researcher and cross-repository data federation.

Unresolved Issues

• What are the performance goals of the API in various configurations?

1.5. API Library 41


API Design

We use a three-tier classification structure for APIs.

• System - APIs that directly access databases and organize the data into functional domains (e.g. the “Person”data element)

• Process - APIs that deal with processes and orchestration (e.g. the “Order Test” process)

• Experience - APIs that combine process and/or System APIs to expose data directly to an app developer (e.g. aweb portal experience API that abstracts a FHIR system API and an order test process API with a light-weightRESTful interface). The Experience API can be used to create a front end experience without understandingcomplex system domains or business logic encapsulated within process APIs.

Example

This is an example published by MuleSoft of a three-tier architecture for health care. In it, data are abstracted fromcomplex EHR systems like Epic into a canonical model that is represented via a set of FHIR REST APIs. Then,experience APIs are layered on top to provide better experiences for both patients and clinicians.

1. System Layer

System APIs abstract away the complexity of EHRs and other core systems of record from the data’s end user, whileproviding downstream insulation from any interface changes or rationalization of those systems.

Assets Included:

• CRM FHIR System API | Salesforce Implementation Template

• CRM FHIR System API | RAML Definition

• EHR FHIR System API | EHR Implementation Template*

• EHR FHIR System API | RAML Definition

• Fitness FHIR System API | Fitbit Implementation Template

• Fitness FHIR System API | RAML Definition



2. Process Layer

Process APIs decouple business processes that interact with and shape data from the source systems where the dataoriginated. For example, the “schedule appointment” process contains logic that is common across multiple entities,which can be called by product, geography, or channel-specific parent services.

Assets Included:

• Onboarding Process API | Implementation Template

• Appointments Process API | Implementation Template

• Appointments Process API | RAML Specification

• EHR to CRM Sync Process API | Implementation Template

• Fitness Data Sync Process API | Implementation Template

• HL7 Event Handler | Implementation Template (Coming soon)

3. Experience Layer

Experience APIs are the means by which data can be reconfigured so that it is most easily consumed by its intendedaudience, all from a common data source, rather than setting up separate point-to-point integrations.

Assets included:

• Web Portal Experience API | Implementation Template

• Web Portal Experience API | RAML Specification

• Salesforce Experience API | Implementation Template

Supporting Assets:

• HL7 Connector

• 12 FHIR APIs | RAML Specification

API List

System APIs

System APIs will be responsible for containing the connectors to each knowledge base or other endpoint and tran-soforming select data elements into a similar schema/format as GA4GH, though we will use our own data standardswhere it makes sense. We will not use the GA4GH API directly.

The general format for building system APIs is as follows:

1. The API ingests data from inputs into the flow - file(s) or another API

2. After annotating the data or processing it, the flow returns the output

3. The API transforms (maps) the output data to the GA4GH or similar standardized schema

4. The API exposes the output objects or entities we need

A graphical version of the system API inputs and outputs can be seen below.

SYSTEM API DATA FLOW INTERACTIVE DIAGRAM: Production

Development

1.5. API Library 43


ANNOVAR System API

ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variantsdetected from diverse genomes (including human genome hg18, hg19, hg38, as well as mouse, worm, fly, yeast andmany others). Given a list of variants with chromosome, start position, end position, reference nucleotide and observednucleotides, ANNOVAR can perform:

• Gene-based annotation: identify whether SNPs or CNVs cause protein coding changes and the amino acidsthat are affected. Users can flexibly use RefSeq genes, UCSC genes, ENSEMBL genes, GENCODE genes,AceView genes, or many other gene definition systems.

• Region-based annotation: identify variants in specific genomic regions, for example, conserved regions among44 species, predicted transcription factor binding sites, segmental duplication regions, GWAS hits, databaseof genomic variants, DNAse I hypersensitivity sites, ENCODE H3K4Me1/H3K4Me3/H3K27Ac/CTCF sites,ChIP-Seq peaks, RNA-Seq peaks, or many other annotations on genomic intervals.

• Filter-based annotation: identify variants that are documented in specific databases, for ex-ample, whether a variant is reported in dbSNP, what is the allele frequency in the 1000Genome Project, NHLBI-ESP 6500 exomes or Exome Aggregation Consortium, calculate theSIFT/PolyPhen/LRT/MutationTaster/MutationAssessor/FATHMM/MetaSVM/MetaLR scores, find inter-genic variants with GERP++ score < 2, or many other annotations on specific mutations.

Other functionalities: Retrieve the nucleotide sequence in any user-specific genomic positions in batch, identify acandidate gene list for Mendelian diseases from exome data, and other utilities.

The figure below shows a summary of ANNOVAR annotations:

The API should access ANNOVAR data directly, but the web version of ANNOVAR converts a .vcf input to a .vcftabulated output file. Here’s a good article on common issues with .vcf files:

http://annovar.openbioinformatics.org/en/latest/articles/VCF/


http://annovar.openbioinformatics.org/en/latest/articles/VCF/


Data Inputs

Required

• Chr

• Start

• End

• Ref

• Alt

Data Outputs

We should add data structure (GA4GH / JSON / whatever) here.

Required

• Func.refGene

• Gene.refGene

• GeneDetail.refGene

• ExonicFunc.refGene

• AAChange.refGene

• Xref.refGene

• cytoBand

Available but not used

• SIFT_score

• SIFT_pred

• Polyphen2_HDIV_score

• Polyphen2_HDIV_pred

• Polyphen2_HVAR_score

• Polyphen2_HVAR_pred

• LRT_score

• LRT_pred

• MutationTaster_score

• MutationTaster_pred

• MutationAssessor_score

• MutationAssessor_pred

• FATHMM_score

• FATHMM_pred

• PROVEAN_score

• PROVEAN_pred

1.5. API Library 45


• VEST3_score CADD_raw

• CADD_phred DANN_score

• fathmm-MKL_coding_score

• fathmm-MKL_coding_pred

• MetaSVM_score

• MetaSVM_pred

• MetaLR_score MetaLR_pred

• integrated_fitCons_score

• integrated_confidence_value GERP++_RS

• phyloP7way_vertebrate

• phyloP20way_mammalian

• phastCons7way_vertebrate

• phastCons20way_mammalian

• SiPhy_29way_logOdds

Cerner FHIR System API

ClinVar System API

ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes,with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted be-tween human variation and observed health status, and the history of that interpretation. ClinVar processes submissionsreporting variants found in patient samples, assertions made regarding their clinical significance, information aboutthe submitter, and other supporting data. The alleles described in submissions are mapped to reference sequences,and reported according to the HGVS standard. ClinVar then presents the data for interactive users as well as thosewishing to use ClinVar in daily workflows and other local applications. Cli

Documents

GA4GH Schemas Documentationon the architecture, product features, and current code, as well as where you will document code you write in a user-friendly way. If you need background