Upload
gabe-rudy
View
266
Download
0
Embed Size (px)
Citation preview
Background
Golden Helix- Founded in 1998- Genetic association software- Analytic services- Hundreds of users worldwide- Over 900 customer citations in scientific
journals
Products I Build with My Team- SNP & Variation Suite (SVS)
- SNP, CNV, NGS tertiary analysis- Import and deal with all flavors of upstream data
- VarSeq- Annotate and filter variants in gene panels, exomes and
genomes for clinical labs and researchers.- GenomeBrowse (Free!)
- Visualization of everything with genomic coordinates. All standardized file formats.
Database Trends
VarSeq
Tertiary analysis to report in one click
Focused and actionable data
Modeled on ACMG guidelines
Hereditary and cancer templates
OMIM included
VSReports
Command line runner Integrate with your current
bioinformatics pipeline Create repeatable clinical
workflows for CLIA and CAP certified analysis
Supports high throughput scenarios
VSPipeline
Transactions Disk structure optimized Fixed schema SQL matures Small mem footprints Master <-> Slaves Threaded / Locking Expensive large
mainframes/servers
90s - SQL
Scale out First class sharding Utalize cheap memory Don’t let disk be
bottleneck Support stream /
distributed analytics
10s - NewSQL
“Web Scale” - distributed Eventually consistent Schema-less, key-based Avoid joins Peer-to-peer Memory cheap Many cheap commodity
servers in datacenter configurations
00s - NoSQL
> SELECT * FROM trends GROUP BY decade;
The “Database” Market in Thirds
VarSeq
Tertiary analysis to report in one click
Focused and actionable data
Modeled on ACMG guidelines
Hereditary and cancer templates
OMIM included
VSReports
Command line runner Integrate with your current
bioinformatics pipeline Create repeatable clinical
workflows for CLIA and CAP certified analysis
Supports high throughput scenarios
VSPipeline ACID / Transcations “Traditional” row-based
MySQL Postgres Oracle MSSQL
NewSQL VoltDB (scale-out) Google Spanner/F1 MemSQL Clustrix
OLTP Key and Hiearchical Based
Wide Columnar Stores BigTable / HBase Cassandra
Hiearchical/Document MongoDB Couchbase
Key-Value Stores Redis Memcachd FoundationDB
Tuple/Triple-stores
Other
Query Optimized Amazon Redshift HP Vertica Infobright Google BigQuery Teradata Cloudera Impala Hadoop+Hive
Data Warehousing
http://www.se-radio.net/2013/12/episode-199-michael-stonebraker/
Mike Stonebraker
Illustra (c Postgres), aquired by IBM Informix (1996)
StreamBase (c Aurora), acquired by TIBCO (2013)
Vertica (c C-Store), aquired HP (2011)
VoltDB (c H-Store) 23M function in 4 rounds
Paradigm (c SciDB)
INGRES – 73 -> 90
Postgres – 84 -> 92
Mariposa – 92 -> 97
Aurora – 01 -> 08
C-Store – 05 -> 09
H-Store – 07 -> Present
SciDB – 08 -> Present
Data Warehouse Solutions
Big Data, Small Analytics => Don’t use MapReduce
http://www.slideshare.net/Hapyrus/amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive
Data Warehousing / Scientific Analysis => Columnar
You’ve got to know what regression means, what Naïve Bayes means, what k-Nearest Neighbors means. It’s all statistics.
All of that stuff turns out to be defined on arrays. It’s not defined on tables. The tools of future data scientists are going to be array-based tools. Those may live on top of relational database systems. They may live on top of an array database system, or perhaps something else. It’s completely open.
• Columns -> Faster Queries• Divide columns into chunks• Compress chunks (better
ratios than rows)• Pre-compute chunk-level
attributes (min/max etc)• Flexible storage layer
• Distributed• Encodings (Parquet,
ORC/Hive, custom)
Extract, Transform, Load (ETL)
“Dimensional Moedeling”- Fact tables & dimensional tables- Fact tables often measurements over time- Dimensional table goes into item details- Denormalized data, complexity hidden- Often many sources loaded into same warehouse
- Logs- One or more relational databases (sales, customer-facing etc)- Vender / Payment information
Example“Like table”: datetime, user_id, post_id,client_data“User table”: user_id, subscription_type, last_paid, has_android_app
Genomics (Other Life Science) DataData Warehouse Like
Gabe’s Adjusted “Moore’s Law” NGS Cost Graph
Sequencers: Versatile tools for science
Genomics is Big Data
5,000 public data repositories Broad Institute:
- Process 40K samples/year- 1000 people- 51 High Throughput Sequencers- 10+ PB of storage
1 Genome in Data- ~300GB Compressed Sequence Data- ~150MB Compressed Variant Data- Seq data went through 5-6 steps
We Want Variants
Differences between your DNA and a reference come in man sizes:- Single letter substitutions are called
Single Nucleotide Polymorphisms (SNPs)
- Small “length polymorphisms” are called Insertions/Deletions (InDels)
- Large duplications/deletiosn are called Copy Number Variations
Average European has ~3 million small variations to the reference. 100K of those in the 30K “gene coding” regions (~2% of the genome)
Next Generation Sequencing Analysis
PrimaryAnalysis
Secondary Analysis
TertiaryAnalysis
“Sense Making”
Analysis of hardware generated data, software built by vendors Use FPGA and GPUs to handle real-time optical or eletrical signals
from sequencing hardware
Filtering/clipping of “reads” and their qualities Alignment/Assembly of reads Recalibrating, de-duplication, variant calling on aligned reads
QA and filtering of variant calls Annotation (querying) variants to databases, filtering on results Merging/comparing multiple samples (multiple files) Visualization of variants in genomic context Statistics on matrixes
Applications of NGS Data in the Clinic
Carrier screening – prenatal and standard
Lifetime risk prediction
Genetic disorder diagnostics
Oncology care
PGx – dosage and care
Public Annotations – Left Joins
Exact Matching “Variants”- “Population Catalogs”
- 1000 Genomes (84M variants)- NHLBI 6,500 Exomes (2M variants)- ExAC 61,486 exomes (10M variants)
- Clinical Classifications- Precomputed predictions / scores
- dbNSFP - 89.6M predictions
Algorithmic Classifciation- How variant interacts with genes (85K tx)
Region Based- Disease regions- Gene Lists
Annotations are Hard!
HGVS is a standard that is not standard- Tries to serve different goals- Many representations of same variant- Should not be used as IDs, but not many
good alternatives
Transcripts- Transcript set choice extremely important,
hard to curate with meaningful attributes as well.
Public Data Curation- ClinVar: multi-record lines- NHLBI: MAF vs AAF, splitting “glob” fields- 1kG: No genotype counts- ExAC: Multi-allelic splitting, left-align- COSMIC: No Ref/Alt, only HGVS- dbNSFP: Abbreviations and aggregate scores
Versioning and Issues- ClinVar missing variants in VCF- dbSNP patches without version changes
Splice Mutation
asdf
N-Glycanase Deficiency
http://www.ngly1.org/ Matthew Might and Matt Wilsey. The
shifting model in clinical diagnostics: how next-generation sequencing and families are altering the way rare diseases are discovered, studied, and treated. Genetics in Medicine. March 2014.
Personalized Medicine
http://www.ngly1.org/ Matthew Might and Matt Wilsey. The
shifting model in clinical diagnostics: how next-generation sequencing and families are altering the way rare diseases are discovered, studied, and treated. Genetics in Medicine. March 2014.
Cancer is a disease of the genome “Molecular Targeted” drugs effective usually side-effect free Required genetic testing to direct cancer treatment becoming affordable
Tabular Storage Format
Postgres FDW
TSF
Use SQLite as container.
SQLite has great cache, multi-threaded and read/write properties
Specialized genomic index, also lexigraphical indexes (LevelDB to do string sorting)
GZIP / BLOSC chunk compression
Primitive, Enums and List Types
TSF in Practice - VarSeq
TSF Backed Relational Data Store
More efficient conditional queries Invisible Joins (i.e. row_id => array
offset) Size on disk "NULL [NA, Missing values] values
are part of the domain space, which avoids auxiliary bit masks at the expensive of 'loosing' a single value from the domain.”
SQL front-end allows using as back-end to existing analytic and web-stacks