Upload
andy-petrella
View
285
Download
0
Tags:
Embed Size (px)
Citation preview
by Data Fellas, Data Enthusiasts v 4.0 (July, 13th ‘15)
Scalable and Interoperable data servicesApplied to Genomics
Young Belgian Startup
The Data Fellas Startup
Data ScienceXavier Tordoir@xtordoir
Andy Petrella@noootsab
Data Processing
Scalable Machine Learning
Micro Services oriented
Data Fellas: EvangelizingTrainingScala
Apache Spark (BE, in September)http://spark4devs.data-fellas.guru/
Distributed Machine Learning
Pipeline (Oakland, August)http://bigdatascala.bythebay.io/training.html
Apache Spark (SFO with BoldRadius, August)
TalksScala IO, Devoxx Belgium, Devoxx France, Scala Days, KTH, KUL, Spark Meetup London, …
more to come (Italy, …)
PMC Member at Strata NYPMC member at DevoxxPMC Member at Foss4G
First: Data ScienceAnalysis
Production
DistributionRendering
Discovery
CatalogSpark Notebookusing Services too
First: Data ScienceAnalysis
Production
DistributionRendering
Discovery
Share Analyses
Share Results
Share Datasets
Next: Applied TO Genomics
Genomics data is pretty big
● 100,000’s genomes in 2015● 1,000,000’s … ● 100,000,000’s … ● …
Next: Applied TO Genomics
Genomics data is pretty big and of High dimensionality
One genome:○ 3 billions bases (basic DNA component) sequence○ 30 - 60 x coverage for quality○ 10’s to 100’s millions variants (variable bases
from one individual to the next)
Next: Applied TO Genomics
e.g. 1000genomes project:
● 200TB compressed data● organised in files/directories● data formatted following specs in a … PDF
Data and services schemas are required
What we do with genomics data?
Lots of Querying and Learning:
E.G.
● Population structure is a fundamental basis● Querying relationships between genomes and other
biological features
Hey… no one has all data!
Metadata
What we do with genomics data?
Lots of Querying and Learning:
E.G.
● We do some specific Modelling on some data…
Hey… no two serve the same computations!
Service Discovery
Interoperable… Analysis
Production
DistributionRendering
Discovery
Share Analyses
Share Results
Share Datasets
Interoperable & scalable…
GA4GH + Shar3 = Med@Scale
+ ADAM & spark+ In Memory optimization (Tachyon)+ Deployment (e.g. DCOS)
Wrap-UP
Follow us @DataFellas and get notified about our
+ sharing platform at scale: Shar3
+ Google Genomics At Home (^.^): Med@Scale
+ future plans: modules for Trading, Geospatial, other medical data, …
ReferencesAdam: https://github.com/bigdatagenomics/adamBdg-Formats: https://github.com/bigdatagenomics/bdg-formats
GA4GH website: http://genomicsandhealth.org/GA4GH data working group: http://ga4gh.org/
@Spark-Notebook: https://github.com/andypetrella/spark-notebook/
Med-At-Scale: https://github.com/med-at-scale/high-health
Data Fellas: http://data-fellas.guru/ Training: http://spark4devs.data-fellas.guru/