17
Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences Dr. Matthieu-P. Schapranow Festival of Genomics, London, U.K. Jan 19, 2016

Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

Embed Size (px)

Citation preview

Page 1: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

Dr. Matthieu-P. Schapranow Festival of Genomics, London, U.K.

Jan 19, 2016

Page 2: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

Schapranow, Festival of Genomics, Jan 19, 2016

Recap: we.analyzegenomes.com Real-time Analysis of Big Medical Data

2

In-Memory Database

Extensions for Life Sciences

Data Exchange, App Store

Access Control, Data Protection

Fair Use

Statistical Tools

Real-time Analysis

App-spanning User Profiles

Combined and Linked Data

Genome Data

Cellular Pathways

Genome Metadata

Research Publications

Pipeline and Analysis Models

Drugs and Interactions

Challenges of Big Medical Data

Drug Response Analysis

Pathway Topology Analysis

Medical Knowledge Cockpit Oncolyzer

Clinical Trial Recruitment

Cohort Analysis

...

Indexed Sources

Page 3: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

Combined column and row store

Map/Reduce Single and multi-tenancy

Lightweight compression

Insert only for time travel

Real-time replication

Working on integers

SQL interface on columns and rows

Active/passive data store

Minimal projections

Group key Reduction of software layers

Dynamic multi-threading

Bulk load of data

Object-relational mapping

Text retrieval and extraction engine

No aggregate tables

Data partitioning Any attribute as index

No disk

On-the-fly extensibility

Analytics on historical data

Multi-core/ parallelization

Our Technology In-Memory Database Technology

+

+++

+

P

v

+++t

SQL

xx

T

disk

3

Schapranow, Festival of Genomics, Jan 19, 2016

Challenges of Big Medical Data

Page 4: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

In-Memory Data Management Overview

Schapranow, Festival of Genomics, Jan 19, 2016

Analyze Genomes: Real-world Examples

4

Advances in Hardware

64 bit address space – 4 TB in current server boards

4 MB/ms/core data throughput

Cost-performance ratio rapidly declining

Multi-core architecture (6 x 12 core CPU per blade)

Parallel scaling across blades

1 blade ≈50k USD = 1 enterprise class server

Advances in Software

Row and Column Store Compression Partitioning Insert Only

A

Parallelization

+++

++

P

Active & Passive Data Stores

Page 5: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

■  Main memory access is the new bottleneck

■  Lightweight compression can reduce this bottleneck, i.e.

□  Lossless

□  Improved usage of data bus capacity

□  Work directly on compressed data

Lightweight Compression

Schapranow, Festival of Genomics, Jan 19, 2016

Analyze Genomes: Real-world Examples

5

Attribute Vector

RecId ValueId 1  C18.0 2  C32.0 3  C00.9 4  C18.0 5 C20.0 6 C20.0 7 C50.9 8 C18.0

Inverted Index

ValueId RecIdList 1  2 2  3 3  5,6 4  1,4,8 5  7

Data Dictionary

ValueId Value 1 Larynx 2 Lip 3 Rectum 4 Colon 5 Mama Table

… … … C18.0 Colon 646470 C50.9 Mama 167898 C20.0 Rectum 647912 C20.0 Rectum 215678 C18.0 Colon 998711 C00.9 Lip 123489 C32.0 Larynx 357982 C18.0 Colon 091487

RecId 1 RecId 2 RecId 3 RecId 4 RecId 5 RecId 6 RecId 7 RecId 8 …

•  Typical compression factor of 10:1 for enterprise software

•  In financial applications up to 50:1

Page 6: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

■  Row stores are designed for operative workload, e.g.

□  Create and maintain meta data for tests

□  Access a complete record of a trial or test series

■  Column stores are designed for analytical work, e.g.

□  Evaluate the number of positive test results

□  Identification of correlations or test candidates

■  In-Memory approach: Combination of both stores

□  Increased performance for analytical work

□  Operative performance remains interactively

Combined Column and Row Store

Schapranow, Festival of Genomics, Jan 19, 2016

Analyze Genomes: Real-world Examples

6

+

Page 7: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

■  Traditional databases allow four data operations:

□  INSERT, SELECT and

□  DELETE, UPDATE

■  Insert-only requires only INSERT, SELECT to maintain a complete history (bookkeeping systems)

■  Insert-only enables time travelling, e.g. to

□  Trace changes and reconstruct decisions

□  Document complete history of changes, therapies, etc.

□  Enable statistical observations

Insert-Only / Append-Only

Schapranow, Festival of Genomics, Jan 19, 2016

Analyze Genomes: Real-world Examples

7

+++

+

Page 8: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

■  Horizontal Partitioning

□  Cut long tables into shorter segments

□  E.g. to group samples with same relevance

■  Vertical Partitioning

□  Split off columns to individual resources

□  E.g. to separate personalized data from experiment data

■  Partitioning is the basis for

□  Parallel execution of database queries

□  Implementation of data aging and data retention management

Partitioning

Schapranow, Festival of Genomics, Jan 19, 2016

Analyze Genomes: Real-world Examples

8

Page 9: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

■  Modern server systems consist of x CPUs, e.g.

■  Each CPU consists of y CPU cores, e.g. 12

■  Consider each of the x*y CPU core as individual workers, e.g. 6x12

■  Each worker can perform one task at the same time in parallel

■  Full table scan of database table w/ 1M entries results in 1/x*1/y search time when traversing in parallel

□  Reduced response time

□  No need for pre-aggregated totals and redundant data

□  Improved usage of hardware

□  Instant analysis of data

Multi-core and Parallelization

Schapranow, Festival of Genomics, Jan 19, 2016

Analyze Genomes: Real-world Examples

9

Page 10: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

■  Consider two categories of data stores: active and passive

■  Active data are accessed frequently & updates are expected, e.g.

□  Most recent experiment results, e.g. last two weeks

□  Samples that have not been processed, yet

■  Passive data are used for analytical & statistical purposes, e.g.

□  Samples that were processed 5 years ago

□  Meta data about seeds that are not longer produced

■  Moving passive data on slower storages

□  Reduces main memory demands

□  Improves performance for active data

Active and Passive Data Store

Schapranow, Festival of Genomics, Jan 19, 2016

Analyze Genomes: Real-world Examples

10

P

Page 11: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

■  Layers are introduced to abstract from complexity

■  Each layer offers complete functionality, e.g. meta data of samples

■  Less layer result in

□  Less code to maintain

□  More specific code

□  Reduced resource demands

□  Improves performance of applications due to eliminating obsolete processing steps

Reduction of Application Layers

Schapranow, Festival of Genomics, Jan 19, 2016

Analyze Genomes: Real-world Examples

11 xx

Page 12: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

■  1,000 core cluster at Hasso Plattner Institute with 25 TB main memory

■  25 nodes, each consists of:

□  40 cores

□  1 TB main memory

□  Intel® Xeon® E7- 4870

□  2.40GHz

□  30 MB Cache

In-Memory Database Technology Hardware Characteristics at HPI FSOC Lab

Schapranow, Festival of Genomics, Jan 19, 2016

Challenges of Big Medical Data

12

Page 13: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

In-Memory Database Technology Use Case: Analysis of Genomic Data

Schapranow, Festival of Genomics, Jan 19, 2016

Analyze Genomes: Real-world Examples

13

Analysis of Genomic Data

Alignment and Variant Calling Analysis of Annotations in World-wide DBs

Bound To CPU Performance Memory Capacity

Duration Hours – Days Weeks

HPI Minutes Real-time

In-Memory Technology

Multi-Core

Partitioning & Compression

Page 14: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

Federated In-Memory Database (FIMDB) Incorporating Local Compute Resources

Schapranow, Festival of Genomics, Jan 19, 2016

Challenges of Big Medical Data

14 S ite B

Federated In -M em oryD atabase Instance ,

A lgorithm s, andApp lications M anaged

by Service P rovider

Clou

d Ser

vice

Prov

ider

S ite A

FIMD

BA.

1

FIMD

BA.

2

FIMD

BA.

3

FIMD

BA.

4

FIMD

BA.

5

FIMD

BB.

1

FIMD

BB.

2

FIMD

BB.

3

FIMD

BC.

1

Federated In -M em oryD atabase Instances

M aster D ataM anaged by

Service P rovider

Sensitive D atareside a t S ite

Page 15: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

Multiple Sites Forming the Federated In-Memory Database System

Schapranow, Festival of Genomics, Jan 19, 2016

Challenges of Big Medical Data

15

Federated In-M em ory D atabase System

M aster D ata andS hared A lgorithm s

S ite A S ite BC loud Provider

C loud IM D BInstance

Local IM D BInstance

S ensitive D ata,e.g . P atient D ata

R

Local IM D BInstance

Sensitive D ata,e .g. P atien t D ata

R

Page 16: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

■  Front-End

□  Any user interface supported, e.g. UI5, JavaScript, iOS

□  Communication protocol: Ajax/JSON

■  API Server

□  Ruby on Rails

□  Translates into database queries

■  Platform

□  IMDB with extensions for Life Sciences, e.g. worker framework, scheduler, third-party tools, etc.

Software Architecture

Schapranow, Festival of Genomics, Jan 19, 2016

Challenges of Big Medical Data

16

H igh-Perform ance In-M em ory G enom e C loud P la tfo rm

R esearcherw ith W ebBrow ser

C lin ic ianw ith M ob ileD evice

Ana lytica l D ataP rocessing

Mas

ter D

ata,

e.g

.R

efer

ence

Gen

omes

,A

nnot

atio

ns, a

ndC

linic

al T

rials

R

G raph P rocessing

Specific C loudApp lication

Specific C loudApp lica tion s

Specific M obileApplication

RR

Tran

sact

iona

l Dat

a,e.

g. G

enom

e R

eads

,Pat

ient

EM

R D

ata,

and

Wor

ker Sta

tus

In -M em ory D atabase InstancesIn-M em ory D atabase Instances

RR

Scheduling andW orker F ram ew ork

Annota tion andU pdater Fram ew ork

Inte

rnet

Intra

net

Dat

a La

yer

Pla

tform

Lay

erA

pplic

atio

n La

yer

Accounting Security Extensions

Page 17: Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

Keep in contact with us!

Hasso Plattner Institute Enterprise Platform & Integration Concepts (EPIC)

Program Manager E-Health Dr. Matthieu-P. Schapranow

August-Bebel-Str. 88 14482 Potsdam, Germany

Dr. Matthieu-P. Schapranow [email protected] http://we.analyzegenomes.com/

Schapranow, Festival of Genomics, Jan 19, 2016

Challenges of Big Medical Data

17