44
making connections between genetics and disease MongoDB and the Connectivity Map

MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

  • Upload
    mongodb

  • View
    198

  • Download
    1

Embed Size (px)

DESCRIPTION

The Broad Institute has developed a novel high-throughput gene-expression profiling technology and has used it to build an open-source catalog of over a million profiles that captures the functional states of cells when treated with drugs and other types of perturbations. Referred to as the Connectivity Map (or CMap), these data when paired with pattern matching algorithms, facilitate the discovery of connections between drugs, genes and diseases. We wished to expose this resource to scientists around the world via an API that is easily accessible to programmers and biologists alike. We required a database solution that could handle a variety of data types and handle frequent changes to the schema. We realized that a relational database did not fit our needs, and gravitated towards MongoDB for its ease of use, support for dynamic schema, complex data structures and expressive query syntax. In this talk, we’ll walk through how we built the CMap library. We’ll discuss why we chose MongoDB, the various schema design iterations and tradeoffs we’ve made, how people are using the API, and what we’re planning for the next generation of biomedical data.

Citation preview

Page 1: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

making connections between genetics and diseaseMongoDB and the Connectivity Map

Page 2: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

Page 3: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

Page 4: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

Page 5: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

Page 6: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

Corey Rajiv

Page 7: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

a common languageGene Expression

Page 8: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

Page 9: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

Page 10: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

Page 11: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

Page 12: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

Page 13: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.13

~7,000 experiments Over 19,000 registered users

Cited by over 1,200 scientific reports

Page 14: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

2006

Page 15: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.

2014

Page 16: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

.16

Page 17: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

CMap-LINCS dataset 1.4 million gene expression profiles

3,800 Genes (shRNA & cDNA) • Targets/pathways of approved drugs • Candidate disease genes • Community nominations

15 Cell types • Banked primary cell types • Cancer cell lines • Primary hTERT-immortalized • Patient-derived iPS cells • Community nominated

12,488 Compounds • FDA approved drugs • Bioactive tool compounds • Screening hits

Page 18: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

• Diverse use-cases

• Users with varying technical expertise

• Annotations are complex and incomplete

• Frequent updates

CMap Data!Easy to describe, tough to Model

Page 19: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Store just what’s needed

Refactor frequently

Test and use daily

Data Model!An agile philosophy keeps the model tractable

Page 20: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Data Model!An inventory of signatures

siginfo

Page 21: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Data Model!Shared fields as separate collections

siginfo

cellinfo

pertinfo

Page 22: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Data Model!Add computed fields and external meta-data

siginfo cellinfo

Page 23: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Data Model!Duplicate data to optimize lookups

siginfo pertinfo

Page 24: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

APIs!Are awesome, we need more of them

Picked functionality over convention!/siginfo?q={“cell”:”A”}  vs  /siginfo/cell/A

Page 25: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

API!MongoDB inspired a rich query syntax

Function Example

Query /siginfo?q={“cell:”A”,”name”:”B”}

Field selection /siginfo?q={}&f={“name”:1}

Document count /siginfo?q={}&c=true

Document limit /siginfo?q={}&l=10

Skip documents /siginfo?q={}&l=10&sk=10

Sort order /siginfo?q={}&s={“name”:-­‐1,”cell”:1}

Distinct values /siginfo?q={}&d=name

Aggregation /siginfo?q={}&g=name

Page 26: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

API!Node and Mongoose enable easy API creation

Page 27: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Language Bindings!JSON as a universal format

Javascript

Python

R

Page 28: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
Page 29: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
Page 30: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
Page 31: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Analytic Tools!A compute API liberates command line scripts

Page 32: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Compute API!Messaging handled via a capped collection

Page 33: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Input Validation!JSON Schema simplifies validation

Page 34: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

GCTX : A binary format based on HDF5 Cross platform

Multi-language support Efficient I/O

Storage size for 30 billion data points is 110 Gb

Numeric Matrix Data!HDF5 offers efficient storage for large matrices

Page 35: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Sign up at lincscloud.org

Lincscloud!A platform for easy access to perturbational data

Free for academic use

Page 36: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Predicting Drug Function!Diverse structures, common activities

Page 37: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Predicting Drug Function!Diverse structures, common activities

VEGFR inhibitor

PPARG agonist

PI3K/MTOR inhibitor

ROCK inhibitor

Estrogen agonist

Page 38: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Finding Novel Drug Targets!Repurposing failed drugs

Original target

Page 39: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Finding Novel Drug Targets!Repurposing failed drugs

Original target

Failed in Phase 2 clinical trial due to lack of efficacy

Page 40: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Finding Novel Drug Targets!Repurposing failed drugs

Original target

Novel Target A

Novel Target B

Novel Target C

Novel Target D

Page 41: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
Page 42: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
Page 43: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease
Page 44: MongoDB and the Connectivity Map: Making Connections Between Genetics and Disease

Acknowledgements Todd Golub

Core Team: Analysis & Software Arvind Subramanian Jacob Asiedu Larson Hogstrom Ian Smith David Lahr Aravind Subramanian Josh Gould Ted Natoli David Wadden !Core Team: Lab John Davis David Peck Xiaodong Lu Melanie Donahue Daniel Lam Jackie Rosains (Project Manager)

Collaborators Bang Wong Steven Corsello (Golub lab) Jake Jaffe (Proteomics) David Takeda (Hahn lab) Pablo Tamayo !Chemistry & Therapeutics Lucienne Ronco Josh Bittker Arthur Liberzon Mathias Wawer Paul Clemons !Genetic Perturbation Platform John Doench Federica Piccioni David Root