Upload
sage-base
View
380
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Stephen Friend, Sept 20, 2011. SciLife, Stockholm, Sweden
Citation preview
Integrating layers of omics data models and compute spaces needed to build a “Convivial Knowledge Expert”
Use of Bionetworks to Build Maps of Diseases Moving beyond the linear
Stephen Friend MD PhD
Sage Bionetworks (Non-Profit Organization) Seattle/ Beijing/ Amsterdam
SciLife September 21st, 2011
why consider the fourth paradigm- data intensive science
thinking beyond the narrative, beyond pathways
advantages of an open innovation compute space
it is more about how than what
Alzheimer’s Diabetes
Cancer Obesity Treating Symptoms v.s. Modifying Diseases
Will it work for me? Biomarkers?
Familiar but Incomplete
Reality: Overlapping Pathways
WHY NOT USE “DATA INTENSIVE” SCIENCE
TO BUILD BETTER DISEASE MAPS?
Equipment capable of generating massive amounts of data
“Data Intensive Science”- “Fourth Scientific Paradigm” For building: “Better Maps of Human Disease”
Open Information System
IT Interoperability
Evolving Models hosted in a Compute Space- Knowledge Expert
It is now possible to carry out comprehensive monitoring of many traits at the population level
Monitor disease and molecular traits in populations
Putative causal gene
Disease trait
what will it take to understand disease?
DNA RNA PROTEIN (dark matter)
MOVING BEYOND ALTERED COMPONENT LISTS
2002 Can one build a “causal” model?
trait
How is genomic data used to understand biology?
“Standard” GWAS Approaches Profiling Approaches
“Integrated” Genetics Approaches
Genome scale profiling provide correlates of disease Many examples BUT what is cause and effect?
Identifies Causative DNA Variation but provides NO mechanism
Provide unbiased view of molecular physiology as it
relates to disease phenotypes
Insights on mechanism
Provide causal relationships and allows predictions
RNA amplification Microarray hybirdization
Gene Index
Tum
ors
Tum
ors
Integration of Genotypic, Gene Expression & Trait Data
Causal Inference
Schadt et al. Nature Genetics 37: 710 (2005) Millstein et al. BMC Genetics 10: 23 (2009)
Chen et al. Nature 452:429 (2008) Zhang & Horvath. Stat.Appl.Genet.Mol.Biol. 4: article 17 (2005)
Zhu et al. Cytogenet Genome Res. 105:363 (2004) Zhu et al. PLoS Comput. Biol. 3: e69 (2007)
“Global Coherent Datasets” • population based
• 100s-1000s individuals
Constructing Co-expression Networks
Start with expression measures for genes most variant genes across 100s ++ samples
Note: NOT a gene expression heatmap
1 -0.1 -0.6 -0.8
-0.1 1 0.1 0.2
-0.6 0.1 1 0.8
-0.8 0.2 0.8 1 1
2
3
4
1 2 3 4
Correlation Matrix Brain sample
expr
essi
on
1 0 1 1 0 1 0 0 1 0 1 1 1 0 1 1 1
2
3
4
1 2 3 4
Connection Matrix
1 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 1
2
4
3
1 2 4 3
4 1
3 2
Establish a 2D correlation matrix for all gene pairs
Define Threshold eg >0.6 for edge
Clustered Connection Matrix
Hierarchically cluster
sets of genes for which many pairs interact (relative to the total number of pairs in that
set)
Network Module
Identify modules
******BYRM
Yeast segregants
Synthetic complete medium
Logorithm growth
Gene expression
Yeas
t seg
rega
nts
genotypes
Public databases
Protein-protein
interations
Transcription factor binding
sites
Bayesian network
Protein Metabolite interations
Data integration via Bayesian Network
Courtesy of Dr. Jun Zhu
Preliminary Probabalistic Models- Rosetta /Schadt
Gene symbol Gene name Variance of OFPM explained by gene expression*
Mouse model
Source
Zfp90 Zinc finger protein 90 68% tg Constructed using BAC transgenics Gas7 Growth arrest specific 7 68% tg Constructed using BAC transgenics Gpx3 Glutathione peroxidase 3 61% tg Provided by Prof. Oleg
Mirochnitchenko (University of Medicine and Dentistry at New Jersey, NJ) [12]
Lactb Lactamase beta 52% tg Constructed using BAC transgenics Me1 Malic enzyme 1 52% ko Naturally occurring KO Gyk Glycerol kinase 46% ko Provided by Dr. Katrina Dipple
(UCLA) [13] Lpl Lipoprotein lipase 46% ko Provided by Dr. Ira Goldberg
(Columbia University, NY) [11] C3ar1 Complement component
3a receptor 1 46% ko Purchased from Deltagen, CA
Tgfbr2 Transforming growth factor beta receptor 2
39% ko Purchased from Deltagen, CA
Networks facilitate direct identification of genes that are
causal for disease Evolutionarily tolerated weak spots
Nat Genet (2005) 205:370
50 network papers http://sagebase.org/research/resources.php
List of Influential Papers in Network Modeling
(Eric Schadt)
Recognition that the benefits of bionetwork based molecular models of diseases are powerful but that they require significant resources
Appreciation that it will require decades of evolving representations as real complexity emerges and needs to be integrated with therapeutic interventions
Sage Mission
Sage Bionetworks is a non-profit organization with a vision to create a “commons” where integrative bionetworks are evolved by
contributor scientists with a shared vision to accelerate the elimination of human disease
Sagebase.org
Data Repository
Discovery Platform
Building Disease Maps
Commons Pilots
Sage Bionetworks Collaborators
Pharma Partners Merck, Pfizer, Takeda, Astra Zeneca, Amgen, Johnson &Johnson
22
Foundations Kauffman CHDI, Gates Foundation
Government NIH, LSDF
Academic Levy (Framingham) Rosengren (Lund) Krauss (CHORI)
Federation Ideker, Califarno, Butte, Schadt
RULES GOVERN
Engaging Communities of Interest
PLAT
FORM
NEW
MAP
S NEW MAPS
Disease Map and Tool Users- ( Scientists, Industry, Foundations, Regulators...)
PLATFORM Sage Platform and Infrastructure Builders-
( Academic Biotech and Industry IT Partners...)
RULES AND GOVERNANCE Data Sharing Barrier Breakers-
(Patients Advocates, Governance and Policy Makers, Funders...)
NEW TOOLS Data Tool and Disease Map Generators- (Global coherent data sets, Cytoscape,
Clinical Trialists, Industrial Trialists, CROs…)
PILOTS= PROJECTS FOR COMMONS Data Sharing Commons Pilots-
(Federation, CCSB, Inspire2Live....)
24
Example 1: Breast Cancer
Zhang B et al., manuscript
Bayesian Network
Survival Analysis
Coexpression Networks Module combination
Partition BN
4 Public Breast Cancer Datasets
NKI: van de Vijver et al. A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med. 2002 Dec 19;347(25):1999-2009.
Wang Y et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005 Feb 19-25;365(9460):671-9.
Miller: Pawitan Y et al. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res. 2005;7(6):R953-64.
Christos: Sotiriou C et al.. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006 Feb 15;98(4):262-72.
25
295 samples
286 samples
159 samples
189 samples
Generation of Co-expression & Bayesian Networks from published Breast Cancer Studies
Recovery of EGFR and Her2 oncoproteins downstream pathways by super modules
Comparison of Super-‐modules with EGFR and Her2 signaling and resistance pathways
28
Key Driver Analysis • IdenDfy key regulators for a list of genes h and a network N • Check the enrichment of h in the downstream of each node in N • The nodes significantly enriched for h are the candidate drivers
29
A) Cell Cycle (blue)
C) Pre-mRNA Processing (brown)
B) Chromatin modification (black)
D) mRNA Processing (red)
Global driver
Global driver & RNAi validation
Signaling between Super Modules
(View Poster presented by Bin Zhang)
Example: The Sage Non-Responder Project in Cancer
Sage Bionetworks • Non-Responder Project
• To identify Non-Responders to approved drug regimens so we can improve outcomes, spare patients unnecessary toxicities from treatments that have no benefit to them, and reduce healthcare costs
• Co-Chairs Stephen Friend, Todd Golub, Charles Sawyers Gary Nolan & Rich Schilsky
• AML (at first relapse)- Jerry Radich • Non-Small Cell Lung Cancer- Roy Herbst
• Ovarian Cancer (at first relapse)- Beth Karlan
• Breast Cancer- Dan Hayes • Renal Cell- Rob Moetzer
Purpose:
Leadership:
Initial Studies:
Blue module: 3000 genes Associated with Type 2 diabetes Elevated HbA1c Reduced insulin secretion
Global expression data from 64 human islet donors
340 genes in islet-specific open chromatin regions
168 overlapping genes, which have
• Higher connectivity • Markedly stronger association with
• Type 2 diabetes • Elevated HbA1c • Reduced insulin secretion
• Enrichment for beta-cell transcription factors and exocytotic proteins
New Type II Diabetes Disease Models Anders Rosengren
• Search across 1300 datasets in MetaGEO at Sage for similar expression profiles Top hit: Islet dedifferentiation study where the 168 genes were upregulated in mature islets and downregulated in dedifferentiated islets (Kutlu et al., Phys Gen 2009)
• Analyses of expression-SNPs and clinical SNPs as well as Causal Inference Test
• Identification of candidate key genes affecting beta-cell differentiation and chromatin
Working hypothesis:
Normal beta-cell: open chromatin in islet-specific regions, high expression of beta-cell transcription factors, differentiated beta-cells and normal insulin secretion
Diabetic beta-cell: lower expression of beta-cell transcription factors affecting the identified module, dedifferentiation, reduced insulin secretion and hyperglycemia
Next steps: Validation of hypothesis and suggested key genes in human islets
Anders Rosengren
New Type II Diabetes Disease Models
Probing complex biology
• The more we learn about it, the more complicated it becomes!
• Cancers' geneDc fingerprints are highly diverse; most mutaDons were unique to individual
• How to piece everything together?
observa)ons to models
diseases
perturba
)on
s
perturba
)on
s
A framework for data integraDon �
probabilistic graphic models
Microarray data
Proteomic data
Genomics
Genetics
Medline Biocarta/Biopathway Biologists
Database
GUI Hypothesis, test
High throughput data
knowledge
Metabolomic data
******BYRM
Yeast segregants
Synthe)c complete medium Logorithm growth
Gene expression metabolites Yeast segregants
genotypes
Public databases
Protein-‐protein intera)ons
Transcrip)on factor binding sites
Bayesian network
Protein Metabolite intera)ons
Bayesian network: Incorpora)ng TFBS and PPI data as a scale-‐free network prior
• DNA-‐protein binding data – Knowledge of what proteins regulate transcripDon of a given gene
PPI: Can we find informaDon overlapped with gene expressions?
3-‐clique
4-‐clique 4-‐clique
3-‐clique
Clique community (par)al clique)
Zhu J et al, Nature GeneDcs, 2008
IntegraDng transcripDon factor (TF) binding data and PPI
• Introducing scale-‐free priors for TF and large PPI complex
• Fixed prior for small PPI complex
Zhu J et al, Nature GeneDcs, 2008
Integra)on improves network quali)es
BN KO data GO terms TF data
w/o any priors 125 55 26
w/ genetics priors 139 59 34
w/ genetics, TF and PPI priors 152 66 52
Zhu et al., Cytogene)cs, 2004 Zhu et al. PLoS Comp Biol. 2007 Zhu J et al., Nature Gene)cs, 2008
• The pair is independent
• The pair is causa/reacDve
LEU2 GCN4
ILV6
GCN4
LEU2 KO gives rise to small expression signature
• LEU2 KO sig enriched (p~10E-‐18) • GCN4 downregulated in LEU2 KO small signature
ILV6 gives rise to large expression signature
• ILV6 KO sig enriched (p~10E-‐52) • GCN4 upregulated in ILV6 KO large signature
ProspecDve validaDon is the gold standard
Zhu J et al., Nature Gene)cs, 2008
Lung Cancer Bayesian network
• Built from 240 lung cancer samples • 7785 genes • 10642 links
EMT’s proteomic signatures
epithelial mesenchymal
Cell medium signature Cell extract signature Cell surface signature
Direct overlaps of EMT proteomic signatures
extract surface media
extract 267 75 57
surface 31% 240 52
media 27% 25% 208
Analysis of EMT signatures through lung cancer network
Cell medium signature
Cell extract signature
Cell surface signature
De novo constructed lung cancer regulatory network signatures subnetworks
overlaps
Hallmark features of EMT
• Decrease of E-cadherin and increase of Vimentin (criteria used in defining signatures) – CTNNA1 is in all three proteomic signatures;
• There are 6 nodes connected to CTNNA1 in the lung cancer network including ARF1
– VIM is in all three proteomic signatures • There are 7 nodes connected to VIM in the lung
cancer network including NOTCH2, EMP3
Lung Network Conclusions
• All proteomic signatures are coherently co-‐regulated at transcripDon level;
• GOBP annotaDons for signatures in cell surface, condiDoned media fracDons are expected;
• Lipid metabolism for signatures in the total cell extract is also expected.
• Subnetworks for EMT proteomic signatures contain all known hallmark features of EMT;
• These subnetworks can provide beger context to understand EMT and to idenDfy key regulators of EMT.
2008 2009 2010 2011
Can we accelerate the pace of scientific discovery?
How is the Federation different from a “traditional” collaboration?
collaboration 2.0
Shared data tools models and prepublications Conflict of interests Intellectual property Authorship
Rules of the game: transparency & trust
Watch What I Do, Not What I Say Reduce, Reuse, Recycle
Most of the People You Need to Work with Don’t Work with You
My Other Computer is Amazon
sage bionetworks synapse project
Type-II diabetes Warburg effect in cancer Human aging
sage federation: scientific pilot projects
Type-II diabetes Warburg effect in cancer Human aging
Cellular mortality (aging)
Cellular immortality (cancer)
sage federation: warburg effect project
warburg effect project interconnected discovery
bioinformatic bioinformatic confirmation that metabolic genes implicated in the warburg effect are differentially expressed across a wide range of cancer types!
butte lab
coherent prostate dataset
taylor, sawyers, et al., mskcc!
network modeling genes associated with poor prognosis in prostate cancer are disproportionately found amongst networks regulating glycolysis genes!
sage bionetworks
network dynamics generate an aerobic glycolysis signature and ʻreverse engineerʼ master transcription factor regulators of the warburg effect transcriptional program !
califano lab
prostate
breast
lung
colon
renal
brain
b cell
layer in!additional !tissue types!
from sawyers, pcctc presentation!
warburg effect project so what? discovery integration to translation, a validation path
lists of nodes and key drivers!
from thompson, science, 2009!
target identification! clinical validation!
peter nelson!
pre-clinical target validation!
jim olson!
sage federation: human aging project
JusDn Guinney Stephen Friend*
Greg Hannum Januz Dutkowski Trey Ideker* Kang Zhang*
Mariano Alvarez Celine Lefebrev Andrea Califano*
sage federation: what is the impact of disease/environment on “biological age” ?
Chronological Age
Biological Age
2001
2009
sage federation: model of biological age
Faster Aging
Slower Aging
Clinical Association - Gender - BMI - Disease Genotype Association Gene Pathway Expression Pr
edicted Age (liver expression)
Chronological Age (years)
Age Differential
human aging: clinical associations with differential aging – gene expression in human liver
Bioage Difference (years)
Faster Aging
Slower Aging
Fit Overweight Underweight
!
!
!
!
FALSE TRUE
−10
010
20
Male Female
human aging: predicting bioage using whole blood methylation
!
!
!!!
!
!!!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
! !!
!
!
!
!
!
!!!!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!
!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!!!
!
!
!
!
!
40 50 60 70 80 90 100
40
60
80
100
Training Cohort: San Diego (n=170)
Chronological Age
Bio
logic
al A
ge
RMSE=3.35
!
!!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
! !
!
!
!!
!
!
!
!
!
!!
!
!
!!
!
!!
!!
!
!
!
!!!
!!
!
!
!
!
!
!
!!
!!
!!
!
!!
!
!!
!
!!
!
!
!
!!!
!
!
!
! !
!
!
!!
!
!
!
!!
!
!
!!
! !
!!!
!
!
!
!!
!
!
!!
!
!
!!
40 50 60 70 80 90
40
60
80
100
Validation Cohort: Utah (n=123)
Chronological Age
Bio
logic
al A
ge
RMSE=5.44
• Independent training (n=170) and validation (n=123) Caucasian cohorts • 450k Illumina methylation array • Exom sequencing • Clinical phenotypes: Type II diabetes, BMI, gender…
human aging: clinical associations with differential bioage
!
!
!!
!!
!!
!
!!
!
!
!
!
!
! !
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
!
!
!!
!
!
!
!
!
!
!
!
! !
!!
!!
!!
!
!
!!
!!
!!
!!
!
!!
!
!!
! !
!
!
!
!
!!
!
!
!
!
!
!
!!
!
!!
!
! !
!!
!
!!
!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!!
!!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!
!
!!
!
!
!
!
!
!
!
!!
!
!
!
!
!!
!
20 25 30 35 40 45
!10
!5
05
10
BMI vs Diff Bioage
BMI
Diff B
ioage
p=0.0362
!
!
N Type II
!10
!5
05
10
Diabetes vs Diff Bioage
Diff B
ioage
p=0.000466
Train: San Diego
!!
!!
!!
!!
!
!!
!
!
!
!
!!
!
! !
!
!
!
!!! !
!
!
!
!
!
! !
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!!!
!!
!
!
!
!
!
!
!!
!!
!!
!
!
!
!
!
!
!
!!
!
!
!!
!!
!
!
!
! !
!!
!
!
! !
!
!
!
!
!
!
!
!!
!
!
!!
!!!
!
!
!
!!
!
20 25 30 35 40 45 50
!1
5!
50
51
0
BMI vs Diff Bioage
BMI
Diff
Bio
ag
e
p=0.00173
!!
N Type II
!1
5!
50
51
0
Diabetes vs Diff Bioage
Diff
Bio
ag
e
p=1.34e!10
ValidaDon: Utah
human aging: clinical associations with combined cohorts
Univariate Analysis
Multiple Regression Gender BMI Diabetes Smoker
p=.45 p=.619 p=4.82e-‐07 p=.02
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!
!
!
!
!!
!
!
!!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
! !
!
!
!
!
!
!
!
!
! !
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!! !
!
! !
!
!!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!
!
! !
!
!
!!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
20 30 40 50
!1
0!
50
51
0
BMI vs Diff Bioage
BMI
Diff
Bio
ag
e
p=0.00406
!
!
N Type II
!1
0!
50
51
0
Diabetes vs Diff Bioage
Diff
Bio
ag
e
p=9.7e!11!
!
!
!!
FALSE TRUE
!1
0!
50
51
0
Smoker vs Diff Bioage
Diff
Bio
ag
e
p=0.00764
human aging: mechanism of biological aging
stochasDc vs mechanisDc
deterioraDon by random “hits” systemaDc process
human aging: mechanism of aging – higher entropy?
40 50 60 70 80 90
2
4
6
8
10
12
14
Primary Cohort
Mean age per window
En
tro
py
p=6.616e!16
40 45 50 55 60 65
!56
!54
!52
!50
!48
Secondary Cohort
Mean age per window
En
tro
py
p=3.528e!05
• Created predictive model of biological age • Type-II diabetes strongly correlates with accelerated aging • Identified genetic variant that may induce methyl-driven
acceleration of aging • General mechanism of aging may involve loss of signal and
increased disorder within the methylome
human aging: research summary
sage federation alternative model of scientific collaboration?
Federated Aging Project : Combining analysis + narraDve
=Sweave Vignette Sage Lab
Califano Lab Ideker Lab
Shared Data Repository
JIRA: Source code repository & wiki
R code + narrative
PDF(plots + text + code snippets)
Data objects
HTML
Submitted Paper
Why not share clinical /genomic data and model building in the ways currently used by the software industry (power of tracking workflows and versioning
Synapse as a Github for building models of disease
Evolution of a Software Project
Biology Tools Support Collaboration
Potential Supporting Technologies
Taverna
Addama
tranSMART
Platform for Modeling
SYNAPSE
INTEROPERABILITY
INTEROPERABILITY (tranSMART)
TENURE FEUDAL STATES
!
Group D LEGAL STACK-ENABLING PAIENTS: John Wilbanks
Arch2POCM
Restructuring Drug Discovery
why consider the fourth paradigm- data intensive science
thinking beyond the narrative, beyond pathways
advantages of an open innovation compute space
it is more about how than what
Moving beyond the linear
linear pathways
linear ways of building models
linear ways of working together
“What Technology Wants” pp 264 by Kevin Kelly
Convivial ManifestaDons of the Sage Synapse Commons
OPPORTUNITIES FOR THE SCILIFE COMMUNITY
Data sets, Tools and Models
Joining Synapse Communities
Joining Federation Projects
Joining Arch2POCM
Change reward structures for sharing data (patients and academics)