Upload
clementine-rose
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
BioinformaticsBioinformaticsDealing with expression dataDealing with expression data
Kristel Van Steen, PhD, ScDKristel Van Steen, PhD, ScD
([email protected])([email protected])
Université de Liege - Institut MontefioreUniversité de Liege - Institut Montefiore
2008-20092008-2009
AcknowledgementsAcknowledgements
Material based on: Material based on:
Slides from Slides from Patrik D’haeseleer, Shoudan Liang and Roland Somogyi Patrik D’haeseleer, Shoudan Liang and Roland Somogyi (genetic network inference)(genetic network inference)
Slides from Steve Horvath and Jun Dong (co-expression networks)Slides from Steve Horvath and Jun Dong (co-expression networks)
Slides from Slides from Sargur Srihari (bagging and boosting)Sargur Srihari (bagging and boosting)
Class Outline
Genetic networks A primer to co-expression network
analysis Bagging and boosting (as promised …) Concensus microarray data analysis
Theory Application
Genetic networksGenetic networks
OutlineOutline
IntroductionIntroduction A conceptual approach to complex A conceptual approach to complex
network dynamicsnetwork dynamics Inference of regulation through clustering Inference of regulation through clustering
of gene expression dataof gene expression data Modeling methodologiesModeling methodologies Gene network inference: reverse Gene network inference: reverse
engineeringengineering
Genes encode proteins, some of Genes encode proteins, some of which in turn regulate other geneswhich in turn regulate other genes
determine the structure of this determine the structure of this intricate network of genetic intricate network of genetic regulatory interactionsregulatory interactions
Traditional approach: localTraditional approach: local Examining and collecting data on a Examining and collecting data on a
single gene, a single protein or a single single gene, a single protein or a single reaction at a timereaction at a time
functional genomicsfunctional genomics
Functional GenomicsFunctional Genomics
Specifically, Specifically, functional genomicsfunctional genomics refers to refers to the development and application of the development and application of globalglobal experimental approaches to assess gene experimental approaches to assess gene function by making use of the information function by making use of the information and reagents provided by structural and reagents provided by structural genomic. genomic. high throughput high throughput large scale experimental methodologies large scale experimental methodologies
combined with statistical and computational combined with statistical and computational analysis of the results.analysis of the results.
Functional Genomics(Cont.)Functional Genomics(Cont.)
We need to define the mapping from We need to define the mapping from sequence space to functional space. sequence space to functional space.
Intermediate representationIntermediate representation
Focus at the level of single cellsFocus at the level of single cells A biological system can be A biological system can be
considered to be a state considered to be a state machine,where the change in machine,where the change in internal state of the system depends internal state of the system depends on both its current internal state and on both its current internal state and any external inputs.any external inputs.
The goal The goal
Observe the state of a cell and how it Observe the state of a cell and how it changes under different changes under different circumstances, and from this to circumstances, and from this to derive a model of how these state derive a model of how these state changes are generatedchanges are generated The state of cellThe state of cell
All those variables determining its behaviorAll those variables determining its behavior
Example Example
A simple,6-node regulatory networkA simple,6-node regulatory network
OutlineOutline
Introduction Introduction A conceptual approach to complex network A conceptual approach to complex network
dynamicsdynamics Inference of regulation through clustering of Inference of regulation through clustering of
gene expression datagene expression data Modeling methodologiesModeling methodologies Gene network inference:reverse Gene network inference:reverse
engineeringengineering Conclusions and OutlookConclusions and Outlook
The global gene expression pattern is The global gene expression pattern is the result of the collective behavior the result of the collective behavior of individual regulatory pathwaysof individual regulatory pathways
Gene function depends on its cellular Gene function depends on its cellular context; thus understanding the context; thus understanding the network as a whole is essential.network as a whole is essential.
Boolean NetworksBoolean Networks
Each gene is considered as a binary Each gene is considered as a binary variable—either ON or OFF—variable—either ON or OFF—regulated by other genes through regulated by other genes through logical or Boolean functions.logical or Boolean functions.
Even with this simplification ,the Even with this simplification ,the network behavior is already network behavior is already extremely rich.extremely rich.
Boolean Networks(Cont.)Boolean Networks(Cont.)
Cell differentiation corresponds to Cell differentiation corresponds to transitions from one global gene transitions from one global gene expression pattern to another.expression pattern to another.
OutlineOutline
Introduction Introduction A conceptual approach to complex network A conceptual approach to complex network
dynamicsdynamics Inference of regulation through clustering of Inference of regulation through clustering of
gene expression datagene expression data Modeling methodologiesModeling methodologies Gene network inference:reverse Gene network inference:reverse
engineeringengineering Conclusions and OutlookConclusions and Outlook
Scoring methodsScoring methods
Whether there has been a significant Whether there has been a significant change at any one conditionchange at any one condition
Whether there has been a significant Whether there has been a significant aggregate change over all conditionsaggregate change over all conditions
Whether the fluctuation pattern Whether the fluctuation pattern shows high diversity according to shows high diversity according to Shannon entropyShannon entropy
Guilt By AssociationGuilt By Association
Select a geneSelect a gene Determine its nearest neighbors in Determine its nearest neighbors in
expression space within a certain expression space within a certain user-defined distance cut-offuser-defined distance cut-off
ClusteringClustering
extract groups of genes that are extract groups of genes that are tightly co-expressed over a range of tightly co-expressed over a range of different experiments. different experiments.
CautionCaution
Different clustering methods can Different clustering methods can have very different resultshave very different results
It’s not yet clear which clustering It’s not yet clear which clustering methods are most useful for gene methods are most useful for gene expression analysis.expression analysis.
Definition:Gene Expression Definition:Gene Expression ProfileProfile
An An expression profile eexpression profile ejj of an ordered of an ordered list of N samples(k=1 to N) for a list of N samples(k=1 to N) for a particular gene j is a vector of scaled particular gene j is a vector of scaled expression values vexpression values vjkjk
The expression profile is:The expression profile is: eejj=(v=(vj1j1,v,vj2j2,v,vj3j3,…,v,…,vjNjN))
Definition:Gene Expression Definition:Gene Expression Profile( Cont.)Profile( Cont.)
A A differencedifference between two genes p between two genes p and q may be estimated as N-and q may be estimated as N-dimensional metric “distance” dimensional metric “distance” between ebetween epp and e and eqq..
Euclidean distanceEuclidean distance: :
= == = N
vvNj
jqjp
..1
2)(pqd
Clustering algorithmsClustering algorithms
Non-hierarchical methodsNon-hierarchical methods Cluster N objects into K groups in an Cluster N objects into K groups in an
iterative process until certain goodness iterative process until certain goodness criteria are optimizedcriteria are optimized
E.g. K-meansE.g. K-means
Clustering algorithmsClustering algorithms
Hierarchical methodsHierarchical methods Return an hierarchy of nested clusters, Return an hierarchy of nested clusters,
where each cluster typically consists of where each cluster typically consists of the union of two or more smaller the union of two or more smaller clusters.clusters.
Agglomerative methodsAgglomerative methods Start with single object clusters and recursively Start with single object clusters and recursively
merge them into larger clustersmerge them into larger clusters Divisive methodsDivisive methods
Start with the cluster containing all objects and Start with the cluster containing all objects and recursively divide it into smaller clustersrecursively divide it into smaller clusters
Other applications of co-Other applications of co-expression clustersexpression clusters
Extraction of regulatory motifsExtraction of regulatory motifs Genes in the same expression share biological Genes in the same expression share biological
funtionsfuntions Inference of functional annotationInference of functional annotation
Functions of unknown genes may be Functions of unknown genes may be hypothesized from genes with know function hypothesized from genes with know function within the same clusterwithin the same cluster
As a molecular signature in distinguishing As a molecular signature in distinguishing cell or tissue typescell or tissue types mRNA expressionmRNA expression
Which clustering method to Which clustering method to use?use?
There is no single best criterion for There is no single best criterion for obtaining a partition because no obtaining a partition because no precise and workable definition of precise and workable definition of ‘cluster’ exists. ‘cluster’ exists.
Clusters can be of any arbitrary Clusters can be of any arbitrary shapes and sizes in a shapes and sizes in a multidimensional pattern space.multidimensional pattern space.
Challenge in cluster analysisChallenge in cluster analysis
A gene could be a member of several A gene could be a member of several clusters, each reflecting a particular clusters, each reflecting a particular aspect of its function and controlaspect of its function and control
SolutionsSolutions clustering methods that partition genes clustering methods that partition genes
into non-exclusive clustersinto non-exclusive clusters Several clustering methods could be Several clustering methods could be
used simultaneouslyused simultaneously
OutlineOutline
Introduction Introduction A conceptual approach to complex network A conceptual approach to complex network
dynamicsdynamics Inference of regulation through clustering of Inference of regulation through clustering of
gene expression datagene expression data Modeling methodologiesModeling methodologies Gene network inference:reverse Gene network inference:reverse
engineeringengineering Conclusions and OutlookConclusions and Outlook
Level of biochemical detailLevel of biochemical detail
abstractabstract Boolean networksBoolean networks
concreteconcrete Full biochemical interaction models with Full biochemical interaction models with
stochastic kinetics in Arkin et al.(1998)stochastic kinetics in Arkin et al.(1998)
Forward and inverse Forward and inverse modelingmodeling
Forward modeling approachForward modeling approach Inverse modeling, or reverse Inverse modeling, or reverse
engineeringengineering Given an amount of data, what can we Given an amount of data, what can we
deduce about the unknown underlying deduce about the unknown underlying regulatory network?regulatory network?
Requires the use of a parametric model, Requires the use of a parametric model, the parameters of which are then fit to the parameters of which are then fit to the real-world data.the real-world data.
OutlineOutline
Introduction Introduction A conceptual approach to complex network A conceptual approach to complex network
dynamicsdynamics Inference of regulation through clustering of Inference of regulation through clustering of
gene expression datagene expression data Modeling methodologiesModeling methodologies Gene network inference:reverse Gene network inference:reverse
engineeringengineering Conclusions and OutlookConclusions and Outlook
Goal of network inferenceGoal of network inference
Construct a coarse-scale model of Construct a coarse-scale model of the network of regulatory the network of regulatory interactions between the genesinteractions between the genes
It’s possible to reverse engineer a It’s possible to reverse engineer a network from its activity profilesnetwork from its activity profiles
Data requirementsData requirements
We need to observe the expression We need to observe the expression of that gene under many different of that gene under many different combinations of expression levels of combinations of expression levels of its regulatory inputsits regulatory inputs Use data from different sourcesUse data from different sources Deal with different data types Deal with different data types
Estimates for network Estimates for network modelsmodels
a sparse network model of a sparse network model of NN genes, genes, where each gene is only affected bywhere each gene is only affected by KK other genes on average. other genes on average.
a sparsely connected, directed a sparsely connected, directed graph with graph with NN nodes and nodes and NKNK edges. edges.
Co-expression network Co-expression network analysisanalysis
OutlineOutline Network and network conceptsNetwork and network concepts Approximately factorizable networksApproximately factorizable networks Gene Co-expression NetworkGene Co-expression Network
Eigengene Factorizability, Eigengene Eigengene Factorizability, Eigengene ConformityConformity
Eigengene-based network conceptsEigengene-based network concepts What can we learn from the What can we learn from the
geometric interpretation?geometric interpretation?
Network=Adjacency Network=Adjacency MatrixMatrix
A network can be represented by an A network can be represented by an adjacency matrix, A=[aadjacency matrix, A=[aijij], that encodes ], that encodes whether/how a pair of nodes is connected.whether/how a pair of nodes is connected. A is a symmetric matrix with entries in [0,1] A is a symmetric matrix with entries in [0,1] For unweighted network, entries are 1 or 0 For unweighted network, entries are 1 or 0
depending on whether or not 2 nodes are depending on whether or not 2 nodes are adjacent (connected)adjacent (connected)
For weighted networks, the adjacency matrix For weighted networks, the adjacency matrix reports the connection strength between node reports the connection strength between node pairspairs
Our convention: diagonal elements of Our convention: diagonal elements of A A are all are all 1.1.
Motivational example I:Motivational example I:Pair-wise relationships between genes across Pair-wise relationships between genes across
different mouse tissues and gendersdifferent mouse tissues and genders
Challenge:Challenge:
Develop simple Develop simple
descriptive measures descriptive measures
that describe the that describe the
patterns.patterns.
Solution: Solution:
The following network The following network
concepts are useful: concepts are useful:
density, centralization,density, centralization,
clustering coefficient, clustering coefficient,
heterogeneityheterogeneity
Motivational example (continued)Motivational example (continued)
Challenge: Find a simple measure for describing the relationship between Challenge: Find a simple measure for describing the relationship between
gene significance and connectivitygene significance and connectivity
Solution: network concept called hub gene significanceSolution: network concept called hub gene significance
BackgroundsBackgrounds
Network concepts are also known as Network concepts are also known as network statistics or network indicesnetwork statistics or network indices Examples: connectivity (degree), clustering Examples: connectivity (degree), clustering
coefficient, topological overlap, etccoefficient, topological overlap, etc Network concepts underlie network Network concepts underlie network
language and systems biological language and systems biological modeling.modeling.
Dozens of potentially useful network Dozens of potentially useful network concepts are known from graph theory.concepts are known from graph theory.
Review of Review of somesome fundamental network fundamental network
concepts which are defined concepts which are defined for all networks (not just co-for all networks (not just co-
expression networks)expression networks)
ConnectivityConnectivity Node connectivity = row sum of the adjacency Node connectivity = row sum of the adjacency
matrixmatrix For unweighted networks=number of direct For unweighted networks=number of direct
neighborsneighbors For weighted networks= sum of connection For weighted networks= sum of connection
strengths to other nodesstrengths to other nodes
iScaled connectivity=Kmax( )
i i ijj i
i
Connectivity k a
k
k
DensityDensity Density= mean adjacencyDensity= mean adjacency Highly related to mean connectivityHighly related to mean connectivity
( )
( 1) 1
where is the number of network nodes.
iji j ia mean k
Densityn n n
n
CentralizationCentralization
CentralizationCentralization = 1 = 1
because it has a star topologybecause it has a star topology
CentralizationCentralization = 0 = 0
because all nodes have the same connectivity of because all nodes have the same connectivity of
22
max( ) max( )
2 1 1
n k kCentralization Density Density
n n n
= 1 if the network has a star topology= 1 if the network has a star topology
= 0 if all nodes have the same connectivity= 0 if all nodes have the same connectivity
HeterogeneityHeterogeneity Heterogeneity: coefficient of variation of the Heterogeneity: coefficient of variation of the
connectivityconnectivity Highly heterogeneous networks exhibit hubsHighly heterogeneous networks exhibit hubs
( )
( )
variance kHeterogeneity
mean k
Clustering CoefficientClustering CoefficientMeasures the cliquishness of a particular nodeMeasures the cliquishness of a particular node
« A node is cliquish if its neighbors know each other »« A node is cliquish if its neighbors know each other »
Clustering Coef of Clustering Coef of
the white node = 0the white node = 0
Clustering Coef = 1Clustering Coef = 1
,
22
il lm mil i m i li
il ill i l i
a a aClusterCoef
a a
This This
generalizes generalizes
directly to directly to
weightedweighted
networks networks
(Zhang and (Zhang and
Horvath 2005)Horvath 2005)
The topological overlap dissimilarity is The topological overlap dissimilarity is used as input of hierarchical clusteringused as input of hierarchical clustering
Generalized in Zhang and Horvath (2005) to the case of weighted networksGeneralized in Zhang and Horvath (2005) to the case of weighted networks Generalized in Li and Horvath (2006) to multiple nodesGeneralized in Li and Horvath (2006) to multiple nodes Generalized in Yip and Horvath (2007) to higher order interactionsGeneralized in Yip and Horvath (2007) to higher order interactions
,
min( , ) 1
iu uj iju i j
iji j ij
a a a
TOMk k a
1ij ijDistTOM TOM
Network SignificanceNetwork Significance Defined as average gene significanceDefined as average gene significance We often refer to the network significance We often refer to the network significance
of a module network as module of a module network as module significance.significance.
iGSNetworkSignif
n
Hub Gene Significance=Hub Gene Significance=slope of the regression line slope of the regression line
(intercept=0)(intercept=0)
2( )i i
i
GS KHubGeneSignif
K
Q: What do all of these fundamental Q: What do all of these fundamental network concepts have in common?network concepts have in common?
They are functions of the adjacency They are functions of the adjacency matrix A and/or a gene significance matrix A and/or a gene significance measure GS.measure GS.
CHALLENGECHALLENGEFind relationships between these and other Find relationships between these and other
seemingly disparate network concepts.seemingly disparate network concepts. For general networks, this is a difficult For general networks, this is a difficult
problem.problem. But a solution exists for a But a solution exists for a special subclassspecial subclass of of
networks: networks: approximately factorizable approximately factorizable networksnetworks
Definition of an approximately Definition of an approximately factorizable networkfactorizable network
Definitions:
The adjacency matrix A is if
there exists a vector CF with non-negative elements such that
for all
is referred to as the of the
approximately factorizable
conformity
ij i j
i
a CFCF i j
CF
i-th node
Why is this relevant?Why is this relevant?
Answer: Because modules are often approximately factorizableAnswer: Because modules are often approximately factorizable
ObservationObservation: Approximate relationships : Approximate relationships among network concepts in among network concepts in
approximately factorizable networksapproximately factorizable networks
22
2
2[1]
1
max( , )1
1
where [1] denotes the index of the most highly connected hub
i jij
j
mean ClusterCoef Heterogeneity Density
k kTopOverlap Heterogeneity
n
TopOverlap Centralization Density Heterogeneity
Weighted Gene Co-expression Weighted Gene Co-expression NetworkNetwork
[ ] [| ( , ) | ]
where is the expression profile for gene ,
and mathematically a vector of expression values
across multiple samples.
ij i j
i
A a cor x x
x i
Note: Unweighted Network is
[ ] [ (| ( , ) | )]
where (.) is an indicator function.
ij i jA a I cor x x
I
Steps for constructing Steps for constructing aa
co-expression networkco-expression network HiHi
A) Microarray gene expression data A) Microarray gene expression data
B) Measure concordance of gene B) Measure concordance of gene
expression with a Pearson expression with a Pearson
correlationcorrelation
C) The Pearson correlation matrix is C) The Pearson correlation matrix is
either dichotomized to arrive at an either dichotomized to arrive at an
adjacency matrix adjacency matrix unweighted unweighted
network network
Or transformed continuously with the Or transformed continuously with the
power adjacency function power adjacency function
weighted networkweighted network
Definition of module (cluster)Definition of module (cluster) Module=cluster of highly connected Module=cluster of highly connected
nodesnodes Any clustering method that results in such sets Any clustering method that results in such sets
is suitableis suitable We define modules as branches of a We define modules as branches of a
hierarchical clustering tree using the hierarchical clustering tree using the topological overlap matrixtopological overlap matrix
brown
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185
brown
-0.10.00.10.20.30.4
Module Eigengene= measure of over-expression=average Module Eigengene= measure of over-expression=average
rednessrednessRows=genes, Columns=microarrayRows=genes, Columns=microarray
module eigengenes across samplesmodule eigengenes across samples
The module eigengene is highly correlated with the most highly connected hub The module eigengene is highly correlated with the most highly connected hub
gene.gene.
Some insightsSome insights Intramodular hub gene= a genes that is Intramodular hub gene= a genes that is
highly correlated with the module eigengene, highly correlated with the module eigengene, i.e. it is a good representative of a modulei.e. it is a good representative of a module
Gene screening strategies that use Gene screening strategies that use intramodular connectivity amount to path-intramodular connectivity amount to path-way based gene screening methodsway based gene screening methods
Intramodular connectivity is a highly Intramodular connectivity is a highly reproducible “fuzzy” measure of module reproducible “fuzzy” measure of module membership.membership.
Network concepts are useful for describing Network concepts are useful for describing pairwise interaction patterns.pairwise interaction patterns.
Bagging and BoostingBagging and Boosting
BaggingBagging
Bagging Bagging
BoostingBoosting
Creating a classifier Creating a classifier sequencesequence
Creating a 2Creating a 2ndnd training set training set
Creating a 3rd data setCreating a 3rd data set
Boosting vs BaggingBoosting vs Bagging
Concensus microarray analysisConcensus microarray analysis
TheoryTheory(Allison et al 2006 !!!)
PracticalIBD application