Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Bleker, Clif, Garcia 1
Carissa Bleker, Ashley Cliff, Sergio Garcia
Bleker, Clif, Garcia 2
Test Questions
1. Which thresholding method did we try to use?
2. Name a type of calculated edge.
3. Name one type of graph that can be used to represent metabolic
networks.
Bleker, Clif, Garcia 3
Ashley Cliff● Bredesen Center Student (DSE)
○ Advisor: Dan Jacobson (ORNL)● Central College, Pella, IA
○ BA: Physics & Computer Science● From: Decorah, IA
○ Population: ~8,000
Bleker, Clif, Garcia 4
Carissa Bleker
Bredesen Center Student (DSE)Advisor: Dr Langston
Stellenbosch University:● BScHons Mathematics
From Cape Town, South Africa● Population of 3.7 million● Not the tip of Africa…
Mojo & Amper
Cape Town Cape L’Agulhas
Bleker, Clif, Garcia 5
Sergio Garcia
● From: Murcia (Population 439K), Spain.
● Pursuing PhD in Chemical and Biomolecular Engineering.
● I play the piano.
Bleker, Clif, Garcia 6
Presentation Outline
1. Background
2. Building Graphs from Real Data
3. Time Varying Graphs
4. Network Analysis Tools
5. Common Networks in Molecular Biology
6. Higher Level Biological Examples
7. Yeast Life Cycle Time Varying Graph
8. Issues
Bleker, Clif, Garcia 7
1. Background
Bleker, Clif, Garcia 8
The Discipline of Systems Biology
● Systems biology is the computational and mathematical modeling of complex biological systems.
● Focus on complex interactions within biological systems, using a holistic approach (holism instead of the more traditional reductionism) to biological research.
● Model and discover emergent properties of cells, tissues and organisms functioning as a system.
● Applications: Disease, biocatalysis, waste management, ….
Bleker, Clif, Garcia 9
Cell
Communities
Bleker, Clif, Garcia 10
Bleker, Clif, Garcia 11
High-throughput Collection of Biological Data
Bleker, Clif, Garcia 12
2. Building Graphs from Real Data
Bleker, Clif, Garcia 13
Building Graphs from Real (Dirty) Data
● Formatting ○ File format - tab vs space vs comma vs no format○ Remove corrupted lines, odd characters (#, %)○ Create ‘simple’ Dimacs or adjacency matrices
● Data cleaning○ Duplicate data○ Are zeros actually zeros○ Missing values (remove, ignore, impute)○ Standardizing/Normalizing
These steps could take longer than the graph analysis - never assume your data is clean/formatted
Bleker, Clif, Garcia 14
Determining Vertices and Edges
● What are the vertices?○ genes, locations, molecules, …
● And edges?○ interactions, covariation, proximity, ...○ roads, wired connection○ Calculated edges - correlation○ Directed, weighted
● Multigraphs○ Ex: Vertices - cities, Edges - roads (blue), direct flight (red), etc..
● What different insights are gained by changing or swapping how we identify vertices and edges?
Bleker, Clif, Garcia 15
Calculated Edges
● Pearson correlation coefficient
○ Pearson p-value
● Spearman (rank based Pearson)
○ Spearman p-value
● Cosine
● Euclidian
● Mutual information
● ...
Bleker, Clif, Garcia 16
Calculate Edges
Bleker, Clif, Garcia 17
Thresholding
A correlation analysis or matrix will generate a complete graph. Thresholding aims to separate signal from noise.
Bleker, Clif, Garcia 18
Thresholding
After: we only have significant edges/associations between vertices.
Bleker, Clif, Garcia 19
Thresholding: Spectral methods
• Weaker (smaller weighted) edges connect dissimilar clusters of the graph
• As t (the threshold) is increased:→ weaker edges are removed→ dissimilar clusters are less connected→ the number of “nearly-disconnected” clusters increases
Perkins, A. D., & Langston, M. A. (2009). Threshold selection in gene co-expression networks using spectral graph theory techniques. Bmc Bioinformatics, 10(11), S4.
Bleker, Clif, Garcia 20
Thresholding: Spectral methods
Finding “nearly-disconnected” clusters: ● Extract the largest connected component● Laplacian of G:
● Sort the values of eigenvector of the second smallest eigenvalue
● Results in an ascending step like function, and each step corresponds to a transition from one cluster to another
Perkins, A. D., & Langston, M. A. (2009). Threshold selection in gene co-expression networks using spectral graph theory techniques. Bmc Bioinformatics, 10(11), S4.
Bleker, Clif, Garcia 21
• Select t that maximises the number of "nearly-disconnected" components, and therefore minimises the number of edges connecting dissimilar parts of the network.
Thresholding: Spectral methods
Perkins, A. D., & Langston, M. A. (2009). Threshold selection in gene co-expression networks using spectral graph theory techniques. Bmc Bioinformatics, 10(11), S4.
Bleker, Clif, Garcia 22
Thresholding: Random Matrix Theory
Based on the nearest neighbor spacing distribution (NNSD) of eigenvalues from the adjacency/correlation matrix.
NNSD: differences between subsequent (ordered) eigenvalues
Jalan, S., & Bandyopadhyay, J. N. (2007). Random matrix analysis of complex networks. Physical Review E, 76(4), 046107.
random network scale-free network small-world network
Bleker, Clif, Garcia 23
Thresholding: Random Matrix Theory
Based on the nearest neighbor spacing distribution (NNSD) of eigenvalues from the adjacency/correlation matrix.
NNSD: differences between subsequent (ordered) eigenvalues
NNSD of eigenvalues of a random matrix can be approximated by Wigner Surmise
NNSD of a non-random matrix appears Poisson
Iterate over t until we find the point of transition of the NNSD
Gibson, S. M., Ficklin, S. P., Isaacson, S., Luo, F., Feltus, F. A., & Smith, M. C. (2013). Massive-scale gene co-expression network construction and robustness testing using random matrix theory. PLoS One, 8(2), e55871.
Bleker, Clif, Garcia 24
Thresholding: Random Matrix Theory
Bleker, Clif, Garcia 25
3. Time-varying Graphs
Bleker, Clif, Garcia 26
Time-varying Graphs
● Dynamic undirected graphs with fixed underlying vertex set○ Edges change over time, vertices do not
● AKA time evolving graphs (TEG)
● Useful for time based data ○ gene expression over time, ○ congestion at intersections, etc
● Use simple metrics (or more complicated) to ‘label’ the differences between time steps
Bleker, Clif, Garcia 27
4. Network Analysis Tools
Bleker, Clif, Garcia 28
Network Properties
● Density: Some biological networks are sparse
Bleker, Clif, Garcia 29
Network Properties
● Clustering coefficient
N= |V|; Ei = |edges between neighbors of i|; ki = degree of i;
Average clustering coefficient for the metabolic networks of 43 organisms (colors represent different taxonomic domains). N is the number of nodes.The diamonds correspond to a scale free network with the same number of nodes and edges.
Bleker, Clif, Garcia 30
Network Properties
● Other:○ Diameter: Shortest distance between the two most distant nodes
in the network. The diameter of metabolic networks is conserved even across distant organisms.
○ Average path length: Average shortest paths.○ Degree Distribution: Number of nodes with a certain degree.
Biological networks tend to follow a power law.○ ...
Bleker, Clif, Garcia 31
Complex Network Models
Bleker, Clif, Garcia 32
Node Centralities and Ranking
● Degree centrality: Nodes with high degree centrality are hubs. While biological networks are robust against perturbation, the removal of hubs often leads to system failure.
Bleker, Clif, Garcia 33
Node Centralities and Ranking
● Other:○ Closeness Centrality: Indicates important nodes that can
communicate quickly with other nodes. Used to identify key central metabolites and extract the core metabolic network.
○ Betweenness Centrality, nodes that appear in many shortest paths rank higher. Metabolites controlling flux between two modules. In telecommunication networks such node would have higher control.
○ Eigenvector Centrality, ranks higher the nodes that are connected to important neighbors. Used to identify pairs of genes that cause sickness/death.
○ ....
Bleker, Clif, Garcia 34
Clustering
● Clusters are parts of a graph that are highly associated
● In biology clusters are of interest for a number of reasons:○ Finding co-regulated vertices○ Finding vertices that are part of the same process○ Hypothesising functionality on unannotated vertices
Bleker, Clif, Garcia 35
Underlying Idea: A random walk on a transition graph that starts within a cluster, is more likely to stay within that cluster than to leave it.
Clustering Algorithms: Markov Clustering
Bleker, Clif, Garcia 36
Two steps on a transition matrix:
1. Matrix square- Simulates random walks through the graph
2. Elementwise matrix squaring- Strengthens strong transition probabilities, and weakens low
probabilities
Clustering Algorithms: Markov Clustering
Bleker, Clif, Garcia 37
Transition matrix
Pi, j = P(i | j)
= probability of walking from j to i
Each column consists of the probabilities of each way you can leave that node, and sums to 1
Clustering Algorithms: Markov Clustering
1 2 3 4 5
1 P1, 1 P1, 2 . . P1, 5
2 P2,1 . .
3 . . .
4 . . .
5 P5, 1 P5, 5
M =
Bleker, Clif, Garcia 38
Transition matrix multiplication
(M2) i, j = ∑k Pi, k Pk, j = ∑kP(walk to i from j through k)
(M2) 2, 3 = Probability of walking to 2 from 3, over all 2-step paths
Clustering Algorithms: Markov Clustering
P1, 1 P1, 2 . . P1, 5
P2,1 . .
. . .
. . .
P5, 1 P5, 5
P1, 1 P1, 2 . . P1, 5
P2,1 . .
. . .
. . .
P5, 1 P5, 5
XM2 =
Bleker, Clif, Garcia 39
Clustering Algorithms: Markov Clustering
G is a graph
add self-loops to G
set parameter I
set M the stochastic matrix of G
while (change > ε){
M’ = M x M
M’ = ГI (M’)
make M’ stochastic
change = M - M’
M’ = M
}
Clustering is the components of M’
Bleker, Clif, Garcia 40
Clustering Algorithms: Markov Clustering
Bleker, Clif, Garcia 41
Clustering Algorithms: Markov Clustering
Bleker, Clif, Garcia 42
Clustering Algorithms: Markov Clustering
Bleker, Clif, Garcia 43
Clustering Algorithms: Markov Clustering
Bleker, Clif, Garcia 44
Clustering Algorithms: Paraclique
g = 3
Bleker, Clif, Garcia 45
Clustering Algorithms: Paraclique
g = 3
Bleker, Clif, Garcia 46
Clustering Algorithms: Paraclique
g = 3
Bleker, Clif, Garcia 47
Clustering Algorithms: Paraclique
g = 3
Bleker, Clif, Garcia 48
Clustering Algorithms: Paraclique
g = 3
Bleker, Clif, Garcia 49
Clustering Algorithms: Paraclique
g = 3
Bleker, Clif, Garcia 50
Clustering Algorithms: Paraclique
g = 3
Bleker, Clif, Garcia 51
Clustering Algorithms: Paraclique
g = 3
Bleker, Clif, Garcia 52
Clustering Algorithms: Paraclique
g = 3
Bleker, Clif, Garcia 53
Clustering Algorithms: Paraclique
g = 3
Bleker, Clif, Garcia 54
Clustering Algorithms: Paraclique
g = 3
Bleker, Clif, Garcia 55
5. Common Networks in Molecular Biology
Bleker, Clif, Garcia 56
Metabolic Networks● Network of chemical reactions enabling the conversion of substrates
into energy and biomass. Can be represented by:○ Simple graphs○ (Directed) bipartite graphs.○ (Directed) hypergraphs
Bleker, Clif, Garcia 57
Hierarchical Modularity in Metabolic Networks
Bleker, Clif, Garcia 58
Protein-protein Interaction Networks● Represent how different proteins operate in coordination with
others to enable biological processes within the cell.
Bleker, Clif, Garcia 59
Gene Co-expression Networks
Bleker, Clif, Garcia 60
6. Higher Level Biological Examples
Bleker, Clif, Garcia 61
Epilepsy Seizure Prediction
Bleker, Clif, Garcia 62
fMRI in Schizophrenia
Bleker, Clif, Garcia 63
● WHO child cause of death for 194 countries in 2013
● Method:
○ Pearson correlation between countries, across COD categories
● Graph:
○ Vertices - Countries
○ Edges - Similarities in COD
● Threshold at 0.95
● Markov clustering
Health Disparities
Bleker, Clif, Garcia 64
Africa
Americas
East Mediterranean
Europe
South East Asia
Western Pacific
Health Disparities Graph
Bleker, Clif, Garcia 65
Health Disparities Clustering
Africa
Americas
East Mediterranean
Europe
South East Asia
Western Pacific
Bleker, Clif, Garcia 66
7. Yeast Life Cycle Time Varying Graph
Bleker, Clif, Garcia 67
Want to know how gene-gene associations change over the life cycle of a yeast cell.
Yeast Life Cycle
Bleker, Clif, Garcia 68
Yeast Life Cycle Data
● Yeast gene expression data collected from synchronised cultures
● 24 time points over 10 minute increments● 6,178 genes
Time points
Genes
Bleker, Clif, Garcia 69
Expression Over Time
Time (s)
Nor
mal
ized
exp
ress
ion
valu
e
Bleker, Clif, Garcia 70
Yeast Example Process
1. Removed genes with:
○ More than 4 missing values○ Low variance over all time points (<0.6)
2. Calculated all-to-all pairwise Spearman correlations for time steps
3. Spectral thresholding - did not work
○ Hard cut off of 0.8 for all graphs
4. Metric calculations
Bleker, Clif, Garcia 71
Variance
Variance
Num
ber o
f gen
es
Bleker, Clif, Garcia 72
Time-varying Co-expression Network
t-1 t-2 ... t-M
G-1
G-2
::
Bleker, Clif, Garcia 73
Time-varying Co-expression Network
t-1 t-2 ... t-M
G-1
G-2
::
Bleker, Clif, Garcia 74
Time-varying Co-expression Network
t-1 t-2 ... t-M
G-1
G-2
::
Bleker, Clif, Garcia 75
Time-varying Co-expression Network
t-1 t-2 ... t-M
G-1
G-2
::
Bleker, Clif, Garcia 76
Time-varying Co-expression Network
t-1 t-2 ... t-M
G-1
G-2
::
Bleker, Clif, Garcia 77
Time-varying Co-expression Network
t-1 t-2 ... t-M
G-1
G-2
::
Bleker, Clif, Garcia 78
Time Graphs - Time 1
Bleker, Clif, Garcia 79
Time Graphs- Time 2
Bleker, Clif, Garcia 80
Time Graphs- Time 3
Bleker, Clif, Garcia 81
Time Graphs- Time 4
Bleker, Clif, Garcia 82
Time Graphs- Time 5
Bleker, Clif, Garcia 83
Bleker, Clif, Garcia 84
Bleker, Clif, Garcia 85
Bleker, Clif, Garcia 86
Bleker, Clif, Garcia 87
Bleker, Clif, Garcia 88
Bleker, Clif, Garcia 89
Bleker, Clif, Garcia 90
Bleker, Clif, Garcia 91
Bleker, Clif, Garcia 92
8. Issues
● Figure out how to threshold
● Better metrics to pinpoint differences in time based graphs
● Network validation, particularly for less studied systems
● Noise in high-throughput data
Bleker, Clif, Garcia 93
Test Questions
1. Which thresholding method did we try to use?
2. Name a type of calculated edge.
3. Name one type of graph that can be used to represent metabolic
networks.