Carissa Bleker, Ashley Cliff, Sergio Garciaweb.eecs.utk.edu/~cphill25/cs594_spring2017/... ·...

Preview:

Citation preview

Bleker, Clif, Garcia 1

Carissa Bleker, Ashley Cliff, Sergio Garcia

Bleker, Clif, Garcia 2

Test Questions

1. Which thresholding method did we try to use?

2. Name a type of calculated edge.

3. Name one type of graph that can be used to represent metabolic

networks.

Bleker, Clif, Garcia 3

Ashley Cliff● Bredesen Center Student (DSE)

○ Advisor: Dan Jacobson (ORNL)● Central College, Pella, IA

○ BA: Physics & Computer Science● From: Decorah, IA

○ Population: ~8,000

Bleker, Clif, Garcia 4

Carissa Bleker

Bredesen Center Student (DSE)Advisor: Dr Langston

Stellenbosch University:● BScHons Mathematics

From Cape Town, South Africa● Population of 3.7 million● Not the tip of Africa…

Mojo & Amper

Cape Town Cape L’Agulhas

Bleker, Clif, Garcia 5

Sergio Garcia

● From: Murcia (Population 439K), Spain.

● Pursuing PhD in Chemical and Biomolecular Engineering.

● I play the piano.

Bleker, Clif, Garcia 6

Presentation Outline

1. Background

2. Building Graphs from Real Data

3. Time Varying Graphs

4. Network Analysis Tools

5. Common Networks in Molecular Biology

6. Higher Level Biological Examples

7. Yeast Life Cycle Time Varying Graph

8. Issues

Bleker, Clif, Garcia 7

1. Background

Bleker, Clif, Garcia 8

The Discipline of Systems Biology

● Systems biology is the computational and mathematical modeling of complex biological systems.

● Focus on complex interactions within biological systems, using a holistic approach (holism instead of the more traditional reductionism) to biological research.

● Model and discover emergent properties of cells, tissues and organisms functioning as a system.

● Applications: Disease, biocatalysis, waste management, ….

Bleker, Clif, Garcia 9

Cell

Communities

Bleker, Clif, Garcia 10

Bleker, Clif, Garcia 11

High-throughput Collection of Biological Data

Bleker, Clif, Garcia 12

2. Building Graphs from Real Data

Bleker, Clif, Garcia 13

Building Graphs from Real (Dirty) Data

● Formatting ○ File format - tab vs space vs comma vs no format○ Remove corrupted lines, odd characters (#, %)○ Create ‘simple’ Dimacs or adjacency matrices

● Data cleaning○ Duplicate data○ Are zeros actually zeros○ Missing values (remove, ignore, impute)○ Standardizing/Normalizing

These steps could take longer than the graph analysis - never assume your data is clean/formatted

Bleker, Clif, Garcia 14

Determining Vertices and Edges

● What are the vertices?○ genes, locations, molecules, …

● And edges?○ interactions, covariation, proximity, ...○ roads, wired connection○ Calculated edges - correlation○ Directed, weighted

● Multigraphs○ Ex: Vertices - cities, Edges - roads (blue), direct flight (red), etc..

● What different insights are gained by changing or swapping how we identify vertices and edges?

Bleker, Clif, Garcia 15

Calculated Edges

● Pearson correlation coefficient

○ Pearson p-value

● Spearman (rank based Pearson)

○ Spearman p-value

● Cosine

● Euclidian

● Mutual information

● ...

Bleker, Clif, Garcia 16

Calculate Edges

Bleker, Clif, Garcia 17

Thresholding

A correlation analysis or matrix will generate a complete graph. Thresholding aims to separate signal from noise.

Bleker, Clif, Garcia 18

Thresholding

After: we only have significant edges/associations between vertices.

Bleker, Clif, Garcia 19

Thresholding: Spectral methods

• Weaker (smaller weighted) edges connect dissimilar clusters of the graph

• As t (the threshold) is increased:→ weaker edges are removed→ dissimilar clusters are less connected→ the number of “nearly-disconnected” clusters increases

Perkins, A. D., & Langston, M. A. (2009). Threshold selection in gene co-expression networks using spectral graph theory techniques. Bmc Bioinformatics, 10(11), S4.

Bleker, Clif, Garcia 20

Thresholding: Spectral methods

Finding “nearly-disconnected” clusters: ● Extract the largest connected component● Laplacian of G:

● Sort the values of eigenvector of the second smallest eigenvalue

● Results in an ascending step like function, and each step corresponds to a transition from one cluster to another

Perkins, A. D., & Langston, M. A. (2009). Threshold selection in gene co-expression networks using spectral graph theory techniques. Bmc Bioinformatics, 10(11), S4.

Bleker, Clif, Garcia 21

• Select t that maximises the number of "nearly-disconnected" components, and therefore minimises the number of edges connecting dissimilar parts of the network.

Thresholding: Spectral methods

Perkins, A. D., & Langston, M. A. (2009). Threshold selection in gene co-expression networks using spectral graph theory techniques. Bmc Bioinformatics, 10(11), S4.

Bleker, Clif, Garcia 22

Thresholding: Random Matrix Theory

Based on the nearest neighbor spacing distribution (NNSD) of eigenvalues from the adjacency/correlation matrix.

NNSD: differences between subsequent (ordered) eigenvalues

Jalan, S., & Bandyopadhyay, J. N. (2007). Random matrix analysis of complex networks. Physical Review E, 76(4), 046107.

random network scale-free network small-world network

Bleker, Clif, Garcia 23

Thresholding: Random Matrix Theory

Based on the nearest neighbor spacing distribution (NNSD) of eigenvalues from the adjacency/correlation matrix.

NNSD: differences between subsequent (ordered) eigenvalues

NNSD of eigenvalues of a random matrix can be approximated by Wigner Surmise

NNSD of a non-random matrix appears Poisson

Iterate over t until we find the point of transition of the NNSD

Gibson, S. M., Ficklin, S. P., Isaacson, S., Luo, F., Feltus, F. A., & Smith, M. C. (2013). Massive-scale gene co-expression network construction and robustness testing using random matrix theory. PLoS One, 8(2), e55871.

Bleker, Clif, Garcia 24

Thresholding: Random Matrix Theory

Bleker, Clif, Garcia 25

3. Time-varying Graphs

Bleker, Clif, Garcia 26

Time-varying Graphs

● Dynamic undirected graphs with fixed underlying vertex set○ Edges change over time, vertices do not

● AKA time evolving graphs (TEG)

● Useful for time based data ○ gene expression over time, ○ congestion at intersections, etc

● Use simple metrics (or more complicated) to ‘label’ the differences between time steps

Bleker, Clif, Garcia 27

4. Network Analysis Tools

Bleker, Clif, Garcia 28

Network Properties

● Density: Some biological networks are sparse

Bleker, Clif, Garcia 29

Network Properties

● Clustering coefficient

N= |V|; Ei = |edges between neighbors of i|; ki = degree of i;

Average clustering coefficient for the metabolic networks of 43 organisms (colors represent different taxonomic domains). N is the number of nodes.The diamonds correspond to a scale free network with the same number of nodes and edges.

Bleker, Clif, Garcia 30

Network Properties

● Other:○ Diameter: Shortest distance between the two most distant nodes

in the network. The diameter of metabolic networks is conserved even across distant organisms.

○ Average path length: Average shortest paths.○ Degree Distribution: Number of nodes with a certain degree.

Biological networks tend to follow a power law.○ ...

Bleker, Clif, Garcia 31

Complex Network Models

Bleker, Clif, Garcia 32

Node Centralities and Ranking

● Degree centrality: Nodes with high degree centrality are hubs. While biological networks are robust against perturbation, the removal of hubs often leads to system failure.

Bleker, Clif, Garcia 33

Node Centralities and Ranking

● Other:○ Closeness Centrality: Indicates important nodes that can

communicate quickly with other nodes. Used to identify key central metabolites and extract the core metabolic network.

○ Betweenness Centrality, nodes that appear in many shortest paths rank higher. Metabolites controlling flux between two modules. In telecommunication networks such node would have higher control.

○ Eigenvector Centrality, ranks higher the nodes that are connected to important neighbors. Used to identify pairs of genes that cause sickness/death.

○ ....

Bleker, Clif, Garcia 34

Clustering

● Clusters are parts of a graph that are highly associated

● In biology clusters are of interest for a number of reasons:○ Finding co-regulated vertices○ Finding vertices that are part of the same process○ Hypothesising functionality on unannotated vertices

Bleker, Clif, Garcia 35

Underlying Idea: A random walk on a transition graph that starts within a cluster, is more likely to stay within that cluster than to leave it.

Clustering Algorithms: Markov Clustering

Bleker, Clif, Garcia 36

Two steps on a transition matrix:

1. Matrix square- Simulates random walks through the graph

2. Elementwise matrix squaring- Strengthens strong transition probabilities, and weakens low

probabilities

Clustering Algorithms: Markov Clustering

Bleker, Clif, Garcia 37

Transition matrix

Pi, j = P(i | j)

= probability of walking from j to i

Each column consists of the probabilities of each way you can leave that node, and sums to 1

Clustering Algorithms: Markov Clustering

1 2 3 4 5

1 P1, 1 P1, 2 . . P1, 5

2 P2,1 . .

3 . . .

4 . . .

5 P5, 1 P5, 5

M =

Bleker, Clif, Garcia 38

Transition matrix multiplication

(M2) i, j = ∑k Pi, k Pk, j = ∑kP(walk to i from j through k)

(M2) 2, 3 = Probability of walking to 2 from 3, over all 2-step paths

Clustering Algorithms: Markov Clustering

P1, 1 P1, 2 . . P1, 5

P2,1 . .

. . .

. . .

P5, 1 P5, 5

P1, 1 P1, 2 . . P1, 5

P2,1 . .

. . .

. . .

P5, 1 P5, 5

XM2 =

Bleker, Clif, Garcia 39

Clustering Algorithms: Markov Clustering

G is a graph

add self-loops to G

set parameter I

set M the stochastic matrix of G

while (change > ε){

M’ = M x M

M’ = ГI (M’)

make M’ stochastic

change = M - M’

M’ = M

}

Clustering is the components of M’

Bleker, Clif, Garcia 40

Clustering Algorithms: Markov Clustering

Bleker, Clif, Garcia 41

Clustering Algorithms: Markov Clustering

Bleker, Clif, Garcia 42

Clustering Algorithms: Markov Clustering

Bleker, Clif, Garcia 43

Clustering Algorithms: Markov Clustering

Bleker, Clif, Garcia 44

Clustering Algorithms: Paraclique

g = 3

Bleker, Clif, Garcia 45

Clustering Algorithms: Paraclique

g = 3

Bleker, Clif, Garcia 46

Clustering Algorithms: Paraclique

g = 3

Bleker, Clif, Garcia 47

Clustering Algorithms: Paraclique

g = 3

Bleker, Clif, Garcia 48

Clustering Algorithms: Paraclique

g = 3

Bleker, Clif, Garcia 49

Clustering Algorithms: Paraclique

g = 3

Bleker, Clif, Garcia 50

Clustering Algorithms: Paraclique

g = 3

Bleker, Clif, Garcia 51

Clustering Algorithms: Paraclique

g = 3

Bleker, Clif, Garcia 52

Clustering Algorithms: Paraclique

g = 3

Bleker, Clif, Garcia 53

Clustering Algorithms: Paraclique

g = 3

Bleker, Clif, Garcia 54

Clustering Algorithms: Paraclique

g = 3

Bleker, Clif, Garcia 55

5. Common Networks in Molecular Biology

Bleker, Clif, Garcia 56

Metabolic Networks● Network of chemical reactions enabling the conversion of substrates

into energy and biomass. Can be represented by:○ Simple graphs○ (Directed) bipartite graphs.○ (Directed) hypergraphs

Bleker, Clif, Garcia 57

Hierarchical Modularity in Metabolic Networks

Bleker, Clif, Garcia 58

Protein-protein Interaction Networks● Represent how different proteins operate in coordination with

others to enable biological processes within the cell.

Bleker, Clif, Garcia 59

Gene Co-expression Networks

Bleker, Clif, Garcia 60

6. Higher Level Biological Examples

Bleker, Clif, Garcia 61

Epilepsy Seizure Prediction

Bleker, Clif, Garcia 62

fMRI in Schizophrenia

Bleker, Clif, Garcia 63

● WHO child cause of death for 194 countries in 2013

● Method:

○ Pearson correlation between countries, across COD categories

● Graph:

○ Vertices - Countries

○ Edges - Similarities in COD

● Threshold at 0.95

● Markov clustering

Health Disparities

Bleker, Clif, Garcia 64

Africa

Americas

East Mediterranean

Europe

South East Asia

Western Pacific

Health Disparities Graph

Bleker, Clif, Garcia 65

Health Disparities Clustering

Africa

Americas

East Mediterranean

Europe

South East Asia

Western Pacific

Bleker, Clif, Garcia 66

7. Yeast Life Cycle Time Varying Graph

Bleker, Clif, Garcia 67

Want to know how gene-gene associations change over the life cycle of a yeast cell.

Yeast Life Cycle

Bleker, Clif, Garcia 68

Yeast Life Cycle Data

● Yeast gene expression data collected from synchronised cultures

● 24 time points over 10 minute increments● 6,178 genes

Time points

Genes

Bleker, Clif, Garcia 69

Expression Over Time

Time (s)

Nor

mal

ized

exp

ress

ion

valu

e

Bleker, Clif, Garcia 70

Yeast Example Process

1. Removed genes with:

○ More than 4 missing values○ Low variance over all time points (<0.6)

2. Calculated all-to-all pairwise Spearman correlations for time steps

3. Spectral thresholding - did not work

○ Hard cut off of 0.8 for all graphs

4. Metric calculations

Bleker, Clif, Garcia 71

Variance

Variance

Num

ber o

f gen

es

Bleker, Clif, Garcia 72

Time-varying Co-expression Network

t-1 t-2 ... t-M

G-1

G-2

::

Bleker, Clif, Garcia 73

Time-varying Co-expression Network

t-1 t-2 ... t-M

G-1

G-2

::

Bleker, Clif, Garcia 74

Time-varying Co-expression Network

t-1 t-2 ... t-M

G-1

G-2

::

Bleker, Clif, Garcia 75

Time-varying Co-expression Network

t-1 t-2 ... t-M

G-1

G-2

::

Bleker, Clif, Garcia 76

Time-varying Co-expression Network

t-1 t-2 ... t-M

G-1

G-2

::

Bleker, Clif, Garcia 77

Time-varying Co-expression Network

t-1 t-2 ... t-M

G-1

G-2

::

Bleker, Clif, Garcia 78

Time Graphs - Time 1

Bleker, Clif, Garcia 79

Time Graphs- Time 2

Bleker, Clif, Garcia 80

Time Graphs- Time 3

Bleker, Clif, Garcia 81

Time Graphs- Time 4

Bleker, Clif, Garcia 82

Time Graphs- Time 5

Bleker, Clif, Garcia 83

Bleker, Clif, Garcia 84

Bleker, Clif, Garcia 85

Bleker, Clif, Garcia 86

Bleker, Clif, Garcia 87

Bleker, Clif, Garcia 88

Bleker, Clif, Garcia 89

Bleker, Clif, Garcia 90

Bleker, Clif, Garcia 91

Bleker, Clif, Garcia 92

8. Issues

● Figure out how to threshold

● Better metrics to pinpoint differences in time based graphs

● Network validation, particularly for less studied systems

● Noise in high-throughput data

Bleker, Clif, Garcia 93

Test Questions

1. Which thresholding method did we try to use?

2. Name a type of calculated edge.

3. Name one type of graph that can be used to represent metabolic

networks.

Recommended