Upload
stanley-price
View
220
Download
0
Embed Size (px)
Citation preview
Lecture 8. Topics in Biological Networks (Basics)
The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2
Lecture outline1. Definition and different types of biological
networks2. Some high-throughput experimental methods for
probing biological networks– Important databases
3. Some computational methods for reconstructing biological networks
4. Data analysis– Analyzing the networks– Using the networks to analyze other data– Visualization and analysis tools
Last update: 22-Oct-2015
DEFINITION AND TYPESPart 1
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 4
Biological networks• A biological network is represented by a graph G=(V, E)– V: a set of nodes (vertices). Each node viV represents an
object• A gene, protein, metabolite, drug, ...
– E: a set of edges. Each edge eijE connects two nodes vi and vj, and represents a relationship between the two objects• Protein-protein interaction (PPI), gene regulation, ...• Undirected (eijE ejiE, e.g., PPI) or
directed (eijE does not imply ejiE, e.g., gene regulation)
– May have additional node and edge attributes such as confidence of interaction
Last update: 22-Oct-2015
v1 v2
v3v4
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 5
Network types• Gene regulatory networks [project]
– Transcription factor binding• Promoters• Distal regulatory elements
– Micro-RNA• Co-expression networks• Protein-protein interaction networks (lecture)• Genetic interaction networks [project]• Metabolic networks [project]• Gene-drug interaction networks [project]• Signaling networks• Neural networks• Disease transmission networks• Phylogenetic networks• Food web• ...
Last update: 22-Oct-2015
Molecular: DNA
Inter-species
Multi-cellular
Inter-organism
Molecular: RNA
Cellular: pathways
Molecular: proteins
• Gene regulatory networks [project]– Transcription factor binding
• Promoters• Distal regulatory elements
– Micro-RNA [project]• Co-expression networks• Protein-protein interaction networks• Genetic interaction networks [project]• Metabolic networks [project]• Gene-drug interaction networks [project]• Signaling networks• Neural networks• Disease transmission networks• Phylogenetic networks• Food web• ...
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 6
TF regulatory networks• Each node represents a gene and the
protein(s) that it encodes• An edge eij exists if vi represents a
transcription factor (TF) and it regulates the gene represented by vj
– Edges are directed– Edges should be signed (activation vs.
repression) – although this information is usually unavailable
– May have edge weights to indicate confidence– Should record only direct regulation– The network itself does not provide
information about the relationships between different edges
• Other types of gene regulatory (e.g., miRNA) networks are defined in similar ways
Last update: 22-Oct-2015
Image credit: Deneris and Wyler., Nature Neuroscience published online 26 February 2012
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 7
Co-expression networks
• Each node represents a gene• An edge eij exists if the genes represented by vi and vj co-express
– Co-expression could be measured by correlation across multiple samples/conditions• May have edge weights to represent degree of co-expression
– Edges are usually undirected• Unless measures like expression ranks are used
– Usually more meaningful to measure protein abundance, but easier to measure RNA level
– Co-expression may suggest functional relationships
Last update: 22-Oct-2015
Image credit: Prieto et al., PLoS One 3(12):e3911, (2008)
Node color indicates some network statistics to be explained later.
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 8
Protein-protein interaction (PPI) networks
• Each node represents a protein• An edge eij exists if the proteins represented
by vi and vj physically interact– Edges are undirected– Usually not distinguishing between permanent
and transient interactions– In some datasets/databases, eij simply indicates
that both the proteins represented by vi and vj participate in a complex, but they may not physically interact directly
– Usually not considering whether it is possible for the different interactions to happen simultaneously
– There are networks for specific types of interactions, e.g., phosphorylation networks
Last update: 22-Oct-2015
Human Calcineurin heterodimer (1AUI)
Image source: RCSB Protein Data Bank
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 9
Genetic interaction networks• The term “genetic interaction”
in general means any types of relationship between genes
• Specifically, it has been used to describe some particular types of scenarios:– Each node represents a gene– An edge eij exists if the growth
rate of the cell is affected by the knockout/knockdown/overdose of the genes as shown in the table
– Depending on the type, the edges can be directed or undirected
Last update: 22-Oct-2015
Type Definition
Synthetic lethality 0=ij<i,j
Synthetic sick 0<ij<i,j
Synthetic rescue 0?=i<ij
Dosage lethality 0=ij*<i
Dosage sick 0<ij*<i
Dosage rescue 0?=i<ij*
Phenotypic enhancement ij<E[ij]
Phenotypic suppression E[ij]<ij
Image credit: Drees et al., Genome Biology 6(4):R38, (2005)
*: overdose
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 10
Metabolic pathways• Each node is a metabolite• An edge eij exists if there is a reaction that turns the
metabolite represented by vi to the metabolite represented by vj
– Edges are directed– Both eij and eji exist if the reaction is reversible– Each edge is labeled by the enzyme that accelerates the reaction
in the cell• There is a dual representation, in which each node is a
reaction, and an edge eij exists if the reaction represented by vi produces a product that is a substrate of the reaction represented by vj
Last update: 22-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 11
Metabolic pathways: an example
Last update: 22-Oct-2015
Image source: Kyoto Encyclopedia of Genes and Genomes
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 12
Metabolic pathways: an example
Last update: 22-Oct-2015
Image source: Kyoto Encyclopedia of Genes and Genomes
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 13
Signaling pathways• Describing the events
that happen in a cell in response to an external signal
• A heterogeneous network involving different types of data– Protein-protein
interaction• Phosphorylation
– Gene regulation– ...
Last update: 22-Oct-2015
Image source: Kyoto Encyclopedia of Genes and Genomes
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 14
Handling many types of relationship• Need a systematic way to represent the many
different types of relationship
Last update: 22-Oct-2015
Image credit: Lu et al., Trends in Biochemical Sciences 32(7):320-331, (2007)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 15
Phylogenetic networks• Generalization of phylogenetic
trees, allowing non-tree structures (i.e., cycles, due to for example horizontal gene transfers)
• Each node is a species/clade• An edge eij exists if the species
represented by vj was diverged from/received genetic materials from the species represented by vi
– Network based on a single gene vs. network based on the whole genome of a species
Last update: 22-Oct-2015
Image credit: Wikipedia; Smets and Barkay, Nature Reviews Microbiology 3(9):675-678, (2005)
Phylogenetic tree
Phylogenetic network
HIGH-THROUGHPUT EXPERIMENTAL METHODS
Part 2
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 17
Probing gene regulatory networks• Transcription factor binding targets– Chromatin immunoprecipitation followed by• Microarray (ChIP-chip)• Sequencing (ChIP-seq)
• miRNA targets– Over-expression/silencing of miRNA, followed by
profiling of changes in mRNA/protein levels• Including direct and indirect targets
– Cross-linking immunoprecipitation-high-throughput sequencing (CLIP-seq)
Last update: 22-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 18
PPI: Yeast-two-hybrid (Y2H)• To test whether two proteins
physically interact• Fuse one protein with a DNA
binding domain (BD)• Fuse the other with an
activation domain (AD)• If the two proteins physically
interact, a reporter gene is expressed
• Can fix the first protein (the “bait”), and try many different second proteins (the “preys”)
Last update: 22-Oct-2015
Image source: Wikipedia
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 19
Protein complex: TAP-MS• Tandem affinity purification
followed by mass spectrometry– Adding a TAP tag to a bait protein– The protein and other proteins
that bind to it (directly or indirectly) bind to IgG beads, while other proteins are washed away
– The identity of the proteins pulled-down in this way can be determined by mass spectrometry
Last update: 22-Oct-2015
Image source: Wikipedia
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 20
Synthetic lethality• There are different methods• One of them is Synthetic
Genetic Array (SGA)– Create single mutation strains
of different mating types– Mate and select for double
mutation– Growth rate measured by
visual inspection or image analysis of colony size
Last update: 22-Oct-2015
Image credit: Tong et al., Science 294(5550):2364-2368, (2001)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 21
Databases• There are many databases for biological networks• BioGrid is a general database for various types of interactions in
multiple species• Gene Expression Omnibus (GEO) contains a lot of gene expression data• The Kyoto Encyclopedia of Genes and Genomes (KEGG) contains
information about pathways• The Protein Data Bank (PDB) contains some crystal structures about
interacting biological objects• There are species-specific databases
– Human Protein Reference Database (HPRD)– Saccharomyces Genome Database (SGD)– ...
• There are also databases that integrate other databases– Biological Networks database (IntegromeDB)
Last update: 22-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 22
File formats• Two main ways to store matrices:
– Adjacency matrix– Adjacency list
• Since most biological networks are sparse, adjacency list is more commonly used
• Simplest formats:– <Object 1><Tab><Object 2>– Simple interaction file (SIF):
<Object 1><Tab><Type><Tab><Object 2>– XML– Formats with visualization information (e.g., GML)( See http://wiki.cytoscape.org/Cytoscape_User_Manual/Network_Formats for
some commonly used formats)
Last update: 22-Oct-2015
COMPUTATIONAL NETWORK RECONSTRUCTION METHODS
Part 3
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 24
Problem definition• Network reconstruction as a machine learning problem
– As a low-cost supplement to experimental methods• Inputs
– A set of nodes V, each node vi described by a vector of features xi
– Each node pair (vi, vj) described by a (potentially empty) vector of features zij
– A (potentially empty) set of positive example edges E+ VV (ideally E+ E)– A (potentially empty) set of negative example edges E- VV
• Goal: For each node pair (vi, vj), determine whether the edge (vi, vj) is in the unknown set of edges, E
• Evaluating accuracy of predictions:– Cross-validation (using some examples for training and some for testing.
Repeat for different training/testing splits)– Functional enrichment analysis– Experimental validation
Last update: 22-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 25
Example: TF regulation• Inputs
– V: the set of all genes (TFs and non-TFs), each node vi described by a vector xi of node features:
• Expression level of the gene at different time points• Sequence at the promoter region of the gene• ...
– Each node pair (vi, vj) is described by a vector zij of features:• (If vi represents a TF) Binding signal of the TF represented by vi at the promoter region
of the gene represented by vj
• (If vi represents a TF) Expression of the gene represented by vj when the gene represented by vi is knocked out/down
• ...
– In some settings, there are no input positive examples– Usually there are no negative examples
• Goal: Determine which gene each TF regulates (and how, i.e., activation vs. repression, coefficients, etc.)
Last update: 22-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 26
Some common difficulties• Big data size (O(n2) number of node pairs for n nodes)
– Long computational time– Large memory consumption
• Small number of positive examples• Noisy positive examples (false positives)• Lack of negative examples• How node features should be used to predict edges is
not trivial• Weak features• Non-linear relationship between features and class
(interaction/no interaction)
Last update: 22-Oct-2015
DATA ANALYSISPart 4
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 28
How to interpret these hair balls?
Last update: 22-Oct-2015
Image credit: Zhu et al., Genes & Development 21(9):1010-1024, (2007)
Transcription factor binding Protein-protein interactions
Phosphorylation Metabolic Genetic interactions
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 29
Interpreting biological networks• Network statistics– Identifying important nodes/edges
• Network generation process– Understanding the formation/evolution of
networks• Network modules– Identifying functional object groups
• Network motifs– Understanding working principles
Last update: 22-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 30
Network statistics• Some statistics about a network:– Degree of a node: number of edges incident on the
node• In-degree and out-degree for a directed graph
– Clustering coefficient of a node, what fraction of the neighbors of the node is connected
– Shortest path length between two nodes– Eccentricity of a node: the maximum of its shortest
path lengths to all other nodes– Betweenness of a node: number of shortest paths that
involve the node• Similar definition for the betweenness of an edge
Last update: 22-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 31
Identifying important objects• A hub is an object with a
large degree– It is likely important as if it
is disrupted, many interactions could be affected
• A bottleneck is an object with a large betweenness– It is likely important as if it
is disrupted, the information flow between many node pairs could be affected
Last update: 22-Oct-2015
Image credit: Yu et al., PLoS Computational Biology 3(4):e59, (2007)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 32
Degree distribution• It has been found that in many biological (and non-biological
networks), the degree distribution has a long tail– Most nodes have few interactions– A few nodes have many interactions– It has been proposed that these networks are “scale-free”, where the
degree distribution follows a power law: P(k) ~ ck- (usually 2 < < 3)• Preferential attachment is one way to produce a scale-free network – The
rich becomes richer, the poor becomes poorer
Last update: 22-Oct-2015
An Erdős-Rényi random network A scale-free network
Image source: Wikipedia
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 33
Identifying important pathways• In functional enrichment analysis, we check if an unexpectedly
large fraction of genes in a target set share a common annotation
• This idea can be generalized: whether the genes in a target set are unexpectedly similar to each other
• A biological network provides a natural way to compute similarity– Finding cluster of genes with many direct connections (similar to
finding protein complexes from PPI)• Alternatively, finding such highly-connected modules could suggest gene
sets for performing standard functional enrichment analysis
– Finding cluster of genes that are close to each other in the network– Finding genes (in the target set or not) that are close to the genes in
the target set
Last update: 22-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 34
Network modules
Last update: 22-Oct-2015
Image credit: Palla et al., Nature 435(7043):814-818, (2005)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 35
Network modules
Last update: 22-Oct-2015
Image credit: Costanzo et al., Science 327(5964):425-431, (2010)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 36
Genetic interaction network• Between-pathway vs. within-pathway explanations
for negative interactions (phenotype of double knock-out worse than the expected one based on the two single knock-outs):
Last update: 22-Oct-2015
Image credit: Dixon et al., Annual Review in Genetics 43(1):601-625, (2009)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 37
Phenotype-associating sub-networks
• A biological network can also be used to find consistent signals in sub-networks (and average out noise)
Last update: 22-Oct-2015
Image credit: Chuang et al., Molecular Systems Biology 3:140, (2007)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 38
Biological networks and network motifs
Last update: 22-Oct-2015
Image credit: Milo et al., Science 298(5594):824-827, (2002)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 39
Statistical significance of a motif• To evaluate whether a pattern is over-represented, we want to
know how many such patterns would be found in a “random” network
• How to form a random network?– Erdos-Renyi random graphs: define the nodes, then each edge
appears with a certain probability• Not close to reality in many cases
– Price/Barabasi-Albert model: add the nodes one by one, where the chance for the new node to connect to an old node is proportional to the number of edges the old node already had• Closer to reality
– Permuting the graph by reconnecting edges• Preserving the total number of nodes• Preserving the total number of edges• Preserving the number of edges of each node
Last update: 22-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 40
Statistical significance
Last update: 22-Oct-2015
Image credit: Milo et al., Science 298(5594):824-827, (2002)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 41
Actual numbers observed
Last update: 22-Oct-2015
Image credit: Milo et al., Science 298(5594):824-827, (2002)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 42
Possible functions of network motifs• A coherent feed-forward
loop can reject rapid variations in the input, so that output is produced only when there is a persistent input
• A single input motif (SIM) can turn on and turn off several downstream devices at different time according to their activation thresholds
Last update: 22-Oct-2015
Image credit: Shen-Orr et al., Nature Genetics 31(1):64-68, (2002)
X Y Z
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 43
Visualization tools• Aisee – Tool for generating network figures in vector
format• Cytoscape – one of the most popular tool, a
visualization tool and a platform with many open-source plugins for various types of analysis
• JUNG• N-Browse• Osprey• Pajek• tYNA • ...
Last update: 22-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 44
Analysis tools• Some of the tools listed on the last slide• GeneSpring – Popular tool for pathway analysis
(commonly used for microarray data)• GraphWeb• HCE, Weka, ... (for clustering and other types of data
mining/machine learning tasks)• NetBox• Pandora( See http://wiki.reactome.org/index.php/Reactome_Resource_Guide
for a long list of tools)
Last update: 22-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 45
Summary• There are many types of biological networks
– Gene regulatory– Protein-protein interaction– Metabolic– ...
• There are high-throughput experimental methods for identifying the interactions
• There are also many computational methods for supplementing the noisy networks from experimental data
• Networks can be used to study object relationships, identifying important objects and modules, and associations with a phenotype
Last update: 22-Oct-2015