CSE891-001 Open Problems in Bioinformatics - Computational Network Biology

Preview:

DESCRIPTION

113 Ernst Bessey Hall . CSE891-001 Open Problems in Bioinformatics - Computational Network Biology. Jin Chen 232 Plant Biology Bld. 2012 Fall. About me…. Jin Chen, Assistant Professor in CSE and PRL since 2009 Office: 232 Plant Biology Lab. Tel: (517) 355-5015. Email: jinchen@msu.edu. - PowerPoint PPT Presentation

Citation preview

1

CSE891-001 Open Problems in Bioinformatics - Computational Network Biology

Jin Chen232 Plant Biology Bld.

2012 Fall

113 Ernst Bessey Hall

2

About me…• Jin Chen, Assistant Professor in CSE and PRL since 2009

• Office: 232 Plant Biology Lab. Tel: (517) 355-5015. Email: jinchen@msu.edu

3

Outline

• Course Description

• Introduction to Computational Network Biology

4

Course Description• Course objectives: study interesting computational network biology

problems and their algorithms, with a focus on the principles used to design those algorithms. (3 credits)

• Instructor: Jin Chen, Office: 232 Plant Biology Bld. Email: jinchen@msu.edu

• Office hours: Thursday 2PM-3PM. If you cannot attend office hours, email me about scheduling a different time.

• Web page: http://www.msu.edu/~jinchen/cse891-2012

5

Course Description• Course work: One 80 minutes lecture, and 80 minutes of

discussion & student presentations each week

• Grading policies: The course will be graded on attendance (10%), participation (20%), and presentation (70%).

• No Final Exam

Term project vs. presentation

6

Course Description• Prerequisites: Graduate students in science or engineering.

Note: an override is necessary for non-CSE graduate students; please send your PID & NetID to me.

• No prior knowledge of biology is required. Computationally inclined biology graduate students are encouraged to take the class as well.

7

Suggested books• A.-L. Barabási, Linked: The new science of networks

• U. Alon, An Introduction to Systems Biology

• B. Palsson. Systems Biology: Properties of Reconstructed Networks

• K. Kaneko, Life: An Introduction to Complex Systems Biology

8

Course Description

Graph model Graph clustering subgraph mining

Protein-protein interaction networkGene regulatory network

Metabolic networkIntegrative study

Graph Mining

Network Biology

9

Course Description• Select 3 papers for presentation from the online paper list

• Each presentation is 45 min, including 15 min Q&A, followed with a discussion

• Your grade will be largely determined by the presentation (70%)

• Presentation starts from Sep 11

Or 1 term project + 1 presentation

10

Why Bioinformatics• The recent advances in biotechnology underlines the need for new

computational tools in modern biology, which are essential for analyzing, understanding and manipulating the detailed information on life we now have at our disposal

• Problems in computational biology vary from understanding sequence data to the analysis of protein shapes, prediction of biological function, study of gene networks, and cell-wide computations

11

Different Views to Study Biological Problems

• Combinatorial Algorithms– Transcription factors, protein interactions

• Statistical Algorithms– Gene expressions

• Imaging Algorithms– Sub-cellular localization, feature extraction

• Graph Algorithms– Biological networks, everything is related!

12

Science 14 January 2011: Vol. 331 no. 6014 pp. 183-185 DOI: 10.1126/science.1193210

13

Biological Solution to a Fundamental Distributed Computing Problem

• Computational methods are extensively used to analyze and model biological systems

• But this paper provides an example of the reverse of this strategy, in which a biological process is used to derive a solution to a long-standing computational problem

• Distributed computing: a large number of processors jointly and distributively solve a task, without any of the processors getting all of the inputs or observing all of the outputs

• Biological processes are also distributed

14

Maximal Independent Set (MIS)• A long-standing distributed computing problem

is that of electing a set of local leaders (maximal independent set) in a network of connected processors. Formally, a MIS is defined as a set of nodes A, so that every node in the network is either in A or directly connected to a node in A, and no two nodes in A are connected. MIS is necessary for deployment of large, ad hoc sensor networks

• Distributively electing a MIS has been considered a challenging problem for three decades. Luby and Alon et al presented fast probabilistic algorithms for electing a MIS. But to date, no method has been able to efficiently reduce message complexity without assuming knowledge of the number of neighbors.

Blue = MIS

15

Selection of Neural Precursors • The selection of neural precursors during

the development of the nervous system resembles the MIS election problem. The sensory organ precursors (SOPs) are selected during larvae and pupae development from clusters of equivalent cells.

• A cell that is selected as a SOP inhibits its neighbors by expressing high levels of the membrane-bound protein Delta, which binds and activates the transmembrane receptor protein Notch on adjacent cells. This lateral-inhibition process is highly accurate, resulting in a regularly spaced pattern in which each cell is either selected as SOP or is inhibited by a neighboring SOP.

Blue = SOP

16

Inspiration from Biology• Although similar, the biological solution differs from

computational algorithms– SOP selection is probably performed without relying on knowledge of

the number of neighbors that are not yet selected– SOP selection requires nonlinear inhibition that in effect reduces

communication to the simplest set of possible messages

• The authors thus asked whether they can develop an algorithm for MIS selection on the basis of a stochastic rate change model that would not require knowledge about the number of active neighbors and would only use threshold communication

17

Algorithm• In an arbitrary synchronous communication network, nodes can only broadcast one-

bit messages. • A message broadcasted by a node reaches all of its neighbors that are still active in the

algorithm. • In each round, a processor can only tell whether or not a message was sent to it.

When a processor receives a message, it cannot tell which of its neighboring processors sent it, and it cannot count the number of messages received in a round.

18

Algorithm Complexity• The running time of the algorithm is O(log n * log D), which is

the number of rounds required to execute the two nested loops

• The worst-case running time is O(log2n)

• By studying a developmental process in flies, the authors devised a solution to an important distributed computing problem. The new algorithm does not require knowledge of the degree of individual processors, uses one-bit messages, and has an optimal message complexity.

19

Conclusion• Using insights from biology to advance computational

systems has mainly focused on optimization techniques inspired by biological observations.

• Areas of computer science that require strict, provable guarantees can also benefit from knowledge regarding how biological systems operate.

• Better understanding of these biological systems can lead to further improvement in the design of complex distributed computing systems.

20

Introduction to Computational Network Biology

• Network biology belongs to systems biology, which belongs to genomics

• Interested in the relations between entities rather than the entities themselves

http://bionet.bioapps.biozentrum.uni-wuerzburg.de/

21

Network’s everywhere• Internet, social network, anti-terrorism network

• Biological networks – Protein-protein interaction (PPI) network– protein-DNA interaction network– gene correlation network– gene regulatory network– metabolic network– signaling network…

• Network is a tool for under standing complex systems

• Network models explains network properties and support network behavior study

• Network measures provide quantitative analysis for complex systems

22

Definition of network (graph)

Node (vertex)

G(V,E)

Self-loop

EdgeMulti-set of edges

Simple graph: does not have loops (self-edges) and does not have multi-edges.

23

Definition of network (graph)

Directed graphvs.Undirected graph

Labeled graphvs.Unlabeled graph

Symmetric graphvs.Asymmetric graph

24

Webpage layout

M. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 2004

Pages on a web site and the hyperlinks between them

25Adopted from R Albert’s slides

26

Biological networks

27

Hawoong Jeong

Yeast Protein-Protein Interaction network

28

Eric Davidson

Gene regulation network of sea urchin

29

Abhishek Murarka

Metabolic flux analysis of E. coli

30

Why study networks?• Complex systems cannot be described in a reductionist view

• Behavior study of complex systems starts with understanding the network topology

• Network - related questions:– How do we reconstruct a network?– How can we quantitatively describe large networks?– How did networks get to be the way they are?

31

Simple measures• Node Degree: the number of edges connected to the node

– In-degree & Out-degree– Total in-degree == total out-degree

• Average Degree: the average of node degrees for all the nodes in the network, denoted as:

• Degree distribution: the degree distribution P(k) gives the fraction of nodes that have k edges

where N is the number of nodes in the network, ki is the node degree of node i

32

Simple measures• Shortest path: to find a path between two nodes such that the

sum of the weights of its constituent edges is minimized

• Graph diameter: the longest shortest path between any pair of nodes in the graph.

• Connected graph: any two vertices can be joined by a path

• Bridge: if we erase the edge, the graph becomes disconnected

33

Simple measures• Betweenness centrality: for all node pairs (i, j), find all the shortest paths between

nodes i and j, denoted as C(i,j), and determine how many of these pass through node k, denoted as Ck(i,j). Betweenness centrality of node k is

• Calculating the betweenness involves calculating the shortest paths between all pairs of vertices on a graph. O(V2logV + VE) for sparse graph with Johnson’s algorithm.

L. C. Freeman, Sociometry 40, 35 (1977); P. E. Black, Dictionary of Algorithms and Data Structures (2004)

34

Simple measures• Clustering coefficient: a measure of degree to which nodes in a graph tend to

cluster together. It is based on triplets of nodes.

• Neighborhood N for a vertex vi is defined as its immediately connected neighbors as

• The local clustering coefficient Ci for a vertex vi is then given by the proportion of links between the vertices within its neighborhood divided by the number of links that could possibly exist between them:

L. C. Freeman, Sociometry 40, 35 (1977)

35

Complex measures

• Frequent subgraph mining

• Graph comparison & classification

• Graph isomorphic testing

36

Useful software

• Visualization & Topological Analysis– Cytoscape (www.cytoscape.org)– Pajek (vlado.fmf.uni-lj.si/pub/networks/pajek)

• Graph related programming– LEDA (www.algorithmic-solutions.com)– Nauty

(www.cs.sunysb.edu/~algorith/implement/nauty/implement.shtml)

1960 1999 2002

Real networks are much more complex

• Transcription regulatory networks of Yeast and E. coli show an interesting example of mixed characteristics– how many genes a TF interacts with – how many TFs interact with a given gene

- scale-free- exponential

Modularity and network motif• Cellular function are likely to be carried out in a highly

modular manner

• Modular -- a group of genes/proteins that work together to achieve distinct functions

• Biology is full of examples of modularity

Remaining challenges• Discovery of network motifs is closely related to the

generation of random networks

• Structure of network motifs does not necessary determine function

• Relation between higher-level organizational, functional states and networks has not yet been studied

Voigt, W. et al. Genetics 2005 Ingram P.J.et al. BMC Genomics 2006

Eric Werner. Nature 2007

45

Next class

• PPI network construction

• False-positive detection

Recommended