45
CSE891-001 Open Problems in Bioinformatics - Computational Network Biology Jin Chen 232 Plant Biology Bld. 2012 Fall 1 113 Ernst Bessey Hall

CSE891-001 Open Problems in Bioinformatics - Computational Network Biology

  • Upload
    jui

  • View
    37

  • Download
    1

Embed Size (px)

DESCRIPTION

113 Ernst Bessey Hall . CSE891-001 Open Problems in Bioinformatics - Computational Network Biology. Jin Chen 232 Plant Biology Bld. 2012 Fall. About me…. Jin Chen, Assistant Professor in CSE and PRL since 2009 Office: 232 Plant Biology Lab. Tel: (517) 355-5015. Email: [email protected]. - PowerPoint PPT Presentation

Citation preview

Page 1: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

1

CSE891-001 Open Problems in Bioinformatics - Computational Network Biology

Jin Chen232 Plant Biology Bld.

2012 Fall

113 Ernst Bessey Hall

Page 2: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

2

About me…• Jin Chen, Assistant Professor in CSE and PRL since 2009

• Office: 232 Plant Biology Lab. Tel: (517) 355-5015. Email: [email protected]

Page 3: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

3

Outline

• Course Description

• Introduction to Computational Network Biology

Page 4: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

4

Course Description• Course objectives: study interesting computational network biology

problems and their algorithms, with a focus on the principles used to design those algorithms. (3 credits)

• Instructor: Jin Chen, Office: 232 Plant Biology Bld. Email: [email protected]

• Office hours: Thursday 2PM-3PM. If you cannot attend office hours, email me about scheduling a different time.

• Web page: http://www.msu.edu/~jinchen/cse891-2012

Page 5: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

5

Course Description• Course work: One 80 minutes lecture, and 80 minutes of

discussion & student presentations each week

• Grading policies: The course will be graded on attendance (10%), participation (20%), and presentation (70%).

• No Final Exam

Term project vs. presentation

Page 6: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

6

Course Description• Prerequisites: Graduate students in science or engineering.

Note: an override is necessary for non-CSE graduate students; please send your PID & NetID to me.

• No prior knowledge of biology is required. Computationally inclined biology graduate students are encouraged to take the class as well.

Page 7: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

7

Suggested books• A.-L. Barabási, Linked: The new science of networks

• U. Alon, An Introduction to Systems Biology

• B. Palsson. Systems Biology: Properties of Reconstructed Networks

• K. Kaneko, Life: An Introduction to Complex Systems Biology

Page 8: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

8

Course Description

Graph model Graph clustering subgraph mining

Protein-protein interaction networkGene regulatory network

Metabolic networkIntegrative study

Graph Mining

Network Biology

Page 9: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

9

Course Description• Select 3 papers for presentation from the online paper list

• Each presentation is 45 min, including 15 min Q&A, followed with a discussion

• Your grade will be largely determined by the presentation (70%)

• Presentation starts from Sep 11

Or 1 term project + 1 presentation

Page 10: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

10

Why Bioinformatics• The recent advances in biotechnology underlines the need for new

computational tools in modern biology, which are essential for analyzing, understanding and manipulating the detailed information on life we now have at our disposal

• Problems in computational biology vary from understanding sequence data to the analysis of protein shapes, prediction of biological function, study of gene networks, and cell-wide computations

Page 11: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

11

Different Views to Study Biological Problems

• Combinatorial Algorithms– Transcription factors, protein interactions

• Statistical Algorithms– Gene expressions

• Imaging Algorithms– Sub-cellular localization, feature extraction

• Graph Algorithms– Biological networks, everything is related!

Page 12: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

12

Science 14 January 2011: Vol. 331 no. 6014 pp. 183-185 DOI: 10.1126/science.1193210

Page 13: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

13

Biological Solution to a Fundamental Distributed Computing Problem

• Computational methods are extensively used to analyze and model biological systems

• But this paper provides an example of the reverse of this strategy, in which a biological process is used to derive a solution to a long-standing computational problem

• Distributed computing: a large number of processors jointly and distributively solve a task, without any of the processors getting all of the inputs or observing all of the outputs

• Biological processes are also distributed

Page 14: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

14

Maximal Independent Set (MIS)• A long-standing distributed computing problem

is that of electing a set of local leaders (maximal independent set) in a network of connected processors. Formally, a MIS is defined as a set of nodes A, so that every node in the network is either in A or directly connected to a node in A, and no two nodes in A are connected. MIS is necessary for deployment of large, ad hoc sensor networks

• Distributively electing a MIS has been considered a challenging problem for three decades. Luby and Alon et al presented fast probabilistic algorithms for electing a MIS. But to date, no method has been able to efficiently reduce message complexity without assuming knowledge of the number of neighbors.

Blue = MIS

Page 15: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

15

Selection of Neural Precursors • The selection of neural precursors during

the development of the nervous system resembles the MIS election problem. The sensory organ precursors (SOPs) are selected during larvae and pupae development from clusters of equivalent cells.

• A cell that is selected as a SOP inhibits its neighbors by expressing high levels of the membrane-bound protein Delta, which binds and activates the transmembrane receptor protein Notch on adjacent cells. This lateral-inhibition process is highly accurate, resulting in a regularly spaced pattern in which each cell is either selected as SOP or is inhibited by a neighboring SOP.

Blue = SOP

Page 16: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

16

Inspiration from Biology• Although similar, the biological solution differs from

computational algorithms– SOP selection is probably performed without relying on knowledge of

the number of neighbors that are not yet selected– SOP selection requires nonlinear inhibition that in effect reduces

communication to the simplest set of possible messages

• The authors thus asked whether they can develop an algorithm for MIS selection on the basis of a stochastic rate change model that would not require knowledge about the number of active neighbors and would only use threshold communication

Page 17: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

17

Algorithm• In an arbitrary synchronous communication network, nodes can only broadcast one-

bit messages. • A message broadcasted by a node reaches all of its neighbors that are still active in the

algorithm. • In each round, a processor can only tell whether or not a message was sent to it.

When a processor receives a message, it cannot tell which of its neighboring processors sent it, and it cannot count the number of messages received in a round.

Page 18: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

18

Algorithm Complexity• The running time of the algorithm is O(log n * log D), which is

the number of rounds required to execute the two nested loops

• The worst-case running time is O(log2n)

• By studying a developmental process in flies, the authors devised a solution to an important distributed computing problem. The new algorithm does not require knowledge of the degree of individual processors, uses one-bit messages, and has an optimal message complexity.

Page 19: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

19

Conclusion• Using insights from biology to advance computational

systems has mainly focused on optimization techniques inspired by biological observations.

• Areas of computer science that require strict, provable guarantees can also benefit from knowledge regarding how biological systems operate.

• Better understanding of these biological systems can lead to further improvement in the design of complex distributed computing systems.

Page 20: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

20

Introduction to Computational Network Biology

• Network biology belongs to systems biology, which belongs to genomics

• Interested in the relations between entities rather than the entities themselves

http://bionet.bioapps.biozentrum.uni-wuerzburg.de/

Page 21: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

21

Network’s everywhere• Internet, social network, anti-terrorism network

• Biological networks – Protein-protein interaction (PPI) network– protein-DNA interaction network– gene correlation network– gene regulatory network– metabolic network– signaling network…

• Network is a tool for under standing complex systems

• Network models explains network properties and support network behavior study

• Network measures provide quantitative analysis for complex systems

Page 22: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

22

Definition of network (graph)

Node (vertex)

G(V,E)

Self-loop

EdgeMulti-set of edges

Simple graph: does not have loops (self-edges) and does not have multi-edges.

Page 23: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

23

Definition of network (graph)

Directed graphvs.Undirected graph

Labeled graphvs.Unlabeled graph

Symmetric graphvs.Asymmetric graph

Page 24: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

24

Webpage layout

M. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 2004

Pages on a web site and the hyperlinks between them

Page 25: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

25Adopted from R Albert’s slides

Page 26: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

26

Biological networks

Page 27: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

27

Hawoong Jeong

Yeast Protein-Protein Interaction network

Page 28: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

28

Eric Davidson

Gene regulation network of sea urchin

Page 29: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

29

Abhishek Murarka

Metabolic flux analysis of E. coli

Page 30: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

30

Why study networks?• Complex systems cannot be described in a reductionist view

• Behavior study of complex systems starts with understanding the network topology

• Network - related questions:– How do we reconstruct a network?– How can we quantitatively describe large networks?– How did networks get to be the way they are?

Page 31: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

31

Simple measures• Node Degree: the number of edges connected to the node

– In-degree & Out-degree– Total in-degree == total out-degree

• Average Degree: the average of node degrees for all the nodes in the network, denoted as:

• Degree distribution: the degree distribution P(k) gives the fraction of nodes that have k edges

where N is the number of nodes in the network, ki is the node degree of node i

Page 32: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

32

Simple measures• Shortest path: to find a path between two nodes such that the

sum of the weights of its constituent edges is minimized

• Graph diameter: the longest shortest path between any pair of nodes in the graph.

• Connected graph: any two vertices can be joined by a path

• Bridge: if we erase the edge, the graph becomes disconnected

Page 33: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

33

Simple measures• Betweenness centrality: for all node pairs (i, j), find all the shortest paths between

nodes i and j, denoted as C(i,j), and determine how many of these pass through node k, denoted as Ck(i,j). Betweenness centrality of node k is

• Calculating the betweenness involves calculating the shortest paths between all pairs of vertices on a graph. O(V2logV + VE) for sparse graph with Johnson’s algorithm.

L. C. Freeman, Sociometry 40, 35 (1977); P. E. Black, Dictionary of Algorithms and Data Structures (2004)

Page 34: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

34

Simple measures• Clustering coefficient: a measure of degree to which nodes in a graph tend to

cluster together. It is based on triplets of nodes.

• Neighborhood N for a vertex vi is defined as its immediately connected neighbors as

• The local clustering coefficient Ci for a vertex vi is then given by the proportion of links between the vertices within its neighborhood divided by the number of links that could possibly exist between them:

L. C. Freeman, Sociometry 40, 35 (1977)

Page 35: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

35

Complex measures

• Frequent subgraph mining

• Graph comparison & classification

• Graph isomorphic testing

Page 36: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

36

Useful software

• Visualization & Topological Analysis– Cytoscape (www.cytoscape.org)– Pajek (vlado.fmf.uni-lj.si/pub/networks/pajek)

• Graph related programming– LEDA (www.algorithmic-solutions.com)– Nauty

(www.cs.sunysb.edu/~algorith/implement/nauty/implement.shtml)

Page 37: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

1960 1999 2002

Page 38: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

Real networks are much more complex

• Transcription regulatory networks of Yeast and E. coli show an interesting example of mixed characteristics– how many genes a TF interacts with – how many TFs interact with a given gene

- scale-free- exponential

Page 39: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

Modularity and network motif• Cellular function are likely to be carried out in a highly

modular manner

• Modular -- a group of genes/proteins that work together to achieve distinct functions

• Biology is full of examples of modularity

Page 40: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology
Page 41: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology
Page 42: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology
Page 43: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

Remaining challenges• Discovery of network motifs is closely related to the

generation of random networks

• Structure of network motifs does not necessary determine function

• Relation between higher-level organizational, functional states and networks has not yet been studied

Voigt, W. et al. Genetics 2005 Ingram P.J.et al. BMC Genomics 2006

Eric Werner. Nature 2007

Page 44: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology
Page 45: CSE891-001 Open Problems in Bioinformatics -  Computational Network  Biology

45

Next class

• PPI network construction

• False-positive detection