CSE891-001 Open Problems in Bioinformatics - Computational Network Biology

Jin Chen232 Plant Biology Bld.

2012 Fall

113 Ernst Bessey Hall

About me…• Jin Chen, Assistant Professor in CSE and PRL since 2009

• Office: 232 Plant Biology Lab. Tel: (517) 355-5015. Email: jinchen@msu.edu

Outline

• Course Description

• Introduction to Computational Network Biology

Course Description• Course objectives: study interesting computational network biology

problems and their algorithms, with a focus on the principles used to design those algorithms. (3 credits)

• Instructor: Jin Chen, Office: 232 Plant Biology Bld. Email: jinchen@msu.edu

• Office hours: Thursday 2PM-3PM. If you cannot attend office hours, email me about scheduling a different time.

• Web page: http://www.msu.edu/~jinchen/cse891-2012

Course Description• Course work: One 80 minutes lecture, and 80 minutes of

discussion & student presentations each week

• Grading policies: The course will be graded on attendance (10%), participation (20%), and presentation (70%).

• No Final Exam

Term project vs. presentation

Course Description• Prerequisites: Graduate students in science or engineering.

Note: an override is necessary for non-CSE graduate students; please send your PID & NetID to me.

• No prior knowledge of biology is required. Computationally inclined biology graduate students are encouraged to take the class as well.

Suggested books• A.-L. Barabási, Linked: The new science of networks

• U. Alon, An Introduction to Systems Biology

• B. Palsson. Systems Biology: Properties of Reconstructed Networks

• K. Kaneko, Life: An Introduction to Complex Systems Biology

Course Description

Graph model Graph clustering subgraph mining

Protein-protein interaction networkGene regulatory network

Metabolic networkIntegrative study

Graph Mining

Network Biology

Course Description• Select 3 papers for presentation from the online paper list

• Each presentation is 45 min, including 15 min Q&A, followed with a discussion

• Your grade will be largely determined by the presentation (70%)

• Presentation starts from Sep 11

Or 1 term project + 1 presentation

Why Bioinformatics• The recent advances in biotechnology underlines the need for new

computational tools in modern biology, which are essential for analyzing, understanding and manipulating the detailed information on life we now have at our disposal

• Problems in computational biology vary from understanding sequence data to the analysis of protein shapes, prediction of biological function, study of gene networks, and cell-wide computations

Different Views to Study Biological Problems

• Combinatorial Algorithms– Transcription factors, protein interactions

• Statistical Algorithms– Gene expressions

• Imaging Algorithms– Sub-cellular localization, feature extraction

• Graph Algorithms– Biological networks, everything is related!

Science 14 January 2011: Vol. 331 no. 6014 pp. 183-185 DOI: 10.1126/science.1193210

Biological Solution to a Fundamental Distributed Computing Problem

• Computational methods are extensively used to analyze and model biological systems

• But this paper provides an example of the reverse of this strategy, in which a biological process is used to derive a solution to a long-standing computational problem

• Distributed computing: a large number of processors jointly and distributively solve a task, without any of the processors getting all of the inputs or observing all of the outputs

• Biological processes are also distributed

Maximal Independent Set (MIS)• A long-standing distributed computing problem

is that of electing a set of local leaders (maximal independent set) in a network of connected processors. Formally, a MIS is defined as a set of nodes A, so that every node in the network is either in A or directly connected to a node in A, and no two nodes in A are connected. MIS is necessary for deployment of large, ad hoc sensor networks

• Distributively electing a MIS has been considered a challenging problem for three decades. Luby and Alon et al presented fast probabilistic algorithms for electing a MIS. But to date, no method has been able to efficiently reduce message complexity without assuming knowledge of the number of neighbors.

Blue = MIS

Selection of Neural Precursors • The selection of neural precursors during

the development of the nervous system resembles the MIS election problem. The sensory organ precursors (SOPs) are selected during larvae and pupae development from clusters of equivalent cells.

• A cell that is selected as a SOP inhibits its neighbors by expressing high levels of the membrane-bound protein Delta, which binds and activates the transmembrane receptor protein Notch on adjacent cells. This lateral-inhibition process is highly accurate, resulting in a regularly spaced pattern in which each cell is either selected as SOP or is inhibited by a neighboring SOP.

Blue = SOP

Inspiration from Biology• Although similar, the biological solution differs from

computational algorithms– SOP selection is probably performed without relying on knowledge of

the number of neighbors that are not yet selected– SOP selection requires nonlinear inhibition that in effect reduces

communication to the simplest set of possible messages

• The authors thus asked whether they can develop an algorithm for MIS selection on the basis of a stochastic rate change model that would not require knowledge about the number of active neighbors and would only use threshold communication

Algorithm• In an arbitrary synchronous communication network, nodes can only broadcast one-

bit messages. • A message broadcasted by a node reaches all of its neighbors that are still active in the

algorithm. • In each round, a processor can only tell whether or not a message was sent to it.

When a processor receives a message, it cannot tell which of its neighboring processors sent it, and it cannot count the number of messages received in a round.

Algorithm Complexity• The running time of the algorithm is O(log n * log D), which is

the number of rounds required to execute the two nested loops

• The worst-case running time is O(log2n)

• By studying a developmental process in flies, the authors devised a solution to an important distributed computing problem. The new algorithm does not require knowledge of the degree of individual processors, uses one-bit messages, and has an optimal message complexity.

Conclusion• Using insights from biology to advance computational

systems has mainly focused on optimization techniques inspired by biological observations.

• Areas of computer science that require strict, provable guarantees can also benefit from knowledge regarding how biological systems operate.

• Better understanding of these biological systems can lead to further improvement in the design of complex distributed computing systems.

Introduction to Computational Network Biology

• Network biology belongs to systems biology, which belongs to genomics

• Interested in the relations between entities rather than the entities themselves

http://bionet.bioapps.biozentrum.uni-wuerzburg.de/

Network’s everywhere• Internet, social network, anti-terrorism network

• Biological networks – Protein-protein interaction (PPI) network– protein-DNA interaction network– gene correlation network– gene regulatory network– metabolic network– signaling network…

• Network is a tool for under standing complex systems

• Network models explains network properties and support network behavior study

• Network measures provide quantitative analysis for complex systems

Definition of network (graph)

Node (vertex)

G(V,E)

Self-loop

EdgeMulti-set of edges

Simple graph: does not have loops (self-edges) and does not have multi-edges.

Definition of network (graph)

Directed graphvs.Undirected graph

Labeled graphvs.Unlabeled graph

Symmetric graphvs.Asymmetric graph

Webpage layout

M. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 2004

Pages on a web site and the hyperlinks between them

25Adopted from R Albert’s slides

Biological networks

Hawoong Jeong

Yeast Protein-Protein Interaction network

Eric Davidson

Gene regulation network of sea urchin

Abhishek Murarka

Metabolic flux analysis of E. coli

Why study networks?• Complex systems cannot be described in a reductionist view

• Behavior study of complex systems starts with understanding the network topology

• Network - related questions:– How do we reconstruct a network?– How can we quantitatively describe large networks?– How did networks get to be the way they are?

Simple measures• Node Degree: the number of edges connected to the node

– In-degree & Out-degree– Total in-degree == total out-degree

• Average Degree: the average of node degrees for all the nodes in the network, denoted as:

• Degree distribution: the degree distribution P(k) gives the fraction of nodes that have k edges

where N is the number of nodes in the network, ki is the node degree of node i

Simple measures• Shortest path: to find a path between two nodes such that the

sum of the weights of its constituent edges is minimized

• Graph diameter: the longest shortest path between any pair of nodes in the graph.

• Connected graph: any two vertices can be joined by a path

• Bridge: if we erase the edge, the graph becomes disconnected

Simple measures• Betweenness centrality: for all node pairs (i, j), find all the shortest paths between

nodes i and j, denoted as C(i,j), and determine how many of these pass through node k, denoted as Ck(i,j). Betweenness centrality of node k is

• Calculating the betweenness involves calculating the shortest paths between all pairs of vertices on a graph. O(V2logV + VE) for sparse graph with Johnson’s algorithm.

L. C. Freeman, Sociometry 40, 35 (1977); P. E. Black, Dictionary of Algorithms and Data Structures (2004)

Simple measures• Clustering coefficient: a measure of degree to which nodes in a graph tend to

cluster together. It is based on triplets of nodes.

• Neighborhood N for a vertex vi is defined as its immediately connected neighbors as

• The local clustering coefficient Ci for a vertex vi is then given by the proportion of links between the vertices within its neighborhood divided by the number of links that could possibly exist between them:

L. C. Freeman, Sociometry 40, 35 (1977)

Complex measures

• Frequent subgraph mining

• Graph comparison & classification

• Graph isomorphic testing

Useful software

• Visualization & Topological Analysis– Cytoscape (www.cytoscape.org)– Pajek (vlado.fmf.uni-lj.si/pub/networks/pajek)

• Graph related programming– LEDA (www.algorithmic-solutions.com)– Nauty

(www.cs.sunysb.edu/~algorith/implement/nauty/implement.shtml)

1960 1999 2002

Real networks are much more complex

• Transcription regulatory networks of Yeast and E. coli show an interesting example of mixed characteristics– how many genes a TF interacts with – how many TFs interact with a given gene

- scale-free- exponential

Modularity and network motif• Cellular function are likely to be carried out in a highly

modular manner

• Modular -- a group of genes/proteins that work together to achieve distinct functions

• Biology is full of examples of modularity

Remaining challenges• Discovery of network motifs is closely related to the

generation of random networks

• Structure of network motifs does not necessary determine function

• Relation between higher-level organizational, functional states and networks has not yet been studied

Voigt, W. et al. Genetics 2005 Ingram P.J.et al. BMC Genomics 2006

Eric Werner. Nature 2007

Next class

• PPI network construction

• False-positive detection

CSE891-001 Open Problems in Bioinformatics - Computational Network Biology

Documents

Bioinformatics - Computational Cell Biology WW

Discovering Yourself with Computational Bioinformatics

Bioinformatics and Computational Molecular Biology Geoff Barton

Automated Theory Formation: First Steps in Bioinformatics Simon Colton Computational Bioinformatics Laboratory

Computational network analysis of the anatomical - Bioinformatics

MidSouth Computational Biology and Bioinformatics

Pathway Bioinformatics - Computational Molecular Biology

* Bioinformatics and Computational Biology Undergraduate Major

Bioinformatics and Computational Biology

MSc Bioinformatics & Computational Biology Four-weeks ... · Interfaculty Bioinformatics Unit University of Bern Administration Baltzerstrasse 6 3012 Bern MSc Bioinformatics & Computational

THE BIOINFORMATICS AND COMPUTATIONAL BIOLOGY€¦ · Bioinformatics and Computational Biology Interdepartmental Graduate Program The Bioinformatics and Computational Biology (BCB)

Bioinformatics (Computational) Molecular Biology Introductionliacs.leidenuniv.nl/.../cmb2019/CMB20190226_Intro_Bioinformatics_… · Bioinformatics & (Computational) Molecular Biology

EECS 4425: Introductory Computational Bioinformatics

UP Bioinformatics and Computational Biology Unit - ACGT · UP Bioinformatics and Computational Biology Unit Fourie Joubert Bioinformatics and Computational Biology Unit University

Bioinformatics opportunities in Genomics and Genetics · Bioinformatics / Computational Biology •Definition: Bioinformatics / Computational Biology - a field of biology concerned

Meeting Report from the Bioinformatics & Computational ... · Meeting Report from the Bioinformatics & Computational Biology ... informatics and data infrastructure in ... the Bioinformatics

Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary

Computational Methods in Bioinformatics-Dr Elshafei

Computational Statistics with Application to Bioinformatics

Automated Exploration of Bioinformatics Spaces Simon Colton Computational Bioinformatics Laboratory