Integrating Genomes D. R. Zerbino, B. Paten, D. Haussler
Science 336, 179 (2012)
Teacher: Professor Chao, Kun-Mao
Speaker: Ho, Bin-Shenq
June 4, 2012
OutlineOverviewObtaining Genomic SequencesModeling Evolution of GenotypeFrom Genotype to PhenotypeLooking Ahead to ApplicationsConclusion
Overview Specialization in computational genomics Integration of genetic, molecular, and phenotypic
information
Impact on diverse fields of science New window into the story of life
population genetics, phylogeneticshuman disease genetics
+graph theory, signal processing
statistics, computer science
Milestones First genome sequences_1970s
Bacteriophage MS2 RNA: 3,569 nucleotides long_1976
Computational genomics_1980
Smith and Waterman
Stormo et al. 16-fold improvement in computational power
under Moore’s law A 10,000-fold sequencing performance
improvement in the past 8 years
Computational Genomics
Genomic dataEvolution
Molecular phenotype
Organismal phenotype
DNA sequence evolving in time ( history )
chromatin piece interactingwith other molecules ( mechanism )
gene product acting in cellular
pathways affecting organisms
( function )
Obtaining Genomic SequencesGenome assembly
given sufficient read redundancy
Large redundant regions (repeats)→ complex networks of read-to-read overlaps not all reflecting actual overlaps→ to determine which overlaps being legitimate and which being spurious→ NP-hard problem→ undetermined, prone-to-errors, costly-to-finish regions Newer sequencing technologies with longer reads
Obtaining Genomic Sequences
Reference-based assembly
Tendency of bias toward reference genome
Newer sequencing technologies with longer
reads
Diversity of Genomesevery genome being the result of a 3.8-billion-year evolutionary journey
from the origin of life
Mostly shared and partly unique
Single-base change_substitution, SNP Indel_insertion, deletion Tandem duplication Recombination Transposition Rearrangement_inversion, segmental deletion,
segmental duplication, fusion, fission, translocation Whole genome duplication
Alignment Alignment with assumption of derivation
from a suitably recent common ancestor What being conserved or changed during
the evolution from common ancestor Substitution, indel, segment order, copy
number Local alignment for conserved functional
regions of more distantly related genomes Global / Genome alignment for genomes
from closely related species
Phylogenetic Analysis Single tree providing an explicit order of gene
descent through shared ancestry Finding optimal phylogeny under probabilistic
or parsimony models of substitutions and indels being NP-hard
Being complicated by homologous recombination
Intending to construct a tractable unified theory of genome evolution with stochastic processes jointly describing diversification events of genome
From Genotype to Phenotype
Fig. 2. The dynamic processes that affect and are affected by the genome.
Genomes_Mechanisms_Functions
Active molecules of the cell, including proteins, messenger RNAs, other functional RNAs
Epigenetic mechanisms regulating RNA and protein production and function
Gene regulatory networks Protein signaling cascades Metabolic pathways Regulatory network motifs
From Genotype to Phenotype Exploring unfolding history and diversity of life Deriving experimental data from an expansion
of cell culture resources for diverse species / tissues and newer single-cell assay methodologies
Correlating specific segregating variants with phenotypic traits or diseases
Identifying causal variants by complete genome analysis in related as well as unrelated cases and controls and in combination with better prediction of possible effects of genome variants
From Genotype to Phenotype
Constructing models of molecular phenotypes involving epigenetic state, RNA expression, and (inferred) protein levels through hidden Markov models, factor graphs, Bayesian networks, and Markov random fields
Incorporating biological knowledge into classification and regression methods (e.g., general linear models, neural networks, and support vector machines)
Looking Ahead to Applications
Genome data growth collectively from petabytes (1015 bytes) today to exabytes (1018 bytes) tomorrow
Cancer diagnosis and treatment Immunology Stem cell therapy Agriculture Human prehistory study
Conclusion Facing challenges of obtaining maximum
information from every sequencing experiment
To borrow and tie together advances from a spectrum of different research fields into foundational mathematical models
Between model comprehensiveness and computational efficiency
To be shaped by increasing knowledge of biology