Upload
hoangdat
View
222
Download
6
Embed Size (px)
Citation preview
What is Systems Biology
27041 Introduction to Systems Biology 2 CBS, Department of Systems Biology
27041 Introduction to Systems Biology 3 CBS, Department of Systems Biology
27041 Introduction to Systems Biology 4 CBS, Department of Systems Biology
Data integration • In the “Big Data” era • Combine different types of data, describing different things or the same
thing with different error
• City guide analogy: – Road maps – Arial pictures of buildings Google Maps – Street-level pictures – Restaurant reviews
27041 Introduction to Systems Biology 5 CBS, Department of Systems Biology
Reduction vs Holistic • Reductionism seeks to find individual factors that explain/cause a
phenomenon – Typically study one factor at a time – Which cells -> which organelle -> which molecules -> which sites on
these molecules -> which atoms and H-bonds involved?
• Holistic approach looks at many or all components of the system and the interplay between them
– E.g. map of whole city (as opposed to map of one road) – Understanding cancer (requires understanding of many different
biological processes)
27041 Introduction to Systems Biology 6 CBS, Department of Systems Biology
Increasing Interest in Systems Biology
● ● ● ● ● ● ● ● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
1995 2000 2005 2010
050
0010
000
1500
0
PubMed
Year
Publ
icat
ions
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●●
●●
●
●●
● ●●
● ●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
Yeast genome completed
Human genome completed
Nobel Prize for Cell Cycle
TermBioinformaticsSystems BiologyCell Cycle
27041 Introduction to Systems Biology 7 CBS, Department of Systems Biology
Systems B
iology
Integration
“Nor
mal
” Bio
logy
Redu
cito
nist
Also, systems Biology is a “top-down” science
27041 Introduction to Systems Biology 8 CBS, Department of Systems Biology
Systems biology and emerging properties
27041 Introduction to Systems Biology 9 CBS, Department of Systems Biology
Integration of “whole x-ome” to understand life in health and disease
genome
transcriptome
proteome
metabolome
lipidome
etceterome
LIFE
27041 Introduction to Systems Biology 10 CBS, Department of Systems Biology
From components to models
27041 Introduction to Systems Biology 12 CBS, Department of Systems Biology
Transcriptional regulation of the Cell Cycle
Simon et al. Cell 2001
27041 Introduction to Systems Biology 13 CBS, Department of Systems Biology
27041 Introduction to Systems Biology 14 CBS, Department of Systems Biology Tyson JJ, Novak B, J. Theor. Biol. 2001
27041 Introduction to Systems Biology 15 CBS, Department of Systems Biology
Carbohydrate metabolic map
27041 Introduction to Systems Biology 16 CBS, Department of Systems Biology
Mathematical abstraction of biochemistry
27041 Introduction to Systems Biology 17 CBS, Department of Systems Biology
The hierarchy of models
27041 Introduction to Systems Biology 18 CBS, Department of Systems Biology
The hierarchy of models
From components to models
27041 Introduction to Systems Biology 20 CBS, Department of Systems Biology
One framework for Systems Biology (part 1)
1. The components. Discover all of the genes in the genome and the subset of genes, proteins, and other small molecules constituting the pathway of interest. If possible, define an initial model of the molecular interactions governing pathway function (how?).
2. Pathway perturbation. Perturb each pathway component through a series of genetic or environmental manipulations. Detect and quantify the corresponding global cellular response to each perturbation.
27041 Introduction to Systems Biology 21 CBS, Department of Systems Biology
One framework for Systems Biology (part 2)
3. Model Reconciliation. Integrate the observed mRNA and protein responses with the current, pathway-specific model and with the global network of protein-protein, protein-DNA, and other known physical interactions.
4. Model verification/expansion. Formulate new hypotheses to explain observations not predicted by the model. Design additional perturbation experiments to test these and iteratively repeat steps (2), (3), and (4).
27041 Introduction to Systems Biology 22 CBS, Department of Systems Biology
From model to experiment and back again
Systems Biology to address the knowledge/data problem Genome sequencing
27041 Introduction to Systems Biology 24 CBS, Department of Systems Biology
The human genome sequencing project (HGP) - 2000
27041 Introduction to Systems Biology 25 CBS, Department of Systems Biology
Sequencing costs over time
27626 - Next Generation Sequence Analysis
Sequencing costs
Drop in costs is faster than Moore’s Law
(Computer power doubles every 2 years)
27041 Introduction to Systems Biology 26 CBS, Department of Systems Biology
Sequencing capability: throughput per machine
27626 - Next Generation Sequence Analysis
1st generation to NGS
cancer genome can be obtained, including all point mutations,rearrangements and copy number changes. Mutations in the accom-panying mitochondrial genomes of the cancer will also be collected.With further adaptation this could be extended to include epigeneticalterations and could be applied to the transcriptomes of cancers toinvestigate the first phenotypic effects of all these changes. This cata-loguewill include all the drivermutations andhence all the cancer genesoperating in that cancer, whether they are protein-coding genes, non-coding RNA genes or more cryptic functional elements of the genome.Indeed, if known or unknown DNA viruses have contributed tooncogenesis thesewill alsobediscovered.The cataloguewill also includeall the passenger mutations that incorporate the signatures of previousexposures, DNA repair defects and other mutational processes thecancer has experienced over the decades during which it was evolving.
Until recently, this was an unattainable fantasy. However, thearrival of second-generation sequencing technologies promises anew era for cancer genomics. These platforms currently generatebillions of bases of DNA sequence per week, yields that are predictedto increase rapidly over the next couple of years (Fig. 3). Several proof-of-principle studies have recently been published applying these tech-nologies to cancer samples. These have demonstrated that the currentgeneration of massively parallel sequencing platforms can identify thefull range of somatically acquired genetic alteration in cancer, includ-ing point mutations on a genome-wide basis57, insertions and dele-tions57, copy number changes56 and genomic rearrangements56, as wellas characterizing the cancer cell transcriptome40,41. Furthermore, theseapproaches have the potential to identify subclonal genetic diversitywithin the population of cancer cells58, with particular relevance to thedetection of subclones carrying drug-resistance mutations59. Indeed,one high-coverage cancer genome sequence has recently beenreported57 and several others will emerge during the course of 2009.
Even with the remarkable technological advances in sequencing,however, the parameters of experiments to catalogue all somaticallyacquired variants in a cancer genome are sobering. To obtain acomplete catalogue of somatic mutations from an individual humancancer may require 20-fold sequence coverage of the cancer genome,and possibly more. Somatic mutations then have to be distinguishedfrom inherited DNA variants. Although most inherited variants thatare common in human populations (.5% allele frequency) have beendiscovered and are registered in databases, there are myriad rareinherited single nucleotide polymorphisms and structural variants thatare not. In most cancer genomes these rare germline variants faroutnumber the somatic mutations present. Therefore, for the foresee-able future at least, a high-coverage sequence of the normal genomefrom the same individual as the cancer will be an inescapable extra
burden to allow identification of the somatic changes. Thus, morethan 100,000,000,000 base pairs of DNA sequence will probably berequired to identify the catalogue of somatic mutations in a singlecancer genome.
Subsequently, it will be necessary to distinguish driver mutationsfrom passengers (see Box 1). The power to distinguish clusters ofdriver mutations in cancer genes from chance clusters of randomlydistributed passenger mutations will depend on how frequently acancer gene is mutated and the prevalence of passenger mutations.To be confident of identifying a cancer gene that is mutated in,5%of a particular type of cancer will require hundreds of cases to besequenced. Each of the.100 cancer types will probably require similarsample sizes.
Coordinating the sequencing of cancer genomesThere is, therefore, much work to be done over the next few years.Ideally, it should be organized to maximize use of resources andharmonize the product. This is the mission of the InternationalCancer Genome Consortium (ICGC, see http://www.icgc.org/home).Buildingon the success of previousmultinational, collaborative initia-tives such as the Human Genome Project and the HapMap consor-tium, the aim of ICGC is to comprehensively characterize somaticallyacquired genetic events in at least fifty classes of cancer, includingthose with the highest global incidence and mortality, requiringhigh-coverage sequencing of 20,000 cancer genomes ormore. The fullcatalogues of somatic mutation from each of these cancers will beintegrated with expression and epigenetic profiles of the same casesand correlated with clinical features.
Projects under the ICGC imprimatur will adhere to predeterminedstandards and procedures for ethical approval, data release,intellectual property, sample quality, clinical annotation, data quality,data storage and sequencing completion. Most importantly, given thedemanding nature of the task, the ICGC will coordinate studies tominimize duplication of effort and enable the most parsimoniousdeployment of resources.
The proposal to sequence large numbers of cancer genomes hasgenerated controversy reminiscent of the debate before sequencingof the reference human genome almost 20 years ago. The experimentswill be expensive and, to some extent, we cannot predict what will befound. However, the human genome is finite. Therefore, with furthertechnological advances in DNA sequencing that are already in sight,this is a deliverable project that will comprehensively elucidate centralquestions relating to the nature of human cancer. The clinical andtranslational implications of such a body of work are profound.Beyond the identification of further potentially druggable cancergenes, a comprehensive catalogue of somatic mutations in carefullycharacterized clinical samples will generate new insights into thegenetic patterns that underpin disease phenotype, prognosis, drugresponse and chemotherapy resistance. As the costs of sequencingwhole cancer genomes drop towards US$1,000, routine sequencingin a clinical, diagnostic setting will become feasible. Such data maydrive individualized therapeutic decision-making through the abilityto predict prognosis, to choose therapeutic regimens known to haveefficacy for the particular genetic subtype of cancer, to sensitivelymonitor response to therapy and to identify rare subclonesharbouringdrug-resistancemutations before therapy is even initiated.Individualized therapeutics will require individualized diagnostics.
The discussion is therefore not about whether to do the experiment,butwhen and how. In amanner similar to theHumanGenomeProjectwe have to coordinate the work internationally to maximize use ofresources and minimize duplication of effort to generate a resourceof high quality so that we only have to do it once, empowering cancerresearch with a lasting legacy for the future.
Forward lookApproximately 100,000 somaticmutations fromcancer genomeshavebeen reported in the quarter of a century since the first somatic
Capillary sequencingCapillary sequencing
1980 1985 1990 1995 2000
10
100
1,000
10,000
100,0000
1,000,000
10,000,000
100,000,000
1,000,000,000
2010 Future
Manualslab gel
Automatedslab gel First-generation
capillary
Second-generationcapillary sequencer
Microwellpyrosequencing
Short-readsequencers
Singlemolecule?
Gel-based systems
Massively parallel sequencing
Year
2005
Kilo
base
s pe
r day
per
mac
hine
Figure 3 | Improvements in the rate of DNA sequencing over the past 30years and into the future. From slab gels to capillary sequencing andsecond-generation sequencing technologies, there has been a more than amillion-fold improvement in the rate of sequence generation over this timescale.
NATUREjVol 458j9 April 2009 REVIEWS
723 Macmillan Publishers Limited. All rights reserved©2009
Stratton et al., Nature 2009
1977 - SangerChain-termination
method
Illumina
454
SolidIon Torrent
Pacific Biosciences
Human genome
~3X coverage of a human genome
27041 Introduction to Systems Biology 27 CBS, Department of Systems Biology
Right now, upstairs in DMAC
Output: ~30 Gbp/day
Human genome is 3.2 Gbp
27041 Introduction to Systems Biology 28 CBS, Department of Systems Biology
Completely sequenced genomes by year
http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Statistics
Bioinformatics And the “wealth” of information
27041 Introduction to Systems Biology 30 CBS, Department of Systems Biology
TCCAAACCCAGGCTCTCTCCCAAACCAGTTTGCGGCAGATGGCCAGTGGAACCTCACTCTCCTCATCAGTAAAAAGGGGGCAGAGTGAGGGTCCTGAGAGCTAGTACAGGGACTGTGTGAAGTAGACAATGCCCAGTGTTTAGCGTAAGAATCAGGGTCCAGCTGGTGCTCCCTAAACAGCAGCTGCTGTTCACTGTTGAAAGGCGCTCTGGAAGGCCAGGCGCGGTGGCTCATGCTTGTAATCCCAGCACTGTGGGAGGCCGAGGTGGGCGGATCACCTGAGGTAGGGAGTTCGAGACCAGCCTGACCAACGTGGAGAAACCCCATCTCTCCTAAAAATACAAAATTAGCCAGGCGTGGTAGCACATACCTGTAATCCCAGCGACTCGGGAGGCTGAGGCAAGAGAATTGCTTGAAACCAGCAGGGGAGGTTGTGGTGAGCCAAGATCGAGCCATTGCACTCCAGCCAGGGCAACAAGAGGCAAAATGGCGAAACTCCATCTCCGAGAAAAAAAAAAAAAAAGAATACTTTCTGAAAGTATTTATTCATACAAATAAAGACTTGACCCATAAGGTAGGAACGCAAATGGGCCACGGAATCACTCATTCCACAGTATACACCGAGTGCCCTTGAAGTGCTGGGCACTGCTCCAGGATTGGGGGCATATTGGTGAAAAGAGAAGCAAGCCTGCCTGCTCAGATGGCAGGGAATGGGGAAAAACAGGGAGACAGTTTCCTGTTTGAGATGTTGGGAGTCTGCTTCGAGTAGTATATTTACTGGAAATAGACCACTAACTTGGATGTCCCTTTTTGGAAATGTGCCTGCGTCCAGGGCTGGGTTGGGGCCCCAATGAACTTTGGCTCTGACATAGCTGTTGCCACACTCAGTGGAACTGAATCCATGTTTGCCTTCACCCGGCATCCTTCACCCCAACTCTCCCCGCCACAACATACATCCCATGCCAGCCTGGGGACCCTCAAAGGTGCTTCATCATTAGGTTTGTGGCTGGGTCCTACTGAAGTAAGTCTTGGCACTCAGAGGGATAGGAATTGAATGAAGACATGAGATTCCTCTGCGGGAGGCCTCTCTAGGAAATCTGTGGACTCACACGTTTACTAATGTTGCTGCAGCCCCGCACCCACCTTGGCCTTGGGCAGCCATACTCTAGGGCTTTTGTAACCTCTCCATGTGAGGAACTCAAATTAGACCTGGGTTTGGAGGCGGTGCTCCGAGCTGGCCTTTGGGGGAGGTTTTGTGCGAGGCATTTCCCAAGTGCTGGCAGGATTGTGTCACAGACACAGAGTAAACTTTTGCTGGGCTCCAAGTGACCGCCCATAGTTTATTATAAAGGTGACTGCACCCTGCAGCCACCAGCACTGCCTGGCTCCACGTGCCTCCTGGTCTCAGTATGGCGCTGTCCTGGGTTCTTACAGTCCTGAGCCTCCTACCTCTGCTGGAAGCCCAGATCCCATTGTGTGCCAACCTAGTACCGGTGCCCATCACCAACGCCACCCTGGACCGGGTGAGTGCCTGGGCTAGCCCTGTCCTGAGCACATGGGCAGCTGCCTCCCTTCTCTGGGCTTCCCTTTACCTGCTGGCTGTGGTCGCACCCCCACTCCCAGCTCTGCCTTTTTCTCTTCTGGGTCCCCAGGGTGAAATTCTCACCAGCCCAGGGGACTCTGGAGGCACCCCCTGCCTCCAAACACAGAAGCCTCACTGCAGAGTCCTTCACGGAGGACGGTTCTGTGCTGGGCCTGGAGGGGCTGCCTGGGGGGCAATGACTGATCCTCAGGGTGAGCTCCTGCATGCGCACTGCCCACCAGGGGCCTCATCTCCCCATCTGCAAAATCAGGGAGAGATCTGCCTGAGTCTCCTCCCAGCTGACAGTCAAAGATTCAGCATCAAGCCCCCATCACCAGCTCCCCCCTTCTCCCCAGATCACTGGCAAGTGGTTTTATATCGCATCGGCCTTTCGAAACGAGGAGTACAATAAGTCGGTTCAGGAGATCCAAGCAACCTTCTTTTACTTTACCCCCAACAAGACAGAGGACACGATCTTTCTCAGAGAGTACCAGACCCGGTGAGAGCCCCCATTCCAATGCACCCCCGATCTCAGCTGTCTGGCCAGAAGACCTGAGCAAGTCCCTCCTTCTTCCTGGCCTTGGCCTTCCCATGGGTGGAACCGGGAGGGTTGGCTTTAATCTCCACCAGAACTCTTGCCCCGGGACTGTGATGGGCGATTGGCCACTTCTCCTCGATAACATTACTGTTTTTCTTCCGCCTTCTGGTTGACTTTAGCCAGAACCAGTGCTTCTATAACTCCAGTTACCTGAATGTCCAGCGGGAGAATGGGACCGTCTCCAGATACGGTGAGGGCCAGCCCTCAGGCAGGAGGGTTCACCGTGGGAACAGGGCAGGCCAGCATAAGGTGGGGGCTGGATGTAGAGCCCTGGAGGCTTTGGGCACAGAGAAATAACCACTAACATTTTTGAGCTCTTACCACGTGCTCAGAAAAAATCCCTAAGAAGACACTGAGAGAATTAGATGAGGAAACATAAGAACAGAGACCTCAAATAGTTTCCCCAAGGTCACACAGCTTATAATTAGAACTAGAATTGGAACTCCAGGCTGGCTTCAGATCTGCCTCTCTCTCACGCCCTCTTTAAGATCCTTTGCAAACCAATGGTAGAAGCCTGTATGTTGGAGAGGTGGTACCTTCAACTATGTCCCCCATCACCGCAGAGGTGGCACATGGCAGGGATCTGATGGAGCTGAACTGACATCATTTAGCATCCCGAGCCTCCTCTCTGGGCCTCATTTTCCTCCTCTGTAAAACGGGGAGAAAGGCCCTGACAGCCACAGTCTGTGTGAGGCTCCTGAGATCTCATGTACAGAAAGTGCTTGGCGTGGAGCTGGGCACGCAGCAGGGGCTGGGCACACGGTGGCCCAAAGGAGACCCGGGCCTTCACTGATGGGCTTTGTGGCCCCGGACACATTTCTCTTCCAGAGGGAGGCCGAGAACATGTTGCTCACCTGCGTTCCTTAGGGACACCCCTAGGACTCCTCACCTGTAAGACAGGCACCATTGTGCCATCCCATGTTCTCACCCAGAGGCTCTTAAGACCTTGATGTTTGGTTCCTACCTGGACGATGAGAAGAACTGGGGGCTGTCTTTCTATGGTAGGCATGCTTAGCAGCCCCAAACTCATGCCCCTCTCAGGCCTCACCCCCCATTCACCCACCCCTGGGCTGGCCCCTAGAACCCCAGCCCTCCCTGGCCTCCGCCGGGCCCCACCATGTCCCCAGTCAGTCTCCTTGCTCCCCCTGCAGCTGACAAGCCAGAGACGACCAAGGAGCAACTGGGAGAGTTCTACGAAGCTCTCGACTGCTTGTGCATTCCCAGGTCAGATGTCATGTACACCGACTGGAAAAAGGTAAACGCAAGGGATTGGACATTGCCCACCTTGTCCATGGCCCAACTTGGGCAGCCCCAGAGGCCCAGAGCAGGAAAGCTGCCAGGCAAGGCTGCACAGCTAGGCAGATCTTCTGCTTTTAGGCACCTGCCTCACTGTAGGGACAGCTGAGCTCTACAGAGGCCCAGGGGTGGTGGATGAGAGCCCAGGAGGGAGAAGTCCCTGTGAAACCAGGGAGGACCTGAAAGCTAACAGGAGGGAACAGCGTGAGCCACGGGGTTGGGGGATTGGCAATTGGAGGGGACGTAATGCGGGGAGTTACCACCTACAGACGCGTCCCAAACCCCAGGCTTTCACCCCAACCTCCACTCCCCGCTCATTTTTAATACCCGTGCAGTGGGGAATTGATACTGTGGTTTTCAATGTCACCCACACTGCAGCACGGCCACAGTCACCATCCCGATTTTTGCTACAAATGAAAATTACTGTATAATGAGCTCCTTAACACTTTTCTTTAAACCTGTGTTTGGAAGACTTGTGTTGGTGTGGCCCTGTGCCCTAATACCTGTGAAATCACAGCACCGATGAGCTGGTTCCAATTTTTAAAATATATACATGCAGTACTTCCATGACTATTCAAAGAAAAACAATTCCTTCCATTTGCCACCTGAGATGACCACCAGGGATGTGAACTACCTCCTGCCCCATCCCCAGCCCCAGGATCCTGGGACAGGGCTTATGAACGCAACCACTGTAGTCAGCTCACTTGATCCACAGCCTGGCACCTCCACTGTCTGGCTAGGGAGCCTCGAATGGGTCCCAAGGCCACCCTGCTCCTCAGTTACATCATCTGCATAGTAGTGGTGGTTGTGAGGAATTCAGGAGCTGCAGCATAAGGGCCCTGCAGGTACTATGTGCTCAGTAAATGCCAGTGGTTCTTAAGGGTCTGAGCTCCCATTGTAGAGGCAAGTAAGCTGAGGTTCAGAGAAGAAAATGACTTGCCCAAGATCACCCAGCTGGGAAGTGACAGTGCCAGGGTTGGAGCCCTGGTTGAGCTGGTTCCACAGGCCAGAGCTCATTCTGCCCTCTCCCCGGAAGACCTCCCACCCTGTCCCCATGCCTCTGCTTCTCCCTCACCCCAATTCCCCGCTGCCTTCTAGGATAAGTGTGAGCCACTGGAGAAGCAGCACGAGAAGGAGAGGAAACAGGAGGAGGGGGAATCCTAGCAGGACACAGCCTTGGATCAGGACAGAGACTTGGGGGCCATCCTGCCCCTCCAACCCGACATGTGTACCTCAGCTTTTTCCCTCACTTGCATCAATAAAGCTTCGCATCGGCCTTTCGAAACGAGGAGTACAATAAGTCGGTTCAGGAGCCCTCAGGCAGGAGGGTTCACCGTGGGAACAGGGCAGGCCAGCATAAGGTGGGGGCTGGATGTAGAGCCCTGGAGGCTTTGGGCACAGAGGCCACCCTGGACCGGGTGAGTGCCTGGGCTAGCCCTGTCCTGAGCACATGGGCAGCTGCCTCCCTTCTCTGGGCTTCCCTTTACCTGCTGGCTGTGGTCGCACCCCCACTCCCAGCCCCCAACTCTCCCCGCCACAACATACATCCCATGCCCAGGAGGGTTCACCGTGGGAACAGGGCAGGCCAGCATAAGGTGGGGGCTGGATGTAGAGCCCTGGAGGCTTTGGGCACAGAGGCCACCCTGGACCGGGTGAGTGCCTGGGCTAGCCC
27041 Introduction to Systems Biology 31 CBS, Department of Systems Biology
Source: http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
27041 Introduction to Systems Biology 32 CBS, Department of Systems Biology
27041 Introduction to Systems Biology 33 CBS, Department of Systems Biology
Current state-of-the art:
- 1024 Cores
- 8 Tb RAM
Fastest commercial computer