What is Systems Biology - · PDF file26 CBS, Department of Systems Biology 27041 Introduction to Systems Biology Sequencing capability: throughput per machine 27626 - Next Generation

What is Systems Biology

27041 Introduction to Systems Biology 2 CBS, Department of Systems Biology



Data integration •  In the “Big Data” era •  Combine different types of data, describing different things or the same

thing with different error

•  City guide analogy: – Road maps – Arial pictures of buildings Google Maps – Street-level pictures – Restaurant reviews


Reduction vs Holistic •  Reductionism seeks to find individual factors that explain/cause a

phenomenon –  Typically study one factor at a time – Which cells -> which organelle -> which molecules -> which sites on

these molecules -> which atoms and H-bonds involved?

•  Holistic approach looks at many or all components of the system and the interplay between them

– E.g. map of whole city (as opposed to map of one road) – Understanding cancer (requires understanding of many different

biological processes)


Increasing Interest in Systems Biology

● ● ● ● ● ● ● ● ●● ●

●

●

●

●

●

●

●

●

●

●

●

●

1995 2000 2005 2010

050

0010

000

1500

0

PubMed

Year

Publ

icat

ions

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

●

●●

● ●●

● ●●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

Yeast genome completed

Human genome completed

Nobel Prize for Cell Cycle

TermBioinformaticsSystems BiologyCell Cycle


Systems B

iology

Integration

“Nor

mal

” Bio

logy

Redu

cito

nist

Also, systems Biology is a “top-down” science


Systems biology and emerging properties


Integration of “whole x-ome” to understand life in health and disease

genome

transcriptome

proteome

metabolome

lipidome

etceterome

LIFE


From components to models


Transcriptional regulation of the Cell Cycle

Simon et al. Cell 2001


27041 Introduction to Systems Biology 14 CBS, Department of Systems Biology Tyson JJ, Novak B, J. Theor. Biol. 2001


Carbohydrate metabolic map


Mathematical abstraction of biochemistry


The hierarchy of models


The hierarchy of models

From components to models


One framework for Systems Biology (part 1)

1.  The components. Discover all of the genes in the genome and the subset of genes, proteins, and other small molecules constituting the pathway of interest. If possible, define an initial model of the molecular interactions governing pathway function (how?).

2.  Pathway perturbation. Perturb each pathway component through a series of genetic or environmental manipulations. Detect and quantify the corresponding global cellular response to each perturbation.


One framework for Systems Biology (part 2)

3.  Model Reconciliation. Integrate the observed mRNA and protein responses with the current, pathway-specific model and with the global network of protein-protein, protein-DNA, and other known physical interactions.

4.  Model verification/expansion. Formulate new hypotheses to explain observations not predicted by the model. Design additional perturbation experiments to test these and iteratively repeat steps (2), (3), and (4).


From model to experiment and back again

Systems Biology to address the knowledge/data problem Genome sequencing


The human genome sequencing project (HGP) - 2000


Sequencing costs over time

27626 - Next Generation Sequence Analysis

Sequencing costs

Drop in costs is faster than Moore’s Law

(Computer power doubles every 2 years)


Sequencing capability: throughput per machine

27626 - Next Generation Sequence Analysis

1st generation to NGS

cancer genome can be obtained, including all point mutations,rearrangements and copy number changes. Mutations in the accom-panying mitochondrial genomes of the cancer will also be collected.With further adaptation this could be extended to include epigeneticalterations and could be applied to the transcriptomes of cancers toinvestigate the first phenotypic effects of all these changes. This cata-loguewill include all the drivermutations andhence all the cancer genesoperating in that cancer, whether they are protein-coding genes, non-coding RNA genes or more cryptic functional elements of the genome.Indeed, if known or unknown DNA viruses have contributed tooncogenesis thesewill alsobediscovered.The cataloguewill also includeall the passenger mutations that incorporate the signatures of previousexposures, DNA repair defects and other mutational processes thecancer has experienced over the decades during which it was evolving.

Until recently, this was an unattainable fantasy. However, thearrival of second-generation sequencing technologies promises anew era for cancer genomics. These platforms currently generatebillions of bases of DNA sequence per week, yields that are predictedto increase rapidly over the next couple of years (Fig. 3). Several proof-of-principle studies have recently been published applying these tech-nologies to cancer samples. These have demonstrated that the currentgeneration of massively parallel sequencing platforms can identify thefull range of somatically acquired genetic alteration in cancer, includ-ing point mutations on a genome-wide basis57, insertions and dele-tions57, copy number changes56 and genomic rearrangements56, as wellas characterizing the cancer cell transcriptome40,41. Furthermore, theseapproaches have the potential to identify subclonal genetic diversitywithin the population of cancer cells58, with particular relevance to thedetection of subclones carrying drug-resistance mutations59. Indeed,one high-coverage cancer genome sequence has recently beenreported57 and several others will emerge during the course of 2009.

Even with the remarkable technological advances in sequencing,however, the parameters of experiments to catalogue all somaticallyacquired variants in a cancer genome are sobering. To obtain acomplete catalogue of somatic mutations from an individual humancancer may require 20-fold sequence coverage of the cancer genome,and possibly more. Somatic mutations then have to be distinguishedfrom inherited DNA variants. Although most inherited variants thatare common in human populations (.5% allele frequency) have beendiscovered and are registered in databases, there are myriad rareinherited single nucleotide polymorphisms and structural variants thatare not. In most cancer genomes these rare germline variants faroutnumber the somatic mutations present. Therefore, for the foresee-able future at least, a high-coverage sequence of the normal genomefrom the same individual as the cancer will be an inescapable extra

burden to allow identification of the somatic changes. Thus, morethan 100,000,000,000 base pairs of DNA sequence will probably berequired to identify the catalogue of somatic mutations in a singlecancer genome.

Subsequently, it will be necessary to distinguish driver mutationsfrom passengers (see Box 1). The power to distinguish clusters ofdriver mutations in cancer genes from chance clusters of randomlydistributed passenger mutations will depend on how frequently acancer gene is mutated and the prevalence of passenger mutations.To be confident of identifying a cancer gene that is mutated in,5%of a particular type of cancer will require hundreds of cases to besequenced. Each of the.100 cancer types will probably require similarsample sizes.

Coordinating the sequencing of cancer genomesThere is, therefore, much work to be done over the next few years.Ideally, it should be organized to maximize use of resources andharmonize the product. This is the mission of the InternationalCancer Genome Consortium (ICGC, see http://www.icgc.org/home).Buildingon the success of previousmultinational, collaborative initia-tives such as the Human Genome Project and the HapMap consor-tium, the aim of ICGC is to comprehensively characterize somaticallyacquired genetic events in at least fifty classes of cancer, includingthose with the highest global incidence and mortality, requiringhigh-coverage sequencing of 20,000 cancer genomes ormore. The fullcatalogues of somatic mutation from each of these cancers will beintegrated with expression and epigenetic profiles of the same casesand correlated with clinical features.

Projects under the ICGC imprimatur will adhere to predeterminedstandards and procedures for ethical approval, data release,intellectual property, sample quality, clinical annotation, data quality,data storage and sequencing completion. Most importantly, given thedemanding nature of the task, the ICGC will coordinate studies tominimize duplication of effort and enable the most parsimoniousdeployment of resources.

The proposal to sequence large numbers of cancer genomes hasgenerated controversy reminiscent of the debate before sequencingof the reference human genome almost 20 years ago. The experimentswill be expensive and, to some extent, we cannot predict what will befound. However, the human genome is finite. Therefore, with furthertechnological advances in DNA sequencing that are already in sight,this is a deliverable project that will comprehensively elucidate centralquestions relating to the nature of human cancer. The clinical andtranslational implications of such a body of work are profound.Beyond the identification of further potentially druggable cancergenes, a comprehensive catalogue of somatic mutations in carefullycharacterized clinical samples will generate new insights into thegenetic patterns that underpin disease phenotype, prognosis, drugresponse and chemotherapy resistance. As the costs of sequencingwhole cancer genomes drop towards US$1,000, routine sequencingin a clinical, diagnostic setting will become feasible. Such data maydrive individualized therapeutic decision-making through the abilityto predict prognosis, to choose therapeutic regimens known to haveefficacy for the particular genetic subtype of cancer, to sensitivelymonitor response to therapy and to identify rare subclonesharbouringdrug-resistancemutations before therapy is even initiated.Individualized therapeutics will require individualized diagnostics.

The discussion is therefore not about whether to do the experiment,butwhen and how. In amanner similar to theHumanGenomeProjectwe have to coordinate the work internationally to maximize use ofresources and minimize duplication of effort to generate a resourceof high quality so that we only have to do it once, empowering cancerresearch with a lasting legacy for the future.

Forward lookApproximately 100,000 somaticmutations fromcancer genomeshavebeen reported in the quarter of a century since the first somatic

Capillary sequencingCapillary sequencing

1980 1985 1990 1995 2000

10

100

1,000

10,000

100,0000

1,000,000

10,000,000

100,000,000

1,000,000,000

2010 Future

Manualslab gel

Automatedslab gel First-generation

capillary

Second-generationcapillary sequencer

Microwellpyrosequencing

Short-readsequencers

Singlemolecule?

Gel-based systems

Massively parallel sequencing

Year

2005

Kilo

base

s pe

r day

per

mac

hine

Figure 3 | Improvements in the rate of DNA sequencing over the past 30years and into the future. From slab gels to capillary sequencing andsecond-generation sequencing technologies, there has been a more than amillion-fold improvement in the rate of sequence generation over this timescale.

NATUREjVol 458j9 April 2009 REVIEWS

723 Macmillan Publishers Limited. All rights reserved©2009

Stratton et al., Nature 2009

1977 - SangerChain-termination

method

Illumina

454

SolidIon Torrent

Pacific Biosciences

Human genome

~3X coverage of a human genome


Right now, upstairs in DMAC

Output: ~30 Gbp/day

Human genome is 3.2 Gbp


Completely sequenced genomes by year

http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Statistics

Bioinformatics And the “wealth” of information


TCCAAACCCAGGCTCTCTCCCAAACCAGTTTGCGGCAGATGGCCAGTGGAACCTCACTCTCCTCATCAGTAAAAAGGGGGCAGAGTGAGGGTCCTGAGAGCTAGTACAGGGACTGTGTGAAGTAGACAATGCCCAGTGTTTAGCGTAAGAATCAGGGTCCAGCTGGTGCTCCCTAAACAGCAGCTGCTGTTCACTGTTGAAAGGCGCTCTGGAAGGCCAGGCGCGGTGGCTCATGCTTGTAATCCCAGCACTGTGGGAGGCCGAGGTGGGCGGATCACCTGAGGTAGGGAGTTCGAGACCAGCCTGACCAACGTGGAGAAACCCCATCTCTCCTAAAAATACAAAATTAGCCAGGCGTGGTAGCACATACCTGTAATCCCAGCGACTCGGGAGGCTGAGGCAAGAGAATTGCTTGAAACCAGCAGGGGAGGTTGTGGTGAGCCAAGATCGAGCCATTGCACTCCAGCCAGGGCAACAAGAGGCAAAATGGCGAAACTCCATCTCCGAGAAAAAAAAAAAAAAAGAATACTTTCTGAAAGTATTTATTCATACAAATAAAGACTTGACCCATAAGGTAGGAACGCAAATGGGCCACGGAATCACTCATTCCACAGTATACACCGAGTGCCCTTGAAGTGCTGGGCACTGCTCCAGGATTGGGGGCATATTGGTGAAAAGAGAAGCAAGCCTGCCTGCTCAGATGGCAGGGAATGGGGAAAAACAGGGAGACAGTTTCCTGTTTGAGATGTTGGGAGTCTGCTTCGAGTAGTATATTTACTGGAAATAGACCACTAACTTGGATGTCCCTTTTTGGAAATGTGCCTGCGTCCAGGGCTGGGTTGGGGCCCCAATGAACTTTGGCTCTGACATAGCTGTTGCCACACTCAGTGGAACTGAATCCATGTTTGCCTTCACCCGGCATCCTTCACCCCAACTCTCCCCGCCACAACATACATCCCATGCCAGCCTGGGGACCCTCAAAGGTGCTTCATCATTAGGTTTGTGGCTGGGTCCTACTGAAGTAAGTCTTGGCACTCAGAGGGATAGGAATTGAATGAAGACATGAGATTCCTCTGCGGGAGGCCTCTCTAGGAAATCTGTGGACTCACACGTTTACTAATGTTGCTGCAGCCCCGCACCCACCTTGGCCTTGGGCAGCCATACTCTAGGGCTTTTGTAACCTCTCCATGTGAGGAACTCAAATTAGACCTGGGTTTGGAGGCGGTGCTCCGAGCTGGCCTTTGGGGGAGGTTTTGTGCGAGGCATTTCCCAAGTGCTGGCAGGATTGTGTCACAGACACAGAGTAAACTTTTGCTGGGCTCCAAGTGACCGCCCATAGTTTATTATAAAGGTGACTGCACCCTGCAGCCACCAGCACTGCCTGGCTCCACGTGCCTCCTGGTCTCAGTATGGCGCTGTCCTGGGTTCTTACAGTCCTGAGCCTCCTACCTCTGCTGGAAGCCCAGATCCCATTGTGTGCCAACCTAGTACCGGTGCCCATCACCAACGCCACCCTGGACCGGGTGAGTGCCTGGGCTAGCCCTGTCCTGAGCACATGGGCAGCTGCCTCCCTTCTCTGGGCTTCCCTTTACCTGCTGGCTGTGGTCGCACCCCCACTCCCAGCTCTGCCTTTTTCTCTTCTGGGTCCCCAGGGTGAAATTCTCACCAGCCCAGGGGACTCTGGAGGCACCCCCTGCCTCCAAACACAGAAGCCTCACTGCAGAGTCCTTCACGGAGGACGGTTCTGTGCTGGGCCTGGAGGGGCTGCCTGGGGGGCAATGACTGATCCTCAGGGTGAGCTCCTGCATGCGCACTGCCCACCAGGGGCCTCATCTCCCCATCTGCAAAATCAGGGAGAGATCTGCCTGAGTCTCCTCCCAGCTGACAGTCAAAGATTCAGCATCAAGCCCCCATCACCAGCTCCCCCCTTCTCCCCAGATCACTGGCAAGTGGTTTTATATCGCATCGGCCTTTCGAAACGAGGAGTACAATAAGTCGGTTCAGGAGATCCAAGCAACCTTCTTTTACTTTACCCCCAACAAGACAGAGGACACGATCTTTCTCAGAGAGTACCAGACCCGGTGAGAGCCCCCATTCCAATGCACCCCCGATCTCAGCTGTCTGGCCAGAAGACCTGAGCAAGTCCCTCCTTCTTCCTGGCCTTGGCCTTCCCATGGGTGGAACCGGGAGGGTTGGCTTTAATCTCCACCAGAACTCTTGCCCCGGGACTGTGATGGGCGATTGGCCACTTCTCCTCGATAACATTACTGTTTTTCTTCCGCCTTCTGGTTGACTTTAGCCAGAACCAGTGCTTCTATAACTCCAGTTACCTGAATGTCCAGCGGGAGAATGGGACCGTCTCCAGATACGGTGAGGGCCAGCCCTCAGGCAGGAGGGTTCACCGTGGGAACAGGGCAGGCCAGCATAAGGTGGGGGCTGGATGTAGAGCCCTGGAGGCTTTGGGCACAGAGAAATAACCACTAACATTTTTGAGCTCTTACCACGTGCTCAGAAAAAATCCCTAAGAAGACACTGAGAGAATTAGATGAGGAAACATAAGAACAGAGACCTCAAATAGTTTCCCCAAGGTCACACAGCTTATAATTAGAACTAGAATTGGAACTCCAGGCTGGCTTCAGATCTGCCTCTCTCTCACGCCCTCTTTAAGATCCTTTGCAAACCAATGGTAGAAGCCTGTATGTTGGAGAGGTGGTACCTTCAACTATGTCCCCCATCACCGCAGAGGTGGCACATGGCAGGGATCTGATGGAGCTGAACTGACATCATTTAGCATCCCGAGCCTCCTCTCTGGGCCTCATTTTCCTCCTCTGTAAAACGGGGAGAAAGGCCCTGACAGCCACAGTCTGTGTGAGGCTCCTGAGATCTCATGTACAGAAAGTGCTTGGCGTGGAGCTGGGCACGCAGCAGGGGCTGGGCACACGGTGGCCCAAAGGAGACCCGGGCCTTCACTGATGGGCTTTGTGGCCCCGGACACATTTCTCTTCCAGAGGGAGGCCGAGAACATGTTGCTCACCTGCGTTCCTTAGGGACACCCCTAGGACTCCTCACCTGTAAGACAGGCACCATTGTGCCATCCCATGTTCTCACCCAGAGGCTCTTAAGACCTTGATGTTTGGTTCCTACCTGGACGATGAGAAGAACTGGGGGCTGTCTTTCTATGGTAGGCATGCTTAGCAGCCCCAAACTCATGCCCCTCTCAGGCCTCACCCCCCATTCACCCACCCCTGGGCTGGCCCCTAGAACCCCAGCCCTCCCTGGCCTCCGCCGGGCCCCACCATGTCCCCAGTCAGTCTCCTTGCTCCCCCTGCAGCTGACAAGCCAGAGACGACCAAGGAGCAACTGGGAGAGTTCTACGAAGCTCTCGACTGCTTGTGCATTCCCAGGTCAGATGTCATGTACACCGACTGGAAAAAGGTAAACGCAAGGGATTGGACATTGCCCACCTTGTCCATGGCCCAACTTGGGCAGCCCCAGAGGCCCAGAGCAGGAAAGCTGCCAGGCAAGGCTGCACAGCTAGGCAGATCTTCTGCTTTTAGGCACCTGCCTCACTGTAGGGACAGCTGAGCTCTACAGAGGCCCAGGGGTGGTGGATGAGAGCCCAGGAGGGAGAAGTCCCTGTGAAACCAGGGAGGACCTGAAAGCTAACAGGAGGGAACAGCGTGAGCCACGGGGTTGGGGGATTGGCAATTGGAGGGGACGTAATGCGGGGAGTTACCACCTACAGACGCGTCCCAAACCCCAGGCTTTCACCCCAACCTCCACTCCCCGCTCATTTTTAATACCCGTGCAGTGGGGAATTGATACTGTGGTTTTCAATGTCACCCACACTGCAGCACGGCCACAGTCACCATCCCGATTTTTGCTACAAATGAAAATTACTGTATAATGAGCTCCTTAACACTTTTCTTTAAACCTGTGTTTGGAAGACTTGTGTTGGTGTGGCCCTGTGCCCTAATACCTGTGAAATCACAGCACCGATGAGCTGGTTCCAATTTTTAAAATATATACATGCAGTACTTCCATGACTATTCAAAGAAAAACAATTCCTTCCATTTGCCACCTGAGATGACCACCAGGGATGTGAACTACCTCCTGCCCCATCCCCAGCCCCAGGATCCTGGGACAGGGCTTATGAACGCAACCACTGTAGTCAGCTCACTTGATCCACAGCCTGGCACCTCCACTGTCTGGCTAGGGAGCCTCGAATGGGTCCCAAGGCCACCCTGCTCCTCAGTTACATCATCTGCATAGTAGTGGTGGTTGTGAGGAATTCAGGAGCTGCAGCATAAGGGCCCTGCAGGTACTATGTGCTCAGTAAATGCCAGTGGTTCTTAAGGGTCTGAGCTCCCATTGTAGAGGCAAGTAAGCTGAGGTTCAGAGAAGAAAATGACTTGCCCAAGATCACCCAGCTGGGAAGTGACAGTGCCAGGGTTGGAGCCCTGGTTGAGCTGGTTCCACAGGCCAGAGCTCATTCTGCCCTCTCCCCGGAAGACCTCCCACCCTGTCCCCATGCCTCTGCTTCTCCCTCACCCCAATTCCCCGCTGCCTTCTAGGATAAGTGTGAGCCACTGGAGAAGCAGCACGAGAAGGAGAGGAAACAGGAGGAGGGGGAATCCTAGCAGGACACAGCCTTGGATCAGGACAGAGACTTGGGGGCCATCCTGCCCCTCCAACCCGACATGTGTACCTCAGCTTTTTCCCTCACTTGCATCAATAAAGCTTCGCATCGGCCTTTCGAAACGAGGAGTACAATAAGTCGGTTCAGGAGCCCTCAGGCAGGAGGGTTCACCGTGGGAACAGGGCAGGCCAGCATAAGGTGGGGGCTGGATGTAGAGCCCTGGAGGCTTTGGGCACAGAGGCCACCCTGGACCGGGTGAGTGCCTGGGCTAGCCCTGTCCTGAGCACATGGGCAGCTGCCTCCCTTCTCTGGGCTTCCCTTTACCTGCTGGCTGTGGTCGCACCCCCACTCCCAGCCCCCAACTCTCCCCGCCACAACATACATCCCATGCCCAGGAGGGTTCACCGTGGGAACAGGGCAGGCCAGCATAAGGTGGGGGCTGGATGTAGAGCCCTGGAGGCTTTGGGCACAGAGGCCACCCTGGACCGGGTGAGTGCCTGGGCTAGCCC


Source: http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html



Current state-of-the art:

- 1024 Cores

- 8 Tb RAM

Fastest commercial computer

Documents

What is Systems Biology - · PDF file26 CBS, Department of Systems Biology 27041 Introduction to Systems Biology Sequencing capability: throughput per machine 27626 - Next Generation