Human Cancer Genome Project Computational Systems Biology of Cancer: (II)

Human Cancer Genome Project Computational Systems Biology of Cancer: (II) Slide 2 Human Cancer Genome Project Bud Mishra Professor of Computer Science, Mathematics and Cell Biology Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai School of Medicine Slide 3 Human Cancer Genome Project The New Synthesis Genome Evolution Selection perturbed pathways micro-environment epigenomics transcriptomics proteomic metabolomics signaling genetic instability Part-lists, Annotation, Ontologies DNA RNA Protein TranscriptionTranslation Genotype Phenotype Slide 4 Human Cancer Genome Project Is the Genomic View of Cancer Necessarily Accurate ? If I said yes, that would then suggest that that might be the only place where it might be done which would not be accurate, necessarily accurate. It might also not be inaccurate, but I'm disinclined to mislead anyone. US Secretary of Defense, Mr. Donald Rumsfeld, Once again quoted completely out of context. Slide 5 Human Cancer Genome Project Cancer Initiation and Progression Genomics (Mutations, Translocations, Amplifications, Deletions) Epigenomics (Hyper & Hypo-Methylation) Transcriptomics (Alternate Splicing, RNA) Proteomics (Synthesis, Post-Translational Modification, Degradation) Signaling Proliferation, Motility, Immortality, Metastasis, Signaling Slide 6 Human Cancer Genome Project Mishras Mystical 3Ms Rapid and accurate solutions Bioinformatic, statistical, systems, and computational approaches. Approaches that are scalable, agnostic to technologies, and widely applicable Promises, challenges and obstacles Measure Mine Model Slide 7 Human Cancer Genome Project Measure What we can quantify and what we cannot Slide 8 Human Cancer Genome Project Microarray Analysis of Cancer Genome Representations are reproducible samplings of DNA populations in which the resulting DNA has a new format and reduced complexity. We array probes derived from low complexity representations of the normal genome We measure differences in gene copy number between samples ratiometrically Since representations have a lower nucleotide complexity than total genomic DNA, we obtain a stronger specific hybridization signal relative to non-specific and noise Normal DNA Normal LCR Tumor DNA Tumor LCR Label Hybridize Slide 9 Human Cancer Genome Project Minimizing Cross Hybridization (Complexity Reduction) Slide 10 Human Cancer Genome Project A1 A2 A3 B1 B2 B3 C1 C2 C3 Copy Number Fluctuation Slide 11 Human Cancer Genome Project Critical Innovations Data Normalization and Background Correction for Affy-Chips 10K, 100K, 500K (Affy); Generalized RMA Multi-Experiment-Based Probe-Characterization (Kalman + EM) A novel genome segmenter algorithm Empirical Bayes Approach; Maximum A Posteriori (MAP) Generative Model (Hierarchical, Heteroskedastic) Dynamic Programming Solution Cubic-Time; Linear-time Approximation using Beam-Search Heuristic Single Molecule Technologies Optical and Nanotechnologies Sequencing: SMASH Epigenomics Transcriptomics Slide 12 Human Cancer Genome Project Background Correction & Normalization Slide 13 Human Cancer Genome Project Oligo Arrays: SNP genotyping Given 500K human SNPs to be measured, select 10 25-mers that over lap each SNP location for Allele A. Select another 10 25-mers corresponding to SNP Allele B. Problem : Cross Hybridization DNA 25-mers Slide 14 Human Cancer Genome Project Using SNP arrays to detect Genomic Aberrations Each SNP probeset measures absense/presence of one of two Alleles. If a region of DNA is deleted by cancer, one or both alleles will be missing! If a region of DNA is duplicated/amplified by cancer, one or both alleles will be amplified. Problem : Oligo arrays are noisy. Slide 15 Human Cancer Genome Project 90 humans, 1 SNP (A=0.48) Allele A Allele B Slide 16 Human Cancer Genome Project 90 humans, 1 SNP (A=0.24) Allele B Allele A Slide 17 Human Cancer Genome Project 90 humans, 1 SNP (A=0.96) Allele B Allele A Slide 18 Human Cancer Genome Project Background Correction & Normalization Consider a genomic location L and two similar nucleotide sequences s L,x and s L,y starting at that location in the two copies of a diploid genomes E.g., they may differ in one SNP. Let x and y be their respective copy numbers in the whole genome and all copies are selected in the reduced complexity representation. The gene chip contains four probes p x 2 s L,x ; p y 2 s L,y ; p x, p y :2 G. After PCR amplification, we have some K x x amount of DNA that is complementary to the probe p x, etc.K' ( K x ) amount of DNA that is additionally approximately complementary to the probe p x. Slide 19 Human Cancer Genome Project Normalize using a Generalized RMA I = U - n [ n 2 - N(0,1) (a/b)/ N(0,1) (a/b)] {(1 + B n / N(0,1) (a/b)} -1 + [b n /B n ] )] {(1 + N(0,1) (a/b)/( B n )} -1, Where a = U- n - n 2 ; b = n, and b n = [I i,j U + n ] N(0,1) ([I i,j U + n ] ) B n = N(0,1) ([I i,j U + n ] ) Slide 20 Human Cancer Genome Project Background Correction & Normalization If the probe has an affinity x, then the measured intensity is can be expressed as [K x x + K] x +noise = [ x + K/K x ] x + noise With Exp[ + a multiplicative logNormal noise, [ + an additive Gaussian noise, and x = K x x an amplified affinity. A more general model: I x = [ x + K/K x ] x e + + Slide 21 Human Cancer Genome Project Mathematical Model In particular, we have four values of measured intensities: I x = [ x x + N x ]e + + 2 I x = [N x ] e + + 2 I y = [ y y + N y ] e + + 2 I y = [N y ] e + + 2 Slide 22 Human Cancer Genome Project Bioinformatics: Data modeling Good news: For each 25-bp probe, the fluorescent signal increases linearly with the amount of complementary DNA in the sample (up to some limit where it saturates). Bad news: The linear scaling and offset differ for each 25-bp probe. Scaling varies by factors of more than 10x. Noise : Due to PCR & cross hybridization and measurement noise. Slide 23 Human Cancer Genome Project Scaling & Offset differ Scaling varies across probes: Each 25-bp sequence has different thermodynamic properties. Scaling varies across samples: The scanning laser for different samples may have different levels. The starting DNA concentrations may differ; PCR may amplify differently. Offset varies across probes: Different levels of Cross Hybridization with the rest of the Genome. Offset varies across samples: Different sample genomes may differ slightly (sample degradation; impurities, etc.) Slide 24 Human Cancer Genome Project Linear Model + Noise Slide 25 Human Cancer Genome Project Noise minimization Slide 26 Human Cancer Genome Project Final Data Model Slide 27 Human Cancer Genome Project MLE using gradients Slide 28 Human Cancer Genome Project Data Outliers Our data model fails for few data points (bad probes) Soln (1): Improve the model Soln (2): Discard the outliers Soln (3): Alternate model for the outliers Weight the data approprately. Slide 29 Human Cancer Genome Project Outlier Model Slide 30 Human Cancer Genome Project Problem with MLE: No unique maxima Slide 31 Human Cancer Genome Project Scaling of MLE estimate Slide 32 Human Cancer Genome Project Segmentation to reduce noise The true copy number (Allele A+B) is normally 2 and does not vary across the genome, except at a few locations (breakpoints). Segmentation can be used to estimate the location of breakpoints and then we can average all estimated copy number values between each pair of breakpoints to reduce noise. Slide 33 Human Cancer Genome Project Allelic Frequencies: Cancer & Normal Slide 34 Human Cancer Genome Project Allelic Frequencies: Cancer & Normal Slide 35 Human Cancer Genome Project Segmentation & Break-Point Detection Slide 36 Human Cancer Genome Project Algorithmic Approaches Local Approach Change-point Detection (QSum, KS-Test, Permutation Test) Global Approach HMM models Wavelet Decomposition Bayesian & Empirical Bayes Approach Generative Models (One- or Multi-level Hierarchical) Maximum A Posteriori Slide 37 Human Cancer Genome Project HMM 2 3 4 5 6 0 1 Model with a very high degree of freedom, but not enough data points. Small Sample statistics a Overfitting, Convergence to local maxima, etc. Slide 38 Human Cancer Genome Project HMM, finally Model with a very high degree of freedom, but not enough data points. Small Sample statistics a Overfitting, Convergence to local maxima, etc. 2 3 1 Slide 39 Human Cancer Genome Project HMM, last time We will simply model the number of break-points by a Poisson process, and lengths of the aberrational segments by an exponential process. Two parameter model: p b & p e =2 2 pbpb 1-p b 1-p e pepe Advantages: 1.Small Number of parameters. Can be optimized by MAP estimator. (EM has difficulties). 2.Easy to model deviation from Markvian properties (e.g., polymorphisms, power-law, Polyas urn like process, local properties of chromosomes, etc.) Slide 40 Human Cancer Genome Project Generative Model Amplification, c=4Amplification, c=3 Deletion, c=0Deletion, c=1 Breakpoints, Poisson, p b Segmental Length, Exponential, p e Copy number, Empirical Distribution Noise, Gaussian, , Slide 41 Human Cancer Genome Project A reasonable choice of priors yields good segmentation. Slide 42 Human Cancer Genome Project A reasonable choice of priors yields good segmentation. Slide 43 Human Cancer Genome Project A MAP (Maximum A Posteriori) Estimators Priors: Deletion + Amplification Data: Priors + Noise Goal: Find the most plausible hypothesis of regional changes and their associated copy numbers Generalizes HMM:The prior depends on two parameters p e and p b. p e is the probability of a particular probe being normal. p b is the average number of intervals per unit length. (pe,pb) max at (0.55,0.01) Slide 44 Human Cancer Genome Project Likelihood Function The likelihood function for first n probes: L( h i 1, 1, , i k, k i ) = Exp(-p b n) (p b n) k (2 2 ) (-n/2) i=1 n Exp[-(v i - j ) 2 /2 2 ] p e (#global) (1-p e ) (#local) Where i k = n and i belongs to the j th interval. Maximum A Posteriori algorithm (implemented as a Dynamic Programming Solution) optimizes L to get the best segmentation L( h i* 1, 1, , i* k, k i ) Slide 45 Human Cancer Genome Project Dynamic Programming Algorithm Generalizes Viterbi and Extends. Uses the optimal parameters for the generative model: Adds a new interval to the end: h i 1, 1, , i k, k i h i k+1, k+1 i = h i 1, 1, , i k, k, i k+1, k+1 i Incremental computation of the likelihood function: Log L( h i 1, 1, , i k, k, i k+1, k+1 i ) = Log L( h i 1, 1, , i k, k i ) + new-res./2 2 Log(p b n) +(i k+1 i k ) Log (2 2 ) (i k+1 i k ) [ I global Log p e + I local Log(1 p e )] Slide 46 Human Cancer Genome Project Prior Selection: F criterion For each break we have a T 2 statistic and the appropriate tail probability ( p value) calculated from the distribution of the statistic. In this case, this is an F distribution. The best (p e,p b ) is the one that leads to the maximum min p -value. (pe,pb) max at (0.55,0.01) Slide 47 Human Cancer Genome Project Segmentation Analysis Slide 48 Human Cancer Genome Project 13q13.113q31.3 CGH Explorer v.2.43 DNAcopy GLAD vMAP Olshen, AB et al. Biostatistic s 5 : 557-72 Lingjaerde, OC et al. Bioinformatics 21 : 821-2 Hupe, P et al. Bioinformatics 20 : 3413-22 Daruwala et al. Proc Natl Acad Sci U S A. 2004 Comparison of chromosome 13 tumor using 4 different segmentation algorithm Comparison of chromosome 13 tumor using 4 different segmentation algorithm Slide 49 Human Cancer Genome Project Comparative Analysis: BAC Array Slide 50 Human Cancer Genome Project Comparative Analysis: Nimblegen Slide 51 Human Cancer Genome Project Comparative Analysis: Affy 10K Slide 52 Human Cancer Genome Project Simulated Data Array CGH simulations and an ROC analysis Using the same scheme as Lai et al. Weil R. Lai, Mark D. Johnson, Raju Kucherlapati, and Peter J. Park (2005), Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data, Bioinformatics, 21(19): 3763-3770. Segmented by Vmap and DNAcopy Vmap algorithm was tested at 11 segmentation Pvalues of: 0.1, 5 10 -2, 10 -2, 10 -3, 10 -4, , 10 -10. DNAcopy algorithm was tested at 9 segmentation alpha values of:.9,.5,.1, 10 -2, 10 -3, 10 -4, , 10 -7. Analysis by Alex Pearlman et al. (2006) Slide 53 Human Cancer Genome Project VMAP Slide 54 DNACopy Slide 55 Slide 56 Log ratio Prostate Tumor Gains and Losses Genome view of 19K BAC CGH Prostate Tumor Gains and Losses Genome view of 19K BAC CGH Slide 57 Human Cancer Genome Project Normal 1,2,3 Tumor1 Tumor2 Tumor3 Proximal breakpoints were identical for T1 and T3. Distal breakpoints overlapped for T1, T2, and T3. Segmentation of Multi-BAC Events On Chromosome 13 Slide 58 Human Cancer Genome Project Further Improvement We employed a hierarchical Bayesian model in which global false discovery rates can be calculated using the different levels of the model. Noise processes are also estimated using the appropriate global parameters. Slide 59 Human Cancer Genome Project Specific Features of the Model We build a model in which, given the region segmentations, we assume that the copy numbers I j = region j, (1 j k) in that regions are mutually independent Gaussian X i,j N ( j, j 2 ), (1 i n j ) random variables with mean j and variance j 2. We further assume that each copy region mean parameter j is in one of a small number of states 2 {1,,S} with respective probabilities, 1, , S of being in state s. j is in state s (with probability s ) if it has a Gaussian distribution with state mean s and state variance s 2. States serve to characterize regions. The state means and variances are the hyperparameters of the model. Slide 60 Human Cancer Genome Project Implementation: Dynamic Programming Given the hyperparameters, we segment regions using a dynamic programming approach. This consists in constructing probe regions as follows: After the (j-1) st region has been constructed: A) we choose the next two contiguous regions to the right of those already constructed by optimizing the corresponding log likelihood, subject to the condition that the p-value of the t-statistic distinguishing between these two (aforementioned) regions is above a given threshold. B) Having chosen these (aforementioned) regions, the probe regions already constructed, contiguous to them, may also need to be altered. Slide 61 Human Cancer Genome Project Segmentation (ROMA,chr3) Slide 62 Human Cancer Genome Project S*M*A*S*H Single Molecule Approaches to Sequencing by Hybridization ~Extensions to Optical Mapping~ Slide 63 Human Cancer Genome Project S*M*A*S*H Genomic DNA is carefully extracted from small number of cells of an organism (e.g., human) in normal or diseased states. (Fig 1 shows a cancer cell to be studied for its oncogeneomic characterization.) Fig 1 Slide 64 Human Cancer Genome Project S*M*A*S*H LNA probes of length 6 8 nucleotides are hybridized to dsDNA (double-stranded genomic DNA) in a test tube (Fig 2) and the modified DNA is stretched on a 1 x 1 chip that has microfluidic channels manufactured on its surface. These surfaces have been chemically treated to create a positive charge. Fig 2 DNA samples are prepared for analysis with LNA probes and restriction enzymes. Slide 65 Human Cancer Genome Project S*M*A*S*H Since DNA is slightly negatively charged, it adheres to the surface as it flows along these channels and stretches out. Individual molecules range in size from 0.3 3 million base pairs in length. Next, bright emitters are attached to the probes on the surface and the molecules are imaged (Fig 3). Fig 3 Slide 66 Human Cancer Genome Project S*M*A*S*H A restriction enzyme 1 is added to break the DNA at specific sites. Since DNA molecules are under slight tension, the cut fragments of DNA relax like entropic springs, leaving small visible gaps corresponding to the positions of the restriction site (Fig 4). 1. A restriction enzyme is a highly specific molecular scissor that recognizes short nucleotide sequences and cuts the DNA at only those recognition sites. Fig 4 Slide 67 Human Cancer Genome Project S*M*A*S*H The DNA is then stained with a fluorogen (Fig 5) and reimaged. The two images are combined to create a composite image suggesting the locations of a specific short word (e.g., probes) within the context of a pattern of restriction sites. Fig 5 Slide 68 Human Cancer Genome Project S*M*A*S*H The intensity of the light emitted by the dye at one frequency provides a measure of the length of the DNA fragments. The intensity of the light emitted by the bright-emitters on probes provides an intensity profile for locations of the probes. Images of each DNA molecule are then converted into ideograms, where the restriction sites are represented by a tall rectangle and probe sites by small circles (Fig 6). Fig 6 Slide 69 Human Cancer Genome Project S*M*A*S*H The steps above are repeated for all possible probe compositions (modulo reverse complementarity). Sutta software then uses the data from all such individual ideograms to create an assembly of the haplotypic ordered restriction maps with approximate probe locations superimposed on the map. ATAT TATC ATCA TCAT CATA ATATCATAT Fig 7 Slide 70 Human Cancer Genome Project S*M*A*S*H Local clusters of overlapping words are combined by Suttas PSBH (positional sequencing by hybridization) algorithm to overlay the inferred haplotypic sequence on top of the restriction map (Fig 7). ATAT TATC ATCA TCAT CATA ATATCATAT Fig 7 Slide 71 Human Cancer Genome Project Gapped Probes Mixing solid bases with `wild-card bases: E.g., xx*x**x*xx (10-4-mers) or xx*x****x*xx (12-6-mers) An wild-card base Universal: In terms of its ability to form base pairs with the other natural DNA/RNA bases. Applications in primers and in probes for hybridization Examples: The naturally occurring base hypoxanthine, as its ribo- or 2'- deoxyribonucleoside 2'-deoxyisoinosine 7-deaza-2'-deoxyinosine 2-aza-2'-deoxyinosine Slide 72 Human Cancer Genome Project Simulation Results Probe Map Assumptions: For single DNA molecules: Probe location Standard Deviation = 240 bases; Data coverage per probe map = 50x; Probe hybridization rate = 30%, and false positive rate of 10 probes per megabase, uniformly distributed. Analytically estimation of the average error rate in the probe consensus map: Probe location SD = 60 bases; False Positive rate < 2.4%; False Negative rate < 2.0%. Slide 73 Human Cancer Genome Project Simulation Results UNGAPPEDGAPPED Slide 74 Human Cancer Genome Project Simulation Results Simulation based on non-random sequences from the human genome: 96 blocks of 1 Kb (from chromosome 1) concatenated together along with its in silico restriction map. Error summary for the gapped probe pattern xx*x **** x*xx: Error count excluding repeats or near repeats: 0.32bp / 10Kb There is no error due to incorrect rearrangements. There is no loss of information at haplotypic level. Assembly failed in 2 of 96 blocks of 1kb = 2.1% failure rate (out of memory). Slide 75 Human Cancer Genome Project GENomic conTIG Gentig uses a purely Bayesian Approach. It models all the error processes in the prior. FAST: It initially starts with a conservative but fast pairwise overlap configuration, computed efficiently using Geometric Hashing. ACCURATE: It iteratively combines pairs of maps or map contigs, while optimizing the likelihood score subject to a constraint imposed by a false- positive constraint. It has special heuristics to handle non-local errors. Slide 76 Human Cancer Genome Project HAPTIG: HAPlotypic conTIG Candida Albicans The left end of chromsome-1 of the common fungus Candida Albicans (being sequenced by Stanford). You can clearly see 3 polymorphisms: (A) Fragment 2 is of size 41.19kb (top) vs 38.73kb (bottom). (B) The 3rd fragment of size 7.76kb is missing from the top haplotype. (C)The large fragment in the middle is of size 61.78kb vs 59.66kb. FAST & ACCURATE BAYESIAN ALGORITHM Slide 77 Human Cancer Genome Project Lambda DNA with probes 10 m Slide 78 Human Cancer Genome Project 500 nm A Fig. A : Four AFM images of lambda DNA with PNA probes hybridized to the distal recognition site, located 6,900 bp or 2.28 microns from the end (green arrow). Non-specifically bound probes indicated by the red arrows. Z- scale is +/- 1.5 nm. Slide 79 Human Cancer Genome Project E. coli Figure 3. Two optical images of E coli K12 genomic DNA after restriction digestion with 6-cutter restriction enzyme Xho 1 and hybridization with an 8-mer PNA probe. Bound probes are indicated by blue arrows and non- specifically bound probes by the red arrows. Scale bar shown is 10 micron. Slide 80 Human Cancer Genome Project Discussions Q&A Slide 81 Human Cancer Genome Project Answer to Cancer If I know the answer I'll tell you the answer, and if I don't, I'll just respond, cleverly. US Secretary of Defense, Mr. Donald Rumsfeld. Slide 82 Human Cancer Genome Project To be continued Break

Documents

Human Cancer Genome Project Computational Systems Biology of Cancer: (II)