Some slides adapted from J. Fridlyand BioSys course: DNA
Microarray Analysis Lecture, 2007 Analysis of Array CGH Data by
Hanni Willenbrock
Slide 2
1 Outline Introduction to comparative genomic hybridization
(CGH) and array CGH Data analysis approaches -Breakpoint detection
-Loss and gain analysis -Application of segmentation to testing
Real data example 1: Application to a primary tumor dataset Real
data example 2: comparative genomic profiling of bacterial
strains
Slide 3
PhD defense, October 27th 2006 2 Comparative Genomic
Hybridization Study types : -Gain or loss of genetic material -To
find variations in the genetic material Purposes: -Study of
chromosomal aberrations often found in cancer and developmental
abnormalities. -Study of variations in the baseline sequence in a
microbial population (microbial comparative genomics).
Slide 4
3 A Variety of Genetic Alterations Underlie Developmental
Abnormalities and Disease Any of the above may lead to an oncogene
activation or to inactivation of a tumor suppressor. Inappropriate
gene activation or inactivation can be caused by: -Mutation
-Epigenetic gene silencing (e.g. addition of methyl groups)
-Reciprocal translocation (exchange of fragments between two
non-homologous chromosomes) -Gain or loss of genetic material
Slide 5
4 Existing techniques for detecting structural abnormalities
Albertson and Pinkel, Human Molecular Genetics, 2003
Slide 6
5 Some microarray platforms for copy number analysis BAC arrays
Affymetrix SNP chip (500 K) Representational oligonucleotide
microarray analysis (ROMA) in Whole genome tiling arrays Own design
(NimbleGen/NimbleExpress)
Slide 7
6 Array CGH: BAC arrays 12 mm HumArray3.1 2464 human BAC clones
spotted in triplicates 164-196 kbp
Slide 8
7 Array CGH Maps DNA Copy Number Alterations to Positions in
the Genome Loss of DNA copies in tumor Gain of DNA copies in tumor
Ratio Position on Sequence Cot-1 DNA Test Genomic DNAReference
Genomic DNA
Slide 9
8 Example: Detection of DiGeorge region (A) Detection of
deletion in the DiGeorge region by FISH. A chromosome 22
subtelomere probe (green) and the TUPLE1 probe for the DiGeorge
region (red) were hybridized to metaphase chromosomes from a normal
individual and an individual with the deletion. The arrow indicates
the missing red FISH signal on the deleted chromosome. (B) Array
CGH copy number profile of chromosome 22 showing deletion in the
DiGeorge region (arrow). Albertson and Pinkel, Human Molecular
Genetics, 2003
Slide 10
9 Structural abnormalities Albertson and Pinkel, Human
Molecular Genetics, 2003 *HSR: homogeneously staining region *
Slide 11
10 Tumor Genomes are Stable Copy Number Profiles of a Tumor
& Recurrence
Slide 12
11 Analysis of array CGH Goal: To partition the clones into
sets with the same copy number and to characterize the genomic
segments in terms of copy number. Biological model: genomic
rearrangements lead to gains or losses of sizable contiguous parts
of the genome, possibly spanning entire chromosomes, or,
alternatively, to focal high-level amplifications.
Slide 13
12 Varying genomic complexity Breakpoints
Slide 14
Exercise Part I: Plot and view array CGH data DNA Microarray
Analysis Course, 2007 13
Slide 15
DNA Microarray Analysis Course, 2006 14 Observed clone value
and spatial coherence N(-.3,.08^2) N(.6,.1^2) ? ? Useful to make
use of the physical dependence of the nearby clones, which
translates into copy number dependence.
Slide 16
15 Expected log 2 ratio as a function of copy number change,
normal cell contamination and ploidy Reference ploidy=2 0.58 0.07
0.58 2.58 100% 10% Reference ploidy=3 50% 2.0 0.38 0.0 0.42
Slide 17
16 Simulation Study Many algorithms to choose from Mainly
evaluated only on limited examples Few comparisons between
algorithm performance Choice of evaluation criteria: -False
breakpoint detection vs. missed breakpoints -Sample type
preferences (size of segments, noise, etc)
Slide 18
17 Methods for Segmentation HMM: Hidden Markov Model (aCGH
package) Fit HMMs in which any state is reachable from any other
state (Fridlyand et al, JMVA, 2004). CBS: Circular binary
segmentation (DNAcopy package) Tertiary splits of the chromosomes
into contiguous regions of equal copy number and assesses
significance of the proposed splits by using a permutation
reference distribution (Olshen et al, Biostatistics, 2004). GLAD:
Gain and Loss Analysis of DNA (GLAD package) Detects chromosomal
breakpoints by estimating a piecewise constant function that is
based on adaptive weights smoothing ( Hupe et al, Bioinformatics,
2004).
Slide 19
18 Comparison Scheme Use of simulated data, where the truth is
known The noise is controlled (see later slide) True breakpoint
false predicted breakpoint One segment
Slide 20
19 Breakpoint Detection Accuracy
Slide 21
Exercise Part II: Segmentation and breakpoint prediction DNA
Microarray Analysis Course, 2007 20
Slide 22
DNA Microarray Analysis Course, 2006 21 Merging segments Note:
that all procedures operate on individual chromosomes, therefore
resulting in a large number of segments with mean values close to
each other. Additional Challenge: reduce number of segments by
merging the ones that are likely to correspond to the same copy
number. This will facilitate inference of altered regions.
Slide 23
DNA Microarray Analysis Course, 2006 22 Merging For estimating
actual copy number levels from segmentations
Slide 24
DNA Microarray Analysis Course, 2006 23 Segmentation and
Merging
Slide 25
24 ROC Curve: Identification of copy number alterations for
varying thresholds
Slide 26
Exercise Part III: Estimate copy number gain and losses DNA
Microarray Analysis Course, 2007 25
Slide 27
26 Using segmentation for testing (phenotype association
studies) Example case: Find clones (or whole segments) that are
significantly differing in copy number between two cancer subtypes.
Task: Investigate whether incorporating spatial information
(segmentation) into testing for differential copy number increases
detection power. Data type: Samples with either of 2 different
phenotypes (e.g. 2 different cancer subtypes) How: Comparison of
sensitivity and specificity using: 1. Original test statistic (no
use of spatial information) 2. Segmented T-statistic derived from
original log 2 ratios 3. T-statistic computed from segmented log 2
ratios
Slide 28
27 Simulation of Array CGH Data Real biological variation
considered: Breast cancer data used as model data Segment length
and copy number is taken from the empirical distribution observed
in breast cancer data (DNAcopy segmentation). Mixture of cells
(sample is not pure) Each sample was assigned a value, P t :
proportion of tumor cells, between 0.3 and 0.7 from a uniform
distribution. Experimental noise is Gaussian Standard deviations
drawn from a uniform distribution between 0.1 and 0.2 to imitate
real data where the noise may vary between experiments. Cancer
subtypes are heterogeneous Certain aberrations characteristic for a
cancer subtype may only exist in a percentage of the patients with
that cancer subtype. Thus, in each sample, segments with copy
number alterations (copy number not 2) was removed at random with
probability 30%.
Slide 29
28 Testing samples (original values) 20 samples from either of
2 classes, red is true copy number, black dots are simulated
values, circles around example of heterogeneity x9 x11 37.5%
57.0%
Slide 30
29 Testing samples (original values) Red: True different
clones
Slide 31
30 Testing: why is multiple testing necessary? standard p-value
cutoff for alpha=0.05 => Many false positives
Slide 32
31 Testing: why is multiple testing necessary? -3.99 (maximum
deviating value) Significance with random class assignments? By
chance, many test statistics are below/above standard significance
thresholds 2.93 5.29
Slide 33
32 The maxT Multiple Testing Correction By repeating random
class assigningment and testing, e.g. 100 times, the following
permutation reference distribution of maximum absolute test
statistic is obtained (maxT distribution): We wish to control the
family wise error rate (FWER) at alpha=0.05 (5% chance of 1 false
positive). Therefore, the cut-off should be such that only in 5% of
the random cases, we will get one false positive (95 percentile):
cutoff = 5 standard significance threshold MaxT multiple testing
corrected threshold
Slide 34
33 Testing samples (original values) standard p-value cutoff
for alpha=0.05 maxT p- value cutoff for alpha = 0.05
Slide 35
34 Testing: Segmenting test statistics Reference
Slide 36
35 Testing segmented samples............ 1. Segmentation of
individual samples...
37 Detecting regions with differential copy number Willenbrock
and Fridlyand. Bioinformatics 2005; 21(22): 4084-91
Slide 39
38 Variation of Simulation Parameters Signal2noise -CBS
consistently the best performance -HMM has the highest FDR -GLAD is
least sensitive Alternative empirical distributions of segment
lengths -HMM has highest sensitivity for segment sizes below 10
-CBS has highest sensitivity for segment sizes 10 or larger -GLAD
consistently performes the worst Outlier detection
Slide 40
39 Real Data Example 1: Primary Tumor Data 75 oral squamous
cell carcinomas (SCCs) TP53 mutational status of all samples was
determined using sequence information (Snijders et al., 2005)
Tasks: -Characterize wild-type and mutant samples with respect to
their genomic alterations -Build a classifier to predict TP53
mutational status
Slide 41
40 Frequency of Gain/Loss Comparisons
Threshold-basedMerge-based 5% altered33% altered
Slide 42
41 Why such a difference in alteration frequency? High
threshold-based cut-off is due to the high experimental noise of
the paraffin-embedded tumors + 2.5x MAD - 2.5x MAD Willenbrock and
Fridlyand. Bioinformatics 2005; 21(22): 4084-91
Slide 43
42 Classification results Willenbrock and Fridlyand.
Bioinformatics 2005; 21(22): 4084-91
Slide 44
43 Real Data Example 2: Comparative genomic profiling of
several Escherichia coli strains The microarray design included
probes for: -7 known E. coli strains -39 known E. coli
bacteriophages -104 known E. coli virulence genes Experimentally:
-2 sequenced control strains (W3110 and EDL933), 3 replicates -2
non-sequenced strains (D1 and 3538), 3 replicates -Bacteriophage:
3538 ( stx2::cat), 2 replicates
Slide 45
44 Comparative Genomic Profiling: challenges Ratio problems:
some genes might be present on query strain but not on the known
reference strain. Single channel microarrays or dual channel
microarrays? -In this case, we used an Affymetrix single channel
custom made array (NimbleExpress) Partly present genes versus
similar but different genes.
Slide 46
45 Homology between the 7 E. coli strains included on the
microarray Very high similarity between the two K- 12 strains and
between the two O157:H7 strains. Percentage of homologues for E.
coli genomes in columns found in E. coli genomes in rows.
Willenbrock et al. Journal of Bacteriology. 2006
Nov;188(22):7713-21.
Slide 47
46 BLAST Atlas Willenbrock et al. Journal of Bacteriology. 2006
Nov;188(22):7713-21.
Slide 48
47 Hybridization Atlases Probe hybridizations for experiments
(samples) result in a similar pattern as expected from the BLAST
atlas. Willenbrock et al. Journal of Bacteriology. 2006
Nov;188(22):7713-21.
Slide 49
48 Mapping the phage 3538 ( stx2::cat) Willenbrock et al.
Journal of Bacteriology. 2006 Nov;188(22):7713-21.
Slide 50
49 Zoom of phage 3538 ( stx2::cat) The hybridization pattern is
very similar for the phage, strain 3538 and strain D1. Willenbrock
et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
Slide 51
Hierarchical Cluster Analysis D1 is very similar to the K-12
type strains (W3110 + MG1655). K-12
Slide 52
51 E. coli virulence genes D1 is probably still a commensal
strain (An organism participating in a symbiotic relationship from
which it benefits while the other is unaffected). Willenbrock et
al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
Slide 53
52 Summary Comparative genomic profiling of two E. coli strains
-0175:H16 D1 -0157:H7 3538 Identification of virulence genes and
phage elements Conclusions: D1 is similar to the K-12 type strains
Characterization of D1 and 3538 genes: -Identification of a number
of genes involved in DNA transfer and recombination
Slide 54
53 Advantages over Conventional Expression Arrays 1.
Hybridization of DNA to microarray (DNA is much more stable) 2.
Little normalization is necessary 3. Use of spatial coherence in
the analysis 4. Only 1 sample is necessary to draw conclusions (it
is still necessary with biological replicates to be able to draw
general conclusions regarding a certain biological subtype) 5.
Results may be easier interpretable and correlated with sample
phenotypes (e.g. loss of oncogene repressor -> certain cancer
subtype)
Slide 55
54 Summary Numerous methods have been introduced for
segmentation of DNA copy number data and breakpoint identification.
It is important to benchmark them against existing methods
(however, only feasible if the software is publicly available)
Currently, CBS (DNAcopy package) has the best overall performance
Use of spatial dependency in the analysis improves testing power on
clone-by-clone basis Merging of segmentation results improves copy
number phenotype characterization Study types: -Study of copy
number in cancer samples -Study of samples from patients with
mental diseases -Comparison of bacterial strains
Slide 56
Questions? Exercise Part IV + Bonus exercise: Real data
analysis DNA Microarray Analysis Course, 2007 55