Multi-Sample analysis of microarray based copy-number aberration
dataGregory R. [email protected]
Mitchell [email protected]
March 6, 2006Copy Number Detection Meeting
Motivating FrameworkThe ability to map the location and
magnitude of aberrations is important.
Aberration regions can be small.
We are interested in regions of copy number aberrations (CNA)
that are recurrent across a class of samples.Myc-N Amplification in
high risk Neuroblastoma.ErbB2 Amplification in higher risk Breast
Cancer.Both of these are highly correlated with prognosis.
Single Slide MethodsThere are numerous single slide methods for
determining aberration within an array.
These methods use multiple elements in a region as replicates
for determining aberration in the region.
With a single slide this is the best one can do.The resolution
of detection is lower than the resolution of the array.
With multiple slides we can take a different strategy. Be more
liberal on the single slide calls.Only believe the calls when we
see them replicated across samples significantly often.
Finally, while there may be aberration present within a single
array that is not present across samples, this aberration is
unlikely to be due to a population effect.
Multiple Sample Analysis (MSA)The ability to use multiple
samples as replication we are able to characterize the genomic
aberrations at a higher resolution (at the resolution of the
array). This also allows us to identify regions of importance to
Use Information from multiple samples to find aberrations
characteristic to the class of samples.
Rather than looking across the genome, we look across
experiments at each location.
This allows us to pickup small regions of tight concordance
regardless of their small size within a single experiment.
STAC Statistical AlgorithmGiven a set of calls STAC finds
aberrations which are significantly concordant across samples.
STAC provides two statistical tests of significance, the
footprint and frequency.
Frequency measures the number of samples that overlap a
Footprint measures how tight the overlap is.
Frequency = 5 in both cases.Footprint 7Footprint 4
Motivating Dataset (Mies Lab)Fixed Paraffin Embedded (FFPE)
Laser Captured Micro-dissected samples from FFPE, archived (10+
years), degraded tissue, with no exact normal analog.
Indirectly labeled samples due to small quantity of DNA. Due to
a need for sufficient amplification
Amplification based on human specific degenerate oligo
2-Channel BAC Arrays made by the Penn Microarray Core based on
the Weber library.
Making Calls and Processing DataRatios are formed for each clone
with the reference (normal) intensity in the denominator and the
experimental sample in the numerator.If a segment of DNA containing
a clone is not altered, then ideally the ratio for that clone
should be 1.If (in one chromosome) a segment of DNA containing a
clone is missing, then ideally the ratio should be 1/2.If (in one
chromosome) a segment of DNA containing a clone is duplicated
ideally the ratio should be 3/2.If the segment is tripled then
ideally the ratio should be 2.Of course data are noisy and subject
to bias and artifacts.
Processing IssuesClone/Array quality issues
Clone mapping issuesOverlaps and inconsistenciesUnequally spaced
clonesHow to infer behavior at locations between clonesTiling
Clone-to-clone variation Differing clone hybridization
affinities and clone/dye interaction effects, etc
NormalizationRemoving dye-bias, etcWithin array
normalizationBetween array normalizationNature of clone coverage.
Inconsistent spacing due to both technical considerations as well
as biological reality.
First Step: Develop a parameterized protocol for single slide
calls.Make calls per cloneUse normal/normal distributionMake calls
for each nucleotide covered by at least 1 cloneHow to deal with
overlapping clones.How to deal with replicate (and potentially
inconsistent) clones.Extend the calls to regions with no
coverage.Develop method for extension from neighboring
clones.Determine how to divide regions flanked by inconsistent
clones.Standardize genome spacing for analysis.Merging continuous
genome into discrete regions.How to deal with overlapping
Making clone-wise calls from raw dataAbsolute threshold
Using Normal ControlsUsing normal samples as controls.A
distribution of sample normals analogous to the test channel of
interest hybridized to an identical reference channel as used for
the experimental hybridizations Possible cutoff parameters using
normal samplesPercentilesStandard deviationsZ-scoresUser
Given a fixed scheme (above), how can we find an optimal
Extending calls to regions with no coverageNote: We dont extend
over all length only small spans. We cutout regions longer than a
Standardizing Genome Spacing
Analysis In an ideal situation we would believe every aberration
We would then ask the question: which aberrations occur
concordantly across samples?
This is where the STAC statistic helps us out.
Finding a reasonable cutoffFor cutoff SD=1, we are definitely
picking up false signal.
For cutoff SD=6 we are likely missing true signal.
Looking one slide at a time it is hard to tell what is a
reasonable cutoff.A single array with calls made at 11 different
6 normals, 15 tumor samples, in parallelfor 11 values of the SD
High CutoffMiddle CutoffLow Cutoff
MethodologyAvoid making decision on cutoffs.
Calculate significance, at a range of cutoff values, using STAC
at each cutoff.
Combine results using multiple testing correction.
End PointStart PointPercent AberrationSD Cutoff ValuesLess
ResultsChromosome 8 important in breast cancer.
Provides fine resolution of aberration.Rather than simply
providing gross changes.Able to characterize aberration at the
resolution of the array.
Able to characterize important regions.Myc, FGFR, etc.Other
regions previously uncharacterized.
ChARM: Chromosome 8
CBS: Chromosome 8
MSA: Chromosome 8Able to characterize a 1Mb amplification of the
FGFR oncogene All single slide methods missed this.
Able to picks up the Myc oncogene amplification Single-slide
methods missed despite its presence in every sample.
Also characterizes other regions.Some of these regions the
single slide methods were able to detect Detected other smaller
regions of aberration
Allows finer resolution mappingSmaller regions are either missed
or clumped together or into larger regions of aberration.MYCNote:
We are working on adding the CBS algorithm implementation to MSA to
allow the use of its single slide approach to our Multiple Sample
DiscussionTo our knowledge, there are no methods that combine
preprocessing and analysis harnessing the power of multiple
samples.Because most methods are single array methods, integration
between experiments is difficult to define.MSA provides statistical
analysis at higher resolution.MSA works with difficult data: Based
on Pinkel and Albertson scale of difficulty, our method has been
tested, and works well, with 5/6 criteria.
Future PlansHandle Affymetrix SNP Chip data.Many of the ideas
for leveraging multiple samples should also apply to the anaylsis
of Affy SNP data. We are currently working on this extension.
Release stand-alone GUI software package (CGH-MSA).To be
released this month.www.cbil.upenn.edu/MSA
Incorporate Single slide methods.
Extend the STAC algorithm beyond binary data to account for
levels of change.
Estimate bias in non-Controlled experiments.
Our method works really well on other cases as well.Downloadable