8
BIOINFORMATICS Vol. 00 no. 00 2006 Pages 1–8 Transcript mapping with oligonucleotide high-density tiling arrays Wolfgang Huber a* , Joern Toedling a , Lars M. Steinmetz b a European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge CB10 1SD, UK, b European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany. ABSTRACT Motivation: High-density DNA tiling microarrays are a powerful tool for the characterization of complete transcriptomes. The two major analytical challenges are the segmentation of the hybridization signal along genomic coordinates to accurately determine transcript boun- daries and the adjustment of the sequence-dependent response of the oligonucleotide probes to achieve quantitative comparability of the signal between different probes. Results: We describe a dynamic programming algorithm for finding a globally optimal fit of a piecewise constant expression profile along genomic coordinates. We developed a probe-specific background correction and scaling method that employs empirical probe response parameters determined from reference hybridizations with no need for paired mismatch (MM) probes. This combined analysis approach allows the accurate determination of dynamical changes in trans- cription architectures from hybridization data and will help to study the biological significance of complex transcriptional phenomena in eukaryotic genomes. Availability: R package tilingArray at www.bioconductor.org. Contact: [email protected] 1 INTRODUCTION High-density genomic tiling microarrays cover a complete genome or a large fraction of it with densely tiled oligonucleotide pro- bes. The major applications of these arrays are for transcriptome analysis, DNA-protein binding and chromatin modification assays (ChIP-chip) and DNA variation detection (3; 5; 6; 10; 12; 19; 20; 23; 28; 29; 30; 31; 32; 33; 34; 37). The current highest-density tiling microarrays contain 6.5 million distinct features on a single chip and are produced by the com- pany Affymetrix. Each feature measures 5μm × 5μm in size and typically contains olignucleotide probes 25 bases in length. In this paper, we focus on the specific analytical challenges posed by the application of short oligonucleotide tiling arrays to transcriptome analysis (28). The first task is the detection of transcript boundaries, i. e., trans- cript start and stop sites. The challenge is to obtain optimal estimates of the genomic coordinates of transcript boundaries from the tiling array data. The hybridization signal corresponds to the sum of target molecules at each probe position. The maximal precision is deter- mined by the offset between the tiling features and can be as fine as a few bases, however in practice, it is often limited by noise and the transcriptional activity of the genomic region surrounding the * to whom correspondence should be addressed transcripts. In addition, we want to quantify the relative level of transcript abundance and have a statistical measure of uncertainty for each estimated transcript boundary. Second, we would like to detect the presence of architectural fea- tures within genes such as alternative transcription start and stop sites, alternative splicing and alternative partial degradation. For this, we need to compare the signal from probes that target different parts of the same gene. To achieve this, we must address the problem of differential probe response: different oligonucleotide probes may report consistently different intensities even if the abundance of their target molecules is the same. Such sequence dependent variation can extend over several orders of magnitude. If not accounted for, this variation leaves a great deal of noise in the data and will obstruct the reliable detection of transcript boundaries, levels and architectures. Here we describe a segmentation method that addresses the first of the above challenges and a DNA reference normalization method that addresses the second. These methods were successfully used in (6) and promise to advance the extraction of biological meaning from tiling array data. 2 METHODS 2.1 Example data We employed an Affymetrix oligonucleotide array that contains 6,553,600 probes and interrogates both strands of the complete genomic sequence of Saccharomyces cerevisiae with 25-mer probes tiled at intervals of eight nucleotides on each strand (17 nucleotides overlap) and a four nucleotides offset of the tile between strands. This design enables a four base pair (bp) resolution for hybridi- zation of double stranded targets and an eight base resolution for strand-specific targets. RNA was isolated from yeast cells during the exponential growth phase in rich medium (YPD) and was doubly enriched for poly- adenylated molecules. First-strand cDNA was synthesized using random primers and labeled. Genomic DNA was isolated from the same yeast strain. RNA and DNA samples were hybridized in three replicates each. The data is available at ArrayExpress (accession number E-TABM-14) and in the Bioconductor data package david- Tiling. The biological findings from this study are described in (6). 2.2 Structural change model (SCM) segmentation Transcript boundaries can be identified from sudden changes of the hybridization signal plotted along a linear genomic coordinate axis (Figure 1). The signal is affected by noise, which suggests that smoothing or probabilistic modeling of the signal is beneficial. The c Oxford University Press 2006. 1

BIOINFORMATICS Pages 1–8users2.unimi.it/marray/2006/material/Lec8/tilingMethods.pdf · same yeast strain. RNA and DNA samples were hybridized in three replicates each. The data

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BIOINFORMATICS Pages 1–8users2.unimi.it/marray/2006/material/Lec8/tilingMethods.pdf · same yeast strain. RNA and DNA samples were hybridized in three replicates each. The data

BIOINFORMATICS Vol. 00 no. 00 2006Pages 1–8

Transcript mapping with oligonucleotide high-densitytiling arraysWolfgang Huber a∗, Joern Toedling a, Lars M. Steinmetz b

aEuropean Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge CB10 1SD,UK, bEuropean Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany.

ABSTRACTMotivation: High-density DNA tiling microarrays are a powerful toolfor the characterization of complete transcriptomes. The two majoranalytical challenges are the segmentation of the hybridization signalalong genomic coordinates to accurately determine transcript boun-daries and the adjustment of the sequence-dependent response ofthe oligonucleotide probes to achieve quantitative comparability of thesignal between different probes.Results: We describe a dynamic programming algorithm for finding aglobally optimal fit of a piecewise constant expression profile alonggenomic coordinates. We developed a probe-specific backgroundcorrection and scaling method that employs empirical probe responseparameters determined from reference hybridizations with no needfor paired mismatch (MM) probes. This combined analysis approachallows the accurate determination of dynamical changes in trans-cription architectures from hybridization data and will help to studythe biological significance of complex transcriptional phenomena ineukaryotic genomes.Availability: R package tilingArray at www.bioconductor.org.Contact: [email protected]

1 INTRODUCTIONHigh-density genomic tiling microarrays cover a complete genomeor a large fraction of it with densely tiled oligonucleotide pro-bes. The major applications of these arrays are for transcriptomeanalysis, DNA-protein binding and chromatin modification assays(ChIP-chip) and DNA variation detection (3; 5; 6; 10; 12; 19; 20;23; 28; 29; 30; 31; 32; 33; 34; 37).

The current highest-density tiling microarrays contain 6.5 milliondistinct features on a single chip and are produced by the com-pany Affymetrix. Each feature measures 5µm × 5µm in size andtypically contains olignucleotide probes 25 bases in length. In thispaper, we focus on the specific analytical challenges posed by theapplication of short oligonucleotide tiling arrays to transcriptomeanalysis (28).

The first task is the detection of transcript boundaries, i. e., trans-cript start and stop sites. The challenge is to obtain optimal estimatesof the genomic coordinates of transcript boundaries from the tilingarray data. The hybridization signal corresponds to the sum of targetmolecules at each probe position. The maximal precision is deter-mined by the offset between the tiling features and can be as fineas a few bases, however in practice, it is often limited by noise andthe transcriptional activity of the genomic region surrounding the

∗to whom correspondence should be addressed

transcripts. In addition, we want to quantify the relative level oftranscript abundance and have a statistical measure of uncertaintyfor each estimated transcript boundary.

Second, we would like to detect the presence of architectural fea-tures within genes such as alternative transcription start and stopsites, alternative splicing and alternative partial degradation. Forthis, we need to compare the signal from probes that target differentparts of the same gene. To achieve this, we must address the problemof differential probe response: different oligonucleotide probes mayreport consistently different intensities even if the abundance of theirtarget molecules is the same. Such sequence dependent variation canextend over several orders of magnitude. If not accounted for, thisvariation leaves a great deal of noise in the data and will obstruct thereliable detection of transcript boundaries, levels and architectures.

Here we describe a segmentation method that addresses the firstof the above challenges and a DNA reference normalization methodthat addresses the second. These methods were successfully usedin (6) and promise to advance the extraction of biological meaningfrom tiling array data.

2 METHODS2.1 Example dataWe employed an Affymetrix oligonucleotide array that contains6,553,600 probes and interrogates both strands of the completegenomic sequence of Saccharomyces cerevisiae with 25-mer probestiled at intervals of eight nucleotides on each strand (17 nucleotidesoverlap) and a four nucleotides offset of the tile between strands.This design enables a four base pair (bp) resolution for hybridi-zation of double stranded targets and an eight base resolution forstrand-specific targets.

RNA was isolated from yeast cells during the exponential growthphase in rich medium (YPD) and was doubly enriched for poly-adenylated molecules. First-strand cDNA was synthesized usingrandom primers and labeled. Genomic DNA was isolated from thesame yeast strain. RNA and DNA samples were hybridized in threereplicates each. The data is available at ArrayExpress (accessionnumber E-TABM-14) and in the Bioconductor data package david-Tiling. The biological findings from this study are described in(6).

2.2 Structural change model (SCM) segmentationTranscript boundaries can be identified from sudden changes of thehybridization signal plotted along a linear genomic coordinate axis(Figure 1). The signal is affected by noise, which suggests thatsmoothing or probabilistic modeling of the signal is beneficial. The

c© Oxford University Press 2006. 1

Page 2: BIOINFORMATICS Pages 1–8users2.unimi.it/marray/2006/material/Lec8/tilingMethods.pdf · same yeast strain. RNA and DNA samples were hybridized in three replicates each. The data

Huber et al

−6

−4

−2

0

2

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

GDH3YAL061W

BDH1ECM1

CNE1 GPB2PEX22

YAL053W OAF1 YAL047W−AYAL044W−A

ERV46CDC24

CDC19YAL037W

FUN12 MTW1 POP5YAL031W−A

SNC1 FRT2YAL027W

YAL019W−AFUN30

PSK1 TPD3 DEP1

−6

−4

−2

0

2

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

YAL059C−AYAL056C−A

ACS1 YAL049CGEM1

SPC72YAL046C

YAL045CGCV3

PTA1YAL042C−A CLN3 CYC3 YAL037C−B

YAL037C−A

RBG1YAL034C−B

FUN19PRP45 GIP4 MYO4 YAL026C−A

DRS2MAK16 LTE1 PMT2 FUN26 CCR4 ATS1 YAL018C YAL016C−B

YAL016C−ANTG1

SYN8

30000 40000 50000 60000 70000 80000 90000 100000 110000 120000 130000

Chr 1

Fig. 1. Visualization of yeast tiling array intensities along 100 kb of chromosome 1, corresponding to about 1% of the genome. The plot shows the normalizedlog2 hybridization intensities (y-axis) along genomic coordinates (x-axis in bp). Each dot corresponds to a unique probe, Watson (+) strand in green and Crick(-) strand in blue. Annotated open reading frames (ORFs) are shown as blue boxes, dubious ORFs as light blue boxes, transcription factor binding sites as greybars. Vertical lines are segment boundaries. The use of the data to map the boundaries and levels of all transcripts, including the untranslated regions (UTRs)of protein-coding genes, antisense transcripts, and currently uncharacterized non-coding RNAs was described by (6). A browsable on-line database of suchplots for the whole yeast genome is available at http://www.ebi.ac.uk/huber-srv/queryGene.

signal could be either the hybridization intensities from a singleRNA sample or it could be a per-probe summary statistic for thecomparison of multiple conditions or time points.

Perhaps the most obvious approach is to move a sliding windowalong the coordinate axis and to measure the evidence for the pre-sence of transcripts by computing a scan statistic at each windowstep. As we discuss in Section 3.1, such approaches are not well-suited to precisely determining the boundaries of transcripts. Animprovement is provided by Hidden Markov models. Discrete stateHidden Markov models are powerful and popular tools in biologi-cal sequence analysis, and there are efficient and elegant dynamicprogramming algorithms for fitting them to data (8; 26).

Tjaden et al (35) applied a two-state Hidden Markov model to thedetection of untranslated region (UTR) boundaries, where the twostates corresponded to presence or absence of transcription. Howe-ver, transcript abundance is a continuous-valued quantity, and thereare biological effects such as alternative transcription start and endsites and partial transcript degradation that result in complex signalpatterns (Figure 2). These patterns are richer than can be detectedby a simple “on/off” model. We propose a continuous-state modelwhose hidden state can be any real number.

2.2.1 The model. The SCM model is well known in econome-trics (2; 39) for the modeling of sudden jumps in financial timeseries and has been applied to the segmentation of array-CGHdata (25). It models the data as a piecewise constant function ofchromosomal coordinates,

zki = µs + εki for ts ≤ k < ts+1, (1)

where k = 1, 2, . . . , n indexes the probes in ascending order alongthe chromosome, i indexes replicate experiments, zki is the signalfrom the k-th probe in the i-th replicate, t2, . . . , tS parameterize the

segment boundaries, t1 = 1 and tS+1 = n+1, S is the total numberof segments, µs is the mean signal level of the s-th segment, andεki are the residuals. µ1, . . . , µS can be any set of real numbers.t2, . . . , tS are also called the change-points. Model (1) is appliedseparately to each chromosome and, if the signal is strand-specific,to each of its two strands.

2.2.2 Parameter estimation. Fitting the model (1) can be accom-plished by minimizing the sum of squared residuals

G(t1, . . . , tS) =

SXs=1

IXi=1

ts+1−1Xk=ts

(zki − µs)2 , (2)

where S is the number of segments, I is the number of replicatearrays, and µs is the arithmetic mean of the zki in segment s.

There is a dynamic programming algorithm that allows the glo-bally optimal set of parameters t1, . . . , tS , for all values of Sbetween 1 and Smax, to be obtained in quadratic time O(n2). Recentpresentations of the algorithm include (2; 25). If one bounds themaximal length of individual segments to a fixed size (for example,l = 20 kb), then the complexity of the algorithm can be reducedto O(nl). With this approach, which we have taken, sequencesof several hundred thousand probes can be processed in a singlerun. We provide a C implementation in the function segment in thetilingArray package of the Bioconductor project (11).

2.2.3 Confidence Intervals. Bai and Perron (1) present an asym-ptotic theory for inference on the SCM model (1), and in acompanion paper (2) they provide a comprehensive and detaileddiscussion of the associated computational aspects. The calculationof the confidence intervals on estimated change-points ts involvesthe distribution of the argmax functional of a process composed oftwo independent Brownian motions. The drift and scale parameters

2

Page 3: BIOINFORMATICS Pages 1–8users2.unimi.it/marray/2006/material/Lec8/tilingMethods.pdf · same yeast strain. RNA and DNA samples were hybridized in three replicates each. The data

Transcript mapping with tiling arrays

−6−4−2

02

●●

●● ●

●●

● ●●●

●●

●● ●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●●

●●

●●

●●

YMR141W−A RPS16A

−6−4−2

02 ●

●●

●●

●● ●

●●

●●

● ●●

●●

●●

●●●●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

RPL13B

551000 552000 553000

Chr 13a

−6−4−2

02

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●●

ABP140 MET7

−6−4−2

02

●●

●●

●●

●●●

●●

●●●

●●●

●●●●

●●●

●●●

●●●●

●●●

● ●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

SSP2

785000 786000 787000 788000 789000 790000

Chr 15b

−6−4−2

02

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●●●●

●●

●●●

●●

−6−4−2

02

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

SER33 SPO22

221000 222000 223000 224000 225000 226000

Chr 9c

CDS

CDS dubious

TF binding site

Watson strand probe

Crick strand probe

Non−unique probe

Segment boundary

Fig. 2. Detailed views on segmentation results. 95%-confidence intervals are shown by vertical dashed lines. Note that the confidence intervals are calculatedin terms of data sampling points, not genomic coordinates. Hence, in cases where the data is unequivocal, the interval boundaries coincide with the change-point estimate itself, for example, see the 5’ end of SER33 in panel c). a) Spliced transcripts RPS16A and RPL13B. b) Complex transcript architecture ofMET7. c) Transcript antisense to SPO22. CDS refers to coding sequence; TF, transcription factor.

of these processes depend on the difference between the segmentmeans and on the standard deviation and serial correlation of theresiduals. Due to the limitations of floating-point arithmetic, the cor-rect numerical evaluation of this distribution function is not trivial.Zeileis et al (40) point out some of the caveats. The package tilin-gArray makes use of their implementation of the confidence intervalestimation in the R package strucchange (38; 39).

2.2.4 Model selection. The only user-defined parameter of theSCM model (1) is the number of segments S (1 ≤ S ≤ Smax). Inprinciple, it can be chosen by a penalized likelihood approach (25),which we now discuss. Assuming that the residuals εki in Equa-tion (1) are independent and identically Normal, the log-likelihoodis

log L = −N

2

„1 + log

2πG

N

«, (3)

where G is the sum of squared residuals from Equation (2), andN = nI is the number of data points. Since the class of modelswith parameter S − 1 is contained in that with parameter S, log Lis a monotonically increasing function of S.

To penalize model complexity, we can consider the Akaikeinformation criterion (AIC) and the Bayesian information criterion(BIC). They are defined, e. g. in (13), as AIC = −2 log L + 2p andBIC = −2 log L + p log N , where p is the number of parametersof the model. In our case, p = 2S, since for a segmentation withS segments, we estimate S − 1 change-points, S mean values andthe standard deviation of the εki. Hence the penalized likelihoodfunctions are

log LAIC = log L− 2S, (4)

log LBIC = log L− S log N. (5)

Since the probes on the array can overlap and some sourcesof noise are correlated for successive probes (for example, cross-hybridization or random fluctuations in the abundance of specifictarget fragments), the data will usually be serially correlated. Serialcorrelation is not a substantial problem for the point estimates ts andµs, and it is explicitly taken into account in Bai’s and Perron’s con-fidence intervals. However, it needs to be considered when makinginference based on the log-likelihoods (3-5).

2.3 DNA reference normalization2.3.1 The model. The fluorescence intensity values obtainedfrom an oligonucleotide microarray hybridization do not directlycorrespond to interpretable physical units. The same abundance of atarget transcript can result in systematically different values whenmeasured with different oligonucleotide probes. This is due to avariety of reasons, among them the different thermodynamic pro-perties of different polynucleotide sequences and biases in labelingefficiency.

The fluorescence intensity response of a probe k to the abun-dance xk of its target molecule in the sample can be modeled asyk = βk + αkxk, where yk is the intensity of probe k, βk isa term that represents unspecific (background) intensity, and αk

is a proportionality factor for the specific part of the signal (27).The specific part of the signal is (to reasonable approximation) pro-portional to the abundance xk of the target molecule, while theunspecific part corresponds to the background fluorescence that isobserved even in the absence of the intended target. Here we are notconcerned with stochastic measurement error (“noise”), for whichwe refer to (16). Furthermore, we assume that non-linear saturationeffects are negligible.

αk and βk are not usually known, and they can be differentfor each probe. This explains why even for xk = xj , in generalyk 6= yj . The goal of DNA reference normalization is to estimateparameters ak and bk such that

y′k =yk − bk

ak(6)

quantifies the target abundance in a way that is to sufficient appro-ximation independent of k, the probe identity.

2.3.2 Parameter estimation. We estimate ak by the geometricmean of the intensities from three replicate array hybridizations ofgenomic DNA. This procedure is motivated by the fact that the abun-dance of the target is the same for all probes that have a uniquematch to the genome. Note that we are excluding probes with mul-tiple matches to the genome. We are also not considering probeswithout any perfect matches in the genome and in particular, we areignoring the so-called mismatch (MM) probes.

3

Page 4: BIOINFORMATICS Pages 1–8users2.unimi.it/marray/2006/material/Lec8/tilingMethods.pdf · same yeast strain. RNA and DNA samples were hybridized in three replicates each. The data

Huber et al

We have no direct way to obtain a detailed estimate of bk for everyprobe, but we can assume that some of its probe to probe variabilitycan be explained through a functional dependence on ak. We use asan estimate bk = f(ak) with a smooth function f , which we obtainas follows. Probes are grouped into 10 strata corresponding to the10%, 20%, . . ., 100% quantiles of ak. Within each stratum we cal-culate the midpoint of the shorth of the intensities of those probeswhose target sequence is not annotated to be within a transcribedregion on either strand. The shorth of a univariate distribution isdefined as the shortest interval that contains at least half of the data,and its midpoint is a robust estimator of the location of the distri-bution. An estimate of the function f can be obtained from thesevalues by linear interpolation or smoothing.

2.3.3 Between array normalization. In order to deal with datafrom multiple arrays, we need to adjust for systematic variationsin the intensities between different arrays, which can be causedfor example by varying amounts of sample material. If we nowdenote y′k from Equation (6) by y′ki, with the index i countingthe different arrays, then we are looking for an affine transforma-tion y′′ki = (y′ki − di) /ci, where di is an array-specific offsetand ci a scaling factor. Microarray intensities are usually trans-formed to a logarithmic scale in order to make the distributionsof the stochastic noise components in the data more symmetricand more homogeneous (7; 9; 15; 17). Because for probes withweakly or unexpressed targets y′′ki can be close to zero or evennon-positive, we apply the so-called generalized logarithmic trans-formation zki = glog∆(y′′ki) = log2

“y′′ki +

py′′2ki + ∆2

”/2.

This transformation depends on a parameter ∆ which is related tothe size of the background noise.

The parameters ci, di, ∆ can be estimated from the data, and forthis we use the robustified maximum-likelihood method providedby the Bioconductor package vsn (15).

2.3.4 Exclusion of non-responding probes. We observed that acertain fraction of probes respond poorly to their target and are notinformative. We allow for the exclusion of such data. In particular,we discard the probes whose estimated ak is smaller than a user-defined quantile of the ak–distribution.

The DNA reference normalization method described in this sec-tion is implemented in the function normalizeByReference of thetilingArray package.

3 RESULTS AND DISCUSSION3.1 SCM segmentationThe main results of this paper are visualized in Figures 1 and 2,which show the application of SCM segmentation and DNA nor-malization to the yeast tiling array data. The segmentation clearlypicks up the major change-points in the data, many of which cor-respond to the beginning or end of annotated genes. In addition tothe change-point estimates, Figure 2 also includes 95%-confidenceintervals as described in Section 2.2.3.

3.1.1 Comparison to sliding windows. Previous high-densitytiling array studies used sliding window methods in combinationwith a thresholding criterion for the identification of transcripts(3; 19; 28; 30). In contrast to SCM, which optimizes a clearlydefined objective function, the sum of squares (2), sliding windowmethods are defined algorithmically. One of the main problems with

0 10 20 30 40

−1

01

23

4

Sig

nal

a

● ● ●●

● ●

●● ● ● ● ●

●● ●

● ●● ●

● ● ● ● ● ● ●●

●●

● ● ● ● ●●

●●

● ● ● ● ● ●●

● ●

0 10 20 30 40

−1

01

23

4

Sig

nal

b

●● ● ● ●

● ●● ●

●●

● ● ● ● ● ● ●●

● ● ● ● ● ● ● ● ● ● ●● ●

●●

●● ●

● ● ● ●●

● ●● ●

SCM segmentationsliding average thr.

0 10 20 30 40−

10

12

34

Position

Sig

nal

c

● ● ● ●●

● ● ●● ● ● ●

●● ● ●

● ●●

● ● ●●

● ● ● ● ●●

● ●● ● ●

●●

● ● ●●

●● ● ● ● ● ● ●

Fig. 3. Comparison between SCM segmentation and sliding average thres-holding (SAT). a) The dots correspond to simulated data for a weaklyexpressed transcript starting at position 13 and ending at 36. The verticalsolid red lines show the change-points found by SCM segmentation. Theblue line shows a sliding average (window width 11). The vertical dashedblue lines show the change-points found by thresholding the sliding averageat a threshold of y = 1 (horizontal dotted line). The size of the transcript isunderestimated. b) As in panel a), for a moderately expressed transcript. Thechange-points from SCM segmentation and SAT coincide. c) As in panel a),for a strongly expressed transcript. The size of the transcript is overestimatedby SAT. SCM segmentation produces unbiased estimates in all cases.

sliding window approaches is shown in Figure 3. Such methods tendto produce biased estimates of the start and end points of transcribedregions, depending on the level of signal above background (13).

3.1.2 Model selection. SCM segmentation has one parameter,the number of segments S, which controls the model complexity.The data can be fit better by increasing S, and this will decreasethe number of missed, real change-points (false negatives) for thecost of increasing the number biologically irrelevant change-points(false positives). In Section 2.2.4 we have described a standard pena-lized likelihood approach. A potential method of choosing S wouldbe to use the value that maximizes a suitably penalized likelihoodfunction. Figure 4 shows a plot of the log-likelihood as a functionof the parameter S, together with two possible choices of penali-zed log-likelihoods according to the Akaike information criterion(AIC) and the Bayes information criterion (BIC). While in parti-cular LBIC works well on the simulated data, both LAIC and LBICwould choose substantially higher values for S than that which wedecided upon for the analysis presented in (6) based on comparisonof the segmentation with biological expectations.

4

Page 5: BIOINFORMATICS Pages 1–8users2.unimi.it/marray/2006/material/Lec8/tilingMethods.pdf · same yeast strain. RNA and DNA samples were hybridized in three replicates each. The data

Transcript mapping with tiling arrays

a b

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

−10

00−

900

−80

0−

700

S

(pen

aliz

ed)

log

likel

ihoo

d

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

log Llog L~AIC

log L~BIC

0 50 100 150 200 250 300

−59

000

−57

000

−55

000

S

(pen

aliz

ed)

log

likel

ihoo

d

log Llog L~AIC

log L~BIC

Fig. 4. Model selection. The plots show the log-likelihood (red) and pena-lized log-likelihoods according to the Akaike information criterion (AIC;blue) and the Bayes information criterion (BIC; green) as functions of theparameter S. a) Simulated data according to model (1) with independentNormal εki and S = 22. The vertical dashed green line is at the maximumof log LBIC and correctly identifies the true value of S. b) Tiling array datafrom the Watson (+) strand of chromosome 1. The vertical dashed grey lineat S = 153 corresponds to the parameter value that was used for Figures 1and 2. The vertical dashed green line at S = 235 is at the maximum oflog LBIC.

We hypothesize that this discrepancy may be the consequence ofthe model of Equation (1) being too simple, in two ways. First,there are biological phenomena that lead to more complex hybri-dization profiles than the piecewise constant shape assumed by themodel. Second, the residuals εki are in practice not independent, asdiscussed in Section 2.2.4. While the model is evidently useful toestimate meaningful change-points and confidence intervals, whenS is given, it might not be powerful enough to also let us infer S.

We recommend the following strategy. Since the algorithmproduces not just the optimal segmentation for a given numberSmax of segments, but also all optimal segmentations with S =2, 3, . . . , Smax − 1, a practical approach is to do the computa-tion with a choice of Smax that is comfortably too large. Theresults can be visualized for different values of S using the visua-lization tool provided in the tilingArray package. This tool wasalso used to create Figures 1 and 2 and the online supplementof (6) (http://www.ebi.ac.uk/huber-srv/queryGene). By examiningthe results in control regions where one has clear expectations aboutthe transcript structures, it is possible to identify an S that has adesirable trade-off between sensitivity and specificity, and to gainconfidence in the algorithm’s results in lesser known regions of thetranscriptome. Often it is reasonable to expect that the segmentlength distribution should be approximately the same on differentchromosomes, hence the choice of S is equivalent to the choice ofthe average segment length LS, with S being the integer closest toLC/LS and LC the length of the region to be segmented, typicallya chromosome. In (6), this procedure let us choose LS = 1500uniformly for all chromosomes.

3.2 Normalization3.2.1 Visual assessment. Figure 5 shows scatterplots of differenttypes of signal along genomic coordinates. Each dot corresponds

to a microarray feature. The intensities from a hybridization ofgenomic DNA are shown in Figure 5a. The y–axis is on the loga-rithmic scale to base 2. Ideally, all features should show the sameintensity, since the copy number of genomic DNA is the samethroughout. Some of the variation in Figure 5a can be explainedby stochastic noise, but the larger part of it is systematic and isdue to sequence-specific properties of either the probes or the tar-get DNA (24; 36; 41). The y-coordinate of each dot is also encodedusing a pseudo-color scheme. Red corresponds to features that havea weak response, blue to those with the strongest response. Thesame coloring for each feature is also used in panels 5b-f.

Figure 5b shows the intensites resulting from hybridization ofRNA, again on a logarithmic scale to base 2. One can clearly distin-guish between transcribed regions, corresponding to the annotatedgenes RPN2 and SER33 and background intergenic regions. Howe-ver, the signal appears rather noisy, with many individual featuresthat map into the transcribed region showing weak signal, and alarge spread of values even in the background region. Notably, thisvariation is not random, as can be seen from the coloring of the dots:to a large part, it can be explained by the probe response as enco-ded by the color. This motivates the use of the DNA intensities foradjusting the probe sequence related signal variation.

Figure 5c shows the result of dividing the RNA-signal by theDNA-signal, then taking the logarithm to base 2. Since the overallscaling is arbitrary, we have shifted the data in panels 5c-f such thatthe 5% quantile is at 0. While the distribution of the data within thetranscribed regions is now much tighter, there is still considerablevariability in the background region. Remarkably, this backgroundvariability is not random, one can see a pattern that correlates withthe coloring of the dots. This motivates a probe specific backgroundcorrection that again employs the DNA intensities from panel 5a.

Figure 5d shows the zki values resulting from the DNA referencenormalization. While the spread of the data in the background regionis not substantially different compared to panel 5c, we note twoimportant aspects: the distribution of the noise in the backgroundis now more symmetric, and, more importantly, the difference bet-ween the mean signal in the background regions and the transcribedregions is increased. Background correction does not reduce thevariance, but it increases the dynamic range and hence the sensitivityto detect weak signals (17).

Figure 5e shows the same data as in panel 5d, but with the 5%of features that had the weakest signal in the DNA hybridizationremoved, as described in Section 2.3.4. This removes many of theoutliers at little cost of good data.

For comparison, Figure 5f shows the result of a normalization thatis similar to 5e, with the only difference that for the estimation ofthe background parameters bk the intensity of the mismatch (MM)probe that is paired with each perfect match (PM) probe is usedinstead of the interpolated values from background-level PM probes.Using paired MM probes for background correction can increase thedynamic range of the signal, but one of the main limitations of thisapproach is that it also greatly increases the signal’s variance, whichcan lead to a net increase of mean squared error.

3.2.2 Quantitative assessment. In order to assess quantitativelythe results of the different procedures 5b–f, we consider asignal/noise ratio. We look at a set of control regions, two positivecontrol regions within the ORFs of RPN2 and SER33 at coor-dinates 217860—220697 and 221078—222487 and two negative

5

Page 6: BIOINFORMATICS Pages 1–8users2.unimi.it/marray/2006/material/Lec8/tilingMethods.pdf · same yeast strain. RNA and DNA samples were hybridized in three replicates each. The data

Huber et al

a

6

8

10

12

14

b

6

8

10

12

c

−2

0

2

4

6

8

d

−2

0

2

4

6

8

e

−2

0

2

4

6

8

f

−2

0

2

4

6

8

RPN2 SER33217kB 223kB 224kB 225kB 226kB

Fig. 5. Along-chromosome plot of array intensities from hybridizationof DNA (a), RNA (b), RNA values divided by DNA values (c), withbackground subtraction with (d) and without non-responding probes (e),alternative background correction by paired MM probes (f).

control regions in the background at coordinates 216800—217700and 222800—227000. The assumption is that the signal within aregion should be constant and deviations from that are noise, whilethe difference between positive and negative controls should belarge and is counted as signal. Noise σ is calculated as the averageof the differences between 97.5% and 2.5% quantiles of the datawithin each of the control regions. The range between the 97.5%and 2.5% quantiles contains 95% of the data.

σ =1

Q0.975N −Q0.025

N

·

Xr∈pos,neg

Q0.975r −Q0.025

r

|pos|+ |neg| . (7)

Here, the symbol r counts over the different regions. The constant inthe denominator is the differences between 97.5% and 2.5% quanti-les for the standard Normal distribution, hence σ is equal to 1 if thedata come from the standard Normal distribution.

Signal ∆µ is calculated as the difference between the averagesof positive and negative control regions, ∆µ =

Pr∈pos µr/|pos| −P

r∈neg µr/|neg|.The result is shown in Table 1. We have explored many variations

of this calculation, using different definitions of σ, ∆µ, and of the

method b c d e f∆µ/σ 3.22 3.47 4.04 4.58 4.36

Table 1. Signal to noise ratio ∆µ/σ of different data processing methodsb–f as described in Section 3.2 and Figure 5.

control regions. The ranking of the methods was always the same asshown in Table 1. The data and the code for the calculations that pro-duce Figure 5 and Table 1 are available in the online documentationof the package tilingArray.

4 CONCLUSIONThe adjustment of probe sequence related signal variation is a fun-damental problem in the analysis of oligo-nucleotide microarraydata (14; 17; 21; 24; 36; 41). Current methods rely on pre-specifiedgroupings of probes each matching the same transcript, so-calledprobe sets, and their focus has been on the accurate detection ofdifferential expression between different conditions rather than thedetection of internal structure within the probe set. The advent ofhigh-density tiling arrays enables us to observe transcript archi-tecture, including transcription start and stop sites, splicing anddegradation and possibly their differential regulation. A prerequisiteis the quantitative comparability, at least approximately, of microar-ray signals from different regions of one transcript. The DNAreference normalization of Section 2.3 responds to this requirement.

Our method does not employ the data from the paired mismatch(MM) probes. The manufacturer’s intention for these features is toserve as controls for unspecific background hybridization. However,as we have shown in Figure 5 and Table 1, a much smaller set ofcontrol probes is sufficient and even produces slightly better results.In particular, we show that the background component of a probe’ssignal can be estimated from a control population of similar probesthat may perfectly match to the genome, but whose target does notappear to be transcribed. Since we use a robust estimation technique,it does not matter if this distribution contains a minority of probeswith specific, non-background signal. We can save half of the realestate and about half of the cost of a chip at practically no loss forthe analysis.

For the estimation of the probe-specific scaling and backgroundparameters, we have opted to characterize each probe by a singlevalue empirically obtained from the DNA reference hybridization.In addition, or indeed alternatively, one could try to use the probesequence information to build a regression model on sequence-related variables for the background and scaling factor of eachprobe (22; 24; 36; 41). Such a model can collect strength by smoo-thing across probes that are similar in sequence space and couldalso employ information on unspecific hybridization that is providedby so-called “antigenomic” probes on recent Affymetrix GeneChipdesigns. Our attempts at such a model with the present data have notproduced results that were better than the DNA reference normali-zation described in Section 2.3, but clearly there is room for moreresearch.

We have described a simple structural change model (SCM) forRNA hybridization profiles along the genome, namely a piecewise-constant function. This model lends itself to an efficient dynamicprogramming algorithm for optimally estimating the change-point

6

Page 7: BIOINFORMATICS Pages 1–8users2.unimi.it/marray/2006/material/Lec8/tilingMethods.pdf · same yeast strain. RNA and DNA samples were hybridized in three replicates each. The data

Transcript mapping with tiling arrays

positions, which together with the segment levels are of primary bio-logical interest. In addition to the change-point positions, the theoryof SCMs also provides estimators for their confidence intervals.Figure 2 shows how the calculated confidence intervals adequatelyreflect the uncertainty in the change-point position depending on thesteepness and height of the change and the noise level. The confi-dence intervals are useful for the ranking and interpretation of thefitted change-points.

SCMs also allow the modeling of general linear relationships bet-ween a possibly vector-valued regressor along a linear coordinateand a possibly vector-valued dependent variable (2; 38; 39). Amongthe uses for such a generalized approach could be, for example,the modeling of decaying flanks (4). Remember that a linear decayon the logarithmic data scale corresponds to an exponential decayon the fluorescence scale, which is often a good approximation formany naturally occurring length distributions.

The methods that we have presented here were successfully usedin (6) to identify the boundary, structure and level of coding andnon-coding transcripts of yeast. Apart from expected transcripts,this study found operon-like transcripts, transcripts from neigh-boring genes not separated by intergenic regions and genes withcomplex transcriptional architecture. It mapped the positions of 3’and 5’ UTRs of coding genes and identified hundreds of RNA tran-scripts both antisense to, and distinct from, annotated genes. Themethods presented here, DNA reference normalization and SCMsegmentation, were instrumental for the analysis, by providing aclean normalized signal and accurate transcript boundary identifi-cations. We expect that the methods will also be useful in the studyof transcriptional complexity under dynamic conditions and in otherorganisms. With suitable adaption they should also be valuable forthe application of tiling arrays in the detection of genomic regi-ons purified in chromatin-immunoprecipitation experiments. Dueto their ability to interrogate the entire genome we expect tilingmicroarrays soon to become a widely used tool.

ACKNOWLEDGMENTSWe thank Lior David for fruitful discussions regarding the DNAreference normalization, Achim Zeileis for insightful commentsabout the estimation of confidence intervals and his softwarepackage strucchange and Richard Bourgon for helpful commentson the manuscript. We thank the contributors to the Bioconduc-tor (www.bioconductor.org) and R (www.R-project.org) projects fortheir software. This work has been supported by the Deutsche For-schungsgemeinschaft and the National Institutes of Health (L.M.S.)and by the European Community’s Sixth Framework Programmecontract (HeartRepair) LSHM-CT- 2005-018630 (J.T.).

REFERENCES[1]J. Bai and P. Perron. Estimating and testing linear models with

multiple structural changes. Econometrica, 66:47–78, 1998.[2]J. Bai and P. Perron. Computation and analysis of multiple

structural change models. J. of Applied Econometrics, 18:1–22,2003.

[3]P. Bertone, V. Stolc, T.E. Royce et al. Global identifica-tion of human transcribed sequences with genome tiling arrays.Science, 306:2242–2246, 2004.

[4]R. Bourgon and T.P. Speed. A model for chromatin immuno-precipitation / high density tiling array experiments: implicati-ons for data analysis. In Profiling Transcriptional Activity withPromoter and CpG Microarrays., DNA Press, Eagleville, PA,2006.

[5]J.S. Carroll, X.S. Liu, A.S. Brodsky et al. Chromosome-widemapping of estrogen receptor binding reveals long-range regu-lation requiring the forkhead protein FoxA1. Cell, 122:33–43,2005.

[6]L. David, W. Huber, M. Granovskaia et al. A high-resolutionmap of transcription in the yeast genome. PNAS, 103:5320–5325, 2006.

[7]S. Dudoit, Y.H. Yang, T.P. Speed, and M.J. Callow. Statisti-cal methods for identifying differentially expressed genes inreplicated cDNA microarray experiments. Statistica Sinica,12:111–139, 2002.

[8]R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison. BiologicalSequence Analysis. Cambridge University Press, 1998.

[9]B.P. Durbin, J.S. Hardin, D.M. Hawkins, and D.M. Rocke.A variance-stabilizing transformation for gene-expressionmicroarray data. Bioinformatics, 18:S105–S110, 2002.

[10]A.-V. Gendrel, Z. Lippman, R. Martienssen, and V. Colot. Profi-ling histone modification patterns in plants using genomic tilingmicroarrays. Nature Methods, 2:213–218, 2005.

[11]R.C. Gentleman, V.J. Carey, D.J. Bates et al. Bioconduc-tor: Open software development for computational biology andbioinformatics. Genome Biology, 5:R80, 2004.

[12]D. Gresham, D.M. Ruderfer, S.C. Pratt et al Genome-widedetection of polymorphisms at nucleotide resolution with asingle DNA microarray. Science, 311:1932–1936, 2006.

[13]T. Hastie, R. Tibshirani, and J. Friedman. The Elements ofStatistical Learning: Data Mining, Inference, and Prediction.Springer-Verlag, New York, 2001.

[14]E. Hubbell. PLIER White Paper. Affymetrix, 2005.[15]W. Huber, A. von Heydebreck, H. Sultmann et al. Variance

stabilization applied to microarray data calibration and tothe quantification of differential expression. Bioinformatics,18:S96–S104, 2002.

[16]W. Huber, A. von Heydebreck, and M. Vingron. Error modelsfor microarray intensities. In Encyclopedia of Genetics, Geno-mics, Proteomics and Bioinformatics., New York, 2004. Wiley.

[17]R.A. Irizarry, B. Hobbs, F. Collin et al. Exploration, normaliza-tion, and summaries of high density oligonucleotide array probelevel data. Biostatistics, 4:249–64, 2003.

[18]J.M. Johnson, S. Edwards, D. Shoemaker, and E.E. Schadt.Dark matter in the genome: evidence of widespread trans-cription detected by microarray tiling experiments. Trends inGenetics, 21:93–102, 2005.

[19]D. Kampa, J. Cheng, P. Kapranov et al. Novel RNAs identi-fied from an in-depth analysis of the transcriptome of humanchromosomes 21 and 22. Genome Res, 14:331–342, 2004.

[20]P. Kapranov, S.E. Cawley, J. Drenkow et al. Large-scaletranscriptional activity in chromosomes 21 and 22. Science,296:916–919, 2002.

[21]C. Li and W.H. Wong. Model-based analysis of oligonucleo-tide arrays: Expression index computation and outlier detection.PNAS, 98:31–36, 2001.

[22]W. Li and X.S. Liu. MAT: Model-based Analysis of Tiling-array. Liu Lab, Harvard School of Public Health, Dana-Farber

7

Page 8: BIOINFORMATICS Pages 1–8users2.unimi.it/marray/2006/material/Lec8/tilingMethods.pdf · same yeast strain. RNA and DNA samples were hybridized in three replicates each. The data

Huber et al

Cancer Institute, 2006. http://chip.dfci.harvard.edu/∼wli/MAT.[23]T.C. Mockler, S. Chan, A. Sundaresan et al. Applications of

DNA tiling arrays for whole-genome analysis. Genomics, 85:1–15, 2005.

[24]F. Naef and M.O. Magnasco. Solving the riddle of the brightmismatches: labeling and effective binding in oligonucleotidearrays. Phys. Rev. E, 68(1 Pt 1):011906, 2003.

[25]F. Picard, S. Robin, M. Lavielle et al. A statistical approachfor array CGH data analysis. BMC Bioinformatics, 6(1):27, Feb2005.

[26]L.R. Rabiner. A tutorial on hidden Markov models and selectedapplications in speech recognition. Proceedings of the IEEE,77:257–286, 1989.

[27]D.M. Rocke and B.P. Durbin. A model for measurement errorfor gene expression arrays. Journal of Computational Biology,8:557–569, 2001.

[28]T.E. Royce, J.S. Rozowsky, P. Bertone et al. Issues in the analy-sis of oligonucleotide tiling microarrays for transcript mapping.Trends Genet, 21:466–75, 2005.

[29]M.P. Samanta, W. Tongprasit, H. Sethi et al. Global identi-fication of noncoding RNAs in Saccharomyces cerevisiae bymodulating an essential RNA processing pathway. PNAS,103:4192–4197, 2006.

[30]E.E. Schadt, S.W. Edwards, D. GuhaThakurta et al. A com-prehensive transcript index of the human genome generatedusing microarrays and computational approaches. Genome Biol,5(10):R73, 2004.

[31]D.W. Selinger, K.J. Cheung, R. Mei et al. RNA expressionanalysis using a 30 base pair resolution Escherichia coli genomearray. Nature Biotechnology, 18:1262–1268, 2000.

[32]D.D. Shoemaker, E.E. Schadt et al. Experimental annotationof the human genome using microarray technology. Nature,

409:922–927, 2001.[33]V. Stolc, Z. Gauhar et al. A gene expression map for the

euchromatic genome of drosophila melanogaster. Science,306:655–60, 2004.

[34]L.V. Sun, L. Chen, F. Greil et al. Protein-DNA interactionmapping using genomic tiling path microarrays in Drosophila.PNAS, 100:9428–9433, 2003.

[35]B. Tjaden, D.R. Haynor, S. Stolyarl et al. Identifying operonsand untranslated regions of transcripts using Escherichia coliRNA expression analysis. Bioinformatics, 18 Suppl 1:337–344,2002.

[36]Z. Wu, R.A. Irizarry, R.C.Gentleman et al. A model basedbackground adjustment for oligonucleotide expression arrays.Journal of the American Statistical Association, 99:909–917,2004.

[37]K. Yamada, J. Lim, J.M. Dale et al. Empirical analysis oftranscriptional activity in the Arabidopsis genome. Science,302:842–846, 2003.

[38]A. Zeileis, F. Leisch, K. Hornik, and C. Kleiber. struc-change: An R package for testing for structural change in linearregression models. Journal of Statistical Software, 7:1–38,2002.

[39]A. Zeileis, C. Kleiber, W. Kramer, and K. Hornik. Testingand dating of structural changes in practice. ComputationalStatistics & Data Analysis, 44:109–123, 2003.

[40]A. Zeileis and C. Kleiber. Validating multiple structural changemodels – a case study. Journal of Applied Econometrics,20:685–690, 2005.

[41]L. Zhang, M.F. Miles, and K.D. Aldape. A model of molecu-lar interactions on short oligonucleotide microarrays. NatureBiotechnology, 21:818–821, 2003.

8