10
Gene regulation: transcription factor analysis Hector Corrada Bravo CMSC858B Spring 2012 University of Maryland Slides courtesy of: Rafael A. Irizarry, Benjamin Langmead, Curtis Huttenhower Gene regulation 2 We have studied how to measure and analyze gene expression differential expression, clustering, prediction Now we concentrate on analysis of the regulation of gene expression Transcription factors: proteins that interact with DNA to regulate expression Epigenetics: Heritable DNA modifications involved in expression regulation transcription factor example Transcription factors 3 Transcription Factor Analysis 4 We want to know where in the genome transcription factors are bound in a given sample (say human liver) ChIP-chip and ChIP-seq We want to know what kinds of sequences transcription factors may bind to Motif analysis

Transcription factors Transcription Factor Analysis

  • Upload
    others

  • View
    42

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Transcription factors Transcription Factor Analysis

Gene regulation: transcription factor analysis

Hector Corrada BravoCMSC858B Spring 2012University of Maryland

Slides courtesy of: Rafael A. Irizarry, Benjamin Langmead, Curtis Huttenhower

Gene regulation

2

•We have studied how to measure and analyze gene expression•differential expression, clustering, prediction

•Now we concentrate on analysis of the regulation of gene expression•Transcription factors: proteins that interact with DNA to regulate expression•Epigenetics: Heritable DNA modifications involved in expression regulation

transcription factor example

Transcription factors

3

Transcription Factor Analysis

4

•We want to know where in the genome transcription factors are bound in a given sample (say human liver)

•ChIP-chip and ChIP-seq•We want to know what kinds of sequences transcription factors may bind to

•Motif analysis

Page 2: Transcription factors Transcription Factor Analysis

Transcription Factor Analysis

5

The challenges:1. Complexity: hundreds of cooperating factors2. Cis-regulation can extend over long stretches3. Trans-regulation involve long-distance interactions4. Many-to-many relationship between TF and genes

Let’s start at the beginning

Physical binding of transcription factors:Given a sample and a target protein (TF) find genomic locations where the protein is bound to DNA

Chromatin Immunoprecipitation (ChIP)

7

!"

Supplementary Figure 1 !"#$%&"'()*+,)!"#$%-./0)123")3.&"+2425'.-)-3*63)78)(6.(*6'+5)*)!"#$)-*9(4.).+6'&".,)'+)(623.'+)72:+,);<=-0)>".)!"#$)-*9(4.)?'44).'3".6)7.)"876','@.,)32)9'&62*66*8-)3"*3)&2+3*'+)(627.-)'+3.6625*3'+5)3".)?"24.)5.+29.)A!"#$%&"'(BC)26)7.)-./:.+&.,)D629)723").+,-)32)5.+.6*3.)9'44'2+-)2D)-"263)6.*,-):-'+5):436*)"'5")3"62:5"(:3)-./:.+&'+5)A!"#$%-./B0))>2).4'9'+*3.):+E+2?+)7'*-)3"*3)9*8)*6'-.),:6'+5)-*9(4.)(6.(*6*3'2+C)"876','@*3'2+)26)-./:.+&'+5)(62&.,:6.-C)(.2(4.)2D3.+)*4-2)'+&4:,.)2+.)26)926.)&2+3624)-*9(4.-)A.050C)#+(:3)26)92&E)#$B)'+)3".).F(.6'9.+3-0

Nature Biotechnology: doi:10.1038/nbt.1505

GATTCCTGCCTCATCCGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

GTCGCAGTATCTGTCT GTCGCAGTATCTGTCT TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG TATATCGCAGTATCTG CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT CCCTATATCGCAGTAT AGCACCCTATGTCGCA AGCACCCTATATCGCA AGCACCCTATGTCGCA GAGCACCCTATGTCGC CCGGAGCACCCTATAT CCGGAGCACCCTATATGCCGGAGCACCCTATG

GTCGCAGTANCTGTCT||||||||| ||||||GTCGCAGTATCTGTCT

GGATCTGCGATATACC|||||| |||||||||GGATCT-CGATATACC

AATCTGATCTTATTTT||||||||||||||||AATCTGATCTTATTTT

ATATATATATATATAT||||||||||||||||ATATATATATATATAT

TCTCTCCCANNAGAGC||||||||| |||||TCTCTCCCAGGAGAGC

Align

Reference

Binding occurs herep-value: 0.0023

Aggregate

Statistics

TATGCACGCGATAGCAGATAGCATTGCGAGAC

8

ChIP-seq

Page 3: Transcription factors Transcription Factor Analysis

ChIPseq analysis

9

• The goal here is then to do peak-detection,

• Taking the number of mapped reads as a smooth function along the genome

• where are there peaks of mapped reads

• this indicates binding locations for the target transcription factor

• Two questions:

• 1) When is a peak big enough?

• 2) Does this depend on properties of genomic region? This is important!!

Genomic bias

10

Mappability: how repetitive is a genomic region (given alignment policy and read length)

GC content: the proportion of C’s and G’s in a given genomic region (depends on read length)

[Chung et al., PLOS CompBio, 2011]

Peak-finding (not accounting for bias)

• Segment the genome into non-overlapping windows of width w (100bp).

• k1i: number of reads from ChIP channel in window i

• k2i: number of reads from control channel in window i

• ni=k1i+k2i: total number of reads in window i

• Quantity of interest is sampling ratio between ChIP sample and negative control in non-bound regions.

• Idea: use regions with small number of reads to estimate as

11

r0 =�

k1i�k2i

• Group windows according to total number of reads (ni)

• For a given total number of reads n=0,1,2,... the number of reads in the ChIP channel in unbound regions is assumed to follow a binomial distribution:

• The distribution of observed counts k1i can then be compared to expected distribution and a false discovery rate can be computed.

• Windows where ni is large enough and fdr is small enough are then determined to be peaks.

Peak-finding (not accounting for bias)

12

Bin(n, p0 = r0/(1 + r0))

Page 4: Transcription factors Transcription Factor Analysis

Peak-finding

• But with mappability and GC-content bias, this model fails. Where?

• Model in unbound bins assumes ratio is uniform across genome...

• One proposal (MOSAiCS [Chung et al., PLOS CompBio, 2011]):

• Use model in unbound bins that includes GC-content and mappability

13

for a fixed number of iterations (200 in this study). The fraction ofa read allocated for each alignment in the final iteration was usedas the probability of that alignment being correct. In this allocationprocedure, the choice of w controls the degree at which multi-readallocation is affected by uni-reads. Therefore, setting 2wz1&L,where L is the fragment size (200 bp for both STAT1 and GATA1datasets), ensures that uni-reads and multi-reads within a given bincorrespond to the same binding event. We considered setting w to25, 50, and 100 bp, respectively, and observed that, althoughthere is high overlap among the peak sets obtained with different w(w70% overlap), w~100 captures the largest number of truepositives and smallest number of false positives in our validationset.For efficient storage and retrieval of the number of reads

mapped to each position in the genome, a binary tree structure,similar to that of a Fenwick tree [58] was used. This structureallows for O( log n) time updates of read counts and O( log n)time computations of the cumulative sum of counts within anyinterval, where n is the total number of genomic positions forwhich counts are recorded. For space efficiency, only genomicpositions at which at least one read alignment started are stored inthe tree. The multi-read allocation algorithm is available athttp://www.stat.wisc.edu/*keles/Software/multi-reads/ in theform of a C++ program.

Mappability for 36 mer and 75 mer SETsConsider a tag length of k and a fragment length of L. Let

x(i):(izk{1) denote the kmer starting at position i and ending atposition izk{1 from 5

0to 3

0. Let xc(i):(i{kz1) denote the kmer

starting at position i and ending in i{kz1 in the other strand.Nucleotide level mappability is defined as in [10]:

di~1 if x(i):(izk{1) is unique,

0 o:w:

!

We can similarly define mappability for a position in the reversestrand as:

dci~1 if xc(i):(i{kz1) is unique,

0 o:w:

!

Note that dci~di{kz1. In the pre-processing of ChIP-seq data,as a result of fragment extension, the total number of observedcounts at position i could be contributed by forward strand tagsthat originate between positions i{Lz1 and i or reverse strandtags that originate between positions i and izL{1. Therefore, wemodify the definition of mappability at position i as follows forsingle-end tags:

d!i ~1

2L

Xi

j~i{Lz1

djzXizL{1

j~i

dcj

!

, ð1Þ

~1

2L

Xi

j~i{Lz1

djzXizL{k

j~i{kz1

dj

!

(Def 1) : ð2Þ

Mappability plots in Figure 1A, Figure S1a, and Figure S2a(middle panels) are based on the above definition of mappability to

highlight the effect of multi-reads on low mappable regions andallow direct comparison with uni-reads.When we are utilizing multi-reads in the actual statistical

analysis, the definition of nucleotide level mappability is modifiedto take into account the fact that kmers mapping to less than 100locations can generate non-zero counts. This can be achieved bydefining di~I(Riƒ100)=Ri where Ri is the number of timeskmerx(i):(izk{1) occurs in the genome. We will refer to bin-levelmappability utilizing this di as Def 2. Figures S1a and S2a (rightpanels) illustrate that the mappability bias is also apparent for theMR samples with this modification on mappability to take multi-reads into account.Finally, the mappability score Mj for bin j is the average of

mappabilities (d!i ) of positions that are within this bin. The GCcontent at the bin level is calculated similarly by changing di to beGC content of the kmer starting at position i.

MOSAiCSMOSAiCS is based on a two-component mixture model where

data from unbound bins, i.e., background bins, are characterizedwith a negative binomial regression model that accounts formappability, GC content, and input counts. Data from bound binsare modeled with a mixture of two negative binomials. Let Yj

denote ChIP tag counts, Xj input tag counts, Mj and GCj

mappability and GC content, respectively. Define Zj to be anunobserved random variable representing whether the bin isbound or not. Then, MOSAiCS assumes that

Yj jZj~0,Mj ,GCj ,Xj*NegBin(a,a=mj),

Yj jZj~1,Mj ,GCj ,Xj*NegBin(a,a=mj)zSj ,

where Sj is a mixture of two negative binomial distributions, i.e.,

Sj~p1NegBin(b1,c1)z(1{p1)NegBin(b2,c2)zk, where k~3 is

a constant that represents the minimum tag count observable in abound region;

log (mj)~b0zI(Xjƒs)½bM log2 (Mjz1)z

bGCSp(GCj)zbX1Xdj %zI(Xjws)bX2X

dj

and Sp is a piecewise linear B-spline model with knots at the firstand third quartiles of GC content. We had previously shown thatthis piecewise linear B-spline model characterizes the dependenceof background tag counts on GC content well [44]. s and d aretuning parameters. For all the datasets we have used MOSAiCSon, d~0:25 works best. Optimal s was chosen among f2,3,4g foreach chromosome based on BIC scores. The R package mosaicsimplements this model (available from Bioconductor [59]) andprovides parameter estimates and posterior probabilities depictingthe probability that a given bin is bound. False discovery rate(FDR) is then controlled at the desired level utilizing theseposterior probabilities. Contiguous bins declared as bound aremerged as peaks.

Classification of MR-only peaks based on their sharedmulti-readsWe classified the MR-only peaks into three classes, Type-I, II,

and III, based on their shared multi-reads with the followingprocedure in a high throughput fashion.1. For each peak, we identified the set of reads that map to its

peak region. As expected, no uni-reads were shared between anytwo peaks. There were cases where the same multi-read was

Multi-Read Analysis of ChIP-Seq Data

PLoS Computational Biology | www.ploscompbiol.org 13 July 2011 | Volume 7 | Issue 7 | e1002111

count in ChIP channel

count in control channel

bound! background

signal

Peak-finding (with bias)

• Signal model

14

for a fixed number of iterations (200 in this study). The fraction ofa read allocated for each alignment in the final iteration was usedas the probability of that alignment being correct. In this allocationprocedure, the choice of w controls the degree at which multi-readallocation is affected by uni-reads. Therefore, setting 2wz1&L,where L is the fragment size (200 bp for both STAT1 and GATA1datasets), ensures that uni-reads and multi-reads within a given bincorrespond to the same binding event. We considered setting w to25, 50, and 100 bp, respectively, and observed that, althoughthere is high overlap among the peak sets obtained with different w(w70% overlap), w~100 captures the largest number of truepositives and smallest number of false positives in our validationset.For efficient storage and retrieval of the number of reads

mapped to each position in the genome, a binary tree structure,similar to that of a Fenwick tree [58] was used. This structureallows for O( log n) time updates of read counts and O( log n)time computations of the cumulative sum of counts within anyinterval, where n is the total number of genomic positions forwhich counts are recorded. For space efficiency, only genomicpositions at which at least one read alignment started are stored inthe tree. The multi-read allocation algorithm is available athttp://www.stat.wisc.edu/*keles/Software/multi-reads/ in theform of a C++ program.

Mappability for 36 mer and 75 mer SETsConsider a tag length of k and a fragment length of L. Let

x(i):(izk{1) denote the kmer starting at position i and ending atposition izk{1 from 5

0to 3

0. Let xc(i):(i{kz1) denote the kmer

starting at position i and ending in i{kz1 in the other strand.Nucleotide level mappability is defined as in [10]:

di~1 if x(i):(izk{1) is unique,

0 o:w:

!

We can similarly define mappability for a position in the reversestrand as:

dci~1 if xc(i):(i{kz1) is unique,

0 o:w:

!

Note that dci~di{kz1. In the pre-processing of ChIP-seq data,as a result of fragment extension, the total number of observedcounts at position i could be contributed by forward strand tagsthat originate between positions i{Lz1 and i or reverse strandtags that originate between positions i and izL{1. Therefore, wemodify the definition of mappability at position i as follows forsingle-end tags:

d!i ~1

2L

Xi

j~i{Lz1

djzXizL{1

j~i

dcj

!

, ð1Þ

~1

2L

Xi

j~i{Lz1

djzXizL{k

j~i{kz1

dj

!

(Def 1) : ð2Þ

Mappability plots in Figure 1A, Figure S1a, and Figure S2a(middle panels) are based on the above definition of mappability to

highlight the effect of multi-reads on low mappable regions andallow direct comparison with uni-reads.When we are utilizing multi-reads in the actual statistical

analysis, the definition of nucleotide level mappability is modifiedto take into account the fact that kmers mapping to less than 100locations can generate non-zero counts. This can be achieved bydefining di~I(Riƒ100)=Ri where Ri is the number of timeskmerx(i):(izk{1) occurs in the genome. We will refer to bin-levelmappability utilizing this di as Def 2. Figures S1a and S2a (rightpanels) illustrate that the mappability bias is also apparent for theMR samples with this modification on mappability to take multi-reads into account.Finally, the mappability score Mj for bin j is the average of

mappabilities (d!i ) of positions that are within this bin. The GCcontent at the bin level is calculated similarly by changing di to beGC content of the kmer starting at position i.

MOSAiCSMOSAiCS is based on a two-component mixture model where

data from unbound bins, i.e., background bins, are characterizedwith a negative binomial regression model that accounts formappability, GC content, and input counts. Data from bound binsare modeled with a mixture of two negative binomials. Let Yj

denote ChIP tag counts, Xj input tag counts, Mj and GCj

mappability and GC content, respectively. Define Zj to be anunobserved random variable representing whether the bin isbound or not. Then, MOSAiCS assumes that

Yj jZj~0,Mj ,GCj ,Xj*NegBin(a,a=mj),

Yj jZj~1,Mj ,GCj ,Xj*NegBin(a,a=mj)zSj ,

where Sj is a mixture of two negative binomial distributions, i.e.,

Sj~p1NegBin(b1,c1)z(1{p1)NegBin(b2,c2)zk, where k~3 is

a constant that represents the minimum tag count observable in abound region;

log (mj)~b0zI(Xjƒs)½bM log2 (Mjz1)z

bGCSp(GCj)zbX1Xdj %zI(Xjws)bX2X

dj

and Sp is a piecewise linear B-spline model with knots at the firstand third quartiles of GC content. We had previously shown thatthis piecewise linear B-spline model characterizes the dependenceof background tag counts on GC content well [44]. s and d aretuning parameters. For all the datasets we have used MOSAiCSon, d~0:25 works best. Optimal s was chosen among f2,3,4g foreach chromosome based on BIC scores. The R package mosaicsimplements this model (available from Bioconductor [59]) andprovides parameter estimates and posterior probabilities depictingthe probability that a given bin is bound. False discovery rate(FDR) is then controlled at the desired level utilizing theseposterior probabilities. Contiguous bins declared as bound aremerged as peaks.

Classification of MR-only peaks based on their sharedmulti-readsWe classified the MR-only peaks into three classes, Type-I, II,

and III, based on their shared multi-reads with the followingprocedure in a high throughput fashion.1. For each peak, we identified the set of reads that map to its

peak region. As expected, no uni-reads were shared between anytwo peaks. There were cases where the same multi-read was

Multi-Read Analysis of ChIP-Seq Data

PLoS Computational Biology | www.ploscompbiol.org 13 July 2011 | Volume 7 | Issue 7 | e1002111

+ strand- strand

!"

Supplementary Figure 1 !"#$%&"'()*+,)!"#$%-./0)123")3.&"+2425'.-)-3*63)78)(6.(*6'+5)*)!"#$)-*9(4.).+6'&".,)'+)(623.'+)72:+,);<=-0)>".)!"#$)-*9(4.)?'44).'3".6)7.)"876','@.,)32)9'&62*66*8-)3"*3)&2+3*'+)(627.-)'+3.6625*3'+5)3".)?"24.)5.+29.)A!"#$%&"'(BC)26)7.)-./:.+&.,)D629)723").+,-)32)5.+.6*3.)9'44'2+-)2D)-"263)6.*,-):-'+5):436*)"'5")3"62:5"(:3)-./:.+&'+5)A!"#$%-./B0))>2).4'9'+*3.):+E+2?+)7'*-)3"*3)9*8)*6'-.),:6'+5)-*9(4.)(6.(*6*3'2+C)"876','@*3'2+)26)-./:.+&'+5)(62&.,:6.-C)(.2(4.)2D3.+)*4-2)'+&4:,.)2+.)26)926.)&2+3624)-*9(4.-)A.050C)#+(:3)26)92&E)#$B)'+)3".).F(.6'9.+3-0

Nature Biotechnology: doi:10.1038/nbt.1505

Peak-finding (with bias)

15

for a fixed number of iterations (200 in this study). The fraction ofa read allocated for each alignment in the final iteration was usedas the probability of that alignment being correct. In this allocationprocedure, the choice of w controls the degree at which multi-readallocation is affected by uni-reads. Therefore, setting 2wz1&L,where L is the fragment size (200 bp for both STAT1 and GATA1datasets), ensures that uni-reads and multi-reads within a given bincorrespond to the same binding event. We considered setting w to25, 50, and 100 bp, respectively, and observed that, althoughthere is high overlap among the peak sets obtained with different w(w70% overlap), w~100 captures the largest number of truepositives and smallest number of false positives in our validationset.For efficient storage and retrieval of the number of reads

mapped to each position in the genome, a binary tree structure,similar to that of a Fenwick tree [58] was used. This structureallows for O( log n) time updates of read counts and O( log n)time computations of the cumulative sum of counts within anyinterval, where n is the total number of genomic positions forwhich counts are recorded. For space efficiency, only genomicpositions at which at least one read alignment started are stored inthe tree. The multi-read allocation algorithm is available athttp://www.stat.wisc.edu/*keles/Software/multi-reads/ in theform of a C++ program.

Mappability for 36 mer and 75 mer SETsConsider a tag length of k and a fragment length of L. Let

x(i):(izk{1) denote the kmer starting at position i and ending atposition izk{1 from 5

0to 3

0. Let xc(i):(i{kz1) denote the kmer

starting at position i and ending in i{kz1 in the other strand.Nucleotide level mappability is defined as in [10]:

di~1 if x(i):(izk{1) is unique,

0 o:w:

!

We can similarly define mappability for a position in the reversestrand as:

dci~1 if xc(i):(i{kz1) is unique,

0 o:w:

!

Note that dci~di{kz1. In the pre-processing of ChIP-seq data,as a result of fragment extension, the total number of observedcounts at position i could be contributed by forward strand tagsthat originate between positions i{Lz1 and i or reverse strandtags that originate between positions i and izL{1. Therefore, wemodify the definition of mappability at position i as follows forsingle-end tags:

d!i ~1

2L

Xi

j~i{Lz1

djzXizL{1

j~i

dcj

!

, ð1Þ

~1

2L

Xi

j~i{Lz1

djzXizL{k

j~i{kz1

dj

!

(Def 1) : ð2Þ

Mappability plots in Figure 1A, Figure S1a, and Figure S2a(middle panels) are based on the above definition of mappability to

highlight the effect of multi-reads on low mappable regions andallow direct comparison with uni-reads.When we are utilizing multi-reads in the actual statistical

analysis, the definition of nucleotide level mappability is modifiedto take into account the fact that kmers mapping to less than 100locations can generate non-zero counts. This can be achieved bydefining di~I(Riƒ100)=Ri where Ri is the number of timeskmerx(i):(izk{1) occurs in the genome. We will refer to bin-levelmappability utilizing this di as Def 2. Figures S1a and S2a (rightpanels) illustrate that the mappability bias is also apparent for theMR samples with this modification on mappability to take multi-reads into account.Finally, the mappability score Mj for bin j is the average of

mappabilities (d!i ) of positions that are within this bin. The GCcontent at the bin level is calculated similarly by changing di to beGC content of the kmer starting at position i.

MOSAiCSMOSAiCS is based on a two-component mixture model where

data from unbound bins, i.e., background bins, are characterizedwith a negative binomial regression model that accounts formappability, GC content, and input counts. Data from bound binsare modeled with a mixture of two negative binomials. Let Yj

denote ChIP tag counts, Xj input tag counts, Mj and GCj

mappability and GC content, respectively. Define Zj to be anunobserved random variable representing whether the bin isbound or not. Then, MOSAiCS assumes that

Yj jZj~0,Mj ,GCj ,Xj*NegBin(a,a=mj),

Yj jZj~1,Mj ,GCj ,Xj*NegBin(a,a=mj)zSj ,

where Sj is a mixture of two negative binomial distributions, i.e.,

Sj~p1NegBin(b1,c1)z(1{p1)NegBin(b2,c2)zk, where k~3 is

a constant that represents the minimum tag count observable in abound region;

log (mj)~b0zI(Xjƒs)½bM log2 (Mjz1)z

bGCSp(GCj)zbX1Xdj %zI(Xjws)bX2X

dj

and Sp is a piecewise linear B-spline model with knots at the firstand third quartiles of GC content. We had previously shown thatthis piecewise linear B-spline model characterizes the dependenceof background tag counts on GC content well [44]. s and d aretuning parameters. For all the datasets we have used MOSAiCSon, d~0:25 works best. Optimal s was chosen among f2,3,4g foreach chromosome based on BIC scores. The R package mosaicsimplements this model (available from Bioconductor [59]) andprovides parameter estimates and posterior probabilities depictingthe probability that a given bin is bound. False discovery rate(FDR) is then controlled at the desired level utilizing theseposterior probabilities. Contiguous bins declared as bound aremerged as peaks.

Classification of MR-only peaks based on their sharedmulti-readsWe classified the MR-only peaks into three classes, Type-I, II,

and III, based on their shared multi-reads with the followingprocedure in a high throughput fashion.1. For each peak, we identified the set of reads that map to its

peak region. As expected, no uni-reads were shared between anytwo peaks. There were cases where the same multi-read was

Multi-Read Analysis of ChIP-Seq Data

PLoS Computational Biology | www.ploscompbiol.org 13 July 2011 | Volume 7 | Issue 7 | e1002111

for a fixed number of iterations (200 in this study). The fraction ofa read allocated for each alignment in the final iteration was usedas the probability of that alignment being correct. In this allocationprocedure, the choice of w controls the degree at which multi-readallocation is affected by uni-reads. Therefore, setting 2wz1&L,where L is the fragment size (200 bp for both STAT1 and GATA1datasets), ensures that uni-reads and multi-reads within a given bincorrespond to the same binding event. We considered setting w to25, 50, and 100 bp, respectively, and observed that, althoughthere is high overlap among the peak sets obtained with different w(w70% overlap), w~100 captures the largest number of truepositives and smallest number of false positives in our validationset.For efficient storage and retrieval of the number of reads

mapped to each position in the genome, a binary tree structure,similar to that of a Fenwick tree [58] was used. This structureallows for O( log n) time updates of read counts and O( log n)time computations of the cumulative sum of counts within anyinterval, where n is the total number of genomic positions forwhich counts are recorded. For space efficiency, only genomicpositions at which at least one read alignment started are stored inthe tree. The multi-read allocation algorithm is available athttp://www.stat.wisc.edu/*keles/Software/multi-reads/ in theform of a C++ program.

Mappability for 36 mer and 75 mer SETsConsider a tag length of k and a fragment length of L. Let

x(i):(izk{1) denote the kmer starting at position i and ending atposition izk{1 from 5

0to 3

0. Let xc(i):(i{kz1) denote the kmer

starting at position i and ending in i{kz1 in the other strand.Nucleotide level mappability is defined as in [10]:

di~1 if x(i):(izk{1) is unique,

0 o:w:

!

We can similarly define mappability for a position in the reversestrand as:

dci~1 if xc(i):(i{kz1) is unique,

0 o:w:

!

Note that dci~di{kz1. In the pre-processing of ChIP-seq data,as a result of fragment extension, the total number of observedcounts at position i could be contributed by forward strand tagsthat originate between positions i{Lz1 and i or reverse strandtags that originate between positions i and izL{1. Therefore, wemodify the definition of mappability at position i as follows forsingle-end tags:

d!i ~1

2L

Xi

j~i{Lz1

djzXizL{1

j~i

dcj

!

, ð1Þ

~1

2L

Xi

j~i{Lz1

djzXizL{k

j~i{kz1

dj

!

(Def 1) : ð2Þ

Mappability plots in Figure 1A, Figure S1a, and Figure S2a(middle panels) are based on the above definition of mappability to

highlight the effect of multi-reads on low mappable regions andallow direct comparison with uni-reads.When we are utilizing multi-reads in the actual statistical

analysis, the definition of nucleotide level mappability is modifiedto take into account the fact that kmers mapping to less than 100locations can generate non-zero counts. This can be achieved bydefining di~I(Riƒ100)=Ri where Ri is the number of timeskmerx(i):(izk{1) occurs in the genome. We will refer to bin-levelmappability utilizing this di as Def 2. Figures S1a and S2a (rightpanels) illustrate that the mappability bias is also apparent for theMR samples with this modification on mappability to take multi-reads into account.Finally, the mappability score Mj for bin j is the average of

mappabilities (d!i ) of positions that are within this bin. The GCcontent at the bin level is calculated similarly by changing di to beGC content of the kmer starting at position i.

MOSAiCSMOSAiCS is based on a two-component mixture model where

data from unbound bins, i.e., background bins, are characterizedwith a negative binomial regression model that accounts formappability, GC content, and input counts. Data from bound binsare modeled with a mixture of two negative binomials. Let Yj

denote ChIP tag counts, Xj input tag counts, Mj and GCj

mappability and GC content, respectively. Define Zj to be anunobserved random variable representing whether the bin isbound or not. Then, MOSAiCS assumes that

Yj jZj~0,Mj ,GCj ,Xj*NegBin(a,a=mj),

Yj jZj~1,Mj ,GCj ,Xj*NegBin(a,a=mj)zSj ,

where Sj is a mixture of two negative binomial distributions, i.e.,

Sj~p1NegBin(b1,c1)z(1{p1)NegBin(b2,c2)zk, where k~3 is

a constant that represents the minimum tag count observable in abound region;

log (mj)~b0zI(Xjƒs)½bM log2 (Mjz1)z

bGCSp(GCj)zbX1Xdj %zI(Xjws)bX2X

dj

and Sp is a piecewise linear B-spline model with knots at the firstand third quartiles of GC content. We had previously shown thatthis piecewise linear B-spline model characterizes the dependenceof background tag counts on GC content well [44]. s and d aretuning parameters. For all the datasets we have used MOSAiCSon, d~0:25 works best. Optimal s was chosen among f2,3,4g foreach chromosome based on BIC scores. The R package mosaicsimplements this model (available from Bioconductor [59]) andprovides parameter estimates and posterior probabilities depictingthe probability that a given bin is bound. False discovery rate(FDR) is then controlled at the desired level utilizing theseposterior probabilities. Contiguous bins declared as bound aremerged as peaks.

Classification of MR-only peaks based on their sharedmulti-readsWe classified the MR-only peaks into three classes, Type-I, II,

and III, based on their shared multi-reads with the followingprocedure in a high throughput fashion.1. For each peak, we identified the set of reads that map to its

peak region. As expected, no uni-reads were shared between anytwo peaks. There were cases where the same multi-read was

Multi-Read Analysis of ChIP-Seq Data

PLoS Computational Biology | www.ploscompbiol.org 13 July 2011 | Volume 7 | Issue 7 | e1002111

background

GC-content

mappability

Transcription Factor Analysis

Using sequence (motif analysis)

Page 5: Transcription factors Transcription Factor Analysis

17

18

19 20

Page 6: Transcription factors Transcription Factor Analysis

21

An aside: representing sequence motifs

22

23 24

Page 7: Transcription factors Transcription Factor Analysis

25 26

27 28

Page 8: Transcription factors Transcription Factor Analysis

29 30

31 32

Page 9: Transcription factors Transcription Factor Analysis

33 34

35 36

Page 10: Transcription factors Transcription Factor Analysis

37

Text