22
Narang, V. et al. 1 Supplementary material For the paper “Localized Motif Discovery in Gene Regulatory Sequences” Vipin Narang, Ankush Mittal and Wing Kin Sung Section A. Scoring functions used in LocalMotif LocalMotif uses a combination of three different scoring functions which individually describe three different characteristics of a motif. The relative entropy score (RES) measures the degree of surprise in the motif nucleotide pattern with respect to the background distribution of nucleotides. The over-representation score (ORS) measures the overabundance of the number of instances of the motif. The spatial confinement score (SCS) measures the disproportionate confinement of motif instances in a certain sequence interval. All three scores are expressed as entropies and brought to a normalized form. The scoring functions and their normalization is described below. Spatial Confinement Score (SCS): Consider a , ld motif M with its instances observed in a large set of sequences, S , of length L each, aligned relative to an anchor point A. Spatial confinement of M within a position interval 1 2 , p p is defined as the difference between the fraction of binding sites actually observed within the interval 1 2 , p p and the fraction that would be expected to lie in 1 2 , p p if the binding sites were uniformly distributed across the entire sequence length. For instance a length 2 L interval , 2 pp L is expected to contain 50% of the observed binding sites if they were uniformly distributed. But if this interval contains 65% of the total binding sites, then it has +0.15 spatial confinement of M. Spatial confinement always lies in the range 1,1 . Its positive value in an interval signifies higher than expected binding site concentration in that interval. Figure S1 shows the spatial confinement of the motif TTGACA in E. coli promoter sequences for various intervals. The interval length 2 1 p p is shown on the x-axis and the interval beginning position 1 p is shown on the y-axis. Spatial confinement is shown as a surface in the z-axis. Maximum spatial confinement is observed for the interval (30,50) indicating that the motif is confined within this interval. The interval is indeed biologically accurate (Harley and Reynolds, 1987). Thus spatial confinement gives a picture of the relative concentration of binding sites for a motif in different position intervals, and can be used to identify the position interval where the motif is maximally confined. However in practice it is difficult to accurately compute it because the number of input sequences provided to the algorithm is mostly limited. The limited information can be utilized most effectively using statistical procedures. A statistical measure for spatial confinement is therefore derived as the spatial confinement score.

Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 1

Supplementary material

For the paper “Localized Motif Discovery in Gene Regulatory Sequences”

Vipin Narang, Ankush Mittal and Wing Kin Sung

Section A. Scoring functions used in LocalMotif

LocalMotif uses a combination of three different scoring functions which individually describe three different characteristics of a motif. The relative entropy score (RES) measures the degree of surprise in the motif nucleotide pattern with respect to the background distribution of nucleotides. The over-representation score (ORS) measures the overabundance of the number of instances of the motif. The spatial confinement score (SCS) measures the disproportionate confinement of motif instances in a certain sequence interval. All three scores are expressed as entropies and brought to a normalized form. The scoring functions and their normalization is described below.

Spatial Confinement Score (SCS):

Consider a ,l d motif M with its instances observed in a large set of sequences, S , of length L each, aligned

relative to an anchor point A. Spatial confinement of M within a position interval 1 2,p p is defined as the

difference between the fraction of binding sites actually observed within the interval 1 2,p p and the fraction that

would be expected to lie in 1 2,p p if the binding sites were uniformly distributed across the entire sequence length.

For instance a length 2L interval , 2p p L is expected to contain 50% of the observed binding sites if they

were uniformly distributed. But if this interval contains 65% of the total binding sites, then it has +0.15 spatial confinement of M.

Spatial confinement always lies in the range 1,1 . Its positive value in an interval signifies higher than

expected binding site concentration in that interval. Figure S1 shows the spatial confinement of the motif TTGACA in E. coli promoter sequences for various intervals. The interval length 2 1p p is shown on the x-axis and the

interval beginning position 1p is shown on the y-axis. Spatial confinement is shown as a surface in the z-axis.

Maximum spatial confinement is observed for the interval (30,50) indicating that the motif is confined within this interval. The interval is indeed biologically accurate (Harley and Reynolds, 1987).

Thus spatial confinement gives a picture of the relative concentration of binding sites for a motif in different position intervals, and can be used to identify the position interval where the motif is maximally confined. However in practice it is difficult to accurately compute it because the number of input sequences provided to the algorithm is mostly limited. The limited information can be utilized most effectively using statistical procedures. A statistical measure for spatial confinement is therefore derived as the spatial confinement score.

Page 2: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 2

0

20

40

60

80

100

0

20

40

60

80

100-1

-0.5

0

0.5

1

positionwidth

020

4060

80100

0102030405060708090100

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Width Position

Figure S1. Spatial confinement of the motif TTGACA in different intervals 1 2,p p in a set of 471 E. coli promoter sequences of length 101

each. The x-axis denotes position p1 and the y-axis denotes the interval width 2 1p p . Maximum is observed at 1 30p and width=20,

indicating that the motif is confined within the interval (30,50), which agrees with the literature (Harley and Reynolds, 1987).

Instead of the large sequences set S , let only a subset SS be available as input to the algorithm. Thus S is a

sample data from the population S . Let c be the concentration of binding sites for the motif M in position interval

1 2,p p within the population S . An estimate c of c may be obtained from the sample S. Let n denote the total

number of binding sites for M in the sequence set S, of which 1n lie within the interval 1 2,p p and 0 1n n n lie

outside this interval. The maximum likelihood estimate is given as 1 0 1c n n n .

The spatial confinement of M in the interval 1 2,p p is measured as the difference 0c c , where 0c is the

concentration of binding sites expected in 1 2,p p according to uniform density, given by 0 2 1c p p L . Since

the exact value of c is unknown, the problem is to assess from the sample estimate c whether or not 0c c in the

interval 1 2,p p and to what degree c exceeds 0c . This would be a statistical measure of the spatial confinement of

M in 1 2,p p .

A statistical hypothesis test is defined to assess whether 0c c with the following elements: the null hypothesis,

the alternate hypothesis, the test statistic and the rejection region. The two hypotheses are:

0 0:H c c

1 0: one tailedH c c

The test statistic is derived via likelihood ratio procedure. Considering that the population distribution is uniform (hypothesis 0H ), in a randomly chosen sample, the likelihood of observing 1n binding sites within the interval

1 2,p p and 0n outside this interval is given by the binomial formula:

1 0

11 1 2 0 0 0 1 0 0Pr sites in , , 1n nn

nn p p c L c n n C c c

Considering that the population distribution is non-uniform (hypothesis 1H ), let the concentration of binding

sites in the interval 1 2,p p be c. The binding site observations are outcomes of a binomial experiment where a

binding site lies within the interval 1 2,p p with probability c and outside it with probability 1 c . If the total

number of observed binding sites is n, of which 1n lie within 1 2,p p and 0 1n n n lie outside, then the

likelihood of observing 1n binding sites within the interval 1 2,p p and 0n outside this interval is again given by

1 0

11 1 2 0 1Pr sites in , , 1n nn

nn p p c L c n n C c c

Page 3: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 3

The maximum likelihood estimate c of c is thus obtained as

1

0 1

ˆ0 ndL

cdc n n

The likelihood ratio test statistic 1 for the hypothesis test is then obtained as

01

01

0 0 01

1

ˆ ˆ ˆ1

nn

nn

L c c c

L c c c

and the rejection region is determined by

1:RR k

where k is chosen according to the desired level of significance of the test. According to the Wilks' theorem

[Rice (1995)], 12 ln is approximately 2 distributed with one degree of freedom. This information can be used

to derive the value of k given a fixed level of significance . If 1 lies in the rejection region then there is

sufficient evidence to conclude that the concentration of binding sites for the motif M in the interval 1 2,p p is

greater than what would be expected from uniform density. As the value of 1 approaches zero, the hypothesis 1H

is favoured increasingly over 0H . The likelihood ratio test statistic 1 is related to the Kullback-Leibler distance

between c and 0c as

0 1

1ˆ lnD c c

n

which can be shown to be equal to

10 0

ˆ ˆ1 1ˆ ˆln ln 1 ln

1

c cc c

n c c

The above equations are used as the statistical measure for the spatial confinement score. It is already in a normalized form being independent of motif length etc.

The significance or the P-value of the SCS score can be determined based on Wilks’ theorem. As shown above,

the likelihood ratio test statistic that compares the hypotheses H0 and H1 is given by 12 ln , and is 2

distributed with one degree of freedom. It can be easily shown that = 2×nSCS, where n = n0+n1. Thus the P-value of SCS can be calculated from the area under the right tail of the 2 distribution with 1df by plugging in

the value of .

Relative entropy score (RES):

The relative entropy of the motif is the Kullback-Leibler divergence between the motif M and the background B:

,,

1

lnl

b ib i

i b b

fD M B f

p

,

which can be decomposed as

, , ,1 1

ln lnl l

b i b i b i bi b i b

D M B f f f p

,

Page 4: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 4

i.e., , , ,1 1

ln lnl l

b i b i b i bi b b i

D M B f f f p

If 1 2 1nx x x then the maximum of the entropy function 1

lnn

i ii

x x

occurs for 1 2 1nx x x n

and the maximum value is ln n . Therefore the first term can be normalized by the factor 1 ln 4l . Normalizing

the second term by the same factor 1 ln 4l , it appears as ,1

1 1ln , where

ln 4

l

b b b b ib i

f p f fl

. For a

uniform background, where pb=0.25, this term reduces to 1 since 1bb

f . Another special case is when

, b bb f p . Then after normalization the term becomes 1. As the difference between bf and bp increases, the

term can become >1. The statistical significance or P-value of the RES can be calculated as follows. Consider each column of the

PWM separately. Let ,, ln b i

i b ib b

fD f

p

be the entropy of the ith column in the motif PWM. The RES is the total

entropy, which is the sum of iD over all columns of the PWM, i.e. 1

l

ii

RES D

.

We view the calculation of iD

as a hypothesis testing problem. The null hypothesis is that the frequencies of the

nucleotides in the ith column are generated from a multinomial distribution specified by the background frequencies

bp where , , ,b A C G T , i.e., for the ith column in the PWM:

0 ,: is a sample from ib i bH n p

The multinomial distribution is given by:

,

0 ~ b inib

b

L p

The alternate hypothesis H1 is that the column of the PWM originated from the multinomial distribution given by the observed frequencies:

1 , ,: is a sample from the observed ib i b iH n f

The multinomial distribution is given by:

,

1 ,~ b inib i

b

L f

The likelihood ratio for the hypothesis test is given by:

,,

,,,

b ib i

b i

nn

bb b

i nb b ib i

b

pp

ff

Thus the likelihood ratio test statistic is given by 2lni i .

Page 5: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 5

Plugging in the expressions, ,,2 ln 2b i

i b i i ib b

fn n D

p

,

where ,i b ib

n n . By Wilks’ theorem, i is 2 distributed with 3 degrees of freedom. In general, if the order of

the multinomial distribution is m, then is 2 distributed with 1m degrees of freedom.

Since the sum of two chi-square distributed variables with degrees of freedom df1 and df2 respectively is also chi-square distributed with df1+df2 degrees of freedom, 2×n×RES is chi-square distributed with 3l degrees of freedom. Using this information we can calculate the P-value of RES.

Over-representation score (ORS):

Searching for motif instances (TFBS) in a set of sequences can be considered as a binomial experiment where patterns of length l are drawn from the sequences and each pattern is classified as either a motif instance or a non-instance. The probability of observing k instances of the motif among a total of t samples is given by:

, Pr "true" in 1t kt k

kP k t k t C e e ,

where e is the proportion of TFBS in the sequences. For example, under the (l,d) motif representation, the chance proportion 0e of the TFBS of a motif according to uniform background is computed theoretically as follows:

0 4lt , 00

3d

ili

i

k C

, and 0 0 0e k t .

The background probability distribution of the TFBS is then

0 00 0

00 0 0 0 0, 1t kt k

kP k t C e e .

If the background distribution is not uniform, the expression will be modified in a suitable manner to incorporate the individual probabilities of each of the 0k patterns that match the (l,d) motif.

An entropy measure for over-representation derived directly (without Gaussian approximation) from the binomial distribution in a normalized form is used in LocalMotif. It is obtained as the Kullback-Leibler divergence between the two binomial distributions:

00 1 0

0 1

,, ln

,

n

k

P k tD P P P k t

P k t

Upon expanding the above expression, it is simplified as:

0 00 1 0 0

1 1

1ln 1 ln

1

e eD P P e e

e e

,

which is independent of the number of samples t. This is used as the measure for over-representation in LocalMotif. Since 0e still depends upon 0 0,k t , the expression needs to be further normalized with respect to these.

Thus an additional factor ,l d is included:

0 00 1 0 0

1 1

1, ln 1 ln

1

e eD P P l d e e

e e

,

Page 6: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 6

where

0

1,

13

4

di

li

l dl

i

. The denominator of ,l d is the fraction of length l patterns that have up to d

mismatches from a given pattern.

Since the ORS is calculated from the binomial distribution same as the SCS, the likelihood ratio test statistic = 2×t×(e0+e1)×ORS is 2 distributed with one degree of freedom. The p-value for ORS is hence calculated from this

2 distribution in the same way as for SCS.

A note on the Gaussian approximation for over-representation (Z-score): As t grows to be large, specifically if both 5te and 1 5t e , the binomial distribution may be

approximated by the Gaussian distribution , 1te te eN . Thus,

0 00 0

00 0 0 0 0 0 0 0 0 0 0, 1 , 1t kt k

kP k t C e e P x t e t e e N ,

If in 1t actual trials (i.e., upon searching the set of sequences consisting of 1t oligonucleotides of length l) the

observed number of matching patterns be 1k . This represents an observed proportion 1 1 1e k t . Hence the

observed probability distribution of the TFBS is:

1 11 1

11 1 1 1 1 1 1 1 1 1 1, 1 , 1t kt k

kP k t C e e P x t e t e e N .

The Z-score (Tompa, 1999) for computing the over-representation is based on the Gaussian approximation:

1 1 1 0

1 0 01

t e t ez

t e e

.

The Z-score is not a normalized measure as it depends upon the number of samples 1t .

For a Gaussian approximation of the binomial distribution, the KL divergence between two Gaussians is given by

22 20 10 0

0 1 2 2 21 1 1

11 ln

2D N N

,

and thus

2

0 0 0 1 0 00 1

1 1 1 1 1 1

1 111 ln

2 1 1 1

e e e e e eD P P

e e e e e e

which is approximately identical with the previous expression for most cases, except when 0e or 1e have extreme

values that are close to 1 or 0, in which case the Gaussian approximation has significant error.

Page 7: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 7

Section B. The LocalMotif Algorithm

The LocalMotif algorithm has two modules – a core module which finds a non-redundant set of best scoring (l,d) motifs, and a refinement module which finds an optimal PWM corresponding to each (l,d) motif found by the core module. The two modules are described here.

The Core Module:

The core module scores candidate (l,d) motifs in different sequence intervals and reports the best scoring ones,

while weeding out similar motifs with overlapping intervals. An exhaustive enumeration strategy would require scoring all possible 4l candidate patterns in all possible sequence intervals, leading to a complexity of 24 .lO L . Therefore a greedy search approach is used. An efficient algorithm is developed as illustrated in Figure S2. It includes various speed-ups and memory optimizations as follows.

1. Creating a positional dictionary

The positional dictionary is a data structure for quickly computing the number of instances of a candidate motif in a given sequence interval. All unique length l patterns (l-mers) occurring in the input sequences form the different entries of a dictionary. The position of every single occurrence of an l-mer is recorded in its entry. Overlapping occurrences of the same l-mer are excluded. The dictionary is cross-referenced so that entries whose l-mers have a Hamming distance of d or less from each other are interlinked. Interlinking facilitates quick enumeration of all binding site occurrences for every l-mer candidate.

2. Speed-up for score computation 

Scoring each candidate l-mer in all possible position intervals 1 2 1 2, : 0p p p p L , would be formidable. Only the intervals 1 2 1 2 1 2, : ; , 0, ,2 ,3 , ,p p p p p p s s s L are considered, where s, called step size, is a small integer value set by default to 5 in the current implementation, and can be varied as a user-supplied parameter. Interestingly the score for a longer interval can be computed directly from the scores for shorter constituent intervals. The relations are derived as follows. Let 1 2 3p p p , and let quantities for the interval ,x yp p be denoted with superscript xy. Then,

13 12 23 13 12 231 1 1

13 12 23 13 12 230 0 0 0 0 0

ˆ ˆ ˆ

n n n c c c

n n n c c c

13 12 230 0 0

13 12 231 1 1

e e e

e e e

12 2313 13 12 12 23 23 13 13 12 12 23 23 13 12 231 , 1 , 1 , , , , , , ,13 13

ˆ ˆˆ ˆ ˆ

ˆ ˆi j i j i j i j i j i j i j i j i j

c cn f n f n f c f c f c f f f f

c c

Computations are thus performed over two passes – scores for all length s intervals are computed in the first pass, and scores for longer intervals are calculated from the scores for shorter constituent intervals in the second pass. The bottleneck in score computation is the first pass, so the second pass speeds up the computation.

3. Early discarding of similar patterns

As the candidate l-mers are being scored, a list of scores is maintained sorted in a descending order. To limit the memory requirement, only the top candidates are maintained in the list, where is a user-defined percentage of the total number of candidates. Moreover, if two candidates have similar pattern (similarity > 65%) and overlapping position intervals, the lower scoring candidate is discarded. Similarity between two l-mers is evaluated using the Needleman-Wunsch global alignment algorithm.

Page 8: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 8

4. Extending the motif search

The LocalMotif algorithm does not perform an exhaustive search over all possible 4l l-mer patterns to seek the best motifs. Initially only the l-mers occurring directly within the sequences are considered as candidates. It is possible that l-mers not occurring directly within the sequences may be the best motifs. A heuristic algorithm similar to SP-STAR (Pevzner and Sze, 2000) extends the search over other probable patterns. The best scoring l-mers are clustered according to the goodness of their alignment, so that each cluster 1 2, , ,clus mM M M M contains similar patterns of length l. A majority pattern is computed for each cluster, whose ith letter is the most frequent ith letter in

clusM , with ties broken arbitrarily (Pevzner and Sze, 2000). The majority pattern of each cluster is a new candidate motif. The new generation of candidate motifs is added to the cross-referenced positional dictionary and scored in all sequence intervals. Best scoring candidate motifs are again selected and the clustering and majority pattern procedure is repeated until scores of a new generation do not show any improvement over previous generations.

5. Combining motif candidates with different (l,d) combinations

In each run, the LocalMotif algorithm finds motifs for a fixed value of l and d as described in Figure S2. The results of separate runs with varying ,l d are combined by a short script before passing the output to the refinement module. Since the LocalMotif scoring function does not depend upon l and d, motifs with different l and d can be directly compared in their scores. Motifs with similar pattern and overlapping intervals are again identified by alignment and among a pair of motifs with greater than 65% similarity (measured relative to the shorter motif), the one with lower score is discarded.

The Refinement Module: The (l,d) motifs discovered by the core module are passed to the refinement module. The refinement module processes each of these motifs separately to produce their corresponding PWMs. For a given (l,d) motif, first an initial PWM of length l is constructed from the alignment of all its d-mismatch instances occurring within the entire set of sequences. The Fitness Expectation Maximization (FEM) algorithm (Wierstra, et al., 2008) is then used to refine the PWM so as to maximize the LocalMotif scoring function. FEM is tuned here to search the local landscape of the scoring function so as to converge to a local minimum nearby the initial PWM. The objective is to optimize the 3l-dimensional continuous vector x of the parameters of the PWM for the fitness function f x , which is the LocalMotif score. The function f x is unknown or ‘black box’ to the algorithm, in

that the only information accessible to the algorithm consists of function measurements selected by the algorithm. The goal is to optimize f x , while keeping the number of function evaluations – which are considered costly – as

low as possible. The FEM algorithm performs this optimization as follows. It evaluates a batch of N separate search points

1 2, , , Nz z z on the fitness function, which are chosen according to a search policy , x . The search policy is a

Gaussian with mean as the current parameter values x and the covariance matrix . This means that the points z (which are 3l-dimensional in this case) are selected according to the probability density:

11 23 2

1 1, exp

22

T

lN

z x z x z x

The information extracted from fitness evaluations 1 , , Nf fz z is used to adjust both the current candidate

solution x and the search policy , x .

At every point in time in the algorithm, we want to optimize the expected fitness J E f Z z of the next batch,

given the current batch of search samples. The fitness tJ of the current batch, t, can be defined as:

1

,N

i ii

J p f

z x z

Page 9: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 9

In order to adjust parameters ,x towards solutions with higher associated fitness, the search distribution is

matched to the actual sample points 1 2, , , Nz z z , but weighted by their ‘utilities’. Though the utility could be taken

as if z for the point iz , this in practice does not work well when the sample size N is small. A good choice of

utility u f z , as described in (Wierstra, et al., 2008), is a simple rank-based utility transformation function, the

piecewise linear 1 2| , , ,k k k k k Nu u f f f f z z z z , which first ranks all samples , ,k N k based on

fitness value, and then assigns zero to the N − m worst ones and assigns values linearly from 0 . . . 1 to the m best samples. The fitness of the batch is then evaluated based on the utility, i.e.,

1

,N

u i ii

J p u f

z x z

Thus the EM algorithm is summarized as:

/* INITIALIZATION */ ♦ Build a dictionary of all l-mers found within the sequences and their occurrence positions, and link l-mers having a Hamming distance d from each other. /* FIRST PASS */ FOR M=all l-mers in the dictionary: FOR p = 0 to L with step s: ♦ Compute the number of binding sites , ,n M p p s of M in the interval (p, p+s).

/* SECOND PASS */ Initialize the list of stored motifs, T, to empty. FOR M = all l-mers in the dictionary: FOR p1 = 0 to (L – s) with step s: FOR p2 = (p1+s) to (p1+max_window_size) with step s: ♦ Compute the number of binding sites of M in the interval (p1,p2) using the values in constituent intervals (p1,p2–s ) and (p2–s , p2 ), i.e., 1 2 1 2 2 2, , , , , ,n M p p n M p p s n M p s p .

♦ Using 1 2, ,n M p p , compute for the interval (p1,p2 ) the variables e0, e1, c0, c , fi,j.

♦ Thus compute the total score of M in the interval (p1,p2). ♦ Insert the entry for {M;(p1,p2)} in the list of scores T using the insertion sort algorithm. ♦ IF size(T) > η THEN discard the lowest scoring entry.

/* DISCARD SIMILAR PATTERNS */

FOR all stored entries {M’;(p1’,p2’)} in the list T: IF M is similar to M’ AND (p1,p2) overlaps (p1’,p2’): IF score of M > score of M’ THEN discard M’ from T.

/* EXTEND MOTIF SEARCH */ ♦ Perform clustering and majority pattern generation. ♦ Add majority pattern to the dictionary and score it in all intervals as in the above. ♦ Repeat the extension till the average score stops increasing. /* OUTPUT */ ♦Output the top scoring motifs and their position intervals.

Figure S2. The LocalMotif core module algorithm for a fixed (l,d).

Page 10: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 10

E step: at the iteration t, weight each point z by its utility for the next iteration t+1:

1

1

| ,

| ,

i i

t i N

i ii

p u fq

p u f

z x zz

z x z

M step: update the parameters x and towards maximizing the fitness:

1

1, 1

, arg max log | ,N

t

t i ii

q p

x

x z z x

In order to speed up convergence, the algorithm is executed online, that is, sample by sample, instead of batch by batch. The online version of the algorithm includes a crucial forget factor, , which modulates the speed at which the search policy adapts to the current sample. Batch size N is now only used for utility ranking function, u, which ranks the current sample among the N last seen samples. The resulting FEM algorithm pseudocode is below. Psuedocode for Fitness Expectation Maximization PWM refinement: Use shaping function u, batch size N, forget factor k1 Initialize the PWM parameters x and the covariance matrix Repeat

draw a sample PWM ,k N z x

evaluate fitness kf z using the LocalMotif scoring function for the PWM kz

Compute rank-based fitness 1 2| , , ,k k k k k Nu u f f f f z z z z

1 k k ku u x x z

1T

k k k ku u x z x z

kk+1 Until fitness f x of the current x does not increase for 5 consecutive iterations

We maintain a history of the PWMs x over all iterations and finally choose the PWM with the best fitness.

Page 11: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 11

Section C. Supplementary Details for Results Comparison of motif finding tools: All experiments below were conducted on a desktop computer equipped with Pentium Dual Core 3.40 Ghz CPU and 4GB RAM running on Windows XP operating system. The standalone versions of the programs MEME, Weeder, Trawler, Amadeus and LocalMotif were run on this same machine. MEME version 3.5.4 and Weeder version 1.3 were compiled using Cygwin. Trawler standalone version 1.2 and Amadeus (Allegro version 1.0) were directly executable using Activestate Perl and Java virtual machine respectively. LocalMotif was compiled in Microsoft Visual C++ to produce a standalone executable, and a user interface was written in Python version 2.5. The parameters used while running each of these tools are listed individually in each section below. Trawler, Amadeus and LocalMotif required specification of background sequences. A standard human background sequences file, hg18_background.fa, was generated to be used for all three programs. It contained sequences 1kb upstream of the TSS and the first 1kb in the exon regions for all 26,514 human Refseq gene transcripts listed in the hg18 version of UCSC genome browser (Refgene table for hg18). This selection of both 1kb upstream promoter region and 1kb exon sequences in an equal mix was found to be optimal for motif finding performance. A background using only the upstream promoter region was tried using two settings: 3kb upstream of the TSS and 1kb upstream of the TSS. Under both settings this “promoter only” background resulted in poor performance with a bias towards reporting only AT rich motifs. For trawler, the whole set of 26,514 transcripts could not be used due to an increase in the running time. A randomly subsampled dataset containing 3000 sequences was used instead. The reported motifs were compared with known TRANSFAC ver. 11.3 motifs using the STAMP tool of Mahony et al. with the default parameters: comparison metric: Pearson Correlation Coefficient, Alignment method: ungapped Smith Waterman, Trim motif edges with information content of less than 0.4. A maximum E-value cutoff of 0.001 was imposed. Simulated Datasets:

Each simulated datasets contained N nucleotide sequences of the same length L selected randomly from the human genome. In about k percentage of the sequences, a known binding site for a single TF obtained from TRANSFAC was implanted within a random position interval of length .p L , where 0.01 0.5p , in both forward and reverse orientations. A total of 100 such datasets were generated while randomly varying the parameters N, L, k, p and the TF. The TFs were chosen among 10 different vertebrate TFs each of which has at least 60 binding sites in the TRANSFAC database. These TFs are as follows:

TRANSFAC matrix name Number of sites

V$SP1_Q6_01 213

V$AP1_Q6_01 105

V$GATA_Q6 105

V$EBOX_Q6_01 100

V$CEBP_Q3 99

V$NF1_Q6_01 77

V$FOXP1_01 76

V$PPARG_01 72

V$AP2ALPHA_02 70

V$ZNF219_01 60

Page 12: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 12

The commands used for running different tools are as below: MEME ./meme inputfile –dna –mod zoops –nmotifs 2 –minw 6 –maxw 15 –revcomp –maxsize 1000000

Weeder ./weederlauncher inputfile HS medium S

Trawler perl bin\trawler.pl -dir_id somename -sample my_experiments\inputfile -background my_experiments\hg18_background.fa

Amadeus Running mode: Normal, Motif length: 10, Enrichment weight=1, Strand bias weight=0, Localization weight=0, Background sequences= hg18_background.fa

LocalMotif Background Markov model order=2, Background file= hg18_background.fa, (l,d) = (6,1)+(7,1)+(8,2)+(9,2), stepsize=10, minwinsize=50, maxwinsize=-1, strand=double

Sequences flanking the TSS: Promoter sequences flanking the TSS were analyzed for motifs in Drosophila melanogaster and human genomes. The Drosophila dataset consisted of 1941 Drosophila core promoter sequences compiled by (Ohler, et al., 2002). Each sequence is of length 300 bp (-250 to +50 relative to the TSS). The results of MEME run were taken directly from Ohler et al., 2002. The 300 bp length dataset was examined with LocalMotif with the following parameters: Background Markov model order=2, Background file=a set of 361 Drosophila intron sequences, (l,d) = (6,1)+(7,1)+(8,2), stepsize=5, minwinsize=5, maxwinsize=-1, strand=double The human data included nine different ChIP-Chip datasets. These are promoters where binding sites for the TFs Oct4, Sox2, Nanog, HNF1A, HNF4A, HNF6, FOXA2, USF1 and CREB1 respectively have been recognized by ChIP-Chip experiments within -8kb to +2kb region flanking the TSS (Boyer, et al., 2005; Odom, et al., 2006). Although the experiments of Boyer, et al. and Odomm, et al. include more datasets, these nine were selected as they were recently reported by Koudritsky and Domany, 2008 to have a sharp peak of the ChIP-Chip signal within 300 bp upstream of the TSS. The full 10kb region was analyzed for motifs using LocalMotif, Trawler and Amadeus. The datasets contain from 370 to 4300 sequences. Since the dataset is too large, only a maximum of 1000 sequences were analyzed, with the subset randomly sampled from the original set of sequences. The parameters used for the analysis are as follows: Trawler perl bin\trawler.pl -dir_id somename -sample my_experiments\inputfile -background

my_experiments\hg18_background.fa

Amadeus Two sets of parameters were tried:

(a) Running mode: Normal, Motif length: 10, Enrichment weight=1, Strand bias weight=0, Localization weight=0, Background sequences= hg18_background.fa

(a) Running mode: Normal, Motif length: 10, Enrichment weight=1, Strand bias weight=0, Localization weight=1 (TSS position provided), Background sequences= hg18_background.fa

LocalMotif Background Markov model order=2, Background file= hg18_background.fa, (l,d) = (6,1)+(7,1)+(8,2), stepsize=100, minwinsize=100, maxwinsize=1000, strand=double

The top 10 motifs reported by LocalMotif in the Sox2 dataset are shown below. In this dataset, Trawler and Amadeus could not detect the Sox2 motif. The reason is that the sequences in this ChIP-Chip dataset are very long (of length 10kb) which makes the ChIP TF motif weak within the full sequence length. However, the Sox2 motif is localized in within about 1kb of the TSS. Thus, LocalMotif could detect it based on its localization using the SCS.

Page 13: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 13

Still LocalMotif reports the Sox2 motif at rank 8 in the list. This is because there are other motifs near the TSS having greater localization, to which LocalMotif assigns a higher SCS. These include binding motifs of the TFs Sp1, NRF2 (binds to ARE elements in the promoter region), NFY (CAAT box), HEB / E2F (E-box), and AP-2. These TFs ubiquitously bind to the promoters of many genes and thus the occurrence of their binding motifs is unsurprising in this dataset. Reported consensus

Position Nearest TRANSFAC motif

Center distribution

GGGCGGGG [-200,+200] Sp1

CCGGAAGG [-200,+200] NRF2

TGATTGGG [-200,+100] NFY

CCAGCTGG [-200,+300] HEB

CCCCAGGC [-200,+200] AP-2

GGGGGGGG [-300,+200] None

CCGCCAAC [-200,+100] E2F

Page 14: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 14

AACAATGT [-200,+100] SOX

CGACGAGG [-200,+200] None

GGGGGAAA [-200,+200] None

AACCAAAA [-200,+200] MADS

(also similar to Sox)

Sequences flanking a known TFBS: The dataset consisted of 34 estrogen receptor (ER) target sequences from human chromosomes 21 and 22 discovered by ChIP analysis of in-vivo ER-chromatin complexes, all of which contain the full ERE motif (length 15 bp, consensus AGGTCACCNTGACCT). Almost all sequences lie distal from the TSS beyond the promoter region and have lengths ranging from 0.2 to 2.5kbp. The ERE was selected as the anchor point, and its ±500 bp flanking region was analyzed for motifs. The positions of Forkhead binding sites relative to the ERE are shown in Figure S3. Most sites lie close to the ERE. The commands used for running different tools are as below: MEME ./meme inputfile –dna –mod zoops –nmotifs 10 –minw 6 –maxw 15 –revcomp –maxsize 1000000

Weeder ./weederlauncher inputfile HS medium S

Trawler perl bin\trawler.pl -dir_id somename -sample my_experiments\inputfile -background my_experiments\hg18_background.fa

Amadeus Two sets of parameters were tried:

(a) Running mode: Normal, Motif length: 10, Enrichment weight=1, Strand bias weight=0, Localization weight=0, Background sequences= hg18_background.fa

(a) Running mode: Normal, Motif length: 10, Enrichment weight=1, Strand bias weight=0, Localization weight=1 (TSS position provided as middle of the sequences), Background sequences= hg18_background.fa

LocalMotif Background Markov model order=2, Background file= hg18_background.fa, (l,d) = (6,1)+(7,1)+(8,2)+(9,2), stepsize=10, minwinsize=50, maxwinsize=-1, strand=double

Page 15: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 15

Figure S3. Distribution of forkhead sites relative to the ER sites.

Only Amadeus and LocalMotif discovered the Forkhead motif. Amadeus discovered the motif when run with Localization weight=1. The table below shows the validation of Forkhead motif consensus identified by LocalMotif. All Forkhead binding sites present within 200 bp distance of a known ER full or half binding site are listed with their locations in the original dataset of Caroll et al. (2005). Binding sites that contribute to Forkhead consensus reported by LocalMotif are marked. Sequence

no. Forkhead

site Location in sequence

Strand Distance from ER site

Recognized by LocalMotif ?

2 TTGTTTTCTT 30 + -28 Yes

3 AAGTAAATAA 247 – 197 No

4 GTGTTTGCTT 209 + 25 No

4 TTGTTTACTT 521 + 46 Yes

5 AAAGAAACAA 1437 – -22 Yes

6 TTGTTTCTTT 580 + 47 Yes

7 TTGTTTTTTT 1383 + 48 Yes

10 AAAGAAAGAA 428 – -98 Yes

13 AAGGAAACAA 413 – 22 No

13 AAGGAAATAA 422 – 13 No

17 TTGTTTACAT 193 + -61 No

21 AAGAAAATAA 1096 – -113 Yes

23 TTGTTTATTT 197 + -181 Yes

23 TTGTTTCCCT 247 + 18 Yes

23 AAACAAACAA 1062 – -11 Yes

27 TTATTTGCTT 769 + 78 Yes

29 AAGGAAACAT 452 – -200 No

32 CTGTTTGCTT 475 + 153 No

35 AAGCAAATAA 398 – 185 No

40 AAGCAAACAA 770 – -40 No

42 TTGTTTGCTT 929 + -178 No

42 TTGTTTTCTT 654 + 97 Yes

48 ATGTTTGCTT 231 + 19 No

52 TTATTTCCTT 331 + -169 Yes

55 TTCTTTCTTT 356 + 147 Yes

55 TTGCTTGCTT 442 + 61 Yes

The Forkhead motif consensus derived from all 45 Forkhead binding sites reported in the original dataset of Caroll et al. (2005) is as follows:

Page 16: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 16

Position A C G T Consensus

1 2 2 1 40 T

2 2 0 0 43 T

3 3 15 20 7 G

4 5 9 3 28 T

5 0 0 0 45 T

6 0 0 0 45 T

7 6 8 7 24 T

8 7 21 9 8 C

9 2 1 0 42 T

10 2 0 0 43 T

Sequences obtained from ChIP-Seq: Two ChIP-Seq datasets were studied. The first dataset was derived from the ChIP-Seq of 15 TFs in mouse embryonic stem cells (Chen, et al., 2008). For each of the 15 TFs, the ±200 bp sequences flanking the 1000 highest intensity peaks were analyzed using Trawler, Amadeus and LocalMotif. The parameters used to run the programs are as follows: Trawler Two sets of parameters were tried:

(a) perl bin\trawler.pl -dir_id somename -sample my_experiments\inputfile -background my_experiments\hg18_background.fa

(b) perl bin\trawler.pl -dir_id somename -sample my_experiments\inputfile -background my_experiments\hg18_background.fa –clustering 0

Amadeus Two sets of parameters were tried:

(a) Running mode: Normal, Motif length: 10, Enrichment weight=1, Strand bias weight=0, Localization weight=0, Background sequences= hg18_background.fa

(a) Running mode: Normal, Motif length: 10, Enrichment weight=1, Strand bias weight=0, Localization weight=1 (TSS position provided as the peak center), Background sequences= hg18_background.fa

LocalMotif Background Markov model order=2, Background file= hg18_background.fa, (l,d) = (6,1)+(7,1)+(8,2)+(9,2), stepsize=50, minwinsize=100, maxwinsize=-1, strand=double

The following table shows the highest ranking motifs reported in the 15 sequence sets. The LocalMotif scores are also shown. With each score, its P-value is given in the parentheses.

Page 17: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 17

The following table gives the E-values of the similarity of de-novo motifs discovered by LocalMotif with the literature published motifs. The E-values are reported by the STAMP tool (Mahony et al. 2007): Dataset Cmyc Ctcf E2f1 Esrrb Klf4 Nanog Nmyc Oct-4

E-value 8.945E-06 4.828E-07 1.402E-02 4.983E-07 2.994E-04 3.830E-02 6.570E-05 1.018E-07

Dataset P300 Smad1 Sox2 Stat3 Suz12 Tcfcp2i1 Zfx

E-value 6.231E-01 3.827E-02 2.463E-04 2.001E-10 NA 4.844E-02 1.198E-04

The second dataset consisted of the 1000 highest scoring peaks from the ChIP-Seq dataset of Welboren, et al., 2009 was analyzed for motifs. LocalMotif was used to analyze the ±1 kb region around the peaks, while MEME, Weeder, Trawler and Amadeus were used to separately analyze ±200 bp and ±500 bp regions around the peaks. The parameters used to run the programs are as follows: MEME ./meme inputfile –dna –mod zoops –nmotifs 30 –minw 6 –maxw 15 –revcomp –maxsize 1000000

Weeder ./weederlauncher inputfile HS medium S

Trawler Two sets of parameters were tried:

(a) perl bin\trawler.pl -dir_id somename -sample my_experiments\inputfile -background my_experiments\hg18_background.fa

(b) perl bin\trawler.pl -dir_id somename -sample my_experiments\inputfile -background my_experiments\hg18_background.fa –clustering 0

Amadeus Two sets of parameters were tried:

(a) Running mode: Normal, Motif length: 10, Enrichment weight=1, Strand bias weight=0,

Page 18: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 18

Localization weight=0, Background sequences= hg18_background.fa

(a) Running mode: Normal, Motif length: 10, Enrichment weight=1, Strand bias weight=0, Localization weight=1 (TSS position provided as the peak center), Background sequences= hg18_background.fa

LocalMotif Background Markov model order=2, Background file= hg18_background.fa, (l,d) = (6,1)+(7,1)+(8,2)+(9,2), stepsize=50, minwinsize=100, maxwinsize=-1, strand=double

Distributions around the peaks of the motifs recognized by LocalMotif in ERE Chip-Seq dataset of Welboren et al. (2009) are given below.

Page 19: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 19

Page 20: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 20

The comparison of these motifs with the known TRANSFAC motifs as reported by the STAMP tool is shown below.

Motif1

PPARgamma-RXRalpha,_M00515 (E val: 5.3848e-07)

Motif4

GCNF_M00526 (E val: 1.0144e-06)

Motif2

AR_M00481 (E val: 4.5944e-03)

Motif8

Oct-1_M00162 (E val: 4.2964e-04)

Page 21: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 21

Motif13

T3R_M00963 (E val: 9.5568e-05)

Motif14

p53_M00034 (E val: 2.3968e-04)

Motif3

AP-1_M00173 (E val: 3.6780e-05)

Motif12

NF-E2_M0003 (E val: 2.5965e-06)

Motif5

SMAD4_M00733 (E val: 1.6375e-04)

Motif6

Pax-6_M00979 (E val: 3.1682e-05)

Page 22: Supplementary material FinalVersion - NUS Computingbioinfo/LocalMotif/... · Narang, V. et al. 4 i.e., 11 ln ln ll bi bi bi b ib b i DM B f f f p If xx x12 n 1 then the maximum of

Narang, V. et al. 22

Motif16

AP-2alpha_M00469 (E val: 3.0004e-05)

Motif7

Croc_M00266 (E val: 6.9697e-05)