8
REFERENCES AND NOTES ‘2 Dels and K. Alder. Justus Liebus Ann. Chem. 460. 9s (1928). 0. Diels. J. H. Blom. W. Koll. {bid. 443. 242 (1925); 8. M. Trosl. I. Flemirg. L. A. Paquette. Comprehensive Organic Synthess (Fergamon. Oxford. 1991), vol. 5. pp. 316; F. Friyuelli and A. Talichi. Dienes in Ihe Die/s-A/der fieacfion(Wiley. New York, 1990). and references therein; W. Carruthers. Cyc/u~dditionReacfions in Crpwnic Synfhesis (Pergamon. Oxford. 1990). ancl references therein; G. Desimoni. G. Tmi. A. 3arco. G. P. Pollini. ACS Monograph No. 180 (hmican Chemicat Society. Washingloo. DC. 19.33). and references cited therein. Fcr recent reviews. see U. Rndur. G. LUQ. C. 0110. C~T. Rw. 93. 741(1993); H. B. tbgm and 0. Rant. ibd 92. 1007 (1992). and references mein. 3. D. Hihrert. K. W. HiII..K. D:Nared. M-T. M.Auditor. J. An Chem. SOC. 111. 9261 (1989). 4. A. C. Braisfed and P. G. Schultl. ibid. 112. 7430 (1!=)1. 5. C. J. Suckling, M. C. Tedford. L. M. Bence. J. I. Imine. W. H. Slimson, Ebg. Med. Chem. Lett. 2, 49 (1992). 6. K. 0. Janda. C. G. Shevtin. R. A Lemer. Science 7. i E. Omman. G. F. Taylor, C. E. Petty. P. J. 259. 490 (1993). 8. For example: d/-pumiliotoxin. C. L E. Overman Jessup, 3. Org. Chem. 43. 2164 (1978). and P. J. Jessup. Terrahedfon Len. 1253 (197; J. Am. Chem. Soc. 100.5179 (1978); d-rc- gephyrotoxin. L E. Overman and C. Fukaya. ibd. 102 1454 (1980); dkisogabaculine, S. Danishef- sky and F. M. Hershenson. J. Ofg. Chem. 44. 1180 (1979);’lytidine. L E. Overman. C. B. Petty. R. J. Doedens. ibid.. p. 4183; arnaryirdaceae alkaloids. L E. Overman eta/., J. Am. CM. SOC. 103. 2816 (1981); gephirotoxine, L E. Overran, D. Lesuisse. M.Hashimoto, ibid. 105, 16. 5373 (1 983). 9. Diene 1 b was prepared by neutralizing the me- wing carboxylic acid suspended in PES (x+ d i m phosphatebuffer)with one equivalent of 10. All new adducts werecharacterized by NMR NaOH. 11. In thecrude mixture of the cycloaddtion and mass spectrometry. could not detect any trace of the meta regioiso- .we mer by ‘H NMR (>98% orfhuregioselecjve). 12. M. J. Frisch el a/.. Gaussian 92 (Gaussian. Pms- 13. R. J. Loncharich. F. K. Brown, K. N. Houk J. Org. burgh. PA, 1932). Chem. 54. 1129 (1989). AI1 structJres were hlty optimized with the 321G basisset. Reactants were also fuly optimized with the M l G basis set. Wbrational frequencies for all t r m sm- tures were calculatedat the RHF/3-21G level and shorvn to have one imaginary trequency. In addk tion. the energies were evaluated by singlepoint caldalioas with the 631G‘ basis set m 3-21G 14. K N. Houk. R. J. Loncharich. J. Blake. W. Joc- geometries. 15. L E. Overman. G. F. Taylor, K. N. H a I. N. gensen, J. Am. Chem: Sac. 111.9172 (1989). 16. M. I. Page and W. P. Jencks. Proc, NaU. Amd. Dometsmith, ibid. 100.31 82 (1 978). 17. F. Bemardi ef a!.. J. Am. Chern. Sot. 110. 3050 &i. USA 68. 1678 (1971). (1989); F. K. Brown and K. N. Houk. remedm Len. 25. 4609 (1984). 18. This strategy to avoid product inhibilion was first described by Braisled and SchulQ in (4). 19. The sterecchemishy was established on the basis of lH NMR analysis. For example: Suo. bH6 = 2.65 ppm and ‘Je.,, = 3.5 Hz; sendo. Sn, = 3.12 ppm. and &Je,,* = 0 Hz. 20. G. Kohler and C. Milstein. Nafure256.495 (1975): Ihe KW conjugate was prepared by hvly add- ing 2 mg of the hapten dissolved in 1133 pI of dimethyfiormamide Io 2 mg of KLH in 900 $1 of 0.01 M scdiumphosphatebuffer. pH 7.4. Fwr 8-week-old129GIX+miceeach receked 2 h traperitoneal (IP) injaclims of 100 pg of the hap len conjugated lo KM and RIB1 actin* (Mh and TOM emulsion) 2 weoks apart. A . SO pg IP 208 injection of KLH ccniLyate in alum was gwen 2 weeks later. One menth aRer the secondinjeclion. the mouse with the hqhest liler was injected Intravenously with 50 pg of KLH-conjugate: 3 days later. the spleen \\as taken for the prepara- tion of hybridomas. Spleen cells (1.0 x lo8) were fused wlth 2.0 x 10- SPZO myeloma cells. Cells were plated into 30 S w e l l plates: each well conlamed I50 pl c1 likpoxanthine. aminopterin. lhymidine-Dulbecco’s niintmal essential medium (HAT-DMEM) contaming 1?6 nulridoma. and 2% fll;r,roace?ic acid)/acelonilrile. at 1.5 mlhin, 240 om. The !??ention time for Ihe frans isomer is 23. Vk Iharh J Ashley for lechnrcal assistance in 6 00 min and for the cis isomer is 6.60 min. iF.2 %mer% analysis: L. Ghosez. K. B. Sharpless, relpful dwssions and comments on the manu- K. C. FJiCclaou. €. Keinan. and 0. Boger for script: arcj 3. K. Chadha forthe x-ray crystallo- graphic ;m. Supported in part by the National kance F>mdalion (CHE-9116377) and the A. P. Sloan %undation (K.D.J.). We also thank FLnCacih Ram6n-Areces(Spain)forafellow- mp to 8.P.T 16 July 1%. accepted 3 September 1993 Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment Charles E. Lawrence, Stephen F. Akchul, Mark S. Boguski, Jun S. Liu, Andrew F. Neuwald, John C. Wootton A wealth of protein and DNA sequence data is being generated by genome projects and other sequencing efforts. A crucial barrier to deciphering these sequences and under- standing the relations among them is the difficulty of detecting subtle local residue patterns common to multiple sequences. Such patterns frequently reflet? similar molecular struc- tures and biological properties. A mathematical definition ofthis ’‘ ocal multiple alignment” problem suitable for full computer automation has been used 1 o develop a new and sensitive algorithm, based on the statistical method of i tme sampling. This algorithm finds anoptimized local alignment model for N sequences in Klinear time, requiring only seconds on current workstations, and allows the simultaneous detection and optimization of multiple patterns and pattern repeats. The method is illustrated as applied to helix- turn-helix proteins, lipodins, and prenyltransferases. Patterns shared by multiple protein or nucleic acid sequences shed lighton molec- ular structure, function, and evolution. The recognition of such patterns generally relies upon aligning many sequences, a complex, multifaceted research process whose difficulty has long been appreciated. This problem may be divided into “global multiple alignment” (I, 2), whose goal is to align complete sequences, and “local mul- tiple alignment” (2-1 I). whose aimis to locate relatively shortpatternsshared by otherwise dissimilar sequences. We report a new algorithm for local multiple alignment that assumes no prior information on the patterns or theirlocationswithinthe se- quences; it determines these locationsfrom only the information intrinsic to the se- quences themselves. We focus on subtle C. E. Lawrence. S. F.Altschul.M. S. Bcguski. A F. Neuwald. and J. C. Woolton are wilh the Nalional Center for Biotechnology Intormalion. National Library of Medicine, National Institutes of Health, Bethesda. MD 20894. J. S. Liu is with the Statistics Department, Haward University, Cambridge, MA 02138. C. E. Lawrence is also affiliated wilh the Biometrics Labora- York Slate Department of Health, Albany, NY 12201. tory. Wadsworth Centor for Labs and Research, New SClENCE 0 VOL. 262 8 OCTOBER 1993 amino acid Iquence pattems that may vary greatly among Merenr proteins. Much rexarch on the alignment of such patterns use5 additional information to sup- plement algerithmic analyses of the actual sequences, including data on three-dimen- siod s m c m , chemical interactions of residues, effects of mutations, and interpre- tation of sequence database search results. However. such research, which has led to many discowries of sequence relations and SUUCNR and function predictions [see (12) for a recent example], is laborious and requires frrsuent input of expert knowl- edge. These approaches are becoming in- creasingly ovenvhelmed by the quantiry of sequence dam. A numkr of automated local mdtiple aligunent algorithms have been developed (2-1 I). and some have proved valuable as part of intaTatd software workbenches. UniLvtumtth.. rigorous algorithms for find- ing ry.rhul wlutions have h e n so compu- txi\mally espmsive as to limit thcir appli- cdilit?. to a very small number of se- ‘qucncus. d heuristic approaches have gain4 sp-4 by sacrificing sensitivicy to

0110. mein. Detecting Subtle Sequence Signals: A Gibbs

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

REFERENCES AND NOTES ‘2 Dels and K. Alder. Justus Liebus Ann. Chem. 460. 9s (1928). 0. Diels. J. H. Blom. W. Koll. {bid. 443. 242 (1925); 8. M. Trosl. I. Flemirg. L. A. Paquette. Comprehensive Organic Synthess (Fergamon. Oxford. 1991), vol. 5. pp. 316; F. Friyuelli and A. Talichi. Dienes in Ihe Die/s-A/der fieacfion(Wiley. New York, 1990). and references therein; W. Carruthers. Cyc/u~dditionReacfions in Crpwnic Synfhesis (Pergamon. Oxford. 1990). ancl references therein; G. Desimoni. G. T m i . A. 3arco. G. P. Pollini. ACS Monograph No. 1 8 0 ( h m i c a n Chemicat Society. Washingloo. DC. 19.33). and references cited therein. Fcr recent reviews. see U. Rndur. G. LUQ. C. 0110. C ~ T . Rw. 93. 741 (1993); H. B. tbgm and 0. Rant. ibd 92. 1007 (1992). and references mein.

3. D. Hihrert. K. W. HiII..K. D:Nared. M-T. M.Auditor. J. An Chem. SOC. 111. 9261 (1989).

4. A. C. Braisfed and P. G. Schultl. ibid. 112. 7430 (1!=)1.

5. C. J. Suckling, M. C. Tedford. L. M. Bence. J. I. Imine. W. H. Slimson, Ebg. Med. Chem. Lett. 2, 49 (1992).

6. K. 0. Janda. C. G. Shevtin. R. A Lemer. Science

7. i E. Omman. G. F. Taylor, C. E. Petty. P. J. 259. 490 (1993).

8. For example: d/-pumiliotoxin. C. L E. Overman Jessup, 3. Org. Chem. 43. 2164 (1978).

and P. J. Jessup. Terrahedfon Len. 1253 (197 ; J. Am. Chem. Soc. 100.5179 (1978); d-rc- gephyrotoxin. L E. Overman and C. Fukaya. ibd. 102 1454 (1980); dkisogabaculine, S. Danishef- sky and F. M. Hershenson. J. Ofg. Chem. 44. 1180 (1979);’lytidine. L E. Overman. C. B. Petty. R. J . Doedens. ibid.. p. 4183; arnaryirdaceae alkaloids. L E. Overman eta/., J. Am. C M . SOC. 103. 2816 (1981); gephirotoxine, L E. Overran, D. Lesuisse. M. Hashimoto, ibid. 105, 16. 5373 (1 983).

9. Diene 1 b was prepared by neutralizing the me- wing carboxylic acid suspended in PES (x+ d i m phosphate buffer) with one equivalent of

10. Al l new adducts were characterized by NMR NaOH.

11. In the crude mixture of the cycloaddtion and mass spectrometry.

could not detect any trace of the meta regioiso- .we

‘ mer by ‘H NMR (>98% orfhuregioselecjve). 12. M. J. Frisch el a/.. Gaussian 92 (Gaussian. Pms-

13. R. J. Loncharich. F. K. Brown, K. N. Houk J. Org. burgh. PA, 1932).

Chem. 54. 1129 (1989). A I 1 structJres were hlty optimized with the 321G basis set. Reactants were also fully optimized with the MlG‘ basis set. Wbrational frequencies for a l l t r m s m - tures were calculated at the RHF/3-21G level and shorvn to have one imaginary trequency. In addk tion. the energies were evaluated by singlepoint caldalioas with the 631G‘ basis set m 3-21G

14. K N. Houk. R. J. Loncharich. J. Blake. W. Joc- geometries.

15. L E. Overman. G. F. Taylor, K. N. H a I. N. gensen, J. Am. Chem: Sac. 111.9172 (1989).

16. M. I. Page and W. P. Jencks. P r o c , NaU. A m d . Dometsmith, ibid. 100.31 82 (1 978).

17. F. Bemardi ef a!.. J. Am. Chern. Sot. 110. 3050 &i. U S A 68. 1678 (1971).

(1989); F. K. Brown and K. N. Houk. remedm Len. 25. 4609 (1984).

18. This strategy to avoid product inhibilion was first described by Braisled and SchulQ in (4).

19. The sterecchemishy was established on the basis of lH NMR analysis. For example: Suo. bH6 = 2.65 ppm and ‘Je.,, = 3.5 H z ; sendo. Sn, = 3.12 ppm. and &Je,,* = 0 Hz.

20. G. Kohler and C. Milstein. Nafure256.495 (1975): I h e K W conjugate was prepared by h v l y add- ing 2 mg of the hapten dissolved in 1133 pI of dimethyfiormamide Io 2 mg of KLH in 900 $1 of 0.01 M scdium phosphate buffer. pH 7.4. F w r 8-week-old 129GIX+ mice each receked 2 h traperitoneal (IP) injaclims of 100 pg of the hap len conjugated lo K M and RIB1 actin* (Mh and TOM emulsion) 2 weoks apart. A .SO pg IP

208

injection of KLH ccniLyate in alum was gwen 2 weeks later. One menth aRer the second injeclion. the mouse with the hqhest liler was injected Intravenously with 50 pg of KLH-conjugate: 3 days later. the spleen \\as taken for the prepara- tion of hybridomas. Spleen cells (1.0 x lo8) were fused wlth 2.0 x 10- SPZO myeloma cells. Cells were plated into 30 Swe l l plates: each well conlamed I50 pl c1 likpoxanthine. aminopterin. lhymidine-Dulbecco’s niintmal essential medium (HAT-DMEM) contaming 1?6 nulridoma. and 2%

fll;r,roace?ic acid)/acelonilrile. at 1.5 mlhin, 240 om. The !??ention time for Ihe frans isomer is

23. V k Iharh J Ashley for lechnrcal assistance in 6 00 min and for the cis isomer is 6.60 min.

iF.2 %mer% analysis: L. Ghosez. K. B. Sharpless,

relpful dwssions and comments on the manu- K. C. FJiCclaou. €. Keinan. and 0. Boger for

script: arcj 3. K. Chadha for the x-ray crystallo- graphic ;m. Supported in part by the National k a n c e F>mdalion (CHE-9116377) and the A. P. Sloan %undation (K.D.J.). We also thank FLnCacih Ram6n-Areces (Spain) for a fellow- m p to 8.P.T

16 July 1%. accepted 3 September 1993

Detecting Subtle Sequence Signals: A Gibbs Sampling

Strategy for Multiple Alignment Charles E. Lawrence, Stephen F. Akchul, Mark S. Boguski,

Jun S. Liu, Andrew F. Neuwald, John C. Wootton A wealth of protein and DNA sequence data is being generated by genome projects and other sequencing efforts. A crucial barrier to deciphering these sequences and under- standing the relations among them is the difficulty of detecting subtle local residue patterns common to multiple sequences. Such patterns frequently reflet? similar molecular struc- tures and biological properties. A mathematical definition ofthis ’‘ ocal multiple alignment” problem suitable for f u l l computer automation has been used 1 o develop a new and sensitive algorithm, based on the statistical method of i t m e sampling. This algorithm finds an optimized l o c a l alignment model for N sequences in Klinear time, requiring only seconds on current workstations, and allows the simultaneous detection and optimization of multiple patterns and pattern repeats. The method is illustrated as applied to helix- turn-helix proteins, l ipodins, and prenyltransferases.

Patterns shared by multiple protein or nucleic acid sequences shed light on molec- ular structure, function, and evolution. The recognition of such patterns generally relies upon aligning many sequences, a complex, multifaceted research process whose difficulty has long been appreciated. This problem may be divided into “global multiple alignment” ( I , 2), whose goal is to align complete sequences, and “local mul- tiple alignment” (2-1 I ) . whose aim is to locate relatively short patterns shared by otherwise dissimilar sequences. We report a new algorithm for local multiple alignment that assumes no prior information on the patterns or their locations within the se- quences; it determines these locations from only the information intrinsic to the se- quences themselves. We focus on subtle

C. E. Lawrence. S. F. Altschul. M. S. Bcguski. A F. Neuwald. and J. C. Woolton are wilh the Nalional Center for Biotechnology Intormalion. National Library of Medicine, National Institutes of Health, Bethesda. MD 20894. J. S. Liu is with the Statistics Department, Haward University, Cambridge, MA 02138. C. E. Lawrence is also affiliated wilh the Biometrics Labora-

York Slate Department of Health, Albany, NY 12201. tory. Wadsworth Centor for Labs and Research, New

SClENCE 0 VOL. 262 8 OCTOBER 1993

amino acid Iquence pattems that may vary greatly among Merenr proteins.

Much rexarch on the alignment of such patterns use5 additional information to sup- plement algerithmic analyses of the actual sequences, including data on three-dimen- s i o d s m c m , chemical interactions of residues, effects of mutations, and interpre- tation of sequence database search results. However. such research, which has led to many discowries of sequence relations and SUUCNR and function predictions [see (12) for a recent example], is laborious and requires frrsuent input of expert knowl- edge. These approaches are becoming in- creasingly ovenvhelmed by the quantiry of sequence dam.

A numkr of automated local mdtiple aligunent algorithms have been developed (2-1 I ) . and some have proved valuable as part of intaTatd software workbenches. UniLvtumtth.. rigorous algorithms for find- ing ry.rhul wlutions have h e n so compu- txi\mally espmsive as to limit thcir appli- cdilit?. to a very small number of se- ‘qucncus. d heuristic approaches have gain4 s p - 4 by sacrificing sensitivicy to

gz rating some rccent &whyrncnrs in sta-

ly small numbs of xquence ele- patterns, each consisting of one segment from ach of the input Second, a Sjole pattern is de-

encs is describca by a set of pmbabdis- tically &erred position varia?~llt~. These features are derived from well established $rinc$es of protein m c m e and knowl- edge of the sources of .quence patrem

recent common anany easy to iocate by various including ours. In a n - concern is to l o c a t e the

t arise from the energetic interactions ?:-ng residues or between residue and ‘ T r k d , irrespective of evolutionary history. $?)e relation between a state’s energy and f ... h u e n c y fornu the basis of statistical me- a Fhanics, and an analogou rehion governs .. 6 frequencies of residua subject to ran- :;* point mutations (14). Residue fre- CY mcdels are therefore natural in the 3-t context.

Third, gcnornic rcnrrangcmcnts. as \vel1 as inscrtions. dclctions. and duplications of scqucncc scgmcnts, result in the occurrence of 3 common pattcm a t diffcrcnt positions within scqucnccs. Howcvcr, thcsc mum- tional cvcnts are “unobscrvcd” l x c a u s no data dircctly spccify thcir cffccts on the positions of the pattcrns (6) . As rccognid by statisticians since the 1970s (IS), many problems with unobserved data are most easily addressed by pretending that c r i t i c a l missing data are available. The key “miss- ing information principle” (15) is that the probabilities for the unobserved positions may be inferred through the application of Bayes theorem to the observed sequence data.

The optimization procedure we use is the predictive update version (16) of the Gibbs sampier (1 7). Strategies based on iterative sampling have been of great inter- est in statistics (18). The algorithm can be undastood as a stochastic anaIog of expec- tation maximization (EM) methods previ- ously used for local multiple ali,wnent (6, 7)- It yields a more robust optimization procedure and permits the inte,waon of information from multiple patterns. In ad- dition, a procedure for the automatic de- termination of pattern width has been developed. For clarity, we first describe the identification of a single pattern of fixed width within each input sequence and then generalize to variable widths and multiple pattern.

The basic algorithm. We assume that we are given a set of N sequences SI, . . ., SN and that we seek within each sequence mutually similar segments of specified width W. The algorithm maintains two evolving data smctures. The first is the pattern description, in the form of a prob- abilistic model of residue frequencies for each position i from 1 to W, and consist- ing of the variables 4i.l, . . ., qi.20. This pattern description is accompanied by an analogous probabilistic description of the “background frequencies” p,, . . .. pzo with which residues occur in sites not described by the pattern. The second data structure, constituting the alignment, is a set of posirions a‘. for k from 1 to N, for the common pattcm within the se- quences. Our objective will be to identify the “best.” defined as the most probable, common pattern. This pattern is obtained by locating the aliznment that maximizes the ratio of the corresponding pattern probability to background probability.

Thc nlgorirhnl is inirialited by choosing random starting pxitions within the vari- ous sequences. It then procrtds through many ircrations csecute the following two steps of the Gibhs sampler:

1) Predicriw update stcp. One of the N sequences, 7, is chosen either at random or

SClENCE VOL. 202 8 OCTOBER 1 9 9 3

in spxificd ordcr. Thc pattcm dcscription q,,i and luckground frcqucncies pj arc then cnlcuhtcJ, as dcscribcd in Eq. 1 b c l o w , tiom the currcnt positions ak in all se- qucnccs cxcluding 2.

2 ) Sampling stcp. Evcry possible scg- mcnt o f width W within scqucncc is considered as a possible instance of the pattern. The probabilities Qx of generating each segment x according to the current partem probabilities qimj are calculated, as are the probabilities P, of generating these segments by the background probabilities pi- The weight 4 = Q f l X is assigned to segment x, and with each segment so weighted, a random one is selected ( 19). Its position then becomes the new a,.

This simple iterative proedure consti- tutes the basic algorithm. The central idea is that the more accurate the pattern de- scription constructed in step l, the more accurate the determination of irs locarion in step 2, and vice versa. Given random p4sirions uk, in step 2 the pattern descrip non qii will tend to favor no particular segnent. Once some correct ak have been selected by chance, however, the qiJ begin to reflect, albeit imperfectly, a pattern ex- tant within other sequences. This process tends to recruit further correct q . 9 which in turn improve the d$crimina&g power of the evolving pattern;,>

An aspect of the algorithm alluded to in step I above concerns the calculation of h e Q from the current set of dk- For the ith position of the pattern we have N - 1 observed amino acids, because sequence z has been excIuded; let cimj be the count of amino acid j in this position. 3ayesian statistical analysis suggests that, for the purpose of pattern estimation, these cimj shodd be supplemented with residue-de- pendent “pseudccounts” bi to yield pattern probabibties

where B is the sum of the 4. The pj are calculated analogously, nith the corre- sponding counts taken over all nonpatcem positions (20).

After normalization, A, gives the prob- ability that the pattern in sequence z be- longs at position x. The algorithm finds thc most probable alignment by sr.lesting a set of ULS that maximizes the prdusr of these rarios. Equivalently. one mny nY.wimisc F, the sum of the logarithms drhesc mius. In the notation dcvclopd ah\vt. F is given by the formula

where the ci.j and qi., are calculnreJ iron1 the complete atignmcnt (Fig. 1).

209

Phase shifts. Onc defect of the a l p rithm 3s just dcscribeJ is the “phase” prob- lem. The strongest pattern may begin, for esnmplr., at positions 7. 19, 8. 23, and 50

ftmh within the variutrs sequences. Howev- er, i t the algorithm happens to choose CI] = 9 :and d2 = 2 1 in an e d y iteration, it will then most likely procced to choose a3 = 10 and a+ = 25. In othcr words, the algorithm con get locked into a nonoptimal “local maximum” that is a shifted form of the optimal pattern. This situation can be avoided by inserting another step into the algorithm (16). After every Mth iteration, for emmple, one may compare the current set ofn, with sets shifted left and right by up to a certain number of letters. Probability ratios may be calculated, as above, for all possibilities, and a random selection is made among them with appropriate m e - @ing weights.

Pattern width. The algorithm as 50 far described requires the pattern width to be input. It is possible, of course, to execute the algorithm with a range of plausible widths and then select the best result ac- cording to some criterion. One diaculrg is that the function F is not immediately useful for this purpose, as its optimal value always increases with increasing wid& W.

The problem here corresponds to the well-known issue of model selection en- countered in statistics. The difficuIty stems from the change in the dimensionality ai& the additional freely adjustable parameters. Several criteria that incorporate the effects of variable dimension have been useful in other applications (21). Unfortunately, these criteria did not perform well at select- ing those pattern widths that idended correct alignments in data sets with known solutions.

A superior criterion proved to be one based on the incomplete-data log-probabil- iry ratio G (22), which subtracts from &e function F the information required to de- termine the location of the pattern in each of the input sequences. We found that dividing G by the number of free parame- ters needed to specify the pattern (19W in the case of proteins) produced a statisdc useful for choosing pattern width. We call this quantity the infomation per parame- ter. The use of this empirical criterion is discussed in the examples section below and is illustrated in Figs. 2 and 3.

Multiple patterns. As described above, a pattern within a set of sequences can be descrihd as consisting of seven1 distinct elements separated by gaps. The G i b sampler may easily maintain several distinct patterns rather than a single one. Seeking several patterns simultaneously rather th3n sequenrially allows information ga in4 about one to aid the alignment of others. The relativc positions of elements within

21 0

the sequences can k uscd to improve their simultaneous nlignnwlt. kcilusc only one element in sequence : is altered a t a timc, the combinatorial Froblcm of joint posi- tioning is circumvented. Ncverthclcss, be- cause no element’s pwition is pernlancntly fixed. the best joint I tut ion of all elements may be identified.

Incorporating m&ls of elcmcnt loca- tion that favor consistent ordering (colin- earity) and of element spacing that favor close packing accommdhlntes insertions and deletions. Our inrplenlentation of a multi- element version oi the Gibbs sampler (23) includes ordering probabilities (24 ) . As il- lustrated below, this joint information im- proves the prediction of the correct align- ment of colinear elements. Constraints on loop length variation result in similarities in the spacing of the elements of homologous proteins. Thus, inclusion of an element spacing component in the model should

improve alignment. However, we have not yet found i r necessary to incorporate spac- ing effects into the algorithm (25).

Examples. To examine the algorithm, we have chosen three examples that present different classes of difficulties for automated multiple alignment. First is the helis-turn- helix (HTH) motif, which represents a large class of sequence-specific DNA bind- ing stntctma involved in numerous cases of gene regulation. Such HTH motifs gener- ally mrn singly as local, isolated structures in different sequence contexts. Detection and al ipnent of HTH motifs is a well- recognized problem because of the great sequence variation compatible with the same basic structure. Second are the lipo- calins, a kmily of proteins that bind small, hydrophobic ligands for a wide range of biological purposes. These proteins show widely spaced sequence motifs within high- ly variable sequences but share the same

Arg 94 222 265 131 9 9 137 137 9 9 9 52 222 94 94 9 265 606 Lya 9 133 441 300 9 11 300 194 9 133 9 3 12 9 9 9 71 256 O l u 53 9 96 4 0 1 9 9 1 4 0 140 9 9 9 53 140 140 9 9 9 53 ASD 61 9 9 413 9 9 299 125 9 67 9 67 67 9 9 9 9 67

2 4 0 9 9 9 9 9 1 6 8 9 9 9 9 9 111 9 Ill 111 9 9 1 5 1 9 56 9 9 1 5 1

915 n o 130 9 251 9 9 9 112 43 181 9 0 1

7 6 9 9 9 9 9 9 9 9 9 9 9

58 107 9 9 500 9 9 121 9 9 149 9 9 166 114 6 1 323 9 9 1 0 4 9 9 9 9 9 9 1 3 6 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

125 168

9 9

43 9 9 9 9

93 114

9 9 9 9

12 5 89

9 9

181 9 9 9 9

149 166 198 262

9 9

9 9 9 9 125 9 89 9 248 9 9 9 9 819 63 9 1 1 4 1 9 IS1 9

215 9 43 9 (3 9 9 9 311 130

295 581 295 9 9 9 9 9 9 210

156 9 59s 9 205

198 9 104 9 9 9 9 427 9 61

262 9 9 136 136 9 9 10s 9 9

as8 9 149 9 37

9 9 366 m 9

125

3 E? 168

56 181 70

210 9

58 31

9 198

9 9 9

125 9 . 9 89 9 89 63 ’ 9 819

9 9 56 112 43 .78 855 9 130

9 9 , 9 9 9 9 9 746 9

6 1 421 .’: . 9 9 171 , , 9

9 9 : . 9 262 9 262

9 9 , . 9 9 9 9

240 89

9 9 9 9 9 9

58 9

61 9

136 9

366

Flg. 1. Alignment and probability ratio model for the helix-tuMlc pattern common to 30 proteins (45). (A) The alignment. Columns from left to right are: s w name; locations a, 01 the left end of the common pattern in each sequence: aligned seqmxes. including residues flanking the 18-residue common paltern; right-end positions ( 4 + 17) d the m m o n pattern; NBRF/PIR accession number; and NBRF/PIR code name. if available. A s t t i s k s (,**) below ths alignment indicate the =-residue segment previously described on tte basis of structural superpositions (26. 27). Almost equal values of information per pararnetec were &en by pattern widths of 16 to 21 residuos (Fig. 2): the longor widths extended to t h e r i g N the 18-residue pattern shown. (B) Probability ratios (100 x q,,,/p) for each amino acid at each posifion in the pattern model.

SCIENCE VOL, 262 8 OCTOUEK IN3

structural toptogy throughout pol\pp- tidc chain. We chosc the five nhwt diycr- gent lipocalins with Lnom 30 S ~ C N T C for analysis bccausc thc c o m t alignment of thcir scqucncc motifs has prC.\iously de- pcndcd on structural superp.irion. Third arc isoprcnyl-protcin transiemw. essential componcnts of the cytoplasnltc signal trans- duction network. The B subunits of these enzymes contain multiple copies of multiple motifs that have not previously been satis- factorily characterized by automated align- ment methods.

Single site: HTH proteins. Ttre aide- spread DNA binding HTH m c m r e com- prises -20 contiguous amino acids (26). In our test set of 30 proteins (Ki- 1). the correct location of the motif is known (26, 27) from x-ray and nuclear magnetic reso- nance structures, or from subrimtion mu- tation experiments, or both. The rest of the 3D structure of these proteins, apart from the HTH structure itself, is completely dserent in different subfamilies Furcher- more, the element is found at positions throughout the polypeptide chain. Our ta t set represents a typically diveax cross sec- tion of HTH sequences. Close homologs have been excluded. The diiiiadrj of detec- tion and alignment of the HTH m o d from such sequences is well remgwd. There have been several astempts ro develop po- sicion-specific weight manice and other empirical pattern discridnamrs diagnostic for this structure (28). Thesehe achieved some success in making.sevd predictions that were later confrrmed'Fd thar have also aroused controversy (29).

We used this example to develop two importanc features of the && First, the empirical Criterion of information per parameter allowed for the automated deter- mination of element width (F@ 2). Sec- ond, heuristic convergence criteria substan- tially shortened the time required to find the best model (Fig. 3, legend). These two features enabled the algorithm to identify and align all 30 HTH motifs quickly and consistently. Correct alignmenu were ob tained with six pattern widths in the range from 17 to 22 residues (Fig. Z), of which 21 residues had the highest converged value of infomlation per parameter. ?hese results compare favorably with the 20-residue view based previously on structural superposi- tions (26, 27). The criteria developed em- pirically with the HTH erample have worked consistently well in all of our sub- sequenr applicarions.

Multiple sites: L i p o c a h . The majority of protein sequence families con~ain multi- ple colinear elemcnts separated hy variable- lengrh gaps (13). We have successfully akned distanrfy related sequenm for sever- al prohlems in this class, including protein kinases, asparty1 proteinases, amin~acyl-

tRNA ligases and mammalian Idix-loop- hm smlctural comparisons (31, 32). The helix proteins. Wc rcport hcrc on onc of tlw mt of thc topologically conscrvcd lipocalin most difficult of thcsc tcst cascs: in lipocalins folds haw vcry diffcrcnt sequcnccs. (30, 31). two wcak scqucncc motifs, ccn- Can-cntional autornatcd scqucnu: align- tcrcd on thc gcncrally conscrvcd rcsidues ment mcthods, although succcssful for se- -Gly-X-Trp- and -Thr-Asp-. arc recognized lected subscts of the data [such as (33)], fail

Flg. 2. Information per parameter as the criterion ~ 1 . 8 of pattern width for helix-turn-helix (HTH) pro- 5 teins. The points indicate the maximum values of - . . I / ' . *

information per parameter found by the algc- z1'6. . * rithm. The upper points (A and +) used the complete sequences of the 30 HTH proteins $1.4. listed in Fig. 1A. (A) All of the sequences in the data set were aligned in the correct register (as 1 2 in Fig. 1A). (+) One or more of :he sequences in the data set were incorrectiy aligned. All com- E,,o, pletely correct alignments in the width range ,O from 17 to 22 residues gave greater values of . information per parameter than any incorrect alignments outside this width range. (0) The "nonsites" sequence data of the 30 HTH proteins, constructed by deleting the 18 residues of the HTH pattern itself (Fig. 1A) from each of the sequences. (x) A shuffled data set (46) of the 30 KM sequences. The alignments from the nonsites background of the HTH proteins give values slightly greater than random expectation.

. . . . . . . - .

~ . . . . . * -: * . . . . - . - * : = , x ; x = = = . . = . =

15 20 25 30 Pattern width

Fig. 3. Convergence behav- ior of the Gibbs sampling algorithm. Because the Gibbs sampler, when run for finite time, is a heuristic rath- er than a rigorous optimiza- tion procedure. one cannot guarantee the optimalii of the results it produces. Therefore, the best solution found in a series of runs will be called "maximal." A sin- gle pattern of width 18 resi- dues was sought in the data set of 30 HTH proteins shown in Fig. 1 A Solid lines show the course of three independent runs with dif- ferent random seeds. Evolv- ing models in such runs rap-

0.4 I 0 1mxxx)3ooo4oO05ooo6ooo

Number of Iterations

id& reach intermediate "background" information values (1 .O to 1.2 bits per paramekr) and then sample different models in this plateau region for a wideiy variable number of iterations before converging rapidly. Curve 1 is typical in showing a very short lag lime on the plateau; Longer lags as in curves 2 and 3 are less common. Curves 1 and 3 illustrate the stochastic behavior of t h e Gibbs sampler: once "converged." the model stays predominantly at the maximal value of 1.64 bits per parameter but is never psrmanently in this solution. In the infinite limit, the sampler will spend the plurality of its time on the pattern that maximizes Fand therefore the information per parameter (22). Curve 2 demonstrates persistence (after escape from the background plateau) in a submaximal state (1.80 bits per parameter), which is a "phase-shifted" version of the best model. When sufficiently large stochastic phase shifts are allowed ( s e e text). such states do not nomalty trap the evolving model for many iterations. Curve 3 reaches exactly the same maximal value as cum I , suggesting one possible strategy for detecting mergence. namely. recurrence of exactty the same pattern with different seeds. The following heuristic approach was found to greatly reduce the time required to find the apparently optimal alignment: (i) For a given random seed, repeat the basic algorithm a fixed numkr of times (typically 10 times for each input sequence) beyond the last iteration in which the best pallern observed (with this s e d ) improved; and (ii) try at most some fixed number of seeds (usualty 10 for a single element), but stop when the best pattern is reproduced by a specified number of difierenl seeds (usually 2). The rationale underlying this approach is that il is unlikely for the identical suboptimal solution to be found on several independent trials before the optimal solution is found ance. The dotted line represents a run on the nonsite data set (Fig. 2). Such runs never exceed 1.2 bits per parameter and thus are stuck permanently in states resembling the background plateaus from which the models of the HTH motif alignments eventually escape. Repeated runs in which shuffled sequences as input data are used can provide criteria lo1 how strong a paltern is required to be considered significant (38).

SCIENCE VOL. 262 8 OCTOBER 1993 21 1

t e align these motifs for the ful l spectrum of lipx:lIin sequences. Challenged with five such diverse sequences of known crystal structure. our algc~rithm correctly aligned thex r w o rcgiuns and extended the width of hwh t o 16 residues (Fig. 41, in agreement with the structural evidence (31, 32).

Multiple copies of multiple sites: Pre- nyltransfenses. Internal repeats in protein sequences underlie many important struc- tures and functions and are more common than is generally recognized (34) . These repents are often obscured by sequence di- vergence following duplication, rendering their detection and characterization a chal- lenging problem. The analysis of repeats is often labor-intensive, relying in part on visual inspection of "dot plots" (10, 34)-a procedure that limits searches and surveys of large databases.

-4n example of recent interest involves sequence repeats in the subunits of the heterodimeric protein-isoprenyltransferases (IO, 35) - These enzymes are responsible for targeting and anchorins members of the ras superfamily of small guanosine mphos- pharases to their sites of action on various cellular membranes (36). The subunits of prenyIuansferases contain a subtle internal repeat of possible function signhcance (31) . Although no direct structural infor- mation is yet available for these proteins, previous sequence analysis suggested that the subunit repeat consists of three mot& separated by variable-Iength gaps and that this entire tripartite structure is repeated three to five times in each of four proteins (10, 35).

The challenge here is therefore to iden- tdy a relatively large number of weak pat- terns covering up to 80 percent of the length of the sequences. Tne resulting crowding of elements increases interele- ment dependencies and the complexiry of the joint probabilicy surface over which the algorithm must find the most probable alignmem:

The previous analysis was subjective and time-consuming, relying on the combined use of severaf diferent multiple ali-ment meth- ods. In contrast, the Gibbs sampIing alp rithm quickly and objectively reproduced and extended the previous results (Fig. 5).

Evaluation and comparison. The main difficulties of automated local multiple aIignment stem from the high dimensional- ity of the search space and the existence of many local optima. Here, the large search space is explored one dimension at a tinw by comparing each sequence to an tbvlving residue frcquency model. S t w h s r i c u n - pling permits the algorithm to rscnp lcxal optima in which deterministic npproidches may get trapped. Including a phxc shift step expcditcs convergence by permitting thc sampler to explore relared IrxaI optima.

212

-

Tests showed the stsnrithm to be rela- the algorithm IO seek a pattern in only a tively insensitive to \-xinus numbers of specified number of input sequences. negative examples included among the in- The use of an appropriate model for put sequences. Tu CL'FZ with large numbers interelement spacing would improve the of negative examples. we have extended algorithm's msitivity. but this feature has

Motif A notif B

li 32 104 119

2s 60 109 124

16 31 100 115

14 29 105 120

17 42 109 124

ICYAJANSG .. CYCPC4TnX DFDLSAFAGAhN6IA.K L P L P I E K Z X . . . TXKrSl? 'NUTDY~NYII~TTX D Y H P D W S .. LPitegWIN . . M L I V K ' N K CU)IQ?XAClWYSLAM A S D I S L L D A . . . IILuI;1I;TI LVLD7DYKKYLLFCME NSAIPKQSLA .. BBP-PIEBR . . CACPmTTVD N-CKMWWAK YPNNEXYCK . .. YCG-0 ) n h S T D W W i I I c I Y C KiDQKXGHQ .. ROIB-BWIN .. c ~ v s s m n NZDIURFAGWYAMX mppGLmn . . . mwnaazt -+TIDTDYLYTAVOYSC nmumw. .. hlJP2JIwso .. HAEsu<- ~ X f f i ~ I I L ASDrnzKIED ... s- rx-I - m L .. . 1. ... . Fig. 4. Two motifs located automatically in five lipocalins of knm crystal structure. The sequences, defined by SwissProt database codes, are, from top to boncn: Manduca sexfa insecticyanin, bovine P-lactoglobulin. Pieris brassicae bilin-binding protein, M e plasma retinol-binding protein, and mouse major urinary protein 2. Asterisks (..*) below the aligment denote generalfy conserved residues recognized fmn structural comparisons (30, 31). The criterion of information per parameter (0.66 and 0.65 bits for motifs A and B. respectively) wested an extended width of 16 residues for both motifs, in agreement with the superposable structures of the proteins in these regions (37, 32).

A e B Rami

cdC43 1 2 r l ... 103 aa.... 127 DKRSLARWSRC Q ... 52 aa.. . 191 mp(cLcyWQ -------------------- 240 ' ~ I ~ QVSSWGCMCFESUSASYWSDD 309 Q m W R TpK~T--- - - - - - - - - - - - - - -

Flg. 5. Repeating motifs in prenyltransferase subunits. Ram1 (Swiss-Prot. accession number P22007) and n - p (Swiss-Pro!, 002293) are the B subunits of famesyltransferase from the yeast, Saccharomyces cerevisiae. and rat brain, respectively. Bet2 (FIR International, 522843) is the fj subunit of type I geranylgeranyltransferase from S. cerevisk. GGT-$I {GenBank, L10416) and Cdc43 (Swiss-Prot. P18898) are the f! subunits of type II geranylgeranyltransferase from S. cerevisiae and rat brain, respectively. The primaly structures of Ifme proteins have been shown to contain a variable number of tripartite internal repeats, each of nkich is composed of "A" and "E' subdomains separated by a "linker region" containing multiple G)i and Pro residues (70,35). When anatyzed by the Gibbs sampler, these previously defined m.xk were identified and additional copies were also observed [compare with figure 1 in (3511. The kbmation per parameter for motifs A, L. and B was 2.3, 2.3, and 2.4 bits, respectivety. Dashes 'h5~2te the locations and extents of gaps between motifs; ellipses (. . .) accompanied by a n u m k wci the abbrevialion "aa" indicate the locations and extents of larger gaps expressed as the n m i x r of amino acid residues. The spacing between molifs Land B is only two or three residues. nhmeas that between motifs A and L is greater and more variable.

SCIENCE VOL. 262 8 OCTOBER 1993

not bccn nccded to idcnrify ewn the subtlc pattcrns dcscrilrd a h w . Ex pmb1cm of highly corrclatcd input qumccs can k addrcsscd by various \~-cightiq achcmcs (37), but wc haw yct to imphwnt such a fcaturc. Choosing an opriml numlw of clcrncnts rcquircs further s m . 3 . Wc haw found that an additional eizwnt is not warranted when nlultiple ranim 4 s lead to many different alignments and when the resulting information per parameter consis- tently fails to exceed that obined from shuffled sequences (38). Prim knowledge concerning amino acid relati- (39) has been used profitably in pairuis protein sequence alignment as well as in pattern construction methods (8. e). We have modified the Gibbs sampler ro use such prior information, but in pracdie, for even moderate numbers of sequencs (rj), we have not found it to yield any improve- ment. However, an i n t e r e m new ap- proach to incorporating prior information has been described (41), and &re is much room for further experimenrah.

Some basic similarides k m e e n our method and several earlier on5 should be noted. Stonno and Hamell a d Herc et al. (5) seek the pattern that nmhizes a mea- sure similar to F. Their differs mainly in the heurisdc opdmia;ion proce- dure used, which is an *on of an algorirhm first proposed by B m and An- derson (4) . W e have i m p h t e d the method of (5) and tested it cm a variety of examples. This approacfi uses only a small subset of the data for $e e& sequences examined, and thus is easily misIed. As a result, the solution found w-as ..e$ as good as that produced by the samplr. Further- more, the need to tonsstrtlct an dignment for each possible segment in the inirial sequence requires on average more pass^ duough the input dam than does the sampler (see be- low), resulting in greater ex& times.

Both EM methods (42) and the Gibbs sampler are buiIt on a common statistical foundation. Two EM appro& for multi- ple alignment have been de&, block- based methods (6, 7) and gapEased meth- o d s in the form of hidden Makov models ( 4 3 ) . For muhielement probkms, the Gibbs sampler outperforms bld-based EM methds. Because EM mesh& are forced to sum over all possibilities, &e rime com- plesiv grows exponentially wirh additional elements. In contrast, the Gijbs sampler never needs to consider more than one elemenr nr 3 time. The speed oi&e sampler srems pnrdy from the fact r h z z it always deals with n specific model alignment rather rhan a weighted average. Also, h o s e EM methods are deterministic, t h e tend to get trapped hy local optima which z e avoided by the sampler. Hidden M a r h v models. because t i ley permit arhitraq gaps, have

grcat flcxibility in modcling pnttcms, but suffcr thc pcnaltics of this n d J d con1plcsity discusscd abovc.

Scvcral othcr approachcs to thc local multiple alignmcnt problcnl Lxar a hricf revicw. Mcthcds that scck a ''consensus" word with the highcst aggrcgate score against segmcnts within the input scqucnces have been described (3). Their space re- quirements effectively limit them to protein patterns of six residues, and their time re- quirements effectively allow only closely re- lated words to contribute to a consensus. These constraints greatly decrease the sensi- tivity of these methods to weak patterns.

Algorithms that compare all input se- quences with one another and then coalesce consistent painvise local alignments have been described (8-10). The MACAW algo- rithm (8) has comparable speed to the Gibbs sampler for a relatively small number of input sequences and can locate many dis- tinct patterns in a single run. Its time com- plexity, however, is at least quadratic in the aggregate length of the input sequences, and it tends to be less sensitive to weak sequence patterns. The performance of methods that must compare all input sequences with one another may degrade as the number of se- quences increases. In contrast, the power of the Gibbs sampler and EM methods in- creases with additional sequences because the pattern model is improved by more data. As illustrated above, the Gibbs sampler is successful even with a relatively small num- ber of input sequences. A version of the Gibbs sampling a l g o r i h has been added to the MACAW program (8), and the updated program is available upon request.

The memory requirements for the Gibbs sampler are negligible; storing the input sequences is usually the dominant space demand. When flexible halting criteria, such as those described in Fig. 3, are used, it is difi‘icult to analyze the worst-case time complexity of the method. However, for typical protein sequence data sets, we have found that, for a single pattern width, each input sequence ne& to be sampled on average fewer than T = 100 times before convergence. In the more time-consuming step 2 of the basic algorithm, approximately LW multiplications are performed, where L is the length of the sequence that has been removed from the model. Therefore. the total numher of multiplications needed to execute the Gibhs sanlplcr is approximately TNLW. where L is the average length of the N input sequences (44). The factor T is expecred to grow with increasing E. Howev- er, experimentation suggests that T tends to decrease slowly with incrcnsing N when the comnmn pattern rsisrs ; ~ t roughly equal strength within rhe inpur sequences. Thus, linear time complesiry has heen observed in applicarions.

SCIENCE VOL. 262 8 OCTOBER 1993

In conclusion, as illuxtr:md by our ex- anvlcs, thc GibLs s:mplcr objcctivcly s o l w s difficult taultiplc scqucncc alignment pnAdaus in n Iwttcr of scconds in thc ahwncc of any cxpcrt knonlcdgc or ancil- lary infortnation dcrivcd from thrcc-dimen- sirnu1 structures or other sourccs. By adopt- ing a randonlizcd optimization procedure in the place of deterministic approaches, it is able to retain both speed and sensitivity to u-eak but biologically significant pattern.

REFERENCES AND NOTES 1. S. F. Allschul and 0. J. Lipman, SlAM J. Appl.

Math. 49, 197 (1989); W. Bains. Nucleic Acids

Stemberg. J. Mol. Bo/. 198, 327 (1987); M. P. Res. 14. 159 (1986); G. J. Banon and M. J.

7.479 (1991); H. Carrillo and D. Lipman. SIAMJ. Berger and P. J. Munson. Cornput. +p/. Biosci.

& p f . Math. 48. 1073 (1988): 0. Feng and E. F. Dooliltle, J. Mal. EwL 25, 351 (1987); 0. Gusfield. &/I. Math. Biol. 155, 141 (1993); C. M. Henneke.

gins, A J. Bleasby. R. Fuchs. hid. 8. 189 (1992); -put. Appl. 8iosci. 5, 141 (1989). D. G. Hig-

M. S. Johnson and R. F. Doolittie. J. Mol. Em/. 123,267 (1986): D. J. Lipman. S. F. Altschul. J. 0. Kececioglu. Proc. Natl.Acad. Sci. USA. 86.4.412 (1989); M. Murata. J. S. Richardson, J. L Suss- man. hid. 82.3073 (1985): A. B. Russell andG. J.

J.App/. Math. 28.35 (1975); S. Subbiah and S. C. him. Proteins 14.309 (1992): D. Sankoff. SIAM

Taylor, Melbw‘s €nzymo/. 183. 456 (1%); M. Harrison. J. Mol. Biol. 209, 539 (1989); W . R.

Vu-gron and P. Argus. Cornput. qOpl. Biosci. 5, 115 (1989); M. S. Waterman and M. 0. ferfwitz.

2 G J. Barton and M. J. Stemberg. J. Mol. Bi.9 212. h l l . Math. Bid. 46. 567 (1984).

389 (1990); C. C h a p p e y , A Danckaen. P. Dessen. S. Hazwt, Cmpt. &d. Biosci. 7. I95 (1931); E.

J. Parry-Smith and T. K. A t t w o o d , ibid. 7.233 (1 991); Depiereux and E. F q a ~ n s . ibid. 8. 501 (1992); 0.

M. 8.451 (1992); E. Sobel and H. MartineZ Nu- &Acids Res. 14,363 (1985).

3. J. Posfai. A S. Bhagwat. G. Posfai, R. J. Roberts. NucleicAcids Res. 17,2421 (1989). C. M. Queen. N. Wegmn. L J. Korn. ibid. 10.449 (1982): H. 0. Smrth. T. M. Annau. S. Chandrasegaran. Prcc. Natl. Acad. Sci. USA. 87. 826 (1990); R. Staden. Compul.&p/. Biosci. 5, 293 (1989): M. S. Water- man. R. Aralia. 0. J. Galas, Bull. Mafb. 8a. 4 6 ,

4. 0. J. Baconand W. F. Anderson, J. Mol. Bid. 191. 515 (1984).

153 (1966). 5. G. D. Stormo and G. W. Harlzell 111, Proc NN.

&ad. Ski. U.SA. 86.1183 (1989); G. 2. Hem. G. W. Hartzell 111, G. 0. Stormo, Cornput. Appl. Bio-

6. C. E. Lawrence and A. A. Reilly. Proleins 7. 41 sci. 6. 81 (1990).

7. L R. Cardon and G. D. Stormo. J. Mol. Bd. 223, (1=Q).

159 (1992).

teim 9. 180 (1991). 8. G. 0. Schuler. S. F. Altschul. 0. J. Llpman. Pro-

9. tA. Vingron and P. Argos. J Mol Biol. 218. 33 (19911:

10. 14. S. kguski. R. C. Hardison. S Schwanz. W.

11. N. N. Alewndrov. Compur. Appl Elosci 8 . 339 IAiller. New B i d 4 , 247 (1992).

(1992); S. HenikoH and J. G. Henikoff. A’xkrc kidsRes. 19, 6565 (1991); S Henlkol. h+n Bo/. 3. 1148 (1991): M. Y. Leung. 8. E. Blalsdcll. C. Burge, S. Karlin. J. M o l . Bml 221. 1367 (1991\. M A. Royrberg. Compuf. App, Bioso. 8. 57 (1391).

:2. 2. Bork, C. Sander. A. Valencla. Proc y3rl .4=.3d

13. J. S. Richardson. Adv Proran Cham. 34. 167 S c i . U.S.A. 89. 7290 (1992)

(1931); A. M Lesk and C. Chothla. J h4S: BICJ~. 136.225 (1980). C. Chotl~ld. Annu. Rev. 6hl-MI7l 53, 537 (1984): A. Sali and T L Blundell. J AIJI. 5Iol.212. 403 (1990): R F Doollltle. Pfora~n-~-i 1. 191 (19921.

213

14. F h\, Pohi. Nafure New Biol. 234. 277 (197'1): 0 G. Berg and P. H. von Hippel. J. Mol. Biol. 193.

Fn-reins f6. 92 (1993). Z 3 (1987): S. H. Bryant and C. E. Lawrence.

15 T Orchard and M. A. Wocdbuv. Proceedings of

Statatics and Probabilrly (Univ. of California fhe Sixth Berkeley Symposium on Malhemalics.

Press. Berkeley. 1972). vol. 1. pp 697-715: L. A.

16. J Ltu. Deparlmenf of Sfalisrrcs Research Rep37 Gcwdrnan. Biometnka 61, 215 (1974).

Nc R-426 (Harvord Univ. Press. Cambridge, MA, 1992)

17. In the context 01 slmulaled annealing [N. Metrop. 011s. A. Rosenbluth. M. Rosenbluth. A. Teller, E.

mtrick. C. 0. Gelalt. M. P. Vecchi. Science Z M . Teller. J. Chem. Pbys. 21. 1087 (1953); S. Kirk-

671 (1983)l. D. Geman and D. Geman [/€€€

intrcduced Gibbs sampling as a particular ste Trans. Paflern Awl. Mach. Inlei/. 6, 721 (1984))

chastic relaxation step. They chose ils name because a key theorem from statistical physics (the Hamrnersley Cliffort Theorem) uses Gibbs/ Bolmann potentials. Because we use residue frequencies based on Gibbs/Boltzmann-like free energies. the name is more appropriate here than

18. M. Tanner and W. H. Wong. J. Am. Sraf. Assoc. in marry other applications.

82.528 (198T); A E. Gelfacd and A. F. M. S m i , ibid. 85. 389 (1 990).

19. Segment x is chosen with probability AJ?, Mere the sum is taken over all possible seg-

20. One c w l d choose q .simply proportional to c ~ . ~ ments.

but this would impd'a zero probability for any amino acid not actually observed. This dimcurty may be surmounted through lhe use of &psian prediclwe inference (78). Bayesian anatysis makes use of subjective "prior probabilities" for the values of the parameters to be estimated. A c o m m choice for such priors when multinwnial models are involved is the Dirichlet distribution [J. Atchison and 1. R. Dumore , Statistical M i c - fion &&?is (Cambridge Univ. Press. New York 1972)), which results in the simple add& of "pseudocounts" to lhe observed counts, as given in Eq. 1. These pseudocounts should capture a priori expectations concerning me ~ccune~lce of the various letters in different pattern positims and may vary from one application to another. We have found that letting bj = Bp,, where p, is the Irequency of residue j in the m p l e t e data set. is effective. We used this noninionnalive prior pro& ability in all of our applications. The total number of pseudocounts 8 is also part of the prior spec- ification, for which there is no "correct" choice. The Smaller B is taken. the grea!er the r d ' i

expectation. We have found that choosing 8 = placed on current observation vis-a-vis a priori

f igeneratty rvorks well. yielding pseudocounts

21. H. M i k e . If€€ Trans. mom. ConlroiJ19. 718 b, applicable to b e calculation of bolh q and p

22. G is equal to (1974): G. S c h w a r ~ , Ann. Sfat. 6. 461 (1978).

L ' , F - I: logl',+ ~ K . / l o s Y & j '

1- IH 1 ( 1 - 1 ) where L', is the number of possible positions for the pattern within sequence i, and Y . is the normalized weight of position j , that is it: wight O,/< divided by the sum of these weights within sequence i. This adjjslment accounts for the fact that the position of the pattern within each se- quence is not known [see (6) and R. J. A tittle and 0. B. Rubin. Sfatislicar ha/ysis ~ t h Mk.sing

21 4 I

iI -

Data (Wiley. New Ye*. \987)]. I t can be shown lhat G increases mcnc:arically with increasing F, so an optimization algnlhm for Fremains appro- priate.

23. A version of the Gib& sampling procedure for locating patterns wthm multiple sequences has been implemented In me C programming lan- guage, and is availaUe from the authors upon

24. Durmg each iteraliorr. the orders of all elements In requesl.

N - 1 of the input slquences are available. Counts of observed crders are combined wilh prior probabililies (Ishen as uniform in our appli- calions) to calculale "&el" order probabililies. analogously to Eq. 1. \'hen choosing an ele- ment's new position. :he weights A, of all candi- date segments may be adjusted naturally by multiplication with these posterior order probabil- ities. A probabilistic W e 1 of elemenl location based jointly on the observed order and residue frequency is produced. In our applications we have set the total numbr of order pseudocounts to NSTA where N is me number of sequences, S is the number of sites per sequence, T is the number of types of element. and k has been chosen as 20. Details of me ordering model w i l l

aration). be described e l smere (J. S. Liu et a/.. in prep

25. We have developed and initially tested a stochas- tic multiple alignment procedure which permits complete flexibility in gaps. It uses the same Markov characteristic lhat is used in dynamic programming to align tw sequences. To date, we found that the increase in gap flexibility permitted by this algorithm is not worth the cost of the increase in noise from chance l o c a l optima.

26. R. G. Brennan and 8. W. Matthews, J. Biol. Chem. 264. 1903 (1989); C. 0. Pabo and R. T. Sauer. Annu. Rev. Biochem. 61. 1053 (1992); J. Treis- man, E. Harris. D. Wilson. C. Desplan. Bioessays

27. D. Kostrewa el a/.. J. Md. Bid. 226.209 (1992): R. 14, 145 (1992).

M. Lamerichs et a/.. fm. Natf. Acad. Sci. U S A 86. 6863 (1989); M. A Schell. P. H. Brown. S. Raiu. J. Bid. Chem. 265, 3844 (1990); A Contre- ras and M. Drummnd. Nucleic Acids Res. 16. 4025 (1988): M. H. Drummond. A Cmtreras. L A . Mitchenall. Mol. Micrcbio. 4, 29 (1993); L M. Shewchuk et dl.. Biochemistry28.2340 (1989); H. S. Yuan et a/.. f r c c . Natl. Acad. Sci. U.S.A. 88. 9558 (1991); A. Brunelle and R. Schleif. J. Mol. Bo/. 209. 607 (1989); S. Spiro et a/.. Mol. Micro- biol. 4, 1831 (1990); C. s. Barbier and s. A Short. J. Bacferiol. 174,2881 (1992); M. D. Yudkin, J. H. Millonig. L Appleby. Mol. Micfobiol. 3. 257 (1989): C. Berens. L Anschmied. W. Hillen. J.

Zalkin, ibid. 263. 19653 (1988). Biol. Chem. 267. 1945 (1992): R. J. Rolfes and H.

28. M. Drurnmond. P. M i . J. C. Woonon. EM60 J. 5.441 (1986); 1. B. Dodd and J. 8. Egan. Nucleic Acids Res. 18. 5019 (1990): D. Akrigg et a/.. Comput. Appl. Biosci. 8.295 (1992).

29. M. D. Yudkin. Profein fig. 1, 371 (1987); I. B. Dodd and J. B. Egan. ibid. 2, 174 (1988).

30. A Nagata et a/.. Proc. NaN. Awd. Sci. U S A 88. 4020 (1991); M. C. Peitsch and M. S. Boguski. New Biol. 2, 197 (1990).

31. S. W. Cowan. M. E. Newcomer, T. k Jones,

32. L Sawyer. Nature327.659 (1987): A C. T. North, Profeins 8, 44 (1990).

Inf. J. Biol. MacromL 11.56 (1989); 2. Bocskei 8f a/.. Nature 360, 186 (1992); H. M. Holden, W. R. Rypniewski. J. H. Law, 1. Rayment. EM80 J. 6, 1565 (1987); R. Huber efal., J. Mol. Bm/. 198,499 (1987): H. L. Monaco and G. Zanotti. Biopolymers

SCIENCE VOL. 262 8 OCTOBER 1993

32. 457 :lY32): M. 2. Papiz el a/.. Nalure 34. 383 (19%): D .? Flower. A. C. T North. T. K. Athvood.

33 M S. W e A i . J Ostell. D. J. States. in frofein Profein S c r 2. 753 (1993).

Eng/newq:A PracficalAppmach. A. R. Rees. M. J. E. Srerrhrg. R. Wetzel. Eds. (IRL Press, Ox-

34 D J %:eS and M S. Boguski. in Sequence fcrd. 1932;. w. 57-88.

h a w s h w r . M. Gribskov and J. Devereux. Eds (Freecan. New York. 1991). pp. 141-148.

35 M S. kmi, A. W. Murray. S. Powers. New Biol.

36. S. Clatce. /mu. Rev. Biochem. 61,355 (1992); W. t?. Schafer wd J. Rine. Annu. Rev. Genet. 30, 209

37. S. F. PJ~SC~JL!. R. J. Carroll, D. J. Lipman. J. Mol.

S i W . k. Mtl. Acad. Sci. U.S.A. 90. 8777 Bid 2 0 7 . 647 (1989); M. Vingron and P. R.

38. when %dF&nt data are available, il is possible Io derive an -ptotic test for the statistical signif- i c a n c e of aq pattern found. In practice. sunicient data are ncl usually available lor this asymptotic test. mty for protein sequences. Given the speed d re algorithm, Monte Carlo tests based on s" sequences [S. F. Altschul and B. W. Ericksm. kw. Bid. fw/. 2,526 (1985)] provide a viable aitmatke. In any case, Monte Carlo tests are supmix under many circumstances to a s y m p t a k * e s t s [P. Hall and D. M. Titterington. J. R. Stat Scc 851. 459 (198911.

39. M. 0. Dqra~T, R. M. Schwam. 6. C. Orcun. in Atlas of Ra&n Sequence and Structure. M. 0.

Focrndativl Dayhoff. Ec. (National Biomedical Research

,Washington. DC. 1978), vol. 5. suppl. 3 , p p . ~ , A . M . ~ ~ a n d M . O . D a y - hoff. in M. pp. 3.53458: S. F. Altschul. J. Mol. Bid. 219. 555 (1991); S. Hmikoff and J. G. Heni- bff. Pnx. Nad. M d . Sci. U.SA. 89. 10915 (1992): 0. F. Feng:,M. S. Johnson. R. F. Doolittle, J. Md. E d . 21, 115 (1985).

40. M. Grikshx A D. Mclachlan. D. Eisenberg.

F. Srmth ad T. Smith. bid. 87. 118 (1990). frcc. Md. Aad. Sci. U S A 64.4355 (1987); R.

41. M. Bwm et al.. in Proceedings of rhe F i r s f Inter- nam Ccnerence on /nre//igent Sysfems for Mol&E&gy, L Hunter. 0. Searls. J. Shavlik. Eds. pA4 Fress, Menlo Park, CA. 1993). pp.

42. A P. D e r r ~ e r , N. M. Laird. D. B. Rubin. J. R. 47-55.

43. D. Haus!er. A Krcgh. S. Mian, K. Sjolander. CIS Stat Soc 359. 1 (1977).

CaF~e sarta crw, 1992). Tech. Ssp. No. UCSC-CRL-92-23 (University of

44. For a repesentative data set. such as that from - Me Hi% praUem studied above. T i 50. N - 30, L = 2x7. ad W = 20, yielding -8.ooO.OOO iteratkm d the innermost loop. The number of possible sWjons for such a problem is >los. Graphics Cap.), the Hnl example (Figs. 1 10 3) Executed cn an Indigo R4M)o workstation (Silicon

eMmples d Fgs. 4 and 5 required 3 and 31 CPU required 2 CW seconds on average, while the

45. Abbreviabcns for the amino acid residues are: A, seconds. T t i v e P y .

Ala; C. Cys 0. Asp: E. Glu; F. Phe; G. Giy: H. His: I, I l e : K Lys; C. Leu; M. Met; N. A s n ; P, Pro; Q. Gln; R A r g : S. Ser; T, Thr; V. Val; W, Trp; and

46. J. C. Woa% wd S. Federhen. Comput. Chem.

47, JSL is slcpccted in part by NlMH grant MH-31-

4. 4oE [TW,.

(1992).

(1993.

Y. Tyr.

17. 149 (19X).

154.

14 May 1 m. accepted 3 September I993