69
Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Embed Size (px)

DESCRIPTION

VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA * * * * * **** * * *** * * * * * *** * KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL ** ***** * ** * ** ** ** *** ** ** * ** * GKEFTPPVQAAYQKVVAGVANALAHKYH PAEFTPAVHASLDKFLASVSTVLTSKYR **** * * * * * * ** Dynamic Programming Needleman and Wunsch, 1970 O(L 2 ) algorithm Maximise score (or minimise distance) Gap penalties Amino acid weight matrix

Citation preview

Page 1: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Multiple Alignments and Multivariate Analysis

Clustal: 1988-2006

Page 2: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Multiple Alignments

Phylogenetic Analysis Secondary Str. PredictionHomology Detection Profile AnalysisHomology Modeling

Page 3: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP-VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA * * * * * **** * * *** * * * * * *** *  KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL ** ***** * ** * ** ** ** *** ** ** * ** *  GKEFTPPVQAAYQKVVAGVANALAHKYHPAEFTPAVHASLDKFLASVSTVLTSKYR **** * * * * * * **

Dynamic Programming•Needleman and Wunsch, 1970 •O(L2) algorithm

Maximise score (or minimise distance)•Gap penalties•Amino acid weight matrix

Page 4: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Weighted Sums of Pairs: WSP

N

i

i

jijij DW

2

1

1

Time O(LN)

Page 5: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Weighted Sums of Pairs: WSP

N

i

i

jijij DW

2

1

1

Sequences Time2 1 second3 150 seconds4 6.25 hours5 39 days6 16 years7 2404 years

Time O(LN)

Page 6: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Horse beta

Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin

Progressive Alignment:Feng and Doolittle, 1987Barton and Sternberg, 1987Willie Taylor, 1987, 1988Hogeweg and Hesper, 1984

Page 7: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Horse beta

Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin

Page 8: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Horse beta

Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin

Page 9: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Horse beta

Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin

Page 10: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Clustal• 35000 citations

• Clustal1-Clustal4 1988– Paul Sharp, Dublin

• Clustal V 1992– EMBL Heidelberg,

• Rainer Fuchs• Alan Bleasby

• Clustal W 1994-2006, Clustal X 1997-2006– Toby Gibson, EMBL, Heidelberg– Julie Thompson, ICGEB, Strasbourg

• Clustal W and Clustal X 2.0 early 2007– University College Dublin

Page 11: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006
Page 12: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Since 1994?

Protein structure alignments and superpositions • Barton and Sternberg; Fitch and McLure• Dali• BaliBase • Homstrad • Oxbench• Prefab etc. etc.

Benchmarks

Protein structure analysis•APDBO'Sullivan O, Zehnder M, Higgins D, Bucher P, Grosdidier A, Notredame C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics. 2003;19 Suppl 1:i215-21.

RNA alignments•Bralibase (Gardner PP, Wilm A & Washietl S (2005) NAR. )

Page 13: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Which Method is Best?

• Clustal W????

• MSA (Lipman, Altschul, Kececioglu)

• DCA (Stoye), PRRP (Gotoh) , SAGA (Notredame)

• Probcons (Do, Brudno, Batzoglu)

• T-Coffee (Notredame)

• 3-D Coffee M-Coffee

• MAFFT (Katoh) and MUSCLE (Edgar)

For Global Protein alignments!!!

Page 14: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Clustal W and X 2.0?

• Jan 2007• Re-engineered in C++• Aim to increase accuracy

– Iteration (Wallace, I. M., O'Sullivan, O. and Higgins, D. G., 2005 Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics 21:1408.)

• Reduce run times

Page 15: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Multivariate Analysis?

Page 16: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

ADE-4 http://pbil.univ-lyon1.fr/ADE-4/

Thioulouse J., Chessel D., Dolédec S., & Olivier J.M. (1997) ADE-4: a multivariate analysis and graphical display software. Statistics and Computing, 7, 1, 75-83.

Page 17: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

• MADE4 – Culhane, A., Thiolouse, J., Perriere, G., Higgins, D.G. (2005)

MADE4: an R package for multivariate analysis of gene expression data. Bioinformatics. 21(11):2789-2790.

Between Group Analysis BGA Dolédec, S. & Chessel, D. (1987) Acta Oecologica, Oecologica Generalis, 8, 3, 403-426.

Supervised Correspondence Analysis or PCA

CO-Inertia Analysis CIADolédec, S. & Chessel, D. (1994) Freshwater Biology, 31, 277-294.Thioulouse, J. & Lobry, J.R. (1995) CABIOS, 11, 321-329

2 datasets; Simultaneous CA or PCA

Page 18: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Use CA, PCA for Sequences?

PCOORD on sequence distances:Higgins, D.G. (1992) Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets. CABIOS, 8, 15-22.

PCA on dipeptide composition:Van Heel, M. (1991) A new family of powerful multivariate statistical sequence analysis techniques. J. Mol Biol. 220(4): 877-887.

PCA on alignment columns:Casari G, Sander C, Valencia A. (1995) A method to predict functional residues in proteins. Nat Struct Biol. 2(2):171-8.

Page 19: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Supervised PCA or CA?

Malate Dehydrogenases

Lactate Dehydrogenases

Page 20: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Between Group Analysis

GSVD

samples

genes

N

Page 21: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

d = 0.05

EC_4_117 EC_4_0

EC_1_1 EC_1_19

EC_4_1 EC_4_91 EC_4_89 EC_4_90 EC_4_92 EC_4_93

EC_4_98 EC_4_99 EC_4_97 EC_4_95 EC_4_94 EC_4_96

EC_4_113 EC_4_114 EC_4_108 EC_4_109 EC_4_110 EC_4_111 EC_4_112 EC_4_106 EC_4_107 EC_4_102 EC_4_100 EC_4_101 EC_4_104 EC_4_103 EC_4_105 EC_4_116

EC_4_88

EC_1_0

EC_36_5 EC_36_2 EC_36_4 EC_36_3 EC_36_6 EC_36_0 EC_36_1

EC_1_15 EC_1_16

EC_4_44 EC_4_115

EC_1_13 EC_1_14

EC_4_87

EC_4_46

EC_1_17 EC_1_18

EC_4_25 EC_4_24 EC_4_23 EC_4_21 EC_4_22 EC_4_18 EC_4_19 EC_4_16 EC_4_17 EC_4_20 EC_4_42 EC_4_43 EC_4_40 EC_4_41

EC_4_39 EC_4_45

EC_4_36 EC_4_37 EC_4_38

EC_4_34 EC_4_35 EC_4_32 EC_4_33 EC_4_29 EC_4_26 EC_4_30 EC_4_31 EC_4_27

EC_4_28

EC_1_2

EC_1_4 EC_1_7 EC_1_5 EC_1_6 EC_1_3 EC_1_9 EC_1_8

EC_1_10 EC_1_11 EC_1_12

EC_4_83 EC_4_84 EC_4_85 EC_4_86 EC_4_49 EC_4_81 EC_4_79 EC_4_80

EC_4_78 EC_4_77 EC_4_76 EC_4_48 EC_4_47

EC_4_74 EC_4_75 EC_4_72 EC_4_73 EC_4_68 EC_4_69 EC_4_63 EC_4_66 EC_4_64

EC_4_65 EC_4_67 EC_4_70 EC_4_71 EC_4_50 EC_4_82 EC_4_52 EC_4_51 EC_4_54 EC_4_53 EC_4_55 EC_4_56 EC_4_57 X5PTP_EC_4 EC_4_58 EC_4_62 EC_4_61 EC_4_59 EC_4_60

EC_4_6 EC_4_7 EC_4_5 EC_4_2 EC_4_3 EC_4_4 EC_4_15 EC_4_14

EC_4_13 EC_4_12 EC_4_11 EC_4_8 EC_4_9 EC_4_10

d = 0.05

Chymotrypsin

Elastase

Tripsin

d = 0.1

X3N

X7A

X10N

X14W

X16S X18I

X54V

X66T X70R

X82E

X82G

X87L

X92I

X93I

X93F X95N

X98W

X98Y

X132Y

X137C X154T

X154V X155S

X155T

X162S

X165N

X180Q

X181A X183L

X196Y X204S

X228K

X229D

X229S

X232Q

X232M X243Q X265S

X273K

X275G

Chymotrypsin

Elastase

Tripsin

0 e

+00

4 e

-04

8 e

-04

Eigenvalues

15 Chymotrypsins

31 Trypsins10 Elastases

Trypsin-like serine proteases

Page 22: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

d = 0.05

EC_4_117 EC_4_0

EC_1_1 EC_1_19

EC_4_1 EC_4_91 EC_4_89 EC_4_90 EC_4_92 EC_4_93

EC_4_98 EC_4_99 EC_4_97 EC_4_95 EC_4_94 EC_4_96

EC_4_113 EC_4_114 EC_4_108 EC_4_109 EC_4_110 EC_4_111 EC_4_112 EC_4_106 EC_4_107 EC_4_102 EC_4_100 EC_4_101 EC_4_104 EC_4_103 EC_4_105 EC_4_116

EC_4_88

EC_1_0

EC_36_5 EC_36_2 EC_36_4 EC_36_3 EC_36_6 EC_36_0 EC_36_1

EC_1_15 EC_1_16

EC_4_44 EC_4_115

EC_1_13 EC_1_14

EC_4_87

EC_4_46

EC_1_17 EC_1_18

EC_4_25 EC_4_24 EC_4_23 EC_4_21 EC_4_22 EC_4_18 EC_4_19 EC_4_16 EC_4_17 EC_4_20 EC_4_42 EC_4_43 EC_4_40 EC_4_41

EC_4_39 EC_4_45

EC_4_36 EC_4_37 EC_4_38

EC_4_34 EC_4_35 EC_4_32 EC_4_33 EC_4_29 EC_4_26 EC_4_30 EC_4_31 EC_4_27

EC_4_28

EC_1_2

EC_1_4 EC_1_7 EC_1_5 EC_1_6 EC_1_3 EC_1_9 EC_1_8

EC_1_10 EC_1_11 EC_1_12

EC_4_83 EC_4_84 EC_4_85 EC_4_86 EC_4_49 EC_4_81 EC_4_79 EC_4_80

EC_4_78 EC_4_77 EC_4_76 EC_4_48 EC_4_47 EC_4_74 EC_4_75 EC_4_72 EC_4_73 EC_4_68 EC_4_69

EC_4_63 EC_4_66 EC_4_64 EC_4_65 EC_4_67 EC_4_70 EC_4_71 EC_4_50 EC_4_82 EC_4_52 EC_4_51 EC_4_54 EC_4_53 EC_4_55 EC_4_56 EC_4_57 X5PTP_EC_4 EC_4_58 EC_4_62 EC_4_61 EC_4_59 EC_4_60

EC_4_6 EC_4_7 EC_4_5 EC_4_2 EC_4_3 EC_4_4 EC_4_15 EC_4_14

EC_4_13 EC_4_12 EC_4_11 EC_4_8 EC_4_9 EC_4_10

d = 0.05

Chymotrypsin

Elastase

Tripsin

d = 0.1

X3N

X7A

X10N

X14W

X16S X18I

X54V

X66T X70R

X82E

X82G

X87L

X92I

X93I

X93F X95N

X98W

X98Y

X132Y

X137C X154T

X154V X155S

X155T

X162S

X165N

X180Q

X181A X183L

X196Y X204S

X228K

X229D

X229S

X232Q

X232M X243Q X265S

X273K

X275G

Chymotrypsin

Elastase

Tripsin

0 e

+00

4 e

-04

8 e

-04

Eigenvalues

Trypsin

Page 23: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

d = 0.05

EC_4_117 EC_4_0

EC_1_1 EC_1_19

EC_4_1 EC_4_91 EC_4_89 EC_4_90 EC_4_92 EC_4_93

EC_4_98 EC_4_99 EC_4_97 EC_4_95 EC_4_94 EC_4_96

EC_4_113 EC_4_114 EC_4_108 EC_4_109 EC_4_110 EC_4_111 EC_4_112 EC_4_106 EC_4_107 EC_4_102 EC_4_100 EC_4_101 EC_4_104 EC_4_103 EC_4_105 EC_4_116

EC_4_88

EC_1_0

EC_36_5 EC_36_2 EC_36_4 EC_36_3 EC_36_6 EC_36_0 EC_36_1

EC_1_15 EC_1_16

EC_4_44 EC_4_115

EC_1_13 EC_1_14

EC_4_87

EC_4_46

EC_1_17 EC_1_18

EC_4_25 EC_4_24 EC_4_23 EC_4_21 EC_4_22 EC_4_18 EC_4_19 EC_4_16 EC_4_17 EC_4_20 EC_4_42 EC_4_43 EC_4_40 EC_4_41

EC_4_39 EC_4_45

EC_4_36 EC_4_37 EC_4_38

EC_4_34 EC_4_35 EC_4_32 EC_4_33 EC_4_29 EC_4_26 EC_4_30 EC_4_31 EC_4_27

EC_4_28

EC_1_2

EC_1_4 EC_1_7 EC_1_5 EC_1_6 EC_1_3 EC_1_9 EC_1_8

EC_1_10 EC_1_11 EC_1_12

EC_4_83 EC_4_84 EC_4_85 EC_4_86 EC_4_49 EC_4_81 EC_4_79 EC_4_80

EC_4_78 EC_4_77 EC_4_76 EC_4_48 EC_4_47 EC_4_74 EC_4_75 EC_4_72 EC_4_73 EC_4_68 EC_4_69

EC_4_63 EC_4_66 EC_4_64 EC_4_65 EC_4_67 EC_4_70 EC_4_71 EC_4_50 EC_4_82 EC_4_52 EC_4_51 EC_4_54 EC_4_53 EC_4_55 EC_4_56 EC_4_57 X5PTP_EC_4 EC_4_58 EC_4_62 EC_4_61 EC_4_59 EC_4_60

EC_4_6 EC_4_7 EC_4_5 EC_4_2 EC_4_3 EC_4_4 EC_4_15 EC_4_14

EC_4_13 EC_4_12 EC_4_11 EC_4_8 EC_4_9 EC_4_10

d = 0.05

Chymotrypsin

Elastase

Tripsin

d = 0.1

X3N

X7A

X10N

X14W

X16S X18I

X54V

X66T X70R

X82E

X82G

X87L

X92I

X93I

X93F X95N

X98W

X98Y

X132Y

X137C X154T

X154V X155S

X155T

X162S

X165N

X180Q

X181A X183L

X196Y X204S

X228K

X229D

X229S

X232Q

X232M X243Q X265S

X273K

X275G

Chymotrypsin

Elastase

Tripsin

0 e

+00

4 e

-04

8 e

-04

Eigenvalues

Trypsin

Page 24: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006
Page 25: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

BGA With CA or PCA?

• CA:– Pretty pictures– Sequences/residues plots– Finds any clear/simple patterns

• Binary aa variables

• PCA:– Use continuous variables

• e.g. aa properties: size, charge, hydrophobicity etc.

Page 26: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

d = 10

EC_4_117 EC_4_0

EC_1_1

EC_1_19 EC_4_1 EC_4_91 EC_4_89 EC_4_90 EC_4_92 EC_4_93 EC_4_98

EC_4_99 EC_4_97 EC_4_95 EC_4_94 EC_4_96

EC_4_113 EC_4_114

EC_4_108 EC_4_109 EC_4_110 EC_4_111 EC_4_112 EC_4_106 EC_4_107 EC_4_102 EC_4_100 EC_4_101

EC_4_104 EC_4_103 EC_4_105 EC_4_116 EC_4_88

EC_1_0

EC_36_5 EC_36_2 EC_36_4 EC_36_3 EC_36_6 EC_36_0 EC_36_1

EC_1_15 EC_1_16

EC_4_44 EC_4_115 EC_1_13 EC_1_14

EC_4_87 EC_4_46

EC_1_17 EC_1_18 EC_4_25 EC_4_24

EC_4_23

EC_4_21 EC_4_22 EC_4_18 EC_4_19 EC_4_16 EC_4_17

EC_4_20 EC_4_42 EC_4_43 EC_4_40 EC_4_41

EC_4_39 EC_4_45 EC_4_36

EC_4_37 EC_4_38 EC_4_34 EC_4_35 EC_4_32 EC_4_33 EC_4_29 EC_4_26 EC_4_30 EC_4_31 EC_4_27 EC_4_28

EC_1_2 EC_1_4 EC_1_7 EC_1_5 EC_1_6 EC_1_3 EC_1_9 EC_1_8

EC_1_10 EC_1_11 EC_1_12

EC_4_83 EC_4_84 EC_4_85

EC_4_86 EC_4_49 EC_4_81 EC_4_79 EC_4_80 EC_4_78 EC_4_77 EC_4_76 EC_4_48 EC_4_47 EC_4_74 EC_4_75 EC_4_72 EC_4_73

EC_4_68 EC_4_69

EC_4_63 EC_4_66 EC_4_64 EC_4_65 EC_4_67 EC_4_70 EC_4_71 EC_4_50 EC_4_82 EC_4_52 EC_4_51 EC_4_54 EC_4_53 EC_4_55 EC_4_56 EC_4_57 5PTP_EC_4 EC_4_58 EC_4_62 EC_4_61 EC_4_59 EC_4_60

EC_4_6 EC_4_7 EC_4_5 EC_4_2 EC_4_3 EC_4_4 EC_4_15 EC_4_14 EC_4_13

EC_4_12 EC_4_11 EC_4_8 EC_4_9 EC_4_10

d = 10

Chymotrypsin

Elastase

Tripsin

d = 0.5

X1C

X1D

X1E

X7B

X47D

X47E

X82B

X95B X95C X95E

X136A X136B

X165A

X185B

X196C

X216A

X227A

X227B

X229A

X229B

X229D

X229E

X232A

X232C

X240D X243A

X255A X255C X255E

X260C X260D

X260E

X267D

X272A X273A

X275A X275B

X275C

X275E

X277B

010

2030

40 Eigenvalues

31 Trypsins15 Chymotrypsins

10 Elastases

Sequences

Residue weights

BGA with PCA using

5 amino acid properties (A-E)

Page 27: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006
Page 28: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

BGA on Alignments

• Focus on any split in the data• Binary or Property coding

– CA or PCA• Sequence Weighting • Pseudocounts

Page 29: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

BGA, CIA, MADE4Aedín CulhaneGuy PerriereJean ThiolouseIan JefferyAilís Fagan

Clustal

Toby Gibson, EMBLJulie Thompson, ICGEB, Strasbourg

IterationBenchmarking Clustal W 2.0

Gordon BlackshieldsMark Larkin

Paul McGettiganIain Wallace

Page 30: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006
Page 31: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

SeqA GARFIELD THE LAST FAT CAT

SeqB GARFIELD THE FAST CAT

SeqC GARFIELD THE VERY FAST CAT

SeqD THE FAT CAT

SeqA GARFIELD THE LAST FA-T CATSeqB GARFIELD THE FAST CA-T ---SeqC GARFIELD THE VERY FAST CATSeqD -------- THE ---- FA-T CAT

Page 32: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Weighted Sums of Pairs

N

i

i

jijij DW

2

1

1MSA Branch and Bound

Lipman, Altschul and Kececioglu, 1989

FastMSA Tweaked MSAGupta, Kececioglu and Schaeffer, 1995

DCA Divide and ConquerStoye, Moulton and Dress, 1997

SAGA Genetic AlgorithmNotredame and Higgins, 1996

PRRP IterationGotoh, 1996

Page 33: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Genetic Algorithm

MutationRecombination (cross-overs)

Selection (WSP)

Page 34: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Genetic Algorithm

MutationRecombination (cross-overs)

Selection (WSP)

Page 35: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Genetic Algorithm

MutationRecombination (cross-overs)

Selection (WSP)

Page 36: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

SAGA

• Cedric Notredame• Sequence Alignment by Genetic Algorithm• Optimise any objective function• Notredame, C. and Higgins, D.G. (1996)

SAGA: Sequence alignment by genetic algorithm. Nucleic Acids Research, 24:1515-1524.

Page 37: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Test case N seqs

Length

Cytc 6 129

GCR 8 60

Ac Protease 5 183

S Protease 6 280

Chtp 6 247

Dfr secstr 4 189

Sbt 4 296

Globin 7 167

Plasto 5 132

ScoreWSP

Structurematch %

CPU-time

1051257 74 7

371875 75 3

379997 80 13

574884 91 184

111924 - 4525

171979 82.03 5

271747 80 7

659036 94 7

236343 54.03 22

ScoreWSP

Structurematch %

CPU-time

1051257 74 960

371650 82 75

379997 80 331

574884 91 3500

111579 - 3542

171975 82.50 411

271747 80 210

659036 94 330

236195 54.05 510

MSA SAGA

Structure Test Cases

Page 38: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Test case N seqs

Length

Cytc 6 129

GCR 8 60

Ac Protease 5 183

S Protease 6 280

Chtp 6 247

Dfr secstr 4 189

Sbt 4 296

Globin 7 167

Plasto 5 132

ScoreWSP

Structurematch %

CPU-time

1051257 74 7

371875 75 3

379997 80 13

574884 91 184

111924 - 4525

171979 82.03 5

271747 80 7

659036 94 7

236343 54.03 22

ScoreWSP

Structurematch %

CPU-time

1051257 74 960

371650 82 75

379997 80 331

574884 91 3500

111579 - 3542

171975 82.50 411

271747 80 210

659036 94 330

236195 54.05 510

MSA SAGA

Structure Test Cases

Page 39: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Which method is best?• Best score?• Empirical tests?

– Sets of test cases• Fitch and McLure• BaliBase• Homstrad• Oxbench• Prefab etc. etc.

– APDBO'Sullivan O, Zehnder M, Higgins D, Bucher P, Grosdidier A, Notredame C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics. 2003;19 Suppl 1:i215-21.

Page 40: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

COFFEE

• Consistency based Objective Function For Evaluation of Ehhhh things

• Maximum Weight Trace (John Kececioglu)• Maximise similarity to a LIBRARY of

residue pairs• Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An

objective function for multiple sequence alignments. Bioinformatics 14: 407-422.

Page 41: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Human beta VHLTPEEKSAVTALWGKVN–-VDEVGGEALHorse beta VQLSGEEKAAVLALWDKVN–-EEEVGGEALHuman alpha –VLSPADKTNVKAAWGKVGAHAGEYGAEALHorse alpha –VLSAADKTNVKAAWSKVGGHAGEYGAEAL

Pairs of Residuese.g.

Seq N, Residue ISeq M, Residue J

Weight = w

Page 42: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Test Case Avg % ID

Nseq

COFFESAGA

PRRP MSA SAGA

ClustalW PILEUP

SAM HMM

Ac prot 21 14 50.2 48.8 51.2 39.2 40.9 27.9

Binding 31 7 64.5 76.2 64.2 50.0 66.6 36.9

Cytc 42 6 90.7 89.4 67.3 89.1 94.6 67.3

Fniii 17 9 47.0 36.3 45.2 42.0 37.8 16.2

Gcr 36 8 83.1 92.8 80.8 80.8 80.8 85.7

Globin 24 17 85.2 87.0 78.0 86.4 72.6 67.8

Igb 24 37 78.1 74.9 70.1 74.8 52.4 67.2

Lzm 39 6 72.3 71.1 72.3 72.2 72.3 55.3

Phenyldiox 22 8 64.7 49.9 55.6 58.5 37.4 45.7

Sbt 61 7 96.9 96.7 96.0 96.9 97.4 90.6

sprot 27 15 66.6 64.3 68.5 62.5 57.9 61.7

% Match

Page 43: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Test Case Avg % ID

Nseq

COFFESAGA

PRRP MSA SAGA

ClustalW PILEUP

SAM HMM

Ac prot 21 14 50.2 48.8 51.2 39.2 40.9 27.9

Binding 31 7 64.5 76.2 64.2 50.0 66.6 36.9

Cytc 42 6 90.7 89.4 67.3 89.1 94.6 67.3

Fniii 17 9 47.0 36.3 45.2 42.0 37.8 16.2

Gcr 36 8 83.1 92.8 80.8 80.8 80.8 85.7

Globin 24 17 85.2 87.0 78.0 86.4 72.6 67.8

Igb 24 37 78.1 74.9 70.1 74.8 52.4 67.2

Lzm 39 6 72.3 71.1 72.3 72.2 72.3 55.3

Phenyldiox 22 8 64.7 49.9 55.6 58.5 37.4 45.7

Sbt 61 7 96.9 96.7 96.0 96.9 97.4 90.6

sprot 27 15 66.6 64.3 68.5 62.5 57.9 61.7

72.6 71.5 68.1 65.5 64.5 56.4

% Match

Page 44: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

T-Coffee

• Heuristic approximation to COFFEE– Uses progressive alignment (Trees)

• Heterogenous data– Sequences– Structures– Genomes– ESTs

• Notredame, C, Higgins, DG and Heringa, J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J.Mol.Biol., 302: 205-217.

Page 45: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

T-Coffee

• Mixed data sources– Primary library from

• Lalign (SIM): – 10 best local alignments

• Clustalw– All pairwise alignments

• SAP (Willie Taylor, Structure Superposition) • Multiple alignments

• Check library for CONSISTENCY– Upweight pairs of residues that agree with other pairs

Default

Page 46: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Local Alignment Global Alignment

T-Coffee

Multiple Sequence Alignment

Mixing Heterogenous Information

Multiple Alignment

StructuralSpecialist

Copyright Cédric Notredame, 2000, all rights reserved

Page 47: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Mixing Heterogenous Information

Structure Superposition

Weighted Residue Pairs

Copyright Cédric Notredame, 2000, all rights reserved

e.g. SAPTaylor and Orengo

Page 48: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Increasing Structure NumberstRNA-synt_2b 19% ID

020406080

100

0 2 3 4 5

no of Structures

% a

ccur

acy

Page 49: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Including Structures in an Alignment

35.24 38.39

66.49

020406080

clustalw T_Coffee Default T_Coffee plus allstructures

%ac

cura

cy

3D-CoffeeO’Sullivan, O., Suhre, K., Abergel, C., Higgins, DG and Notredame, C

(2004) J.Mol.Biol.

Page 50: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Recent Developments• 20-30 new programs in past 2 years

• MUSCLE– Bob Edgar, ISMB, 2004– Iteration/progressive alignment

• FAST• Big Alignments

• PROBCONS– Tom Do, Michael Brudno, Serafim Batzoglou– ISMB 2004– “P-Coffee”

• VERY accurate

Page 51: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Iteration Revisited--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE

Page 52: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Iteration Revisited--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE

Page 53: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Iteration Revisited--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE

--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE

---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-

Page 54: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Iteration Revisited--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE

--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS----------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTPIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE

---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-

Page 55: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Remove EACH Sequence RFRemove BEST Sequence RBRandom RandomTree based Tree

Iterate

Iterate

Iterate

Page 56: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Iteration on HomStrad 184Method Default Remove

EachRemove Best Random

ProbCons 64.88 64.69 65.00 64.27Muscle v3.3 63.12 63.77*** 63.70** 63.39

T-Coffee 62.87 63.24 63.38 62.70Muscle v3.2 61.76 63.58*** 63.57*** 62.75**

ClustalW 59.87 61.54*** 61.44*** 60.99**

FFT-NSI (Mafft) 59.65 62.10*** 62.05*** 61.55***

Average 59.32 60.70*** 60.88*** 60.57***

Tree Based

Quick Tree 63.45** 63.69*** 62.47**

Slow Tree 63.10* 63.27** 61.74

Wallace, O’Sullivan and Higgins, 2004, Bioinformatics, 21:1408

Page 57: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Clustal W T-Coffee

T-Coffee

Multiple Sequence Alignment

Combining Multiple Alignment Methods

Probcons

MuscleSpecialist

Copyright Cédric Notredame, 2000, all rights reserved

Page 58: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Combining Multiple Alignment methods with T-Coffee

65.80

66.00

66.2066.40

66.60

66.80

67.00

67.2067.40

67.60

67.80

prob

cons

+mus

cle v6

+tco

ffee

+mus

cle v3

.52

+fins

i

+pcm

a

+gins

i

+fftn

si

+clus

talw

+fftn

s2

+fftn

s1

+dial

ign-t

+dial

ign

+poa

-glob

al

+poa

-loca

l

Page 59: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Combining Multiple Alignment methods with T-Coffee

65.80

66.00

66.2066.40

66.60

66.80

67.00

67.2067.40

67.60

67.80

prob

cons

+mus

cle v6

+tco

ffee

+mus

cle v3

.52

+fins

i

+pcm

a

+gins

i

+fftn

si

+clus

talw

+fftn

s2

+fftn

s1

+dial

ign-t

+dial

ign

+poa

-glob

al

+poa

-loca

l

Page 60: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

The Wisdom of CrowdsJames Surowiecki

Crowds are surprisingly good at accurate decisions

Better than “experts”

Only if they do not form a “mob”

Page 61: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

50.00

52.00

54.00

56.00

58.00

60.00

62.00

64.00

66.00

68.00

70.00

Combined 51.96 58.32 62.75 65.15 65.94 66.73 67.38 67.75

Default 51.90 57.92 61.15 63.73 64.22 65.37 66.04 66.41

Poa -global +Dialign-T +ClustalW +PCMA +FINSI +T-Coffee +Muscle v6 +ProbCons

M-Coffee combine 8 methods

Page 62: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

BGA, CIA, MADE4Aedín CulhaneGuy PerriereJean ThiolouseIan JefferyAilís Fagan

Clustal

Toby Gibson, EMBLJulie Thompson, ICGEB, Strasbourg

IterationBenchmarking Clustal W 2.0

Gordon BlackshieldsMark Larkin

Paul McGettiganIain Wallace

Page 63: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999)NAR and Bioinformatics

•ICGEB Strasbourg

•141 manual alignments using structures•5 sections•core alignment regions marked

1. Equidistant(82)

2. Orphan(23)

3. Two groups (12)

4. Long internal gaps(13)

5. Long terminal gaps(11)

Page 64: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Compare Methods

• Sam HMMHughey and Krogh, 1996

• Dialign Local multiple alignmentsMorgenstern, 1999

• ClustalW Progressive alignmentThompson, Higgins and Gibson, 1994

• Prrp Iterative WSPGotoh, 1996

• T-Coffee Pairwise libraryNotredame, Higgins and Heringa, 2000

Page 65: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

BalibaseMethod

1 (82) 2 (23) 3 (12) 4 (13) 5 (11) Total

SAM 46.8 20.0 13.9 43.9 42.7 39.8Dialign 71.0 25.2 35.1 74.7 80.4 61.5ClustalW 78.5 32.2 42.5 65.7 74.3 66.4PRRP 78.6 32.5 50.2 51.1 82.7 66.4

% alignment columns correct

Core alignment blocks only

Page 66: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

BalibaseMethod

1 (82) 2 (23) 3 (12) 4 (13) 5 (11) Total

SAM 46.8 20.0 13.9 43.9 42.7 39.8Dialign 71.0 25.2 35.1 74.7 80.4 61.5ClustalW 78.5 32.2 42.5 65.7 74.3 66.4PRRP 78.6 32.5 50.2 51.1 82.7 66.4T-Coffee 80.7 37.3 52.9 83.2 88.7 72.1

% alignment columns correct

Core alignment blocks only

Page 67: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Clustal• Clustal, Clustal1-4 TCD– Higgins DG, Sharp PM. (1988)

CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene. 73(1):237-44.

– Higgins DG, Sharp PM. (1989) Fast and sensitive multiple sequence alignments on a microcomputer. Comput Appl Biosci. 5(2):151-3. 

• ClustalV Heidelberg– Higgins DG, Bleasby AJ, Fuchs R. (1992)

CLUSTAL V: improved software for multiple sequence alignment. Comput Appl Biosci. 8(2):189-91.

• ClustalW Hinxton– Thompson JD, Higgins DG, Gibson TJ. (1994)

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22):4673-80.

• ClustalX UCC– Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. (1997)

The CLUSTAL_X windows interface: flexible strategies for multiple sequencealignment aided by quality analysis tools. Nucleic Acids Res. 25(24):4876-82.

Page 68: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Clustal re-engineering in C++

• Problems:• Code has become very complex.• 18 code files (up to 5229 lines).• 400 Global variables.• 500 functions

• Wish to:• Simplify the code.• Improve structure of code (modularisation)• Make easier to make functional changes.• Make easier to understand code.• Improve portability

– Qt Cross platform C++ GUI toolbox.

Page 69: Multiple Alignments and Multivariate Analysis Clustal: 1988-2006

Location

Energy

Global minimum

local minimum

The Local Minimum Problem: Clustal is “Greedy”