Upload
jakub-pas
View
56
Download
5
Embed Size (px)
Citation preview
Improving FFAS alignments using T-Coffee
Why we improve alignments?
• Protein function determination is nontrivial task. The best way to do it is to relate sequence of unknown protein to proteins with known properties.
• To explore evolution of related sequences• Because determination of protein structure
experimentally is still time consuming alignments are also used to create homology models which can give us additional functional information.
What is sequence profile?
A protein profile is a matrix that describes a particular domain or family. Each row of the matrix represents a position in a multiple alignment. Each row has 20 scores, one for each amino acid, reflecting the probabilities of the various amino acids occurring at that row's position in the profile. Thus, scores are position dependent.
Why we use profiles?
To better understanding of evolutionary, structural functional relationships between related sequences.
For biological analysis we usually prepare following steps:
• Finding protein sequences related to our query using database searches algorithm. BLAST, FASTA with reasonable confidence. (FFAS stops at this stage)
• Creating multiple sequence alignment of related sequences (Clustal W, T-Cofee, POA, Dialign)
• Using additional information e.g., predicted secondary structure (Orfeus), knowledge (biological importance of given amino acids)
How T-Coffee works?
It performs all possible pairwise alignments within the set of sequences but in two steps: first with ClustalW and second using „lalign” program from local-Fasta package.The results from both methods are combined into primary library. A library extension step determines how residue pair align with respect to other residues. Then library is used to assess how well sequences are aligned given the other sequences in the dataset, rather then looking at two sequences in isolation. The final alignment is then built progressively using the information in the library.
What was done?
Four algorithms were used in attempt to obtain better alignments:
• Simple elongation of sequences in blast profile.
• Aligning sequences to blast profile using T-Coffee
• Creating profiles using T-coffee in multiple sequence alignment mode
• Mixed method – T-coffee + elongation
Benchmark
• 1024 pairs of protein domains from SCOP
• Low seqence identity
• High strctural similarity
• No redundant pairs
• Only one domains structures
Benchmark
• In each method profiles were created for target and template
• Altough for some algorithms benchmark was computional power expensive (e.g. 14 days on 15 cpu’s for T-coffee in multiple sequence alignment mode) only oryginal PSI-BLAST profile creating procedure was tested.
• In T-coffee in multiple sequence alignment mode there was no results for some pairs.
Alignment qualiy measure
It is common that only fraction of the model is correct. After structural superposition the most significant subset is found using LG score measure.
This allow to compare only reasonable parts of models.
T-cofee to profile algorithm
LG score FFAS vs LG score T-Cofee
0
2000
4000
6000
8000
10000
12000
14000
0 5000 10000 15000
LG score FFAS
LG
sc
ore
T-C
ofe
e
T-cofee to profile algorithm
LG score FFAS vs LG score T-Cofee
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 500 1000 1500 2000
LG score FFAS
LG
sco
re T
-Co
fee
Elongation
LG score FFAS vs LG score T-Cofee
0
2000
4000
6000
8000
10000
12000
14000
0 2000 4000 6000 8000 10000 12000 14000
LG score FFAS
LG
sc
ore
T-C
ofe
e
T-Coffee
LG score FFAS vs LG score T-coffee
0200
400600
8001000
12001400
16001800
2000
0 500 1000 1500 2000
LG score FFAS
LG
sco
re T
-co
ffee
T-coffee + elongation
LG score FFAS vs LG score T-coffee
0
2000
4000
6000
8000
10000
12000
0 2000 4000 6000 8000 10000 12000
LG score FFAS
LG
sco
re T
-co
ffee
• Best results are obtained using T-coffe only. For FFAS LG score <600 alignments are improved in 72% of all cases.
• Altough LG score is unknow in „real life” there is necesery to find correlation between alignment improvement and known factors.
Note:
• Some of the results are missing. • We can not trust benchmark in all cases
d1ca1_1 a.124.1.1 d1ah7__ a.124.1.1
• Sequence identity 33%
• LG score FFAS = 2457.6
• LG score T-coffee = 2545.6
FFAS ALIGNMENT: 10 20 30 40 50 60 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 1 EDKHKEGVNSHLWIVNRAIDIMSRNTTL----VKQDRVAQLNEWRTELENGIYAADYENP 56 model 1 WDGKIDGTGTHAMIVTQGVSILENDLSKNEPESVRKNLEILKENMHELQLGSTYPDYDKN 60 70 80 90 100 110 120 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 57 YYDNSTFASHFYDPDNGKTYI---------PFAKQAKETGAKYFKLAGESYKNKDMKQAF 107 model 61 AYD--LYQDHFWDPDTDNNFSKDNSWYLAYSIPDTGESQIRKFSALARYEWQRGNYKQAT 118 130 140 150 160 170 180 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 108 FYLGLSLHYLGDVNQPMHAANFTNLSYPQGFHSKYENFVDTIKDNYKVTDGNGYWNWKGT 167 model 119 FYLGEAMHYFGDIDTPYHPANVTAVD--SAGHVKFETFAEERKEQYKI-------NTVGC 169 190 200 210 220 230 240 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 168 NPEEWIHGAAVVAKQDYSGIVNDN--------TKDWFVKAAVSQEYAD-KWRAEVTPMTG 218 model 170 KTNEDFYAD-ILKNKDFNAWSKEYARGFAKTGKSIYYSHASMSHSWDDW------DYAAK 222 250 260 ....|....|....|....|...d1ah7__ 219 KRLMDAQRVTAGYIQLWFDTYGD 241 model 223 VTLANSQKGTAGYIYRFLHDVSE 245
d1ca1_1d1ah7__
TCOFEE ALIGNMENT: 10 20 30 40 50 60 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 1 EDKHKEGVNSHLWIVNRAIDIMSRNTT----LVKQDRVAQLNEWRTELENGIYAADYENP 56 model 1 WDGKIDGTGTHAMIVTQGVSILENDLSKNEPESVRKNLEILKENMHELQLGSTYPDYDKN 60 70 80 90 100 110 120 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 57 YYDNSTFASHFYDPDNGKTY---------IPFAKQAKETGAKYFKLAGESYKNKDMKQAF 107 model 61 AY--DLYQDHFWDPDTDNNFSKDNSWYLAYSIPDTGESQIRKFSALARYEWQRGNYKQAT 118 130 140 150 160 170 180 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 108 FYLGLSLHYLGDVNQPMHAANFTNLSYPQGFHSKYENFVDTIKDNYKVTDGNGYWNWKGT 167 model 119 FYLGEAMHYFGDIDTPYHPANVTAVDSAG--HVKFETFAEERKEQYKINTVGCK-----T 171 190 200 210 220 230 240 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 168 NPEEWIHGAAVVAKQDYSGIVNDNTKDWFVKAAVSQEYADKWRAEVTPMTGKRLMDAQRV 227 model 172 NEDFYADILKNKDFNAWSKEYARGFAKTGKSIYYSHASMSHSWDDWDYAAKVTLANSQKG 231 250 ....|....|....|.d1ah7__ 228 TAGYI-QLWFDTYGDR 242 model 232 TAGYIYRFLHDVSEGN 247
d1ca1_1d1ah7__
STRUCTURAL ALIGNMENT: 10 20 30 40 50 60 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 1 WSAEDKHKEGVNSHLWIVNRAIDIMSRNTTLVK----QDRVAQLNEWRTELENGIYAADY 56 model 1 WDGKIDG---TGTHAMIVTQGVSILENDLSKNEPESVRKNLEILKENMHELQLGSTYPDY 57 70 80 90 100 110 120 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 57 ENPYYDNSTFASHFYDPDNGKTYIP---------FAKQAKETGAKYFKLAGESYKNKDMK 107 model 58 DK-NAYD-LYQDHFWDPDTDNNFSKDNSWYLAYSIPDTGESQIRKFSALARYEWQRGNYK 115 130 140 150 160 170 180 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 108 QAFFYLGLSLHYLGDVNQPMHAANFTNLSYPQGFHSKYENFVDTIKDNYKVTDGNGYWNW 167 model 116 QATFYLGEAMHYFGDIDTPYHPANVTAVDS--AGHVKFETFAEERKEQYKINTVGCKTNE 173 190 200 210 220 230 240 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 167 -----------KGTNPEEWIHGAAVVAKQDYSG-IVNDNTKDWFVKAAVSQEYADKWRAE 215 model 174 DFYADILKNKDFNAWSKEYARGFAKTGKSIYYSHASMSH-----------------SWDD 216 250 260 270 ....|....|....|....|....|....|...d1ah7__ 216 VTPMTGKRLMDAQRVTAGYIQLWFDTYGDR--- 245 model 217 WDYAAKVTLANSQKGTAGYIYRFLHDVSEGNDP 249
d1ca1_1d1ah7__
d1ca1_1d1ah7__d1ca1_1d1ah7__
d1mgta2 c.55.7.1 d1sfe_2 c.55.7.1
• Sequence identity 40%
• LG score FFAS = 207.7
• LG score T-coffee = 251.1
d1fb1a_d1a8ra_
d1mgta2 c.55.7.1 d1sfe_2 c.55.7.1
• Sequence identity 39%
• LG score FFAS = 47.6
• LG score T-coffee = 113.8
d1mgta2d1sfe_2
d1mgta2d1sfe_2
Sequence Identity
-6000
-4000
-2000
0
2000
4000
0 20 40 60 80 100 120
Identity %
LG
sco
re
Profile identity vs LG score
-6000
-5000
-4000
-3000
-2000
-1000
0
1000
2000
3000
4000
0 20 40 60 80 100
Identity %
LG
- s
co
re
FFAS score vs LG score
-1000
-800
-600
-400
-200
0
200
400
600
800
1000
-2.00E+02 -1.80E+02 -1.60E+02 -1.40E+02 -1.20E+02 -1.00E+02 -8.00E+01 -6.00E+01 -4.00E+01 -2.00E+01 0.00E+00
FFAS score
LG
sco
re
Conclusions:
• FFAS alignments still can be improved.• Using T-coffee to create FFAS profiles can improve
alignment quality• It is not known how to add logic wether use T-coffee to
create FFAS profiles
To do:
• Check correlation between alignment diversity and alignment improvement.
• Try to use different method of comparison of sequence alignment (overlap score)
• Compare other multiple alignment method to T-coffee.