Improving ffas alignments using t cofee

Preview:

Citation preview

Improving FFAS alignments using T-Coffee

Why we improve alignments?

• Protein function determination is nontrivial task. The best way to do it is to relate sequence of unknown protein to proteins with known properties.

• To explore evolution of related sequences• Because determination of protein structure

experimentally is still time consuming alignments are also used to create homology models which can give us additional functional information.

What is sequence profile?

A protein profile is a matrix that describes a particular domain or family. Each row of the matrix represents a position in a multiple alignment. Each row has 20 scores, one for each amino acid, reflecting the probabilities of the various amino acids occurring at that row's position in the profile. Thus, scores are position dependent.

Why we use profiles?

To better understanding of evolutionary, structural functional relationships between related sequences.

For biological analysis we usually prepare following steps:

• Finding protein sequences related to our query using database searches algorithm. BLAST, FASTA with reasonable confidence. (FFAS stops at this stage)

• Creating multiple sequence alignment of related sequences (Clustal W, T-Cofee, POA, Dialign)

• Using additional information e.g., predicted secondary structure (Orfeus), knowledge (biological importance of given amino acids)

How T-Coffee works?

It performs all possible pairwise alignments within the set of sequences but in two steps: first with ClustalW and second using „lalign” program from local-Fasta package.The results from both methods are combined into primary library. A library extension step determines how residue pair align with respect to other residues. Then library is used to assess how well sequences are aligned given the other sequences in the dataset, rather then looking at two sequences in isolation. The final alignment is then built progressively using the information in the library.

What was done?

Four algorithms were used in attempt to obtain better alignments:

• Simple elongation of sequences in blast profile.

• Aligning sequences to blast profile using T-Coffee

• Creating profiles using T-coffee in multiple sequence alignment mode

• Mixed method – T-coffee + elongation

Benchmark

• 1024 pairs of protein domains from SCOP

• Low seqence identity

• High strctural similarity

• No redundant pairs

• Only one domains structures

Benchmark

• In each method profiles were created for target and template

• Altough for some algorithms benchmark was computional power expensive (e.g. 14 days on 15 cpu’s for T-coffee in multiple sequence alignment mode) only oryginal PSI-BLAST profile creating procedure was tested.

• In T-coffee in multiple sequence alignment mode there was no results for some pairs.

Alignment qualiy measure

It is common that only fraction of the model is correct. After structural superposition the most significant subset is found using LG score measure.

This allow to compare only reasonable parts of models.

T-cofee to profile algorithm

LG score FFAS vs LG score T-Cofee

0

2000

4000

6000

8000

10000

12000

14000

0 5000 10000 15000

LG score FFAS

LG

sc

ore

T-C

ofe

e

T-cofee to profile algorithm

LG score FFAS vs LG score T-Cofee

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 500 1000 1500 2000

LG score FFAS

LG

sco

re T

-Co

fee

Elongation

LG score FFAS vs LG score T-Cofee

0

2000

4000

6000

8000

10000

12000

14000

0 2000 4000 6000 8000 10000 12000 14000

LG score FFAS

LG

sc

ore

T-C

ofe

e

T-Coffee

LG score FFAS vs LG score T-coffee

0200

400600

8001000

12001400

16001800

2000

0 500 1000 1500 2000

LG score FFAS

LG

sco

re T

-co

ffee

T-coffee + elongation

LG score FFAS vs LG score T-coffee

0

2000

4000

6000

8000

10000

12000

0 2000 4000 6000 8000 10000 12000

LG score FFAS

LG

sco

re T

-co

ffee

• Best results are obtained using T-coffe only. For FFAS LG score <600 alignments are improved in 72% of all cases.

• Altough LG score is unknow in „real life” there is necesery to find correlation between alignment improvement and known factors.

Note:

• Some of the results are missing. • We can not trust benchmark in all cases

d1ca1_1 a.124.1.1 d1ah7__ a.124.1.1

• Sequence identity 33%

• LG score FFAS = 2457.6

• LG score T-coffee = 2545.6

FFAS ALIGNMENT: 10 20 30 40 50 60 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 1 EDKHKEGVNSHLWIVNRAIDIMSRNTTL----VKQDRVAQLNEWRTELENGIYAADYENP 56 model 1 WDGKIDGTGTHAMIVTQGVSILENDLSKNEPESVRKNLEILKENMHELQLGSTYPDYDKN 60 70 80 90 100 110 120 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 57 YYDNSTFASHFYDPDNGKTYI---------PFAKQAKETGAKYFKLAGESYKNKDMKQAF 107 model 61 AYD--LYQDHFWDPDTDNNFSKDNSWYLAYSIPDTGESQIRKFSALARYEWQRGNYKQAT 118 130 140 150 160 170 180 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 108 FYLGLSLHYLGDVNQPMHAANFTNLSYPQGFHSKYENFVDTIKDNYKVTDGNGYWNWKGT 167 model 119 FYLGEAMHYFGDIDTPYHPANVTAVD--SAGHVKFETFAEERKEQYKI-------NTVGC 169 190 200 210 220 230 240 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 168 NPEEWIHGAAVVAKQDYSGIVNDN--------TKDWFVKAAVSQEYAD-KWRAEVTPMTG 218 model 170 KTNEDFYAD-ILKNKDFNAWSKEYARGFAKTGKSIYYSHASMSHSWDDW------DYAAK 222 250 260 ....|....|....|....|...d1ah7__ 219 KRLMDAQRVTAGYIQLWFDTYGD 241 model 223 VTLANSQKGTAGYIYRFLHDVSE 245

d1ca1_1d1ah7__

TCOFEE ALIGNMENT: 10 20 30 40 50 60 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 1 EDKHKEGVNSHLWIVNRAIDIMSRNTT----LVKQDRVAQLNEWRTELENGIYAADYENP 56 model 1 WDGKIDGTGTHAMIVTQGVSILENDLSKNEPESVRKNLEILKENMHELQLGSTYPDYDKN 60 70 80 90 100 110 120 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 57 YYDNSTFASHFYDPDNGKTY---------IPFAKQAKETGAKYFKLAGESYKNKDMKQAF 107 model 61 AY--DLYQDHFWDPDTDNNFSKDNSWYLAYSIPDTGESQIRKFSALARYEWQRGNYKQAT 118 130 140 150 160 170 180 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 108 FYLGLSLHYLGDVNQPMHAANFTNLSYPQGFHSKYENFVDTIKDNYKVTDGNGYWNWKGT 167 model 119 FYLGEAMHYFGDIDTPYHPANVTAVDSAG--HVKFETFAEERKEQYKINTVGCK-----T 171 190 200 210 220 230 240 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 168 NPEEWIHGAAVVAKQDYSGIVNDNTKDWFVKAAVSQEYADKWRAEVTPMTGKRLMDAQRV 227 model 172 NEDFYADILKNKDFNAWSKEYARGFAKTGKSIYYSHASMSHSWDDWDYAAKVTLANSQKG 231 250 ....|....|....|.d1ah7__ 228 TAGYI-QLWFDTYGDR 242 model 232 TAGYIYRFLHDVSEGN 247

d1ca1_1d1ah7__

STRUCTURAL ALIGNMENT: 10 20 30 40 50 60 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 1 WSAEDKHKEGVNSHLWIVNRAIDIMSRNTTLVK----QDRVAQLNEWRTELENGIYAADY 56 model 1 WDGKIDG---TGTHAMIVTQGVSILENDLSKNEPESVRKNLEILKENMHELQLGSTYPDY 57 70 80 90 100 110 120 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 57 ENPYYDNSTFASHFYDPDNGKTYIP---------FAKQAKETGAKYFKLAGESYKNKDMK 107 model 58 DK-NAYD-LYQDHFWDPDTDNNFSKDNSWYLAYSIPDTGESQIRKFSALARYEWQRGNYK 115 130 140 150 160 170 180 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 108 QAFFYLGLSLHYLGDVNQPMHAANFTNLSYPQGFHSKYENFVDTIKDNYKVTDGNGYWNW 167 model 116 QATFYLGEAMHYFGDIDTPYHPANVTAVDS--AGHVKFETFAEERKEQYKINTVGCKTNE 173 190 200 210 220 230 240 ....|....|....|....|....|....|....|....|....|....|....|....|d1ah7__ 167 -----------KGTNPEEWIHGAAVVAKQDYSG-IVNDNTKDWFVKAAVSQEYADKWRAE 215 model 174 DFYADILKNKDFNAWSKEYARGFAKTGKSIYYSHASMSH-----------------SWDD 216 250 260 270 ....|....|....|....|....|....|...d1ah7__ 216 VTPMTGKRLMDAQRVTAGYIQLWFDTYGDR--- 245 model 217 WDYAAKVTLANSQKGTAGYIYRFLHDVSEGNDP 249

d1ca1_1d1ah7__

d1ca1_1d1ah7__d1ca1_1d1ah7__

d1mgta2 c.55.7.1 d1sfe_2 c.55.7.1

• Sequence identity 40%

• LG score FFAS = 207.7

• LG score T-coffee = 251.1

d1fb1a_d1a8ra_

d1mgta2 c.55.7.1 d1sfe_2 c.55.7.1

• Sequence identity 39%

• LG score FFAS = 47.6

• LG score T-coffee = 113.8

d1mgta2d1sfe_2

d1mgta2d1sfe_2

Sequence Identity

-6000

-4000

-2000

0

2000

4000

0 20 40 60 80 100 120

Identity %

LG

sco

re

Profile identity vs LG score

-6000

-5000

-4000

-3000

-2000

-1000

0

1000

2000

3000

4000

0 20 40 60 80 100

Identity %

LG

- s

co

re

FFAS score vs LG score

-1000

-800

-600

-400

-200

0

200

400

600

800

1000

-2.00E+02 -1.80E+02 -1.60E+02 -1.40E+02 -1.20E+02 -1.00E+02 -8.00E+01 -6.00E+01 -4.00E+01 -2.00E+01 0.00E+00

FFAS score

LG

sco

re

Conclusions:

• FFAS alignments still can be improved.• Using T-coffee to create FFAS profiles can improve

alignment quality• It is not known how to add logic wether use T-coffee to

create FFAS profiles

To do:

• Check correlation between alignment diversity and alignment improvement.

• Try to use different method of comparison of sequence alignment (overlap score)

• Compare other multiple alignment method to T-coffee.

Recommended