28
Swiss Institute of Bioinformatics Institut Suisse de Bioinf ormatique CN+LF-2006.01 An introduction to multiple alignments original version by Cédric Notredame, updated by Laurent Falquet Swiss Institute of Bioinformatics Institut Suisse de Bioinf ormatique CN+LF-2006.01 Overview ! Multiple alignments ! How-to, Goal, problems, use ! Patterns ! PROSITE database, syntax, use ! PSI-BLAST ! BLAST, matrices, use ! [ Profiles/HMMs ] …

Overview An introduction to multiple alignments · PDF fileAn introduction to multiple alignments ... wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

Embed Size (px)

Citation preview

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

An introduction

to multiple alignmentsoriginal version by Cédric Notredame, updated by Laurent Falquet

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Overview

! Multiple alignments

! How-to, Goal, problems, use

! Patterns

! PROSITE database, syntax, use

! PSI-BLAST

! BLAST, matrices, use

! [ Profiles/HMMs ] …

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Overview

! What are multiple alignments?

! How can I use my alignments?

! How does the computer align the sequences?

! The progressive alignment algorithm

! What are the difficulties?

! Pre-requisite?

! How can we compare sequences?

! How can we align sequences?

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Sometimes two sequences are not enough

The man with TWO watches

NEVER knows the exact time

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

What is a multiple sequence alignment?

! What can it do for me?

! How can I produce one of these?

! How can I use it?

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

What is a multiple sequence alignment?

! Structural/biochemical criteria! Residues playing a similar role end up in the same column.

! Evolution criteria! Residues having the same ancestor end up in the same column.

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

unknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

unknown AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Extrapolation

SwissProt

Unkown Sequence

Homology?

Less Than 30 % idBUT

Conserved where it MATTERS

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Extrapolation

Prosite Patterns

P-K-R-[PA]-x(1)-[ST]…

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

Extrapolation

Prosite Patterns

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

L?K>R

Prosite Profiles -More Sensitive

-More Specific

AFDEFGHQIVLW

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

PROSITE profile (see also HMMs)

A Substitution Cost For Every Amino

Acid, At Every Position

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

Phylogeny

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

chite

wheat

trybr

mouse

-Evolution

-Paralogy/Orthology

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

Phylogeny

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Struc. Prediction

Column Constraint"

Evolution Constraint"

Structure Constraint

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

Phylogeny

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Struc. Prediction

PsiPred or PhD

For secondary

Structure Prediction:

75% Accurate.

Threading: is improving

but is not yet as good.

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How can I use a multiple alignment?

Phylogeny

Struc. Prediction

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Caution!

Automatic Multiple

Sequence Alignment methods

are not always perfect…

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

The problem

! why is it difficult to compute a multiple sequencealignment?

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

Computation

What is the good alignment?

Biology

What is a good alignment?

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

The problem

! why is it difficult to compute a multiple sequencealignment?

CIRCULAR PROBLEM....

GoodSequences

GoodAlignment

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

The problem

! Same as pairwise alignment problem

! We do NOT know how sequences evolve.

! We do NOT understand the relation betweenstructures and sequences.

! We would NOT recognize the “correct” alignment ifwe had it IN FRONT of our eyes…

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

The Charlie Chaplin paradox

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

What do I need to know to make a good multiple alignment?

! How do sequences evolve?

! How does the computer align the sequences?

! How can I choose my sequences?

! What is the best program?

! How can I use my alignment?

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

An alignment is a story

ADKPKRPLSAYMLWLN

ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN

ADKPRRPLS-YMLWLNADKPKRPKPRLSAYMLWLN

Mutations

+

Selection

ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN

Insertion

Deletion

Mutation

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Homology

! Same sequences -> same origin? -> same function? -> same 3D fold?

Length

%Sequence Identity

30%

100

Same 3D Fold

Twilight Zone

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Residues and mutations

! All residues are equal, but some more than others…

PG

SC

LI

T

V A

W

YF QH

K

R

ED N

Aliphatic

Aromatic

Hydrophobic

Polar

SmallM

Accurate matrices are data driven rather than knowledge driven

G

C

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Substitution matrices

Different Flavors:

• Pam: 250, 350• Blosum: 45, 62• …

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

What is the best substitution matrix?

! Mutation rates depend on families

! Choosing the right matrix may be tricky

! Gonnet250 > BLOSUM62 > PAM250

! Depends on the family, the program used and its tuning

Family S NHistone3 6.4 0

Insulin 4.0 0.1

Interleukin I 4.6 1.4!"Globin 5.1 0.6

Apolipoprot. AI 4.5 1.6

Interferon G 8.6 2.8

Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years)

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Insertions and deletions?

Indel Cost

L

Cost

L

Cost

L

Affine Gap Penalty

Cost=GOP+GEP*L

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How to align many sequences?

! Exact algorithms are computing time consuming

! Needlemann & Wunsch

! Smith & Waterman

2 Globins =>1 sec

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

3 Globins =>2 mn

How to align many sequences?

! Exact algorithms are computing time consuming

! Needlemann & Wunsch

! Smith & Waterman

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

4 Globins =>5 hours

How to align many sequences?

! Exact algorithms are computing time consuming

! Needlemann & Wunsch

! Smith & Waterman

! -> heuristic wished

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

5 Globins =>3 weeks

How to align many sequences?

! Exact algorithms are computing time consuming

! Needlemann & Wunsch

! Smith & Waterman

! -> heuristic really wished!

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

6 Globins =>9 years

How to align many sequences?

! Exact algorithms are computing time consuming

! Needlemann & Wunsch

! Smith & Waterman

! -> heuristic required!

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

How to align many sequences?

! Exact algorithms are computing time consuming

! Needlemann & Wunsch

! Smith & Waterman

! -> heuristic definitely required!

7 Globins =>1000 years

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Existing methods

1-Carillo and Lipman:

-MSA, DCA.

-Few Small Closely RelatedSequence.

2-Segment Based:

-DIALIGN, MACAW.

-May Align Too Few Residues

-Do Well When They Can Run.

3-Iterative:-HMMs, HMMER, SAM.

-Slow, Sometimes Inacurate

-Good Profile Generators

4-Progressive:

-ClustalW, Pileup, Multalign…

- Fast and Sensitive

5-Mixtures:

-T-Coffee, MAFFT, MUSCLE,ProbCons, Psi-Praline,

- Very sensitive

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Progressive alignment

Feng and Dolittle, 1980; Taylor 1981

Dynamic Programming Using A Substitution Matrix

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Progressive alignment

Feng and Dolittle, 1980; Taylor 1981

-Depends on the ORDER of the sequences (Tree).

-Depends on the CHOICE of the sequences.

-Depends on the PARAMETERS:

•Substitution Matrix.

•Penalties (Gop, Gep).

•Sequence Weight.

•Tree making Algorithm.

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Progressive alignment

! Works well when phylogeny is dense

! No outlayer sequence

! Example: river crossing

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Selecting sequences from a BLAST output

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

A common mistake

! Sequences too closely related

! Identical sequences brings no information

! Multiple sequence alignments thrive on diversity

PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE

PRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE

PRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE

PRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE

PRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE

PRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE

:**::*.*******:***:* :****************..::******:***********

PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES

PRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES

PRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES

PRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES

PRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES

PRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES

:*** ******.******.**** *:************.:******:**

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Respect information!

-This alignment is notinformative about therelation between TPCCMOUSE and the rest ofthe sequences.

-A better spread of thesequences is needed

PRVA_MACFU ------------------------------------------SMTDLLN----AEDIKKA

PRVA_HUMAN ------------------------------------------SMTDLLN----AEDIKKA

PRVA_GERSP ------------------------------------------SMTDLLS----AEDIKKA

PRVA_MOUSE ------------------------------------------SMTDVLS----AEDIKKAPRVA_RAT ------------------------------------------SMTDLLS----AEDIKKA

PRVA_RABIT ------------------------------------------AMTELLN----AEDIKKA

TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM

: :*. .*::::

PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI

PRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI

PRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFIPRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI

PRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI

PRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI

TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :. . * .*..:*: *: * *. :::..:*:::**: .*:*: :** :

PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-

PRVA_HUMAN LKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_GERSP LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES-

PRVA_MOUSE LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES-

PRVA_RAT LKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES-

PRVA_RABIT LKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES-TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE

*: . .. :: .: : *: ***:.**:*. :** ::

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Selecting diverse sequences

-A REASONABLE model now exists.

-Going further:remote homologues.

PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIEPRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIEPRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIEPRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIEPRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIEPRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEPRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *: .: . .* .:*. * ** *: * : * :* * **:**

PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKGPRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :** .*:.* .* *: ** :: .* **** **::** **

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Aligning remote homologuesPRVA_MACFU ------------------------------------------SMTDLLNA----EDIKKAPRVA_ESOLU -------------------------------------------AKDLLKA----DDIKKA

PRVB_CYPCA ------------------------------------------AFAGVLND----ADIAAA

PRVB_BOACO ------------------------------------------AFAGILSD----ADIAAG

PRV1_SALSA -----------------------------------------MACAHLCKE----ADIKTAPRVB_LATCH ------------------------------------------AVAKLLAA----ADVTAA

PRVB_RANES ------------------------------------------SITDIVSE----KDIDAA

TPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI

TPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAITPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM

: ::

PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV

PRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF

PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF

PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLFPRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF

PRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF

TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI

TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM

: . .: .. . *: * : * :* : .*:*: :** .

PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA-

PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA--

PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-

PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--

PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA--PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA--

TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ

TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ

TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE

:: .. :: : :: .* :.** *. :** ::

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Going further…

PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI

PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF

PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF

TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI

TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI

TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM

TPC_PATYE SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI

. : .. . :: . : * :* : .* *. : * .

PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES--

PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG--

PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ---

TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ-

TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ-

TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE-

TPC_PATYE LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA

: . :: : :: * :..* :. :** ::

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

What makes a good alignment…

! The more divergent the sequences, the better

! The fewer indels, the better

! Nice ungapped blocks separated with indels

! Different classes of residues within a block:

! Completely conserved (*)

! Size and hydropathy conserved (:)

! Size or hydropathy conserved (.)

! The ultimate evaluation is a matter of personaljudgment and knowledge

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Avoiding pitfalls

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Naming your sequences the right way

! Never use white spaces in your sequence names

! Never use special symbols. Stick to plain letters,numbers and the underscore sign (_) to replacespaces. Avoid ALL other signs, especially the mosttempting ones like @, #, |, *, >, <…

! Never use names longer than 15 characters

! Never give the same name to 2 different sequencesin your set. Some programs accept it, others likeClustalW don’t.

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Do not use too many sequences!

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Beware of Repeats

! There is a problem when two sequences do not contain the same number ofrepeats

! It is then better to manually extract the repeats and to align them separately.Individual repeats can be recognized using Dotlet or Dotter.

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Keep a biological perspective

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL-

wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS

trybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG

mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS

* *** .:: ::... : * . . . : * . *: *

chite KSEWEAKAATAKQNY-I--RALQE-YERNG-G-

wheat KAPYVAKANKLKGEY-N--KAIAA-YNK-GESA

trybr RKVYEEMAEKDKERY----K--RE-M-------

mouse KQAYIQLAKDDRIRYDNEMKSWEEQMAE-----

: : * : .* :

DIFFERENTPARAMETERS

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Do not overtune!!!

DO NOT PLAY WITHPARAMETERS!

IF YOU KNOW THEALIGNMENT YOU

WANT:MAKE IT YOURSELF!

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. * .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-

wheat ANKLKGEYNKAIAAYNKGESA

trybr AEKDKERYKREM---------

mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

BaliBase classification and benchmark

DescriptionPROBLEM

EvenPhylogenicSpread.

One OutlayerSequence

Two Distantlyrelated Groups

Long InternalIndel

Long TerminalIndel

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Choosing the right method

Source:

BaliBase

Thompsonet al, NAR,

(1999)

Do et al,Genome

Res. (2005)

PROBLEM Program Strategy

ClustalW,T-coffee,

MUSCLE,ProbCons

ProbCons,MUSCLE,MAFFT

Dialign II,ProbCons,

T-Coffee

T-Coffee,MUSCLE,ProbCons

Dialign II,

ProbCons,

MAFFT

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Some interesting links

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

More links

! MUSCLE

! http://www.drive5.com/muscle

! MAFFT

! http://timpani.genome.ad.jp/~mafft/server

! PROBCONS

! http://probcons.stanford.edu

! PSI-PRALINE

! http://ibivu.cs.vu.nl/programs/pralinewww

! 3D-COFFEE! http://www.igs.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi

Swiss Institute of Bioinf ormatics

Institut Suisse de Bioinf ormatique

CN+LF-2006.01

Conclusion

! The best alignment method:

! Your brain

! The right data

! The best evaluation method:

! Your eyes

! Experimental information(SwissProt)

! Choosing the sequences well isimportant

! Beware of repeated elements

! What can I conclude?

! Homology -> informationextrapolation

! How can I go further?

! Patterns

! Profiles

! HMMs

! …