Upload
clarence-cummings
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Alignments and alignment reliability
The first critical step in sequence analysis – the know how
Eyal Privman and Osnat Penn
Tel Aviv University
COST Training School
Rehovot, 2010
What are alignments good for? To compare sequences
Find homology Similar sequence similar function
To learn about sequence evolution Mismatch = point mutation Gap = indel (insertion or deletion) Reconstruct phylogenetic tree Infer selection forces, e.g., detecting positive
selection
Sequences evolution
ATGAAATAA
ATGTTTTAA ATGCCCAAATAA
ATGTTTTAA ATGTTT ATGCCCAAATAA
ATG---TTTTAA
ATG---TTT---
ATGCCCAAATAA
30 MYA
5 MYA
Today
Human
Chimp
Mouse
Alignment and phylogeny are mutually dependant
Inaccurate tree building
MSA
Sequence alignment
0.4
Phylogeny reconstruction
Unaligned sequences
Alignment and phylogeny are both challenging
25% of residues are aligned wrong
Based on BAliBASE: a large representative set of proteins
Alignment and phylogeny are both challenging
5% of tree branches are wrong
Based on simulations of 100 protein sequences
Making an alignment
For 2 sequences : use exact methods.
For more sequences: Exact methods are not feasible (too slow) We use heuristic methods
ABCDE
Compute the pairwise Compute the pairwise alignments for all against alignments for all against
all (10 pairwise alignments).all (10 pairwise alignments).The similarities are The similarities are
converted to distances and converted to distances and stored in a tablestored in a table
First step :compute pairwise distances
Progressive alignment
ABCDE
A
B8
C1517
D161410
E32313132
A
D
C
B
E
Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):
• represents the order in which pairs of represents the order in which pairs of sequences are to be alignedsequences are to be aligned• similar sequences are neighbors in the similar sequences are neighbors in the tree tree • distant sequences are distant from distant sequences are distant from each other in the treeeach other in the tree
Second step:build a guide tree
ABCDE
A
B8
C1517
D161410
E32313132The guide tree is imprecise The guide tree is imprecise and is NOT the tree which and is NOT the tree which truly describes the truly describes the evolutionary relationship evolutionary relationship between the sequences!between the sequences!
Third step: align sequences in a bottom up order
A
D
C
B
E
1. Align the most similar (neighboring) pairs
2. Align pairs of pairs
3. Align sequences clustered to pairs of pairs
deeper in the tree
Sequence A
Sequence B
Sequence C
Sequence D
Sequence E
Multiple sequence alignment (MSA)
progressive alignment
ABCDE
Guide tree
A
DCB
E
MSA
Pairwise distance table Iterative
Multiple sequence alignment (MSA)Several advanced MSA programs are available.
Today we will use two: MAFFT – fastest and one of the most accurate PRANK – distinct from all other MSA programs because of its
correct treatment of insertions/deletions
MAFFT Web server & download:
http://align.bmr.kyushu-u.ac.jp/mafft/online/server/ Efficiency-tuned variants
quick & dirty or slow but accurate
Nucleic Acids Research, 2002, Vol. 30, No. 14 3059-3066© 2002 Oxford University Press
MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform
Kazutaka Katoh, Kazuharu Misawa1, Kei-ichi Kuma and Takashi Miyata*
Choosing a MAFFT strategy
L-INS-i
ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------
--------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo-------
------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo-------
--------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo
--------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------
G-INS-i
XXXXXXXXXXX-XXXXXXXXXXXXXXX
XX-XXXXXXXXXXXXXXX-XXXXXXXX
XXXXX----XXXXXXXX---XXXXXXX
XXXXX-XXXXXXXXXX----XXXXXXX
XXXXXXXXXXXXXXXX----XXXXXXX
E-INS-i
oooooooooXXX------XXXX---------------------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo
---------XXXXXXXXXXXXXooo------------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX-------------
-----ooooXXXXXX---XXXXooooooooooo----------------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo
---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX-------------
---------XXXXX----XXXX---------------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo--------
quick & dirty slow
but accurate
MAFFT outputSaving the output Choose a format: Clustal, Fasta,
or click "Reformat" to convert to a selection of other formats
Save page as a text filee.g. save as "phylip" file and uploadto PhyML for reconstructing the tree
A colored view of the alignment
PhyML: tree reconstructionThe most widely used maximum likelihood (ML) program Web server & download: http://www.atgc-montpellier.fr/phyml/
PRANK output
If you need a different format – copy the results to the READSEQ sequence converter: http://www-bimas.cit.nih.gov/molbio/readseq/
1. Download and save the sequences file from Osnat's homepage
(you can google “Osnat Penn" and look for the workshop
materials under "Teaching"). Save the file as "trim5a.AA.fas"
(File “Save page as”). This file contains 20 protein sequences
in FASTA format.
2. Run PRANK web-server to create a protein alignment:
a. In the “Default alignment” section browse for
“trim5a.AA.fas”.
b. Run (press the “Start alignment“ button) .
3. While you wait: copy the sequences into the MAFFT web server
and run the "automatic" "moderately accurate" strategy – which
strategy did MAFFT choose for you? Click on the "Fasta
format“ link, and save as “trim5a.AA.mafft.aln“ (File “Save
page as”) and try the "Jalview" button.
4. When PRANK finishes click on the “Show Fasta file” button,
and save the MSA by the name “trim5a.AA.prank.aln“.
Sources of alignment errors
Progressive alignment algorithms are greedy heuristics Co-optimal solutions
Heads-or-Tails (HoT) scores (Landan & Graur 2007)
Guide-tree errors GUIDANCE scores (Penn, Privman et al. MBE 2010)
GUIDANCE: Guide-tree based alignment confidence scores
…MSA 1 MSA 2 MSA 99 MSA 100
Progressive alignment
…Tree 1 Tree 2 Tree 99 Tree 100
Bootstrap sampling of NJ trees
Base MSA
GUIDANCE Scores
0
1Confident Uncertain
Penn, Privman et al. MBE. 2010
HIV1 group M
SIV chimp
HIV1 group O
HIV1 group N
SIV cerco
SIV gorilla
Transmembrane domain
Extracellular domain
Cytoplasmic domain(a)
GU
IDA
NC
E s
core
Column
GUIDANCE Scores
Confident Uncertain
HIV1 group M
SIV chimp
HIV1 group O
Transmembrane domain
Extracellular domain
Cytoplasmic domain(b)
GU
IDA
NC
E s
core
Column
1. Run GUIDANCE web-server to calculate confidence scores for
the MAFFT alignment:
a. In the “Upload your sequence file” window browse for
“trim5a.AA.fas”.
b. Choose “Amino Acids” in the “Sequences Type” option.
c. In order to speed the run, change the “Number of bootstrap
repeats” in the “Advanced options” section to 30. Note that
this is not recommended for real life.
d. Run (press the “Submit“ button) .
Empirical findingsvariation among genes:
““ImportantImportant”” proteins evolveproteins evolve
slowerslowerthan “unimportantunimportant” onesones
Empirical findingsvariation among sites:
Functional Functional sitessites evolveevolve
slowerslowerthanthan nonfunctional nonfunctional sitessites
Silent and non-silent mutations
Silent:UUU -> UUC (both encode phenylalanine)
Non-silent:UUU -> CUU (phenylalanine to leucine)
For most proteins, the rate of For most proteins, the rate of silentsilent substitutions is much highersubstitutions is much higher
than the than the non-silentnon-silent rate rate
This is called purifying selection purifying selection
= conservation= conservation
There are rarerare cases where the non-silentnon-silent rate is much higher than the silentsilent rate
This is called positive selection positive selection
Positive Selection
Examples: Pathogen proteins evading the host immune
system Proteins of the immune system detecting
pathogen proteins Pathogen proteins that are drug targets Proteins that are products of gene duplication Proteins involved in the reproductive system
False positive predictions
Selecton uses an MSA as input The MSA may contain unreliable regions
Errors in Selecton computations
Errors in the positive selection inference
1. Go to the GUIDANCE results of the last exercise.
2. Which columns are not well aligned? Are these sites
also predicted to evolve under positive selection?
See Selecton results in:
http://selecton.tau.ac.il/results/1268662868/colors.html
Summary
Different alignment programs may result different MSAs.
Alignment uncertainty may cause errors in downstream analyses such as positive selection analysis.
GUIDANCE can detect alignment errors.