Upload
dangnhu
View
216
Download
0
Embed Size (px)
Citation preview
1
Sequence alignment methods
ATGAAAAAGAAAACAACACTTAGCGAGGAGGACCAGGCTCTGTTTCGCCAGTTGATGGCG
GGGACTCGCAAGATTAAGCAGGACACGATTGTCCACCGACCGCAGCGTAAAAAAATCAGC
GAAGTGCCGGTGAAACGCTTGATCCAGGAGCAGGCTGATGCCAGCCATTATTTCTCCGAT
GAGTTTCAGCCGTTATTAAATACCGAAGGTCCGGTGAAATATGTTCGCCCGGATGTCAGC
CATTTTGAGGCGAAGAAACTGCGCCGTGGCGATTATTCGCCGGAGTTGTTTTTGGATTTA
CACGGTCTGACGCAGCTGCAGGCCAAGCAGGAACTGGGGGCGTTGATTGCCGCCTGCCGCPstI
GAGTTGCCCTGATAAGGGTACTATTACGGACGAGTCATCTTATGCGGAGCGATTAGGGCG
CGGTTAGCGAGCTACTATCGGGGGGCGAGCTTATTGGGCGGGGCGGACTATGGGCTGGCG
AGGCGGAACGGGTACTGGACGTACTAGGCGAGGCGATCTAGCGAGGGCATGTTGATGGCG
GGAGCGGTTTTTAGGGCGTTTTTGGCGGCCCCCTATCTATGCAGCACGAGCGACTATGCC
Word/pattern recognition-Identification of restriction enzyme cleavage sites
The universe of biological sequence analysis
CGCCGAGGATGGCCGTCATGGCGCCCCGAACCCTCCTCCTGCTACTCTTGGGGGCCCTGG
MetAlaProArgThrLeuLeuLeuLeuLeuLeuGlyAlaLeuAla
CCCTGACCCAGACCTGGGCGGGTGAGTGCGGGGTCGTGGGGAAACCGCCTCTGCGGGGAG
LeuThrGlnThrTrpAlaGly
AAGCAAGGGGCCCGCCCGGCGGGGACGCAGGACCCGGGTAGCCGCGCCGGGAGGAGGGTC
GGGTGGGTCTCAGCCACTCCTCGCCCCCAGGCTCCCACTCCATGAGGTATTTCACCACAT
SerHisSerMetArgTyrPheThrThrSer
CCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCCGTGGGCTACGTGGACGACA
ValSerArgProGlyArgGlyGluProArgPheIleAlaValGlyTyrValAspAspThr
CGCAGTTCGTGCGGTTTGACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCACCGT
GlnPheValArgPheAspSerAspAlaAlaSerGlnArgMetGluProArgAlaProTrp
GGATAGAGCAGGAGGGGCCGGAGTATTGGGACCTGCAGACACGGAATGTGAAGGCCCAGT
IleGluGlnGluGlyProGluTyrTrpAspLeuGlnThrArgAsnValLysAlaGlnSer
CACAGACTGACCGAGCGAACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCG
GlnThrAspArgAlaAsnLeuGlyThrLeuArgGlyTyrTyrAsnGlnSerGluAla
GTGAGTGACCCCGGCCCGGGGCGCAGGTCACGACCTCTCATCCCCCACGGACGGGCCGGG
Exon 1
Exon 2
- prediction of exon structureThe universe of biological sequence analysis
C G A T A G C A T G A T G T C TC G A C A G C A T - A T G T C T* * * * * * * * * * * * * *
Pairwise alignment
2
Why sequence alignments ?
• Prediction of function • Protein family analysis• Comparative genomics• Phylogeny / Evolutionary history• Genome sequencing:
• Assembly• Alignment to reference genome
We have a ‘new’ sequence. It is similar to a previously known sequence?We can test by alignment whether it is similar
to a sequence with known function. If it is we can assign a possible function to our new sequence
Prediction of function
Sequence to be investigated
Database of sequences
Seq. with known function
Protein family analysis Comparative genomics - reveals biologically significant regions of the genome
3
dotplotC G A
CGACAGCATATGTCT
T A G C A T G A T G T C T
CGATAGCATGATGTCTCGACAGCAT-ATGTCT*** ***** ******
Pairwise alignmentdotplotPairwise alignment C G A
CGACAGCATATGTCT
T A G C A T G A T G T C T
dotplotPairwise alignment C G A
CGACAGCATATGTCT
T A G C A T G A T G T C T
CGATAGCATGATGTCTCGACAGCAT-ATGTCT*** ***** ******
- -2221222222222222 = 25+ + + + + + + + + + + + + +
dotplotC G A
CGACAGCATATGTCT
T A G C A T G A T G T C T
CG-----ATAGCATGATGTCTCGACAGCATA------TGTCT** *** *****
- -222222222222222222222 = -2- - - - - -- - -+ + + + +++++ +
Pairwise alignment
4
More sophisticated scoring of protein sequence alignments
Each amino acid change has acharacteristic probability
substitution matrix
More sophisticated scoring of protein sequence alignments
Each amino acid change has acharacteristic probability
A G L C E| | | | |A A L C D4+ 0+4 +9+2 =19
A
B
Local alignment
Global alignment
AB
| | | | | | | | | |
| | | | | | | | | | | | | |
Local and global alignments
BLAST - searches in databases for sequence similarityClustalW - multiple alignment of sequences
Frequently used methods in sequence analysis that are based on sequence alignment
5
FASTA, 1988William Pearson
BLAST
David LipmanStephen Altschul
BLAST, 1990
Searching databases for sequence similarity- traditional alignment method too slow BLAST - Basic Local Alignment Search
Tool
A query sequence (DNA or protein) is tested against all sequences in a database (DNA or protein) , i.e the query is aligned to all the database sequences. Final output is a list of the best matching database sequences.
M A K I Q G L G K R Y
M *A *K * *L *Q *G * *A *L *G * *K * *R *Y *
Improvement of speed as compared to local alignment algorithm:
* Initial search isfor word hits.
* Word hits are then extended in either direction.
Searching databases for sequence similarity- shortcuts of BLAST
"word hit"
BLASTP 2.2.9 [May-01-2004]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
Query= lcl|SRP54_MOUSE (P14576) Signal recognition particle 54 kDa protein (SRP54)(504 letters)
Database: swissprot197,228 sequences; 71,501,181 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
SRP54_MOUSE (P14576) Signal recognition particle 54 kDa protein ... 959 0.0 SRP54_PONPY (Q5R4R6) Signal recognition particle 54 kDa protein ... 958 0.0 SRP54_MACFA (Q4R965) Signal recognition particle 54 kDa protein ... 958 0.0 SRP54_HUMAN (P61011) Signal recognition particle 54 kDa protein ... 958 0.0 SRP54_CANFA (P61010) Signal recognition particle 54 kDa protein ... 958 0.0 SRP54_RAT (Q6AYB5) Signal recognition particle 54 kDa protein (S... 957 0.0 SRP54_GEOCY (Q8MZJ6) Signal recognition particle 54 kDa protein ... 794 0.0 SR542_LYCES (P49972) Signal recognition particle 54 kDa protein ... 565 e-161SR543_ARATH (P49967) Signal recognition particle 54 kDa protein ... 560 e-159SR542_HORVU (P49969) Signal recognition particle 54 kDa protein ... 558 e-158......SRPR_MOUSE (Q9DBG7) Signal recognition particle receptor alpha s... 99 3e-20SRPR_HUMAN (P08240) Signal recognition particle receptor alpha s... 99 3e-20SRPR_YEAST (P32916) Signal recognition particle receptor alpha s... 98 7e-20
BLAST output
6
BLAST output, cont.
sp|Q9I3P8.1|FLHF_PSEAE RecName: Full=Flagellar biosynthesis prot... 57 3e-07sp|Q44758.1|FLHF_BORBU RecName: Full=Flagellar biosynthesis prot... 55 2e-06sp|Q01960.1|FLHF_BACSU RecName: Full=Flagellar biosynthesis prot... 53 4e-06
sp|O28980.1|Y1289_ARCFU RecName: Full=Uncharacterized protein AF... 39 0.064sp|B9LKC1.1|CYSC_CHLSY RecName: Full=Adenylyl-sulfate kinase; Al... 38 0.21 sp|Q12U80.1|RADB_METBU RecName: Full=DNA repair and recombinatio... 37 0.29 sp|A5D014.1|ACCD_PELTS RecName: Full=Acetyl-coenzyme A carboxyla... 35 0.93 sp|Q03T56.1|RSMA_LACBA RecName: Full=Ribosomal RNA small subunit... 35 1.2 sp|Q1I2K4.1|CYSC_PSEE4 RecName: Full=Adenylyl-sulfate kinase; Al... 35 1.6 sp|Q38V22.1|RSMA_LACSS RecName: Full=Ribosomal RNA small subunit... 34 1.8 sp|A1U3X8.1|CYSC_MARAV RecName: Full=Adenylyl-sulfate kinase; Al... 34 2.3 sp|A6TD42.1|CYSC_KLEP7 RecName: Full=Adenylyl-sulfate kinase; Al... 34 2.9 sp|P63890.2|CYSC_SALTI RecName: Full=Adenylyl-sulfate kinase; Al... 34 2.9
...
Parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is.
Expect value (E)
Query: 1 MVLADLGRKITSALRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAI 60MVLADLGRKITSALRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAI
Sbjct: 1 MVLADLGRKITSALRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAI 60
Query: 61 DLEEMASGLNKRKMIQHAVFKELVKLVDPGVKAWTPTKGKQNVIMFVGLQGSGKTTTCSK 120DLEEMASGLNKRKMIQHAVFKELVKLVDPGVKAWTPTKGKQNVIMFVGLQGSGKTTTCSK
Sbjct: 61 DLEEMASGLNKRKMIQHAVFKELVKLVDPGVKAWTPTKGKQNVIMFVGLQGSGKTTTCSK 120
Query: 121 LAYYYQRKGWKTCLICADTFRAGAFDQLKQNATKARIPFYGSYTEMDPVIIASEGVEKFK 180LAYYYQRKGWKTCLICADTFRAGAFDQLKQNATKARIPFYGSYTEMDPVIIASEGVEKFK
Sbjct: 121 LAYYYQRKGWKTCLICADTFRAGAFDQLKQNATKARIPFYGSYTEMDPVIIASEGVEKFK 180
Query: 181 NENFEIIIVDTSGRHKQEDSLFEEMLQVSNAIQPDNIVYVMDASIGQACEAQAKAFKDKV 240NENFEIIIVDTSGRHKQEDSLFEEMLQV+NAIQPDNIVYVMDASIGQACEAQAKAFKDKV
Sbjct: 181 NENFEIIIVDTSGRHKQEDSLFEEMLQVANAIQPDNIVYVMDASIGQACEAQAKAFKDKV 240
Query: 241 DVASVIVTKLDGHAKGGGALSAVAATKSPIIFIGTGEHIDDFEPFKTQPFISKLLGMGDI 300DVASVIVTKLDGHAKGGGALSAVAATKSPIIFIGTGEHIDDFEPFKTQPFISKLLGMGDI
Sbjct: 241 DVASVIVTKLDGHAKGGGALSAVAATKSPIIFIGTGEHIDDFEPFKTQPFISKLLGMGDI 300
High Scoring Pair (HSP)
>SRPR_MOUSE (Q9DBG7) Signal recognition particle receptor alpha subunit(SR-alpha) (Docking protein alpha) (DP-alpha)
Length = 636
Score = 99.0 bits (245), Expect = 3e-20Identities = 68/313 (21%), Positives = 143/313 (45%), Gaps = 31/313 (9%)
Query: 14 LRSLSNATIINEEVLNAMLKEVCTALLEADVNIKLVKQLRENVKSAIDLEEMASGLNKRK 73L+ L + ++ E + ++L ++ L+ +V + QL E+V + ++ + M +
Sbjct: 322 LKGLVGSKSLSREDMESVLDKMRDHLIAKNVAADIAVQLCESVANKLEGKVMGTFSTVTS 381
Query: 74 MIQHAVFKELVKLVDPGVKAW-------TPTKGKQNVIMFVGLQGSGKTTTCSKLAYYYQ 126++ A+ + LV+++ P + + + V+ F G+ G GK+T +K++++
Sbjct: 382 TVKQALQESLVQILQPQRRVDMLRDIMDAQRRQRPYVVTFCGVNGVGKSTNLAKISFWLL 441
Query: 127 RKGWKTCLICADTFRAGAFDQLK-------------QNATKARIPFYGSYTEMDPVIIAS 173G+ + DTFRAGA +QL+ ++ + + + D IA
Sbjct: 442 ENGFSVLIAACDTFRAGAVEQLRTHTRRLTALHPPEKHGGRTMVQLFEKGYGKDAAGIAM 501
High Scoring Pair (HSP)
7
BLASTP 2.2.9 [May-01-2004]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
Query= lcl|SRP54_MOUSE (P14576) Signal recognition particle 54 kDa protein (SRP54)(504 letters)
Database: swissprot197,228 sequences; 71,501,181 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
SRP54_MOUSE (P14576) Signal recognition particle 54 kDa protein ... 959 0.0 SRP54_PONPY (Q5R4R6) Signal recognition particle 54 kDa protein ... 958 0.0 SRP54_MACFA (Q4R965) Signal recognition particle 54 kDa protein ... 958 0.0 SRP54_HUMAN (P61011) Signal recognition particle 54 kDa protein ... 958 0.0 SRP54_CANFA (P61010) Signal recognition particle 54 kDa protein ... 958 0.0 SRP54_RAT (Q6AYB5) Signal recognition particle 54 kDa protein (S... 957 0.0 SRP54_GEOCY (Q8MZJ6) Signal recognition particle 54 kDa protein ... 794 0.0 SR542_LYCES (P49972) Signal recognition particle 54 kDa protein ... 565 e-161SR543_ARATH (P49967) Signal recognition particle 54 kDa protein ... 560 e-159SR542_HORVU (P49969) Signal recognition particle 54 kDa protein ... 558 e-158......SRPR_MOUSE (Q9DBG7) Signal recognition particle receptor alpha s... 99 3e-20SRPR_HUMAN (P08240) Signal recognition particle receptor alpha s... 99 3e-20SRPR_YEAST (P32916) Signal recognition particle receptor alpha s... 98 7e-20
BLAST output – revealing orthologs and paralogs
orthologs
paralogs
Genes or proteins are homologous if they are related by divergence from a common ancestor.
Orthology Sequences that diverged after a speciation event.Orthologous genes often have the samefunction in different species.
Paralogy Sequences that diverged after a gene duplicationevent.Paralogous genes perform different but related functions within one organism.
The two kinds of protein evolutionary relationship
X
X
X1
X
X2
Speciation
Orthologs
Ancestral organism
Organism A
Organism A
Organism B
Organism B
Orthologs
X
X
Xa
X
Xb
Gene duplication
Paralogs
Paralogs
8
Mouse trypsin -- orthologs -- Human trypsin| |paralogs paralogs| |Mouse chymotrypsin -- orthologs -- Human chymotrypsin
Example of orthology / paralogy relationships
Query Database
blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA
The different variants of BLASTThe variants of BLAST
Cited 31998 times since 1990 !
BLAT
Alignment software specialized for next-generation sequencing technology
BTW BowtieSOAP2
Align reads to a reference genome
Reference genome
When BLAST is too slow:
9
Further improvement of computational efficiency - BLAT
(http://genome.ucsc.edu/cgi-bin/hgBlat?command=start)
Cited 34,646 times !
BLAST - searches in databases for sequence similarityClustalW - multiple alignment of sequences
Frequently used methods in sequence analysis that are based on sequence alignment
ClustalW
• Construction of tree based on pairwise alignments• Progressive alignment guided by tree.
AB
CD
E
HIV
Introduction to the practical“Examining HIV genes and proteins"
10
Introduction to the practical“Examining HIV genes and proteins"
EMBOSS programs in this practical
sixpackplotorf
dottup - dotplot analysiswater - Smith Waterman local alignmentneedle - Needleman - Wunsch global alignment
Introduction to the practical“Examining HIV genes and proteins"
11
M A K R K L K K N L K T F V A F S A I T F1W Q R E S * K R T * K L L L H L V L L L F2G K E K V K K E L K N F C C I * C Y Y C F3
1 ATGGCAAAGAGAAAGTTAAAAAAGAACTTAAAAACTTTTGTTGCATTTAGTGCTATTACT 60----:----|----:----|----:----|----:----|----:----|----:----|
1 TACCGTTTCTCTTTCAATTTTTTCTTGAATTTTTGAAAACAACGTAAATCACGATAATGA 60X A F L F N F F F K F V K T A N L A I V F6
X P L S F T L F S S L F K Q Q M * H * * F5H C L S L * F L V * F S K N C K T S N S F4
A L L L T N G I P I S A L T Q S S N T T F1L Y C * L M V F Q L V L * L S L P I Q L F2F I V N * W Y S N * C F N S V F Q Y N * F3
61 GCTTTATTGTTAACTAATGGTATTCCAATTAGTGCTTTAACTCAGTCTTCCAATACAACT 120----:----|----:----|----:----|----:----|----:----|----:----|
61 CGAAATAACAATTGATTACCATAAGGTTAATCACGAAATTGAGTCAGAAGGTTATGTTGA 120A K N N V L P I G I L A K V * D E L V V F6
Q K I T L * H Y E L * H K L E T K W Y L F5S * Q * S I T N W N T S * S L R G I C S F4
E I T S Q A T T G L R N V M Y Y G D W S F1R L L H K L L Q G Y V M * C I M V T G L F2D Y F T S Y Y R V T * C N V L W * L V Y F3
121 GAGATTACTTCACAAGCTACTACAGGGTTACGTAATGTAATGTATTATGGTGACTGGTCT 180----:----|----:----|----:----|----:----|----:----|----:----|
121 CTCTAATGAAGTGTTCGATGATGTCCCAATGCATTACATTACATAATACCACTGACCAGA 180S I V E C A V V P N R L T I Y * P S Q D F6
Q S * K V L * * L T V Y H L T N H H S T F5L N S * L S S C P * T I Y H I I T V P R F4
Translation of a nucleotide sequence using ‘sixpack’Introduction to the practical
“Examining HIV genes and proteins"
Plotorf to show open reading frames(in this case ORF is defined as starting with AUG codon)
Ribosomal protein S16 1771-2019
Ribosomal protein L19 3426-3773
Unnamed protein 416-1522 tRNA methyltransferase 2617-3384
Introduction to the practical“Examining HIV genes and proteins"
12
Gag
Gag-Pol fusion(5%)
Introduction to the practical“Examining HIV genes and proteins"
Introduction to the practical“Examining HIV genes and proteins"
Global alignment of mRNA sequence to genomic DNA sequence
Effect of gap parameters
mature, spliced mRNA
genomic DNA
Global alignment of mRNA sequence to genomic DNA sequence
Effect of gap parameters
13
Introduction to the practical“Examining HIV genes and proteins"
Dot plot analysis (dottup) reveals repeats
Introduction to the "Exercises with biological sequences -examining HIV genes and proteins"
- biological questions addressed with BLAST and ClustalX.
BLAST - search databases for sequence similarity
* Identifying homologous proteins. * Non-viral homologues to any HIV proteins?* Are we able to identify a relationship between human HIV
and the monkey SIV?
ClustalX - multiple sequence alignment
* Identifying amino acids involved in drug resistance.* What is the relationship between HIV and monkey SIV?* Using a multiple alignment to compute a phylogenetic tree.