Exploring Protein Sequences
Tutorial 5
Exploring Protein Sequences
• Multiple alignment– ClustalW
• Motif discovery– MEME– Jaspar
• More than two sequences– DNA– Protein
• Evolutionary relation– Homology Phylogenetic tree– Detect motif
Multiple Sequence Alignment
GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC
A
D B
CGTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC
• Dynamic Programming– Optimal alignment– Exponential in #Sequences
• Progressive– Efficient– Heuristic
Multiple Sequence Alignment
GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC
A
D B
CGTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC
ClustalW
“CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”, J D Thompson et al
• Progressive– At each step align two existing alignments or sequences
– Gaps present in older alignments remain fixed
ClustalW
GTCGTAGTCG-GC-TGTC-TAG-CGAGCGTGC-GAAG-AG-GCG-GCCGTCG-CG-TCGT
GTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC
ClustalW - InputScoring matrix
Gap scoring
Input sequences
ClustalW - Output
ClustalW - Output
Input sequences
Pairwise alignment scores
Building alignment
Final score
ClustalW - Output
ClustalW Output
Sequence names Sequence positions
Match strength in decreasing order: * : .
http://http://www.megasoftware.net/
Can we find motifs using multiple sequence alignment?
1 2 3 4 5 6 7 8 9 10
A 0 0 0 0 0 0.5 1/6 1/3 0 0
D 0 0.5 1/3 0 0 1/6 5/6 1/6 0 1/6
E 0 0 2/3 1 0 0 0 0 1 5/6
G 0 1/6 0 0 1 1/3 0 0 0 0
H 0 1/6 0 0 0 0 0 0 0 0
N 0 1/6 0 0 0 0 0 0 0 0
Y 1 0 0 0 0 0 0.5 0.5 0 0
1 3 5 7 9..YDEEGGDAEE....YDEEGGDAEE....YGEEGADYED....YDEEGADYEE....YNDEGDDYEE....YHDEGAADEE.. * :** *:
MotifA widespread pattern with a biological significance
Can we find motifs using multiple sequence alignment?
YES! NO
MEME – Multiple EM for Motif finding
• http://meme.sdsc.edu/• Motif discovery from unaligned sequences
– Genomic or protein sequences• Flexible model of motif presence (Motif can be absent in some sequences or appear several times in one sequence)
MEME - InputEmail address
Multiple input sequences
How many times in each sequence?
How many motifs?
How many sites?
Range of motif lengths
MEME - OutputMotif length
Number of times
Like BLAST
MEME - Output
Probability * 10
‘a’=10, ‘:’=0
MEME - Output
Low uncertainty
=
High information content
MEME - Output
Multilevel Consensus
Sequence names
Reverse complement (genomic input only)
Position in
sequence
Strength of match
Motif within sequence
MEME - Output
Overall strength of motif matches
sequence lengths
Motif instance
MEME - Output
‘-’=Other strand
MAST• Searches for motifs (one or more) in sequence databases:– Like BLAST but motifs for input– Similar to iterations of PSI-BLAST
• Profile defines strength of match– Multiple motif matches per sequence– Combined E value for all motifs
• MEME uses MAST to summarize results: – Each MEME result is accompanied by the MAST result for searching the discovered motifs on the given sequences.
JASPAR• Profiles
– Transcription factor binding sites– Multicellular eukaryotes– Derived from published collections of
experiments
• Open data accesss
JASPAR• profiles
– Modeled as matrices.– can be converted into PSSM for scanning
genomic sequences.
1 2 3 4 5 6 7 8 9 10
A 0 0 0 0 0 0.5 1/6 1/3 0 0
D 0 0.5 1/3 0 0 1/6 5/6 1/6 0 1/6
E 0 0 2/3 1 0 0 0 0 1 5/6
G 0 1/6 0 0 1 1/3 0 0 0 0
H 0 1/6 0 0 0 0 0 0 0 0
N 0 1/6 0 0 0 0 0 0 0 0
Y 1 0 0 0 0 0 0.5 0.5 0 0
Search profile
http://jaspar.cgb.ki.se/
http://jaspar.cgb.ki.se/