31
Multiple Sequence Alignment What is it Why do we use it How to use it • Tools • ClustalW • Exercise

Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Embed Size (px)

Citation preview

Page 1: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment

• What is it• Why do we use it• How to use it• Tools• ClustalW• Exercise

Page 2: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment

• Many genes are represented in highly conserved forms in a wide range of organisms

• Patterns of change in these gene sequences may be analyzed by simultaneous alignment of the sequences (identify conserved regions)

• This is known as multiple sequence alignment (msa)

• A multiple alignment arranges a set of sequences in a scheme where positions believed to be homologous are written in a common column.

Page 3: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Applications of Multiple Sequence Alignment

• Predict protein function

• Predict protein structure (using structure superposition programs).

• Predict the evolutionary history of sequences (using phylogenetic analysis programs).

• Contig Assembly (Shotgun sequences & ESTs)

• Identify new family members

• Design PCR primers for amplification of related sequences

• Database searching with the consensus sequences to identify other sequences with a similar pattern.

Page 4: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment Guidelines

• Select the sequences carefully. Make sure they are members of the same family and they all share a common ancestor

• Use protein sequences if possible. Translate if necessary and then convert back to DNA after the alignment. • Protein seqs are three times shorter and provide a more

informative alphabet

• If there is little signal at the aa level there will be no signal at the nt level

• If you are interested in non-coding sequences you have no choice but beware DNA alignment is tricky (need a very high level of conservation)

Page 5: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment Guidelines Cont.

• Ensure that at least half of the sequences share more than 30% identity and avoid sequences that have > 90% identity to another sequence

• An alignment that contains only very similar sequences is not very informative

• If you make sure that each sequence is between 30 and 70% identical with half of the sequences in the set you will have made a reasonable compromise between new information and alignment quality

Page 6: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment Guidelines Cont.

• Start with 10-15 sequences and avoid aligning more than 50 sequences (if you do employ a high level of manual curation)

• Multiple alignment programs are not good at handling large sets of sequences.

• Visualizing many alignments is difficult and if it falls on more than one page interpretation can become difficult if not impossible.

• Aligning a lot of sequences is computationally difficult and public servers have limited resources, so it may take a long time to run and make it difficult for you to fine tune alignment parameters or alternative sequences.

Page 7: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment Guidelines Cont.

• Tree building and structure prediction programs do not handle big alignments well

• Making accurate big alignments is difficult and not so reliable making it difficult to have confidence in the fidelity of the sequences that you are saying belong to a family. Best to start small and gradually increase the size of the multiple alignments.

• Before adding a sequence to a multiple alignment, you can figure out whether it is a good choice by doing a pairwise comparison.

Page 8: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment Guidelines Cont.

• Use sequences of similar length. Programs have problems aligning partial and complete sequences.

• Repeated domains are problematic for the alignment programs, especially if the number of domains is different.

• Name Sequences appropriately • Never use white spaces such as clone 2 (clone2 or clone_2)• Do not use special symbols, stick to plain letters, numbers

and the underscore• Do not use names any longer than 15 characters• Use unique names for each sequence• Use informative names (OSJLBa0001A01f compared to

Main_Clone1)

Page 9: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

EXPASY INTEGRATED BLAST & MSA SERVER

Page 10: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

EXPASY INTEGRATED BLAST & MSA SERVER (databases and options)

Page 11: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

• Output of search displayed• Links to Pfam

Scroll down

Page 12: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise
Page 13: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

• View Alignments (helps inform selection)

Page 14: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

• Make selections for inclusion in msa• Send your selections options

• Select your sequences in fasta format• Send your selections options

Page 15: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

P20472 PRVA_HUMAN Parvalbumin alpha [PVALB] [Homo sapiens 186 9e-47P80079 PRVA_FELCA Parvalbumin alpha [PVALB] [Felis silvestris... 162 1e-39 P02626 PRVA_AMPME Parvalbumin alpha [Amphiuma (Salamand... 109 1e-23 P02619 PRVB_ESOLU Parvalbumin beta [Esox lucius (Northern pike)] 95 2e-19P43305 PRVU_CHICK Parvalbumin, thymic CPV3 (Parvalbumin 3) [G 92 2e-18P32930 ONCO_HUMAN Oncomodulin (OM) (Parvalbumin beta) [OCM] 89 2e-17Q91482 PRV1_SALSA Parvalbumin beta 1 (Major allergen Sal s 1)... 85 3e-16 P02620 PRVB_MERME Parvalbumin beta [Merluccius merluccius (Eu... 80 7e-15 P02622 PRVB_GADCA Parvalbumin beta (Allergen Gad c 1) (Gad c ... 74 5e-13

ACC # SwissProt # Description Organism Score EXP

• Example selected sequences • Note the range of scores and E values selected

Page 16: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

• ClustalW (Unix, Mac, PC, VMS). • ClustalX (IGBMC , EBI) (graphical interface) (Unix, Mac, PC, VMS). • Multalin • MSA (Unix). • DIALIGN (Unix). • DCA (Unix). • Multiple alignment by randomized iterative strategy (Unix). • MACAW (Mac, PC). • T-Coffee (Unix). • MAFFT (Linux, Unix, Windows XP, Mac OS X).

Multiple Sequence Alignment Software

Page 17: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment Online Tools

• ClustalW at EBI (Hinxton, UK). Display and edit alignments with JalView. • ClustalW, Multalin at PBIL (Lyon, France). Colored alignments and secondary structure predictions. • ClustalW, MAP, PIMA at BCM MSA, ClustalW, ctree at IBC (St Louis, USA) • Multalin at INRA (Toulouse, France). Colored alignments. • ClustalW, DCA, DIALIGN2 at Pasteur (Paris, France) • ClustalW at EMBL (Heidelberg, Germany). Performs multiple alignment on homologous sequences detected by BLAST. • ClustalW at DDBJ (Mishima, Japan) • MAP (Michigan Tech. Univ., USA) • ProbModel at CBRG (Zurich, Switzerland) • DIALIGN2 at BiBiServ (Bielefeld, Germany) • DCA at BiBiServ (Bielefeld, Germany) • ITERALIGN (Stanford, USA) • T-COFFEE (Lausanne, Switzerland) • MATCH-BOX (Namur, Belgium) • BLOCK Maker at FHCRC (Washington, USA) • MEME at SDSC (San Diego, USA) • MEME at Pasteur (Paris, France) • PIMA II at BMERC (Boston, USA) • MAVID at UCB (Berkeley, USA)

Page 18: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

• First msa that could run on almost any platform

• Most widely used msa program

• ClustalW is the latest version

• There are many Clustal servers around the world, most operating the same version but their different interfaces provide access to different options.

• It is available as a stand-alone package also.

Multiple Sequence Alignment Software: ClustalW

Page 19: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

• CLustalW uses a progressive method to build its alignments

• It compares two sequences at a time and clusters them

by similarity.

• This clustering resembles a phylogenetic tree (.dnd file from ClustalW output). This clustering is called as dendogram

Multiple Sequence Alignment Software: ClustalW

Root

A

B

C

D

• Reveals that A and B are more similar than C and D• To make the progressive alignment ClustalW follows the dendogram and starts aligning A and B and then C and D.• It then treats the multiple alignments like single sequences and aligns them two by two.

Page 20: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise
Page 21: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise
Page 22: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment Software: ClustalW

• Pairwise Scores• This is the pairwise comparisons ClustalW uses to build its tree• This can be ignored

Page 23: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment Software: ClustalW

• Shows the alignment• Can be saved as a text file• Can view it in color

Page 24: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment Software: ClustalW

• The Guide Tree

• Shows the tree that ClustalW uses to guide its progressive alignment• It is displayed in Phylip tree format• A cladogram is a branching diagram (tree) assumed to be an estimate of a phylogeny where the branches are of equal length, thus cladograms show common ancestry, but do not indicate the amount of evolutionary "time" separating taxa

Page 25: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment Software: ClustalW

• The Phylogram Tree• A Phylogram is a branching diagram (tree) assumed to be an estimate of a phylogeny, branch lengths are proportional to the amount of inferred evolutionary change

Page 26: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

• Interpreting an alignment is more art than a science !!|

• No E values exist to tell us how reliable the search was as in database searching

• Best method of evaluation is based on knowledge of protein structures.

• Structures contain loops that evolve rapidly• Loops are softer portions of the protein that connect its more rigid portions• Protein structures also contain core regions inside the protein that act as support walls for the protein. These support walls evolve less rapidly than the loops on the surface

Interpreting Multiple Sequence Alignments

• In a good multiple alignment can expect to find nice gap free blocks that correspond to core regions and gap rich regions that correspond to the loops

Page 27: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

• How Can you tell whether a block is good?• Take a look at the alignment symbols

• * A star indicates an entirely conserved region• : A colon indicates columns where all the residues have roughly the same size and same hydropathy• . A period indicates columns where the size or hydropathy has been preserved in the course of evolution• An average GOOD block is at least 10-30 aa long exhibiting at least 1 to 3 stars, five to seven colons and a few periods

Interpreting Multiple Sequence Alignments

• In a good multiple alignment can expect to find nice gap free blocks that correspond to core regions and gap rich regions that correspond to the loops

Page 28: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

Multiple Sequence Alignment Tools

• BLAST Servers with integrated MSA’s

• www.expasy.ch/cgi-bin/BLASTEMBnet-ch.pl• Extract entire sequences• Export sequences in FASTA format• Submit sequences to ClustalW• Submit sequences to Tcofee

• http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_blast.html• Extract entire sequences• Extract sequence fragments• Export sequences in FASTA format• Submit sequences to ClustalW

• srs.ebi.ac.uk• Submit sequences to ClustalW

Page 29: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

www.expasy.ch/cgi-bin/BLASTEMBnet-ch.pl

Page 30: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_blast.html

Page 31: Multiple Sequence Alignment What is it Why do we use it How to use it Tools ClustalW Exercise

srs.ebi.ac.uk