26
Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become easier for researchers to study the proteins. These studies help in providing preliminary insights into the structural and functional aspects of proteins without conducting experiments.

Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Embed Size (px)

Citation preview

Page 1: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Bioinformatics and Protein Sequence Analysis

Surabhi Agarwal

With sequencing of large number of proteins and subsequent storage of data, it has become easier

for researchers to study the proteins. These studies help in providing preliminary insights into the

structural and functional aspects of proteins without conducting experiments.

Page 2: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Master Layout (Part 1)

5

3

2

4

1 This animation consists of 2 parts:Part 1: Protein Sequence AlignmentPart 2: Alignment analysis and interpretations

Extract the newly determined amino acid sequence for your query peptide.

Assess the significance of the result with its alignment scoreSeq 2

Seq 3

Seq 1

Page 3: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Definitions of the componentsPart 1 – Protein sequence alignment

5

3

2

4

11. Query Peptide: This refers to the unknown protein or peptide that is provided

as an input to the sequence analysis server. The sequence of this protein is determined before carrying out further studies for analyzing similarity matches with other proteins.

2. Relevant Algorithm: An algorithm refers to the sequence of logical steps that are used for comparing the query peptide with other given protein sequences. The nature of query such as “Local” or “Global” and “Pair-wise alignment” or “Multiple Sequence Alignment” determines the algorithm that is used.

3. Local Alignment: “Local” alignment represents matching individual blocks of protein sequences in which the protein alignment gets broken at positions where a mismatch occurs. The aim of such alignment studies is to find the longest possible blocks of similarity in aligned protein sequences.

4. Global Alignment: “Global” alignment represents an end-to-end alignment of two or more sequences, where gaps are introduced at the positions where mismatches occur.

5. Pair-wise sequence alignment: This procedure compares and aligns two given sequences. The comparison can either be Global or Local with the quality of alignment being judged by the alignment score.

Page 4: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

5

3

2

4

16. Multiple Sequence Alignment: This refers to the end-to-end alignment of several

given sequences that are provided to the search engine. Multiple alignment tends to introduce minimum gaps and finds regions of similarity within all given sequences.

7. Word –length: The minimum length of an amino acid sequence that needs to match exactly in order to initiate an alignment process in either direction. Sensitivity and speed of alignment are dependent on the word length provided by the user.

8. Scoring Matrix: The matrix of values that are referred to for assigning a score to the alignment of pairs of residues. The matrix used for a BLAST search is selected depending on the type of sequences that one is searching with. These are PAM series matrices and BLOSUM series.

a) PAM: PAM stands for Point Accepted Mutations. It is a log-odds, matrix scoring system that is constructed on the amino acid replacements in a set of closely related proteins. PAM value helps in defining the percentage of mutations that get accepted from a given set of proteins. 1 PAM refers to a change in position for an average of 1% of amino-acids residues.

b) BLOSUM: This stands for “Blocks of Amino Acid Substitution Matrix” and is constructed from a set of distantly related proteins. BLOSUM provides a comprehensive biological insight into proteins when the evolutionary distance is not known beforehand. It is based on the relative frequency of amino acid residues and the probabilities of their substitution in a set of highly conserved blocks of residues in proteins that are evolutionarily distant.

Definitions of the componentsPart 1 – Protein sequence alignment

Page 5: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

5

3

2

4

19. Threshold: Threshold provides a measure of the statistical significance of the

results of an alignment study and represents the expected number of matches occurring by chance event.

10. Gap Penalty and Gap Extension: In an alignment of two or more given protein sequences, a gap is introduced wherever an amino acid mismatch occurs. In this context, “Gap penalty” refers to a deduction in the overall alignment score on introduction of a gap while the “Gap Extension” is for extending an already existing gap.

11. Alignment Score: This is also referred to as the Bit Score and provides a comparative quantification of the quality of alignment. The score increases when a higher number of residue matches and lower number of mismatches are encountered. The alignment having a higher bit score is a better match.

12. Percentage Identity: This indicates the percentage of amino acid residues that are an identical match to each other during the comparison of two sequences.

13. E-value: E-value provides a quantification of any chance alignment between two or more sequences instead of them being a biologically significant match. For similarity match against a database, this value is dependant on the size of the database against which the sequence is compared. The closer the e-value is to zero, the higher is the biological significance of the match.

14. Hit: The results of a search are called a ‘Hit’ and the term ‘best Hit’ would refer to the best result for that particular query.

Definitions of the componentsPart 1 – Protein sequence alignment

Page 6: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Step 1: Pair-wise sequence alignment for two given sequences - INPUT

Action Audio Narration

1

5

3

2

4Description of the action

SEQUENCE DATABASE

ALIGNMENT ALGORITHM (BLAST)

Enter sequence 1

Schematic of the process of pair-wise alignment

Follow the animation steps. Re-draw all figures. Show all definitions first by highlighting the parameter. Follow it with input of 2 sequences and the parameter values one by one. Downlink after scoring matrix should look like the downlinks seen on web-pages. Click on the downlink and show the BLOSUM62 Matrix getting selected. Click on BLAST tool

Alignment algorithms are computer algorithms which take the 2 protein sequences and align them residue by residue. Here we depict alignment done between 2 given sequences. To align two sequences, enter them in input box. We took the example of CBR-COL-186 protein of Caenorhabditis briggsae and collagen of Caenorhabditis elegans. The sequences are abridged for the purpose of animation. To carry out the exact study, users can download the sequences corresponding to the Gene ID. Enter the parameters as per the nature of the query and the purpose of the search and finally click on the BLAST tool.

>gi|268576797|ref|XP_002643378.1| C. briggsae CBR-COL-186 protein [Caenorhabditis briggsae] MKSTEKKSTELDLELEAQSLRRIAFFGVAMSTVATFVCIITVPLAYNKMQQMQSNMIDQYMASARGIRVA …

Enter sequence 2

>gi|6682|emb|CAA35955.1| collagen [Caenorhabditis elegans] MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQHRSNGLWDEYK …

Word Size

Threshold

Gap penalty

Scoring Matrix

PAM30BLOSUM62

BLOSUM62

3

10

Existence 11, Extension 1

1

Enter sequence 1

Length of initial set of amino acids that needs to be matched before

alignment beginsExpected Number of Matches that

are allowed to occur by chanceValues deducted from overall alignment score on introduction and extension of

mismatchesThe reference matrix used to assign

scores to matches of residues

Page 7: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Action

1

5

3

2

4 Shows the various output formats for pair-wise alignment

Show the smaller image of the server with every output and definitions coming out of it one at a time as shown in the powerpoint animation

Pair-wise alignment with the help of BLOSUM 62 matrix gives various kinds of results after alignment. These are alignment, alignment score, dot-plot, percentage identity and e-value. The raw score from BLOSUM62 matrix is 189 and from PAM30 matrix is 178. Bit score for alignment of the exact same study done using BLOSUM62 is 77.4 and for PAM30 matrix is 78.7. Therefore, the Bit scores give a uniform and normalized measure of the overall quality of alignment irrespective of the scoring system. The biological significance of this result is very high as the e value is very near to 0. For a more detailed study on the types of BLAST tools available, visit http://blast.ncbi.nlm.nih.gov/Blast.cgi

Step 2: Pair-wise sequence alignment for two given sequences - OUTPUT

http://blast.ncbi.nlm.nih.gov/Blast.cgi

ALIGNMENT:Sequence 1 LELEAQSLRRIAFFGVAMSTVATFVCIITVPLAYNKMQQMQSNMIDQYMASARGIRVARR + E +SLR++AFFG+A+ST+AT II VP+ YN MQ +QS++ + Sequence 2 IAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSE----------VEF

Audio NarrationDescription of the action

Shows the match or mismatch between each of the residues

Sequence 1

Sequence 2

Gaps introduced in sequence 2 due to lack of similar residues in

sequence 1

DOT-PLOT

Dot-Plot is the graphical visualization of the two given

sequences to find approximate overlaps to identify regions of close

similarity

BIT SCORE

Bit score are the normalized scores which are found after normalization of raw scores based on the scoring

matrix used in the algorithm

77.4 bits

PERCENTAGE IDENTITY

The percentage of residues which were identical in the two sequences

34%

E-VALUE

The statistical measure of the biological significance. The closer e-value is to 0, higher is the biological

significance

6e-19

Page 8: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Action Audio Narration

1

5

3

2

4Description of the action

Schematic of the process of pair-wise alignment

Alignment can also be done by matching a sequence against a related database of sequences to identify it. Input the unknown sequence, and then select the database against which the sequence is to be matched. Fill the parameter values as per the purpose of the search and the nature of the query sequence. In this case we study the hits using PAM30 scoring Matrix. Click on the BLAST tool once all parameters have been entered.

Step 3: Pair-wise alignment of sequences against database- INPUT

Follow the animation steps. Re-draw all figures. Show all definitions first by highlighting the parameter. Follow it with input of 1 sequence. Downlink after “Select Database” and “Scoring Matrix” should look like the downlinks seen on web-pages. Select “Protein” under the “Select Database” options box as shown in the animation. Follow this by inputting the parameter values one by one. Click on the downlink against “Scoring Matrix” and show the PAM30 Matrix. Click on BLAST tool.

SEQUENCE DATABASE

ALIGNMENT ALGORITHM (BLAST)

Enter sequence 1MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQHRSNGLWDEYKRFQGVSGVEGRIKRDAYHRSLGVSGASRKARRQSYGNDAAVGGFGGSSGGSCCSCGSGAAGPAGSPGQDGAPGNDGAPGAPGNPGQDASEDQTAGPDSFCFDCPAGPPGPSGAPGQKGPSGAPGAPGQSGGAALPGPPGP

Word Size

Threshold

Gap penalty

Scoring Matrix

PAM30BLOSUM62

PAM30

3

10

Existence 11, Extension 1SELECT

DATABASEPROTEINNUCLEOTIDEGENEPROTEOMEGEOESTSNP

Page 9: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Action Audio NarrationDescription of the action

1

5

3

2

4Shows the various output formats for pair-wise alignment

Show the smaller image of the server with every output and definitions coming out of it one at a time as shown in the powerpoint animation

Pair-wise alignment gives various kinds of results after alignment. These are alignment views, alignment score, dot-plot, e-value, percentage identity amongst many others. When compared to bit scores from other hits of the result, the bit score turns out to be the highest for collagen proteins in Caenorhabditis elegans

Step 4: Pair-wise alignment of sequences against database- OUTPUT

SEQUENCE DATABASE

ALIGNMENT ALGORITHM (BLAST)

Enter sequence 1MPSSVSWGILLLAGLCCLVPVSLAEDPQGDAAQKTDTSHHDQDHPTFNKITPNLAEFAFSLYRQLAHQSNSTNIFFSPVSIATAFAML

Word Size

Threshold

Gap penalty

Scoring Matrix

PAMBLOSUM

BLOSUM

3

10

Existence 11, Extension 1SELECT DATABASE

PROTEINNUCLEOTIDEGENEPROTEOMEGEOESTSNP

IDENTIFICATIONGENE ID: 179452 col-13 | Collagen [Caenorhabditis elegans]

Identifies the protein sequence and the source organism for the

unknown sequence

ALIGNMENT:Query MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQH MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQHDatabase MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQH

Alignment shows 100% matching with the identified sequence

TOTAL SCORE624 bits

Measure of the quality of the alignment when compared to bit scores of other hits of the search

E-Value1e-176

In the case of database searches, E-value is found by the multiplication of

pair-wise e-value number of sequences in the database.

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html; http://pfam.sanger.ac.uk/

100%Percentage Identity

Percentage of residues exactly matching in the query sequence and

the selected hit

Domain Identified (if any)

The query is scanned to find domains from Pfam Database. In case, such a

domain is identified, it is shown as part of the result

17 691 50 100 150 200 300250

Pfam ID: pfam01484: Domain Name: Col_cuticle_N Description: Nematode cuticle collagen N-terminal domain

Page 10: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Action Audio Narration

1

5

3

2

4Description of the action

SEQUENCE DATABASE

MULTIPLE SEQUENCE ALIGNMENT (CLUSTAL-W)

Enter sequence 1

Schematic of the process of pair-wise alignment

Follow the animation steps. Enter first 2 sequences. Click on “Add more sequences”. Open the 3rd input box for entering thee 3rd sequence. Show the input of 3rd sequence. Show the input of parameters. Select “Absolute” ahead of “Score Type” downlonk. Downlink after scoring matrix should look like the downlinks seen on web-pages.

Multiple Sequence Alignment tools are used to compare the amino acid sequences of more than two proteins. The word-size is the length of the seed set of amino acids, which needs to match exactly to get extended in both directions. Window Length is the length of the residues on either side, till which the alignment will be extended. The Gap penalty and extension hold the same meaning as in pair-wise alignment. In the scores, users can choose to see absolute scores for comparing or percentage value of the scores.

>gi|268574584|ref|XP_002642271.1| Hypothetical protein CBG18259 [Caenorhabditis briggsae] MDEKQRLQAYRFVAYSAVTFSTVAVFSLCITLPLVYNYVDGIKTQINHEIKFCKHSARDIFAEVNHIRANPKNASRFARQAGYGTDEAVSGGS

Enter sequence 2>gi|32565788|ref|NP_871711.1| COLlagen family member (col-96) [Caenorhabditis elegans] MDEITRRNAYRFVAYSAVTFSVVAVFSLCITLPMVYNYVHGIKSQINHQISFCKHSARDIFSEVNHIRASPNNATLREKRQAGDCSGCCL

Word Size

Window length

Gap penalty

Score type

ABSOLUTEPERCENTAGE

ABSOLUTE

3

10

Existence 11, Extension 1

Enter sequence 3>gi|17559060|ref|NP_505677.1| COLlagen family member (col-13) [Caenorhabditis elegans] MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQHRSNGLWDEYKRFQGVSGVEGRIKRDAYH

ADD MORE SEQUENCES

Step 5: Multiple Sequence Alignment - INPUTThe word-size is the length of the initial seed set of amino acids, which needs to

match exactly to get the alignment extended in both directions

Window Length is the length of the residues on either side of the initial

matched sequence, till which the alignment will be extended.

Users can choose to see absolute scores for comparing or percentage value of

the scores

Page 11: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

SEQUENCE DATABASEEnter sequence 1MPSSVSWGILLLAGLCCLVPVSLAEDPQGDAAQKTDTSHHDQDHPTFNKITP

Enter sequence 2

MKLLKLTGFIFFLFFLTESLTLPTQPRDIENFNSTQKFIEDNIEYITIIAFAQYVQEA

Word Size

ThresholdGap penalty

Scoring MatrixBLOSUM

3

10

Existence 11, Extension 1

MULTIPLE SEQUENCE ALIGNMENT (CLUSTAL-W)

Action Audio NarrationDescription of the action

1

5

3

2

4

Enter sequence 2

MKLLKLTGFIFFLFFLTESLTLPTQPRDIENFNSTQKFIEDNIEYITIIAFAQYVQEA

MULTIPLE SEQUENCE ALIGNMENTsequence 1 MDE-----KQRLQAYRFVAYSAVTFSTVAVFSLCITLPLVYNYVDGIKTQsequence 2 MDE-----ITRRNAYRFVAYSAVTFSVVAVFSLCITLPMVYNYVHGIKSQsequence 3 MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSS

COLOR CODED ALIGNMENT

Shows the various output formats for multiple sequence alignment

Show the smaller image of the server with every output coming out of it one at a time

Multiple sequence alignment gives various kinds of results after alignment. The alignment view in text format displays the residue wise matching for the input sequence. The color coded alignment gives a better graphical picture as the amino acid residues are assigned colors based on their physico-chemical properties. Here we depict one of the many color coding available. Alignment score is an absolute term, as selected previously. It can be compared with other scores to measure the quality of alignment. Users obtain .output file for the summary of the result, .aln files which contains the text alignment and .dnd files which contain the distance based information. For detailed understanding of these outputs, kindly visit http://www.ebi.ac.uk/Tools/clustalw2/index.html

Text alignment of query sequences

http://www.ebi.ac.uk/Tools/es/cgi-bin/clustalw2/

Sequence 1Sequence 2Sequence 3

ALIGNMENT SCORE

5269

Color coded alignment of query sequences

Alignment score which can be compared with other scores to

measure the quality of alignmnet

Mapping of colors to amino acid groups

Step 6: Multiple Sequence Alignment - OUTPUT

Page 12: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Master Layout (Part 2)

5

3

2

4

1 This animation consists of 2 parts:Part 1: Protein Sequence AlignmentPart 2: Alignment analysis and interpretations

Protein secondary structures

Structural features that decide function

Phylogram representing evolutionary relationships

Page 13: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Definitions of the componentsPart 2 – Alignment analysis and interpretations

5

3

2

4

11. Computational Phylogenetic Predictions: Sequence alignment studies of

proteins can reveal the conserved and variable residues between the two sequences. Protein sequences derived from different organisms, but having a high degree of similarity are assumed to be coming from the same ancestor. Such predictions, which can now be carried out computationally with the help of various algorithms, help in providing an insight into evolutionary processes.

2. Phylogram: Phylogram is a pictorial representation that provides a visualization of evolutionary relationships or phylogeny. In this, the length of branches in the tree are considered to be proportional to the evolutionary distance.

3. Cladogram: A Cladogram is another form of pictorial representation that also gives a visual insight into evolutionary relationships or phylogeny. Unlike the phylogram, the branches of a cladogram are of equal length irrespective of the evolutionary distance.

4. Maximum Parsimony: A method used for alignments which show very strong sequence similarity. This is usually applied for less than twelve sequences.

Page 14: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

5

3

2

4

15. Distance methods: This predicts the evolutionary distance when there is any

sequence variation present and can be used on large number of sequences. As the distance between two sequences increases, the uncertainty of the alignment also increases.

6. Maximum likelihood: This method is useful for prediction of evolutionary distance when sequence variability is high. It can be used for alignments with any amount of variability.

7. Protein structure prediction: The three dimensional structure of a protein is largely specified by its amino acid sequence. Protein structures can be predicted with an accuracy of 70-75% when provided with the sequence.

8. Functional annotation: Function(s) of proteins can be predicted for those proteins having a well-described homology. Gene Ontology terms (GO terms) provide a unique identification of the function that the gene is involved in. These functions are categorized at different levels of functional hierarchy.

9. Protein motif: Common patterns of residues in a set of protein sequences is known as a motif.

Definitions of the componentsPart 2 – Alignment analysis and interpretations

Page 15: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Step 1: Phylogenetic analysis from alignment- Input

Action Audio Narration

1

5

3

2

4 Description of the action

SEQUENCE DATABASE

PHYLOGENETIC ANALYSIS (PHYLIP)

Enter a sequence alignment for 2 or more sequences

Schematic of the process of analysis of alignment

Follow the animation steps. Show the description of each of the methods as the mouse hovers over them. Finally select “Maximum Parsimony” method. Downlink after scoring matrix should look like the downlinks seen on web-pages.

Multiple sequence alignment produces alignment files (.aln), which can be used to determine the evolutionary distances of a set of given protein sequences. This can be achieved by many server-based and stand-alone programs. The user needs to select the method for calculating the distance. Here we depict the usage of alignment files for phylogenetic analysis.

Select a method

Seq1 -------------- LLFLFSSAYSRGVFRRDTHKSeq2 MKWVTFISLLFLFSSAYSRGVFRRDAHSeq3 MKWVTFLLLLFVSGSAFSRGVFRREA

MAXIMUM PARSIMONY

DISTANCE METHODS

MAXIMUM LIKELIHOOD

USED FOR SEQUENCES WITH HIGHLY CONSERVED RESIDUESUSED FOR SEQUENCES WITH MODERATELY CONSERVED RESIDUESUSED FOR SEQUENCES WITH HIGHLY VARIABLE RESIDUES

MAXIMUM PARSIMONY

Page 16: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Step 2: Phylogenetic analysis from alignment- Output

Action Audio Narration

1

5

3

2

4 Description of the action

SEQUENCE DATABASE

PHYLOGENETIC ANALYSIS (PHYLIP)

Enter a sequence alignment for 2 or more sequences

Follow the animation steps. The server on the previous slide gives the following outputs

The outputs from the analysis will be Distance file known as the DND file, Cladogram and Phylogram which are evolutionary trees. In the DND file, there is a common node. The values against the sequence are the distance from the common node. DND files give the distance measure of the aligned sequences from their common ancestral node. Cladograms are the graphical representation of the branching during evolution of the proteins that were aligned. Cladograms do not represent the evolutionary distances or the common ancestral node. Phylograms also represent the evolutionary distance tree in a graphical format. In this, the branch lengths correspond to the evolutionary distance between the two proteins. All branches will converge to a common ancestral root.

Select a method

PGFPPLVAPEPDALCAAFQDNPNLPRLVRPEVDVMCTAFHDNPKLK-PDPNTLCDEFKADEKKF

MAXIMUM PARSIMONY

( seq 1:0.13525, Seq 2:0.09868, seq 3:0.09868);

DND FILESCLADOGRAMPHYLOGRAM

Schematic of the process of analysis of alignment

DND files gives the distance measure of the aligned sequences from their

common ancestral node Branching diagram depicting evolutionary

relationships or phylogeny.

Phylogram is a branching depicting evolutionary relationships or phylogeny. In this, the length of branches in the tree are

considered to be proportional to the evolutionary distance.

Page 17: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Action Audio Narration

1

5

3

2

4 Description of the action

SEQUENCE DATABASE

Structural and Functional prediction (MeMe server)

Enter a sequence alignment for 2 or more sequences

Schematic of the for structural and functional analysis

Alignment files can also be used for a variety of structural and functional analysis. Here we represent the functioning of such programs and servers by taking a simple example of protein motif prediction. The range of the width and the maximum number of motifs to be found are defined by the user.

Range for width of the motifs to be found

Seq 1 PGFPPLVAPEPDALCAAFQDNSeq 2 PNLPRLVRPEVDVMCTAFHDNSeq 3 PKLK-PDPNTLCDEFKADEKKF

6-50

Maximum number of motifs to be found

3

Follow the animation steps. Input the alignment. Input the parameters. Click on the server tool.

Step 3: Structural and Functional prediction from alignment- Input

http://meme.sdsc.edu/meme4_4_0/intro.html

Page 18: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Action Audio Narration

1

5

3

2

4Description of the action

The outputs obtained are1. Block Diagram of protein motifs, which is the schematic used to visualize the positions and kinds of motifs in the alignment of two or more sequences. The color coding varies from server to server.2. Sites of the blocks on a residue-by-residue basis.

Color coded block diagram for motifs

Residue-wise sites for motifs

Schematic of the for structural and functional analysis

SEQUENCE DATABASE

Enter a sequence alignment for 2 or more sequences

PGFPPLVAPEPDALCAAFQDNPNLPRLVRPEVDVMCTAFHDNPKLK-PDPNTLCDEFKADEKKF

Range for width of the motifs to be found

6-50

Maximum number of motifs to be found 3

Structural and Functional prediction (MeMe server)

Block diagram of motif prediction is the schematic used to visualize the positions and

kinds of motifs in the alignment of twoor more sequences

The color coded diagram shows the positions of the motifs in the text

alignment of the compared sequences

Step 4: Structural and Functional prediction from alignment- Output

Follow the animation steps., The server on the previous slide gives the following outputs

http://meme.sdsc.edu/meme4_4_0/intro.html

Page 19: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Step 5: Structural and Functional prediction from alignment- Further Analysis

Action Audio Narration

1

5

3

2

4Description of the action

Functions that can be predicted from sequence data

Animator needs to re-draw all the images shown as they have been retrieved from web-resources. Show the pie chart. Highlight one quarter of it one at a time and depict the diagram next to it along with narrating it.

Once the protein motifs are detected, they can be used for further analysis, such as 1. Epitope Prediction2. Active site determination3. Determination of trans-membrane domains4. Identification of DNA binding residues

http://qwickstep.com/search/the-active-site-of-an-enzyme.html, http://www.science.uva.nl/research/its/molsim/research/TMsignalling_lizhe/index.htmlhttps://www.uzh.ch/oci/ssl-dir/group/files/14_roverview.jpg, http://medgadget.com/archives/2008/03/3d_imaging_of_bleomycindna_binding.html

Identify DNA binding residues

Subtilisn

Finding Trans-membrane domains

Epitope prediction in antigens

Enzyme Active sites

Page 20: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Interacativity Type Options Results

1

2

5

3

4

Input the term “insulin chain A” in the protein database of your choice 1

Chose the protein sequences corresponding to insulin A 2.

Check the source organism for the protein sequence. 3.

Store the FASTA sequences mentioned against Human and mouse in separate locations 4

Run the server to obtain output 6.

Input the two sequences in a multiple alignment server 5

Arrange the steps in the order to be performed.

Remove the step number from the bottom of the tab . Show all the steps in the mixed order. The user must click on the tabs order wise. If the user clicks at a tab which is not in the right order, then flash a message saying “try again”

All the tabs must be arranged in right order.

Check the.dnd file to find evolutionary distance 8

Check for the .aln file and input it into programs for finding Phylogenetic distances such as phylip 7

Interactivity option 1: Find the evolutionary distance between insulin chain A of human and mouse

Page 21: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Interacativity Type Options Results

1

2

5

3

4

Match the left column to the right

Match the meaning of the parameter on the right to the name of the parameter on the left. If the matching is correct, turn the tab green, else flash “Try Again”

Results on next slide

Interactivity option 2.a : Match the following

PAM MATRIX

BLOSUM MATRIX

PHYLOGRAM

BIT SCORE

E-VALUE

DOMAIN IDENTIFICATION

EVOLUTIONARY TREE

SIMILARITY BASED SCORING MATRIX

MEASURE OF BIOLOGICAL SIGNIFICANCE

DISTANCE BASED SCORING MATRIX

MEASURE OF QUALITY OF ALIGNMENT, NORMALIZED ACCORDING TO SCORING

MATRIX

BLAST RESULT LINKED TO PFAM

Page 22: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Boundary/limitsInteracativity Type Options Results

1

2

5

3

4

Interactivity option 2.b : Match the followingPAM MATRIX

BLOSUM MATRIX

PHYLOGRAM

BIT SCORE

E-VALUE

DOMAIN IDENTIFICATION

EVOLUTIONARY TREE

SIMILARITY BASED SCORING MATRIX

MEASURE OF BIOLOGICAL SIGNIFICANCE

DISTANCE BASED SCORING MATRIX

MEASURE OF QUALITY OF ALIGNMENT, NORMALIZED ACCORDING TO

SCORING MATRIX

BLAST RESULT LINKED TO PFAM

Match the left column to the right

Match the meaning of the parameter on the right to the name of the parameter on the left. If the matching is correct, turn the tab green, else flash “Try Again”

Correct Matching

Page 23: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Questionnaire1. Which is a scoring matrix based on distantly related proteins?

Answers: a) PAM b)BLOSUM c) Both d) None

2. Which parameter signifies whether the match between two sequences is a

chance alignment?

Answers: a) word-length b) e-value c) dot-plot d) none

3. Which evolutionary tree has the branch length corresponding to the evolutionary

distances?

Answers: a) Phylogram b)Cladogram c) both d) none

4. Which is NOT a ClustalW output file extension?

Answers: a) .dnd b) .txt c) .aln d) .output

5. Phylogenetic method for most variable sequence is

Answers: a) Distance method b) Maximum Distance c) Maximum Parsimony d)

Maximum Likelihood

1

5

2

4

3

Page 24: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Links for further readingReference websites:

http://blast.ncbi.nlm.nih.gov/Blast.cgihttp://www.ebi.ac.uk/Tools/clustalw2/index.html

http://www.pdb.org/pdb/home/home.dohttp://expasy.org/sprot/

http://expasy.org/prosite/http://pfam.sanger.ac.uk/

http://www.psc.edu/general/software/packages/phylip/

Page 25: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Links for further readingFollowing URLs are used for animations

http://www.ncbi.nlm.nih.gov/http://blast.ncbi.nlm.nih.gov/Blast.cgi

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html http://pfam.sanger.ac.uk/

http://www.ebi.ac.uk/Tools/es/cgi-bin/clustalw2/http://meme.sdsc.edu/meme4_4_0/intro.html

http://www.ebi.ac.uk/Tools/clustalw2/index.htmlhttp://qwickstep.com/search/the-active-site-of-an-enzyme.html

http://www.science.uva.nl/research/its/molsim/research/TMsignalling_lizhe/index.htmlhttps://www.uzh.ch/oci/ssl-dir/group/files/14_roverview.jpg

http://medgadget.com/archives/2008/03/3d_imaging_of_bleomycindna_binding.html

Page 26: Bioinformatics and Protein Sequence Analysis Surabhi Agarwal With sequencing of large number of proteins and subsequent storage of data, it has become

Links for further reading

Books:

Bioinformatics Sequence and Genome Analysis by David Mount