Upload
alexia
View
27
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Spring 2013. Bioinformatics. Ayesha Masrur Khan. Protein Family and Domains. Once a protein sequence is obtained, there are many questions that can be asked, such as -what is the protein’s overall identity? -what putative functions does it have? -what biological motifs are present? - PowerPoint PPT Presentation
Citation preview
BioinformaticsSpring 2013
Ayesha Masrur Khan
Lec-4 2
Protein Family and DomainsOnce a protein sequence is obtained, there
are many questions that can be asked, such as
-what is the protein’s overall identity?-what putative functions does it have?-what biological motifs are present?
Different computational tools are needed to determine possible functional domains based on primary sequence data.
Lec-4 3
Protein Family and Domains (contd.) Therefore, family and domain
databases are used to address the question- ‘what domains are contained within this sequence?’ or ‘what family does this protein belong to?’
BUT first: what are families and domains?
Lec-4 4
Protein Family and Domains (contd.)
Family---> A family of proteins was originally defined by Dayhoff et.al (1978) as a group of sequences with more than 50% identity when aligned with similar functions. Families are often also characterized by the presence of one or more domains with high sequence similarity.
Domains---> Traditionally known as structurally independent folding units, are conserved functional units that may contain one or more motifs. Characterized by the following:1- A spatially separated unit of the protein 3D structure2- May have sequence and/or structural resemblance to another protein structure or domain.3- May have a specific function associated with it.
Lec-4 5
Protein Family and Domains (contd.)
Motifs---> These include both short stretches of fixed residue length that act as sites for post translational modifications and longer sequences that form secondary structures for protein-DNA, protein-ion or protein-lipid interactions.
Lec-4 6
Domain Example: Pyruvate kinase
Quaternary structure: 4 subunits
3 domains
Lec-4 7
Zinc finger motif: A sequence motif
Three zinc fingers bound spirally in the major groove of a DNA molecule.
The coordination of a zinc atom by characteristically spaced cysteine and histidineresidues in a single zinc finger motif
Sequence motif: A particular amino-acid sequence that is characteristic of a specific biochemical function
Lec-4 8
Other examples: structural motifs & functional motifs
Another type is the functional motif, which is a sequence or structural motif that isalways associated with a particular biochemical function.
Lec-4 9
Protein families Protein families are related to one another by
sequence similarity, domain composition, or structure.
These include proteins found across species orthologues) or within the same species (paralogs).
Family descriptors are derived from MSAs (multiple sequence alignments) that enable us to define traits that encompass all member sequences. Family descriptors have been based on sequence identity (>50% identical), common domains (e.g. catalytic binding domains, calcium binding motifs etc.), structure, or a combination of these characteristics.
Lec-4 10
Protein Domains Domains represent discrete stretches within the
protein, unlike protein families, which are commonly defined over the length of the sequence.
These units are conserved at the level of sequence and structure.
They can be described by: combinations of short regions of highly conserved amino
acids within a domain all amino acids structural features
Domain description is developed in the same way as the family descriptors.
Lec-4 11
Family-Domain Databases
Because of the reuse of motifs and domains, similarities can be found within sequences that are otherwise unrelated evolutionarily.
Therefore, methods are needed to distinguish between similarities due to random variation and those of common origin or function.
Family-domain databases provide the following benefits:1. Increase sensitivity, i.e. true matches are
detected through MSA2. Increased specificity, i.e. detect only related
proteins3. Classification of protein sequences to
appropriate families
Lec-4 12
Family-Domain DatabasesSome database referencesName Web-address Description
PROSITE http://www.expasy.ch/prosite Groups of proteins of similar biochemical function on basis of amino acid patterns
Pfam http://www.sanger.ac.uk/Pfam
Profiles derived from alignment of protein families, each one composed of similar sequence
SMART http://smart.emblheidelberg.de/
Genetically mobile domains
InterPro http://www.ebi.ac.uk/interpro
Integrated resource of protein domains and functional sites: combination of Pfam, PRINTS, ProSite, and current SwissProt/TrEMBL sequence.
Lec-4 13
Searching sequence databases
Search methods engage in a series of sequence alignments to determine degrees of similarity between sequences and then return a list of matched sequences to the user.
Alignment Algorithms
Manually, we examine two or more sequences for similar residue patterns, match up identical residues, decide qualitatively whether they are aligned well, and determine statistically how identical or similar the sequences are.
The automation of this process requires a computer-based method to line sequences up against one another and a scoring method for evaluating the success of the alignment in terms of similarity or identity.
Lec-4 14
DNARNAProtein
Sequence comparison and alignment is a central problem in computational biology. The most basic task is: given two known sequences (DNA, RNA or amino acids) and a scoring model, determine if they are related or not.
Lec-4 15
Sequence alignment•When we align sequences, we assume that they share a common ancestor
-They are then homologous•Protein fold is much more conserved than protein sequence•DNA sequences tend to be less informative than protein sequences
ATTGCGC ATTGCGC ATCCGC
ATTGCGC AT-CCGC
ATTGCGCC
An Alignment is a hypothesis of positional homology between bases/Amino Acids.
Lec-4 16
Sequence alignmentThe alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. There are lots of possible alignments.Two sequences can always be aligned. Sequence alignments have to be scored. Often there is more than one solution with the same score.
Lec-4 17
Identity vs. Similarity Identity refers to an exact match between two
nucleotides or amino acids Similarity refers to a resemblance between two residues
that is greater than one would expect at random.
Percent Sequence Identity• The extent to which two nucleotide or amino acid sequences are invariant.
70% identical
A C C T G A G – A G A C G T G – G C A G
Lec-4 18
By hand - slide sequences on two lines of a word processor
Dot plot with windows
Rigorous mathematical approach Dynamic programming (slow, optimal)
Heuristic methods (fast, approximate) BLAST and FASTA
Alignment methods
Lec-4 19
Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached.-Used when an objective and optimal measure is needed to compare two sequences and it is valid to assume that the length of the sequences is equivalent
Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.
Global and Local Alignment
20Lec-4
Global and Local Alignment
Lec-4 21
Global alignment The the Needleman-Wunsch algorithm (1970)
creates a global alignment over the length of both sequences.
Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. Sometimes two sequences may be derived from
ancient recombination events where only a single functional domain is shared.
Global methods are useful when you want to force two sequences to align over their entire length
Lec-4 22
This method identifies the most similar sub-region shared between two sequences
Smith-Waterman algorithm (1981)
Local Alignment