View
221
Download
6
Tags:
Embed Size (px)
Citation preview
©CMBI 2007
Exploring Protein Sequences
Prediction methods exist for all kinds of motifs, signals etc. in newly discovered protein sequences. These are based on either the protein sequence itself or its comparison to protein families (a multiple sequence alignment)
Combining these predictions with primary biochemical data can provide valuable insights into protein structure and function
Let’s make a quick tour through:
– Patterns– Domains and domain databases– Signals in proteins
Celia van GelderCMBI
Radboud UniversityOctober 2007
©CMBI 2007
Exploring Protein Sequences
Part 1:
Patterns ProfilesProtein Domains Protein Domain Databases
Part 2:
Signals in Proteins:Hydropathy PlotsTransmembrane helicesSignal PeptidesRepeatsCoiled Coils
©CMBI 2007
Patterns
•Homologous sequences in multiple alignments show conserved regions
•These conserved regions (patterns, motifs, segments, blocks, features) are typically around 10-20 aa in length
•They usually reflect the structural and/or functional elements of the protein
•New sequences can be searched against a library of patterns and can be assigned a function, to a family or sub-family.
Identifying patterns
--CYDEGGIS-- --CYEDGGIS-- --CYEEGGIT-- --CYRGDGNT--
C-Y-X2-[DG]-G-X-[ST]
regular expression or pattern
PROSITE Syntax:A-[BC]-X-D(2,5)-{EFG}-HMeans:
AB or CAnything2-5 D’sNot E,F or GH
Identifying patterns (2)
Patterns can contain:
- alternative residues- flexible regions
Patterns can not contain:
- mismatches (exact match or no match at all)- gaps
©CMBI 2007
PROSITE
–PROSITE - Database of protein domains, families and functional sites
–1319 patterns and 748 profiles/matrices (oct 2007)
–For every pattern or profile there is documentation present
–Sequence search and Keyword search possible
–http://www.expasy.ch/prosite/
©CMBI 2007
PROSITE example
©CMBI 2007
PROSITE Patterns
Some patterns occur frequently in proteins; they may not actually be present, such as post-translational modification sites.
–ID ASN_GLYCOSYLATION; PATTERN.–DE N-glycosylation site.–PA N-{P}-[ST]-{P}.
You will get a warning:
Notice also in the PROSITE record the number of false positives and false negatives
©CMBI 2007
Identifying patterns – fingerprints
Pattern 1 Pattern 2 Pattern 3 Pattern 4
Fingerprint or signature
Matrix Matrix MatrixMatrix
Databases: PRINTS, BLOCKS
©CMBI 2007
Profiles
Many motifs cannot be easily defined using simple regular expressions.
Such motifs can be defined using a profile, which is a numerical representation of a MSA. For each position in the MSA, each of the 20 amino acids is given a score depending on how likely it is to occur.
Profiles provide a sensitive means of detecting distant sequence relationships.
©CMBI 2007
The profile represents a specific pattern found for a set of proteins.
It is then used to search a target sequence for matches to the profile.
©CMBI 2007
Identifying patterns – full domain alignment
Pattern 1 Pattern 2 Pattern 3 Pattern 4
position-specific matrix + gaps and insertions
Databases: Profiles (alignment manually corrected)Pfam (automatically aligned)
gaps and insertions
Fingerprint or signature+
©CMBI 2007
Protein domains - definitions
• Group of residues with high contact density, number of contacts within domains is higher than the number of contacts between domains.
• A stable unit of protein structure that can fold autonomously
• A rigid body linked to other domains by flexible linkers
• A portion of the protein that can be active on its own if you remove it from the rest of the protein.
©CMBI 2007
Protein Domains
• Domains can be 25 to 500 amino acids long; most are less than 200 amino acids
• The average protein contains 2 or 3 domains
• The same or similar domains are found in different proteins.“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).“Nature is smart but lazy”
• Usually, each domain plays a specific role in the function of the protein.
©CMBI 2007
Protein Domains - an alphabet of functional modules
WD40 WWSH2 SH3
14-3-3 ANK3 ARM BH1 C1 C2 CARD
EH EVH FYVE PDZDeath DED EFH
PH PTB SAM
From: Bioinformatics.ca
©CMBI 2007
Domain Linkers
Domain linkers link the protein domains together and have been found to contain an amino acid signature that is distinct from the structurally compact domains.
Average linker size 8-9 amino acids
Linkers are susceptible for protease attack and they are flexible. Often amino acids like Pro, Ser, Gly, Thr (and less frequent Ala, Asn and Asp) are found in linker sequences.
©CMBI 2007
Protein Domain Databases
Even though the structure of a domain is not always known it is still possible to define the domain boundaries from sequence alone
Many of the common domains have already been defined in domain databases
Advantages:• Pre-annotated domains• Easy interpretation of domain structure
Problem:• Not trivial to define domain boundaries unambiguously
The challenge of family analysis
T. Attwood
©CMBI 2007
Domain databases
Generation #entries
PfamA manual 7503 families
PfamB automatic >140,000 families
Prints manual 11,435 motifs, 1900 fingerprints
Prosite Profiles manual 577 profiles
Blocks automatic 28,337 blocks, 5733 groups
SMART manual 667 HMMs
ProDom automatic 501,917 domain families
December 2005
©CMBI 2007
PRINTS database
• Most protein families are characterised not by one motif, but by several conserved motifs, so-called fingerprints.
• Use all fingerprints of a protein family to build a diagnostic signaturefor this family
• Fingerprints are the basis of the PRINTS database, and are stored in the form of aligned motifs
• Input about protein families is done manually
• True members match all elements of the fingerprint in order, subfamily members may match part of fingerprint
http://ip30.eti.uva.nl/ember-demo/ch3
©CMBI 2007
PRINTS
©CMBI 2007
BLOCKS database
Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.
The blocks for the BLOCKs database are made automatically
To ensure complete coverage it is recommended that both the PRINTS and the BLOCKS database be searched
©CMBI 2007
©CMBI 2007
Pfam
Pfam (Protein families) is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families.
For each family in Pfam you can:
•Look at multiple alignments •View the domain organisation of proteins•Examine species distribution •Follow links to other databases •View known protein structures
©CMBI 2007
Pfam
Pfam-A entries are manually curated - 9318 families (July 2007)
Pfam-B entries are automatically generated clusters –>140,000 (not covered by Pfam-A)
iPfam is a resource that describes domain-domain interactions that are observed in known structures - 3019 interactions
©CMBI 2007
©CMBI 2007
SMART
SMART - Simple Modular Architecture Research Tool
Specializes in:
1) signalling domains2) nuclear domains3) extracellular domains
Current version 5.0: Number of SMART HMMs: 669
©CMBI 2007
Bacteriorhodopsin
Human serine protease
©CMBI 2007
Structure Databases & Structural classification
PDB Brookhaven Databank http://www.rcsb.org/pdb/
CDD – Conserved Domain Databasehttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
MSD – Macromolecular Structure Databasehttp://www.ebi.ac.uk/msd/index.html
CATH - Protein Structure Classificationhttp://www.biochem.ucl.ac.uk/bsm/cath/
SCOP - Structural Classification of Proteinshttp://scop.mrc-lmb.cam.ac.uk/scop/
Adapted from: Bioinformatics.ca
©CMBI 2007
Limitations of domain databases
• Patterns not present for all families of proteins
• Multiple sequence alignment to define patterns could be inaccurate due to an automatic alignment
• Low number of sequences from different species could result in inaccurate patterns
©CMBI 2007
Integrating Pattern databases
InterPro - Integrated Documentation Resource of Protein Families, Domains and Functional Sites.
InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.
The aim is to provide a one-stop-shop for protein family diagnostics
©CMBI 2007
InterPro
Member Databases
Prosite (regular expressions and profiles)
Pfam, SMART, TIGRFAMs, PIRSF, PANTHER, Gene3D and SUPERFAMILY
(hidden Markov Models - HMMs)
PRINTS(groups of aligned, un-weighted motifs)
ProDom(uses cluster analysis to group sequences)
Release 16.1: 14768 entries (Oct 2007)
Types of entries: Family, Domain, Repeat, PTM, Binding Site, Active Site
©CMBI 2007
©CMBI 2007
Summary patterns & domains
• Many different protein signature databases exist (from small patterns to alignments to complex HMMs)
• The databases have different strengths and weaknesses. Some databases can be better for your sequence than others
• Therefore: best to combine methods, preferably in an integrated database
• The quality of a database/server is best tested with a sequence you know very well
• Always do control experiments: never trust a server
©CMBI 2007
Exploring Protein Sequences
Part 1:
Patterns ProfilesProtein Domains Protein Domain Databases
Part 2:
Signals in Proteins:Hydropathy PlotsTransmembrane helicesSignal PeptidesRepeatsCoiled Coils
©CMBI 2007
Hydropathy plots
Hydropathy plots are designed to display the distribution of polar and apolar residues along a protein sequence.
Hydrophobicity scales are based on experimental evidence indicating hydrophobic/hydrophilic properties of each amino acid
Hydropathy plots are generally most useful in predicting transmembrane segments and N-terminal secretion signal sequences.
©CMBI 2007
Hydropathy scales
A positive value indicates local hydrophobicity and a negative value suggests a water-exposed region on the face of a protein.(Kyte-Doolittle scale)
Sliding Window Approach
Sum the amino acid hydrophobicity values in a given window
Plot the average value in the middle of the window
I L I K E I R 4.50+3.80+4.50-3.90-3.50+4.50-4.50 = 5.40 => 5.4/7=0.77
Move to the next position in the sequence
L I K E I R Q +3.80+4.50-3.90-3.50+4.50-4.50 – 3.50 = => -2.6/7=-0.37
The window size can be changed.
J. Leunissen
©CMBI 2007
hydrophobic +
hydrophilic -
score
NH2 protein sequence COOH
interior residues exterior
Hydrophobicity plot
From: Bioinformatics.ca
©CMBI 2007
Transmembrane Helices
Transmembrane proteins are integral membrane proteins that interact extensively with the membrane lipids.
Nearly all known integral membrane proteins span the lipid bilayer
Hydropathy analysis can be used to locate possible transmembrane segments
The main signal is a stretch of hydrophobic and helix-loving amino acids
A window of about 19 is generally optimal for recognizing the long hydrophobic stretches that typify transmembrane stretches.
©CMBI 2007
Transmembrane Helices (2)
In a -helix the rotation is 100 degrees per amino acid
The rise per amino acid is 1,5 Å
To span a membrane of 30 Å approx. 30/1,5 = 20 amino acids are needed
©CMBI 2007
Transmembrane Helix Prediction - Rhodopsin
©CMBI 2007
Signal Peptides
Proteins have intrinsic signals that govern their transport and localization in the cell (nucleus, ER, mitochondria, chloroplasts)
Specific amino acid sequences determine whether a protein will pass through a membrane into a particular organelle, become integrated into the membrane, or be exported out of the cell.
©CMBI 2007
Signal Peptides (2)
The common structure of signal peptides from various proteins is described as:
• a positively charged (N-terminal) n-region
• followed by a hydrophobic h-region (which can adopt an -helical conformation in an hydrophobic environment)
• and a neutral but polar c-region (cleavage region; the signal sequence is cleaved off here after delivering the protein at the right site).
Signal Peptides (3)
Prokaryotes
Eukaryotes Gram-negative Gram-positive
Total length (average) 22.6 aa 25.1 aa 32.0 aa
n-regions only slightly Arg-rich Lys+Arg-rich
h-regions short, very
hydrophobic slightly longer, less
hydrophobic very long, less hydrophobic
c-regions short, no pattern short, Ser+Ala-rich longer, Pro+Thr-rich
-3,-1 positions small and neutral
residues almost exclusively Ala
+1 to +5 region no pattern rich in Ala, Asp/Glu, and Ser/Thr
Marlinda Hupkes 2004
©CMBI 2007
Repeats in proteins
• A repeat is any piece of protein sequence that appears multiple times within a single protein
• Length of the repeat can vary from 1 (single amino acid repeat) up to 240 amino acids
• Repeats are rarer in coding regions than in non-coding regions
• Repeats occur in 14 % of all proteins
• Eukaryotic proteins have three times more internal repeats than prokaryotic proteins
• The three kingdoms of life have very few repeats in common
Repeats, examples
• Gln repeat in huntingtin (Huntington’s disease)(CAG)n = a polyglutamine tract (polyQ)Up to 35 repeats not pathological, > 35 repeats is pathological
• Bacterial transferase hexapeptide (three repeats) • Leucine-rich repeats (LRRs) 20-29 aa motif• WD-repeat• Ankyrin-repeat• etc.etc.
©CMBI 2007
Coiled-Coils
The coiled-coil is a ubiquitous protein motif that is often used to control oligomerisation.
It is found in many types of proteins, including transcription factors, viral fusion peptides, and certain tRNA synthetases.
Examples:– Very long coils in tropomyosin and intermediate filaments– GCN4 – gene regulation in yeast; leucine zipper
©CMBI 2007
Coiled-Coils
Left-handed spiral of right-handed helices
May be parallel
or anti-parallel
NN C
C
NCN
C
David Gossard
©CMBI 2007
Coiled-Coils – Heptad repeat
Seven residue patterns abcdefg in which the a and d residues (core positions) are generally hydrophobic.
ab
cd
e
f
g
ab
cd
e
f
g
Residues at “d” and “a”form hydrophobic core
Residues at “e” and “g”form ion pairs
David Gossard
©CMBI 2007
Assignment (see also paper version)
Make a report about the protein signal of your choice.Questions which should be answered in this report are:
• Describe the protein signal you want to detect.
• Describe existing prediction method(s), their prediction quality and their underlying theory.
• Describe the available webservers for detecting this protein signal, the quality of their predictions, their pro's and con's, and all else you find relevant.
• Give example output for a (for your protein signal) relevant protein and explain this output.