©CMBI 2007 Exploring Protein Sequences Prediction methods exist for all kinds of motifs, signals etc. in newly discovered protein sequences. These are

©CMBI 2007

Exploring Protein Sequences

Prediction methods exist for all kinds of motifs, signals etc. in newly discovered protein sequences. These are based on either the protein sequence itself or its comparison to protein families (a multiple sequence alignment)

Combining these predictions with primary biochemical data can provide valuable insights into protein structure and function

Let’s make a quick tour through:

– Patterns– Domains and domain databases– Signals in proteins

Celia van GelderCMBI

Radboud UniversityOctober 2007

©CMBI 2007


Part 1:

Patterns ProfilesProtein Domains Protein Domain Databases

Part 2:

Signals in Proteins:Hydropathy PlotsTransmembrane helicesSignal PeptidesRepeatsCoiled Coils

©CMBI 2007

Patterns

•Homologous sequences in multiple alignments show conserved regions

•These conserved regions (patterns, motifs, segments, blocks, features) are typically around 10-20 aa in length

•They usually reflect the structural and/or functional elements of the protein

•New sequences can be searched against a library of patterns and can be assigned a function, to a family or sub-family.

Identifying patterns

--CYDEGGIS-- --CYEDGGIS-- --CYEEGGIT-- --CYRGDGNT--

C-Y-X2-[DG]-G-X-[ST]

regular expression or pattern

PROSITE Syntax:A-[BC]-X-D(2,5)-{EFG}-HMeans:

AB or CAnything2-5 D’sNot E,F or GH

Identifying patterns (2)

Patterns can contain:

- alternative residues- flexible regions

Patterns can not contain:

- mismatches (exact match or no match at all)- gaps

©CMBI 2007

PROSITE

–PROSITE - Database of protein domains, families and functional sites

–1319 patterns and 748 profiles/matrices (oct 2007)

–For every pattern or profile there is documentation present

–Sequence search and Keyword search possible

–http://www.expasy.ch/prosite/

©CMBI 2007

PROSITE example

©CMBI 2007

PROSITE Patterns

Some patterns occur frequently in proteins; they may not actually be present, such as post-translational modification sites.

–ID ASN_GLYCOSYLATION; PATTERN.–DE N-glycosylation site.–PA N-{P}-[ST]-{P}.

You will get a warning:

Notice also in the PROSITE record the number of false positives and false negatives

©CMBI 2007

Identifying patterns – fingerprints

Pattern 1 Pattern 2 Pattern 3 Pattern 4

Fingerprint or signature

Matrix Matrix MatrixMatrix

Databases: PRINTS, BLOCKS

©CMBI 2007

Profiles

Many motifs cannot be easily defined using simple regular expressions.

Such motifs can be defined using a profile, which is a numerical representation of a MSA. For each position in the MSA, each of the 20 amino acids is given a score depending on how likely it is to occur.

Profiles provide a sensitive means of detecting distant sequence relationships.

©CMBI 2007

The profile represents a specific pattern found for a set of proteins.

It is then used to search a target sequence for matches to the profile.

©CMBI 2007

Identifying patterns – full domain alignment

Pattern 1 Pattern 2 Pattern 3 Pattern 4

position-specific matrix + gaps and insertions

Databases: Profiles (alignment manually corrected)Pfam (automatically aligned)

gaps and insertions

Fingerprint or signature+

©CMBI 2007

Protein domains - definitions

• Group of residues with high contact density, number of contacts within domains is higher than the number of contacts between domains.

• A stable unit of protein structure that can fold autonomously

• A rigid body linked to other domains by flexible linkers

• A portion of the protein that can be active on its own if you remove it from the rest of the protein.

©CMBI 2007

Protein Domains

• Domains can be 25 to 500 amino acids long; most are less than 200 amino acids

• The average protein contains 2 or 3 domains

• The same or similar domains are found in different proteins.“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).“Nature is smart but lazy”

• Usually, each domain plays a specific role in the function of the protein.

©CMBI 2007

Protein Domains - an alphabet of functional modules

WD40 WWSH2 SH3

14-3-3 ANK3 ARM BH1 C1 C2 CARD

EH EVH FYVE PDZDeath DED EFH

PH PTB SAM

From: Bioinformatics.ca

©CMBI 2007

Domain Linkers

Domain linkers link the protein domains together and have been found to contain an amino acid signature that is distinct from the structurally compact domains.

Average linker size 8-9 amino acids

Linkers are susceptible for protease attack and they are flexible. Often amino acids like Pro, Ser, Gly, Thr (and less frequent Ala, Asn and Asp) are found in linker sequences.

©CMBI 2007

Protein Domain Databases

Even though the structure of a domain is not always known it is still possible to define the domain boundaries from sequence alone

Many of the common domains have already been defined in domain databases

Advantages:• Pre-annotated domains• Easy interpretation of domain structure

Problem:• Not trivial to define domain boundaries unambiguously

The challenge of family analysis

T. Attwood

©CMBI 2007

Domain databases

Generation #entries

PfamA manual 7503 families

PfamB automatic >140,000 families

Prints manual 11,435 motifs, 1900 fingerprints

Prosite Profiles manual 577 profiles

Blocks automatic 28,337 blocks, 5733 groups

SMART manual 667 HMMs

ProDom automatic 501,917 domain families

December 2005

©CMBI 2007

PRINTS database

• Most protein families are characterised not by one motif, but by several conserved motifs, so-called fingerprints.

• Use all fingerprints of a protein family to build a diagnostic signaturefor this family

• Fingerprints are the basis of the PRINTS database, and are stored in the form of aligned motifs

• Input about protein families is done manually

• True members match all elements of the fingerprint in order, subfamily members may match part of fingerprint

http://ip30.eti.uva.nl/ember-demo/ch3

©CMBI 2007

PRINTS

©CMBI 2007

BLOCKS database

Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.

The blocks for the BLOCKs database are made automatically

To ensure complete coverage it is recommended that both the PRINTS and the BLOCKS database be searched

©CMBI 2007

©CMBI 2007

Pfam

Pfam (Protein families) is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families.

For each family in Pfam you can:

•Look at multiple alignments •View the domain organisation of proteins•Examine species distribution •Follow links to other databases •View known protein structures

©CMBI 2007

Pfam

Pfam-A entries are manually curated - 9318 families (July 2007)

Pfam-B entries are automatically generated clusters –>140,000 (not covered by Pfam-A)

iPfam is a resource that describes domain-domain interactions that are observed in known structures - 3019 interactions

©CMBI 2007

©CMBI 2007

SMART

SMART - Simple Modular Architecture Research Tool

Specializes in:

1) signalling domains2) nuclear domains3) extracellular domains

Current version 5.0: Number of SMART HMMs: 669

©CMBI 2007

Bacteriorhodopsin

Human serine protease

©CMBI 2007

Structure Databases & Structural classification

PDB Brookhaven Databank http://www.rcsb.org/pdb/

CDD – Conserved Domain Databasehttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

MSD – Macromolecular Structure Databasehttp://www.ebi.ac.uk/msd/index.html

CATH - Protein Structure Classificationhttp://www.biochem.ucl.ac.uk/bsm/cath/

SCOP - Structural Classification of Proteinshttp://scop.mrc-lmb.cam.ac.uk/scop/

Adapted from: Bioinformatics.ca

©CMBI 2007

Limitations of domain databases

• Patterns not present for all families of proteins

• Multiple sequence alignment to define patterns could be inaccurate due to an automatic alignment

• Low number of sequences from different species could result in inaccurate patterns

©CMBI 2007

Integrating Pattern databases

InterPro - Integrated Documentation Resource of Protein Families, Domains and Functional Sites.

InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.

The aim is to provide a one-stop-shop for protein family diagnostics

©CMBI 2007

InterPro

Member Databases

Prosite (regular expressions and profiles)

Pfam, SMART, TIGRFAMs, PIRSF, PANTHER, Gene3D and SUPERFAMILY

(hidden Markov Models - HMMs)

PRINTS(groups of aligned, un-weighted motifs)

ProDom(uses cluster analysis to group sequences)

Release 16.1: 14768 entries (Oct 2007)

Types of entries: Family, Domain, Repeat, PTM, Binding Site, Active Site

©CMBI 2007

©CMBI 2007

Summary patterns & domains

• Many different protein signature databases exist (from small patterns to alignments to complex HMMs)

• The databases have different strengths and weaknesses. Some databases can be better for your sequence than others

• Therefore: best to combine methods, preferably in an integrated database

• The quality of a database/server is best tested with a sequence you know very well

• Always do control experiments: never trust a server

©CMBI 2007


Part 1:

Patterns ProfilesProtein Domains Protein Domain Databases

Part 2:

Signals in Proteins:Hydropathy PlotsTransmembrane helicesSignal PeptidesRepeatsCoiled Coils

©CMBI 2007

Hydropathy plots

Hydropathy plots are designed to display the distribution of polar and apolar residues along a protein sequence.

Hydrophobicity scales are based on experimental evidence indicating hydrophobic/hydrophilic properties of each amino acid

Hydropathy plots are generally most useful in predicting transmembrane segments and N-terminal secretion signal sequences.

©CMBI 2007

Hydropathy scales

A positive value indicates local hydrophobicity and a negative value suggests a water-exposed region on the face of a protein.(Kyte-Doolittle scale)

Sliding Window Approach

Sum the amino acid hydrophobicity values in a given window

Plot the average value in the middle of the window

I L I K E I R 4.50+3.80+4.50-3.90-3.50+4.50-4.50 = 5.40 => 5.4/7=0.77

Move to the next position in the sequence

L I K E I R Q +3.80+4.50-3.90-3.50+4.50-4.50 – 3.50 = => -2.6/7=-0.37

The window size can be changed.

J. Leunissen

©CMBI 2007

hydrophobic +

hydrophilic -

score

NH2 protein sequence COOH

interior residues exterior

Hydrophobicity plot

From: Bioinformatics.ca

©CMBI 2007

Transmembrane Helices

Transmembrane proteins are integral membrane proteins that interact extensively with the membrane lipids.

Nearly all known integral membrane proteins span the lipid bilayer

Hydropathy analysis can be used to locate possible transmembrane segments

The main signal is a stretch of hydrophobic and helix-loving amino acids

A window of about 19 is generally optimal for recognizing the long hydrophobic stretches that typify transmembrane stretches.

©CMBI 2007

Transmembrane Helices (2)

In a -helix the rotation is 100 degrees per amino acid

The rise per amino acid is 1,5 Å

To span a membrane of 30 Å approx. 30/1,5 = 20 amino acids are needed

©CMBI 2007

Transmembrane Helix Prediction - Rhodopsin

©CMBI 2007

Signal Peptides

Proteins have intrinsic signals that govern their transport and localization in the cell (nucleus, ER, mitochondria, chloroplasts)

Specific amino acid sequences determine whether a protein will pass through a membrane into a particular organelle, become integrated into the membrane, or be exported out of the cell.

©CMBI 2007

Signal Peptides (2)

The common structure of signal peptides from various proteins is described as:

• a positively charged (N-terminal) n-region

• followed by a hydrophobic h-region (which can adopt an -helical conformation in an hydrophobic environment)

• and a neutral but polar c-region (cleavage region; the signal sequence is cleaved off here after delivering the protein at the right site).

Signal Peptides (3)

Prokaryotes

Eukaryotes Gram-negative Gram-positive

Total length (average) 22.6 aa 25.1 aa 32.0 aa

n-regions only slightly Arg-rich Lys+Arg-rich

h-regions short, very

hydrophobic slightly longer, less

hydrophobic very long, less hydrophobic

c-regions short, no pattern short, Ser+Ala-rich longer, Pro+Thr-rich

-3,-1 positions small and neutral

residues almost exclusively Ala

+1 to +5 region no pattern rich in Ala, Asp/Glu, and Ser/Thr

Marlinda Hupkes 2004

©CMBI 2007

Repeats in proteins

• A repeat is any piece of protein sequence that appears multiple times within a single protein

• Length of the repeat can vary from 1 (single amino acid repeat) up to 240 amino acids

• Repeats are rarer in coding regions than in non-coding regions

• Repeats occur in 14 % of all proteins

• Eukaryotic proteins have three times more internal repeats than prokaryotic proteins

• The three kingdoms of life have very few repeats in common

Repeats, examples

• Gln repeat in huntingtin (Huntington’s disease)(CAG)n = a polyglutamine tract (polyQ)Up to 35 repeats not pathological, > 35 repeats is pathological

• Bacterial transferase hexapeptide (three repeats) • Leucine-rich repeats (LRRs) 20-29 aa motif• WD-repeat• Ankyrin-repeat• etc.etc.

©CMBI 2007

Coiled-Coils

The coiled-coil is a ubiquitous protein motif that is often used to control oligomerisation.

It is found in many types of proteins, including transcription factors, viral fusion peptides, and certain tRNA synthetases.

Examples:– Very long coils in tropomyosin and intermediate filaments– GCN4 – gene regulation in yeast; leucine zipper

©CMBI 2007

Coiled-Coils

Left-handed spiral of right-handed helices

May be parallel

or anti-parallel

NN C

C

NCN

C

David Gossard

©CMBI 2007

Coiled-Coils – Heptad repeat

Seven residue patterns abcdefg in which the a and d residues (core positions) are generally hydrophobic.

ab

cd

e

f

g

ab

cd

e

f

g

Residues at “d” and “a”form hydrophobic core

Residues at “e” and “g”form ion pairs

David Gossard

©CMBI 2007

Assignment (see also paper version)

Make a report about the protein signal of your choice.Questions which should be answered in this report are:

• Describe the protein signal you want to detect.

• Describe existing prediction method(s), their prediction quality and their underlying theory.

• Describe the available webservers for detecting this protein signal, the quality of their predictions, their pro's and con's, and all else you find relevant.

• Give example output for a (for your protein signal) relevant protein and explain this output.

Documents

©CMBI 2007 Exploring Protein Sequences Prediction methods exist for all kinds of motifs, signals etc. in newly discovered protein sequences. These are