21
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Protein functions prediction Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Introduction Signal peptides Transmembrane regions and topology PTM (post-translational modifications) Low complexity and biased regions Repeats Coils Secondary structure Antigenic peptides Domain/Motifs Tools The EMBOSS package

Protein functions prediction - EMBnet node Switzerland

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Introduction

Signal peptidesTransmembrane regions and topologyPTM (post-translational modifications)Low complexity and biased regionsRepeatsCoils

Secondary structureAntigenic peptidesDomain/MotifsToolsThe EMBOSS package

Page 2: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Different techniques

AlgorithmsSliding window, Nearest NeighborPatterns, regular expressionWeight matricesHMM, profilesNeural NetworksRules

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Sliding windowTHISISATESTSEQVENCETHATDISPLAYSTHESLIDINGWINDQ W

Score1Score2

Scoren

Width or Size=11, Step=5

Results are usually displayed as a graph, see example ->

Page 3: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Patterns / regular expression

Pattern: <A-x-[ST](2)-x(0,1)-{V}Regexp: ^A.[ST]{2}.?[^V]Text: The sequence must start with an alanine, followed by any amino acid, followed by a serine or a threonine, two times,followed by any amino acid or nothing,followed by any amino acid except a valine.Simply the syntax differ…

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Weight matrices (PSSM)

Page 4: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

HMM / profiles

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Neural Networks

General principle: Example:

Page 5: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Signals found in proteins

N-terexportation - secretionmitochondriachloroplast

internalNLS (nuclear localization signal)

C-terGPI-anchor (Glycosyl Phosphatidyl Inositol)

other membraneanchors (see PTM) other unknown ?

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Signals detection tools

SignalPMitoProtChloroPPredotarPSortTargetPSigcleave (EMBOSS)Phobius

Big-PIDGPI

Page 6: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Transmembrane regions

Detection (signal peptide, hydropathy, helices)Organisation (topology)

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Transmembrane detection tools

TMHMMTMPredTopPred2DASHMMTopTmap (EMBOSS)

Mixture of toolsPhobiusConPred II

Page 7: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Post translational modificationsPhosphorylation

S - T - YN-glycosylation

N O-glycosylation

S - T - (HO)KAcetylation, methylation

D - E - KSulfation

Y

Farnesylation, myristylation,palmitoylation, geranylgeranylation, GPI-anchor

C - Nter - CterUbiquitination and family

K - NterInteins (protein splicing)Pre-translational

SelenoproteinC

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

PTM detectionPattern prediction (PROSITE)Short or weak signalFrequent hit producerBest method is experimental

MS/MS detectionMost method use « rules »joining pattern detection and knowledge to predict sites.

NetOGlyc - Prediction of type O-glycosylation sites in mammalian proteins DictyOGlyc - Prediction of GlcNAc O-glycosylation sites inDictyostelium YinOYang - O-beta-GlcNAc attachment sites in eukaryotic protein sequences NetPhos - Prediction of Ser, Thr and Tyr phosphorylation sites ineukaryotic proteins NMT - Prediction of N-terminal N-myristoylation Sulfinator - Prediction of tyrosinesulfation sites

Page 8: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Low complexity regions

repeatscompositional biasPEST

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Low complexity / RepeatsDUST (DNA) / SEG

de novo detectionRepeatMasker (DNA)

search collectionREP

search collectionREPRO, Radar

de novo detectionPEST, PESTFind

de novo detection

EMBOSS (DNA)einvertedequicktandemetandempalindrome

EMBOSS (protein)oddcomp

Page 9: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Coils

Helix of helixcoiled-coil

Leu-zipper

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Coils detection

COILSWeight matrices

Paircoil, MulticoilPairwise correlation

MarcoilHMM

Pepcoil (EMBOSS)Weight matrices

Page 10: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Secondary structure

Structure to predictAlpha-helicesBeta-sheetsTurnsRandom coil

Garnier (EMBOSS)PHDDSCPREDATORNNSSPJpredJnetMany others

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Antigenic peptidePeptides binding to MHCclass I

8, 9, 10 mersclass II

15 mers (3+9+3)Depend highly on MHC type

Use of experimental knowledge

Databases of known peptides

SYFPEITHI HLA_Bind (BIMAS)MAPPP combined expertAntigenic (EMBOSS)Many more

Prediction of proteasome cleavage sites

NetChopPaProc

Page 11: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Domain / Motif

All the protein domain descriptors

PROSITEPFAMSMARTPRODOMBLOCKSPRINTSTIGRfam…

Federation: InterProMany techniques

Patterns, RegexpPSSM (PSI-BLAST)ProfilesHMM

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Other Tools

You can find some of them on our serverswww.ch.embnet.org

Or on ExPASy serverwww.expasy.org/tools

Or ask Google!!www.google.com

Page 12: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

European Molecular Biology Open Software Suite

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

How to use EMBOSS/Jemboss at SIB

Page 13: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Free Open Source (for most Unix plateforms)GCG successor (compatible with GCG file format)More than 150 programs (ver. 2.9.0)Easy to install locally

but no interface, requires local databasesUnix command-line only

InterfacesJemboss, www2gcg, w2h, wemboss… (with account)Pise, EMBOSS-GUI, SRSWWW (no account)Staden, Kaptain, CoLiMate, Jemboss (local)

Access: www.emboss.org or emboss.sourceforge.net

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Format USA'asis' :: Sequence[start :end : reverse]

Format :: '@' ListFile[start :end : reverse]

Format ::'list' :ListFile[start :end : reverse]

Format ::Database :Entry [start :end : reverse]

Format ::Database -SearchField: Word[start :end : reverse]

Format :: File: Entry [start :end : reverse]

Format :: File: SearchField: Word[start :end : reverse]

Format ::Program Program-parameters '|' [start :end : reverse]

Example: fasta::Swissprot:UBP5_HUMAN[200:300]

DatabasesAny can be added, use showdb to display the available databases

Some details

Page 14: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

showdbDisplays information on the currently available databases# Name Type ID Qry All Comment# ==== ==== == === === =======ipr_fetch P OK OK OK InterPro current by fetchipi_fetch P OK OK OK IPI current by fetchrefseq_fetch P OK OK OK refseq current by fetchrepbase_fetch P OK OK OK repbase current by fetchswiss_fetch P OK OK OK SwissProt current by fetchswissprot P OK OK OK SWISSPROT sequencestrembl P OK OK OK TREMBL sequencestrembl_fetch P OK OK OK trembl current by fetchtremblnew P OK OK OK TREMBL New sequencesug_fetch P OK OK OK Unigene by fetchembl N OK OK OK EMBL releaseemhum N OK OK OK EMBL release, Human section by emboss indexemrod N OK OK OK EMBL release, Rodent section by emboss indexemvrt N OK OK OK EMBL release, Vertebrate (nonhuman, nonrodent)

seqret (seqretall, seqretset, seqretsplit)entret (for complete untouched entry, e.g., for unigene, interpro, swissprot…)Possible to define your own « .embossrc » file

databases

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Some tools for DNAredata Search REBASE for enzyme name, references, suppliers etcremap Display a sequence with restriction cut sites, translation etcrestover Finds restriction enzymes that produce a specific overhangrestrict Finds restriction enzyme cleavage sitesshowseq Display a sequence with features, translation etcsilent Silent mutation restriction enzyme scancirdna Draws circular maps of DNA constructs lindna Draws linear maps of DNA constructs revseq Reverse and complement a sequence…

Page 15: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Example: remapECLAC E.coli lactose operon with lacI,lacZ,lacY and lacA genes.

Hin6ITaqI |HhaI| Bsc4I | Bsu6I| | Hin6I |BssKI | | |HhaI AciI | |BsiSI\ \ \\ \ \\\

GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGT10 20 30 40 50 60

----:----|----:----|----:----|----:----|----:----|----:----|CTGTGGTAGCTTACCGCGTTTTGGAAAGCGCCATACCGTACTATCGCGGGCCTTCTCTCA

/ / / / / / / ///|TaqI | Hin6I AciI | | ||BssKIBsc4I HhaI | | |BsiSI

| | Bsu6I| Hin6IHhaI

# Enzymesthat cut Frequency IsoschizomersAciI 1Bsc4I 1BsiSI 1BssKI 1Bsu6I 1HhaI 2Hin6I 2 HinP1I,HspAITaqI 1

# Enzymesthatdo notcutAclI BamHI BceAI Bse1I BshI ClaI EcoRI EcoRIIHin4I HindII HindIII HpyCH4IV KpnI NotI

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Example: cirdnaFile: ../../data/data.cirp

Start 1001End 4270grouplabelBlock 1011 1362 3ex1endlabellabelTick 1610 8EcoR1endlabellabelBlock 1647 1815 1endlabellabelTick 2459 8BamH1endlabellabelBlock 4139 4258 3ex2endlabelendgroupgrouplabelRange 2541 2812 [ ] 5AluendlabellabelRange 3322 3497 > < 5MER13endlabelendgroup

Page 16: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Example: plotorf

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

EMBOSS format input/outputUFO Universal Feature Object

gff, swissprot, embl, pir, nbrf (with or without sequence)Alignments

Multiple and pairwise, many flavors (FASTA, MSF, SRS…)Reports

Feature (UFO), SRS, motif, seqtable, excel, diffseq, listfile (USA), etc…

Sequences (compatible with USA) Many!!! E.g., fasta, clustal, gcg, paup, gff, embl, swissprot, acedb, abi, etc…

Page 17: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Web interfaces

PISE (Pasteur Institute Software Environment) http://www-alt.pasteur.fr/~letondal/Pise/

wEMBOSS (Belgium&Argentina) (not yet at SIB)

http://www.wemboss.org

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Pise a tool to generate Web interfaces for Molecular Biology programs

http://emboss.ch.embnet.org/Pise

Page 18: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

http://www.wemboss.org

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Page 19: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Launch Jemboss http://emboss.ch.embnet.org/Jemboss

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Launch Jemboss

First time only…

Each time…

Page 20: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Jemboss windows

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Jemboss windows other systems

Page 21: Protein functions prediction - EMBnet node Switzerland

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Summary

Anonymous web access through PiseRegistered access through JembossRegistered access through command-line (requires UNIX skills)

Please report problems!

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Exercises

DEA Exercises web based sequence analysisThe goal of this exercise is to use web based tools for protein sequence analysis

a) Take this TrEMBL sequence (Q9X252) and try a BLAST against swissprot with the complete protein orwith the first 70 residues. Explain the difference. Use TMPred, SignalP, and COILS to help you.

b) Pass this sequence through PFSCAN and search all databases. Compare with this command onludwig-sun1/2: hits -b "prf pat pfam" tr:Q9X252 c) use the different profile, motifs, pattern databases to get more information about the domain(s) you found.

d) How do you evaluate the PRINTS tropomyosin annotation in this TrEMBL entry (Q9WZH0)?

List of useful links:basic BLAST or advanced BLAST or PSI-BLAST

TMPred prediction tool for transmembrane regions (or TMHMM)

COILS prediction tool for coiled-coil regions

SignalP prediction tool for signal-peptide cleavage site

Profile, domain, motifs databases and search sites:PFSCAN

InterPro (Pfam, PRINTS, PROSITE, SMART)

HITS