PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

PIR Seminar , University of Georgetown, Washington DC, 12.11.2003

Information integrationfor

Swiss-Prot annotation

Anne-Lise VeutheySwiss Institute of Bioinformatics

The simplified story of a Swiss-Prot entry

cDNAs, genomes, …

EMBLnew EMBL

TrEMBLnew TrEMBL

SWISS-PROT

« Automated »• Redundancy check (merge)• Family attribution (InterPro)• Annotation (computer)

« Manual »• Redundancy (merge, conflicts)• Annotation (manual)• SWISS-PROT tools (macros…)• SWISS-PROT documentation• Medline• Databases (MIM, MGD….)• Brain storming

Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive)

CDS

CDS: proposed and submitted at EMBL by authors or by genome projects (can be experimentally proven or derived from gene prediction programs). TrEMBL neither translates DNA sequences, nor uses gene prediction programs: only takes CDS proposed by the submitting authors in the EMBL entry.

Some data are not submitted to the public databases !!Delayed or cancelled…

12/11/2003PIR seminar, Washington

Annotation projects

• Human Proteomics Initiatve (HPI)

• HAMAP: microbial annotation

• Plant Proteome Annotation Project (PPAP)

• Fungi annotation

• Tox-Prot: annotation of toxin proteins


HPI Goals

• Annotation of all known human proteins;

• Annotation of all mammalian orthologs.

With a particular emphasis on:– Alternative splicing;– Polymorphisms;– PTMs;– Structural information.


http://www.expasy.org/sprot/hpi/hpi_stat.html


Swiss-Prot / TrEMBL Chromosome by chromosome

http://www.ebi.ac.uk/proteome/HUMAN/

Tota

l n

um

ber

of

entr

ies

(SP+

Tr)

1: 1'6402: 1'0443: 8794: 6305: 7456: 9697: 7528: 5499: 61710: 59411: 97612: 85213: 27514: 50815: 46916: 70617: 53818: 22919: 1'12120: 55421: 17922: 405X: 645Y: 61

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

Chromosome

% in

Sw

iss-P

rot

49%


Problem of information overflow

• Almost 136 356 entries in SwissProt

• But over 1 million sequences waiting in TrEMBL for manual curation to be incorpotated in Swiss-Prot

• The number of TrEMBL entries is increasing exponentially

• Development of quality automated annotation (HAMAP)• Improvement of manual annotation efficiency by:

– Bioinformatic tools integration (Anabelle) – Text mining tools to help literature screening (BioMinT)

Solutions


HAMAP

More than 150 complete genomes are available inpublic databases. Collectively they encode more than350’000 protein sequences. Such a large amount ofsequences makes classical manual annotation anintractable task.

HAMAP handles annotation of:• Microbial proteomes (eubacteria and archaea)• Plastid proteomes (cyanelle and chloroplast)• Soon extended to mitochondrial and viral proteomes

High-quality Automated and Manual Annotation of microbial Proteomes.


HAMAP goals

Annotate automatically, with a high throughput but no decrease in quality, proteins from complete microbial genomes which either belong to a family or have no similaritiesImprove human-machine interaction for the manual annotation of remaining proteinsProvide an integrated information at the level of a microbial genome


HAMAP pipeline

125 genomes

914 families965 profiles

360'000 proteins

32'000 hits


HAMAP modules

Downstream interface

Operational

External communi-

cation

Build an interactive web site to present HAMAP data and allow retrieval of complete proteomes.

GAM

In development

Genome Awareness

(herbs)

Warn about missing proteins, potential problems with paralogs and inconsistent pathways.

CAM

Prototype

Complex Annotation

Module

Reliably annotate complex families.

SAM

Operational

Sequence Annotation

Module

Given a family rule and a set of entries, produce annotated entries based on the rule. Also annotate ORFans.

FAM

Operational

Family Assignment Module

Maintain HAMAP profiles, assign new family members. Detect ORFans.

Upstreaminterface

Operational

Cleanup & Redundanc

y

Fix errors in TrEMBLnew entries. Merge redundancy with Swiss-Prot.

FAM : The family assignment module

Similarity searches

•Profile searches against db of families

•Searches in PROSITE, Pfam, TIGRFAM,…

•BLAST against the protein db

Belongs to a family

•High sequence similarity

•Correct length

•Characteristic features

May belong to a family

•sub-threshold score

•incorrect length

•problems with features

Other similarities

No similarities to known families, but

•Significant BLAST matches

•Matches in PROSITE, Pfam,…

No similarities(except to ORFans in close species)

Family attribution New family/function? ORFan

SAM module Manual annotation

Manual correction

SAM module SAM module

Family attribution


Family Assignment Module

Profile generationProfiles are automatically generated from the manually curated family alignments

Profile calibrationCheck for profile accuracy

Set cutoff = lower score of membersNo false positives allowed in Swiss-Prot

Run against TrEMBL and report result sIdentify trusted / member / weak matches


SAM : The sequence annotation module

Similarity searches

Family rule-based annotationPrediction program-based annotation of hypothetical

proteins (ORFans)

Match to a defined family No matches

FAM module


Sequence Annotation Module

SAM annotates a protein entry based onA family rule and a sequence, orAn ORFan sequence

Conditions in family rules are fully handledExternal programs are called for computed features : ProfileScan, TMHMM, SignalP, REP, COILS, as instructed in the family ruleComputed features are combined using exclusion rulesWarnings are generated in case of problems


HAMAP family database

HAMAP families have the following properties:

Single function, or function that can be deduced from taxonomy or presence/absence of metabolic pathwaysFamily members (except divergent sequences) have more similarity to one another than to other proteins


HAMAP family database

All the annotation can be inferred from:The rules given in the family (including conditions and dependencies)The alignment of the new entry with the existing alignment in the family (for feature propagation)The entry taxonomyAdditional information about metabolic pathways and number of membranesResults of ad hoc analysis tools


Computed features

Signal type 1 SignalP neural network (von Heijne)Signal type 2 (lipoprotein)

Signal type 4 (pilin) PROSITE patternTransmembrane regions TMHMM (Krogh)Coiled coils Modified COILS (Lupas)ATP/GTP binding sites Walker A profile (to do)LPxTG cell-wall anchor PROSITE profile

Inteins (upstream of identification step)

3 PROSITE profiles: C+N-terminal junctions + homing endonuclease

PROSITE rule; later, Pedro Gonnet's patoseq program ?

Repeats : ANK, LRR, TPR, WD, Kelch

REP (Bork & Andrade) profiles + program


Complex Annotation Module

Extension of SAM to allow:Hierarchical assignmentABC transporter profile => generic annotation+ CysA subfamily profile => specific annotationModular annotation

Domain rulesMetamotifs to express domain arrangements

Is now part of the Anabelle projectNot limited to bacteriaAlso usable for semi-automated annotation


Genome Awareness Module

Runs after a complete proteome has been automatically annotatedWarn about missing proteins, potential problems with paralogs and inconsistent pathways

Implemented in an expert system:HERBS project

HERBS: Hamap Expert Rule-Based System

• A Sibelius project in collaboration with INRIA, Grenoble.

• Expert system using JESS a rule generation motor written in JAVA

• Description of metabolic pathways using DAG:

pyrimidinesynthesis

step4

step5MF_01208 MF_00224

MF_00225

MF_01211


http://www.expasy.org/sprot/hamap/


Annotation of eukaryotic proteomes

Increase in complexity:• Multi-domain proteins• Molecular complexes• Multigenic families• Multiple subcellular locations• Alternative splicing• Post-translational modifications• Pathways complexityIn conclusion, automated procedures of annotation arefar less easy to implement

Step1Selecting a new protein to annotate

Step2Searching PubMed

Step3Reading papers

Step4Analysing sequence with

bioinformatic tools

Step5Data integration

Step6Creation of a new/updated

Swiss-Prot entry

Swiss-Prot Entry Creation Flowchart



Step3Reading papers


bioinformatic tools



Swiss-Prot entry

Where the computer is helping?


• Rule-based system for manual, automated and automatic annotation

• Proteins with a complex domain architecture, and simple once.

• Unlike HAMAP, it annotates also proteins which are not yet characterized, but which share defined domains.

• It consists of 3 modules

Anabelle


• Module 1• Runs protein sequence analysis tools (PC,

internal server, external servers)• General feature format (gff)

www.sanger.ac.uk/Software/formats/GFF

• Module 2

• Module 3

AEGP_RAT TMHMM|2.0 Transmembrane 1157 1179 . . . Level 0 ; NterLocation "O" ; Category "TOPOLOGY"

AEGP_RAT ps_scan|v1.5 PS50068 230 268 10.938 . . Name "LDLRA_2" ; Level 0 ; RawScore 993 ; FeatureFrom 1 ; FeatureTo -1 ; Sequence "RCPLGHHHCQNKACVEPHQLCDGEDNCGDSSDEdpLICS" ; KnownFalsePos 1 ; InterProID "IPR002172" ; Category "DOMAIN“

AEGP_RAT SignalP-NN|v2.0|euk Signal 1 21 . . . Level 0 ; C-max "0.761,22,Y" ; Y-max "0.783,22,Y" ; S-max "0.992,12,Y" ; S-mean "0.934,Y" ; Category "TOPOLOGY"

Anabelle


• Module 1

• Module 2• automatic pre-selection• visualizer• post-processing tool

• Module 3

Anabelle


• Module 1

• Module 2

• Module 3• applies annotation rules

Anabelle


Selection of methods




CC -!- SIMILARITY: Belongs to peptidase family S1.case <FTGroup:1>KW HydrolaseKW Serine proteaseend caseFT From: PS50240FT DOMAIN from to Serine protease #.FT Group: 1FT ACT_SITE 42 42 Charge relay system (Potential).FT Group: 1; Condition: HFT ACT_SITE 91 91 Charge relay system (Potential).FT Group: 1; Condition: DFT ACT_SITE 205 205 Charge relay system (Potential).FT Group: 1; Condition: SFT DISULFID 27 43 Potential.FT DISULFID 111 192 Potential.FT DISULFID 156 171 Potential.FT DISULFID 182 210 Potential.

Annotation section of an annotation rule for the

serine protease domain


UniRule format and repository

• We are currently developing a general rule format (UniRule format), which we suggest to be used by all partners for annotation rules

• We are creating a central CVS repository accessible to all rule curators in which we will store all rules in the UniRule format.


Advantages (1/2)

Common types of rules …

Protein family, Protein, Domain, Site

… for safe rule interactionshierarchy of rules (a rule can supersede another one)

triggering of a rule from another rule

etc…


Advantages (2/2)

Common tools, …rule creation, update and maintenance

syntax checking

non-redundancy checking

etc…

… storage and access


Various groups of rules• HAMAP rules (microbes and plastids)• MitoRules: mitochondrial proteins• ProRules (complex protein families)• AnaRules (Rules for domains from many

programs such as SignalP, TMHMM)• Rulebase• PIR rules• … and maybe groups of rules for plants,

yeast, viruses, etc


UniRule format and central CVS storage will be discussed during the 2nd AAM at the EBI in December




Step3Reading papers


bioinformatic tools



Swiss-Prot entry


Medical Annotation

Annotation of genetic diseases andpolymorphisms in humanParticularities:

• Annotation has to be as complete as possible, implying large number of retrieved documents

• Only mutation that don’t drastically modify sequence are kept (no stop or frame shift mutations)


Medical Annotation Tool: Specifications

Query interface adapted to search mutation-related articles

Classification of retrieved documents

• Information extraction from documents

• Mutation position control and Swiss-Prot lines generation


Query interface


Classifier Training Corpus

Dataset:• 2192 abstracts• from 32 genes in • three categories

• “Good” - relevant for annotation (14%)• “Bad” - irrelevant to annotation (70%)• “Unclear” - no decision could be made about

abstract’s relevance (16%)• Used to train a hierarchical probabilistic classifier (in

collaboration with XRCE)

Total distribution

Bad70%

Unclassified0%

Good14%

Unclear16%


Classifier Architecture

morpho-syntacticanalysis

normalization:mutation points,

gene & protein synonymsterm extraction

feature selection cascade of categorizers

retrieved documents

reordereddocuments


Classifier Performance

• “Good”• Precision =

49%• Recall = 84%

• “Bad”• Precision =

96%• Recall = 82%

Classified list evaluation

0%10%20%30%40%50%60%70%80%90%

100%

0% 20% 40% 60% 80% 100%

recall point

pre

cis

ion

probabilistic 2 stage classifier pubmed


• 3 year FP5 European Project, started January 2003

• Official web site: www.biomint.org

• 5 teams involved:– University of Manchester (UK, coordinator)

– PharmaDM (Belgium)

– Austrian Research Institute for Artificial Intelligence (Austria)

– University of Geneva, AI Lab (Switzerland)

– University of Antwerp, CNTS (Belgium)

– Swiss Institute of Bioinformatics (Switzerland)

The projectThe project


The goals of BioMinTThe goals of BioMinT

To develop a generic text mining tool that:– interprets different types of queries – retrieves relevant documents from the biological literature – extracts the required information – outputs the result as a database slot filler or as a structured

report

The tool will thus provide two essential research support services:

1. A curator's assistant: it will accelerate, by partially automating, the annotation and update of bio-databases;

2. A researcher's assistant: it will generate readable reports in response to queries from biological researchers.


Architecture of the prototypeArchitecture of the prototype


A curator’s assistant for proteomic databases

A curator’s assistant for proteomic databases

• Swiss-Prot protein sequence knowledgebase• PRINTS protein « fingerprints » database

Both hand-annotated by trained biologists.


Information retrieval and query management

Information retrieval and query management

1) An expansion of the initial query with synonyms or related terms derived either from domain ontologies or from existing database entries.

2) A filtering and ranking of documents retrieved from these servers using task-specific heuristics.

A semantic meta-query engine built round legacy search engines of servers such as PubMed that operates in two steps:

Query interface prototypeQuery interface prototype

BLAST

Developed by Pavel Dobrokhotov in the framework of SwissProt medical annotation: Bioinformatics 19(suppl. 1): i91-i94 (ISMB 2003)

Species

Human

Mouse

Rat

Drosophila

Yeast

Escherichia coli

Bacillus subtilist

A. thaliana

C. Elegans

Species

Acetoin catabolismAcetoin & catabolismAcetoin & degradationAcetoin & breakdown

AcetylationAcetylat*

Acetylcholine receptor inhibitorAcetylcholine & receptor & inhibitor

Actin-bindingActin bindingActin & bind*

Acute phaseAcute-phase

AcyltransferaseAcyl & transfer*

ADP-ribosylationADP-ribosylat*ADP & ribosylat*

Alginate biosynthesisAlginate & biosynthesisAlginate & synthesis

Alkaloid metabolismAlkaloid & metabolismcaffeine & metabolismnicotine & metabolismmorphine & metabolism

PIR seminar, Washington 12/11/2003

Query Result organisationQuery Result organisation• Filtering, classification and clustering according to different

rules:- Selecting the articles with target protein as primary

subject: Key phrases: cloned <X>, <X> was characterised, isolate(d) <X>, <X> (is/,)

a new protein, identify(ied) <X> Frequency of gene/protein name and synonyms in the abstract

- Journal/Journal type/Publication date- Same lab/authors- Species- Articles on key annotation topics: PTMs, mutations,

function, diseases• Results presented to the curator by information type


Information extractionInformation extraction

• Based on data mining software using Inductive Logic Programming (ILP).

• Using basic text analysis tools developed by CTNS and Tilburg University, namely a Memory-Based Shallow Parser (MBSP) based on a Machine Learning package (TiMBL)

• Using biomedical terminology resources, publicly available or provided by end users.

PIR seminar, Washington 12/11/2003

Topics for information extractionTopics for information extraction• Function(s) and role(s); enzymes: a. Catalytic activity (if EC number)

b. Cofactor

c. Enzyme regulation

d. Pathway• Subunit (Protein/protein interactions)• Subcellular location• Alternative products (alt. splicing, alt. initiation, RNA editing)• Tissue specificity (Northern and Western results)• Developmental stage• Induction• Domain• Post-translational modifications (PTM)• Mass spectrometry• Polymorphisms• Disease• Biotechnology• Pharmaceutical• Miscellaneous• Similarities• Caution• Database (specialized cross-references)


Benchmark environment for training and evaluationBenchmark environment for training and evaluation

We need a corpus of supervised abstractsTo train the text-mining toolsTo elaborate rules for specific information

extraction

What do we need to tag ?• Fragments of, or whole sentences describing

information useful for protein annotation• Specific words describing a specific type of

information




Step3Reading papers


bioinformatic tools



Swiss-Prot entry


Textual annotation edition

• Currently CRISP: text editor enhanced with in-house macros

• Development of Spedit: a XML editor dedicated to Swiss-Prot annotation

• Complete integration of sequence analysis and text-mining tools in a unique graphical user interface


Technical issues

• Conversion in a relational database on Oracle;• Production of a XML format;• Links to GO terms;• Update of PDB links;• Standardisation of a number of topics of CC lines;• Reformatting of GN line and ‘ALTERNATIVE

PRODUCTS’ CC topic;• Controlled vocabulary for PTM modification in FT lines;• Ongoing conversion to mixed case lines.• Enhancement of Swiss-Prot entry views

VARIANT 34 34 A->E (IN LDHB DEFICIENCY). /FTId=VAR_004174


Database Modelling

Populate Assess

Evaluate & Analyse Effects of Point mutations on 3D-Structures

Database

Effects of Variations in Human Proteins on Local Substructure and Functionality

sp_variants

FtidTypeseqposFromaaToaaDis_statusisoflag

sp_entries

ACID3Dflag.-dis

sp_sequences

isoidAC (FK)SequenceChecksum

1

n

relseqft

Isoid (FK)Ftid (FK)

1 n

n

1

relseqpdb

isoid (FK)chainid (FK)AlidEvalueSeqfromSeqtoIs_selected

1 n

models

chsaw-chains

PdbCodeResiduesAmino acidsHeterogensSolvent sequence organismCompoundeccodechainid

midfilenameTypeIsoid (FK)Ftid (FK)Chainid (FK)BuiltversionMethodCreated_date

1

n

n

Local substructure analysis

And

Many others!

diseases

analysis

struct_info

TempidScop-idCath-id

ftidomimidFull-disAbbrev-dis

n

1 1

pdb

Codedepositedexperimentresolutionheadertitlerevdat

n

1 1

THE MODSNP DATABASE


Acknowledgements

HAMAP:Alexandre Gattiker, Karine Michoud, Virginie Lesaux, Corinne Lachaize, Anne Morgat, Isabelle Phan

ANABELLE:Brigitte Boeckmann, Alexandre Gattiker, Xavier Martin, Nicolas Hulo, Christian Sigrist, Silvia Braconi

TEXT MINING: Pavel Dobrokhotov, Eric Gaussier (XRCE), Cyril Goutte (XRCE)Violaine Pillet, Marc Zehnder

EDITOR:Alain Gateau, Stéphanie Federico, Brigitte Boeckmann

VARIANT MAPPING:Holger Scheib, Lina Yib

http://www.expasy.org/people/swissprot.html#people

SWISS-PROT group at ISB and EBI

Barcelona 2002

Amos BairochISB

Documents

PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of