66
PIR Seminar , University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of Bioinformatics

PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

Embed Size (px)

Citation preview

Page 1: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

PIR Seminar , University of Georgetown, Washington DC, 12.11.2003

Information integrationfor

Swiss-Prot annotation

Anne-Lise VeutheySwiss Institute of Bioinformatics

Page 2: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

The simplified story of a Swiss-Prot entry

cDNAs, genomes, …

EMBLnew EMBL

TrEMBLnew TrEMBL

SWISS-PROT

« Automated »• Redundancy check (merge)• Family attribution (InterPro)• Annotation (computer)

« Manual »• Redundancy (merge, conflicts)• Annotation (manual)• SWISS-PROT tools (macros…)• SWISS-PROT documentation• Medline• Databases (MIM, MGD….)• Brain storming

Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive)

CDS

CDS: proposed and submitted at EMBL by authors or by genome projects (can be experimentally proven or derived from gene prediction programs). TrEMBL neither translates DNA sequences, nor uses gene prediction programs: only takes CDS proposed by the submitting authors in the EMBL entry.

Some data are not submitted to the public databases !!Delayed or cancelled…

Page 3: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Annotation projects

• Human Proteomics Initiatve (HPI)

• HAMAP: microbial annotation

• Plant Proteome Annotation Project (PPAP)

• Fungi annotation

• Tox-Prot: annotation of toxin proteins

Page 4: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

HPI Goals

• Annotation of all known human proteins;

• Annotation of all mammalian orthologs.

With a particular emphasis on:– Alternative splicing;– Polymorphisms;– PTMs;– Structural information.

Page 5: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

http://www.expasy.org/sprot/hpi/hpi_stat.html

Page 6: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Swiss-Prot / TrEMBL Chromosome by chromosome

http://www.ebi.ac.uk/proteome/HUMAN/

Tota

l n

um

ber

of

entr

ies

(SP+

Tr)

1: 1'6402: 1'0443: 8794: 6305: 7456: 9697: 7528: 5499: 61710: 59411: 97612: 85213: 27514: 50815: 46916: 70617: 53818: 22919: 1'12120: 55421: 17922: 405X: 645Y: 61

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

Chromosome

% in

Sw

iss-P

rot

49%

Page 7: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Problem of information overflow

• Almost 136 356 entries in SwissProt

• But over 1 million sequences waiting in TrEMBL for manual curation to be incorpotated in Swiss-Prot

• The number of TrEMBL entries is increasing exponentially

• Development of quality automated annotation (HAMAP)• Improvement of manual annotation efficiency by:

– Bioinformatic tools integration (Anabelle) – Text mining tools to help literature screening (BioMinT)

Solutions

Page 8: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

HAMAP

More than 150 complete genomes are available inpublic databases. Collectively they encode more than350’000 protein sequences. Such a large amount ofsequences makes classical manual annotation anintractable task.

HAMAP handles annotation of:• Microbial proteomes (eubacteria and archaea)• Plastid proteomes (cyanelle and chloroplast)• Soon extended to mitochondrial and viral proteomes

High-quality Automated and Manual Annotation of microbial Proteomes.

Page 9: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

HAMAP goals

Annotate automatically, with a high throughput but no decrease in quality, proteins from complete microbial genomes which either belong to a family or have no similaritiesImprove human-machine interaction for the manual annotation of remaining proteinsProvide an integrated information at the level of a microbial genome

Page 10: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

HAMAP pipeline

125 genomes

914 families965 profiles

360'000 proteins

32'000 hits

Page 11: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

HAMAP modules

Downstream interface

Operational

External communi-

cation

Build an interactive web site to present HAMAP data and allow retrieval of complete proteomes.

GAM

In development

Genome Awareness

(herbs)

Warn about missing proteins, potential problems with paralogs and inconsistent pathways.

CAM

Prototype

Complex Annotation

Module

Reliably annotate complex families.

SAM

Operational

Sequence Annotation

Module

Given a family rule and a set of entries, produce annotated entries based on the rule. Also annotate ORFans.

FAM

Operational

Family Assignment Module

Maintain HAMAP profiles, assign new family members. Detect ORFans.

Upstreaminterface

Operational

Cleanup & Redundanc

y

Fix errors in TrEMBLnew entries. Merge redundancy with Swiss-Prot.

Page 12: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

FAM : The family assignment module

Similarity searches

•Profile searches against db of families

•Searches in PROSITE, Pfam, TIGRFAM,…

•BLAST against the protein db

Belongs to a family

•High sequence similarity

•Correct length

•Characteristic features

May belong to a family

•sub-threshold score

•incorrect length

•problems with features

Other similarities

No similarities to known families, but

•Significant BLAST matches

•Matches in PROSITE, Pfam,…

No similarities(except to ORFans in close species)

Family attribution New family/function? ORFan

SAM module Manual annotation

Manual correction

SAM module SAM module

Family attribution

Page 13: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Family Assignment Module

Profile generationProfiles are automatically generated from the manually curated family alignments

Profile calibrationCheck for profile accuracy

Set cutoff = lower score of membersNo false positives allowed in Swiss-Prot

Run against TrEMBL and report result sIdentify trusted / member / weak matches

Page 14: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

SAM : The sequence annotation module

Similarity searches

Family rule-based annotationPrediction program-based annotation of hypothetical

proteins (ORFans)

Match to a defined family No matches

FAM module

Page 15: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Sequence Annotation Module

SAM annotates a protein entry based onA family rule and a sequence, orAn ORFan sequence

Conditions in family rules are fully handledExternal programs are called for computed features : ProfileScan, TMHMM, SignalP, REP, COILS, as instructed in the family ruleComputed features are combined using exclusion rulesWarnings are generated in case of problems

Page 16: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

HAMAP family database

HAMAP families have the following properties:

Single function, or function that can be deduced from taxonomy or presence/absence of metabolic pathwaysFamily members (except divergent sequences) have more similarity to one another than to other proteins

Page 17: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

HAMAP family database

All the annotation can be inferred from:The rules given in the family (including conditions and dependencies)The alignment of the new entry with the existing alignment in the family (for feature propagation)The entry taxonomyAdditional information about metabolic pathways and number of membranesResults of ad hoc analysis tools

Page 18: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of
Page 19: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of
Page 20: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Computed features

Signal type 1 SignalP neural network (von Heijne)Signal type 2 (lipoprotein)

Signal type 4 (pilin) PROSITE patternTransmembrane regions TMHMM (Krogh)Coiled coils Modified COILS (Lupas)ATP/GTP binding sites Walker A profile (to do)LPxTG cell-wall anchor PROSITE profile

Inteins (upstream of identification step)

3 PROSITE profiles: C+N-terminal junctions + homing endonuclease

PROSITE rule; later, Pedro Gonnet's patoseq program ?

Repeats : ANK, LRR, TPR, WD, Kelch

REP (Bork & Andrade) profiles + program

Page 21: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Complex Annotation Module

Extension of SAM to allow:Hierarchical assignmentABC transporter profile => generic annotation+ CysA subfamily profile => specific annotationModular annotation

Domain rulesMetamotifs to express domain arrangements

Is now part of the Anabelle projectNot limited to bacteriaAlso usable for semi-automated annotation

Page 22: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Genome Awareness Module

Runs after a complete proteome has been automatically annotatedWarn about missing proteins, potential problems with paralogs and inconsistent pathways

Implemented in an expert system:HERBS project

Page 23: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

HERBS: Hamap Expert Rule-Based System

• A Sibelius project in collaboration with INRIA, Grenoble.

• Expert system using JESS a rule generation motor written in JAVA

• Description of metabolic pathways using DAG:

pyrimidinesynthesis

step4

step5MF_01208 MF_00224

MF_00225

MF_01211

Page 24: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

http://www.expasy.org/sprot/hamap/

Page 25: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Annotation of eukaryotic proteomes

Increase in complexity:• Multi-domain proteins• Molecular complexes• Multigenic families• Multiple subcellular locations• Alternative splicing• Post-translational modifications• Pathways complexityIn conclusion, automated procedures of annotation arefar less easy to implement

Page 26: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

Step1Selecting a new protein to annotate

Step2Searching PubMed

Step3Reading papers

Step4Analysing sequence with

bioinformatic tools

Step5Data integration

Step6Creation of a new/updated

Swiss-Prot entry

Swiss-Prot Entry Creation Flowchart

Page 27: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

Step1Selecting a new protein to annotate

Step2Searching PubMed

Step3Reading papers

Step4Analysing sequence with

bioinformatic tools

Step5Data integration

Step6Creation of a new/updated

Swiss-Prot entry

Where the computer is helping?

Page 28: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

• Rule-based system for manual, automated and automatic annotation

• Proteins with a complex domain architecture, and simple once.

• Unlike HAMAP, it annotates also proteins which are not yet characterized, but which share defined domains.

• It consists of 3 modules

Anabelle

Page 29: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

• Module 1• Runs protein sequence analysis tools (PC,

internal server, external servers)• General feature format (gff)

www.sanger.ac.uk/Software/formats/GFF

• Module 2

• Module 3

AEGP_RAT TMHMM|2.0 Transmembrane 1157 1179 . . . Level 0 ; NterLocation "O" ; Category "TOPOLOGY"

AEGP_RAT ps_scan|v1.5 PS50068 230 268 10.938 . . Name "LDLRA_2" ; Level 0 ; RawScore 993 ; FeatureFrom 1 ; FeatureTo -1 ; Sequence "RCPLGHHHCQNKACVEPHQLCDGEDNCGDSSDEdpLICS" ; KnownFalsePos 1 ; InterProID "IPR002172" ; Category "DOMAIN“

AEGP_RAT SignalP-NN|v2.0|euk Signal 1 21 . . . Level 0 ; C-max "0.761,22,Y" ; Y-max "0.783,22,Y" ; S-max "0.992,12,Y" ; S-mean "0.934,Y" ; Category "TOPOLOGY"

Anabelle

Page 30: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

• Module 1

• Module 2• automatic pre-selection• visualizer• post-processing tool

• Module 3

Anabelle

Page 31: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

• Module 1

• Module 2

• Module 3• applies annotation rules

Anabelle

Page 32: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Selection of methods

Page 33: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Page 34: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Page 35: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

CC -!- SIMILARITY: Belongs to peptidase family S1.case <FTGroup:1>KW HydrolaseKW Serine proteaseend caseFT From: PS50240FT DOMAIN from to Serine protease #.FT Group: 1FT ACT_SITE 42 42 Charge relay system (Potential).FT Group: 1; Condition: HFT ACT_SITE 91 91 Charge relay system (Potential).FT Group: 1; Condition: DFT ACT_SITE 205 205 Charge relay system (Potential).FT Group: 1; Condition: SFT DISULFID 27 43 Potential.FT DISULFID 111 192 Potential.FT DISULFID 156 171 Potential.FT DISULFID 182 210 Potential.

Annotation section of an annotation rule for the

serine protease domain

Page 36: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

UniRule format and repository

• We are currently developing a general rule format (UniRule format), which we suggest to be used by all partners for annotation rules

• We are creating a central CVS repository accessible to all rule curators in which we will store all rules in the UniRule format.

Page 37: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Advantages (1/2)

Common types of rules …

Protein family, Protein, Domain, Site

… for safe rule interactionshierarchy of rules (a rule can supersede another one)

triggering of a rule from another rule

etc…

Page 38: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Advantages (2/2)

Common tools, …rule creation, update and maintenance

syntax checking

non-redundancy checking

etc…

… storage and access

Page 39: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Various groups of rules• HAMAP rules (microbes and plastids)• MitoRules: mitochondrial proteins• ProRules (complex protein families)• AnaRules (Rules for domains from many

programs such as SignalP, TMHMM)• Rulebase• PIR rules• … and maybe groups of rules for plants,

yeast, viruses, etc

Page 40: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

UniRule format and central CVS storage will be discussed during the 2nd AAM at the EBI in December

Page 41: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

Where the computer is helping?

Step1Selecting a new protein to annotate

Step2Searching PubMed

Step3Reading papers

Step4Analysing sequence with

bioinformatic tools

Step5Data integration

Step6Creation of a new/updated

Swiss-Prot entry

Page 42: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Medical Annotation

Annotation of genetic diseases andpolymorphisms in humanParticularities:

• Annotation has to be as complete as possible, implying large number of retrieved documents

• Only mutation that don’t drastically modify sequence are kept (no stop or frame shift mutations)

Page 43: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Medical Annotation Tool: Specifications

Query interface adapted to search mutation-related articles

Classification of retrieved documents

• Information extraction from documents

• Mutation position control and Swiss-Prot lines generation

Page 44: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Query interface

Page 45: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Classifier Training Corpus

Dataset:• 2192 abstracts• from 32 genes in • three categories

• “Good” - relevant for annotation (14%)• “Bad” - irrelevant to annotation (70%)• “Unclear” - no decision could be made about

abstract’s relevance (16%)• Used to train a hierarchical probabilistic classifier (in

collaboration with XRCE)

Total distribution

Bad70%

Unclassified0%

Good14%

Unclear16%

Page 46: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Classifier Architecture

morpho-syntacticanalysis

normalization:mutation points,

gene & protein synonymsterm extraction

feature selection cascade of categorizers

retrieved documents

reordereddocuments

Page 47: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Classifier Performance

• “Good”• Precision =

49%• Recall = 84%

• “Bad”• Precision =

96%• Recall = 82%

Classified list evaluation

0%10%20%30%40%50%60%70%80%90%

100%

0% 20% 40% 60% 80% 100%

recall point

pre

cis

ion

probabilistic 2 stage classifier pubmed

Page 48: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

• 3 year FP5 European Project, started January 2003

• Official web site: www.biomint.org

• 5 teams involved:– University of Manchester (UK, coordinator)

– PharmaDM (Belgium)

– Austrian Research Institute for Artificial Intelligence (Austria)

– University of Geneva, AI Lab (Switzerland)

– University of Antwerp, CNTS (Belgium)

– Swiss Institute of Bioinformatics (Switzerland)

The projectThe project

Page 49: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

The goals of BioMinTThe goals of BioMinT

To develop a generic text mining tool that:– interprets different types of queries – retrieves relevant documents from the biological literature – extracts the required information – outputs the result as a database slot filler or as a structured

report

The tool will thus provide two essential research support services:

1. A curator's assistant: it will accelerate, by partially automating, the annotation and update of bio-databases;

2. A researcher's assistant: it will generate readable reports in response to queries from biological researchers.

Page 50: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Architecture of the prototypeArchitecture of the prototype

Page 51: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

A curator’s assistant for proteomic databases

A curator’s assistant for proteomic databases

• Swiss-Prot protein sequence knowledgebase• PRINTS protein « fingerprints » database

Both hand-annotated by trained biologists.

Page 52: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Information retrieval and query management

Information retrieval and query management

1) An expansion of the initial query with synonyms or related terms derived either from domain ontologies or from existing database entries.

2) A filtering and ranking of documents retrieved from these servers using task-specific heuristics.

A semantic meta-query engine built round legacy search engines of servers such as PubMed that operates in two steps:

Page 53: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

Query interface prototypeQuery interface prototype

BLAST

Developed by Pavel Dobrokhotov in the framework of SwissProt medical annotation: Bioinformatics 19(suppl. 1): i91-i94 (ISMB 2003)

Species

Human

Mouse

Rat

Drosophila

Yeast

Escherichia coli

Bacillus subtilist

A. thaliana

C. Elegans

Species

Acetoin catabolismAcetoin & catabolismAcetoin & degradationAcetoin & breakdown

AcetylationAcetylat*

Acetylcholine receptor inhibitorAcetylcholine & receptor & inhibitor

Actin-bindingActin bindingActin & bind*

Acute phaseAcute-phase

AcyltransferaseAcyl & transfer*

ADP-ribosylationADP-ribosylat*ADP & ribosylat*

Alginate biosynthesisAlginate & biosynthesisAlginate & synthesis

Alkaloid metabolismAlkaloid & metabolismcaffeine & metabolismnicotine & metabolismmorphine & metabolism

Page 54: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

PIR seminar, Washington 12/11/2003

Query Result organisationQuery Result organisation• Filtering, classification and clustering according to different

rules:- Selecting the articles with target protein as primary

subject: Key phrases: cloned <X>, <X> was characterised, isolate(d) <X>, <X> (is/,)

a new protein, identify(ied) <X> Frequency of gene/protein name and synonyms in the abstract

- Journal/Journal type/Publication date- Same lab/authors- Species- Articles on key annotation topics: PTMs, mutations,

function, diseases• Results presented to the curator by information type

Page 55: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Information extractionInformation extraction

• Based on data mining software using Inductive Logic Programming (ILP).

• Using basic text analysis tools developed by CTNS and Tilburg University, namely a Memory-Based Shallow Parser (MBSP) based on a Machine Learning package (TiMBL)

• Using biomedical terminology resources, publicly available or provided by end users.

Page 56: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

PIR seminar, Washington 12/11/2003

Topics for information extractionTopics for information extraction• Function(s) and role(s); enzymes: a. Catalytic activity (if EC number)

b. Cofactor

c. Enzyme regulation

d. Pathway• Subunit (Protein/protein interactions)• Subcellular location• Alternative products (alt. splicing, alt. initiation, RNA editing)• Tissue specificity (Northern and Western results)• Developmental stage• Induction• Domain• Post-translational modifications (PTM)• Mass spectrometry• Polymorphisms• Disease• Biotechnology• Pharmaceutical• Miscellaneous• Similarities• Caution• Database (specialized cross-references)

Page 57: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Benchmark environment for training and evaluationBenchmark environment for training and evaluation

We need a corpus of supervised abstractsTo train the text-mining toolsTo elaborate rules for specific information

extraction

What do we need to tag ?• Fragments of, or whole sentences describing

information useful for protein annotation• Specific words describing a specific type of

information

Page 58: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of
Page 59: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

Where the computer is helping?

Step1Selecting a new protein to annotate

Step2Searching PubMed

Step3Reading papers

Step4Analysing sequence with

bioinformatic tools

Step5Data integration

Step6Creation of a new/updated

Swiss-Prot entry

Page 60: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Textual annotation edition

• Currently CRISP: text editor enhanced with in-house macros

• Development of Spedit: a XML editor dedicated to Swiss-Prot annotation

• Complete integration of sequence analysis and text-mining tools in a unique graphical user interface

Page 61: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Technical issues

• Conversion in a relational database on Oracle;• Production of a XML format;• Links to GO terms;• Update of PDB links;• Standardisation of a number of topics of CC lines;• Reformatting of GN line and ‘ALTERNATIVE

PRODUCTS’ CC topic;• Controlled vocabulary for PTM modification in FT lines;• Ongoing conversion to mixed case lines.• Enhancement of Swiss-Prot entry views

Page 62: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

VARIANT 34 34 A->E (IN LDHB DEFICIENCY). /FTId=VAR_004174

Page 63: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Database Modelling

Populate Assess

Evaluate & Analyse Effects of Point mutations on 3D-Structures

Database

Effects of Variations in Human Proteins on Local Substructure and Functionality

Page 64: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

sp_variants

FtidTypeseqposFromaaToaaDis_statusisoflag

sp_entries

ACID3Dflag.-dis

sp_sequences

isoidAC (FK)SequenceChecksum

1

n

relseqft

Isoid (FK)Ftid (FK)

1 n

n

1

relseqpdb

isoid (FK)chainid (FK)AlidEvalueSeqfromSeqtoIs_selected

1 n

models

chsaw-chains

PdbCodeResiduesAmino acidsHeterogensSolvent sequence organismCompoundeccodechainid

midfilenameTypeIsoid (FK)Ftid (FK)Chainid (FK)BuiltversionMethodCreated_date

1

n

n

Local substructure analysis

And

Many others!

diseases

analysis

struct_info

TempidScop-idCath-id

ftidomimidFull-disAbbrev-dis

n

1 1

pdb

Codedepositedexperimentresolutionheadertitlerevdat

n

1 1

THE MODSNP DATABASE

Page 65: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

12/11/2003PIR seminar, Washington

Acknowledgements

HAMAP:Alexandre Gattiker, Karine Michoud, Virginie Lesaux, Corinne Lachaize, Anne Morgat, Isabelle Phan

ANABELLE:Brigitte Boeckmann, Alexandre Gattiker, Xavier Martin, Nicolas Hulo, Christian Sigrist, Silvia Braconi

TEXT MINING: Pavel Dobrokhotov, Eric Gaussier (XRCE), Cyril Goutte (XRCE)Violaine Pillet, Marc Zehnder

EDITOR:Alain Gateau, Stéphanie Federico, Brigitte Boeckmann

VARIANT MAPPING:Holger Scheib, Lina Yib

Page 66: PIR Seminar, University of Georgetown, Washington DC, 12.11.2003 Information integration for Swiss-Prot annotation Anne-Lise Veuthey Swiss Institute of

http://www.expasy.org/people/swissprot.html#people

SWISS-PROT group at ISB and EBI

Barcelona 2002

Amos BairochISB