Upload
reynold-dalton
View
228
Download
0
Tags:
Embed Size (px)
Citation preview
PIR Seminar , University of Georgetown, Washington DC, 12.11.2003
Information integrationfor
Swiss-Prot annotation
Anne-Lise VeutheySwiss Institute of Bioinformatics
The simplified story of a Swiss-Prot entry
cDNAs, genomes, …
EMBLnew EMBL
TrEMBLnew TrEMBL
SWISS-PROT
« Automated »• Redundancy check (merge)• Family attribution (InterPro)• Annotation (computer)
« Manual »• Redundancy (merge, conflicts)• Annotation (manual)• SWISS-PROT tools (macros…)• SWISS-PROT documentation• Medline• Databases (MIM, MGD….)• Brain storming
Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive)
CDS
CDS: proposed and submitted at EMBL by authors or by genome projects (can be experimentally proven or derived from gene prediction programs). TrEMBL neither translates DNA sequences, nor uses gene prediction programs: only takes CDS proposed by the submitting authors in the EMBL entry.
Some data are not submitted to the public databases !!Delayed or cancelled…
12/11/2003PIR seminar, Washington
Annotation projects
• Human Proteomics Initiatve (HPI)
• HAMAP: microbial annotation
• Plant Proteome Annotation Project (PPAP)
• Fungi annotation
• Tox-Prot: annotation of toxin proteins
12/11/2003PIR seminar, Washington
HPI Goals
• Annotation of all known human proteins;
• Annotation of all mammalian orthologs.
With a particular emphasis on:– Alternative splicing;– Polymorphisms;– PTMs;– Structural information.
12/11/2003PIR seminar, Washington
http://www.expasy.org/sprot/hpi/hpi_stat.html
12/11/2003PIR seminar, Washington
Swiss-Prot / TrEMBL Chromosome by chromosome
http://www.ebi.ac.uk/proteome/HUMAN/
Tota
l n
um
ber
of
entr
ies
(SP+
Tr)
1: 1'6402: 1'0443: 8794: 6305: 7456: 9697: 7528: 5499: 61710: 59411: 97612: 85213: 27514: 50815: 46916: 70617: 53818: 22919: 1'12120: 55421: 17922: 405X: 645Y: 61
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Chromosome
% in
Sw
iss-P
rot
49%
12/11/2003PIR seminar, Washington
Problem of information overflow
• Almost 136 356 entries in SwissProt
• But over 1 million sequences waiting in TrEMBL for manual curation to be incorpotated in Swiss-Prot
• The number of TrEMBL entries is increasing exponentially
• Development of quality automated annotation (HAMAP)• Improvement of manual annotation efficiency by:
– Bioinformatic tools integration (Anabelle) – Text mining tools to help literature screening (BioMinT)
Solutions
12/11/2003PIR seminar, Washington
HAMAP
More than 150 complete genomes are available inpublic databases. Collectively they encode more than350’000 protein sequences. Such a large amount ofsequences makes classical manual annotation anintractable task.
HAMAP handles annotation of:• Microbial proteomes (eubacteria and archaea)• Plastid proteomes (cyanelle and chloroplast)• Soon extended to mitochondrial and viral proteomes
High-quality Automated and Manual Annotation of microbial Proteomes.
12/11/2003PIR seminar, Washington
HAMAP goals
Annotate automatically, with a high throughput but no decrease in quality, proteins from complete microbial genomes which either belong to a family or have no similaritiesImprove human-machine interaction for the manual annotation of remaining proteinsProvide an integrated information at the level of a microbial genome
12/11/2003PIR seminar, Washington
HAMAP pipeline
125 genomes
914 families965 profiles
360'000 proteins
32'000 hits
12/11/2003PIR seminar, Washington
HAMAP modules
Downstream interface
Operational
External communi-
cation
Build an interactive web site to present HAMAP data and allow retrieval of complete proteomes.
GAM
In development
Genome Awareness
(herbs)
Warn about missing proteins, potential problems with paralogs and inconsistent pathways.
CAM
Prototype
Complex Annotation
Module
Reliably annotate complex families.
SAM
Operational
Sequence Annotation
Module
Given a family rule and a set of entries, produce annotated entries based on the rule. Also annotate ORFans.
FAM
Operational
Family Assignment Module
Maintain HAMAP profiles, assign new family members. Detect ORFans.
Upstreaminterface
Operational
Cleanup & Redundanc
y
Fix errors in TrEMBLnew entries. Merge redundancy with Swiss-Prot.
FAM : The family assignment module
Similarity searches
•Profile searches against db of families
•Searches in PROSITE, Pfam, TIGRFAM,…
•BLAST against the protein db
Belongs to a family
•High sequence similarity
•Correct length
•Characteristic features
May belong to a family
•sub-threshold score
•incorrect length
•problems with features
Other similarities
No similarities to known families, but
•Significant BLAST matches
•Matches in PROSITE, Pfam,…
No similarities(except to ORFans in close species)
Family attribution New family/function? ORFan
SAM module Manual annotation
Manual correction
SAM module SAM module
Family attribution
12/11/2003PIR seminar, Washington
Family Assignment Module
Profile generationProfiles are automatically generated from the manually curated family alignments
Profile calibrationCheck for profile accuracy
Set cutoff = lower score of membersNo false positives allowed in Swiss-Prot
Run against TrEMBL and report result sIdentify trusted / member / weak matches
12/11/2003PIR seminar, Washington
SAM : The sequence annotation module
Similarity searches
Family rule-based annotationPrediction program-based annotation of hypothetical
proteins (ORFans)
Match to a defined family No matches
FAM module
12/11/2003PIR seminar, Washington
Sequence Annotation Module
SAM annotates a protein entry based onA family rule and a sequence, orAn ORFan sequence
Conditions in family rules are fully handledExternal programs are called for computed features : ProfileScan, TMHMM, SignalP, REP, COILS, as instructed in the family ruleComputed features are combined using exclusion rulesWarnings are generated in case of problems
12/11/2003PIR seminar, Washington
HAMAP family database
HAMAP families have the following properties:
Single function, or function that can be deduced from taxonomy or presence/absence of metabolic pathwaysFamily members (except divergent sequences) have more similarity to one another than to other proteins
12/11/2003PIR seminar, Washington
HAMAP family database
All the annotation can be inferred from:The rules given in the family (including conditions and dependencies)The alignment of the new entry with the existing alignment in the family (for feature propagation)The entry taxonomyAdditional information about metabolic pathways and number of membranesResults of ad hoc analysis tools
12/11/2003PIR seminar, Washington
Computed features
Signal type 1 SignalP neural network (von Heijne)Signal type 2 (lipoprotein)
Signal type 4 (pilin) PROSITE patternTransmembrane regions TMHMM (Krogh)Coiled coils Modified COILS (Lupas)ATP/GTP binding sites Walker A profile (to do)LPxTG cell-wall anchor PROSITE profile
Inteins (upstream of identification step)
3 PROSITE profiles: C+N-terminal junctions + homing endonuclease
PROSITE rule; later, Pedro Gonnet's patoseq program ?
Repeats : ANK, LRR, TPR, WD, Kelch
REP (Bork & Andrade) profiles + program
12/11/2003PIR seminar, Washington
Complex Annotation Module
Extension of SAM to allow:Hierarchical assignmentABC transporter profile => generic annotation+ CysA subfamily profile => specific annotationModular annotation
Domain rulesMetamotifs to express domain arrangements
Is now part of the Anabelle projectNot limited to bacteriaAlso usable for semi-automated annotation
12/11/2003PIR seminar, Washington
Genome Awareness Module
Runs after a complete proteome has been automatically annotatedWarn about missing proteins, potential problems with paralogs and inconsistent pathways
Implemented in an expert system:HERBS project
HERBS: Hamap Expert Rule-Based System
• A Sibelius project in collaboration with INRIA, Grenoble.
• Expert system using JESS a rule generation motor written in JAVA
• Description of metabolic pathways using DAG:
pyrimidinesynthesis
step4
step5MF_01208 MF_00224
MF_00225
MF_01211
12/11/2003PIR seminar, Washington
http://www.expasy.org/sprot/hamap/
12/11/2003PIR seminar, Washington
Annotation of eukaryotic proteomes
Increase in complexity:• Multi-domain proteins• Molecular complexes• Multigenic families• Multiple subcellular locations• Alternative splicing• Post-translational modifications• Pathways complexityIn conclusion, automated procedures of annotation arefar less easy to implement
Step1Selecting a new protein to annotate
Step2Searching PubMed
Step3Reading papers
Step4Analysing sequence with
bioinformatic tools
Step5Data integration
Step6Creation of a new/updated
Swiss-Prot entry
Swiss-Prot Entry Creation Flowchart
Step1Selecting a new protein to annotate
Step2Searching PubMed
Step3Reading papers
Step4Analysing sequence with
bioinformatic tools
Step5Data integration
Step6Creation of a new/updated
Swiss-Prot entry
Where the computer is helping?
12/11/2003PIR seminar, Washington
• Rule-based system for manual, automated and automatic annotation
• Proteins with a complex domain architecture, and simple once.
• Unlike HAMAP, it annotates also proteins which are not yet characterized, but which share defined domains.
• It consists of 3 modules
Anabelle
12/11/2003PIR seminar, Washington
• Module 1• Runs protein sequence analysis tools (PC,
internal server, external servers)• General feature format (gff)
www.sanger.ac.uk/Software/formats/GFF
• Module 2
• Module 3
AEGP_RAT TMHMM|2.0 Transmembrane 1157 1179 . . . Level 0 ; NterLocation "O" ; Category "TOPOLOGY"
AEGP_RAT ps_scan|v1.5 PS50068 230 268 10.938 . . Name "LDLRA_2" ; Level 0 ; RawScore 993 ; FeatureFrom 1 ; FeatureTo -1 ; Sequence "RCPLGHHHCQNKACVEPHQLCDGEDNCGDSSDEdpLICS" ; KnownFalsePos 1 ; InterProID "IPR002172" ; Category "DOMAIN“
AEGP_RAT SignalP-NN|v2.0|euk Signal 1 21 . . . Level 0 ; C-max "0.761,22,Y" ; Y-max "0.783,22,Y" ; S-max "0.992,12,Y" ; S-mean "0.934,Y" ; Category "TOPOLOGY"
Anabelle
12/11/2003PIR seminar, Washington
• Module 1
• Module 2• automatic pre-selection• visualizer• post-processing tool
• Module 3
Anabelle
12/11/2003PIR seminar, Washington
• Module 1
• Module 2
• Module 3• applies annotation rules
Anabelle
12/11/2003PIR seminar, Washington
Selection of methods
12/11/2003PIR seminar, Washington
12/11/2003PIR seminar, Washington
12/11/2003PIR seminar, Washington
CC -!- SIMILARITY: Belongs to peptidase family S1.case <FTGroup:1>KW HydrolaseKW Serine proteaseend caseFT From: PS50240FT DOMAIN from to Serine protease #.FT Group: 1FT ACT_SITE 42 42 Charge relay system (Potential).FT Group: 1; Condition: HFT ACT_SITE 91 91 Charge relay system (Potential).FT Group: 1; Condition: DFT ACT_SITE 205 205 Charge relay system (Potential).FT Group: 1; Condition: SFT DISULFID 27 43 Potential.FT DISULFID 111 192 Potential.FT DISULFID 156 171 Potential.FT DISULFID 182 210 Potential.
Annotation section of an annotation rule for the
serine protease domain
12/11/2003PIR seminar, Washington
UniRule format and repository
• We are currently developing a general rule format (UniRule format), which we suggest to be used by all partners for annotation rules
• We are creating a central CVS repository accessible to all rule curators in which we will store all rules in the UniRule format.
12/11/2003PIR seminar, Washington
Advantages (1/2)
Common types of rules …
Protein family, Protein, Domain, Site
… for safe rule interactionshierarchy of rules (a rule can supersede another one)
triggering of a rule from another rule
etc…
12/11/2003PIR seminar, Washington
Advantages (2/2)
Common tools, …rule creation, update and maintenance
syntax checking
non-redundancy checking
etc…
… storage and access
12/11/2003PIR seminar, Washington
Various groups of rules• HAMAP rules (microbes and plastids)• MitoRules: mitochondrial proteins• ProRules (complex protein families)• AnaRules (Rules for domains from many
programs such as SignalP, TMHMM)• Rulebase• PIR rules• … and maybe groups of rules for plants,
yeast, viruses, etc
12/11/2003PIR seminar, Washington
UniRule format and central CVS storage will be discussed during the 2nd AAM at the EBI in December
Where the computer is helping?
Step1Selecting a new protein to annotate
Step2Searching PubMed
Step3Reading papers
Step4Analysing sequence with
bioinformatic tools
Step5Data integration
Step6Creation of a new/updated
Swiss-Prot entry
12/11/2003PIR seminar, Washington
Medical Annotation
Annotation of genetic diseases andpolymorphisms in humanParticularities:
• Annotation has to be as complete as possible, implying large number of retrieved documents
• Only mutation that don’t drastically modify sequence are kept (no stop or frame shift mutations)
12/11/2003PIR seminar, Washington
Medical Annotation Tool: Specifications
Query interface adapted to search mutation-related articles
Classification of retrieved documents
• Information extraction from documents
• Mutation position control and Swiss-Prot lines generation
12/11/2003PIR seminar, Washington
Query interface
12/11/2003PIR seminar, Washington
Classifier Training Corpus
Dataset:• 2192 abstracts• from 32 genes in • three categories
• “Good” - relevant for annotation (14%)• “Bad” - irrelevant to annotation (70%)• “Unclear” - no decision could be made about
abstract’s relevance (16%)• Used to train a hierarchical probabilistic classifier (in
collaboration with XRCE)
Total distribution
Bad70%
Unclassified0%
Good14%
Unclear16%
12/11/2003PIR seminar, Washington
Classifier Architecture
morpho-syntacticanalysis
normalization:mutation points,
gene & protein synonymsterm extraction
feature selection cascade of categorizers
retrieved documents
reordereddocuments
12/11/2003PIR seminar, Washington
Classifier Performance
• “Good”• Precision =
49%• Recall = 84%
• “Bad”• Precision =
96%• Recall = 82%
Classified list evaluation
0%10%20%30%40%50%60%70%80%90%
100%
0% 20% 40% 60% 80% 100%
recall point
pre
cis
ion
probabilistic 2 stage classifier pubmed
12/11/2003PIR seminar, Washington
• 3 year FP5 European Project, started January 2003
• Official web site: www.biomint.org
• 5 teams involved:– University of Manchester (UK, coordinator)
– PharmaDM (Belgium)
– Austrian Research Institute for Artificial Intelligence (Austria)
– University of Geneva, AI Lab (Switzerland)
– University of Antwerp, CNTS (Belgium)
– Swiss Institute of Bioinformatics (Switzerland)
The projectThe project
12/11/2003PIR seminar, Washington
The goals of BioMinTThe goals of BioMinT
To develop a generic text mining tool that:– interprets different types of queries – retrieves relevant documents from the biological literature – extracts the required information – outputs the result as a database slot filler or as a structured
report
The tool will thus provide two essential research support services:
1. A curator's assistant: it will accelerate, by partially automating, the annotation and update of bio-databases;
2. A researcher's assistant: it will generate readable reports in response to queries from biological researchers.
12/11/2003PIR seminar, Washington
Architecture of the prototypeArchitecture of the prototype
12/11/2003PIR seminar, Washington
A curator’s assistant for proteomic databases
A curator’s assistant for proteomic databases
• Swiss-Prot protein sequence knowledgebase• PRINTS protein « fingerprints » database
Both hand-annotated by trained biologists.
12/11/2003PIR seminar, Washington
Information retrieval and query management
Information retrieval and query management
1) An expansion of the initial query with synonyms or related terms derived either from domain ontologies or from existing database entries.
2) A filtering and ranking of documents retrieved from these servers using task-specific heuristics.
A semantic meta-query engine built round legacy search engines of servers such as PubMed that operates in two steps:
Query interface prototypeQuery interface prototype
BLAST
Developed by Pavel Dobrokhotov in the framework of SwissProt medical annotation: Bioinformatics 19(suppl. 1): i91-i94 (ISMB 2003)
Species
Human
Mouse
Rat
Drosophila
Yeast
Escherichia coli
Bacillus subtilist
A. thaliana
C. Elegans
Species
Acetoin catabolismAcetoin & catabolismAcetoin & degradationAcetoin & breakdown
AcetylationAcetylat*
Acetylcholine receptor inhibitorAcetylcholine & receptor & inhibitor
Actin-bindingActin bindingActin & bind*
Acute phaseAcute-phase
AcyltransferaseAcyl & transfer*
ADP-ribosylationADP-ribosylat*ADP & ribosylat*
Alginate biosynthesisAlginate & biosynthesisAlginate & synthesis
Alkaloid metabolismAlkaloid & metabolismcaffeine & metabolismnicotine & metabolismmorphine & metabolism
PIR seminar, Washington 12/11/2003
Query Result organisationQuery Result organisation• Filtering, classification and clustering according to different
rules:- Selecting the articles with target protein as primary
subject: Key phrases: cloned <X>, <X> was characterised, isolate(d) <X>, <X> (is/,)
a new protein, identify(ied) <X> Frequency of gene/protein name and synonyms in the abstract
- Journal/Journal type/Publication date- Same lab/authors- Species- Articles on key annotation topics: PTMs, mutations,
function, diseases• Results presented to the curator by information type
12/11/2003PIR seminar, Washington
Information extractionInformation extraction
• Based on data mining software using Inductive Logic Programming (ILP).
• Using basic text analysis tools developed by CTNS and Tilburg University, namely a Memory-Based Shallow Parser (MBSP) based on a Machine Learning package (TiMBL)
• Using biomedical terminology resources, publicly available or provided by end users.
PIR seminar, Washington 12/11/2003
Topics for information extractionTopics for information extraction• Function(s) and role(s); enzymes: a. Catalytic activity (if EC number)
b. Cofactor
c. Enzyme regulation
d. Pathway• Subunit (Protein/protein interactions)• Subcellular location• Alternative products (alt. splicing, alt. initiation, RNA editing)• Tissue specificity (Northern and Western results)• Developmental stage• Induction• Domain• Post-translational modifications (PTM)• Mass spectrometry• Polymorphisms• Disease• Biotechnology• Pharmaceutical• Miscellaneous• Similarities• Caution• Database (specialized cross-references)
12/11/2003PIR seminar, Washington
Benchmark environment for training and evaluationBenchmark environment for training and evaluation
We need a corpus of supervised abstractsTo train the text-mining toolsTo elaborate rules for specific information
extraction
What do we need to tag ?• Fragments of, or whole sentences describing
information useful for protein annotation• Specific words describing a specific type of
information
Where the computer is helping?
Step1Selecting a new protein to annotate
Step2Searching PubMed
Step3Reading papers
Step4Analysing sequence with
bioinformatic tools
Step5Data integration
Step6Creation of a new/updated
Swiss-Prot entry
12/11/2003PIR seminar, Washington
Textual annotation edition
• Currently CRISP: text editor enhanced with in-house macros
• Development of Spedit: a XML editor dedicated to Swiss-Prot annotation
• Complete integration of sequence analysis and text-mining tools in a unique graphical user interface
12/11/2003PIR seminar, Washington
Technical issues
• Conversion in a relational database on Oracle;• Production of a XML format;• Links to GO terms;• Update of PDB links;• Standardisation of a number of topics of CC lines;• Reformatting of GN line and ‘ALTERNATIVE
PRODUCTS’ CC topic;• Controlled vocabulary for PTM modification in FT lines;• Ongoing conversion to mixed case lines.• Enhancement of Swiss-Prot entry views
VARIANT 34 34 A->E (IN LDHB DEFICIENCY). /FTId=VAR_004174
12/11/2003PIR seminar, Washington
Database Modelling
Populate Assess
Evaluate & Analyse Effects of Point mutations on 3D-Structures
Database
Effects of Variations in Human Proteins on Local Substructure and Functionality
sp_variants
FtidTypeseqposFromaaToaaDis_statusisoflag
sp_entries
ACID3Dflag.-dis
sp_sequences
isoidAC (FK)SequenceChecksum
1
n
relseqft
Isoid (FK)Ftid (FK)
1 n
n
1
relseqpdb
isoid (FK)chainid (FK)AlidEvalueSeqfromSeqtoIs_selected
1 n
models
chsaw-chains
PdbCodeResiduesAmino acidsHeterogensSolvent sequence organismCompoundeccodechainid
midfilenameTypeIsoid (FK)Ftid (FK)Chainid (FK)BuiltversionMethodCreated_date
1
n
n
Local substructure analysis
And
Many others!
diseases
analysis
struct_info
TempidScop-idCath-id
ftidomimidFull-disAbbrev-dis
n
1 1
pdb
Codedepositedexperimentresolutionheadertitlerevdat
n
1 1
THE MODSNP DATABASE
12/11/2003PIR seminar, Washington
Acknowledgements
HAMAP:Alexandre Gattiker, Karine Michoud, Virginie Lesaux, Corinne Lachaize, Anne Morgat, Isabelle Phan
ANABELLE:Brigitte Boeckmann, Alexandre Gattiker, Xavier Martin, Nicolas Hulo, Christian Sigrist, Silvia Braconi
TEXT MINING: Pavel Dobrokhotov, Eric Gaussier (XRCE), Cyril Goutte (XRCE)Violaine Pillet, Marc Zehnder
EDITOR:Alain Gateau, Stéphanie Federico, Brigitte Boeckmann
VARIANT MAPPING:Holger Scheib, Lina Yib
http://www.expasy.org/people/swissprot.html#people
SWISS-PROT group at ISB and EBI
Barcelona 2002
Amos BairochISB