15
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso

Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

  • Upload
    jenaya

  • View
    31

  • Download
    3

Embed Size (px)

DESCRIPTION

Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation. WormBase Literature Curators Textpresso. SAB 2008. How does data get into WormBase?. Institution: Sanger Institute SUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse /elegans/ - PowerPoint PPT Presentation

Citation preview

Page 1: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

Improving Curation Efficiency: User Contributions and Textpresso-Based

Semi-Automation

SAB 2008

WormBase Literature Curators Textpresso

Page 2: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

SAB 2008

User submission (email, web forms)

First-pass curation

Institution: Sanger InstituteSUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse/elegans/

COMMENT TEXT: Dear WormBase, I think that WormBase may be missing a gene between Y50E8A.6 and Y50E8A.7......

How does data get into WormBase?

Page 3: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

SAB 2008

Publication

Flagging/Triage

Curation

Current first-pass curation pipeline

Page 4: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

SAB 2008

Growing desire amongst biocurators for user submissions

First people to know what data is in a paper is the authors

TAIR – partnered with Plant Physiology web interface for data submission (February 2008) voluntary, link included in acceptance letter

Submitter

email

Paper identifier

Locus name

Term/descriptor,method

User submissions: first-pass flagging/triage

Page 5: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

SAB 2008

User-submitted first-pass flags - WormBase

Page 6: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

SAB 2008

User data-submission forms: Expression Pattern

Page 7: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

SAB 2008

Full-text searching

Keywords and/or categories

Data extraction: Textpresso

Müller, Kenny, and Sternberg. PLoS Biology, November, 2004.

Page 8: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

SAB 2008

Paper – entity association: pattern matching

Transgenes (Wen): WBPaper00031242 – gqIs3, gqIs35, oxIs12

Fact extraction: specialized categories

Genetic interactions (Andrei): eor-2(op166) suppresses HSN death in the strong tra-1(e1099) background, but not noticeably in the weaker tra-1(e1076) background.

GO cellular component curation (Kimberly): ...positions of these neurons are indicated with circles and localizations of GAR-3::YFP on the cell membranes are denoted by arrows.

Textpresso: What data types?

Page 9: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

SAB 2008

Textpresso-mediated CC curation: from sentences to annotations

Page 10: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

SAB 2008

Transgenes: 1,100 new paper-transgene connections 250 new transgenes

checked manually – 95% accuracy ultimately, connections will go directly into database

Genetic Interactions: 1,875 (1/2007 – 5/2008) ~5,600 total interactions keeping current with new papers

GO Cellular Component Annotations: 515 (1/2007 – 5/2008) 2-3X rate prior to categories nearly complete keeping up with new data (1-2 hours/week)

Textpresso: How much data?

Page 11: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

Textpresso: Other data types

How else can we use Textpresso?

Other data types: Molecular Function Assays, Gene Product Interactions

Pilot: GO molecular function annotations for protein kinase activitykeyword: phosphorylatecategory: C. elegans proteins

13 new GO annotations/hour

Extension of this: protein modifications – not yet captured in WB

Pilot: Gene product interactions for WB and BINDkeywords: physically interact

category: C. elegans proteins310 matches in 237 documents22 physical interactions – top 15 papers

Page 12: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

Textpresso for triage: Classifying text based on content

Multiple strategies (using existing first-pass papers as training set):

Organismal triage – C. elegans, Drosophila

Identify, prioritize information-rich papers

Flag for specific data types

Multiple levels:

Machine learning – SVM (Support Vector Machine)Word frequency analysis

Hand-crafted categories

Combine SVM and categories

Supplement with word weighting, contextual analyses

Page 13: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

SAB 2008

Keeping better track of curation statistics.....

Page 14: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

SAB 2008

.....and making curation statistics more transparent to users.

Users could search for curation status of any paper

Users could search for curation status of a given data type

Each database release would report newly curated papers

Each database release would document increases in data-type curation

Page 15: Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

WormBase Literature Curation

Gene Symbols, Alleles,Sequence Features,

Mapping Data:Mary Ann Tuli, Sanger

Gene Function: Concise Descriptions,Gene Ontology:

Ranjana Kishore, CaltechErich Schwarz, Caltech

Kimberly Van Auken, Caltech

Mutant Phenotypes (RNAi and Alleles):Igor Antoshechkin, CaltechJolene Fernandez, Caltech

Raymond Lee, CaltechGary Shindelman, Caltech

Karen Yook, Caltech

First Pass, Genetic Interactions:

Andrei Petcherski, Caltech

Gene Regulation, PWMs:Xiaodong Wang, CaltechErich Schwarz, Caltech

Expression Patterns, Antibodies, Transgenes:

Wen Chen, Caltech

Anatomy Ontology, Cell Function:

Raymond Lee, CaltechMicroarrays, SAGE:

Igor Antoshechkin, Caltech

Sequence, Gene Structures:Sanger, Wash UAuthors, Papers: Cecilia Nakamura, Daniel Wang

Curation Tools, Database:Juancarlos Chan, Caltech