32
Developing a Curator Assistant for Functional Analysis of Genome Databases Requesting $1,451,005 from NSF BIO Advances in Biological Informatics, August 2009 PI: Bruce Schatz, Institute for Genomic Biology, University of Illinois at Urbana-Champaign coPI: ChengXiang Zhai, Computer Science, University of Illinois, LanguageTechnology coPI: Susan Brown, Biology, Kansas State University, ArthropodBaseConsortium (BeetleBase) coPI: Donald Gilbert, Bioinformatics, Indiana University, Community Annotation (wFleaBase) Intellectual Merit The advent of next-generation sequencing is rapidly decreasing the cost of genomes. Projections are that costs will decrease from $1M to $100K to $10K in the next 5 to 10 years. As a result, the number of arthropod genomes will increase from 10 to 1000 to 10,000. This shifts the major limitation from sequencing to annotating. The current level of annotation is recognizing genes from sequences, rather than understanding the function of genes. Traditionally, functional analysis has been performed by human curators who read biological literature to provide evidence for a genome database of gene function such as FlyBase. To functionally analyze a genome, biologists develop an ortholog pipeline, where the genes of their organism have the orthologs computed and these used to find the most similar gene in a model organism database. This process is inexpensive, but inaccurate, compared to manual curators. We propose to develop a Curator Assistant that will enable the communities that are generating genomes to analyze the function of their genes by themselves. While the model organism databases (MODs) have groups of curators, subsequent genome databases have struggled to find funding for even a single human curator. Such bases will have to be curated by the communities themselves, by community biologists using software infrastructure to help them extract functions from community literature. Within the

ABIcurator.doc

  • Upload
    butest

  • View
    609

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: ABIcurator.doc

Developing a Curator Assistant for Functional Analysis of Genome Databases Requesting $1,451,005 from NSF BIO Advances in Biological Informatics, August 2009

PI: Bruce Schatz, Institute for Genomic Biology, University of Illinois at Urbana-ChampaigncoPI: ChengXiang Zhai, Computer Science, University of Illinois, LanguageTechnologycoPI: Susan Brown, Biology, Kansas State University, ArthropodBaseConsortium (BeetleBase)coPI: Donald Gilbert, Bioinformatics, Indiana University, Community Annotation (wFleaBase) Intellectual Merit

The advent of next-generation sequencing is rapidly decreasing the cost of genomes. Projections are that costs will decrease from $1M to $100K to $10K in the next 5 to 10 years. As a result, the number of arthropod genomes will increase from 10 to 1000 to 10,000. This shifts the major limitation from sequencing to annotating. The current level of annotation is recognizing genes from sequences, rather than understanding the function of genes.

Traditionally, functional analysis has been performed by human curators who read biological literature to provide evidence for a genome database of gene function such as FlyBase. To functionally analyze a genome, biologists develop an ortholog pipeline, where the genes of their organism have the orthologs computed and these used to find the most similar gene in a model organism database. This process is inexpensive, but inaccurate, compared to manual curators.

We propose to develop a Curator Assistant that will enable the communities that are generating genomes to analyze the function of their genes by themselves. While the model organism databases (MODs) have groups of curators, subsequent genome databases have struggled to find funding for even a single human curator. Such bases will have to be curated by the communities themselves, by community biologists using software infrastructure to help them extract functions from community literature. Within the Arthropod Base Consortium (ABC), for example, only FlyBase is a MOD with professional curators.

During the NSF-funded BeeSpace project, we developed prototype software for automatically extracting entities and relations from biological literature. The entities include genes, anatomy, and behavior, while the relations include interaction (gene-gene), expression (gene-anatomy), and function (gene-behavior). These entities and relations can be used to populate relational tables to build a genome database. Our prototype works on Drosophila literature and leverages FlyBase, the MOD for the ABC. Our techniques appear general enough for all arthropods.

We propose to develop a fully fledged Curator Assistant that fully utilizes machine learning technologies for natural language processing. These include community dictionaries, heuristic procedures, and training sets. Given the community collection with relevant literature, the assistant software suggests candidate relations that the community biologists can select from. Providing additional knowledge is much easier than reading biological literature and mechanisms are provided to specify the level of quality desired and revise the information itself. Broader Impact

Our project has been organized via the annual Symposium of the Arthropod Base Consortium. Our investigators including the BeeSpace PI for informatics and the Symposium organizer for biology, representing arthropod genomes in particular and animal genomes in general. Our project will develop language technology for entity-relation semantics into usable infrastructure and distribute it through GMOD, which already provides the sequence support used by ABC. We will develop the standards for literature support for customized extraction and curation, including practical deployment to a distributed community of NSF-funded genome biologists.

Page 2: ABIcurator.doc

Developing a Curator Assistant for Functional Analysis of Genome Databases

PI: Bruce Schatz, Institute for Genomic Biology, University of Illinois at Urbana-ChampaigncoPI: ChengXiang Zhai, Computer Science, University of Illinois, LanguageTechnologycoPI: Susan Brown, Biology, Kansas State University, ArthropodBaseConsortium (BeetleBase)coPI: Donald Gilbert, Bioinformatics, Indiana University, CommunityAnnotation (wFleaBase)

1. GENOME SEQUENCING AND BIOCURATION

The advent of next-generation sequencing is rapidly decreasing the cost of genomes. Projections are that costs will decrease from $1M to $100K to $10K in the next 5 to 10 years. As a result, the number of arthropod genomes will increase from 10 to 1000 to 10,000. This shifts the major limitation from sequencing to annotating. The current level of annotation is recognizing genes from sequences, rather than understanding the function of genes.

Traditionally, functional analysis has been performed by human curators who read biological literature to provide evidence for a genome database of gene function such as FlyBase. To functionally analyze a genome, biologists develop an ortholog pipeline, where the genes of their organism have the orthologs computed and these used to find the most similar gene in a model organism database. This process is inexpensive, but inaccurate, compared to manual curators.

We propose to develop a Curator Assistant that will enable the communities that are generating genomes to analyze the function of their genes by themselves. While the model organism databases (MODs) have groups of curators, subsequent genome databases have struggled to find funding for even a single human curator. Such bases will have to be curated by the communities themselves, by community biologists using software infrastructure to help them extract functions from community literature. Within the Arthropod Base Consortium (ABC), for example, only FlyBase is the only MOD with professional curators.

During the NSF-funded BeeSpace project, we developed prototype software for functional analysis [33], by automatically extracting entities and relations from biological literature. The entities include genes, anatomy, and behavior, while the relations include interaction (gene-gene), expression (gene-anatomy), and function (gene-behavior). These entities and relations can be used to populate relational tables to build a genome database. Our prototype currently works on Drosophila literature and leverages FlyBase, the MOD for the ABC. Our techniques are general enough for all arthropods. This is an important taxa of organisms for NSF biologists.

We propose to develop a fully fledged Curator Assistant that fully utilizes machine learning technologies for natural language processing. These include community dictionaries, heuristic procedures, and training sets. Given the community collection with relevant literature, the assistant software suggests candidate relations that the community biologists can select from. Providing additional knowledge is much easier than reading biological literature and mechanisms are provided to specify the level of quality desired and revise the information itself.

We will debug the new system on the existing bases such as BeetleBase and wFleaBase, then deploy more widely to the full bases of the Arthropod Base Consortium as it grows. The software will be general enough to be widely applicable for genome databases. We will use GMOD (Generic Model Organism Database) consortium as the distribution mechanism for our literature curation software, to complement their existing software for sequence curation.

- 1 -

Page 3: ABIcurator.doc

2. GENOME DATABASES AND BIOCURATION

The Curator Assistant will initially focus on arthropod genomes, as organisms of central interest to NSF. At least half of the described species of living animals are arthropods (jointed legs, mostly insects), species of great scientific interest for molecular genetics and evolutionary synthesis. The Arthropod Base Consortium (ABC) has been meeting quarterly for the past 4 years, to discuss their needs for genome data bases and data analysis. The inner circle has about 40 scientists, who hold workshops at the major community sites. The outer circle has about 400 scientists, who attend the Annual Symposium [www.k-state.edu/agc/symposium.shtml]. This community consortium currently includes some 10 resource genomes, including insects important biologically (bee, beetle, butterfly, aphid), crustaceans important ecologically (water flea), and vectors important for human diseases (mosquito, tick, louse).

There is a reference genome for this community, the fruit fly Drosophila melanogaster, which has been a genetic model for 100 years. As the model insect, Drosophila is important enough to justify a 40-person staff at FlyBase, who manually curate this model organism database (MOD). Through a close collaboration, the FlyBase literature curation process is serving as the model for our semantic indexing of biological literature, see Figure 1 below.

The first wave of genomes were of the model genetic organisms, these MODs already had Bases with human curators. For the arthropods, the only MOD is FlyBase for the insect Drosophila meglanoster. The second wave of genomes did not have decades of genetics, but were attempting to jumpstart with genome sequencing. For arthropods, these include the insects honey bee and flour beetle, both important scientifically and agriculturally. The corresponding bases, e.g. BeeBase and BeetleBase, were able to gain modest funding, but not for professional curators, only for postdocs and programmers. Such resources thus went into annotating genes of particular interest (small numbers) or support of automatic processing (large numbers).

With the third wave, the sequencing is still done at genome centers, but no attempts are made at manual curation. These Bases, e.g. wFleaBase for Daphnia, spend their limited resources on community annotation and computation. Beyond the third wave, the sequencing is being done at campus centers rather than national centers and any curation is done automatically with quality enhancement by the community itself. Within ABC, ButterflyBase and AphidBase are down this path and will be working with our group as their genomes mature. The 10,000 arthropod genomes expected in the next decade will all be in the post-curator era.

From a technology standpoint, this implies that the Curator Assistant must support variable levels of quality because different bases from different waves will do different amounts of post-assistant quality improvement. With many curators, the system should generate many candidates that can be manually checked by human experts. With few curators, the system should generate few candidates for manual checking, thus higher precision and lower recall. With no curators, the system should generate highest precision “correct” entries, which are annotated by the community itself using collaboration technology. In the preliminary work performed in the BeeSpace project described below, we developed prototype services tuned towards recall and towards precision, indicating feasibility of developing a fully tunable system for curation quality.

What’s in a Base: An Examination of FlyBase

For many reasons several of the fields in FlyBase use structured controlled vocabularies (aka ontologies). This makes it much easier (and more robust) to make links within the database, as

- 2 -

Page 4: ABIcurator.doc

well as making it much easier to search the database for information. Moreover, several of these controlled vocabularies are shared with other databases, and this provides a degree of integration between them. The controlled vocabularies are only implemented in certain fields in FlyBase. The initial literature selection is done at FlyBase at Cambridge University while the bulk of the literature curation is done at FlyBase at Harvard University to populate the gene models in the database from highlighted facts in the literature articles [8].

Controlled vocabularies currently used by FlyBase are [www.flybase.org]: The Gene Ontology (GO). This provides structured controlled vocabularies for the

annotation of gene products (although FlyBase annotates genes with GO terms, as a surrogate for their products). The GO has three domains: the molecular function of gene products, the biological process in which they are involved and their cellular component.

Anatomy. A structured controlled vocabulary of the anatomy of Drosophila melanogaster, used for the description of phenotypes and where a gene is expressed.

Development. A structured controlled vocabulary of the development of Drosophila melanogaster, used for the description of phenotypes and when a gene is expressed.

The Sequence Ontology (SO). A structured controlled vocabulary for sequence annotation, for the exchange of annotation data and for the description of sequence objects in databases. FlyBase describes the genome in a consistent and rigorous manner.

All of these structured controlled vocabularies are in the same format, that used by the Open Biomedical Ontology group. This format is called the OBO format [www.obo.org] .

These controlled vocabularies focus on the most important types of data for genome databases, namely “gene”, “anatomy”, and types of “function” such as “development” [37]. The factoids in the official database are relations on these datatypes, such as Interaction (gene-gene), Expression (gene-anatomy), Function (gene-development). When a FlyBase curator records a factoid, they also record the type of evidence that enables them to judge its correctness. The list for genes is as below. Note this implies that even manual curation includes different factoids at different qualities, whether a relation is true depends on the level of evidence chosen.

The Gene Ontology Guide to GO Evidence Codes contains comprehensive descriptions of the evidence codes used in GO annotation. FlyBase uses the following evidence codes when assigning GO data: inferred from mutant phenotype (IMP), inferred from genetic interaction (IGI), inferred from direct assay (IDA), inferred from physical interaction (IPI), inferred from expression pattern (IEP), inferred from sequence or structural similarity (ISS), inferred from electronic annotation (IEA), inferred from reviewed computational analysis (RCA),,traceable author statement (TAS), non-traceable author statement (NAS), inferred by curator (IC), no biological data available (ND). Note some of these are observational and some computational.

3. CURATOR ASSISTANT SYSTEM

Biocuration [17] is the process of extracting facts from the biological literature to populate a database about gene function. The curators at the Model Organism Databases (MODs) read input papers from scientific literature relevant to their organism and extract facts judged to be correct, which are then used to populate the structured fields of their genome database. There are currently 10 reference genomes, each with their own group of curators. These groups are falling

- 3 -

Page 5: ABIcurator.doc

behind, with the current scale of literature, and new resource genomes are being denied custom curator support, due to financial limitations.

In the 5-year BeeSpace project just ending with NSF BIO FIBR funding, we have been working closely with FlyBase curators to better understand what can be automated within the biocuration process. We are fortunate in collaborating with John MacMullen from the Graduate School of Library and Information Science, who specializes in studying the process of biocuration by analyzing the detailed activities of MOD curators. He is analyzing the curator annotations in FlyBase, among others, by examining which sentences are highlighted in the texts and which database entries are inferred from these. Through the BeeSpace project, we also work with the many curators at the FlyBase project under PI William Gelbart at Harvard University and the few curators at the BeeBase project under PI Christine Elsik at Georgetown University.

The Group Manager at FlyBase-Cambridge (England), Steven Marygold, provided the Figure below giving the steps in the FlyBase curation process. He spoke at the ABC working meeting in December 2007 hosted at our project home site in the Institute for Genomic Biology at the University of Illinois, slides at www.beespace.uiuc.edu/files/Marygold-ABC.ppt .

Figure 1. FlyBase Literature Curation Process Diagram [27].

The automatic process set up in the Curator Assistant is modeled after this manual process. The user could be a full biocurator or could be a community member research biologist, thus differently tuning the system to their needs. They search the literature to choose articles. The manual curator can only choose tens of articles to skim, but the assisted curator can choose thousands of articles to be automatically skimmed. The BeeSpace system that the Curator Assistant leverages contains powerful services for choosing collections well targeted to the particular purpose, including searching and clustering. The major strength of the automatic system is breadth, it can cover a much wider selection of the available literature than can humans. In demonstrating the prototype to many curators at the Arthropod Genomics Symposium, even the most professional curators spoke longingly of having an automatic system to filter candidates, in order to attempt to with the full range of biological literature.

The Curator Assistant will focus on the middle of the diagram, the central core of the curation process. This process highlights the curatable material and then performs curation, this is

- 4 -

Page 6: ABIcurator.doc

basically finding sentences with functional information and extracting the facts that are described by the functional sentences. For example, two genes interact with each other (Interaction), a gene is expressed in a specific part of the anatomy (Expression), a gene regulates a particular behavior (Function). Key information is usually contained within the abstract, which is why our current services are effective, even though they cover only Medline and Biological Abstracts. The manual curators have the advantage of reading the fulltext, so we will be also gathering fulltext systematically for our community, through the collaboration technology described below.

For the bottom of the diagram, the Curator Assistant will also support error checking of different kinds by the community curators themselves and by the community biologists themselves, as described in the later section on Community Annotation and Curation. Finally, through an arrangement with the GMOD consortium (Generic Model Organism Database software), who support the GBrowse genome sequence displayer and the CHADO database schema format, we will be distributing our literature infrastructure software to the broader genome community to supplement the existing sequence infrastructure software. The concluding section below on Organization and Schedule contains further details on GMOD relations.

The underlying system uses natural language processing to extract relevant entities and relations automatically from relevant literature. An entity is a noun phrase representing a community datatype, e.g. gene name or body part. A relation is a verb phrase representing the action performed by an entity, e.g. gene A regulates behavior B in organism C. Many projects extract entities and relations, using template rules for a particular domain. The BeeSpace project pioneered trained adaptive entity recognition, where sample sentences are used to train the recognizer for particular entities with high accuracy and software adapts the training to related domains automatically [18,19] and we will be leveraging off this NSF BIO project, which ends in August 2009 before the proposed project would begin. We also leverage off our previous NSF research in digital libraries on interactive support for literature curation [4,22].

The first prototype within the BeeSpace system has already become a production service, with streamlined v4 interface available at www.beespace.uiuc.edu . The Gene Summarizer was the subject of an accepted plenary talk at the 2nd International Biocurator Meeting in San Jose in October 2007 [34]. The Gene Summarizer has two stages: the first highlights the curatable materials while the second curates these materials in a usable interactive form [25,26]. The highlighting is tuned for recall, so that sentences containing gene names are automatically extracted from the literature abstracts, where the entity “gene” is broadly recognized, including genes, proteins, and gene-like descriptions. The curation is simpler than what is proposed for the Curator Assistant but is very effective for practicing biologists who use the interactive system, where each gene sentence is placed automatically into a functional category.

The first version of this service used a machine learning approach that was trained on the curator generated sentences from FlyBase, explaining why the curator had entered a particular factoid into FlyBase relational database. PI Schatz of BeeSpace then visited PI Gelbart of FlyBase at Harvard and observed the curator process at length. A reciprocal visit by a FlyBase curator, Sian Giametes, to BeeSpace refined the automatic process and the functional categories. We then also did specific training with new sentences judged by bee biologists at University of Illinois and beetle biologists at Kansas State University. A subsequent version was developed using this training with much higher accuracy than previous dictionary-based versions.

Figures 2 and 3 give examples of using the Gene Summarizer with this insect training on a Drosophila fly gene and on a Tribolium beetle gene. There are more fly papers than beetle papers so the number of highlighted sentences are naturally greater. The functional categories

- 5 -

Page 7: ABIcurator.doc

are: Gene Products (GP), Expression Location (EL), Sequence Information (SI), Wild-type Function & Phenotypic Information (WFPI), Mutant Phenotype (MP), Genetic Interaction (GI).

Figure 2. Gene Summarization for Automatic Curation on FlyBase collection.

Figure 3. Gene Summarization for Automatic Curation on BeetleBase collection.

- 6 -

Page 8: ABIcurator.doc

4. CURATOR ASSISTANT PROTOTYPE

After integrating the Gene Summarizer in BeeSpace v3, we developed a prototype BeeSpace v5 that specifically extracted entity and relation from literature. This has deeper curation, recognizing within a highlighted sentence what entities and relations are mentioned. The extractors were tuned for precision to produce “correct” factoids, rather than the previous extractors that were tuned for recall to produce comprehensive coverage of all entities present. From this, it became clear that the level of precision and recall was a tunable feature of machine learning and thus it would be feasible to support varying qualities for different purposes.

The precision v5 system was an important prototype for the Curator Assistant, as it showed that accurate automatic extraction was technically possible. The first version leveraged the relations within FlyBase and was run on the Drosophila collection of standard articles that we obtained through collaboration from FlyBase at Indiana University where the software development is done. The high precision used disambiguation algorithms that enabled identification of which gene was mentioned. For v3 recall, “wingless” was a particular text phrase but for v5 precision, the same word was a particular gene number. Thus, accurate linkouts became possible. So a gene entity recognized can jump directly to the FlyBase gene entry for that name and an anatomy entity can jump directly to the FlyBase anatomical hierarchy.

Figure 4 contains a sample output from the v5 prototype on the Drosophila fly collection. Multiple word phrases are recognized correctly for gene in green, for anatomy in orange, for behavior in blue, and for chemical in yellow. (Tags are correct if this figure displayed in color.) Anatomy is dictionary-based, just like gene, using the FlyBase anatomy terms as the base. The function terms in the categories of behavior and chemical were extracted using heuristics of certain key words. There was another set of function terms for development, the other category used in FlyBase, but not many terms identified with our simple heuristics. Figure 5 shows that the recognized gene is linked to its corresponding correct gene database entry in FlyBase.

In the proposed project, for entities, we will focus on gene, anatomy, and function (combining behavior, anatomy, development).  For relations, we will focus on different combinations of these such as Interaction (Gene-Gene), Expression (Gene-Anatomy), Function (Gene-Behavior etc).  We will leverage existing resources for dictionary generation, such as gene names from NCBI Entrez Gene [www.ncbi.nlm.nih.gov/sites/entrez?db=gene] and anatomy names from FlyBase [http://flybase.org/static_pages/anatomy/glossary.html].  The relational indexes in Biological Abstracts include gene and anatomy, providing a rich source of entities tagged by human curators from phrases in biological literature.  FlyMine [www.flymine.org] is a rich source of query relations, including multistep inferences extracted from FlyBase.  We will also leverage available resources to obtain training data or pseudo training data. In particular, BioCreative studies [16,29] have resulted in a valuable training set, which we have already used in gene recognition. Fixed template systems such as Textpresso [30] have hand-generated rules useful for constructing features in our learning-based framework.

For the proposed project, we plan to do extensive training to improve the precision of the dictionaries and of the heuristics, to automatically identify sentence slots for particular entities. This process greatly improved our previous efforts for entity summarization, as discussed above. To achieve better results, the community curators can supplement the dictionaries with local gene names or anatomy names. The next section is a technical discussion of the training procedures and how such tuning can be feasibly implemented.

- 7 -

Page 9: ABIcurator.doc

Figure 4. Preliminary Work from BeeSpace Prototype v5. Interactive System for Entity Relations using FlyBase relational database for leverage, with live linkouts.

Figure 5. FlyBase Gene entry (manual) linked to from Curator Assistant (automatic).

- 8 -

Page 10: ABIcurator.doc

We have tried running the Drosophila trained v5 extractors on Tribolium literature, since few beetle genes have direct names but commonly use the fly gene names. The anatomy is also not identical but similar in many ways. This process sometimes produces good results as shown in Figure 6. This version is the initial attempt at a general system for arthropods using prototype classification, the closer the organism is to the prototype fly the more accurate the recognition.

Figure 6. Entity Relation v5 on Beetle Tribolium literature. This still uses the FlyBase training so not as accurate as would be trained system, but still produces some useful outputs.

We are currently extracting from a large insect collection from the Biological Abstracts database. PI Schatz is giving an invited lecture in December 2009 at the annual meeting of the ESA Entomological Society of America on "Computer support for community knowledge: information technologies for insect biologists to automatically annotate their molecular information" and will demonstrate the evolved version of this prototype. coPI Gilbert is giving an invited talk in the same session on Integrative Physiological and Molecular Insect Systems. He works on the arthropod water flea, a good test of machine learning for entity anatomy.

- 9 -

PROJECT SCHEDULE FOR CURATOR ASSISTANTYear 1. Develop v1 leverage FlyBase (base BeeSpace v5). Deploy to BeetleBase.Year 2. Develop v2 with Trained Recognizers. Deploy to BeetleBase and wFleaBase.Year 3. Develop v3 with Community Curation. Deploy to entire ABC including Hymenoptera and Leptidoptera genome databases without curators and VectorBase with.

Page 11: ABIcurator.doc

5. ENTITY RELATION EXTRACTION

This project proposes that it is feasible to apply advanced machine learning and natural language processing techniques to extract various biological entities and relations with tunable extraction results in a sustainable way through leveraging the increasing amount of training data from annotations naturally accumulated over time. This sustainability is illustrated in Figure 7. The main technical component is the trainable and tunable extractor. This extractor can automatically process large amounts of literature and identify relevant entities and relations that can become candidate factoids for curation. The extracted results would then be validated by human curators or any one with appropriate expertise for validation. The validated results can be incorporated into structured databases for researcher query or analysis tools to further process. The growing amount of validated entities and relations naturally serves as additional training data for the extractor, leading to “organic” improvement of extraction performance over time.

The extractor is trainable due to the use of a machine learning approach to extraction as opposed to the traditional rule-based approaches. This means that the extractor can learn over time from the human-validated extraction results to improve its extraction accuracy; the more training data we have, the better the accuracy of extraction will be. Thus as we accumulate more and more entities and relations, the Curator Assistant would become more and more intelligent and powerful, being able to replace more and more of the human labor. Thus, the extractor would become more and more scalable to handle large amounts of literature automatically.

The extractor is tunable due to a combination of high-precision techniques such as dictionary lookup and rule-based recognition with high-recall enhancement from statistical learning. Informally, our idea is that we can first use dictionary lookup and/or rule-based methods to obtain a small amount of highly accurate extraction results and then feed these results as (pseudo) training data to a learning-based extractor to train the extractor to extract more results, thus increase recall. A learning-based extractor also generally has parameters to control the tradeoff of precision and recall, making it possible to tune the system to output either fewer results with higher precision or more results with higher recall but potentially lower precision.

This trainable and tunable extractor will be implemented based on a general learning framework for information extraction, in which all resources, including dictionaries, human-generated rules, and existing annotations, can be integrated in a principled way. The basic idea of using machine learning [1] for extraction is to cast the extraction problem as a classification problem. For example, for entity extraction, the task would be to classify a candidate phrase as either being a particular type of entity (e.g., gene) or not, while for relation extraction, the classification task can be to classify a sentence as either containing a particular relation (e.g.,

- 10 -

Figure 7. Extraction Process for Assistant, where Curator tunes the Dictionaries and the Training.

Page 12: ABIcurator.doc

gene interaction) or not. The prediction is based on a function that combines various features that describe an instance (i.e., a phrase or a sentence) in a weighted manner. For example, for gene prediction, features can include every possible clue that can potentially help making the prediction. Or features can be local syntactic features such as whether the phrase has capitalized letters, whether there are parentheses or Greek letters, whether there is a hyphen, or contextual features such as whether the word “gene” or “expressed” occurs in a small window around the phrase. These features can be combined to generate a score as basis for the prediction. The exact way to combine the features and to make the decision would vary from method to method [1].

For example, a commonly used effective classifier is based on logistic regression [1,18]. It works as follows. Let X be a candidate phrase and f1(X), f2(X), …, fk(X) be k feature values computed on X; e.g., f1(X)=1 (or 0) can indicate that the first letter of X is (or not) capitalized. Let Y {0,1} be a binary variable indicating whether X is a gene. The logistic regression classifier assumes that Y and the features are related through the parameterized function:

where ’s are parameters that control the weights on all the features learned from training data. Given any instance X, we can use the formula above to compute p(Y=1|X), and thus can predict X to be a gene if p(Y=1|X)> p(Y=0|X) (i.e., p(Y=1|X)>0.5), and a non-gene otherwise. The training data will be of the form of a pair (Xj, Yj) where Xj is a phrase and Yj {0,1} is the correct prediction for Xj , thus a pair like (Xj, Yj=1) would mean that phase Xj should be predicted as a gene, while a pair like (Xj, Yj=0) would mean that phase Xj should be predicted as not a gene. In general, we will have many such training pairs, which tell us the expected predictions for various instances. With a set of such training data {(Xj, Yj)}, j=1,…,n, in the training phase, we would optimize the parameters (i.e., ’s) to minimize the prediction errors on the training data. Intuitively, this is to figure out the best settings for these ’s so that ideally for all training pairs where Yj=1, p(Yj=1| Xj) would be larger than 0.5, while for those where Yj=0, p(Yj=1| Xj) would be smaller than 0.5.

Although we used gene prediction as an example to illustrate the idea of this kind of learning approach, it is clear that the same method can be used for recognizing other entities as well as relations if X is a candidate sentence and Y indicates whether a certain relation is expressed in X. There are many other classifiers [1] such as SVM and k-nearest neighbors that we can also use; they all work in a similar way – using training data to optimize a combination of features for making a prediction.

A significant advantage of such a learning-based approach over the traditional rule-based approach (as used in, e.g., the Textpresso system [30]) is that it can keep improving its performance through leveraging the naturally growing curated database as training data, thus gradually reducing the need for human effort over time. Indeed, such supervised learning methods have already been applied successfully for information extraction from biology literature (see, e.g., [3,9,12,28,35,36,43] ) and many other tasks such as text categorization and hand-written character recognition. Such a learning-based method relies on the availability of two critical resources: (1) training data; (2) computable effective features. The more training data we have and the more useful features we have, the accuracy of extraction would be higher. Unfortunately, these two resources

- 11 -

Page 13: ABIcurator.doc

are not always readily available to us. Below we discuss how we can apply advanced machine learning and NLP techniques to solve these two challenges.

Insufficient training data: All the human-generated annotations are naturally available high quality training data, but for a new genome, we may not have many or any annotations available, creating a problem of “cold start”. We solve this problem using three strategies:

1. “Borrow” training data from related model organisms that have already been well annotated through the use of domain adaptation techniques [18,19,20]. For example, our previous work shows that cross-domain validation (emphasizing more on features that work well for multiple domains) can lead to an improvement in the accuracy of extracting genes from a BioCreative test set [16] by up to 40% [18].

2. Bootstrap with a small number of manually created rules to generate pseudo training examples (e.g., by assuming that all the matched cases with a rule are correct predictions). This is a general powerful idea to improve recall, thus can be expected to be very useful when we want to tune toward high recall based on high precision results. For example, a small set of human-generated rules can be used for extraction with high accuracy; the generated high precision results can then be used to train a classifier, which would be able to augment the extraction results to improve recall. In our previous study, this technique has also been shown to be very effective when combined with domain adaptation [20].

Figure 8 shows some sample results from using the pseudo training data automatically generated from entries in a FlyBase table for genetic interaction relation recognition. Different curves correspond to using different combinations of features. The best performing curve uses all the words in a sentence as features. Note that this top curve also shows that it is possible to tune the extractor to produce either high-precision low-recall results or low-precision high-recall results by applying a different cutoff threshold to a ranked list of predictions. 3. In the worst case, we will resort to human annotators to generate a small number of high-quality training examples with minimum effort using active learning techniques, which allow us to choose the most useful examples for a human annotator to work on so as to minimize human effort. The basic idea is to ask a human expert to judge a case on which our classifier is most uncertain about; we can expect the classifier to learn most from the correct prediction for such uncertain cases. There are many active learning techniques that we can apply [7,10,40].

Insufficient effective features: Some entities and relations are easier to extract than others; for example, organisms are easier to extract than genes because the former is usually restricted to a closed set of vocabulary while the latter is not. For most entities, we expect that the standard features defined based on surface forms of words and contextual words around a phrase would

- 12 -

Figure 8. Relation Extractor with Tunable Precision-Recall depending on thresholds.

Page 14: ABIcurator.doc

be sufficiently effective for prediction. However, for difficult cases, we may need to extend the existing feature construction methods to define and extract additional effective features for a specific entity or relation. We will solve this problem using two strategies:

1. Systematically generate more sophisticated linguistic features based on syntactic and semantic structures (e.g., dependency relations between words determined by a parser). To improve the effectiveness of features, it is useful to consider more discriminative features than words. To this end, we will parse text to obtain syntactic and semantic structures of sentences and systematically generate a large space of linguistically meaningful features that can potentially capture more semantic relations and are more discriminative. In our previous study [21], we have proposed a graph representation that enables systematic enumeration of linguistic features, and our study has found that using a combination of features of different granularity can improve performance for relation extraction. In this project, we will apply this methodology to enable the classifier to work on a large space of features.

2. Involve human experts in the loop of learning so that when the system makes a mistake, the expert can pinpoint to the exact feature responsible for the error; this way, the system can effectively improve the quality of features through human feature supervision. For example, in some previous experiments, we have discovered that dictionary-based approaches to gene name recognition are unable to distinguish a gene abbreviation such as “for” from the common preposition word “for”. Thus if we just add a feature to the classifier to indicate whether the phrase occurs in a dictionary, we may potentially misrecognize a preposition like “for” as a gene name. To solve this problem, we designed a special classifier targeting at disambiguating such cases based on the distribution patterns of words in the nearby text. The results in Figure 9 show that this technique can successfully distinguish all the occurrences of “foraging” and “for” (the numbers are the scores given by the classifier; a positive number indicates a gene, while a negative number a non-gene). The output from such a disambiguation classifier can be regarded as a high-level feature that can be fed into a general gene recognizer to tune the classifier toward high precision.

Note that we take a very broad view of features, which makes our framework quite general. Thus, in addition to leveraging all kinds of training data, we can also incorporate a variety of other useful resources such as dictionaries and human-generated rules through defining appropriate features (e.g., a feature can correspond to whether an instance matches a particular rule or an entry in a dictionary), effectively leveraging results from existing work. Extracting entities and relations from biomedical literature has been studied extensively in the literature (see, e.g., [3,5,6,9,11-15,23-24,28,30-31,35,36,38-39,41-43]), including our own previous work (e.g., [18-21]). Our framework would enable us to leverage and combine the findings and resources from all these previous studies to perform large-scale information extraction. For example, we can obtain a wide range of useful features from previous work and various strategies for optimizing extraction accuracy.

- 13 -

Figure 9. Sample gene name disambiguation results

Page 15: ABIcurator.doc

6. COMMUNITY ANNOTATION and CURATION

The Community itself will eventually have to take over the curator role, with interactive analysis to enable scientists to use the infrastructure to infer biological functions and infer semantic relationships. Today's new genome projects are efforts contributed by many experts and students, supported and enabled by distributed data sets, wiki project notebooks, genome maps, annotation and search tools. These projects are not supported in a monolithic way, but via contributions by biologists at nearly as many institutions as the hundreds of individual labs.

For example, more than 400 biologists contributed gene annotations to the Daphnia genome [17]. As this is the same scale of attendees to the Arthropod Genomics Symposium, but for a single arthropod, the number of potential contributors to the ArthropodBaseConsortium annotations clearly numbers in the tens of thousands. Each of these is a potential curator, with effective infrastructure for Curator Assistant. See the Collaboration Wikis for Daphnia Genomics Consortium [https://dgc.cgb.indiana.edu/display/DGC/] and for Aphid Genomics Consortium [https://dgc.cgb.indiana.edu/display/aphid/] for arthropod genomes examples.

This is a new model of sustainable scientific activity, with cost-effective collaboration via widely adopted cyberinfrastructure. Experts and students in focus areas are actively involved, and contribute according to their means and interest. They join from disparate areas of basic and applied sciences, educational, governmental, and industry centers (e.g. Daphnia and Aphid genomes involve EPA and USDA agencies, agricultural and environmental businesses).

We will develop infrastructure to address collaboration support for community annotation. By providing tunable quality for biological factoids, we provide an automatic system to filter the literature for curatable knowledge. In current gene annotation systems, such as Apollo distributed by GMOD, the curator is presented with a blank form in which to write a gene description.  In the Curator Assistant, they are presented with candidate suggestions, thus greatly expanding the number of persons who can serve as effective curators. We will also provide mechanisms for the community to enter their own documents as published into the base collections for the system, yielding a rich source of full-text articles, and to directly provide their own factoids from their articles, without the inaccuracy of automatic entity-relation extraction.

Currently, the most popular collaboration tools are wikis. While a wiki excels at simplicity and flexibility, it lacks validation tools, rich indexing and social instrumentation.  We propose to develop structured social instrumentation for collaborative research environments, including collaborative curation.  In particular, our systems will allow users to offer confidence ratings for human annotations and for various automated metadata extracts presented to the users.   The users themselves will gain expert status when their annotations receive high confidence ratings.  These ratings and rankings will allow researchers to share expertise and enhance the precision of automated annotation systems in a mutually-beneficial way with secure transactions.

A relevance rating system will be integrated in the basic functioning of the system itself.   Every view of information (entities, relations, abstracts, document lists) will also include checkboxes to up-rate or reject/dismiss any listed elements.  For example, community members can judge the quality of the factoids viewed during their usage of the system. Items which are selected and viewed receive increased relevance ratings.  Data items which are dismissed/rejected are down-rated in relevance and/or validity.  The rating system is not optional:  It is transparently embedded within the user experience, which is key to its success. This model of relevance feedback and validity ratings embedded within the core system has proven effective in popular commercial social network systems such as YouTube and LastFM.

- 14 -

Page 16: ABIcurator.doc

7. PROJECT ORGANIZATION AND SCHEDULE

Our project has been organized via the annual Symposium of the Arthropod Base Consortium (ABC). This is sponsored by the Arthropod Genomics Center at Kansas State University with coPI Brown as Director. There have been 3 symposia held thus far in Kansas City, drawing 300-400 attendees, generally representatives of their research laboratories or genome projects. http://www.k-state.edu/agc/symp2009/ The steering committee for the ABC meets after the workshop to plan community support, this proposal grew out of these planning meetings.

There have also been specific meetings of the inner circle, 30-40 attendees, once or twice a year at the main infrastructure sites such as FlyBase. The BeeSpace project hosted the one in December 2007 at the University of Illinois, the slides for this workshop are at http://www.beespace.uiuc.edu/groups_abc.php . The investigators for this proposal each spoke at this meeting, along with the Head of Literature Curation for FlyBase Cambridge. The proposed project will host a budgeted annual specialty workshop to plan Curator Assistant.

The genome databases being used as test models in this project have already bypassed the use of professional curators. They are coming in later than the post-MOD wave, such as honey bee, where a case for a few curators was eventually successful after many grant attempts. So BeetleBase for Tribolium the flour beetle and wFleaBase for Daphnia the water flea employ a few biologists and programmers to help with sequencing support and computational pipelines. The coPIs who lead the bioinformatics for these, respectively Susan Brown and Donald Gilbert, are influential proponents of the new paradigm for community curation via annotation software.

This proposal is concerned with developing an effective Curator Assistant and testing it to evolve to full utility. The infrastructure investigators will develop the software infrastructure, Schatz leading the informatics system development and Zhai leading the computer science research. These were the same roles they played in the BIO FIBR BeeSpace project, which developed interactive services for functional analysis using computer science research. The bioinformatics investigators will serve as the initial users, each is the lead for the informatics of a major community of arthropod biologists with several hundred community members. Tribolium is an insect close to Drosophila, while Daphnia is a non-insect arthropod far from Drosophila. The close BeeSpace collaboration with FlyBase will be continued, with both the curator site at Harvard with PI Bill Gelbart and the software site at Indiana with PI Thom Kaufmann.

Deployment to the full ABC and beyond will begin towards the end of the project. The groups already identified coordinate multiple related databases. They will be the wave of deployment after the investigator organisms are effectively using the Curator Assistant. Their coordinators have expressed great interest while serving on the ABC steering committee. NIH-supported VectorBase has many curators for mosquitos and ticks, USDA-supported HymenopteraBase has few curators for bees and wasps, LepidopteraBase has no curators for butterflies and moths. There is also an international collaboration for AphidBase hosted at INRA in France.

The GMOD (Generic Model Organism Database) consortium is a bioinformatics group who provide common infrastructure for over 100 genome projects, including all the ABC genomes [www.gmod.org/wiki/GMOD_Users]. We have presented our preliminary software at GMOD meetings [32], using RESTful protocols for linking Genome Browser to Gene Summarizer, and made arrangements with the coordinator Scott Cain to link our software into GMOD for mass distribution, during extensive conversations at the GMOD meetings and the ABC meetings. So the Curator Assistant will become the literature infrastructure for ABC, just as GBrowse is the sequence infrastructure, and through GMOD made available to the genome biology community.

- 15 -

Page 17: ABIcurator.doc

References Cited

[1] Bishop C (2007) Pattern Recognition and Machine Learning, Springer, 2007. [2] Buell J, Stone D, Naeger N, Fahrbach S, Bruce C, Schatz B (2009) Experiencing BeeSpace:

Educational Explorations in Behavioral Genomics for High School and Beyond, AAAS Annual Symposium, Chicago, Feb 2009. curricular materials at www.beespace.uiuc.edu/ebeespace

[3] Chang J, Schutze H, Altman R (2004) GAPSCORE: finding gene and protein names one word at a time, Bioinformatics, 20(2):216-25.

[4] Chung Y, Pottenger W, Schatz B (1998) Automatic Subject Indexing using an Associative Neural Network, 3rd Int ACM Conf on Digital Libraries, Pittsburgh, PA, Jun, pp 59-68. Nominated for Best Paper award.

[5] Cohen A (2005) Unsupervised gene/protein entity normalization using automatically extracted dictionaries, Proc BioLINK2005 Workshop Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. Detroit, MI: Association for Computational Linguistics; 2005:17-24.

[6] Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I (2004) Extracting human protein interactions from MEDLINE using a full-sentence parser, Bioinformatics, 20(5):604-11, 2004.

[7] Dasgupta S, Tauman Kalai A, Monteleoni C (2005) Analysis of perceptron-based active learning, Proceedings of COLT 2005, 249-263, 2005.

[8] Drysdale R, Crosby M, FlyBase Consortium (2005) FlyBase: genes and gene models, Nucleic Acids Research, 33:D390-D395, Database Issue, doi:10.1093/nar/gki046.

[9] Finkel J, Dingare S, Manning C, Nissim M, Alex B, Grover C (2005) Exploring the boundaries: gene and protein identification in biomedical text, BMC Bioinformatics, 6 Suppl 1(NIL):S5, 2005.

[10] Freund Y, Seung H, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm, Machine Learning, 28(2-3):133-168.

[11] Fukuda K, Tamura A, Tsunoda T, Takagi T (1998) Toward information extraction: identifying protein names from biological papers, Pac Symp Biocomput, NIL(NIL):707-18.

[12] Hakenberg J, Bickel S, Plake C, Brefeld U, Zahn H, Faulstich L, Leser U, Scheffer T (2005) Systematic feature evaluation for gene name recognition, BMC Bioinformatics, 6 Suppl 1(NIL):S9, 2005.

[13] Hanisch D, Fundel K, Mevissen H, Zimmer R, Fluck J (2005) ProMiner: rule-based protein and gene entity recognition, BMC Bioinformatics 2005, 6(Suppl 1):S14doi:10.1186/1471-2105-6-S1-S14.

[14] Hatzivassiloglou V, Duboue P, Rzhetsky A (2001) Disambiguating proteins, genes, and rna in text: a machine learning approach, Bioinformatics, 17 Suppl 1.:S97-S106.

[15] Hirschman L, Park J, Tsujii J, Wong L, Wu C (2002) Accomplishments and challenges in literature data mining for biology, Bioinformatics, 18(12):1553-1561.

[16] Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology, BMC Bioinformatics 2005, 6(Suppl 1):S1doi:10.1186/1471-2105-6-S1-S1.

[17] Howe D, Costanzo M, Fey P, et. al. (2008) Big data: The future of biocuration, Nature 455: 47-50; doi:10.1038/455047a.

[18] Jiang J, Zhai C (2006) Exploiting Domain Structure for Named Entity Recognition, Proceedings of HLT/NAACL 2006.

- 16 -

Page 18: ABIcurator.doc

[19] Jiang J, Zhai C (2007) Instance weighting for domain adaptation in NLP, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), 264-271.

[20] Jiang J, Zhai C (2007) A Two-Stage Approach to Domain Adaptation for Statistical Classifiers , Proc 16th ACM International Conference on Information and Knowledge Management ( CIKM'07), pp 401-410.

[21] Jiang J, Zhai C (2007) A Systematic Exploration of The Feature Space for Relation Extraction, Proc Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007), pp 113-120.

[22] Johnson E, Schatz B, Cochrane P (1996) Interactive Term Suggestion for Users of Digital Libraries: Using Subject Thesauri and Co-occurrence Lists for Information Retrieval, Proc Digital Libraries '96: 1st ACM Intl Conf on Digital Libraries, March, Bethesda, MD.

[23] Kazama J, Makino T, Ohta Y, Tsujii J (2002) Tuning SVM for biomedical named entity recognition, Proc workshop on NLP in the biomedical domain, 2002.

[24] Kulick S and others (2004) Integrated Annotation for Biomedical Information Extraction, Proc HTL-NAACL 2004 Workshop on Linking Biological Literature, Ontologies and Databases, pp 61-68.

[25] Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B (2006) Automatically generating gene summaries from biomedical literature, Proc Pacific Symposium on Biocomputing, pp 40-51.

[26] Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B (2007) Generating gene summaries from biomedical literature: A study of semi-structured summarization, Information Processing and Management, 43: 1777-1791.

[27] Marygold S (2007) Genetic Literature Curation at FlyBase-Cambridge, presentation at ArthropodBaseConsortium working group meeting at University of Illinois, Dec 2007. www.beespace.uiuc.edu/files/Marygold-ABC.ppt

[28] Mika S, Rost B (2004) Protein names precisely peeled off free text, Bioinformatics, 20 Suppl. 1:241-247, 2004.

[29] Morgan A, Hirschman L (2007) Overview of BioCreative II Gene Normalization, Proc of the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain: CNIO; 2007:17-27.

[30] Muller H, Kenny E, Sternberg P (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature, PLoS Biology 2004 Nov; 2(11) e309. doi:10.1371/journal.pbio.0020309 pmid:15383839. www.textpresso.org

[31] Narayanaswamy M, Ravikumar K, Vijay-Shanker K (2003) A biological named entity recognizer, Proc Pacific Symposium on Biocomputing, pp 427-38.

[32] Sanders B, Arcoleo D, Schatz B (2008) BeeSpace Navigator Integration with GMOD GBrowse, 9th annual Bioinformatics Open Source Conference (BOSC 2008), Toronto, ON, Canada. www.beespace.uiuc.edu/files/BOSC2008_v3.ppt

[33] Schatz B (2002) Building Analysis Environments: Beyond the Genome and the Web, invited essay for Trends and Controversies section about Mining Information for Functional Genomics, IEEE Intelligent Systems 17: 70-73 (May/June 2002).

[34] Schatz B (2007) Gene Summarizer: Software for Automatically Generating Structured Summaries from Biomedical Literature, accepted plenary Presentation to 2nd International Biocurator Meeting, San Jose. www.canis.uiuc.edu/~schatz/Biocurator.GeneSummarizer.ppt

[35] Settles B (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text, Bioinformatics, 21(14):3191-3192, 2005.

[36] Skounakis M, Craven M, Ray S (2003) Hierarchical hidden markov models for information extraction, Proc of the 18th International Joint Conference on Artificial Intelligence, 2003.

- 17 -

Page 19: ABIcurator.doc

[37] Sokolowski M (2001) Drosophila: genetics meets behaviour. Nature Reviews Genetics, 11(2):2001.

[38] Srinivasan P, Libbus B (2004) Mining Medline for implicit links between dietary substances and diseases, Bioinformatics, 20 Suppl. 1:290-296, 2004.

[39] Tanabe L, Wilbur W (2002) Tagging gene and protein names in biomedical text, Proceedings of the workshop on NLP in the biomedical domain, 2002.

[40] Tong S, Koller D (2001) Support vector machine active learning with applications to text classification, Journal of Machine Learning Research, 2:45-66, 2001.

[41] Tsuruoka Y, Tsujii J (2003) Boosting precision and recall of dictionary-based protein name recognition, Proc ACL 2003 workshop on Natural language processing in biomedicine, pp 41-48, Morristown, NJ.

[42] Tuason O, Chen L, Liu H, Blake J, Friedman C (2004) Biological nomenclatures: A source of lexical knowledge and ambiguity, Proc Pacific Symposium on Biocomputing 9, pp 238-249.

[43] Zhou G, Shen D, Zhang J, Su J, Tan S (2005) Recognition of protein/gene names from text using an ensemble of classifiers, BMC Bioinformatics, 6 Suppl 1(NIL):S7, 2005.

- 18 -