Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text

Playing Biology’s Name Game: Identifying Protein Names In Scientific Text

Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and

Ralf Zimmer

Pac Symp Biocomput. 2003;:403-14.

Abstract Construction of a comprehensive

general purpose name dictionary An accompanying automatic curation

procedure based on a simple token model of protein names

An efficient search algorithm to analyze all abstracts in MEDLINE

Parameters are optimized using machine learning techniques

Model for protein and gene names Protein names are often composed

of more than one word (token) The “order” of these words is not

very important – permutation of tokens may occur

General-purpose dictionaries of protein names must be automatically composed

Token classes (1/3)

Token classes (2/3) Extract all words from the dictionary

with frequency of occurrence > 100 Non-descriptive tokens: words

occurring in databases but rarely used in free text or have no influence on the significance of match

Modifier tokens: words crucial for correct recognition

Token classes (3/3) Specifier tokens: Arabic and

Roman numbers and Greek letters Delimiter tokens: used to gain

specificity in the matching procedure – help identify name boundaries

Common words: obtained by comparison to a standard English dictionary

Standard tokens: gene identifiers as they cannot be easily assigned to a separate calss

Automatic generation of the dictionary Extract gene symbols, alias names, and

full names for all human genes from the HUGO Nomenclature database

Create an entry for each official gene symbol and add the corresponding names in the OMIM database

Extract all synonyms in SWISSPROT and TREMBL database and match these to HUGO entries

Curation of the dictionary (1/3) To resolve ambiguities and to

remove nosensical names from the dictionary

A curation procedure consists of two phases – expansion and pruning

Expansion:

Curation of the dictionary (2/3) Pruning: remove redundancies, ambiguities,

and irrelevant synonyms First: synonyme a sequence of token

class identifiers Use regular expression to search unspecific

synonyms (e.g. only non-descriptive tokens, only specifier tokens, etc.)

Finally, a list of ambiguous names is stored separately with reference to their original records

Curation of the dictionary (3/3) The ambiguity list can be used to

identify such entries and move them to the manual curation list based on their frequency of occurrence.

Efficient detection of names (1/3)

MEDLINE contains about 11 million abstracts Linear time in the number of tokens of the

parsed text To sweep over the abstract, processing one

token at a time and keep a set of candidate solutions and two associated scoring measures, boundary score s and acceptance score s, for the present position


boundary score s: controls the end of the extension of a candidate match and is increased on a token mismatch. The candidate is pruned if s >boundary threshold

acceptance score s: determine whether the candidate is reported as a match. s is a linear combination of token-class-specific match and mismatch terms. In other words, the significance of token classes vary.


Example:

Only the non-descriptive token “precursor” is unmatched in the candidate a nearly maximal match score would be computed (if non-descriptive tokens receive a small weight)

However, the semantically significant modifier token “receptor” leads to a substantial mismatch term (if weights are set appropriately)

Parameter optimization

Robust linear programming (RPL) was used to compute a set of sensible weights

This supervised machine learning techniques uses a set of positive samples, i.e. correctly identified protein names, and a set of negative ones.

The match and mismatch weighting parameters for delimiter, specifier, modifier, and standard tokens were tuned.

The optimized weightings penalize mismatch of modifier and number tokens and reward matching of other token classes to various extend

Evaluation The test dataset is based on the TRANSPATH

database on regulatory interactions. Extracted all human proteins with

SWISSPROT annotations Discarded abstracts if no text was available

or if a protein was described for the first time Resulting benchmark set consists of 611

associations (141 objects in 470 abstracts)

Results – 5-fold c.v.

Documents

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text