Text Mining for Biocuration of Bacterial Infectious Diseases

GBCB Seminar

April 24, 2014

Biocuration in bacterial infectious diseases

Existing approaches to biocuration

Goals of current research

Sentence classification for virulence factor (VF) curation

Future Research

Large scale sequencing, transcriptomics, proteomics and metabolomics provide large volumes of data about structure and function

Valuable information about genes, proteins and other biological entities derived from interpretation of data

Publications capture information that researchers extract from data by aggregating, integrating, summarizing and analyzing experiment results and interpreting those results with respect to other published results

Gene annotation◦ Virulence factors◦ Antibiotic resistance◦ Genomic metadata

Experiment Metadata◦ Transcriptomic metadata◦ Metabolomic metadata

Literature◦ Named entity recognition◦ Metadata tagging

Automated annotation ◦ Example – RAST ◦ Transfer annotations based on similarity◦ Metabolic reconstruction

Community curation ◦ Example – WikiGenes◦ Collaborative manual curation

Model building ◦ Example - MetaFlux ◦ Predict missing components of pathways based

on FBA models

Dedicated manual curation ◦ Example –◦ PATRIC Curate entries with statements

traceable to literature

◦ In 2009, half of biocurators were using text mining in support of biocuration1

◦ Common use cases: Document prioritization Linking entities and relations to biological resources

such as GO or UniProt Identification of evidence

◦ Identification of evidence Pattern recognition - genomic location information Named entity recognition – T4SS components Event extraction – positive/negative regulation

1. PMID: 23110974

Manual procedures are time consuming and costly

Volume of literature continues to grow

Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually

Some success with popular tools but limitations

http://www.nature.com/nrmicro/journal/v8/n1/fig_tab/nrmicro2260_F2.htmlhttp://stroke.nih.gov/materials/strokechallenges.htm

http://www.nature.com/nrmicro/journal/v8/n1/fig_tab/nrmicro2260_F2.html

Potentially brittle methods, e.g. dictionary lookups

Questions of effort required to extend

Named entity recognition does not allows disambiguate correctly

Prioritizing documents is still challenging

Textpresso Dictionary Entries

Adhesion to hostAdhesion to hostsAdhesion to other organism during symbiotic interactionAdhesion to other organism during symbiotic interactionsAdhesion to symbiontAdhesion to symbiontsAgglutination during conjugation with cellular fusionAgglutination during conjugation with cellular fusionsAgglutination during conjugation without cellular fusionAgglutination during conjugation without cellular fusions

Generalized set of biocuration tools to:◦ Filter and prioritize documents◦ Identify relevant assertion sentences within documents◦ Extract entity and events ◦ Require minimal manual intervention

Approach◦ Address each objective separately◦ Topic modeling and similarity measures for document

classification◦ Term-frequency Inverse Document Frequency (TF-IDF) for

sentence classification◦ Shallow semantic parsing for entity and event extraction

Focus of this presentation is TF-IDF for sentence classification and its limitations

3 Key Components

◦ Data

◦ Representation scheme

◦ Algorithms

Data

◦ Positive examples – VF assertion sentences

◦ Negative examples – Randomly selected from same publications

Representation

◦ TF-IDF

◦ Vector space representation

◦ Cosine of vectors measure of similarity

Algorithms

◦ Supervised learning

SVMs

Ridge Classifier

Perceptrons

kNN

SGD Classifier

Naïve Bayes

Random Forest

AdaBoost

• Semisupervised Learning• Label Spreading

“Bacterial virulence factors enable a [pathogen] to replicate and disseminate within a host in part by subverting or eluding host defenses.”1

Example assertion sentences about virulence factors

Mutations in the fimH gene of Salmonella typhimurium result in a non-fimbriate, non-adhesive phenotype.2

Unexpectedly, here we find that nonacylated LprG retains TLR2 activity. 3

The autolysin Ami contributes to the adhesion of Listeria. 4

Negative examples are randomly selected non-VF assertion sentences from the same set of publications.

VF Sentence Set 1 - PATRIC team of biocurators identified 4,696 assertion sentences in 1,127 publications about virulence in 5 genera: Escherichia, Listeria, Mycobacterium, Salmonella, Shigella

VF Sentence Set 2 - Second round of curation over initial results yield 3,716 VF assertion sentences from 787 publications across 6 genera: Bartonella, Escherichia, Listeria, Mycobacterium, Salmonella, Shigella

1. A. Cross, “What is a Virulence Factor” Crit Care. 2008; 12(6): 196.

2. Hancox, Yeh et al. 1997

3. Drage, Tsai et al. 2010

4. Milohanic, Jonquieres et al. 2001

Term Frequency (TF) tf(t,d) = # of occurrences of t in dt is a termd is a document

Inverse Document Frequency (IDF)idf(t,D) = log(N / |{d in D : t in d}|)D is set of documentsN is number of document

TF-IDF = tf(t,d) * idf(t,D)

TF-IDF is ◦ large when high term frequency in document and low

term frequency in all documents◦ small when term appears in many documents

Bag of word model

Ignores structure (syntax) and meaning (semantics) of sentences

Representation vector length is the size of set of unique words in corpus

Stemming used to remove morphological differences

Each word is assigned an index in the representation vector, V

The value V[i] is non-zero if word appears in sentence represented by vector

The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus

Support Vector Machine (SVM) is large margin classifier

Commonly used in text classification

Initial results based on VF Sentence Set 1

Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png

Non-VF, Predicted VF: ◦ “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels

of EspB into the host cell.”

◦ “Data were log-transformed to correct for heterogeneity of the variances where necessary.”

◦ “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into thePstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.”

VF, Predicted Non-VF◦ “Here, it is reported that the pO157-encoded Type V-secreted serine protease

EspP influences the intestinal colonization of calves. “

◦ “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “

◦ “The DsbLI system also comprises a functional redox pair”

Adding additional examples is not likely to substantially improve results as seen by error curve

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 5000 10000

All

Training Error

Validation Error

8 Alternative Algorithms

VF Sentence Set 2

Select 10,000 most important features using chi-square

Machine learning technique that takes advantage of unlabeled data

Unlabeled data helps determine shape of underlying data distribution

Added randomly selected, unlabeled sentences from VF publications

Trained with 842 labeled and 4346 unlabeled

Label Spreading is a semi-supervised algorithm somewhat resilient to noise

Algorithms have parameters not learned from data

SVMs, for example:◦ C – balances training error and over-fitting◦ Kernel – function to map data to high-dimensional

space, e.g. linear, polynomial◦ Gamma – parameter in non-linear kernels, controls how

far influence of a training instance reaches

Search combination of parameters

Optimal results with linear kernel and slightly smaller C than default

Process of explicitly modeling relations between variables or explicitly representing information not already in a representation scheme, for example:◦ Classify all numbers as NUMBER instead of numerals◦ Replace gene/protein names with term GENE_Protein

Used in text classification problems, e.g. phrase-based learning has improved some rule-based classifiers.1

Rule based learners may not be generalizable to other domains

Taxonomic-structure of Unified Medical Language System (UMLS) used to create semantic similarity measures. 2

Most informative features can be detected automatically, e.g. chi-square

Manual feature engineering is not a viable option if our goal is topic-independent support for biocuration

1. DOI:10.1.1.36.97702. PMID: 22580178

Improve quality of data (quantity not likely helpful)

Utilize multiple supervised algorithms, ensemble and non-ensemble

Use unlabeled data and semi-supervised techniques

Feature Selection

Parameter Tuning

Feature Engineering

Given:

◦ High quality data in sufficient quantity

◦ State of the art machine learning algorithms

How to improve results: Change Representation?

TF-IDF◦ Loss of syntactic and

semantic information

◦ No relation between term index and meaning

◦ No support for disambiguation

◦ Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties

Ideal Representation◦ Capture semantic

similarity of words

◦ Does not require feature engineering

◦ Minimal pre-processing, e.g. no mapping to ontologies

◦ Improves precision and recall

Words represented as set of weights in vector

Useful properties◦ Semantically similar words in close proximity◦ Methods for capturing phrases, e.g. “Secretion system”◦ Captures some semantic features

Trained with◦ Skip-gram or CBOW algorithms◦ Text, such as PubMed abstracts and open access papers

T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf

Utilize distributed representations in classification algorithms

Compare SVM and multi-layered neural network for classification

Build on distributed word representation as basis for shallow semantic parsing and information extraction

Apply to other specialty gene sets

PATRIC Curators: Rebecca Wattam, Chunhong Mao, David Abraham, Meredith Wilson, Yan Zhang

Resources◦ PATRIC www.patricbrc.org◦ Python, NumPy, SciPy, Scikit-

Learn◦ iPython◦ Gensim

Funding◦ National Institute of Allergy and

Infectious Disease, National Institutes of Health

CID Photo Here

Data & Analytics

Text Mining for Biocuration of Bacterial Infectious Diseases