30
Creating Knowledge bases from text in absence of training data. Sanghamitra Deb Accenture Technology Laboratory Phil Rogers, Jana Thompson, Hans Li

Data day2017

Embed Size (px)

Citation preview

Page 1: Data day2017

Creating Knowledge bases from text in absence of training data.

Sanghamitra DebAccenture Technology Laboratory

Phil Rogers, Jana Thompson, Hans Li

Page 2: Data day2017

Typical Business Process

Executive Summary

Business Decisions

hours of knowledge curation by experts

Page 3: Data day2017

The Generalized approach of extracting text: Parsing

Tokenization Normalization Parsing Lemmatization

Tokenization: Separating sentences, words, remove special characters, phrase detections

Normalization: lowering words, word-sense disambiguation

Parsing: Detecting parts of speech, nouns, verbs etc.

Lemmatization: Remove plurals and different word forms to a single word (found in the dictionary).

Page 4: Data day2017

Extract sentences that contain the specific attribute

POS tag and extract unigrams,bigramsand trigrams centered on nouns

Extract Features: words around nouns: bag of words/word vectors, position of the noun and length of sentence.

Train a Machine Learning model to predict which unigrams, bigrams or trigrams satisfy the specific relationship: for example the drug-disease treatment relationship.

Map training data to create a balanced positive and negative training set.

The Generalized approach of extracting text : ML

Page 5: Data day2017

Extract sentences that contain the specific attribute

POS tag and extract unigrams,bigrams and trigrams centered on nouns

Extract Features: words around nouns: bag of words/word vectors, position of the noun and length of sentence.

Train a Machine Learning model to predict which unigrams, bigrams or trigrams satisfy the specific relationship: for example the drug-disease treatment relationship.

Map training data to create a balanced positive and negative training set.

The Generalized approach of extracting text : ML

How do we generate this training data?

Page 6: Data day2017

A different Approach

Stanford

Replaces training data by encoding domain knowledge

Page 7: Data day2017

The snorkel approach of Entity Extraction

Extract sentences that contain the specific attribute

POS tag and extract unigrams,bigramsand trigrams centered on nouns

Write Rules: Encode your domain knowledge into rules.

Validate Rules: coverage, conflicts, accuracy

Run learning: logistic regression, lstm, …

Examine a random set of candidates, create new rules

Observe the lowest accuracy(highest conflict) rules and edit them

iterate

Page 8: Data day2017

Training Data | Rules

.

. ..

.*

. ....

.*

*

Planetary Orbits

Page 9: Data day2017

How does snorkel work without training dataWrite Rules: Encode your domain knowledge into rules.

The rules are modeled as a Naive Bayes model which assumes that the rules are conditionally independent.

These probabilities are fed into Machine Learning algorithm: Logistic Regression in the simplest case to create a model used to make future predictions

Even though most of the time this is not true, in practice it generates a pretty good training set with probabilities of being in either class.

http://arxiv.org/pdf/1512.06474v2.pdf

Page 10: Data day2017

Data Dive: FDA Drug Labels

Page 11: Data day2017

It is indicated for treating respiratory disorder caused due to allergy.

For the relief of symptoms of depression.

Evidence supporting efficacy of carbamazepine as an anticonvulsant was derived from active drug-controlled studies that enrolled patients with the following seizure types:

When oral therapy is not feasible and the strength , dosage form , and route of administration of the drug reasonably lend the preparation to the treatment of the condition

Data Dive: FDA Drug Labels

Page 12: Data day2017

Candidate Extraction

Using domain knowledge and language structure collect a set of high recall low precision. Typically this set should have 80% recall and 20% precision.

60% accuracy, too specific need to make it more general

30% accuracy, this looks fine

…………………………………………………………………………………………………………………………………………………………………….

…………………………………………………………………………………………………………………………………………………………………….

Page 13: Data day2017

Automated Features:

pos-tags context dep-tree char-offsets

Page 14: Data day2017

Rule Functions

Page 15: Data day2017

Testing Rule Functions:

Page 16: Data day2017

0

75

150

225

300

-1 0 1

Generation of training data

One rule

Page 17: Data day2017

0

55

110

165

220

-1 0 1

Generation of training data

two rules

Page 18: Data day2017

0

45

90

135

180

-1 0 1

Generation of training data

three rules

Page 19: Data day2017

0

35

70

105

140

-1 0 1

Generation of training data

four rules

Page 20: Data day2017

0

35

70

105

140

-1 0 1

Generation of training data

20 rules

Page 21: Data day2017

Results and performance.

drug-name disease candidate Candidates snorkel

Lithium Carbonate

bipolar disorder 1 1

Lithium Carbonate individual 1 0

Lithium Carbonate maintenance 1 0

Lithium Carbonate manic episode 1 1

Precision and recall ~90%

Page 22: Data day2017

Evolution of F1-score with sample size

Page 23: Data day2017

Relationship extractions

•Is person X married to person Y? •Does drug X cure disease Y? •Does software X (example: snorkel) run on programing language Y

(example: python3)

Define filters for candidate extraction for a pair (X,Y) example: (snorkel, python2.7), (snorkel,python3.1), …

Once you have the pairs , examine them using annotation tool.

Write rules ——> observe their performance against annotated data. Iterate

Page 24: Data day2017

Crowdsourced training data

In some cases training data is generated on the same dataset by multiple people.

In snorkel each source can be incorporated as a separate rule function.

The model for the rules figure out the relative weights for each person and create a cleaner training data.

Page 25: Data day2017

Why Docker?

• Portability: develop here run there: Internal Clusters, aws, google cloud etc, Reusable by team and clients

• isolation: os and docker isolated from bugs.

• Fast• Easy virtualization : hard ware

emulation, virtualized os.• Lightweight

Python stack on docker

Page 26: Data day2017

FROM ubuntu:latest # MAINTAINER Sanghamitra Deb <[email protected]> CMD echo Installing Accenture Tech Labs Scientific Python Enviro

RUN apt-get install python -y RUN apt-get update && apt-get upgrade -y RUN apt-get install curl -y RUN apt-get install emacs -y RUN curl -O https://bootstrap.pypa.io/get-pip.py RUN python get-pip.py RUN rm get-pip.py RUN echo "export PATH=~/.local/bin:$PATH" >> ~/.bashrc RUN apt-get install python-setuptools build-essential python-dev -y RUN apt-get install gfortran swig -y RUN apt-get install libatlas-dev liblapack-dev -y RUN apt-get install libfreetype6 libfreetype6-dev -y RUN apt-get install libxft-dev -y RUN apt-get install libxml2-dev libxslt-dev zlib1g-dev RUN apt-get install python-numpy

ADD requirements.txt /tmp/requirements.txt RUN pip install -r /tmp/requirements.txt -q

Dockerfilescipy matplotlib ipython jupyter pandas Bottleneck patsy pymc statsmodels scikit-learn BeautifulSoup seaborn gensim fuzzywuzzy xmltodict untangle nltk flask enum34

requirements.txt

docker build -t sangha/python . docker run -it -p 1108:1108 -p 1106:1106 --name pharmaExtraction0.1 -v /location/in/hadoop/ sangha/python bash docker exec -it pharmaExtraction0.1 bash docker exec -d  pharmaExtraction0.1 python  /root/pycodes/rest_api.py

Building the Dockerfile

Page 27: Data day2017

Typical ML pipeline vs Snorkel

(1) Candidate Extraction.

(2) Rule Function

(3) Hyperparameter tuning

Page 28: Data day2017

Snorkel :

Pros: • Very little training

data necessary • Do not have to

think about feature generation

• Do not need deep knowledge in Machine Learning

• Convenient UI for data annotation

• Created structured databases from unstructured text

Cons: • Code is new, so it

may not be robust to all situations.

• Doing online prediction is difficult.

• Not much transparency in the internal workings.

Page 29: Data day2017

Banks: Loan Approval Paleontology

Design of Clinical Trials

Legal Investigation

Market Research Reports

Human Trafficking

Skills extraction from resume

Content Marketing

Product descriptions and reviews

Pharmaceutical Industry

Applicability across a variety of industries

and use cases

Page 30: Data day2017

Where to get it?

https://github.com/HazyResearch/snorkel

http://arxiv.org/pdf/1512.06474v2.pdf