35
Text Mining - as Normal as Data Mining? Andrew Hinton, Application Specialist IISDV 2016, Tuesday 19 th April 2016, Nice

II-SDV Andrew Hinton - Text mining - as normal as data mining?

Embed Size (px)

Citation preview

Text Mining - as Normal as Data Mining?

Andrew Hinton, Application Specialist

IISDV 2016, Tuesday 19th April 2016, Nice

Agenda

Introduction to text mining

The challenge

Applications of specialised normalization solutions

− Maximising Source Normalization

− EASL (Extraction and Search Language )

− Allows programmatic access to unstructured data similar to

SQL over structured data.

− Numeric Normalization & Range search

− Capturing weights between 60 and 80kg whether

expressed in kilograms or pounds, for patient selection

from EHRs.

− Gene Mutation Normalization

− Use case where gene mutations have been linked to rare

disease progression.

© 2016 Linguamatics Ltd 2

Answers to Our Questions are in Free Text

80% of information at companies is in free text

Most of the answers to our questions are there

Ever-increasing amounts of text data to examine

© 2016 Linguamatics Ltd 3

0

5.000.000

10.000.000

15.000.000

20.000.000

25.000.000

PubMed Records

− Different kinds of documents

− External literature, patents,

EHRs, internal reports, blogs,

presentations

− Different formats

− HTML, PDF, XML, Word, PPT,

Wiki, TXT, HL7

Keyword Searching

© 2016 Linguamatics Ltd 4

OLED

Documents, Web Pages, Folders

All these documents contain the keyword ‘Additive’. Read ALL

the documentto find the relevant bit

to you

Linguamatics in Healthcare

© 2016 Linguamatics Ltd 5

ElectronicHealthRecord

EnterpriseData

Warehouse

Pathology, radiology, initial

assessment, discharge, check up

Structured data

Clinical

Risk

Monitor

Patient characteristics

Patientlists

Clinical

trials

gov

Patient characteristics

MatchingClinicaltrials

Patient Narrative

Semantic search tags

Semantic

Enrichment

Clinical casehistories and/or

genomic interpretation

Patient characteristics

Scientific

literature

I2E Transforms Text into Actionable Insights

© 2016 Linguamatics Ltd 6

Turn text Into structured datausing sophisticated queries

Accurate results: only retrieves relevant results

Complete results: comprehensive and systematic

Analytics

To driveanalytics

Enterprise

Warehouse

Search vs. Text Mining

© 2016 Linguamatics Ltd 7

Text MiningSearch Engine

Filter to find most

relevant documents, then read

News Feeds Literature Patents Internal Reports Social Media

Natural Language Processing (NLP) -

understand meaning

© 2012 Linguamatics Ltd.

Use of ontologies and clustered results

Efficient review, without reading every document

Challenges in Unstructured Data

© 2016 Linguamatics Ltd

Different word, same meaning

cyclosporine

ciclosporin

Neoral

Sandimmune

Different expression, same meaning

Non-smoker

Does not smoke

Does not drink or smoke

Denies tobacco use

Different grammar, same meaning

5mg/kg of cyclosporine per day

5mg/kg per day of cyclosporine

cyclosporine 5mg/kg per day

Same word, different context

Diagnosed with diabetes

Family history of diabetes

No family history of diabetes

NLP

8

Linguistic Processing Using NLP

Interprets meaning of the text

Groups words into meaningful units

Search for different forms of words

© 2016 Linguamatics Ltd 9

We find that p42mapk phosphorylates c-Myb on serine and threonine .

Purified recombinant p42 MAPK was found to phosphorylate Wee1 .

sentences morphology -

different forms

noun groups

match entities

verb groups

match actions

From Words to Meaning

© 2016 Linguamatics Ltd 10

“Among them, nimesulide, a selective COX2 inhibitor, …”

Entrez Gene ID: 5743

inhibits

Entrez Gene ID: 5743inhibits

Identifyingentities and relations

Linguistics to establish relationships

Text Mining - as Normal as Data Mining?

© 2016 Linguamatics Ltd 11

CHALLENGE

How can we capture information from free text as conveniently as accessing a database?

One of the essential differences is the lack of normalization of terms and concepts in free text.

SOLUTION

NLP-based text mining provides the capability to look through unstructured text normalizing:

• Keywords to concepts• Numerical data• Range Search• Gene Mutations• Content source

BENEFIT

A set of structured facts, relationships or assertions, from different data sources that can be used for decision support

Providing tabular or visual analytics to fill data warehouses and support

better patient care.

Literature

Patents

ReportsClinical Trials

Examples of Normalization

Content Source Normalization

I2E: A Fully Federated Text Mining Platform

14 Merge into a single set of results

ContentServer 1

ContentServer 2

ContentServer 3

ContentServer 4

Federated Architecture

Normalizing Data from Different Sources

Single query

Differently structured data sources on different servers

− Journal articles (PubMed Central) on local Enterprise

Server

− MEDLINE on remote cloud server

Single set of results

© 2016 Linguamatics Ltd 15

Using EASLEASL: Extraction And Search Language

Representing a Query in EASL

17

EASL Example

© 2016 Linguamatics Ltd 18

query:

document:

- phrase:

- class: {snid: nci.C1909, pt: Pharmacologic Substance}

- treat

- class: {snid: nlm.C04.588.180, pt: Breast Neoplasms}

output:

outputSettings: {documentsPerAssertion: -1,

hitsPerDocPerAssertion: 10, outputOrdering: frequency,

resultType: standard}

Benefits of EASL

Automation− Richer language for WSAPI applications

− Can build a completely new query vs. adapting smart query parameters

− Allows on-the-fly query production

Re-use− Save, share and compare components of queries e.g.

− Save out Alternatives

− Load complex expressions in smart query parameters

Audit− Human readable language for documenting the text mining strategy

− Using open mark-up language (YAML)

Conversion− Enable scripts to convert from other query languages e.g. advanced search

Different interfaces− Enables 3rd party applications to create I2E queries

− Developers can produce innovative specialized interfaces e.g. advanced

search plus terminologies

© 2016 Linguamatics Ltd 19

EASL: Enhancing the Value of Federated Search

20 Merge into a single set of results

ContentServer 1

ContentServer 2

ContentServer 3

ContentServer 4

Federated Architecture

translate2easl

© 2016 Linguamatics Ltd 21

Espacenet query Pubmed query

espacenet2easl pubmed2easl

EASL keywords + index terms

EASL terminologies, linguistics …

Clinical Trials

OMIM

FDA Drug

Labels

PatentsNIH Grants

MEDLINE

refine

query

Range Search and Normalization

What Do We Want to Find?

Patients

− below 60 years old

− weight ≥ 80kg

− not having chemotherapy after 2010

− with a mutation C677T

© 2016 Linguamatics Ltd 23

Challenge: Variety Within the Text

Below 60 years old

− aged 58

− 35 years old

− 42-year-old

− 39 y/o

Weight ≥ 80kg

− 267 pounds

− 280 lbs

− 80.4kg

− 82 kilograms

© 2016 Linguamatics Ltd 24

After 2010

− January 21, 2011

− October of 2012

− 08/21/11

− 2012-05-04

Mutation C677T

− C677T

− 677C>T

− 677C/T

− 677C->T

Normalizing Gene Mutations

Different types of mutation description, including:

− positional e.g. +869(T>C)

− rsID e.g. rs100

Transform different syntax e.g.

− 1166A/C -> A1166C

− Asn to Ser substitution at codon 127 -> N127S

− +1196C/T -> C1196T)

− g.655C/A>G -> C655G, A655G

− M567V/A -> M567V, M567A

© 2016 Linguamatics Ltd 25

Mutation Normalization Examples

© 2016 Linguamatics Ltd 26

Range Search

Allows search for values within a range

− in fixed fields e.g. publication

date

− within free text e.g. dosages

Can directly ask for e.g.

− patients with diabetes under

60 with BMI under 30

Can find intervals within the text and find these when search for a number or an overlapping range

© 2016 Linguamatics Ltd 27

Range Search with Normalization

Range Search (Age, Date)

− Patients aged < 60yrs

− Date before 2010

Normalizing:

− Report Date, Age, Weight & BMI

© 2016 Linguamatics Ltd 28

Normalization Benefits

Ability to compare measurements with different units e.g. kg vs. lbs

Ability to perform range search for numerics, measurements, dates

Standardized representations to link to structured data e.g. mutation databases

Better clustering of results e.g. drug lab codes

© 2016 Linguamatics Ltd 29

Real World Example: Mutation Normalization

Mucopolysaccharidosis II: Hunter Syndrome

Rare X-linked recessive disorder Deficiency of the lysosomal enzyme iduronate-2-sulfatase Leads to progressive accumulation of glycosaminoglucans throughout the bodySigns & symptoms:

− Bone deformities with joint stiffness; Frequent

respiratory infections; Cardiomyopathy;

Hepatosplenomegaly; Neurocognitive

impairment; Reduced lifespan

− Some symptoms partially improved with enzyme

replacement therapy

Spectrum of clinical severity (mild to severe); main difference is progressive development of neurodegeneration in the severe form

© 2016 Linguamatics Ltd 31

32

CHALLENGE

• Scarcity of knowledge of natural history of disease

• Sparse data, needs high recall across full text papers

• Mutation patterns very variable

• Structured databases lack broad phenotypic association data

© 2016 Linguamatics Ltd

TEXT ANALYTICS FOR RARE DISEASESGENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER SYNDROME

33

CHALLENGE

• Scarcity of knowledge of natural history of disease

• Sparse data, needs high recall across full text papers

• Mutation patterns very variable

• Structured databases lack broad phenotypic association data

SOLUTION

• Developed workflow with Linguamatics I2E

• Abstracts ID’ed in MEDLINE using broad vocabularies

• Full text PDFs processed for text analytics

• I2E mutation ontology and bespoke severity vocabs enabled extraction of genotype-phenotype associations

BENEFIT

• Extraction of patient mutations matched or bettered genetic databases

• Increased understanding of IDS mutational spectrum for provider diagnostics and patient awareness

• Enabled rational approach to immune response classification

© 2016 Linguamatics Ltd

TEXT ANALYTICS FOR RARE DISEASESGENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER SYNDROME

Shire-Use case

© 2016 Linguamatics Ltd 34

In Summary

Better Normalization of

− Numbers, dates, drug codes, TNM cancer stage

− Subsequent range search

− Gene mutations

In combination with a human readable open query language EASL

− Maximises the ease and flexibility of asking complex

questions simultaneously across different content

sources

Ultimately agile NLP text mining provides

− High quality, structured, clustered & normalized results

in the format you need

− Improves speed to insight for faster decision making

© 2016 Linguamatics Ltd 35