Finding Answers in the Data The Future Role of Text and Data Mining
Kim Zwollo General Manager, RightsDirect Andrew Hinton Linguamatics
Making Copyright Work – CCC and RightsDirect
Rightsholders Content Users
600+ million rights from:
• Publishers
• Authors
• Creators
• 35,000 companies
• Employees worldwide
• Users in 180 countries
• Licensing Solutions
• Rights Management
• Content Delivery
• Copyright Education
10/15/2014
Overview
• What is Text and Data Mining
• Why text mining is useful
• Technology Trends
• Information Retrieval Challenges
• Publisher perspective
• Emerging solutions
• Use cases from Linguamatics
10/15/2014 3
What is Text and Data Mining
Interpret Meaning, Identify
& Extract
• Facts
• Relationships
• Assertions
Linguamatics 2014
Application Areas for text mining
Protein-
Protein
Interactions
Vocabulary
Development
Target
Identification
&
Prioritization
Conference
Abstract
Mining
Key Opinion
Leader
Identification Safety/Tox
In-licensing
Opportunities
Gene
Profiling Systems
Biology
Mining
FDA Drug
Labels
Extracting
Numerical
and
Experimental
Data
Mutations
and Gene
Expressio
n
Sentiment
Analysis in
Social
Media
Workflow
Integration
Mining
Electronic
Medical
Records
Clinical
Trial
Analysis
Patent
Analysis
Biomarker
Discovery
Competitive
Intelligence
Drug
Repositioning
“Drug Discovery” Process
• Goal: Develop new treatments for diseases through hypothesis formation.
• Methodology:
– Keyword/Database Searching
– Review Literature
– Find relationships
– Develop hypothesis
– Test
– Product development
Etc.
10/15/2014 6
Analyzing Article Sets
Problem: Too Much Research
• 53M Records in Scopus
• 800,000 Journal Articles published per year
10/15/2014 8
http://altmetrics.org/manifesto/ October 26, 2010
Even within one disease area…
• Angina
• Acute coronary
syndrome
• Alexia
• Anomic aphasia
• Aortic dissection
• Aortic regurgitation
• Aortic stenosis
• Apoplexy
• Apraxia
• Arrhythmias
• Asymmetric septal
hypertrophy (ASH)
• Atherosclerosis
• Atrial flutter
• Atrial septal defect
• Atrioventricular canal
defect
• Atrioventricular septal
defect
• Avascular necrosis
–Etc…
10/15/2014 9
Lots of disorders …
Lots of documents…
• 35,000+ on Improve Circulation
• 7,000+ per disease area
Literature Based Discovery
10/15/2014 10
Don Swanson (1924-2012)
[1986] Blood viscosity served
as a bridge between the topics of
Raynaud’s disease and dietary fish oil.
A
B
C
Information Retrieval and Discovery Process
10/15/2014 11 *http://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining
Software Platforms for TDM
Information Retrieval
Knowledge Discovery
Challenges for Text Mining Researchers
• Many sources of content
• Many formats
• Difficult to obtain full-text in XML
• Difficult to integrate content into TDM software.
• Hard to negotiate and manage licenses and feeds from all publishers.
12
STM Publisher Perspective
• Concern about disruptive nature of TDM to subscription business
• Access problem, more than a copyright problem
• Technical challenges with formats and authentication
• More industry education needed
• Top STM Publishers are making their content available for mining
10/15/2014 13
Background: Timeline
• JISC paper May 2011
• First PDR-TDM meeting Nov 2011
• CCC TDM Event – March 2012
• CCC White Paper on TDM Issues and Solutions – May 2012
• CCC Pilot 2013
• Second PDR-TDM meeting Nov 2013
• Content acquisition 2014
• Launch CCC service for mining full text (2015)
10/15/2014 14
Helping TDM Researchers P
ub
lish
er 1
Pu
blis
her 2
Rightsholders provide CCC with
a feed of their content in XML
Pu
blis
her 3
<XML>
Helping TDM Researchers C
om
pan
y A
Co
mp
an
y C
Co
mp
an
y B
Companies provide CCC with information about
their subscriptions and holdings, using our
automated tools in DirectPath.
Helping TDM Researchers C
om
pan
y A
Pu
blis
her 1
C
om
pan
y C
Pu
blis
her 2
Pu
blis
her 3
Co
mp
an
y B
Companies request article sets
for each TDM project.
CCC manages access based on
subscription information.
<XML>
Looking Ahead: Emerging Solutions for Information Retrieval
• Open Access Content
• Publisher-specific capabilities for delivering content (Elsevier and others)
• Industry-wide content access solutions by intermediaries
– CrossRef
– CCC
– PLS
10/15/2014 18
A look at a Text Mining Application A presentation by Linguamatics
Andrew Hinton, Linguamatics
10/15/2014 19
Click to edit Master title style Click to edit Master title style About Linguamatics
Boston Cambridge
I2E: agile, scalable, real-time NLP-based text mining
Fact extraction and knowledge synthesis
Fortune 500
Pharma/Biotech
Healthcare
Government
Linguamatics 2014
Including 17 of the top 20
Including Kaiser Permanente
Including FDA
Software Consulting Hosted Content
Click to edit Master title style Click to edit Master title style Linguistic Processing Using NLP
• Groups words into meaningful units
• Morphology allows search for different forms of words
We find that p42mapk phosphorylates c-Myb on serine and threonine . Purified recombinant p42 MAPK was found to phosphorylate Wee1 .
sentences
morphology -
different forms
noun groups
match entities
verb groups
match actions
Linguamatics 2014
Unique capabilities of Text Mining
Use-Cases
Linguamatics 2014
Click to edit Master title style Click to edit Master title style
Linguamatics 2014
Biomarker Discovery - Genes
Gene
(from
Entrez)
Complex
linguistic
relationship
Disease
(from
MedDRA)
Relevant sentence
extracted with terms
highlighted
Link to
source
document
Click to edit Master title style Click to edit Master title style Categorizing Relationships
Use of NLP allows accurate and precise
identification of biomarker relationships
Linguamatics 2014
Click to edit Master title style Click to edit Master title style
Patents Applications and Grants Companies vs. Diseases
0
5000
10000
15000
20000
25000
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Ap
plic
atio
ns
Gra
nts
Abbott AZ Bayer BMS GSK Roche Merck Novartis Pfizer
Virus Diseases
Substance-Related Disorders
Stomatognathic Diseases
Skin and Connective Tissue Diseases
Respiratory Tract Diseases
Parasitic Diseases
Otorhinolaryngologic Diseases
Occupational Diseases
Nutritional and Metabolic Diseases
Nervous System Diseases
Neoplasms
Musculoskeletal Diseases
Mental Disorders
Male Urogenital Diseases
Immune System Diseases
Hemic and Lymphatic Diseases
Female Urogenital Diseases andPregnancy ComplicationsEye Diseases
Endocrine System Diseases
Linguamatics 2014
Click to edit Master title style Click to edit Master title style
Find properties
Melting Points for Exemplified Compounds
Output to e.g. Excel
Linguamatics 2014
Click to edit Master title style Click to edit Master title style
Connecting information found in different parts of the document for example finding a compound as “Example 12” in a patent and linking to a table where numerical data is reported
Patent document
Linking from Definitions to Table Values
…
Combined into a row of data in the structured results table
Patent Data from IFI Claims Direct
Linguamatics 2014
Click to edit Master title style Click to edit Master title style
• For information in claims, often want to work back along the chain of claims, to see what the current claim is dependent upon
Claim Chain Information
Linguamatics 2014
Compounds Treats
cervical cancer
Peptide Seq
Residues 33-176
Click to edit Master title style Click to edit Master title style
• Analysis of PubMed Central records
• Look for analytical chemical techniques mention’s
• Identify concepts in abstract ‘v’ body
Benefits on Text Mining Using Full Text
Linguamatics 2014
Many more mentions of
experimental techniques in full
text compared to abstract alone!
Analytical Chemistry Techniques Section
Thank You!