Upload
biofortis
View
428
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Life sciences is a fast becoming a data problem - in this presentation we explore the challenges faced by scientists wishing to leverage life science and healthcare big data. We demonstrate Qiagram - a collaborative visual, ad hoc query tool for exploring these large complex data sets. Using examples form Adverse Event Reporting Database, MedRA and SNOMED we illustrate how scientists with little IT knowledge can mine these data sets and unlock their potential.
Citation preview
Copyright © 2009Proprietary & Confidential
Copyright © 2012 - Proprietary & Confidential
QIAGRAM: A NOVEL INTERFACE TO MINE AND UNDERSTAND LARGE DATA SETS WITH NATURAL LANGUAGE QUERIES
Making data useful by making data smarter
Matthew Clark, Ph. D. BioFortis
February 6, 2013
Copyright © 2012 - Proprietary & Confidential
Life sciences is a fast becoming a data problem
Beyond the obvious issues of scale and
reproducibility, the complexity and
diversity of …data poses the greatest
challenge to unlocking knowledge and
scientific discovery.*
Higdon et al (2012) Unraveling The Complexities Of Life Sciences Data: DOI: 10.1089/big.2012.1505
Copyright © 2012 - Proprietary & Confidential
Big Data Challenges
VolumeLarge amounts of data
Veracity The credibility/quality, how trusted is the data
VelocityNeed for rapid analysis
Value Actionable outcomes for an organization
VarietyMany disparate types
Copyright © 2012 - Proprietary & Confidential
A multitude of ‘omics and more
genomics proteomics cellomics metabalomics lipidomics transcriptomics
High Throughput technologiesNGS, imaging, mass-spectrometry, high
capacity flow, arrays
Collecting data at a prodigious rate – not always clear on how to use
Other DataHealthcare data (EMR), demographics,
Adverse events, clinical trials
Copyright © 2012 - Proprietary & Confidential
Potential is huge…
• Targeted trials• Adverse events from pharmacy/clinical data• Segment patients based on profiles/responses• Biomarker discovery• Observational/outcome studies• “Virtual” clinical trials• In-silico discovery
Enormous promise…. Enormous challenges
Copyright © 2012 - Proprietary & Confidential
Big data challenges - general
Source: The Economist, 2011, Big Data, Harnessing a Game changing asset
Copyright © 2012 - Proprietary & Confidential
Barriers to extracting value
Source: The Economist, 2011, Big Data, Harnessing a Game changing asset
Copyright © 2012 - Proprietary & Confidential
Big Data in life sciences/healthcare
• Multiple disparate data sources• Lack of integration patient-molecular-clinical-
assay-payer• “Swiss cheese” problem• Data cleansing/verification/credibility• Standards for data interchange• Privacy concerns• Lack good tools for cross-domain analytics
Copyright © 2012 - Proprietary & Confidential
Changing paradigms
• Hypotheses driven– Traditional, test the hypotheses, scientific method
• what’s the mechanism of action of this drug
• Discovery driven– More open, questioning, enumerates elements to
drive hypotheses• What data do I have, what’s interesting
• Hybrid– Discovery driven + Hypotheses
• Human Genome Project
Copyright © 2012 - Proprietary & Confidential
Data Exploration
Often neglected, but now key to getting value from life science big data
Copyright © 2012 - Proprietary & Confidential
What is Data Exploration?
• Occurs before in-depth statistical/analytics• Explore and probe the data• Determine “what’s interesting”, “what’s relevant”• Generate hypotheses• Ensure data is there to support hypotheses
Copyright © 2012 - Proprietary & Confidential
Asking questions of the data
Very complex query that touches many of the 5 V’s of Big Data
Copyright © 2012 - Proprietary & Confidential
Asking questions of the dataMultiple data
sources
Domain expertiseMore data sources
More data sources
Requires considerable IT resources to program this query
Copyright © 2012 - Proprietary & Confidential
Data Exploration challenges
• Programming is hard– Mostly SQL, SAS
• Lack of a shared language to support collaboration– Multidisciplinary data
requires domain experts
• Meaningful access to data– Sensitive to
regulatory/compliance
“The hands-on analytics time to write the SAS code and specify clearly what you need for each hypothesis is very time-consuming,” Felix Freuh, CEO, Medco*
*Miller, K. Big Data Analytics, Biomedical Computational Review, Winter 2011/2012
Copyright © 2012 - Proprietary & Confidential
Clinical and molecular data
The Problem
Data Managers, overwhelmed by
researchers questions on complex data sources
Researchers with many questions across
disciplines
“Weeks to months to NEVER”“Lost in Translation”
SELECT DISTINCT PATIENT_ID, SAMPLE_ID, SAMPLE_NAME
FROM SAMPLE_INVENTORY S INNER JOIN PATIENTS P ON S.PATIENT_ID = P.PATIENT_ID
INNER JOIN DIAGNOSIS D ON S.PATIENT_ID = D.PATIENT_ID
INNER JOIN MEDICATIONS M ON S.PATIENT_ID = M.PATIENT_ID
INNER JOIN BIOMARKERS B ON S.PATIENT_ID = B.PATIENT_ID
WHERED.DIAGNOSIS_NAME = ‘LUNG CANCER’ AND
M.MEDICATION_GENERIC_NAME = ‘CETUXIMAB’ ANDB.BIOMARKER_NAME = ‘EGFR’ AND
B.OBSERVATION = 1ORDER BY PATIENT_ID, SAMPLE_NAME
No common language for
questions
Copyright © 2012 - Proprietary & Confidential
Overcoming the challenges
• Deep Collaboration– Easy access to dynamic data– Intuitive tools– Secure holistic view of data– Collaboration
Copyright © 2012 - Proprietary & Confidential
Big Data - Deep Collaboration
Single researcher in a silo often can go deep into the data, but maybe limited by their domain expertise
Small groups of researchers may be able to collaborate on asking questions but can’t go very deep with the tools they have today
QIAGRAM
Deep Collaboration is when multiple groups of researchers can collaborate in asking questions deeper into the layers of data. Shared domain knowledge allows deeper insights
Copyright © 2012 - Proprietary & Confidential
Clinical and molecular data
Qiagram – Collaborative Scientific Intelligence
Researchers and data managers can collaborate
on creating queriesQiagram acts as a
shared, visual language for queries
More efficient and effective query creationTransparent to all stakeholders
Copyright © 2012 - Proprietary & Confidential
MINING AERS DATAExamples
Copyright © 2012 - Proprietary & Confidential
Small Data Can Become Big Data
1000 Drugs
1000 Drug Categories
1000 Adverse Events
109 Possible Combinations
Copyright © 2012 - Proprietary & Confidential
Introduction to Qiagram
Copyright © 2012 - Proprietary & Confidential
Answer the Question –Which Sources Have Data on Cholestasis?
Copyright © 2012 - Proprietary & Confidential
Joining Data Sources is Simple
Copyright © 2012 - Proprietary & Confidential
Combining Data Sources
• SNOMED contains hierarchy of drug and medical terms– ~12M records
• AERS contains reports of adverse events– ~70M records
• MedDRA contains hierarchy of adverse event terms– 150k records
Copyright © 2012 - Proprietary & Confidential
SNOMED Ontology
Copyright © 2012 - Proprietary & Confidential
MedDRA Hierarchy
low level term pref term hlt pref termhlgt pref term soc term
abdominal migraine migraine migraine headaches headaches nervous system disordersacute migraine migraine migraine headaches headaches nervous system disordersband-like headache tension headache headaches nec headaches nervous system disordersbasilar migraine basilar migraine migraine headaches headaches nervous system disorderscephalalgia headache headaches nec headaches nervous system disorderscephalalgia or cephalgia headache headaches nec headaches nervous system disorderscephalgia headache headaches nec headaches nervous system disorders
Copyright © 2012 - Proprietary & Confidential
AERS
• All drug-related adverse events reported to FDA since 2000
• Tables for Drugs, demographics, indications, therapy, reactions, outcomes
Copyright © 2012 - Proprietary & Confidential
Filter AERS Drugs by SNOMED Categories
AERS Drug list
Results in all analgesics in AERS, with associated case #s
Copyright © 2012 - Proprietary & Confidential
Count the Various MedDRA high-level group terms reported for all drugs in AERS from
the SNOMED “antibiotic” category
AERS Drug list
SNOMED Categories
MedDRA Hierarchy Mapping
Copyright © 2012 - Proprietary & Confidential
Top Ten Antibiotic Adverse Event High-Level Group Terms
Count hlgt pref term - MedDRA browser692,058 general system disorders nec522,739 epidermal and dermal conditions491,409 neurological disorders nec385,375 joint disorders297,433 gastrointestinal signs and symptoms290,760 respiratory disorders nec278,723 allergic conditions230,134 cardiac disorder signs and symptoms208,985 injuries nec197,446 gastrointestinal motility and defaecation conditions
Copyright © 2012 - Proprietary & Confidential
Data Quality
• With more date, more chance for inconsistencies.
• Need easy ways to dynamically check the data, identify errant records
• Example: AERS data
Copyright © 2012 - Proprietary & Confidential
Query to Locate Patients with Treatment Dates After Death Dates
Copyright © 2012 - Proprietary & Confidential
Example Results
isr - drugs drug age death_dtstart_dt -
therapies
days after death - therapies + demographics
4016857 darbepoetin 59 9/5/2002 8/12/3003 365,5834006065 interferon I 46 8/23/2002 1/27/2991 361,0176013473 naloxone 54 6/15/1953 6/15/2008 20,0896038344 combivent 50 12/10/1958 12/8/2008 18,2616038344 levofloxacin 50 12/10/1958 12/8/2008 18,2616105245 dexamethasone 49 1/27/1959 7/15/2008 18,0676105245 bortezomib 49 1/27/1959 7/12/2008 18,0646252126 enfuvirtide 50 8/29/1956 7/8/2005 17,8456252126 efavirenz 50 8/29/1956 5/10/2002 16,6906252126 didanosine 50 8/29/1956 5/10/2002 16,690
Over 2,000 results
Copyright © 2012 - Proprietary & Confidential
Answer Questions At the Speed of Thought
• Many “purpose built” systems answer pre-defined questions.
• However, in data exploration we need the ability to explore new questions
Copyright © 2012 - Proprietary & Confidential
Collaborative Experience
• Team of physicians, informaticians, safety experts collaboratively explored questions based on large amounts of clinical (SDTM) data –
– Did subjects who were pre-treated with certain drug classes have the most change in cardiac function?
– What was in common with the subjects that were outliers in cardiac function change?
• Team defined baselines, changes, etc
Copyright © 2012 - Proprietary & Confidential
CPRD
• The Clinical Practice Research Datalink (CPRD) is the new English NHS observational data and interventional research service, jointly funded by the NHS National Institute for Health Research (NIHR) and the Medicines and Healthcare products Regulatory Agency (MHRA).
• 6 large fact tables with 1 B to 2 B rows
• Example query– Identify patients with coronary artery disease who
have taken aspirin, then study readmission rates.
Copyright © 2012 - Proprietary & Confidential
Thomson Reuters' MarketScan
• Several characteristics set MarketScan databases apart from other research databases. The core databases, Commercial, Medicare Supplemental, and Medicaid, are huge – over 170 million patients since 1995.
• Over 25 Fact tables 100 M up to 1.5 B rows• Example
– Identify cancer patients, looking at opiate treatment and study duration of the escalation
Copyright © 2012 - Proprietary & Confidential
Premier Research Services
• Patient level data is available from more than 600 hospitals, 45 million records and 310 million hospital visits
• 5 large Fact tables from 100 M to over 4 B rows
Copyright © 2012 - Proprietary & Confidential
Cerner Health Facts Database
• 8 Large Fact tables most in the 10's of millions of records
• Example– Looking for type II diabetes patients, study infection
rates of these patients based on hospital types.
Copyright © 2012 - Proprietary & Confidential
Launching in March, 2013, cloud based Qiagram offering with AERS and TCGA data