17
Semantic Data Normalization For Efficient Clinical Trial Research September 8th, 2016

Semantic Data Normalization For Efficient Clinical Trial Research

Embed Size (px)

Citation preview

Page 1: Semantic Data Normalization For Efficient Clinical Trial Research

Semantic Data Normalization For

Efficient Clinical Trial Research

September 8th, 2016

Page 2: Semantic Data Normalization For Efficient Clinical Trial Research

• The specifics of clinical data• What is RDF and how we can use it together with TA?• Semantic annotations and their limitations• What is semantic data normalization?• Current state and next steps

Outline

September 8th, 2016

Page 3: Semantic Data Normalization For Efficient Clinical Trial Research

• Unstructured (Semi-Structured)• Abundant• Redundant• Ambiguous• Aggregated

Clinical Data

September 8th, 2016

In order to transform your clinical data into information and even knowledge, you will have to analyze it!

… but before that you have to make it ready for the analysis!

Page 4: Semantic Data Normalization For Efficient Clinical Trial Research

September 8th, 2016

What is RDF

RDF data model resolves all syntax level ambiguities

It helps you express all data in a common data model

ID GRAA_HUMAN STANDARD; PRT; 262 AA. AC P12544; DT 01-OCT-1989 (Rel. 12, Created) DT 01-OCT-1989 (Rel. 12, Last sequence update) DT 15-JUN-2002 (Rel. 41, Last annotation update) DE Granzyme A precursor (EC 3.4.21.78) (Cytotoxic T-lymphocyte proteinase DE 1) (Hanukkah factor) (H factor) (HF) (Granzyme 1) (CTL tryptase) DE (Fragmentin 1). GN GZMA OR CTLA3 OR HFSP. OS Homo sapiens (Human).

<PubmedArticle> <MedlineCitation Owner="NLM" Status="In-Process"> <PMID Version="1">21500419</PMID> <DateCreated> <Year>2011</Year> <Month>04</Month> <Day>15</Day> </DateCreated> <Article PubModel="Print"> <Journal> <ISSN IssnType="Electronic">1520-6882</ISSN> <JournalIssue CitedMedium="Internet"> <Volume>82</Volume> <Issue>20</Issue> <PubDate> <Year>2010</Year> <Month>Oct</Month> <Day>15</Day> </PubDate> </JournalIssue>

Page 5: Semantic Data Normalization For Efficient Clinical Trial Research

Linked DataHow well interlinked is the linked data cloud?

•Many interesting queries are difficult to be expressed in SPARQL•String functions could not be index•Often there are misplaced identifiers

biopax-2:SHORT-NAME

biopax-2:XREF

P29965

UNIPROT

CD40L_HUMAN

cpath:CPATH-94138

cpath:CPATH-LOCAL-8467065

biop

ax-2

:PHY

SICA

L-EN

TITY

biopax-2:ID

biopax-2:DB

biopax-2:PHYSICAL-ENTITY

cpath:CPATH-LOCAL-8749236

uniprot:P29965

CD40L_HUMAN

uniprot:mnemonic

TNF5_HUMAN uniprot:mnemonic

uniprot:mnemonicCD4L_HUMAN

#5September 8th, 2016

Page 6: Semantic Data Normalization For Efficient Clinical Trial Research

Semantic Annotations

pmid:17714090

broader

umls:C0035204

broader

broaderTransitiveCOPD

Bronchial Diseases

Respiration Disorders

umls:C0006261

Chronic Obstructive Airway Diseases

broa

der

Asthma umls:C000496

Asthma and chronic obstructive pulmonary disease (COPD) are chronic airway diseases characterized by airflow obstruction. The beta(2)-adrenoceptor mediates bronchodilatation in response to exogenous and endogenous beta-adrenoceptor agonists. Single nucleotide polymorphisms in the beta(2)-adrenoceptor gene (ADRB2) cause amino acid changes (e.g. Arg16Gly, Gln27Glu) that potentially alter receptor function. Recently, a large cohort study found no association between asthma susceptibility and beta(2)-adrenoceptor polymorphisms. In contrast, asthma phenotypes, such as asthma severity and bronchial hyperresponsiveness, have been associated with beta(2)-adrenoceptor polymorphisms.

broaderTransitive mentionsmentions

Ian A Yang

journal

Clinical and experimental pharmacology … author

September 8th, 2016

Page 7: Semantic Data Normalization For Efficient Clinical Trial Research

• Good for:– Generation of machine readable meta data– Semantic indexing of large sets of documents– Providing additional background knowledge

• Limitations:– Incomplete knowledge extraction– Does not capture completely the context

Semantic Annotations

September 8th, 2016

Page 8: Semantic Data Normalization For Efficient Clinical Trial Research

• What is it?– A text analytics approach that aims to capture the full

context of the information and to provide clear references to concepts/objects in order to be easily interpreted by machines.

• How we do it?– Work on sentence level– Extract the key phrases from the sentence– Identify the main concept– Identify all the qualifiers and negations– Model the extracted data as RDF

Semantic Data Normalization

September 8th, 2016

Page 9: Semantic Data Normalization For Efficient Clinical Trial Research

Semantic Data Normalization

September 8th, 2016

• Condition text:– “Advanced Biliary Tract Adenocarcinoma” (Study ID = NCT01506973)

• Text Analysis– One phrase is identified in the Condition text– Advanced Biliary Tract Adenocarcinoma

• Data Schema– One annotation object is created– Main concept is “Adenocarcinoma”– Qualifier concepts are “Advanced” and “Biliary tract”

Page 10: Semantic Data Normalization For Efficient Clinical Trial Research

Semantic Data Normalization

September 8th, 2016

NCT01506973

rdf:type ClinicalTrial

ct:conditionText “Advanced Biliary Tract Adenocarcinoma”

ct:conditionAnnotation ConditionAnnotationID

ca:hasDisease C0001418

ca:hasPhrase “Advanced Biliary Tract Adenocarcinoma”

ca:hasQualifiers QualifierGroupID

C0205179 C0005423

cg:hasQualifiers

Page 11: Semantic Data Normalization For Efficient Clinical Trial Research

• Study Conditions– Multiple phrases in a text– Pre-coordinated concepts vs. post-coordinated– Scoring of matching concepts

• Study Interventions– Drug, route, form– Drug dosage

• Adverse Events– Normalization of AE– Post-coordinated concepts

• Eligibility Criteria– Semantic sectioning and categorization– Negations– Diseases, findings, treatments, age and gender

Demo Example

September 8th, 2016

Page 12: Semantic Data Normalization For Efficient Clinical Trial Research

Intervention Annotation Model - Drugs

September 8th, 2016

NCT01506973

rdf:type ClinicalTrial

ct:hasIntervention

in:drugAnnotation DrugAnnotationID

da:hasDrug 111418

da:hasAdministrationRoute

do:hasSingleDose

DrugDosageID

SingleDoseID PeriodIDdo:hasPeriod

NCT01506973_1_2

SCTID:111418

SCTID:121681

da:hasDosage

do:hasFrequency

FrequencyID

Value UnitDenominato

r ValueDenominato

r Unit

da:hasAdministrationForm

Page 13: Semantic Data Normalization For Efficient Clinical Trial Research

Criteria Annotation Model

September 8th, 2016

NCT01506973

rdf:type ClinicalTrial

ct:hasCriteriaSection

cs:hasCriterion Criterion

cr:hasText

cr:hasAnnotation

CriteriaSection

AnnotationId

sa:Negation

rdf:type “Inclusion”/”Exclusion”/”Not defined”

cs:hasText…No extensive intraductal components on core biopsy, defined as intraductal carcinoma.Patients must not have recurrent invasive breast cancer. …Patients must not have recurrent invasive breast cancer.

“Disease”/”Drug”/…rdf:type

“True”/”False”/…Property 1Property 2Property N

Page 14: Semantic Data Normalization For Efficient Clinical Trial Research

• Work with ClinicalTrials.gov data as public show case– > 215K clinical studies– > 76 million RDF statements

• Coverage– Conditions (197,154 objects)

– Diseases, Findings, Body locations, Qualifiers

– Interventions (rdf:type = ‘Drug’ and rdf:type = ‘Biologics’) – (381,590 objects)– Drugs, Dosages, Administration form, Administration route, Population group

– Adverse Events – (1,226,754 objects)– Diseases, Findings, Body locations, Qualifiers

– Criteria (semantic sectioning and categorization, negations) – (7,216,361 objects)– Diseases, Findings, Drugs, Population groups

• In total more than 80 millions of RDF triples

Current Status

September 8th, 2016

Page 15: Semantic Data Normalization For Efficient Clinical Trial Research

• Directly mine the public enhanced CT.gov version• Apply the same approach over your internal clinical trials data• Once the data is semantically normalized you can “slice and

dice” it as your use case requires• Examples

– Top-bottom data exploration– Linked data browsing

How Can I Use This?

September 8th, 2016

Page 16: Semantic Data Normalization For Efficient Clinical Trial Research

Next Steps

• Release RDFized version of ClinicalTrials.gov• Pre-loaded in GraphDB Free• Pre-loaded on Ontotext S4 Cloud• As RDF serialization distribution

• Release all semantically structured information under free for non-commercial use license

• Extend the data schema to support not only concepts but also tokens which cannot be normalized to ontology instances

Page 17: Semantic Data Normalization For Efficient Clinical Trial Research

Thank You!

You can contact me by e-mail:

[email protected]