Download pdf - Standardizing Phenotype Variable in the Database of Genotypes and Phenotypes

Standardizing Phenotype Variables in the Database of Genotypes and Phenotypes (dbGaP) based on Information Models

Ko-Wei Lin, DVM, PhD Alexander Hsieh, Seena Farzaneh, BS, Son Doan, PhD, Hyeoneui Kim, RN, MPH, PhD Division of Biomedical Informatics, University of California, San Diego, La Jolla, CA

2013 AMIA Summit on Translational Bioinformatics

March 18, 2013

Overview �  Introduction and Background

- Challenges in dbGaP database

- Objective of the study

�  Materials, Methods and Results

- Phase I: Test of CEM template models

- Phase II: Build information model

�  Conclusions

�  Current Status and Future Direction

dbGaP: database of Genotypes and Phenotypes

�  Developed by National Center for Biotechnology Information (NCBI)

�  Data repositories of studies such as Genome-Wide Association Studies (GWAS) to allow researchers investigate association between genotype and phenotype

�  GWAS: exam common genetic variant in different individuals to see if any variant is associated with a trait

Genotype Phenotype

�  Phenotypes: diseases, signs and symptoms, clinical attributes…etc.

�  Currently host 400+ studies, 2500+ datasets, 130,000+ phenotype variables

�  Reuse of dbGaP data: promote research discovery, validate existing findings, reduce time and cost, advance translational medical research.

Challenges in current dbGaP �  Unstandardized representation of phenotype variables

results in incomplete and inaccurate data retrieval. Ø No specific naming convention Ø No specific meaning in abbreviated codes

Phenotype variable “height”

Feasibility of Using Clinical Element Models (CEM) to Standardize Phenotype Variables in the database of Genotypes and Phenotypes (dbGaP)

Ko-Wei Lin, DVM, PhD; Melissa Tharp, BA; Mike Conway, PhD; Mindy Ross, MD; Jihoon Kim, MS; Wendy Chapman, PhD;

Lucila Ohno-Machado, MD, PhD; Hyeon-Eui Kim, RN, MPH, PhD

Division of Biomedical Informatics, School of Medicine, University of California San Diego, La Jolla, CA

Abstract

The database of Genotypes and Phenotypes (dbGaP) contains various types of data generated in Genome Wide Association Studies (GWAS). These data can be potentially used to facilitate novel scientific discovery and reduce cost and time for exploratory research. However, idiosyncrasies in variable names become a major barrier for reusing these data. We studied the problem formalizing the phenotype variable descriptions using Clinical Element Models (CEM). Direct mapping of 379 phenotype names to existing CEM yielded a low rate of exact matches (N=25). However, the flexible and expressive underlying information models of CEM provided a robust means to represent 115 phenotype variable descriptions, indicating that CEMs can be successfully applied to standardize a large portion of the clinical variables contained in dbGaP.

Introduction

With the advancements in genome-wide association studies (GWAS), public repositories of genotype and phenotype data, such as the database of Genotypes and Phenotypes (dbGaP), have become increasingly available online (1). The proper use or reuse of GWAS data could promote exploratory research, novel scientific discovery, validation of existing findings, and reduction of cost and time for research. However, data in such public repositories are not collected in a standardized or harmonized way, and hence it is challenging to reuse them. For example, as illustrated in Table 1, variables are often named without following a specific naming convention, or are labeled with abbreviated codes that do not convey specific meaning. Many of these variables are accompanied by variable descriptions that can help users understand what data the variable intends to represent. However, keyword searches applied to variable descriptions do not always provide accurate results due to many syntactic and lexical complexities associated with narrative text, such as use of negation and synonyms (3).

Table 1. Idiosyncratic ways of representing the height variable in dbGaP

Variable ID Variable Names Variable Descriptions phv00071000.v1 htcm Standing height at follow up visit phv00165340.v1.p2 ESP_HEIGHT_BASELINE Standing height in cm at baseline phv00083471.v1.p2 lunghta4 HEIGHT (cm)

Idiosyncrasies in variable names are a major challenge to utilizing the data included in dbGaP. Standardizing phenotype variables in such a way that supports an accurate and complete search against dbGaP data is one of the main purposes of the Phenotype Finder in Data Resource (PFINDR) program funded by the National Heart, Lung, and Blood Institute (NHLBI).

As the first step towards standardizing the phenotype variables in dbGaP, we tested the feasibility of using an existing information model for clinical data, the Clinical Element Models (CEM) developed by GE Healthcare/Intermountain Healthcare Data Modeling and Terminology Team to formally represent the phenotype variable descriptions in dbGaP. We intend to use such formal representations as the basis for Natural Language Processing (NLP) algorithms that can systematically identify the key information on a phenotype variable from its variable description.

The overall purpose of this study was to explore the possibility of representing the key information that phenotype variables in dbGaP carry using CEM. Specifically, we aimed to evaluate (1) the content coverage of existing CEMs on a small set of phenotype variables and (2) the feasibility of formalizing phenotype variable descriptions using CEM template models.

Page 1 of 7

Challenges in current dbGaP

http://www.ncbi.nlm.nih.gov/gap

Challenges in current dbGaP

Ø  Idiosyncrasies of phenotype variables make it difficult to identify relevant data with a sufficient level of accuracy.

Standardization phenotype variable is important

à Focus on variable description

PFINDR program (Phenotype Finder IN Data Resources)

�  PhenDisco: Phenotype Discoverer

Clustering

Data user

!"#$%&

Free text search

Structured (advanced) search

Unsorted, flat list results

'!#$%&

Study Description Annotator

()*+,-$.+)&/+!01&

Query Parser

Structured Query Interface

Ranking Algorithms

Free text search Structured search

Ranked results/Relevance feedback

Standardization & annotation

Query support

PhD System New workflow Original workflow PhD data flow

Demographics variables (DIVER)

Other variables

Phenotype Variable Annotator

Data submitter

feedback/ confirmation – “semi-automated” standardization & annotation

PhenDisco data flow PhenDisco System

Objective

Goal: Investigate an information model based approach to standardizing phenotype variables in dbGaP.

�  Phase I: Test the feasibility of CEM template models to formally represent the phenotype variable descriptions in dbGaP.

�  Phase II: Develop our own information models and applied them to variable standardization.

The ultimate goal is to develop an Nature Language Processing (NLP) based system that algorithmically

standardizes the phenotype variables in PhenDisco.

Objective




The Clinical Element Model (CEM) �  Developed by GE Health/Intermountain Healthcare Data Modeling and

Terminology Team

�  Support sharing computable meaning during data exchange between different systems

�  A logical structure for representing detailed clinical data models

CEM Template Models �  Serve as basis for creating a CEM �  6 domains: Disease and Disorders, Procedures, Signs and Symptoms,

Medications, Anatomical Sites, and Laboratory Test

�  Signs and Symptoms CEM Template Model:

Alleviating_factor UMLS relations {manages, treats, prevents} associatedCode SNOMED CT, UMLS CUI Body_laterality {superior, inferior, medial, lateral, distal, proximal, dorsal, ventral} Body_location UMLS relation {location_of} Body_side {left, right, bilateral, unmarked} Conditional {true, false} Course {unmarked, changed, increased, decreased, improved, worsened, resolved} Duration Temporal Link End_time Temporal Link Exacerbating_factor UMLS relations {complicates, disrupts} Generic {true, false} Negation_indicator {negationAbsent, negationPresent} Relative_temporal_context Temporal Link Severity UMLS relation {degree_of} Start_time Temporal Link Subject {patient, familyMember, donorFamilyMember, donorOther, other} Uncertainty_indicator {indicatorPresent, indicatorAbsent}

Phase I Representing phenotype variable descriptions in dbGaP using CEM template models

�  Material and Methods 1.  Randomly retrieve 200 non-demographic phenotype variable

descriptions from two phenotype data dictionaries in dbGaP. 2.  Manually conduct the modeling using the six CEM template

models.

�  Results 1.  115 unique variables 2.  CEM template models represented 70% phenotype variable

descriptions and are overly complex.

V. DISCUSSION Direct mapping of phenotype names to an existing

CEM yielded a very small number of exact matches. However, we were able to further specify the chosen CEMs, whose level of match was deemed broad, to represent the specifics of the mapped phenotype variables by populating attributes and/or constraints of the underlying information models of the chosen CEM. Likewise, the underlying information model of the CEM deemed narrow matches as relevant to represent the mapped phenotype variables by removal of certain attributes or constraints. Motivated by this finding, we proceeded to the second phase of study where we modeled phenotype variable descriptions using the CEM template models.

During the modeling process, however, we noticed slight differences in representing phenotype data as clinical data (i.e., as in CEMs) and as research data (i.e.,

as in dbGaP). The former was often aggregated and reformatted into the latter to meet data analysis and workflow management purposes in research. We expect that many such cases can be resolved by modeling with multiple template models, which can be integrated in a nested fashion, as illustrated in Fig 2.

We also encountered other challenging cases. For

TABLE 4. RESULTS OF MAPPING PHENOTYPE NAMES TO CEM

Phenotype categories

Mapping results

Mapped (N=240) Not mapped Total

Exact Broad Narrow Related

Diseases and Disorders 0 116 0 5 7 128

Procedures 0 0 0 0 0 0 Signs and Symptoms 2 19 2 2 56 81

Medications 0 0 0 0 0 0 Anatomical

Sites 0 0 0 0 0 0

Labs 20 2 44 10 21 97 Other 3 6 2 7 32 50

Unknown 0 0 0 0 23 23 Total

number 25 143 48 24 139 379

TABLE 5. CATEGORIES OF THE PHENOTYPE VARIABLE AND RELEVANT CEM TEMPLATE MODELS USED

Topics Number of variables (%)

CEM template models used

Diseases and Disorders 1 (0.87) Diseases and Disorders Findings (excluding Disease or Disorder) 70 (60.87) Signs and Symptoms

Medications 2 (1.74) Medication, Signs and Symptoms Laboratory tests 8 (6.96) Laboratory Tests, Signs and Symptoms Not applicable 30 (26.09) --

Unknown 4 (3.48) -- Total number 115

Figure 2. Nested modeling of "Corticosteroid dose at follow up"

Objective




Phase II Methods

MetaMap

Randomly select 300 Variable

descriptions

Mapping Information

model (Semantic

roles)

Algorithmic process

eHost

Test generalizability

of the model

Develop rules

South BR et al. BioNLP 2012, page 130-139. http://code.google.com/p/ehost/

Phase II Results

�  Our information model was constructed with 10 semantic role classes.

�  Our model fully represented the key concepts in the 600 phenotype variable descriptions.

Semantic role class name

Examples

1 Topic Disease, Signs and symptoms

2 Subject of information

Patient, family members

3 Informer Doctor

4 Certainty Diagnosed, confirmed

5 Situational Context While sleeping, after birth

6 Temporal modifier Last month, since last visit

7 Extent modifier Loudly, excessive

8 Health outcomes Hospitalization

9 Body site Right leg, lower back

10 Quantity Qualifier How many, count

Mapping Example 1 Mom has lung cancer diagnosed by doctor last year

Topic lung cancer

Subject of Information

Mom

Body site

Health outcomes

Extent modifier

Temporal modifier last year

Situation Context

Certainty diagnosed

Informer doctor

Quantity Qualifier

Mapping Example 2

Minor pain in lower back after running

Topic pain

Subject of Information

Subject

Body site lower back

Health outcomes

Extent modifier

minor

Temporal modifier

Situation Context

after running

Certainty

Informer Quantity Qualifier

Conclusions �  We developed an information model for a simple NLP

algorithm to standardize phenotype variables

�  Our experience showed that direct analysis of the phenotype variable descriptions in dbGaP is an important component for developing a workable information model

Current Status and Future Direction �  We have developed a system for tagging the phenotype variables

with two main semantic roles “topic” and “subject of information”, and the system achieved 69% accuracy in semantic tagging.

�  We plan to process all phenotype variables in dbGaP and add them into the pipeline. We will evaluate whether it improves the accuracy of phenotype query in PhenDisco.

Clustering

Data user

!"#$%&

Free text search

Structured (advanced) search

Unsorted, flat list results

'!#$%&

Study Description Annotator

()*+,-$.+)&/+!01&

Query Parser

Structured Query Interface

Ranking Algorithms

Free text search Structured search

Ranked results/Relevance feedback

Standardization & annotation

Query support

PhD System New workflow Original workflow PhD data flow

Demographics variables (DIVER)

Other variables

Phenotype Variable Annotator

Data submitter

feedback/ confirmation – “semi-automated” standardization & annotation

PhenDisco data flow PhenDisco System

Acknowledgements

�  University of California San Diego Division of Biomedical Informatics Lucila Ohno-Machado, MD, PhD Wendy Chapman, PhD

Mike Conway, PhD Jihoon Kim, MS Mindy Ross, MD, MBA Melissa Tharp, BS

Current and past PFINDR team members: Dr. Xiaoqian Jiang, Dr. Neda Alipanah, Stephanie Feudjio Feupe, Rebacca Walker, Asher Garland, Jing Zhang, Ustun Yildiz, Karen Truong, Vinay Venkatesh, Rafael Talavera �  Collaborator:

Hua Xu, PhD (Vanderbilt University)

�  NIH/NHLBI (The National Heart, Lung, and Blood Institution) grant UH2HL108785