Standardizing Phenotype Variables in the Database of Genotypes and Phenotypes (dbGaP) based on Information Models
Ko-Wei Lin, DVM, PhD Alexander Hsieh, Seena Farzaneh, BS, Son Doan, PhD, Hyeoneui Kim, RN, MPH, PhD Division of Biomedical Informatics, University of California, San Diego, La Jolla, CA
2013 AMIA Summit on Translational Bioinformatics
March 18, 2013
Overview � Introduction and Background
- Challenges in dbGaP database
- Objective of the study
� Materials, Methods and Results
- Phase I: Test of CEM template models
- Phase II: Build information model
� Conclusions
� Current Status and Future Direction
dbGaP: database of Genotypes and Phenotypes
� Developed by National Center for Biotechnology Information (NCBI)
� Data repositories of studies such as Genome-Wide Association Studies (GWAS) to allow researchers investigate association between genotype and phenotype
� GWAS: exam common genetic variant in different individuals to see if any variant is associated with a trait
Genotype Phenotype
� Phenotypes: diseases, signs and symptoms, clinical attributes…etc.
� Currently host 400+ studies, 2500+ datasets, 130,000+ phenotype variables
� Reuse of dbGaP data: promote research discovery, validate existing findings, reduce time and cost, advance translational medical research.
Challenges in current dbGaP � Unstandardized representation of phenotype variables
results in incomplete and inaccurate data retrieval. Ø No specific naming convention Ø No specific meaning in abbreviated codes
Phenotype variable “height”
Feasibility of Using Clinical Element Models (CEM) to Standardize Phenotype Variables in the database of Genotypes and Phenotypes (dbGaP)
Ko-Wei Lin, DVM, PhD; Melissa Tharp, BA; Mike Conway, PhD; Mindy Ross, MD; Jihoon Kim, MS; Wendy Chapman, PhD;
Lucila Ohno-Machado, MD, PhD; Hyeon-Eui Kim, RN, MPH, PhD
Division of Biomedical Informatics, School of Medicine, University of California San Diego, La Jolla, CA
Abstract
The database of Genotypes and Phenotypes (dbGaP) contains various types of data generated in Genome Wide Association Studies (GWAS). These data can be potentially used to facilitate novel scientific discovery and reduce cost and time for exploratory research. However, idiosyncrasies in variable names become a major barrier for reusing these data. We studied the problem formalizing the phenotype variable descriptions using Clinical Element Models (CEM). Direct mapping of 379 phenotype names to existing CEM yielded a low rate of exact matches (N=25). However, the flexible and expressive underlying information models of CEM provided a robust means to represent 115 phenotype variable descriptions, indicating that CEMs can be successfully applied to standardize a large portion of the clinical variables contained in dbGaP.
Introduction
With the advancements in genome-wide association studies (GWAS), public repositories of genotype and phenotype data, such as the database of Genotypes and Phenotypes (dbGaP), have become increasingly available online (1). The proper use or reuse of GWAS data could promote exploratory research, novel scientific discovery, validation of existing findings, and reduction of cost and time for research. However, data in such public repositories are not collected in a standardized or harmonized way, and hence it is challenging to reuse them. For example, as illustrated in Table 1, variables are often named without following a specific naming convention, or are labeled with abbreviated codes that do not convey specific meaning. Many of these variables are accompanied by variable descriptions that can help users understand what data the variable intends to represent. However, keyword searches applied to variable descriptions do not always provide accurate results due to many syntactic and lexical complexities associated with narrative text, such as use of negation and synonyms (3).
Table 1. Idiosyncratic ways of representing the height variable in dbGaP
Variable ID Variable Names Variable Descriptions phv00071000.v1 htcm Standing height at follow up visit phv00165340.v1.p2 ESP_HEIGHT_BASELINE Standing height in cm at baseline phv00083471.v1.p2 lunghta4 HEIGHT (cm)
Idiosyncrasies in variable names are a major challenge to utilizing the data included in dbGaP. Standardizing phenotype variables in such a way that supports an accurate and complete search against dbGaP data is one of the main purposes of the Phenotype Finder in Data Resource (PFINDR) program funded by the National Heart, Lung, and Blood Institute (NHLBI).
As the first step towards standardizing the phenotype variables in dbGaP, we tested the feasibility of using an existing information model for clinical data, the Clinical Element Models (CEM) developed by GE Healthcare/Intermountain Healthcare Data Modeling and Terminology Team to formally represent the phenotype variable descriptions in dbGaP. We intend to use such formal representations as the basis for Natural Language Processing (NLP) algorithms that can systematically identify the key information on a phenotype variable from its variable description.
The overall purpose of this study was to explore the possibility of representing the key information that phenotype variables in dbGaP carry using CEM. Specifically, we aimed to evaluate (1) the content coverage of existing CEMs on a small set of phenotype variables and (2) the feasibility of formalizing phenotype variable descriptions using CEM template models.
Page 1 of 7
Challenges in current dbGaP
http://www.ncbi.nlm.nih.gov/gap
Challenges in current dbGaP
Ø Idiosyncrasies of phenotype variables make it difficult to identify relevant data with a sufficient level of accuracy.
Standardization phenotype variable is important
à Focus on variable description
PFINDR program (Phenotype Finder IN Data Resources)
� PhenDisco: Phenotype Discoverer
Clustering
Data user
!"#$%&
Free text search
Structured (advanced) search
Unsorted, flat list results
'!#$%&
Study Description Annotator
()*+,-$.+)&/+!01&
Query Parser
Structured Query Interface
Ranking Algorithms
Free text search Structured search
Ranked results/Relevance feedback
Standardization & annotation
Query support
PhD System New workflow Original workflow PhD data flow
Demographics variables (DIVER)
Other variables
Phenotype Variable Annotator
Data submitter
feedback/ confirmation – “semi-automated” standardization & annotation
PhenDisco data flow PhenDisco System
Objective
Goal: Investigate an information model based approach to standardizing phenotype variables in dbGaP.
� Phase I: Test the feasibility of CEM template models to formally represent the phenotype variable descriptions in dbGaP.
� Phase II: Develop our own information models and applied them to variable standardization.
The ultimate goal is to develop an Nature Language Processing (NLP) based system that algorithmically
standardizes the phenotype variables in PhenDisco.
Objective
Goal: Investigate an information model based approach to standardizing phenotype variables in dbGaP.
� Phase I: Test the feasibility of CEM template models to formally represent the phenotype variable descriptions in dbGaP.
� Phase II: Develop our own information models and applied them to variable standardization.
The Clinical Element Model (CEM) � Developed by GE Health/Intermountain Healthcare Data Modeling and
Terminology Team
� Support sharing computable meaning during data exchange between different systems
� A logical structure for representing detailed clinical data models
CEM Template Models � Serve as basis for creating a CEM � 6 domains: Disease and Disorders, Procedures, Signs and Symptoms,
Medications, Anatomical Sites, and Laboratory Test
� Signs and Symptoms CEM Template Model:
Alleviating_factor UMLS relations {manages, treats, prevents} associatedCode SNOMED CT, UMLS CUI Body_laterality {superior, inferior, medial, lateral, distal, proximal, dorsal, ventral} Body_location UMLS relation {location_of} Body_side {left, right, bilateral, unmarked} Conditional {true, false} Course {unmarked, changed, increased, decreased, improved, worsened, resolved} Duration Temporal Link End_time Temporal Link Exacerbating_factor UMLS relations {complicates, disrupts} Generic {true, false} Negation_indicator {negationAbsent, negationPresent} Relative_temporal_context Temporal Link Severity UMLS relation {degree_of} Start_time Temporal Link Subject {patient, familyMember, donorFamilyMember, donorOther, other} Uncertainty_indicator {indicatorPresent, indicatorAbsent}
Phase I Representing phenotype variable descriptions in dbGaP using CEM template models
� Material and Methods 1. Randomly retrieve 200 non-demographic phenotype variable
descriptions from two phenotype data dictionaries in dbGaP. 2. Manually conduct the modeling using the six CEM template
models.
� Results 1. 115 unique variables 2. CEM template models represented 70% phenotype variable
descriptions and are overly complex.
V. DISCUSSION Direct mapping of phenotype names to an existing
CEM yielded a very small number of exact matches. However, we were able to further specify the chosen CEMs, whose level of match was deemed broad, to represent the specifics of the mapped phenotype variables by populating attributes and/or constraints of the underlying information models of the chosen CEM. Likewise, the underlying information model of the CEM deemed narrow matches as relevant to represent the mapped phenotype variables by removal of certain attributes or constraints. Motivated by this finding, we proceeded to the second phase of study where we modeled phenotype variable descriptions using the CEM template models.
During the modeling process, however, we noticed slight differences in representing phenotype data as clinical data (i.e., as in CEMs) and as research data (i.e.,
as in dbGaP). The former was often aggregated and reformatted into the latter to meet data analysis and workflow management purposes in research. We expect that many such cases can be resolved by modeling with multiple template models, which can be integrated in a nested fashion, as illustrated in Fig 2.
We also encountered other challenging cases. For
TABLE 4. RESULTS OF MAPPING PHENOTYPE NAMES TO CEM
Phenotype categories
Mapping results
Mapped (N=240) Not mapped Total
Exact Broad Narrow Related
Diseases and Disorders 0 116 0 5 7 128
Procedures 0 0 0 0 0 0 Signs and Symptoms 2 19 2 2 56 81
Medications 0 0 0 0 0 0 Anatomical
Sites 0 0 0 0 0 0
Labs 20 2 44 10 21 97 Other 3 6 2 7 32 50
Unknown 0 0 0 0 23 23 Total
number 25 143 48 24 139 379
TABLE 5. CATEGORIES OF THE PHENOTYPE VARIABLE AND RELEVANT CEM TEMPLATE MODELS USED
Topics Number of variables (%)
CEM template models used
Diseases and Disorders 1 (0.87) Diseases and Disorders Findings (excluding Disease or Disorder) 70 (60.87) Signs and Symptoms
Medications 2 (1.74) Medication, Signs and Symptoms Laboratory tests 8 (6.96) Laboratory Tests, Signs and Symptoms Not applicable 30 (26.09) --
Unknown 4 (3.48) -- Total number 115
Figure 2. Nested modeling of "Corticosteroid dose at follow up"
Objective
Goal: Investigate an information model based approach to standardizing phenotype variables in dbGaP.
� Phase I: Test the feasibility of CEM template models to formally represent the phenotype variable descriptions in dbGaP.
� Phase II: Develop our own information models and applied them to variable standardization.
Phase II Methods
MetaMap
Randomly select 300 Variable
descriptions
Mapping Information
model (Semantic
roles)
Algorithmic process
eHost
Test generalizability
of the model
Develop rules
South BR et al. BioNLP 2012, page 130-139. http://code.google.com/p/ehost/
Phase II Results
� Our information model was constructed with 10 semantic role classes.
� Our model fully represented the key concepts in the 600 phenotype variable descriptions.
Semantic role class name
Examples
1 Topic Disease, Signs and symptoms
2 Subject of information
Patient, family members
3 Informer Doctor
4 Certainty Diagnosed, confirmed
5 Situational Context While sleeping, after birth
6 Temporal modifier Last month, since last visit
7 Extent modifier Loudly, excessive
8 Health outcomes Hospitalization
9 Body site Right leg, lower back
10 Quantity Qualifier How many, count
Mapping Example 1 Mom has lung cancer diagnosed by doctor last year
Topic lung cancer
Subject of Information
Mom
Body site
Health outcomes
Extent modifier
Temporal modifier last year
Situation Context
Certainty diagnosed
Informer doctor
Quantity Qualifier
Mapping Example 2
Minor pain in lower back after running
Topic pain
Subject of Information
Subject
Body site lower back
Health outcomes
Extent modifier
minor
Temporal modifier
Situation Context
after running
Certainty
Informer Quantity Qualifier
Conclusions � We developed an information model for a simple NLP
algorithm to standardize phenotype variables
� Our experience showed that direct analysis of the phenotype variable descriptions in dbGaP is an important component for developing a workable information model
Current Status and Future Direction � We have developed a system for tagging the phenotype variables
with two main semantic roles “topic” and “subject of information”, and the system achieved 69% accuracy in semantic tagging.
� We plan to process all phenotype variables in dbGaP and add them into the pipeline. We will evaluate whether it improves the accuracy of phenotype query in PhenDisco.
Clustering
Data user
!"#$%&
Free text search
Structured (advanced) search
Unsorted, flat list results
'!#$%&
Study Description Annotator
()*+,-$.+)&/+!01&
Query Parser
Structured Query Interface
Ranking Algorithms
Free text search Structured search
Ranked results/Relevance feedback
Standardization & annotation
Query support
PhD System New workflow Original workflow PhD data flow
Demographics variables (DIVER)
Other variables
Phenotype Variable Annotator
Data submitter
feedback/ confirmation – “semi-automated” standardization & annotation
PhenDisco data flow PhenDisco System
Acknowledgements
� University of California San Diego Division of Biomedical Informatics Lucila Ohno-Machado, MD, PhD Wendy Chapman, PhD
Mike Conway, PhD Jihoon Kim, MS Mindy Ross, MD, MBA Melissa Tharp, BS
Current and past PFINDR team members: Dr. Xiaoqian Jiang, Dr. Neda Alipanah, Stephanie Feudjio Feupe, Rebacca Walker, Asher Garland, Jing Zhang, Ustun Yildiz, Karen Truong, Vinay Venkatesh, Rafael Talavera � Collaborator:
Hua Xu, PhD (Vanderbilt University)
� NIH/NHLBI (The National Heart, Lung, and Blood Institution) grant UH2HL108785