24
SHARPn SUMMIT “SECONDARY USE” 3 rd Annual Face-to-Face University of Minnesota Rochester Center, 111 South Broadway June 11-12, 2012

SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

Embed Size (px)

Citation preview

Page 1: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face

University of Minnesota Rochester Center, 111 South Broadway

June 11-12, 2012

Page 2: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 2

SHARPn Program Leadership

Chrisopher G.Chute, M.D., Dr.P.H. PI Mayo Clinic [email protected]

Stan Huff, M.D. Co-PI University of Utah Intermountain Healthcare [email protected]

Hongang Liu, Ph.D. Data Normalization Lead; Applied NLP Mayo Clinic [email protected]

Guergana Savova, Ph.D. NLP Research Lead Harvard Medical School & Childrens Hospital Boston [email protected]

Jyotishiman Pathak, Ph.D. Phenotyping Lead Mayo Clinic [email protected]

Calvin Beebe Chief Architect Infrastructure Lead Mayo Clinic [email protected]

Marshall Schor Scaling Capacity Lead IBM TJ Watson Research Center [email protected]

Kenty Bailey, Ph.D. Data Quality Lead Mayo Clinic [email protected]

Lacey Hart, MBA, PMP® Executive Officer Mayo Clinic [email protected]

Erin Martin Administrative Asst Mayo Clinic [email protected]

Project Advisory Committee

Suzanne Bakken, RN, DNSc The Alumni Professor of Nursing and Professor of Biomedical Informatics Columbia University [email protected]

Barbara A. Koenig, Ph.D. Professor, Dep of Social & Behavioral Sciences, Institute for Health & Aging University of California, San Francisco

Isaac Kohane, M.D., Ph.D. Director, i2b2 National Center for Biomedical Computing Harvard Medical School (HMS) [email protected]

Marty LaVenture, MPH, Ph.D., FACMI Director, Minnesota Office for Health Information Technology and Center for Health Informatics Minnesota Dept of Health [email protected]

Dan Masys, M.D. Affiliate Professor Dept. of Medical Education and Biomedical Informatics University of Washington [email protected]

C. David Hardison, Ph.D. VP, Chief Health Scientist; SAIC Public Health Operations [email protected]

Mark A. Musen, M.D., Ph.D Professor of Medicine (Biomedical Informatics; Division Head (BMIR); Co-Director, Biomedical Informatics Training Program Stanford University [email protected]

Robert A. Rizza, M.D. Executive Dean for Research, Mayo Clinic [email protected]

Nina M. Schwenk, M.D. Assistant Professor of Medicine Mayo Clinic [email protected]

Tevfik Bedirhan Üstün M.D. Coordinator, Classifications, Terminologies & Standards Health Statistics & Informatics World Health Organization [email protected]

Kent A. Spackman M.D., Ph.D. Chief Terminologist International Health Terminology Standards Dev Organization (IHTSDO) [email protected]

Federal Steering SubCommitttee: Janet Woodcock (FDA); Ken Buetow (NCI/NIH); Milt Corn (NLM/NIH); Laura Conn (CDC); Ram Sriram (NIST)

Page 3: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 3

MONDAY, JUNE 11, 2012 - AGENDA AT-A-GLANCE 7:30 Check In and Breakfast

4th Floor, Lobby, West Corridor 8:00 SHARPn PIs on Secondary Use of EHR Data

Lecture Hall, Rm 414

Christopher G. Chute, M.D., Dr.P.H. Principal Investigator Mayo Clinic Stanley Huff, M.D., Co-Principal Investigator , University of Utah, Intermountain Healthcare

8:30 Clinical Data Normalization / Practical Modeling Issues Representing Coded &

Structured Patient Data Room: Lecture Hall, Rm 414*

Stanley Huff, M.D.

Cloud Resources Lab Innovation Lab, Rm 415

Troy Bleeker, CBAP®

NLP Research Presentations Classroom, Rm 417

Part I: NLP Fundamentals: Methods and Shared

Lexical Resources Guergana Savova, Ph.D.

9:00 Part II: Dependency Parsing and Dependency-based Semantic Role Labeling (SRL)

Steven Bethard, Ph.D.

Part III: Discovering Severity and Body Site Modifiers: a Relation Extraction Task

Dmitriy Dligach, Ph.D.

9:30 Part IV: Applying Dependency Parses and SRL: Subject and Generic Attribute Discovery

Stephen Wu, Ph.D.

Part V: Discovering Negation and Uncertainty Modifiers Cheryl Clark, Ph.D.

10:00 BREAK / OPEN POSTER SESSION Coffee & networking – Lobby

Poster Display – Rm 419 10:30 Standards, Data Integration

& Semantic Interoperability Room: Lecture Hall, Rm 414*

Part I: Introduction to SHARPn Normalization

Tom Oniki, Ph.D.; Hongfang Liu, Ph.D.

NLP System Presentations Classroom, Rm 417

Part I (30min): Comparative Study of Two NLP

Framework Architectures Yixian Bian; Gunes Koru; Hongfang Liu, Ph.D.

11:00 Part II: Semantic Normalization and Interoperability lessons learned

Tom Oniki, Ph.D.; Kyle Marchant; Calvin Beebe; Hongfang Liu, Ph.D.

Part II (20min): MCORES: A system for noun phrase coreference resolution for clinical records

Andreea Bodnari; Peter Szolovits; Ozlem Uzuner, Ph.D.

Part III (20min): Multi-Scrubber: An Ensemble

System for De-Identification of Protected Health Information.

Anna Rumshisky Ph.D.; Ken Buford; Ira Goldstein; Ozlem Uzuner, Ph.D.

11:30 Part III: SHARPn Infrastructure and Normalization Pipeline Demonstration

Vinod Kaggal

Page 4: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 4

AGENDA AT-A-GLANCE - MONDAY, JUNE 11, 2012 12:00 GROUP PHOTO / LUNCH / NETWORKING

Buffet Line – Lobby, West Corridor Seating – Classroom, Rm 417

1:00 High-Throughput Phenotyping / Cohort Identification from EHRs for Clinical and

Translational Research Room: Lecture Hall, Rm 414*

Part I (30min): SHARPn HTP Introduction &

Applications Jyotishman Pathak, Ph.D.

NLP Software Demos: UIMA, cTAKES, CLEARtk

Room: Innovation Lab, Rm 415

Part I: UIMA Introduction James Masanz

Clinical Element Model (CEM) Presentations Room: Classroom, Rm 417

Paper I: Harmonization of SHARPn Clinical

Element Models with CDISC SHARE Clinical Study Data Standards

Julie Evans; Guoqian Jiang, Ph.D.

1:30 Part II (30min): Using EHR for Clinical Research

Vitaly Herasevich, M.D., Ph.D.

Part II: cTAKES Tutorial and GUI Demo

Pei Chen

Part III: CLEARtk Tutorial Steven Bethard, Ph.D.

Poster: OpenCEM Wiki: A Semantic-Web-based Repository for Supporting Harmonization of Clinical Study Data Standards and Clinical

Element Models Guoqian Jiang, Ph.D.

Demo I: CEMs to CDISCs SHARE

Metadata Repository Landen Bain

2:00 Part III (30min): Association Rule Mining and Type 2 Diabetes Risk Prediction

Gyorgy Simon, Ph.D.

Part IV: Evaluation Workbench

Lee Christensen

Paper II: Pharmacogenomics Data Standardization using Clinical Element Models

Qian Zhu, Ph.D.; Robert Freimuth, Ph.D.

Demo II: CEM-OWL Cui Tao, Ph.D.

2:30 BREAK / OPEN POSTER SESSION Refreshments & networking – Lobby

Poster Display – Rm 419 3:00 EHR Use Cases

Room: Lecture Hall, Rm 414*

Poster (15min): Scenario-based Requirements Engineering for Developing

Electronic Health Records Add-ons to Support Comparative Effectiveness Research in Patient Care Settings

Junfeng Gao, Ph.D.

Regenstrief Institute (60min): Data Normalization / Clinical Data Repositories

Daniel Vreeman, PT, DPT; Marc Rosenman, M.D.

(15min): Q&A

cTAKES Coding Sprint Room: Innovation Lab, Rm 415

James Masanz; Pei Chen

Data Normalization Deep-Dive Room: Classroom, Rm 417

Part I (30min): Mapping EHR Data to

SHARPn Use Cases Tom Oniki, Ph.D.; Hongfang Liu, Ph.D.;

Susan Rea Welch, Ph.D.

Part II (15min): Standards Utilized for Source Materials

Calvin Beebe

3:30

Part III (15 min): Terminology Services Harold Solbrig

4:00 4:30

Part IV (15min): Persistence DB Structure Kyle Marchant

Part V (30min): An End-To-End Evaluation Framework

Peter Haug, M.D.

(15min): Facilitated Discussion on Future Build 5:00 Day’s Report Out

Room: Lecture Hall, Rm 414 5:30 Reception

DoubleTree Hotel, Universal II Ballroom, Skyway Level / 2nd Floor 6:30 Dinner on your own – signup sheets available at registration desk for reservation assistance

Page 5: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 5

TUESDAY, JUNE 12, 2012 - AGENDA AT-A-GLANCE 7:30 Check In and Breakfast

4th Floor, Lobby, West Corridor 8:00 SHARPn Project-Leaders on Domain Milestones

Room: Lecture Hall, Rm 414

Hongfang Liu, Ph.D., Mayo Clinic Guergana Savova, Ph.D., Harvard Medical School and Children’s Hospital Boston

Jyotishman Pathak, Ph.D., Mayo Clinic Kent Bailey, Ph.D., Mayo Clinic 8:30

9:00 NLP Methods Presentations Room: Lecture Hall, Rm 414*

Part I (30min) MedTagger: A Fast NLP Pipeline

for Indexing Clinical Narratives Siddhartha Jonnalagadda, Ph.D.

SE MN Beacon - In the Field: Medication Reconciliation

using PH-Doc Room: Innovation Lab, Rm 415 Dan Jensen; Deb Castellanos

HTP - Presentations Room: Classroom, Rm 417

Paper (15min): Exploring Patient Data in

Context to Support Clinical Research Studies: Research Data Explorer

Adam Wilcox, Ph.D.; Chunhua Weng; Sunmoo Yoon; Suzanne Bakken, RN, DNSc

Poster (15min): Utilizing Previous Result Sets as

Criteria for New Queries within FURTHeR Dustin Schultz; Richard Bradshaw; Joyce Mitchell

Poster (15min): Semantic Search

Engine for Clinical Trials Yugyung Lee, Ph.D.

(15min Q&A)

9:30 Paper (15min): Knowledge-Based vs. Bottom-up Methods for Word Sense Disambiguation

in Clinical Notes. Anna Rumshisky, Ph.D.; Rachel Chasin

Poster (15min): Classification of Emergency

Department CT Imaging Reports using Natural Language Processing and Machine Learning

Efsun Sarioglu; Kabir Yadav, M.D.; Hyeong-Ah Choi

10:00 BREAK / OPEN POSTER SESSION Coffee & networking – Lobby

Poster Display – Rm 419 10:30 Medication Reconciliation

Room: Lecture Hall, Rm 414*

Part I (15min): cTAKES drug NER tool Sean Murphy

Part II (30min): MedER medication

extraction tool Sunghwan Sohn, Ph.D.

Part III (45min): Pan-SHARP Collaborative Jorge Herskovic, M.D., Ph.D.

- INVITATION ONLY Session –

HTP Next Steps Planning Session

Room: Innovation Lab, Rm 415 (10:30-11:00am)

Applied NLP & Information Extraction future exploration

Room: Classroom, Rm 417

Part I (30min): NLP for Clinical Decision Support Kavishwar Wagholikar, MBBS, Ph.D.

11:00

Part II (15min): Biomedical Informatics and Clinical NLP in Translational Science Research.

Piet de Groen, M.D.

Poster (15min): Enabling Medical Experts to Navigate Clinical Text for Co-hort Identification

Stephen Wu, Ph.D.

(30min) Future Exploration of NLP needs in Clinical Settings facilitated session

11:30

Page 6: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 6

AGENDA AT-A-GLANCE - TUESDAY, JUNE 12, 2012

12:00 LUNCH / NETWORKING Buffet Line – Lobby, West Corridor

Seating – Classroom, Rm 417 1:00 Data Quality

Room: Lecture Hall, Rm 414*

Data Quality and Comparability of Large Scale Health Data Kent Baily, Ph.D.;

Susan Rea Welch, Ph.D.

Phenotyping Tools Demo & Code Sprint Room: Innovation Lab, Rm 415

Part I (30min): Phenoportal Demonstration Jyoti Pathak, Ph.D., Dingcheng Li, Ph.D.

- INVITATION ONLY Sessions -

Project Advisory/ Fed. Steering Committee/ONC/PI's - Reflections

Room: Classroom, Rm 417 - A (1:00pm-2:30pm)

PanSHARP – Mercury Planning Session

Room: Classroom, Rm 417 - B (1:00pm-2:30pm)

NLP Next Steps Planning Session

Room: Rm 419 (1:00pm-2:30pm; 3:00pm-5:00pm)

Beacon CDR Training

Siebens 4-05 (1:00pm-5:00pm)

1:30

Part II (30min): A Knowledge-Driven Workbench for Predictive Modeling

Peter Haug, M.D.; Xinzi Wu; John Holmen; Matthew Ebert;

Robert Hausam; Jeffrey Ferarro

2:00

Part III (30min): Clinical analytics driven care coordination for 30-day readmission/

360 Fresh Demonstration Ramesh Sairamesh, Ph.D.

2:30 Closing Remarks

Room: Lecture Hall, Rm 414

SHARPn Partners:

• Agilex Technologies • CDISC (Clinical Data Interchange Standards

Consortium) • Centerphase Solutions • Deloitte • Group Health, Seattle • IBM Watson Research Labs • University of Utah • University of Pittsburgh • Harvard Children’s Hospital • Intermountain Healthcare • Mayo Clinic

• Mirth Corporation, Inc. • MIT • MITRE Corp. • Regenstrief Institute, Inc. • Rochester Epidemiology Project (REP) • SE MN Beacon Community • Substitutable Medical Apps, reusable

technologies (SMART) • SUNY • University of Colorado • University of Texas (SHARPc) • University of Illinois Champagne (SHARPs)

Page 7: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 7

7:30-8:00AM CHECK-IN and Breakfast U of M Rochester Center 4th Floor, Lobby, West Corridor 8:00 – 8:30AM OPENING ADDRESS SHARPn Thought-Leaders on Secondary Use of EHR Data Room: Lecture Hall, Rm 414 Christopher G. Chute, M.D., Dr.P.H. Principal Investigator Mayo Clinic Stanley Huff, M.D., Co-Principal Investigator University of Utah, Intermountain Healthcare 8:30-10:00AM TRACK SESSIONS Clinical Data Normalization / Practical modeling issues Representing coded & structured patient data Room: Lecture Hall, Rm 414 Stanley Huff, M.D.; Tom Oniki, Ph.D. This session expresses the need for formal data models and how standard terminologies are used in the models. Explore use cases encountered and discuss basic name-value pair paradigm for flexible representation of patient data; the proper roles for standard terminologies; approach to handling negative findings; support for pre-coordinated data entry while storing data in post-coordinated database. Cloud Resources Lab Room: Innovation Lab, Rm 415 Troy Bleeker, CBAP® Resources are at your fingertips. Come find out how to get started with SHARP resources on our Eucalyptus Enterprise Cloud. Leave with practical experience. Intended audience: Technical users of Area 4’s SHARP cloud resources. Natural Language Processing (NLP) Research Presentations Room: Classroom, Rm 417 Part I: NLP Fundamentals: Methods and Shared Lexical Resources Guergana Savova, Ph.D. This tutorial will focus on NLP and text analytics for the clinical narrative. A brief history of NLP and methods evolution highlighting state-of-the-art; implementation considerations; and creating annotated corpora as training/testing sets.

Part II: Dependency parsing and dependency-based semantic role labeling (SRL) Steven Bethard, Ph.D. The presentation will describe these two key NLP technologies used as input features for the discovery of a number of modifiers – subject, severity, body location, negation, uncertainty. Dependency parsing and semantic role labeling provide a level of semantic formalization of the surface language form. We will describe our method of implementing these key NLP technologies Part III: Discovering Severity and Body Site Modifiers: a Relation Extraction Task Dmitriy Dligach, Ph.D. The severity and body site modifiers are critical in describing the overall clinical state of patients. As such they are core attributes in the Clinical Element Model. We cast their discovery as a relation extraction task of two types of UMLS relations – locationOf and degreeOf. The presentation will describe the machine learning approach and will present evaluation results. Part IV: Applying dependency parses and SRL: Subject and Generic Attribute Discovery Stephen Wu, PhD This session follows the overview of the Dependency parser and Semantic Role Labeler (SRL), illustrating how those components have been used in downstream components. We will review the definitions of two attributes of Named Entities, namely “Subject” and “Generic”, and we will show how they can be found via simple rules that utilize the Dependency Parser and SRL. The session should be of interest to those who are interested in discovery of Named Entity attributes, and those who are unsure how to make use of dependency parse/SRL features. Part V: Discovering Negation and Uncertainty Modifiers Cheryl Clark, Ph.D. We describe a methodology for identifying the negation and uncertainty modifiers associated with clinical events in clinical text. This system was among the top performing systems in the assertion subtask of the 2010 i2b2/VA community evaluation Challenges in natural language processing for clinical data, and has subsequently been packaged as a UIMA module called the MITRE Assertion Status Tool for Interpreting Facts (MASTIF), recently integrated with cTAKES. We describe the process of extending MASTIF, which uses a single multi-way classifier to select among a closed set of mutually exclusive assertion categories, to a system that uses individual, independent classifiers to assign values to independent negation and

Monday, June 11, 2012

Page 8: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 8

uncertainty attributes associated with a variety of clinical concepts (e.g., medications, procedures, and relations) as specified by SHARPn requirements. We discuss the benefits that result from this new representation and the challenges associated with generating it automatically. We compare the accuracy of MASTIF on i2b2 data with accuracy on a subset of SHARPn clinical documents, and discuss the contribution of linguistic features to accuracy and generalizability of the system. Finally, we discuss our plans for future development. 10:00-10:30AM BREAK / OPEN POSTER SESSION Coffee & networking – Lobby Poster Display – Rm 419 10:30AM-12:00PM TRACK/PAPER SESSIONS Standards, Data Integration & Semantic Interoperability Room: Lecture Hall, Rm 414 Part I: Introduction to SHARPn Normalization Stanley Huff, M.D.; Tom Oniki, Ph.D.; Hongfang Liu, Ph.D. In this session, we will trace the efforts so far to develop our normalization target in SHARPn. We will review the current status of the SHARP CEM, the approach we have developed to elaborative new models, the tools used to constructed and manage these artifacts, and our goals for the next year. Part II: Semantic Normalization and Interoperability: building blocks, challenges, and solutions Stanley Huff, M.D.; Tom Oniki, Ph.D.; Hongfang Liu, Ph.D. One of the necessary steps in data normalization is to normalize semantics. Depending on the data sources, semantic normalization can be an easy or a very challenging task. This session is a collaborative session to brainstorm the challenges faced by the data normalization pipeline and potential solutions. Part III: SHARPn Infrastructure and Normalization Pipeline Demonstration Vinod Kaggal Discussion on the design and implementation of the normalization pipeline. Discussion will surround the major modules used in the pipeline followed by a demonstration of the normalization pipeline.

NLP System Presentations Room: Classroom, Rm 417 Part I: A Comparative Study of Two Natural Language Processing Yixian Bian; Gunes Koru; Hongfang Liu, Ph.D. There are two popular Natural Language Processing framework architectures: UIMA and GATE available. They are both developed in Java programming language. UIMA is produced by IBM and GATE is the research achievement of University of Sheffield. Although they share common goals, the two architectures are different in many aspects. Which one to adopt? In this paper, we compare them from three perspectives: Software design quality, user's manual and bug repositories with different empirical evidences. The comparison results show that UIMA is better than GATE in those perspectives. Part II: MCORES: A system for noun phrase coreference resolution for clinical records Andreea Bodnari; Peter Szolovits; Ozlem Uzuner, Ph.D. Abundant patient information is stored within the narratives of electronic medical records. Because of their free-text structure these narratives cannot be directly integrated into computerized clinical research. Language processing techniques are customarily built to help the analysis of clinical free text. Coreference resolution is an important language processing technique used for the analysis of free-text narratives. We present a clinical coreference resolution system (MCORES) for the noun phrases in electronic medical records. MCORES handles coreference of four frequently used clinical semantic categories: persons, problems, treatments, and tests. It resolves coreference in a two-step process. The first step is binary classification performed on pairs of noun phrases from the same semantic category. The second step is clustering of true coreference pairs generated from the first step chains. MCORES generates a rich set of lexical, syntactic, and semantic features for the binary classifier. The feature set includes clinical knowledge gathered from UMLS. We evaluate MCORES against an in-house baseline and two available third parties, open-domain coreference resolution systems (i.e., RECONCILE (ACL09) and the Beautiful Anaphora Resolution Toolkit (BART)). Overall, MCORES outperforms both the in-house and the third party systems. MCORES™ main advantage is the rich feature set enhanced with clinical knowledge.

Monday, June 11, 2012

Page 9: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 9

Part III: Multi-Scrubber: An Ensemble System for De-Identification of Protected Health Information. Anna Rumshisky, Ph.D.; Ira Goldstein, Ken Buford Ozlem Uzuner, Ph.D. Narrative clinical records are still largely an untapped source of information available for clinical research. Under the Health Insurance Portability and Accountability Act (HIPAA), protected health information (PHI) including patient names, addresses, dates, phone numbers, and other identifying information needs to be removed from the record before it can be re-purposed for research. In this work, we present Multi-Scrubber, an ensemble system for de-identification of clinical records that combines predictions of existing state-of-the-art de-identification systems using a meta-classifier. Clinical records from different institutions often vary widely in terms of adopted formats. Deidentification systems trained on records of a particular format tend to perform poorly on records from other institutions. Furthermore, sharing annotated data between institutions is typically problematic due to HIPAA regulations. We show that Multi-Scrubber can be used to jump-start annotation for a record format internal to a particular institution by using models trained on outside data and by combining them with models trained on smaller sets of institution-internal annotated data. Multi-Scrubber maximizes the use of training data by swapping in larger base models after the meta-training has been completed. JCarafe/MIST, Stanford NER, and Illinois NET are used as base classifiers for the ensemble learner. We tested Multi-Scrubber on three corpora from the Informatics for Integrating Biology and the Bedside (i2b2) and MIMIC II projects, varying the data set sizes and inclusion of institution-internal and outside data and models. The corpora were marked for the following PHI types: doctor, patient, and hospital names, locations, dates, IDs, telephone, and fax numbers, and ages over 89 years old. We find that MultiScrubber outperforms them base classifier systems in multiple settings and in most categories, with performance gain most pronounced on smaller training set sizes and sparse data. 12:00-1:00PM LUNCH / GROUP PHOTO Group Photo – Lobby, West Corridor Buffet Line – Lobby, West Corridor Seating – Classroom, Rm 417 1:00-2:30PM TRACK/PAPER SESSIONS High-Throughput Phenotyping / Cohort identification from EHRs for Clinical and Translational Research Room: Lecture Hall, Rm 414

Part I (30min): SHARPn HTP Introduction & Applications Jyotishman Pathak, Ph.D. While originally for application of research cohorts from EMR's, this project has obvious extensions to clinical trial eligibility, clinical decision support and has relevance to quality metrics (numerator and denominator constitute phenotypes). High-throughput Phenotyping (HTP) leverages high-throughput computational technologies to derive efficient use of health information data. The field currently has barriers in technological research and tool generation. Part II (30min): Using EHR for Clinical Research Herasevich, Vitaly, M.D., Ph.D. Effective care of critically ill patient in the Emergency Department (ED), operating room, and Intensive Care Unit (ICU) heavily depends on the ability of clinicians to quickly process large amounts of clinical and laboratory data while taking care of multiple patients and being interrupted frequently. Compared to the outpatient settings, the ICU is a data-rich environment with exponentially more data points generated in a short period of time. The advances in information technology (IT) have a natural application in the fast-paced environment of ED and ICU. Like any other intervention in clinical medicine, IT applications should be fundamentally designed to improve patient care. During this lecture, speaker will outline concepts and application of clinical informatics in intensive care and will present experience of ICU and OR datamarts development.

Part III (30min): Association Rule Mining and Type 2 Diabetes Risk Prediction Simon, Gyorgy, Ph.D. Early detection of patients with elevated risk of developing diabetes mellitus is very important to improve prevention and overall clinical management of these patients. We apply association rule mining to electronic medical records (EMR) to discover sets of risk factors and their corresponding subpopulations that represent particularly high risk of developing diabetes. We found association rule mining particularly beneficial as it is capable of making predictions on par with other state of the art machine learning methods but it constructs a model that is richer in information yet easy to interpret. In this presentation, we will discuss association rule mining and its application to diabetes risk assessment. We will also consider its adaptation towards automatic phenotyping.

Breakfast, Lunch & Refreshments provided by:

Monday, June 11, 2012

Page 10: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 10

Clinical Element Model (CEM) Presentations Room: Classroom, Rm 417 Paper I (15min): Harmonization of SHARPn Clinical Element Models with CDISC SHARE Clinical Study Data Standards Julie Evans; Tom Oniki, Ph.D.; Joey Coyle, Landen Bain, Guoqian Jiang, Ph.D.; Stanley Huff, M.D.; Rebecca Kush; Christopher Chute, M.D., Dr.P.H.

The Intermountain Healthcare/GE Healthcare Clinical Element Models (CEMs) have been adopted by the SHARPn project for normalizing patient data from the electronic medical records (EMRs). To maximize the reusability of the CEMs in a variety of use cases across both clinical study and secondary use, it is necessary to build interoperability between the CEMs and existing data standards (e.g. CDISC and ISO 11179 standards).

The objective of the paper is to harmonize the SHARPn CEMs with CDISC SHARE clinical study data standards. As the starting point, we were focused on three generic domains: Demographics, Lab Tests and Medications. CDISC contributed templates in the three domains in Excel spreadsheets and the SHARPn project provided three CEM models: SecondaryUsePatient, SecondaryUseLabObs and SecondaryUseNotedDrug in XML Schema.

We formed a CSHARE CEMs Harmonization Working Group with representatives from CDISC, Intermountain Healthcare and Mayo Clinic. We performed a panel review on each data element extracted from the CDISC templates and SHARPn CEMs. When a consensus is achieved, a data element is classified into one of the following three context categories: Common, Clinical Study or Secondary Use.

In total, we reviewed 127 data elements from the CDISC SHARE templates and 1130 data elements extracted from the SHARPn CEMs. We identified 4 common data elements (CDEs) from the Demographics domain, 20 CDEs from the Lab Test domain and 15 CDEs from the Medications domain. We have also identified a set of outstanding issues, including differences in implementation, data types and value set definition mechanism.

In conclusion, we have identified a set of data elements that are common to the context of both clinical study and secondary use. We consider that the outcomes produced by this Working Group would be useful for facilitating the semantic interoperability between systems for both clinical study and clinical practice.

Poster (15min): OpenCEM Wiki: A Semantic-Web-based Repository for Supporting Harmonization of Clinical Study Data Standards and Clinical Element Models Guoqian Jiang, Ph.D.

The objective is to develop and evaluate a semantic repository for supporting harmonization of clinical study data standards and clinical element models (CEMs) using Semantic Web technology. We collected the following standards: 1) the CDISC clinical study data standards: a) CDASH standards in CDISC Operational Data Model (ODM) XML format; b) SDTM standards in Excel spreadsheet; c) CDISC Terminologies in ODM XML format. 2) The Intermountain Healthcare Clinical Element Models (CEMs) in CEML XML format. The system architecture is comprised of 4 modules: 1) A Resource Description Framework (RDF) transformation module that supports converting the data standards and CEMs from a variety of formats into RDF format. 2) A semantic repository module that stores and integrates the data standards and CEMs in a centralized RDF store. 3) A semantic query interface module that uses a SPARQL endpoint against the RDF store. 4) A Semantic Wiki frontend module for supporting representation and harmonization efforts. We implemented a prototype of the system called OpenCEM Wiki. 1) We used an open XML2RDF transformation web service to convert those clinical data standards and CEMs in XML into RDF triples. 2) We used an open source RDF store called 4store to integrate the data standards and CEMs in the RDF model. 3) We established a SPARQL endpoint using built-in services from 4store. 4) In the frontend module, we implemented a Semantic MediaWiki platform with a number of semantic extensions. We will also discuss the potentials for enabling collaborative harmonization between clinical study data standards and CEMs. Demo I (30min): CEMs to CDISCs SHARE metadata repository Landen Bain A simulation showing a researcher designing a research data capture form by selecting C-SHARE elements and, leveraging the map between SHARE and the CEMs, pre-arrange the population of the eCRF by secondary use of EHR data. Paper II (15min): Pharmacogenomics Data Standardization using Clinical Element Models Qian Zhu, Ph.D.; Robert Freimuth, Ph.D.; Zonghui Lian; Scott Bauer; Matthew Durski; Jyotishman Pathak, Ph.D.; Christopher Chute, M.D., Dr.P.H. Pharmacogenomics is a multidisciplinary and data-intensive science that requires increasingly clear annotation and representation of phenotypes to support data integration. Studies are often conducted independently, however, and standardized representations for data are seldom used. This leads to data heterogeneity, which hinders data reuse and integration. The CEMs developed within SHARP form a library of common logical models that facilitate consistent data representation, interpretation, and exchange. In this study, 4483

Monday, June 11, 2012

Page 11: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 11

variables from the Pharmacogenomics Research Network (PGRN) were represented using four CEMs that were adopted from SHARPn. This was accomplished by grouping PGRN variables into categories based on UMLS semantic type, then mapping each to one or more attributes within a CEM using a web-based tool that was developed to support curation activities. CEMs represented more than half of the variables: 18% mapped to LaboratoryObservation, 15% to NotedDrug, 14% to Disease/Disorder, and 7% to Patient. Of the remaining variables, 41% represented clinical findings, 33% represented quality of life or cognitive assessment measures, and 26% were concepts such as adverse events, clinical procedures, pharmacokinetics/pharmacodynamics, and genomics. This study demonstrates the successful application of SHARP CEMs to the pharmacogenomics domain. It also identified several categories of information that are not currently supported by SHARP CEMs, which represent opportunities for further development and collaboration. Demo II (15min): Semantic Web Representation of Clinical Element Model Cui Tao, Ph.D. The Clinical Element Model (CEM) is the common unified information model in the Strategic Health IT Advanced Research Project, secondary use of EHR (SHARPn) to ensure unambiguous data representation, interpretation, and exchange within and across heterogeneous sources and applications. The current representation of CEMs does not support formal semantic definitions and therefore it is not possible to perform reasoning and consistency checking on derived models. The aim of the CEM-OWL project is to represent the CEM specification using the Web Ontology Language (OWL). The CEM-OWL representation connects the CEM content with the Semantic Web environment, which provides authoring, reasoning, and querying tools. This work may also facilitate the harmonization of the CEMs with domain knowledge represented in terminology models as well as other clinical information models such as the openEHR Archetype Model. We envision a three-layer clinical data representation model in OWL: (1) a meta-level ontology that defines the abstract meta representation of the CEM, where the basic structures, the properties and their relationships, and the constraints are defined; (2) OWL ontologies for representing each individual detailed clinical element model; and (3) patient data represented in RDF triples with respect to the ontologies on layer 2. After representing the CEM as well as patient data in OWL/RDF, we can utilize Semantic-Web tools to check for semantic consistency on both the ontology and the instance levels and to reason over these data for extracting new knowledge.

NLP Software Demos: UIMA, cTAKES, CLEARtk Room: Innovation Lab, Rm 415 Part I: UIMA introduction James Masanz The Unstructured Information Management Architecture (UIMA; http://uima.apache.org/) is a top level Apache Software Foundation project. It is an engineering framework enabling the processing of large volumes of unstructured information. This will be a high-level overview of the UIMA framework to contextualize the following cTAKES and CLEAR-TK presentations. Part II: cTAKES tutorial and GUI Demo Pei Chen cTAKES (clinical Text Analysis and Knowledge Extraction System) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text. cTAKES is built on top of UIMA. This will be a high-level overview of cTAKES; where to download it; how to install it; and how to begin processing documents. There will also be a live demo of a cTAKES GUI which wraps common cTAKES configuration tasks and functionality in a simple user interface geared towards end-users and new developers who wish to have a jump start on using cTAKES. Part III: ClearTk tutorial Steven Bethard, Ph.D. ClearTK is an open source framework for developing statistical natural language processing (NLP) components in Java, built on top of the Apache Unstructured Information Management Architecture (UIMA). It provides a common interface and wrappers for popular machine learning libraries, a rich feature extraction library that can be used with any of the machine learning classifiers, and UIMA integration that allows classifiers to be easily trained and evaluated using UIMA pipelines. This tutorial will present an overview of the features of the ClearTK framework, with examples drawn from SHARP NLP applications. cTAKES and ClearTK have been integrated. Part IV: Evaluation workbench Lee Christensen The Evaluation Workbench is a graphical tool for comparing document annotations from different sources, including automated NLP systems and human annotators, and generating statistical outcome measures from those comparisons. Comparisons can be made based on classification (e.g. UMLS CUIs), attribute (e.g. polarity = present/absent), and annotation level (e.g. document-level,

Monday, June 11, 2012

Page 12: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 12

snippet-level). The tool includes a specialized UIMA analysis engine for converting UIMA results into Workbench-readable format. 2:30-3:00PM BREAK / OPEN POSTER SESSION Refreshments & networking – Lobby Poster Display – Rm 419 3:00-5:00PM FACILITATED SESSIONS Data Normalization Deep-Dive Room: Classroom, Rm 417 Part I: Mapping EHR data to SHARPn Use Cases Tom Oniki, Ph.D.; Hongfang Liu, Ph.D.; Susan Rea Welch, Ph.D. The SHARPn components were designed to perform Normalization, optionally NLP, optionally Phenotyping and other secondary uses such as data extraction for research databases and applications such as medication reconciliation. Technical documentation of the mappings between input and output are usually generated in order to communicate the expected inputs and outputs for particular use cases and to evaluate whether input data were processed completely and correctly. This session is intended to kick off a discussion among SHARPn participants and stakeholders concerning mapping artifacts needed for general use of the various SHARPn pipelines. Part II: Terminology Services Harold Solbrig CTS2 provides a formal RESTful API for terminology resources. This session will provide a roadmap to the standard content will introduce the Mayo CTS2 Development Framework and will provide demonstrations of running CTS2 based tools. Part III: An End-To-End Evaluation Framework Peter Haug, M.D. The goal of this session is to expose the audience to ongoing efforts to evaluate the effectiveness of the SHARRP-N data transfer and translation infrastructure. Initial results will be provided and plans for continuing evaluation efforts will be discussed. EHR Use Cases Room: Lecture Hall, Rm 414 Poster (15min): Scenario-based Requirements Engineering for Developing Electronic Health Records Add-ons to Support Comparative Effectiveness Research in Patient Care Settings

Junfeng Gao, Ph.D., Solomon Berhe, Ph.D., Sunmoo Yoon, Ph.D., J Thomas Bigger, M.D., Adam Wilcox, Ph.D., Suzanne Bakken, RN DNSc, Chunhua Weng, Ph.D. Research visit scheduling is a complex multi-user constraint satisfaction problem. It must satisfy constraints of research participants, coordinators, research protocols, and sometimes investigators and clinicians. Objective: We aim at developing electronic health records add-ons to align research visits with clinical visits through secondary use of clinical appointments in research settings. This poster presents our method to iteratively deepen requirements elicitation from research personnel, who do not necessarily use the technical terms to express their user needs. Methods: A participatory design team met twice a week to discuss user requirements and software prototype designs, and developed use case scenarios. Ten clinical research coordinators with different demographic characteristics and varying levels of experience (e.g., inpatient vs. outpatient, cardiology vs. neurology, and clinical trials vs. observational studies) were recruited to perform scenario-based evaluations of the formative software prototype using a think-aloud protocol. A structured interview was conducted to assess user satisfaction. Thematic analysis was performed to analyze evaluation interview transcripts for discovering themes regarding user needs for efficiently scheduling research visits. Results: Seven major themes were identified: alignment of research and clinical visits, workflow flexibility support, data governance policies�, constraints visualization and satisfaction, interoperability with other systems�, reminder for upcoming visits, and privacy and regulation compliance. Diverging user opinions were revealed and harmonized. Conclusion: Our scenario-based software requirements engineering approach that combined participatory design and iterative formative evaluations was effective for requirements gathering from clinical researchers. Regenstrief Institute (60min): Data Normalization / Clinical Data Repository Larry Lemmon; Marc Rosenman, M.D.; Daniel Vreeman, PT, DPT; 3:00-6:00PM DEVELOPERS TRACK cTAKES Coding Sprint and Help Desk Room: Innovation Lab, Rm 415 James Masanz

Monday, June 11, 2012

Page 13: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 13

For those with technical questions about cTAKES, from install to design decisions, this session will be an opportunity to get your technical questions answered by cTAKES and ClearTK developers. For those interested in both improving cTAKES through testing and then learning more through discussion of issues, media will be provided to install cTAKES, with time to try to break cTAKES (find bugs), log the issues, and have open discussions addressing the issues/optimizations. 5:00PM CLOSING Day’s Report Out Closing Remarks Room: Lecture Hall, Rm 414 5:30-6:30PM RECEPTION DoubleTree - Skyway Level / 2nd Level University Hall II

Monday, June 11, 2012

SHARPn Mission

To enable the use of EHR data for secondary purposes, such as clinical research and public health. Leverage health informatics to:

• generate new knowledge • improve care • address population

needs To support the community of EHR data consumers by developing:

• open-source tools • services • scalable software

Page 14: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 14

7:30 – 8:00AM CHECK-IN and Breakfast U of M Rochester Center 4th Floor, Lobby, West Corridor 8:00 – 9:00AM OPENING ADDRESS SHARPn Project-Leaders on Domain Milestones Room: Lecture Hall, Rm 414 Hongfang Liu, Ph.D., Mayo Clinic

Guergana Savova, Ph.D., Harvard Children’s Hospital

Jyotishman Pathak, Ph.D., Mayo Clinic

Kent Bailey, Ph.D., Mayo Clinic 9:00-10:00AM TRACK SESSIONS SE MN Beacon: Medication Reconciliation using PH-Doc Room: Lecture Hall, Rm 414 Daniel Jensen, Deb Castellanos Demonstration of local Public Health in SE Minnesota - requesting CCD documents from several regional clinics, parsing / consuming documents and performing medication reconciliation. HTP - Presentations Room: Classroom, Rm 417 Paper (15min): Exploring Patient Data in Context to Support Clinical Research Studies: Research Data Explorer Adam Wilcox, Ph.D.; Chunhua Weng; Sunmoo Yoon; Suzanne Bakken, RN, DNSc With the growth in use of electronic health records, there is an increase in the collection and availability of patient information in electronic clinical databases. These electronic health data may also support clinical research, but for many researchers the data remain inaccessible. In 2010, AHRQ funded multiple studies to develop infrastructure and improve methods for collecting and using data from electronic clinical databases for research. As part of this initiative, the Washington Heights/Inwood Informatics Infrastructure for Comparative Effectiveness Research (WICER) is creating a patient-centered data infrastructure in a diverse community in northern Manhattan. One component of WICER is the Research Data eXplorer (RedX), a tool assisting clinical researchers in accessing, navigating and querying electronic health data to identify patient cohorts. RedX has multiple properties support cohort identification. First, it displays data in the context of an electronic health record, allowing clinician researchers to

navigate data integrated at a patient level. Second, it helps users build queries based on patient data, by querying other patients with similar characteristics. Queries can then be combined to further refine a research cohort. Third, RedX facilitates researcher access to data by using a de-identified database, so that researchers can safely browse data without risking patient confidentiality. In this paper, we describe the design, development and use of RedX. We report how a usability evaluation and experience with a user’s group have informed our system. We also discuss governance issues that were addressed, and how they have affected its design, development and implementation. Poster: Utilizing Previous Result Sets as Criteria for New Queries within FURTHeR Dustin Schultz; Richard Bradshaw; Joyce Mitchell The Biomedical Informatics Core at the University of Utah’s Center for Clinical and Translational Science (CCTS) is continuing the development of a platform for real-time federation of health information from heterogeneous data sources. The Federated Utah Research and Translational Health electronic Repository (FURTHeR) 1-2, utilizes standard terminologies, a logical federated query language, and translation components to translate queries and results from each heterogeneous data source. Given the heterogeneous nature of data sources, one data source may contain more or less information than another data source. Researchers with inclusion/exclusion criteria that are not available in a particular data source have previously not been able to search this data source. We’ve enhanced FURTHeR to support these types of queries. Once the federated query results have been returned we are able to use those results as criteria for a new federated query. Utilizing a patient identifier mapping table, patient identifiers from the results are translated to their data source specific identifiers depending on the target data sources of the query. This provides vast new capabilities for researchers by allowing them to associate results from one data source to form the basis for a query of another data source. Poster (15min): Semantic Search Engine for Clinical Trials Yugyung Lee; Saranya Krishnamoorthy; Feichen Shen; Deendayal Dinakarpandian; Dennis Owens Standardizing the representation and content of eligibility criteria is an important requirement for improving the efficiency of clinical trials. There have been several efforts to formalize eligibility criteria by the creation of ontologies and other structured representations. This poster presents a semantic approach for facilitating accurate matches between clinical trials and eligible subjects. For the purpose, clinical trial studies from the Clinicaltrials.gov have been clustered. Secondly, the clinical trial ontology called the MindTrial Eligibility ontology (MEO) was

Tuesday, June 12, 2012

Page 15: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 15

modeled according to the inclusion and exclusion criteria clusters obtained from the clustering. Thirdly, an open-ended query is generated using the Semantic Web Rule Language (SWRL) to discover potential participants in clinical trials by facilitating partial matching through relaxation of eligibility criteria. Two kinds of search interfaces are designed for selecting patients or potential volunteers. One is based on a detailed fine-grained checklist view where fields identical to those in the query can be selected as inclusion or exclusion criteria. The second kind of query interface is based on summarized queries and reasoning that are expanded by the MEO ontology and computed on a subset of volunteer responses. The prototype system for the proposed model has been implemented to support customized searches for potential recruiters. This approach can aid in better characterization of both human volunteers and clinical study requirements, thus leading to accurate and efficient matching of subjects with clinical studies. The results obtained from the approach can readily incorporate into intelligent search engines in databases of clinical trials and subjects. NLP – Method Presentations Room: Lecture Hall, Rm 414 Part I (30min): MedTagger: A fast NLP Pipelines for Indexing Clinical Narratives Siddhartha Jonnalagadda, Ph.D. This consumer-driven pipeline pipeline is being deployed in Mayo Clinic’s Enterprise Data warehouse. The programmers of Mayo Clinic’s clinical data warehouse Enterprise Data Trust (EDT) who ensures that the 80 million unstructured documents of Mayo Clinic are processed. The systems analysts in the Data Management Service who work closely with clinicians for effective downstream querying in the diverse use cases in clinical practice and research. Our future goal is to enable clinicians and researchers with various levels of expertise to directly configure queries. Paper (15min) Knowledge-Based vs. Bottom-up Methods for Word Sense Disambiguation in Clinical Notes. Anna Rumshisky, Ph.D.; Rachel Chasin Extracting structured patient information from clinical narrative is highly dependent on accurate word sense disambiguation (WSD) of ambiguous words and acronyms found in text. Manually annotated data required by the supervised methods for every target word is very labor-costly. This leaves the option of either (1) knowledge-based methods relying solely on the structure of existing knowledge sources such as the Unified Medical Language System (UMLS), or (2) bottom-up unsupervised techniques using unannotated data. Both types of methods have been used with some success in the general

domain. In this work, we investigate the relative value of such methods for clinical text. We compare the performance of (1) A method based on Latent Dirichlet Allocation (LDA), a topic modeling technique applied to the problem of WSD by re-conceptualizing target word senses as topics, and context features as terms; (2) Two knowledge-based methods: (i) A modified PageRank algorithm using the UMLS concept graph with non-zero initial vector weights for the context terms surrounding the target mention, and (ii) A method that uses path-based distances within the UMLS concept hierarchy to compute cumulative similarity between context terms and target word senses. We investigate the impact of enriching the knowledge-based methods with syntactic dependency and text proximity information. We use a subset of 15 targets from the Mayo WSD corpus that map to SNOMED terms in evaluation, with 100 instances and 2-6 senses per target. The average baseline accuracy obtained by choosing the most frequent sense for each target word is 56.5%. Similarly to such methods in the general domain, our results for the knowledge-based methods fall below the baseline. The best configuration of our PageRank algorithm attains 48.9% average accuracy and that of our UMLS path-based method attains 42.5%. The LDA-based technique outperforms these methods with the cross-validation accuracy of 66.0%. Poster I (15min): Classification of Emergency Department CT Imaging Reports using Natural Language Processing and Machine Learning Efsun Sarioglu; Kabir Yadav, M.D.; Hyeong-Ah Choi Our objective is to develop a hybrid natural language processing (NLP) and machine learning system for automated classification emergency department computed tomography imaging reports to support comparative effectiveness research (CER). Materials & Methods We performed secondary analysis of a previous diagnostic CER study on blunt facial trauma victims. Staff radiologists dictated each CT report as free text. A trained data abstractor extracted the reference standard outcome of acute orbital fracture, with a random subset checked by the study physician to confirm reliability. Patient data was de-identified during pre-processing. The system takes patient reports as input to the Medical Language Extraction and Encoding (MedLEE) NLP tool to tag patient reports with Unified Medical Language System codes and modifiers that show the probability and temporal status. During post-processing, findings are filtered for low certainty and past/future modifiers, and then combined with the manual reference standard for classification using the data mining tools WEKA 3.7.5 and Salford Systems CART 6.6. The dataset is randomly split 50-50 train-test for decision tree classification. Results Our results obtained using machine learning alone are comparable to prior NLP studies (precision=0.949, recall=0.932, f-score=0.941) and the combined use of NLP and machine learning shows further

Tuesday, June 12, 2012

Page 16: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 16

improvement (precision=0.968, recall=0.965, f-score=0.966). Discussion The high performance of machine learning without NLP may be due to certain words only being used to describe fractures. Conclusion Combining NLP and machine learning shows promise in coding free-text electronic clinical data to support large-scale CER. Future work will refine NLP tools and investigate different classification algorithms using other data sets to demonstrate consistent performance. 10:00-10:30AM BREAK / OPEN POSTER SESSION Coffee & networking – Lobby Poster Display – Rm 419 10:30AM-12:00PM TRACK SESSIONS Medication Reconciliation Room: Lecture Hall, Rm 414 Part I: cTAKES drug NER tool Sean Murphy This presentation briefly explains the technique of the existing cTAKES drug NER tool and items to improve. Part II: MedER Sunghwan Sohn, Ph.D. Covers the method and functionality of new medication extraction tool we are currently developing: 1) how we approach to improve the medication signature extraction performance and user customizability, 2) how we tackle to normalize medication descriptions in clinical notes and map to the most appropriate RxNorm (RxCUI). Part II: Pan-SHARP Collaborative Jorge Herskovic, MD, PhD SHARP program overview; Cross-program cooperative: medication reconciliation project Closed Session: HTP Next Steps Planning Session Room: Innovation Lab, Rm 415 NLP Information Extraction future exploration Room: Classroom, Rm 417 Part I (30min): Natural language processing for clinical decision support Wagholikar, Kavishwar, MBBS, Ph.D.; Chaudhry, Rajeev MBBS

Patients in the United States receive only half of the recommended health care. Electronic Health Records (EHRs) have the potential to address this problem by providing reminders to physicians and patients. However, a primary obstacle is that most EHR data is in free-text form that is not readily amenable for computer processing. Although natural Language processing (NLP) has been useful for research applications, it is largely unused for clinical applications due to lack of sufficient accuracy. We have developed accurate rule based NLP systems to provide decision support for cervical and colorectal cancer screening and surveillance. In this talk we describe the construction and evaluation of these systems Part II (15min): Biomedical informatics and clinical NLP in translation science research. Piet de Groen, M.D. Part III (15min): Enabling Medical experts to Navigate Clinical text for Co-hort Identification (meTAKEs) Wu, Stephen, Ph.D. Natural Language Processing (NLP) can extract information from clinical narratives, and thus plays a critical role in enabling the secondary use of electronic medical records (EMRs) for clinical and translational research. Several clinical NLP systems have been successfully applied to specific use cases, often as collaborative projects between interested medical experts and NLP practitioners. However, wider adoption of NLP tools for use in clinical and translational research is hampered by the fact that NLP tools are typically designed for informaticists rather than for end users (in this case, highly-motivated medical experts who have pre-existing interest in the results). We introduce the prototype of a user-centric NLP user interface: meTAKES (medical expert Text Analysis and Knowledge Extraction System). With meTAKES, domain-expert clinicians and researchers find relevant patients (and patients documents) in the EMR by performing user-in-the-loop information retrieval. The interactive nature of this tool allows medical experts to make use of understandably-packaged NLP techniques, such as customized dictionaries and criteria, negation and context discovery, and query expansion based on termsª semantic characteristics. MeTAKES is currently implemented as a web-based interface using Google Web Toolkit, with a lightweight client GUI, and server-side processing of queries and documents. Alongside Apache Lucene, alternate NLP systems (cTAKES and MedTagger) have been used as the workhorse for this server-side processing. MeTAKES will eventually be released as an open-source project that can be instantiated within any institution without compromising patient confidentiality. Ongoing work includes cohort management, integration with structured data, and NLP techniques like parsing and relationship extraction.

Tuesday, June 12, 2012

Page 17: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 17

(30min) Future exploration of NLP needs in clinical settings facilitated-interactive session 12:00-1:00PM LUNCH Buffet Line – Lobby, West Corridor Seating – Classroom, Rm 417 1:00-2:30PM TRACK SESSIONS Got BIG Data: got correct solutions? Data quality results of large scale health data Room: Lecture Hall, Rm 414 Kent Baily, Ph.D.; Susan Rea Welch, Ph.D. Comparisons of data from SHARPn development-phase providers and organizations suggest differences in baseline data used for phenotyping. Data of suspicious quality were also exposed. In shared secondary usage, the accuracy and reliability of data take on a new dimension. We cannot readily know all contributing EHR usage conventions, consumer population differences and/or data quality assurance methods. Can we accept the aggregated results at face value? We present early results that support data quality screening strategies in the SHARPn pipelines. We invite discussion and input on this topic from SHARPn collaborators and stakeholders. Phenotyping Tools Demo & Code Sprint Room: Innovation Lab, Rm 415 Part I (30min): Phenoportal Demonstration Jyotishman Pathak, Ph.D.; Dingcheng Li, Ph.D. the Phenoportal project is a collaboration with the National Quality Forum (NQF) and investigation of the Measure Authoring Toolkit (MAT) and contribution to the library from the Electronic Medical Records and Genomics (eMERGE) Network with disease co-hort algorithms.

• Offers Search and visualization • Ability to do keyword-based searches for available

algorithms • Navigate a hierarchy of phenotypes • Visualize the algorithm logic flow • Download human readable version (MS word)

Part II (30min): A Knowledge-Driven Workbench for Predictive Modeling Peter Haug, M.D.; Xinzi Wu; John Holmen; Matthew Ebert; Robert Hausam; Jeffrey Ferarro The mission of the SHARP-Area 4 project is to enable the use of EHR data for secondary purposes, such as clinical research and public health. Data from multiple EHRs will be transmitted, translated, and aggregated in a repository dedicated to the support of clinical research. This research environment invites the development of novel technologies to exploit the large stores of aggregated clinical data that will result. In a related project, we have prototyped a predictive modeling environment that will support the development of computable disease models using data stored in data repositories and medical knowledge embedded in ontologies. The goal is to provide a flexible tool kit that will encourage the rapid development of useful diagnostic models. As a part of the 2012, SHARP-N face-to-face, we will demonstrate this system. The development of diagnostic screening systems is generally labor intensive. Clinicians and data analysts work together to define characteristics of affected patients and to oversee the extraction of the data necessary to diagnose the targeted conditions. We plan to reduce this effort with a modeling environment that features an ontology used to represent diagnostic knowledge. Within the ontology we have embedded the information necessary to extract data from specialized tables in an Enterprise Data Warehouse. This data is formatted and loaded into an application called the Analytic Workbench, which generates and evaluates diagnostic models. We will illustrate the use of this tool in the development of an application that screens patients for community-acquired pneumonia. Part III (30min): Clinical Analytics driven Care Coordination for 30-day readmissions 360 Fresh Demonstration Ramesh Sairamesh, Ph.D. 1 in 4 patients in the US get readmitted within 30-days of previous discharge, costing the nation (Medicare and Medicaid) Billions of US dollars per year in unplanned and likely preventable readmissions. This paper and corresponding demonstration of our commercial tools will illustrate approaches for early-detection, real-time assessments and care coordination processes amongst clinicians to enable a reduction in unplanned readmissions rates while improving patient quality outcomes. We also present risk prediction methods and corresponding clinician tools to enable proactive tracking of high risk patients, stratification and targeted care based on information gleaned from de-identified electronic medical records, discharge notes, admission notes and TeleHealth data. Our observational and prospective studies conducted on the effectiveness of our risk prediction methods over several tens of millions of patient visit records across

Tuesday, June 12, 2012

Page 18: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 18

health systems showed a reasonably high C-statistic (over 0.79) in identifying the at-risk patient populations. The early identification methods and care coordination processes have the potential to transform care delivery services, and enable healthcare practitioners, clinicians, quality specialists and nursing teams to target care to the right patient at the right time, and enable better quality of care across clinical settings. The risk assessment and care tools have been implemented and deployed by 360Fresh in large clinical settings where identification and risk assessments have become critical to enabling post-discharge support and care. Our tools have been used by clinicians over the last four years for various patient conditions related to Cardiovascular and Cancer. 2:30PM CLOSING REMARKS Room: Lecture Hall, Rm 414 1:00-5:00PM INVITATION ONLY SESSIONS Project Advisory/Fed, Steering Committee/ONC/PIs - Reflections Room: Classroom, Rm 417 - A (1:00-2:30PM) Pan-SHARP – Mercury Planning Session Room: Classroom, Rm 417 - B (1:00-2:30PM) NLP Next Steps Planning Session Room: Rm 419 (1:00-2:30PM; 3:00-5:00PM) Beacon CDR Training Siebens 4-05 (1:00-5:00PM)

Tuesday, June 12, 2012

SHARPn Partnerships, Alliances & Adopters • Clinical Information Modeling

Initiative (CIMI). International consensus group with Detailed Clinical Models.

• Consortium for Healthcare Informatics Research (CHIR)

• Informatics for Integrating Biology and the Bedside (i2b2)

• Electronic Medical Records and Genomics (eMERGE) Network

• Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC)

• Multi-source Integrated Platform for Answering Clinical Questions (MiPACQ)

• National Library of Medicine, UMLS team

• National Quality Forum / MAT User Group

• Open Health Natural Language Processing (OHNLP)

• Pharmacogenomics Research Network (PGRN) – PGPop team

• popHealth • Southeast Minnesota Beacon

Community • Standards & Interoperability

(S&I) Framework - Query Health

Page 19: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 19

Kent Bailey, Ph.D. is a Professor of Biostatistics in the Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, at the Mayo Clinic. Dr. Bailey is a Senior Statistician who has been primarily involved in cardiovascular and hypertension studies over the last 26 years. His statistical work and interests have included applying survival methods in novel contexts such as hypertension control, and recently, developing models for the local prevalence and trends in obesity, as well as age-recursive models for disease prevalence. More recently he has been involved in statistical genetics, specifically in analysis of genome wide association studies (GWAS), including dimension reduction, principle components, gene set enrichment analysis, index construction and replication and validation. Landen Bain is a freelance expert in data standards for healthcare and clinical research. As liaison between CDISC, the clinical research standards body, and the healthcare information community, he bridges these two worlds that have common interests and common subjects, but little interaction Calvin Beebe is a Technical Specialist II – Mayo Integrated Clinical Systems. He works as a Technical Specialist II, in both the Clinical Integration and Exchange Unit, inside the Mayo Integrated Clinical Systems (MICS Division) and as Chief Architect on the SHARPn and SE MN Beacon projects, Calvin has over 31 years of experience in healthcare information systems design, development and support. The last 12 years, he has been active at Health Level Seven, International (HL7), and currently serves as a Co-chair on the Structured Documents Work Group, home of the CDA and CCD. Steven Bethard, Ph.D. is a Research Associate at University of Colorado's Center for Language and Education Research and a Visiting Professor at KULeuven's Language Intelligence & Information Retrieval research group in Belgium. He received his Ph.D. in Computer Science and Cognitive Science from the University of Colorado in 2007, and worked as a postdoctoral researcher at Stanford University's Natural Language Processing group and Johns Hopkins University's Human Language Technology Center of Excellence. His research covers a variety of topics in natural language processing, information retrieval and machine learning. Yixin Bian, Ph.D. Student, University of Maryland, Baltimore County Troy Bleeker, CBAP® is a business analyst in the division of Biomedical Statistics and Informatics. Andreea Bodnari, B.S., Massachusetts Institute of Technology (MIT)

Deb Castellanos is a software developer with Xerox (previously ACS, previously BRC, previously CCSI) since 1980. During that time she has designed and programmed over a dozen applications. Her favorite software application is PH-Doc, the county-owned Public Health Documentation System. PH-Doc started in 1984 as a statistical reporting system for local public health in Minnesota. The original electronic patient chart included demographics, family data, and service data. Today PH-Doc supports a very robust electronic chart. Her education includes a BA from Clark University with a double major in Music and Psychology and an MS in Educational Psychology. She is a certified Project Manager and Rehabilitation Counselor. Pei Chen is a lead application development specialist at the Informatics Program at Boston Children’s Hospital/Harvard Medical School. His interests lie in building practical applications using machine learning techniques. He has a passion for the end-user experience and has a background Computer Science/Economics and a firm believer in the open source community. Mr. Chen co-leads the NLP software development group within SHARPn. Lee Christensen is a research associate at the Department of Epidemiology at the University of Utah. He has worked in software development for over 20 years, and was lead developer for several natural language processing systems, including MPLUS, ONYX and TOPAZ. Christopher Chute, M.D., Dr.P.H. received his undergraduate and medical training at Brown University, internal medicine residency at Dartmouth, and doctoral training in Epidemiology at Harvard. He is Board Certified in Internal Medicine, and a Fellow of the American College of Physicians, the American College of Epidemiology, and the American College of Medical Informatics. He became founding Chair of Biomedical Informatics at Mayo in 1988, and is PI on a large portfolio of research. He is presently Chair, ISO Health Informatics Technical Committee (ISO TC215) and Chairs the World Health Organization (WHO) ICD-11 Revision. He also serves on the Health Information Technology Standards Committee for the Office of the National Coordinator in the US DHHS, and the HL7 Advisory Board. Cheryl Clark, Ph.D. has been a lead AI scientist in MITRE’s Human Language Technology Department since 2007. Her research interests include text processing, natural language understanding, and medical informatics. She participates in a number of natural language processing projects at MITRE, and is the principal investigator of Automating Fact Extraction from Medical Records, a MITRE-funded research project. Dr. Clark currently leads the NLP SHARPn task on Negation and Uncertainty.

Speaker Biographies

Page 20: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 20

Piet de Groen, M.D. is a Professor of Medicine at the Mayo Clinic College of Medicine. His clinical research interests are in hepatocellular cancer and cholangiocarcinoma. Dr. de Groen has a strong interest in biomedical informatics and clinical NLP which he has been using in his translational science research. Dmitriy Dligach, Ph.D. is a research fellow at Childrens Hospital Boston and Harvard Medical School. He holds a PhD in Computer Science from University of Colorado. Dr. Dligach’s research focuses on semantic tagging through supervised and unsupervised methods. Dr. Dligach leads the relation extraction task within the NLP SHARPn team. Junfeng Gao, Ph.D. is a Postdoctoral Research Scientist at Columbia University Peter Haug, M.D. received his medical degree from the University of Wisconsin Medical School and trained in Internal Medicine at LDS Hospital and the University of Utah. He received Fellowship training in Biomedical Informatics at the University of Utah. He currently functions as Director of the Homer Warner Center for Informatics Research at Intermountain Healthcare and as a Professor in the Department of Biomedical Informatics at the University of Utah. His active research interests include natural language processing, sharable approaches to representing decision support logic, models of disease capable of informing clinical care, and secondary uses of data collected in the course of care. Vitaly Herasevich, M.D., Ph.D. is Assistant Professor of Anesthesiology and Medicine at Mayo Clinic College of Medicine Jorge Herskovic, M.D., Ph.D. graduated from the Universidad de Chile with a MD in 2002, and obtained his PhD in Health Informatics from The University of Texas School of Biomedical Informatics at Houston in 2008. He is currently an Assistant Professor at SBMI, where he focuses on Biomedical Information Retrieval and Natural Language Processing. He has published his research in JAMIA and JBI. He is a W.M. Keck Fellow and a Schull Scholar. He has also interned at St. Luke's Episcopal Hospital, and Google. In addition to his research activities, he serves as the instructor for the Foundations of Health Informatics I course (an introduction to health informatics). Stan Huff, M.D. is Professor (Clinical) of Medical Informatics at the University of Utah, and the Chief Medical Informatics Officer at Intermountain Healthcare. Intermountain Healthcare is a charitable not-for-profit health care organization in the intermountain west that includes 24 hospitals, numerous primary care and specialty clinics, and a health plans (health insurance) division. He has worked in the area of medical vocabularies and medical database architecture for the past 20

years. He is currently a co-chair of the Logical Observation Identifier Names and Codes (LOINC) Committee, a member of the Board of Directors of HL7, and a member of the HIT Standards Committee. He teaches a course in medical vocabulary and data exchange standards at the University of Utah. Guoqian Jiang, Ph.D. conducts research on biomedical terminologies and ontologies, data standards, and clinical study common data elements. He has specifically focused on these subareas: quality auditing of large-scale biomedical terminologies and ontologies, quality evaluation of clinical study common data elements, developing and implementing innovative collaborative terminology authoring platforms leveraging Semantic Web technology, and clinical phenotyping and clinical element models using electronic health records. Daniel Jensen is the Associate Director of Olmsted County Public Health Services in Southeastern Minnesota. He oversees the areas of aged and disabled programs, WIC (Women, Infants & Children), and public health Informatics. In the role of Informatics Mr. Jensen serves as the lead public health coordinator for the Southeast Minnesota Beacon program; leads development strategy for PH-Doc; and is a strong advocate for building technical solutions through communities of practice. Siddhartha Jonnalagadda, Ph.D. obtained his Bachelors of Technology (Honors) in Computer Science and Engineering from the Indian Institute of Technology, Kharagpur. He is the first student to have completed requirements for the PhD degree in Biomedical Informatics from Arizona State University. His main research interest is Natural Language Processing (NLP). Siddhartha, currently a Research Associate at Mayo Clinic-Rochester, is focusing on using NLP for clinical knowledge gathering. Vinod Kaggal is a Sr. Programmer Analyst that has been working on the normalization pipeline. Vinod played a key role in designing and implementing the normalization pipeline. Yugyung Lee, Ph.D. is Associate Professor University of Missouri - Kansas City School of Computing and Engineering. Hongfang Liu, Ph.D. is an associate professor at Mayo Clinic and currently leading Mayo clinical natural language processing (NLP) program. Dr. Liu has over 15 years’ experience in biomedical informatics specialized in clinical and biomedical NLP and terminology. Her research mission is to bring NLP to research and practice so that information in free text can be utilized for various automated systems and consumers. Kyle Marchant is a senior Software Architect and Manager for Agilex Technologies (located in Chantilly Virginia). Agilex is a leading Software Integrator with one of its sectors focused on

Page 21: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 21

Healthcare. Kyle has worked closely with the VA, DoD, Intermountain Healthcare, Mayo Clinic and other industry leading Clinical Care systems to help further many of the Heath Information Exchange, Data Normalization, and Clinical Modeling/Terminology programs. Most recently, Kyle has worked on the Sharp Secondary Use and Beacon HIE initiatives providing architecture design, software development, and clinical data normalization/exchange expertise. He has a B.S. in Computer Engineering from the University of Utah James Masanz is a senior software developer in the Department of Biomedical Statistics and Informatics at Mayo Clinic. He has been a lead cTAKES developer for the last 5 years. He holds a Master’s degree in Computer Science. He co-leads the NLP software development group within SHARPn. Sean Murphy is a senior programmer analyst in the Clinical Informatics Systems unit at Mayo Clinic. He has BS Degree in Computer Science. He has experience relating to NLP in clinical informatics, primary developer on drug profile pipeline utilized in cTAKES, and responsible for implementing/porting/ testing several algorithms into cTAKES that target various phenotypes relating to NHGRI eMERGE, PGRN and similar studies. He assisted in retooling efforts related to streamlining and automating the icTAKES/meTAKES environment Tom Oniki, Ph.D. is a Medical Informatics at Intermountain Healthcare. He has responsibility for development of detailed clinical models and accompanying controlled terminology for the Electronic Medical Record project that Intermountain is engaged in with GE Healthcare. Prior to joining Intermountain Healthcare, Dr. Oniki was a Principal Product Manager at Oracle Corporation, responsible for the development of a terminology server for their healthcare platform. Dr. Oniki received his PhD in Medical Informatics from the University of Utah in 2001. He was also employed by the Critical Care Department of LDS Hospital, participating in efforts to implement computer applications and continuous quality improvement in the care of critically ill patients. He received a Masters Degree in Medical Informatics from the University of Utah and graduated Magna Cum Laude with a Bachelors Degree in Electrical Engineering from Brigham Young University. Jyoti Pathak, Ph.D. is an Assistant Professor in Medical Informatics at the Mayo Clinic College of Medicine. He joined Mayo in 2007 with several years of research experience in biomedical knowledge representation and semantic information integration, and has been a key contributor in two major NIH/HHS funded initiatives—the Electronic Medical Records and Genomics (eMERGE) and Strategic Health IT Research Project (SHARP) projects—which have pioneered techniques for high-throughput phenotyping from the electronic medical record. He is the recipient of prestigious Iowa State University

Graduate Research Excellence Award and Mayo Clinic Early Career Development Award in 2007 and 2010, respectively. Marc Rosenman, M.D. is Director of Operations for the Regenstrief Center for Healthcare Improvement and Research (RCHIR); Research Scientist, Regenstrief Institute, Inc.; Assistant Professor of Pediatrics, Children’s Health Services Research, Indiana University School of Medicine; Director, Health Data and Epidemiology Section, Regenstrief Institute; and faculty supervisor for the Regenstrief Institute’s data management group. His research focuses on clinical epidemiology, electronic medical record systems, and health information from multiple sources. Anna Rumshisky, Ph.D. is a Postdoctoral Associate in the Clinical Decision-Making Group within the Computer Science and Artificial Intelligence Laboratory at MIT. Ramesh Sairamesh, Ph.D. is the CEO and President of 360Fresh Inc. driving analytics for Patient Quality and Safety. He collaborates with CITRIS-SSME at UC Berkeley, California and CMU, Silicon Valley on HealthCare Systems and Services. Previously he was a Manager and Program Leader for Business Solutions, Early Warning Systems and Manufacturing Quality Research at IBM Watson Research, New York. He was one of the functional architects for IBM's e-business and e-Marketplace products. At IBM, from 2001 to 2007, he helped drive the vision and strategy for advanced business solutions on value-chain management, early warning systems for reducing warranty, and enterprise quality in manufacturing (automotive) for IBM. He has helped incubate and drive three commercial business solutions for IBM’s customers in the areas of Dealer-CRM, Early Warning for Warranty and Supply-Chain Quality. He has numerous US Patents and over 50 research publications. He has won three IBM outstanding innovation awards for eCommerce and Dealer collaboration and an IBM Research Division Award for Service Science, Management and Engineering (SSME) IBM. Guergana Savova, Ph.D. is faculty at Harvard Medical School and Childrens Hospital Boston. Her research interest is in natural language processing (NLP) especially as applied to the text generated by physicians (the clinical narrative) focusing on higher level semantic and discourse processing which includes topics such as named entity recognition, event recognition, relation detection and classification including co-reference and temporal relations. The methods are mostly machine learning spanning supervised, lightly supervised and completely unsupervised. Her interest is also in the application of the NLP methodologies to biomedical use cases. She holds a Masters of Science in Computer Science and a PhD in Linguistics with a minor in Cognitive Science from University of Minnesota.

Page 22: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 22

Dustin Schultz is a Lead Software Engineer at University of Utah Health Sciences Information Technology. He is responsible for leading the design and development of all software and infrastructure for the FURTHeR project. Additionally, provide direction, mentoring, and development tasks for other software engineers. Gyorgy Simon, Ph.D. is an Instructor of Biostatistics at Mayo Clinic College of Medicine Sunghwan Sohn, Ph.D. is Assistant professor of Medical Informatics at Mayo Clinic. His research focuses on information extraction and retrieval on clinical notes and biomedical literature using machine learning and natural language processing. Dr. Sohn leads the NLP SHARPn task on Medication template population. Harold Solbrig has been involved in the area of ontologies terminologies and classification systems since the early 1980’s. He is the editor of the OMG Lexicon Query Services (LQS) specification, the HL7 Common Terminology Service (CTS) specification and the recently adopted OMG CTS2 specification. He is an active participant in OMG’s Ontology Platform SIG, the WHO ICD-11 development project, the IHTSDO Implementation and Innovation technical committee and the Clinical Information Modeling Project (CIMI). Cui Tao, Ph.D. is an Assistant Professor of Medical Informatics at Mayo Clinic College of Medicine. Her research focuses on ontology generation, conceptual modeling and information extraction over the biomedical domain. She is also interested in the Semantic Web and its application on biomedical and clinical data. She received a doctorate and master's degree in computer science from Brigham Young University in Provo, Utah. She also earned a Bachelor of Science degree from Beijing Normal University, where she majored in biology and minored in computer science. Ozlem Uzuner, Ph.D., is an Assistant Professor at University at Albany-SUNY College of Computing and Information. She is interested in digital information, intellectual property, and privacy; she studies both technology and policy issues related to dissemination of information and protection of intellectual property and privacy in the digital world. In particular, she applies Natural Language Processing techniques to policy problems and applications that can benefit from textual and linguistic processing. Information retrieval, medical language processing, information extraction, data anonymization, and text summarization are her current foci. Daniel Vreeman, PT, DPT is an Assistant Research Professor at the Indiana University School of Medicine and a Research Scientist at the Regenstrief Institute, Inc. His primary research

focus is on the role of standardized clinical vocabularies to support electronic health information exchange. As Associate Director for Terminology Services at Regenstrief he directs the development of LOINC, an international standard for laboratory and clinical observations, and provides leadership and oversight to terminology services that undergird the Indiana Network for Patient Care, a regional health information exchange in central Indiana. He also teaches medical informatics at Indiana University. Kavishwar Wagholikar, Ph.D. is a postdoctoral fellow in Division of Biomedical Statistics and Informatics at Mayo Clinic Rochester. He completed medical school from University of Mumbai, India and obtained PhD in Scientific Computing from University of Pune, India. His expertise is in knowledge representation, clinical decision support (CDS) and clinical natural language processing (NLP). His previous research was on modeling uncertainty in patient information and clinical decision making. Currently his work is focused on developing NLP based systems for decision support at the point of care. Susan Rea Welch, Ph.D is a post-doctoral fellow at the Homer Warner Informatics Research Center at Intermountain Healthcare. Her work has focused on data mining of EHR data for purposes of data quality and disease-related discovery. Adam Wilcox, Ph.D. is an Associate Professor of Biomedical Informatics at Columbia University. His research and interests are focused on the application of health information technology to effectively transform care, and to transform research and discovery. His work has contributed to each of these areas (application of health information technology, collaborative care transformation, research and discovery) consistently over the last 15 years, with some very high-impact results Stephen Wu, Ph.D. is Assistant Professor in Medical Informatics at the Mayo Clinic College of Medicine. With a background in statistical Natural Language Processing focusing on computational semantics, Dr. Wu has worked in the clinical domain seeking to bridge the gap between expert knowledge and textual representations. Dr. Wu leads the NLP SHARPn task on discovering subject and generic attributes. Kabir Yadav, M.D. is an Assistant Clinical Professor of Emergency Medicine at The George Washington University Qian Zhu, Ph.D. is a Research Associate at Mayo Clinic.

Page 23: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 23

Strategic Health IT Advanced Research Projects (SHARP)

Stan Huff, MD

Page 24: SHARPn SUMMIT “SECONDARY USE” - Mayo Clinicinformatics.mayo.edu/sharp/images/3/36/SHARPn_2012... · SHARPn SUMMIT “SECONDARY USE” 3rd Annual Face-to-Face ... 9:00 NLP Methods

*Lecture Hall is Webinar enabled and sessions will be recorded (866-365-4406/2933780#) 24