6
The Effects of Context on Data Quality in Biomedical Data Reuse Leonard D’Avolio, Organizer University of California, Los Angeles [email protected] Capturing the effects of contextual influences on surgical operative reports Melissa Cragin University of Illinois, Urbana-Champaign [email protected] Limitations introduced by unforeseen uses of shared neuroscience data collections W. John MacMullen University of North Carolina, Chapel Hill [email protected] Contextual influences on gene ontology annotations Catherine Arnott Smith University of Wisconsin-Madison [email protected] The effects of patient access on medical record accuracy Introduction The collection of huge stores of electronically-formatted information and advances in information processing technologies has led to dramatic changes in the conduct of biomedical science. In biology, a paradigm shift is underway due to an unprecedented flood of data, the emergence of shared research repositories, and advances in the application of data mining algorithms. As a result, the traditional model of scientific discovery of “formulate hypothesis, conduct experiment, evaluate results” is being replaced with “collect and store data, mine for new hypotheses, confirm with data or supplemental experiment” (Han et al. 2002). In clinical medicine, similar developments have made possible a variety of secondary applications of extant clinical data, including physician decision support, outcomes assessment, document retrieval, and clinician performance evaluation. The quality, or usefulness, of existing data for secondary uses has thus far been approached with a focus on issues of technical access and the mathematical format of data. Largely ignored in these

The effects of context on data quality in biomedical data reuse

Embed Size (px)

Citation preview

Page 1: The effects of context on data quality in biomedical data reuse

The Effects of Context on Data Quality in Biomedical Data Reuse

Leonard D’Avolio, OrganizerUniversity of California, Los Angeles [email protected]

Capturing the effects of contextual influences on surgical operative reports

Melissa CraginUniversity of Illinois, Urbana-Champaign [email protected]

Limitations introduced by unforeseen uses of shared neuroscience data collections

W. John MacMullenUniversity of North Carolina, Chapel Hill [email protected]

Contextual influences on gene ontology annotations

Catherine Arnott SmithUniversity of Wisconsin-Madison [email protected]

The effects of patient access on medical record accuracyIntroduction

The collection of huge stores of electronically-formatted information andadvances in information processing technologies has led to dramatic changes inthe conduct of biomedical science. In biology, a paradigm shift is underway dueto an unprecedented flood of data, the emergence of shared researchrepositories, and advances in the application of data mining algorithms. As aresult, the traditional model of scientific discovery of “formulate hypothesis,conduct experiment, evaluate results” is being replaced with “collect and storedata, mine for new hypotheses, confirm with data or supplemental experiment”(Han et al. 2002). In clinical medicine, similar developments have made possible avariety of secondary applications of extant clinical data, including physiciandecision support, outcomes assessment, document retrieval, and clinicianperformance evaluation. The quality, or usefulness, of existing data forsecondary uses has thus far been approached with a focus on issues oftechnical access and the mathematical format of data. Largely ignored in these

Page 2: The effects of context on data quality in biomedical data reuse

efforts are the effects on quality introduced when data captured for one purposeis reused for another. This panel brings together four researchers with on goingstudies in the biomedical domain focused on the effects of context on the qualityof data for secondary uses. Drawing from empirical evidence the topics to beaddressed include; 1) the effects of context on data reuse, 2) anticipating,identifying and accounting for these effects, and 3) the future implications oflarge-scale data reuse for information professionals in the biomedical domain.

Fitness for re-use : digital data curation for shared data collections (Cragin)

Shared digital data collections are essential to the conduct of 21st Century biology and

discovery. If they are to be maintainedfor long-term use, these shared collections will

require that scientists participate in curation and preservation activities. While it is hoped

that these data scientists will contribute metadata that is not captured automatically, it is

likely that this will be constrained by cost and the scope of the scientists’ own disciplinary

considerations. This has serious implications for thebiomedical sciences, where new

discoveries will increasingly depend on the integration of data across disciplines, kind,

technique, and scale (i.e. size, time, and orders of complexity) (Heidorn, Palmer & Wright,

2007). Because researchers will draw on data from other disciplines into their own

research, they will require information and/or tools to assess the value and pertinence of

external data sets (points, collections, etc.) to their own work.

The study of scientific information work, which concerns the actual labor involved in the

making, dissemination and use of scholarly and research products, is of particular import

for the emergence of eScience. In information science research, one approach is to use a

practices approach to investigate the range of information needs and use across domains.

The practiceapproach orients attention onto “materially mediated …arrays of human

activity,” which are dependent on and “organizedaround shared practical understanding”

(Schatzki, 2001, p.2); and, “[b]ecause the practice approach is ‘materialist’ in nature,these

understandings and activities involve, and are manifest in, the information artifacts used

and created in the work situation” (Palmer & Cragin, forthcoming).

Drawing from on-going research in data practices and the use of shared digital data

collections in biomedicine, several examples are presented in which the end-users

discovered only after extended exploration and analysis the lack of “fit” of data they did

not generate but acquired for use in their own research. It appears from these examples

that one aspect of “data utility” or fitness judgment will be tied to “non-use;” that is, there

may be no way for those generating the data to comprehend or foresee the limitations of

these data for wider use. Therefore, in addition to rich human-generated metadata, it

seems that some kinds of automated tools will be critical for assessing available data for

Page 3: The effects of context on data quality in biomedical data reuse

prescribed parameters based on the needs of secondary users. What are those needs? Will

new categories of metadata be necessary? Whose responsibility will it be to develop the

tools to support data assessment for re-use?

Palmer, C.L. & Cragin, M.H. (forthcoming). Scholarly Information Work and Disciplinary

Practices. Annual Review of Information Science & Technology.

Heidorn, P. Bryan, Palmer, Carole L., and Wright, Dan (2007). Biological Information

Specialists for Biological Informatics.

Journal of Biomedical Discovery and Collaboration, 2(1). Available:

http://www.j-biomed-discovery.com/content/2/1/1

Schatzki, T. R. (2001). Introduction: Practice theory. In T.R. Schatzki, K. Knorr Cetina, & E.

Von Savigny (Eds.), The practice turn in contemporary theory (pp. 1-14). New York:

Routledge.

Identifying and incorporating contextual influences on physicians’ dictations tofacilitate surgical outcomes research (D’Avolio)

Currently, there is little information available to researchers describing what takes place in

the operating room. As a result, most surgical outcomes research correlates outcomes to

rather indirect measures such as hospital volume or the subspecialty of the surgeon. One

potential source of information is the free text surgical operative report dictated by a

member of the surgical team. It describes the surgical tasks performed, as well as any

observations of interest made during the surgery.

As part of a study on the use of surgical operative reports to discover process-based

variables, natural language processing and data mining techniques were applied a

collection of reports to create models of surgical processes (Meng, D'Avolio et al., 2005).

While these techniques proved capable of mapping the anatomies and events described in

the surgeries based on recall and precision evaluations, trends within the reports and

conversations with surgeons raised concerns. In interviews, surgeonsexpressed doubt at

the success of a technique that relied on what surgeons say they did due their awareness

of the potential legal implications of medical dictations. At the same time, these surgeons

claimed to be capable of “reading between the lines” to identify what went wrong in a

surgery. A follow up analysis on the lengths of reports showed that attending surgeons

averaged 817.37 (±26.67, 95% C.I.) words for their reports versus 1594.54 (±129.64,

95% C.I.) for residents.

These findings support the long-held view in the information science community that an

approach to data quality concerned solely with the poor mathematical format of the

Page 4: The effects of context on data quality in biomedical data reuse

clinical data is limited. However, little has been done to meet the challenge faced by

practitioners responsible for incorporating this reality in system design. More specifically,

how does one account for the economic, legal, and cultural contexts that shape the utility

of surgical dictations while avoiding the extreme positions of 1) the data is corrupt and

useless or 2) quality concerns are simply symptoms of “noisy free text.” Toward this end, a

method for identifying and incorporating both computational and contextual data quality

obstacles is proposed. Preliminary results from this technique have been incorporated in

the design of an automated system for detecting correlations to negative surgical

outcomes in physicians’ dictations.

Meng, F., D'Avolio, L. W., Chen, A., Taira, R., & Kangarloo, H. (2005). Generating models of

surgical procedures usingUMLS concepts and multiple sequence alignment. Paper

presented at the Annual Meeting of the American Medical Informatics Association,

Washington D.C.

Gene ontology annotations as an example of the impact of curatorial variation on data reuse (MacMullen)

This presentation discusses general factors that may affect the creation and subsequent

utility of manually-curated annotations in scientific data repositories. These include human

information interaction behavior, such as personal annotation workflows, reading

behavior, the educational backgrounds of curators and their types and durations of

experiences in the labwith different organisms, and with the process of curation. Other

factors include systemic or structural issues, ranging fromthe curation and data

management policies and practices of the underlying repositories, to the prevailing

paradigm of narrative text scientific articles, to the equipment and protocols used to

obtain the data, to inherent features of the organisms under study.

To illustrate some impacts of these domain-independent factors on data reuse in curated

biomedical repositories, this presentation provides examples from a large-scale

prospective study of human-curated Gene Ontology annotations that included the

collection and analysis of contextual information such as manually-annotated paper

articles and interview data in addition to the formal GO annotations.

Gene Ontology (GO) annotations provide an approach to biomedical knowledge

management that facilitates the integrationof genetic information across organisms and

species, as well as across the boundaries of data repositories such as model organism

databases. However, several contextual factors involved in the creation of GO annotations

may affect the reuse ofthose data for other applications, such as literature mining and

data integration, or for use by scientists in specialties different from those in which the

annotations were originally made.

Page 5: The effects of context on data quality in biomedical data reuse

Manually-curated GO annotations, like index terms, are often treated as black boxes -

knowledge that exists in some context-free manner. They are frequently used as a ‘gold

standard’ or reference against which to evaluate the performance of datamining and

natural language processing tools. But a lack of understanding of how GO annotations are

created, including contextual factors, such as human differences in and policy decisions,

may negatively affect the results of such evaluations.

‘Don’t write this down’: The chimera of medical record accuracy (Arnott Smith)

Early in the patient access movement, a psychoanalyst commented that a problem “pretty

much untouched in our literature [is:] our version of the clinical moment is the official one”

(p. 371). Two realities of medical record content will make true knowledge sharing

between health professionals and patients a difficult goal to achieve: (1) terminology

presents a challengeto health literacy and health communication; (2) the very content of

the record has been demonstrated to change during writing when a patient reader is

expected in the future. This paper will focus on the second problem.

Narrative text makes up the majority of information in a typical record and has been

shown to undergo a change when patients are known to be prospective readers. This has

been called the “two sets of books” problem [1] It challenges the idea that objective

clinical accuracy really exists for digitization and knowledge sharing with patients.

Self-censorship by physicians comes in two principal forms. The first is the avoidance of a

particular diagnostic label in favor of a less technical-sounding term. The second form is

the omission of a term because the physician fears it will be misunderstood,a

phenomenon called by Jones and Hedley “censoring by default”. [2] Most worrisome for

information quality is that this may result in physicians failing to write something down in

any terminology.

Healthcare professionals are also directed to censor by their patients. Gutheil and Hilliard

(2001) formulated a taxonomy ofpsychiatric patients’ requests for noninclusion of specific

items or entire sessions: (1) Don’t take notes. (2) Don’t keep records at all. (3) Leave this

out.(4) I want you to change the record. (5) I want you to destroy my record. Rationales

include legal difficulties, avoidance of stigma “for political and other reasons”, concern

that a permanent record will be created, “embarrassing material”, and “overtly paranoid

concerns” (p. 160) [3].

Data quality control was one of the original motivations of a movement for consumer

empowerment that began in the 1970s. It was considered that the ability of patients to

contribute to their records and correct inaccuracies would increase the clinical value of the

record for all who needed it-not only patients, but their families, as well as the healthcare

Page 6: The effects of context on data quality in biomedical data reuse

team. Truly increasing access to medical records will require that healthcare professionals

and consumers-not only patients, but future patients-reach consensus on the objective

accuracy of medical data in those records.

1 Britten et al., 1991

2 Hedley and Jones, 19873.Gutheil and Hilliard, 2001