Upload
leonard-davolio
View
216
Download
2
Embed Size (px)
Citation preview
The Effects of Context on Data Quality in Biomedical Data Reuse
Leonard D’Avolio, OrganizerUniversity of California, Los Angeles [email protected]
Capturing the effects of contextual influences on surgical operative reports
Melissa CraginUniversity of Illinois, Urbana-Champaign [email protected]
Limitations introduced by unforeseen uses of shared neuroscience data collections
W. John MacMullenUniversity of North Carolina, Chapel Hill [email protected]
Contextual influences on gene ontology annotations
Catherine Arnott SmithUniversity of Wisconsin-Madison [email protected]
The effects of patient access on medical record accuracyIntroduction
The collection of huge stores of electronically-formatted information andadvances in information processing technologies has led to dramatic changes inthe conduct of biomedical science. In biology, a paradigm shift is underway dueto an unprecedented flood of data, the emergence of shared researchrepositories, and advances in the application of data mining algorithms. As aresult, the traditional model of scientific discovery of “formulate hypothesis,conduct experiment, evaluate results” is being replaced with “collect and storedata, mine for new hypotheses, confirm with data or supplemental experiment”(Han et al. 2002). In clinical medicine, similar developments have made possible avariety of secondary applications of extant clinical data, including physiciandecision support, outcomes assessment, document retrieval, and clinicianperformance evaluation. The quality, or usefulness, of existing data forsecondary uses has thus far been approached with a focus on issues oftechnical access and the mathematical format of data. Largely ignored in these
efforts are the effects on quality introduced when data captured for one purposeis reused for another. This panel brings together four researchers with on goingstudies in the biomedical domain focused on the effects of context on the qualityof data for secondary uses. Drawing from empirical evidence the topics to beaddressed include; 1) the effects of context on data reuse, 2) anticipating,identifying and accounting for these effects, and 3) the future implications oflarge-scale data reuse for information professionals in the biomedical domain.
Fitness for re-use : digital data curation for shared data collections (Cragin)
Shared digital data collections are essential to the conduct of 21st Century biology and
discovery. If they are to be maintainedfor long-term use, these shared collections will
require that scientists participate in curation and preservation activities. While it is hoped
that these data scientists will contribute metadata that is not captured automatically, it is
likely that this will be constrained by cost and the scope of the scientists’ own disciplinary
considerations. This has serious implications for thebiomedical sciences, where new
discoveries will increasingly depend on the integration of data across disciplines, kind,
technique, and scale (i.e. size, time, and orders of complexity) (Heidorn, Palmer & Wright,
2007). Because researchers will draw on data from other disciplines into their own
research, they will require information and/or tools to assess the value and pertinence of
external data sets (points, collections, etc.) to their own work.
The study of scientific information work, which concerns the actual labor involved in the
making, dissemination and use of scholarly and research products, is of particular import
for the emergence of eScience. In information science research, one approach is to use a
practices approach to investigate the range of information needs and use across domains.
The practiceapproach orients attention onto “materially mediated …arrays of human
activity,” which are dependent on and “organizedaround shared practical understanding”
(Schatzki, 2001, p.2); and, “[b]ecause the practice approach is ‘materialist’ in nature,these
understandings and activities involve, and are manifest in, the information artifacts used
and created in the work situation” (Palmer & Cragin, forthcoming).
Drawing from on-going research in data practices and the use of shared digital data
collections in biomedicine, several examples are presented in which the end-users
discovered only after extended exploration and analysis the lack of “fit” of data they did
not generate but acquired for use in their own research. It appears from these examples
that one aspect of “data utility” or fitness judgment will be tied to “non-use;” that is, there
may be no way for those generating the data to comprehend or foresee the limitations of
these data for wider use. Therefore, in addition to rich human-generated metadata, it
seems that some kinds of automated tools will be critical for assessing available data for
prescribed parameters based on the needs of secondary users. What are those needs? Will
new categories of metadata be necessary? Whose responsibility will it be to develop the
tools to support data assessment for re-use?
Palmer, C.L. & Cragin, M.H. (forthcoming). Scholarly Information Work and Disciplinary
Practices. Annual Review of Information Science & Technology.
Heidorn, P. Bryan, Palmer, Carole L., and Wright, Dan (2007). Biological Information
Specialists for Biological Informatics.
Journal of Biomedical Discovery and Collaboration, 2(1). Available:
http://www.j-biomed-discovery.com/content/2/1/1
Schatzki, T. R. (2001). Introduction: Practice theory. In T.R. Schatzki, K. Knorr Cetina, & E.
Von Savigny (Eds.), The practice turn in contemporary theory (pp. 1-14). New York:
Routledge.
Identifying and incorporating contextual influences on physicians’ dictations tofacilitate surgical outcomes research (D’Avolio)
Currently, there is little information available to researchers describing what takes place in
the operating room. As a result, most surgical outcomes research correlates outcomes to
rather indirect measures such as hospital volume or the subspecialty of the surgeon. One
potential source of information is the free text surgical operative report dictated by a
member of the surgical team. It describes the surgical tasks performed, as well as any
observations of interest made during the surgery.
As part of a study on the use of surgical operative reports to discover process-based
variables, natural language processing and data mining techniques were applied a
collection of reports to create models of surgical processes (Meng, D'Avolio et al., 2005).
While these techniques proved capable of mapping the anatomies and events described in
the surgeries based on recall and precision evaluations, trends within the reports and
conversations with surgeons raised concerns. In interviews, surgeonsexpressed doubt at
the success of a technique that relied on what surgeons say they did due their awareness
of the potential legal implications of medical dictations. At the same time, these surgeons
claimed to be capable of “reading between the lines” to identify what went wrong in a
surgery. A follow up analysis on the lengths of reports showed that attending surgeons
averaged 817.37 (±26.67, 95% C.I.) words for their reports versus 1594.54 (±129.64,
95% C.I.) for residents.
These findings support the long-held view in the information science community that an
approach to data quality concerned solely with the poor mathematical format of the
clinical data is limited. However, little has been done to meet the challenge faced by
practitioners responsible for incorporating this reality in system design. More specifically,
how does one account for the economic, legal, and cultural contexts that shape the utility
of surgical dictations while avoiding the extreme positions of 1) the data is corrupt and
useless or 2) quality concerns are simply symptoms of “noisy free text.” Toward this end, a
method for identifying and incorporating both computational and contextual data quality
obstacles is proposed. Preliminary results from this technique have been incorporated in
the design of an automated system for detecting correlations to negative surgical
outcomes in physicians’ dictations.
Meng, F., D'Avolio, L. W., Chen, A., Taira, R., & Kangarloo, H. (2005). Generating models of
surgical procedures usingUMLS concepts and multiple sequence alignment. Paper
presented at the Annual Meeting of the American Medical Informatics Association,
Washington D.C.
Gene ontology annotations as an example of the impact of curatorial variation on data reuse (MacMullen)
This presentation discusses general factors that may affect the creation and subsequent
utility of manually-curated annotations in scientific data repositories. These include human
information interaction behavior, such as personal annotation workflows, reading
behavior, the educational backgrounds of curators and their types and durations of
experiences in the labwith different organisms, and with the process of curation. Other
factors include systemic or structural issues, ranging fromthe curation and data
management policies and practices of the underlying repositories, to the prevailing
paradigm of narrative text scientific articles, to the equipment and protocols used to
obtain the data, to inherent features of the organisms under study.
To illustrate some impacts of these domain-independent factors on data reuse in curated
biomedical repositories, this presentation provides examples from a large-scale
prospective study of human-curated Gene Ontology annotations that included the
collection and analysis of contextual information such as manually-annotated paper
articles and interview data in addition to the formal GO annotations.
Gene Ontology (GO) annotations provide an approach to biomedical knowledge
management that facilitates the integrationof genetic information across organisms and
species, as well as across the boundaries of data repositories such as model organism
databases. However, several contextual factors involved in the creation of GO annotations
may affect the reuse ofthose data for other applications, such as literature mining and
data integration, or for use by scientists in specialties different from those in which the
annotations were originally made.
Manually-curated GO annotations, like index terms, are often treated as black boxes -
knowledge that exists in some context-free manner. They are frequently used as a ‘gold
standard’ or reference against which to evaluate the performance of datamining and
natural language processing tools. But a lack of understanding of how GO annotations are
created, including contextual factors, such as human differences in and policy decisions,
may negatively affect the results of such evaluations.
‘Don’t write this down’: The chimera of medical record accuracy (Arnott Smith)
Early in the patient access movement, a psychoanalyst commented that a problem “pretty
much untouched in our literature [is:] our version of the clinical moment is the official one”
(p. 371). Two realities of medical record content will make true knowledge sharing
between health professionals and patients a difficult goal to achieve: (1) terminology
presents a challengeto health literacy and health communication; (2) the very content of
the record has been demonstrated to change during writing when a patient reader is
expected in the future. This paper will focus on the second problem.
Narrative text makes up the majority of information in a typical record and has been
shown to undergo a change when patients are known to be prospective readers. This has
been called the “two sets of books” problem [1] It challenges the idea that objective
clinical accuracy really exists for digitization and knowledge sharing with patients.
Self-censorship by physicians comes in two principal forms. The first is the avoidance of a
particular diagnostic label in favor of a less technical-sounding term. The second form is
the omission of a term because the physician fears it will be misunderstood,a
phenomenon called by Jones and Hedley “censoring by default”. [2] Most worrisome for
information quality is that this may result in physicians failing to write something down in
any terminology.
Healthcare professionals are also directed to censor by their patients. Gutheil and Hilliard
(2001) formulated a taxonomy ofpsychiatric patients’ requests for noninclusion of specific
items or entire sessions: (1) Don’t take notes. (2) Don’t keep records at all. (3) Leave this
out.(4) I want you to change the record. (5) I want you to destroy my record. Rationales
include legal difficulties, avoidance of stigma “for political and other reasons”, concern
that a permanent record will be created, “embarrassing material”, and “overtly paranoid
concerns” (p. 160) [3].
Data quality control was one of the original motivations of a movement for consumer
empowerment that began in the 1970s. It was considered that the ability of patients to
contribute to their records and correct inaccuracies would increase the clinical value of the
record for all who needed it-not only patients, but their families, as well as the healthcare
team. Truly increasing access to medical records will require that healthcare professionals
and consumers-not only patients, but future patients-reach consensus on the objective
accuracy of medical data in those records.
1 Britten et al., 1991
2 Hedley and Jones, 19873.Gutheil and Hilliard, 2001