8
“Big Data” in Laboratory Medicine Moderators: Nicole V. Tolan 1* and M. Laura Parnas 2 Experts: Linnea M. Baudhuin, 3 Mark A. Cervinski, 4 Albert S. Chan, 5 Daniel T. Holmes, 6 Gary Horowitz, 7 Eric W. Klee, 8 Rajiv B. Kumar, 9 and Stephen R. Master 10 Depending upon who you ask, you may get very different answers to the question of “what does ‘big data’ mean to you?” Most obviously, the term “big data” applies to the high-resolution omics data for which we rely on various bioinformatics tools to make conclusions on how to im- prove patient care. However, “big data” also readily refers to the data reported every day as a part of the clinical laboratory testing environment, and more broadly to the information generated in electronic health records (EHRs). 11 There are several practical IT solutions for handling day-to-day “big data” that enable millions of test results to be reported per year. Informatics is changing the processes behind laboratory medicine. With ever-growing demands on laboratory medi- cine professionals not only to collect and interpret omics data in the era of the Precision Medicine Initiative, but also to ensure high-quality, low-cost patient management in the structure of accountable care organizations, we have invited several experts to discuss their take on “big data.” Our experts highlight how to ensure that the data ana- lyzed are high quality, so that the conclusions we make will translate to effective clinical management and opti- mal patient care. They review a number of IT solutions they rely on to gain efficiency in the clinical laboratory to benefit clinical practice. Also, our experts discuss the ability to query the clinical laboratory database in an effort to improve test utilization and how “big data” analytics allows for a more effective means of quality management. These 8 experts, with diverse back- grounds and interests, highlight various IT solutions to tackle our “big data.” How do you define “big data” and what does it mean to you in your clinical practice? Eric Klee: In my opinion, the term “big data” has dif- ferent meanings depending on the context being con- sidered. From an IT per- spective, “big data” is any- thing that challenges an institution’s computational infrastructure and requires application-specific modifi- cations. A clear example is providing sufficient com- pute nodes, memory, and storage to account for the de- mands of whole genome sequencing. The size of the data sets generated often requires high-performance computer clusters and specialized storage infrastructure. I think “big data” has a different meaning from the context of a cyto- or molecular geneticist. From that perspective, I would assert that “big data” refers to any data set that challenges or ex- ceeds an individual’s ability to manually evaluate all data points for clinical relevance. This does not necessarily re- quire the data to be whole genome sequencing, but any targeted next-generation sequencing (NGS) panel of suffi- cient size to require informatics solutions to enable data reduction before clinical interpretation. For example, a mo- lecular geneticist might be capable of reviewing all variants called on a 10-gene panel (approximately 30 – 40 variants per case) without additional informatics support; however, they would be overwhelmed in trying to do this for a 60 – 100-gene panel. 1 Associate Director, Clinical Chemistry and Director, Point-of-Care Testing, Department of Pa- thology and Laboratory Medicine, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston MA; 2 Director, Clinical Science, Sutter Health Shared Laboratory, Livermore, CA; 3 Co-director, Personalized Genomics Laboratory, and Clinical Genome Sequencing Lab- oratory, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN; 4 Di- rector, Clinical Chemistry and Point-of-Care Testing, Department of Pathology and Laboratory Medicine, Dartmouth-Hitchcock Medical Center and the Geisel School of Medicine at Dart- mouth, Hanover, NH; 5 VP and Chief, Digital Patient Experience, Sutter Health Office of Pa- tient Experience, Sacramento, CA; 6 Division Head, Clinical Chemistry, Department of Pathol- ogy and Laboratory Medicine, St. Paul’s Hospital and University of British Columbia, Vancouver, BC; 7 Director, Clinical Chemistry, Department of Pathology and Laboratory Med- icine, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston MA; 8 Direc- tor of Bioinformatics, Clinical Genome Sequencing Laboratory and Associate Director of Bioin- formatics, Center for Individualized Medicine, Department of Health Science Research, Mayo Clinic, Rochester, MN; 9 Medical Director of Clinical Informatics, Departments of Pediatrics and Clinical Informatics, Stanford School of Medicine and Stanford Children’s Health, Palo Alto, CA; 10 Director, Central Lab and Chief, Clinical Chemistry Laboratory Service, Department of Pathology and Laboratory Medicine, New York Presbyterian Hospital and Weill Cornell Medical College, New York, NY. * Address correspondence to this author at: Beth Israel Deaconess Medical Center, 330 Brookline Ave., Boston, MA 02215. Fax 617-667-4533; e-mail [email protected]. Received August 30, 2015; accepted September 10, 2015. © 2015 American Association for Clinical Chemistry 11 Nonstandard abbreviations: EHRs, electronic health records; NGS, next-generation sequenc- ing; LIS, laboratory information system; TAT, turnaround time; VUS, variant of uncertain sig- nificance; SNV, single-nucleotide variant; INDEL, insertion or deletion variant; CNV, copy number variant; CRAN, Comprehensive R Archive Network; QA, quality assurance; ICD-9, International Classification of Diseases, Ninth Revision; NHLBI, National Heart, Lung, and Blood Institute; ExAC, Exome Aggregation Consortium; LIS, laboratory information system; PHI, protected health information; SYCL, Society for Young Clinical Laboratorians. Clinical Chemistry 61:12 1433–1440 (2015) Q&A 1433

“Big Data” in Laboratory Medicine Q&A

Embed Size (px)

Citation preview

Page 1: “Big Data” in Laboratory Medicine Q&A

“Big Data” in Laboratory MedicineModerators: Nicole V. Tolan1* and M. Laura Parnas2

Experts: Linnea M. Baudhuin,3 Mark A. Cervinski,4 Albert S. Chan,5 Daniel T. Holmes,6 Gary Horowitz,7

Eric W. Klee,8 Rajiv B. Kumar,9 and Stephen R. Master10

Depending upon who you ask, you may get very differentanswers to the question of “what does ‘big data’ mean toyou?” Most obviously, the term “big data” applies to thehigh-resolution omics data for which we rely on variousbioinformatics tools to make conclusions on how to im-prove patient care. However, “big data” also readily refers tothe data reported every day as a part of the clinical laboratorytesting environment, and more broadly to the informationgenerated in electronic health records (EHRs).11 There areseveralpractical ITsolutions forhandlingday-to-day“bigdata”that enable millions of test results to be reported per year.

Informatics is changing the processes behind laboratorymedicine. With ever-growing demands on laboratory medi-cine professionals not only to collect and interpret omicsdata in the era of the Precision Medicine Initiative, but alsoto ensure high-quality, low-cost patient management in thestructure of accountable care organizations, we have invitedseveral experts to discuss their take on “big data.”

Our experts highlight how to ensure that the data ana-lyzed are high quality, so that the conclusions we makewill translate to effective clinical management and opti-mal patient care. They review a number of IT solutionsthey rely on to gain efficiency in the clinical laboratoryto benefit clinical practice. Also, our experts discussthe ability to query the clinical laboratory database inan effort to improve test utilization and how “big data”analytics allows for a more effective means of qualitymanagement. These 8 experts, with diverse back-grounds and interests, highlight various IT solutionsto tackle our “big data.”

How do you define “big data” and what does it meanto you in your clinical practice?

Eric Klee: In my opinion,the term “big data” has dif-ferent meanings dependingon the context being con-sidered. From an IT per-spective, “big data” is any-thing that challenges aninstitution’s computationalinfrastructure and requiresapplication-specific modifi-cations. A clear example isproviding sufficient com-

pute nodes, memory, and storage to account for the de-mands of whole genome sequencing. The size of the datasets generated often requires high-performance computerclusters and specialized storage infrastructure. I think “bigdata” has a different meaning from the context of a cyto- ormolecular geneticist. From that perspective, I would assertthat “big data” refers to any data set that challenges or ex-ceeds an individual’s ability to manually evaluate all datapoints for clinical relevance. This does not necessarily re-quire the data to be whole genome sequencing, but anytargeted next-generation sequencing (NGS) panel of suffi-cient size to require informatics solutions to enable datareduction before clinical interpretation. For example, a mo-lecular geneticist might be capable of reviewing all variantscalled on a 10-gene panel (approximately 30–40 variantsper case) without additional informatics support; however,they would be overwhelmed in trying to do this for a 60–100-gene panel.

1 Associate Director, Clinical Chemistry and Director, Point-of-Care Testing, Department of Pa-thology and Laboratory Medicine, Beth Israel Deaconess Medical Center and Harvard MedicalSchool, Boston MA; 2 Director, Clinical Science, Sutter Health Shared Laboratory, Livermore,CA; 3 Co-director, Personalized Genomics Laboratory, and Clinical Genome Sequencing Lab-oratory, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN; 4 Di-rector, Clinical Chemistry and Point-of-Care Testing, Department of Pathology and LaboratoryMedicine, Dartmouth-Hitchcock Medical Center and the Geisel School of Medicine at Dart-mouth, Hanover, NH; 5 VP and Chief, Digital Patient Experience, Sutter Health Office of Pa-tient Experience, Sacramento, CA; 6 Division Head, Clinical Chemistry, Department of Pathol-ogy and Laboratory Medicine, St. Paul’s Hospital and University of British Columbia,Vancouver, BC; 7 Director, Clinical Chemistry, Department of Pathology and Laboratory Med-icine, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston MA; 8 Direc-tor of Bioinformatics, Clinical Genome Sequencing Laboratory and Associate Director of Bioin-formatics, Center for Individualized Medicine, Department of Health Science Research, MayoClinic, Rochester, MN; 9 Medical Director of Clinical Informatics, Departments of Pediatrics

and Clinical Informatics, Stanford School of Medicine and Stanford Children’s Health, PaloAlto, CA; 10 Director, Central Lab and Chief, Clinical Chemistry Laboratory Service, Departmentof Pathology and Laboratory Medicine, New York Presbyterian Hospital and Weill CornellMedical College, New York, NY.

* Address correspondence to this author at: Beth Israel Deaconess Medical Center, 330 BrooklineAve., Boston, MA 02215. Fax 617-667-4533; e-mail [email protected].

Received August 30, 2015; accepted September 10, 2015.© 2015 American Association for Clinical Chemistry11 Nonstandard abbreviations: EHRs, electronic health records; NGS, next-generation sequenc-

ing; LIS, laboratory information system; TAT, turnaround time; VUS, variant of uncertain sig-nificance; SNV, single-nucleotide variant; INDEL, insertion or deletion variant; CNV, copynumber variant; CRAN, Comprehensive R Archive Network; QA, quality assurance; ICD-9,International Classification of Diseases, Ninth Revision; NHLBI, National Heart, Lung, andBlood Institute; ExAC, Exome Aggregation Consortium; LIS, laboratory information system;PHI, protected health information; SYCL, Society for Young Clinical Laboratorians.

Clinical Chemistry 61:121433–1440 (2015)

Q&A

1433

Page 2: “Big Data” in Laboratory Medicine Q&A

Linnea Baudhuin: “Bigdata” is a broad term thatrelates to data sets that areso large, diverse, and/orcomplex that traditionaldata processing applica-tions are inadequate to ana-lyze, capture, curate, share,visualize, and store. Thus,“big data” requires innova-tive bioinformatics solu-tions for processing, to

make the data meaningful, and to derive usable informa-tion. In the arena of clinical molecular genetic testing, NGShas required us to develop complex bioinformatics solutionsthat require raw data mapping, alignment, filtering, andvariant calling.

Stephen Master: Thereare several different defini-tions of “big data” thatget thrown around. Oneschool of thought says thatwe should only use the term“big data” when we havetoo much information tostore or process on a singlecomputer. Another way todefine “big data” is by thecharacteristics of volume

(amount of data), velocity (how quickly we acquire data),and variety (different kinds of data). We also talk about “bigdata” in the context of analyzing large, complex data setssuch as those derived from omics experiments. The impor-tant thing to recognize is that many of the same analyticalapproaches can be applied to laboratory medicine data re-gardless of the precise definition that we choose. As a result,I think that it’s appropriate to use “big data” to describe theinformation that we get from our large numbers of patients,samples, and analytes in the clinical laboratory.

Dan Holmes: The term“big data” is frequentlyusedbymylaboratorymedi-cine colleagues, clinicians,and health administratorsin various settings. Fromthe context of these conver-sations, I (somewhat jok-ingly) would be prepared tosay that most of us use theterm to mean “I can’t dothe analysis in Excel.” Re-

stricting my thinking to healthcare environments, I woulddefine the term as follows:

Big data (in clinical medicine): (n) Extremely large datasets obtained from demographic, clinical (medical, nurs-ing, paramedical, pharmacy), diagnostic, and public healthrecords used to direct decisions about diagnostics, patientcare, resource allocation, and epidemiological trends.

The connotation of the term suggests that the dataitself might have to be pooled from disparate sources, usuallydatabases of diverse structure, and that the data might re-quire substantial “wrangling” to prepare it for analysis usingpreexisting or custom computational tools.

Mark Cervinski: All ofthe high-volume test re-sult data produced by au-tomated instruments inthe clinical laboratorycould qualify as “bigdata.” A medium-sizedlaboratory such as ourscan generate 3 to 4 millionpatient test results a yearand in addition, each oneof those results has associ-

ated data that never make it to the chart. For every testperformed, we track the time when, and where, a samplewas drawn, transit time, processing time, analyzer time,resulting time, and specimen integrity (hemolysis, ic-terus, and lipemia indices). However, the only data toaccompany the test result are typically the time the sam-ple was drawn and when it was resulted. All of thesenonresult data are valuable and mineable. With theproper tools and questions in hand, a laboratorian can diginto these data to query whether their reference ranges areappropriate, to look for test utilization patterns, to curbtest overutilization, or to monitor preanalytic and ana-lytic quality.

Gary Horowitz: Al-though my laboratorygenerates over 5 millionpatient test results eachyear, I don’t usually thinkof the work I do as involv-ing “big data,” but maybeI should. To me, “bigdata” relates to the kind ofanalytics that Googledoes—e.g., using the fre-quency of search terms to

track influenza epidemics almost in real time. The work Ido with large amounts of patient data reflects practices atmy institution, which is only a small piece of the “bigdata” our laboratory produces each day. My goal in an-alyzing these clinical data is to see whether I can un-

Q&A

1434 Clinical Chemistry 61:12 (2015)

Page 3: “Big Data” in Laboratory Medicine Q&A

cover ways to improve not only laboratory practice,but overall clinical practice. That is, in addition toensuring the accuracy, precision, and turnaroundtimes (TATs) of laboratory results, I try to see whetherthe results themselves can be used to monitor andimprove clinical care. For example, if many of thevancomycin levels we report are outside the therapeu-tic range, it’s not enough that our assays are accurateand TATs are good. I’d like to know what we can do,as an institution, to help ensure that patients’ vanco-mycin levels are therapeutic.

Rajiv Kumar: “Big data”is the assessment of mas-sive amounts of informa-tion from multiple elec-tronic sources in unison,by sophisticated analytictools to reveal otherwiseunrecognized patterns. Asa pediatric endocrinolo-gist, to me “big data”means the method to en-hance the care of human

disease, such as insulin-dependent diabetes mellitus. Multi-ple and fluctuating factors affect blood sugar control, andpatients/parents are asked to make real-time insulin multi-daily dosing decisions without truly knowing if a given dosewill have the intended effect. They have the benefit of theirown experience and healthcare provider guidance based onintermittent retrospective review of available data. However,each decision point represents a new combination of vari-ables with subtleties in patterns that may not be readilyidentified. As the diabetes research community moves closerto realization of an automated closed-loop artificial pancreasin clinical use, “big data” will be the backbone that facilitatesoptimal glycemic control.

Albert Chan: Our worldtoday is geared to improvethe lives of consumers.From Amazon to Googleto our local grocery store,we have the same expecta-tion: a product that meetsour consumer expectationof utmost quality andconvenience. This is madepossible by “big data.” Onthe continuum from

creepiness to utility, our expectations have shifted. Wenow readily provide the personalized data that can sim-plify our transactions or personalize our experiences forthe better.

How do you tackle your “big data” and what do youmake out of it in your practice?

Linnea Baudhuin: Most laboratories utilize multipledifferent software programs and home-brewed IT solu-tions to analyze extremely large NGS output files. Thereare 3 major types of NGS tests in current clinical use: (1)cancer genetic variant testing for diagnosis, prognosis, ortherapeutic response, (2) gene panel testing for diagnosisof inherited disorders, and (3) whole exome (or genome)sequencing for rare inherited disorders. In the future, wecan expect NGS tests for pharmacogenomic testing pan-els, as well as transcriptomic- and epigenomic-based tests.For all of these applications, we need the ability to filterout potentially large numbers of benign or nonreportablevariants. Additionally, we frequently encounter variantsthat are of uncertain significance (VUSs), with the morethat we sequence, the more VUSs we encounter. Whiletesting of trios (e.g., parents and affected child) can helpto create cleaner data, oftentimes complete trios are notavailable for testing. For these reasons, and more, werequire a specialist to help us tackle big NGS data. Mostclinical genetics laboratories now employ at least onebioinformatics specialist, with the bigger laboratories re-quiring a team of such specialists.

Eric Klee: The “big data” data sets that I generally workwith are all NGS based, and most of what we attempt tomake out of these data are high-quality variant call sets.The variant types range from simple single-nucleotidevariants (SNVs) and small insertion or deletion variants(INDELs), to copy number variants (CNVs), to struc-tural variants, including fusions, translocations, and in-versions. The steps employed to make these calls includeNGS, QC read, filtering, realignment to the appropriatereference genome, and then a series of highly customizedvariant-calling algorithms that are application specific.These are integrated into a centralized relational data-base, merging annotations together with the variant calldata, to provide the appropriate context to the underly-ing data at the time of interpretation.

Stephen Master: From the perspective of software plat-forms, the bulk of our analytic work is being done in theR statistical program language. There are a few otherlanguage platforms that also have good support for sta-tistical analytics in a big data context, but we now have agrowing international group of clinical chemists who aresharing approaches for analyzing laboratory data in R.

Daniel Holmes: I exclusively use the R statistical pro-gramming language for cleaning, processing, analyzing,and visualizing the large data sets I have to deal with inlaboratory medicine. I use R because it has a very largefollowing of people from many disparate academic fields.

Q&A

Clinical Chemistry 61:12 (2015) 1435

Page 4: “Big Data” in Laboratory Medicine Q&A

This means that specific tools for almost any conceivableanalysis are available in the Comprehensive R ArchiveNetwork (CRAN). Relevant to clinical chemists andpathologists, this includes tools for database creation/querying, data cleaning and reshaping, routine and so-phisticated statistical analyses, mass spectrometry/chromatography, genetics, epidemiology, machine learn-ing, and data visualization (including real-time interac-tive visualization). Additionally, R is freely available andopen-source, platform independent, and has a broadcommunity that shares its knowledge and source code.Our efforts with R have been primarily directed towardsautomating quality management tasks, visualization ofquality metrics in new ways, assessing utilization, andfinding needles in haystacks. For example, Levey-Jennings charts for all tests on all chemistry analyzers forthe previous month are autogenerated in the early hoursof Monday mornings. These are processed and convertedto PDFs by a single R script (coordinated with Linuxbash) and autoemailed to all relevant staff for reviewwhen they arrive for work. As for finding needles in hay-stacks, we had an analyzer filing identical panel results on2 consecutive patients on a very rare basis. With R, wewere able to write a script to identify these occurrencesamong 5 million analyses performed over the prior 3years to find the 30 affected patient records and makeinvestigations and corrections.

Mark Cervinski: For the last few years we’ve been usingour “big data” to establish a moving averages program tomonitor the mean patient value for a number of chemis-try analytes in real time. Moving averages is not a newquality assurance (QA) concept but it has been difficultto implement, partially due to the difficulty in acquiringand analyzing the data. We were only able to developsensitive moving average protocols once we were able toaccumulate nearly 2 million test results in a database.Using this “big data” data set, we were able to model themoving average process in a statistical modeling softwarepackage. This modeling of “big data” allowed us to de-velop protocols to rapidly detect analytical shifts. With-out the data, we would only be able to guess if our pro-tocols would be sensitive enough to detect a systematicerror condition. We’ve only begun to mine this data inour laboratory and thus far, we’ve focused only on theresults and when and where the sample came from. Butthere is great potential to influence the medical care thathappens outside of the laboratory’s walls.

Gary Horowitz: We make extensive use of MicrosoftAccess and Microsoft Excel. We are fortunate in that wehave access to a database of laboratory data extendingback to 2003, along with many other clinical parameters[admissions, International Classification of Diseases,Ninth Revision (ICD-9), and recently Tenth Revision

(IDC-10), codes, attending physicians, etc.]. We use Ac-cess to write queries to retrieve data of interest, and thenwe do the majority of our analyses using Excel. To con-template the increasing number of variables in the future,more sophisticated analytical tools will be required, suchas pulling data from this database using R statisticalsoftware.

Rajiv Kumar: Currently when reviewing data for a pa-tient with diabetes, we strive to recognize patterns thatmay benefit from modulation of insulin dosing parame-ters and other intervention. Variables assessed includeblood sugar trends, characteristics of insulin dosing,physical activity, carbohydrate intake, and progression ingrowth and puberty. These data and metadata are chal-lenging to access between quarterly clinic visits and arenot readily organized in a manner conducive for rapidpattern recognition. In response, my group has exploredan infrastructure goal of conveying more patient data,more often, without increasing patient/parent effort orexacerbating provider resource strain. Using integrationof Apple’s HealthKit platform with our Epic EHR pa-tient portal, we are now able to receive up to 288 bloodglucose readings per day for patients using Bluetooth-enabled continuous glucose monitors. Key aspects of thisintegration include data unified in the EHR (home ofmedical history variables, laboratory test results, growthdata, prescription data, and provider work flow), passivedata transfer via the patient’s/parent’s mobile device,and data security of this patient health information. Withcontinued variable expansion using this approach, therewill be important progress in clinical decision supporttoward more precise diabetes care.

Albert Chan: Historically, physicians have made deci-sions about patient care based on limited data. For exam-ple, physicians may change a hypertensive medicationbased on a blood pressure obtained in the exam room.Yet, 99 plus percent of a patient’s life happens outside ofthe office. This is the promise of wearables and otherremote monitoring solutions, providing our care teamswith a more holistic view of the patient. For example, ahome blood pressure cuff digitally connected to our elec-tronic health record provides our care team with a morecomplete view of blood pressure control to make betterclinical decisions. More importantly, empowering ourpatients with this data can facilitate new teachablemoments.

Are there particular “big data” analytics you use togain efficiencies, for quality management, or to pro-vide clinical improvements?

Eric Klee: We use all the same type of standard NGSanalytics that the broader community uses, including ba-

Q&A

1436 Clinical Chemistry 61:12 (2015)

Page 5: “Big Data” in Laboratory Medicine Q&A

sic read-level QC (i.e., FastQC), alignment level QC (%reads mapped, mapped on target, mapped at Q30, etc.),and variant level QC [Ti/Tv (transition–transversion) ra-tio, number of variants per region, synonymous-to-nonsynonymous ratios, Q20 or greater variant counts,variant frequency distributions]. In addition, we use ahighly specialized relational database to enable completevariant profile storage with rapid retrieval, enabling us toquickly generate QC and interpretative reports.

Linnea Baudhuin: We utilize bioinformatics tools toanalyze NGS data quality and filter out data that are ofpoor quality or require potential follow-up. Quality pa-rameters that are assessed include number of overlappingreads, per base depth of coverage, average depth of cov-erage within total sequenced area, uniformity of cover-age, variant frequency for heterozygotes and homozy-gotes, strand bias, and nonduplicate reads. Cutoffs foranalytical performance parameters need to be establishedduring test verification. Additionally, per-base qualityscores are assessed for each test and bioinformatics willremove bases with low quality scores before alignment.To gain efficiencies for variant classification, we utilizemultiple sources to determine if a variant has been previ-ously detected and reported in the general population, inspecific ethnic groups, and/or in individuals or familieswith relevant disease phenotypes. These sources includeour own internal variant database, ClinVar, HumanGene Mutation Database, the National Heart, Lung, andBlood Institute’s (NHLBI) Exome Variant Server, andthe Exome Aggregation Consortium (ExAC) database.We utilize whatever information we gather from thesesources, along with other in silico prediction and evolu-tionary conservation tools, to help us make decisions onvariant classification. We also classify variants during testverification and store this information in a database, tohelp streamline downstream classification of variants en-countered when the test is live. This information is espe-cially helpful for benign or likely benign variants.

Stephen Master: Right now we deal with our “big data”once it has already been pulled from the laboratory infor-mation system (LIS) in a batch mode. This is fine forthings that are not time sensitive (simple things like TATanalysis, identification of long-term testing trends, or dis-covering multivariable patterns), but it doesn’t yet pro-vide a way for us to turn our big-data analysis into real-time diagnostic output. Our next step at an institutionallevel is making sure that we have much more rapid accessto the raw results data. In terms of specific analytic appli-cations, my group has demonstrated the use of high-throughput hematology analyzer data to identify myelo-dysplastic syndrome. However, we really need to solvethe real-time-data access problem to fully take advantageof these analytics in our clinical practice.

Daniel Holmes: At present, there is a significant desire forthe application of quality metrics in both monitoring oftraditional statistics (TAT, reporting of critical values, ad-verse event rates) and the development of novel metrics(identification of outlier behavior, trends in utilization,identification of testing of low clinical utility). Previously, allof this monitoring was done manually in spreadsheet-basedprograms, which is problematic for a number of reasons: it isnot traceable (there is no record of the steps in the analysis),it is not automated (repeating the same tedious steps eachmonth to generate reports), the statistical tools in spread-sheet programs are limited, and there are currently no auto-mated report generation or real-time data visualizationtools. For these reasons, my colleagues and I are coding toolsto automate the traditional metrics of quality monitoring.We hope to automate TAT and utilization monitoring us-ing the pipeline of R (database query and analysis),R-Markdown (http://rmarkdown.rstudio.com/), Knitr(http://yihui.name/knitr/), and LaTeX to create PDF re-ports. We may opt to create a web “dashboard” using the“Shiny” package for R (http://shiny.rstudio.com/). At aminimum, the advantage will be that the source code showsexactly what has been done, and we can use the same code toproduce laboratory quality metric reports across the 7 largehospitals in our region.

Mark Cervinski: In addition to the moving averagesprotocols, we use tools available in our middleware soft-ware to analyze sample work flow and tests performed perhour, to adjust our instrument maintenance times, staff-ing, and test distribution mixture. We also routinelymonitor autoverification rates, in-laboratory TAT, andspecimen quality flags in real time, as deviations from thenorm could indicate unnoticed instrumentation errorsand predict instrument downtime. On a longer scale, thedata we collect could be considered “big data” but as wemonitor these changes on a daily, weekly, and monthlybasis we tend to refer to this as “small data.” These “smalldata” fields are key to designing middleware rules thatassist laboratory technologists and allow them to focus onthose samples that need extra handling. A well-designedset of middleware and LIS rules can bump up the labor-atory’s autoverification rate and lower the cost per test,metrics that become ever more important as reimburse-ment rates for laboratory testing continue to decline.

Gary Horowitz: We generate and test hypotheses with agoal of analyzing and improving clinical practice. As anexample, we wondered why our clinicians were orderingso many serum folate tests. Were they screening for folatedeficiency, or were they working up cases of macrocyticanemias? Theoretically, there should be very little folatedeficiency in the US population, since all breads andcereals have been fortified since 1996. Our analysis indi-cated that the test was being ordered in huge numbers

Q&A

Clinical Chemistry 61:12 (2015) 1437

Page 6: “Big Data” in Laboratory Medicine Q&A

and almost always in the absence of anemia, let alonemacrocytic anemia. Our data indicated that it had anexceptionally low clinical yield: 3 cases of folate defi-ciency out of 84 000 samples (0.06%) over an 11-yearperiod. We argued that, although we generate excellentresults in a timely fashion, we would prefer not to do thetest at all. In a similar way, we can look at how oftenphysicians order a single troponin in patients undergoingevaluation for acute coronary syndromes, how often ther-apeutic drug levels are within their target ranges, and howoften physicians react appropriately to D-dimer levels. Inall of these situations, our goal is to try to identify areaswhere, together with our clinical colleagues, we can im-prove patient care above and beyond offering accuratetest results in a timely manner.

Rajiv Kumar: In anticipation of an exponential increase inglucose data from our patients, we built an analytic triagereport and glucose data viewer embedded in the EHR tofacilitate intervisit retrospective data review without collaps-ing available resources. The automated report is generated atdefined intervals to triage patients by glycemic control. Thisallows a diabetes provider to focus time and resources wherethey are needed most. For a patient whose home data meetflag criteria, the provider opens the patient’s chart and usesthe glucose viewer to review and quickly identify actionabledata trends. Any questions or recommendations are con-veyed to the patient and/or parent using the EHR’s patientportal, permitting efficient communication while simulta-neously documenting changes in the treatment plan. We arenow translating this provider work flow to streamline anal-ysis of patient-generated health data for additional chronicdiseases.

What are some lessons learned from your experiencewith real-time “laboratory” data integration throughApple HealthKit?

Rajiv Kumar: A major component of managing data ismanaging patient/parent expectations about said data.Our current intention of using HealthKit is not to takeover real-time decision-making, but rather to facilitateefficient identification of actionable trends between vis-its. In review of previous efforts to implement home datain the EHR, we learned that unless explicitly highlighted,patients may think their provider is constantly watchingtheir data and get frustrated when they are not contactedimmediately for an aberrant value. At setup, we use verbaland written notification to establish appropriate expecta-tions regarding only intermittent provider monitoring.To date, we have received only positive feedback, as we aremeeting the expectations we defined. When a patient/parent contacts the diabetes provider between clinic visitswith questions or concerns regarding glucose trends, quickprovider access to the data with no additional effort is ap-

preciated on both sides. There is a technical requirement forpatients/parents to keep their mobile device operating sys-tem and relevant apps updated to maintain passive datacommunication. While this is not a major hurdle for mostsmartphone users, we found use of a tip sheet with mobiledevice screenshots to be helpful for some.

How do you see these devices impacting healthcare inthe near term, and way in the future? What shouldlaboratorians be investing in now, to have a seat atthe table 20 years from now?

Rajiv Kumar: In the near term, passive communicationof patient-generated health data to the EHR enhancesaccess to this information in the context of other labora-tory and relevant variables in the chart. This organizationof puzzle pieces in the center of provider work flow maylikely improve care for a given patient’s condition. In thelong term, this precision medicine initiative will continueto foster variable expansion and facilitate clinical decisionsupport tools for providers and patients alike. Addition-ally, and with respect to population health, unified orga-nization of data sets in a given EHR will permit deiden-tified sharing across health systems to provide invaluableinsight on epidemiology and optimized approaches forhuman disease. In anticipation of the short- and long-term power of complete health information, hospital sys-tems and laboratorians should be investing in EHR datainfrastructure today, including an interactive patient por-tal. We need to assure that the data we generate andreceive have longevity, are easy to access and share, andcan be easily formatted to answer questions that we havenot yet thought to ask. Importantly, we also need toadvocate for healthcare policy that supports use of bigdata in evolving care models without challenging pro-vider resources.

Albert Chan: In our experience with a personalizedhealthcare program for hypertension, almost 69% of pa-tients with previously uncontrolled hypertension whowere provided their blood pressure data contextualizedwith behavioral factors, such as exercise activity, are cur-rently at target control. Similarly, simple binary measuressuch as a serum test will be augmented with measures thatreflect our increasingly nuanced understanding of health.From genomics and other tests that provide quantitativeprobabilities of disease, our clinicians will have to becomefacile, with an ability to take increasingly complex dataand explain the ramifications and patient’s options forintervention. I am teaching my son and daughters thebasics of computer science. This is not a bet that they willgrow up to be programmers. Rather, it is based on a beliefthat all of us, including those of us to participate in clin-ical care, will need superior quantitative skills to serve asadvocates for our patients. Our healthcare consumers will

Q&A

1438 Clinical Chemistry 61:12 (2015)

Page 7: “Big Data” in Laboratory Medicine Q&A

depend on and demand that we have these abilities tobetter partner with them to make the critical decisionsthat influence their health.

What pitfalls must laboratorians be aware of to en-sure we make accurate conclusions from the analysisof our “big data”?

Eric Klee: It is important that laboratorians understandall of the assumptions that have been built into any “bigdata” analysis pipeline. Oftentimes, these consist of de-fault configurations per informatics solutions that willnot necessarily meet the assay-specific requirements. Asimple example is the minimum read depth or frequencyof a variant that would be called and reported. Theseconfigurations are often set with the assumption that theuser is analyzing basic genomic data in a hereditary testapplication and will fall short when thinking about so-matic or mitochondrial assays. More complex are some ofthe assumptions made around complex variant situa-tions, including INDEL events, or SNVs in proximity toINDELs, etc. It is equally important that the laboratorianis familiar with the type of variant quality filtering that isbeing used. When one is dealing with extremely largedata sets, automated filtering and data reduction is re-quired for efficient interpretation. A laboratorian cannotbe expected to review all possible variants for each caseanalyzed, but must take the time and establish a solidunderstanding of the data reduction and QC steps em-ployed in a “big data” assay during test development, toensure the proper methods are being used.

Linnea Baudhuin: NGS has prompted us to move fromtargeted mutations or single gene analysis to multigenepanels, whole exome, and even whole genome testing.Along with this, our analysis of the data has moved fromfairly simple software solutions to the need to implementa stitched-together set of bioinformatics software systemsthat are a combination of off-the-shelf and in-house de-veloped. A high-quality bioinformatics pipeline enablesus to perform testing with high sensitivity and specificity.In the world of NGS, this means that we can detect asmany variants and types of variants as are present whileensuring that the data being reported are meeting qualitystandards. But, we need to balance this with being carefulabout creating a test that is clinically useful, keeping inmind that more is not necessarily better. In other words,the more we sequence, the more variants we find, and themore variant categorization that needs to be done. This,in turn, translates to resources spent by the laboratoryclassifying variants, time spent by the clinician trying tounderstand and explain the results, a higher potential forincorrect interpretation of the report by clinician/patient,and potentially unnecessary follow-up testing on VUSs.Thus, we, the laboratorians, have a responsibility to pro-

vide NGS tests that are analytically and clinically valid, aswell as clinically useful. We also need to carefully statewhat the limitations of testing are (e.g., what is not de-tected with the test) and we need classify variants in aconservative and standardized manner.

Stephen Master: I think that the biggest pitfall for thelaboratory is not spending enough time thinking aboutdata management. There have been several highly publi-cized cases within the past 10–15 years where complexdata were incorrectly processed and led to possible harmto subjects or patients. The underlying problem is thatonce we’re talking about “big data” it can be very difficultto spot problems “by eye” unless we have well-validatedways of reproducibly managing and processing data. An-other important pitfall is the relative lack of people in ourfield with quantitative expertise. If we’re going to use “bigdata” approaches in clinical chemistry and laboratory medi-cine, we need to be able to effectively police ourselves andpeer review each other’s laboratories through the inspectionprocess. This has important implications not only for theway that we prioritize our use of big data for computationalpathology but also for the way that we train the upcomingcohort of young clinical laboratorians.

Mark Cervinski: Whether it is intentional or not, datacan be massaged to fit a predetermined outcome. Theformulation of testable questions and disclosure of allanalysis conditions, including what values or data ele-ments were included or excluded from the analysis, isvital. Like all scientific experiments, the results of ouranalysis of “big data” must be able to be replicated by ourpeers. While sharing our data sets may not be possiblebecause of protected health information (PHI) disclosureor simply because of the size of the database, I wouldsupport the notion of sharing the tools used so that theycan be vetted and improved upon by other similarlyskilled investigators.

Daniel Holmes: The data coming out of our LIS are notalways as clean as we think. For example, in program-ming analytical tools for TAT, we have realized that datawere polluted with add-ons, duplicate analyses, duplicaterequests, nonnumeric results, and even negative TATs.Thorough review of the quality of the data and a strategyfor removing extraneous data are necessary to ensure thatthe results are meaningful and accurate. We usually startwith small predictable analyses on specific cases to verifythe generation of meaningful results. Then we embed theanalyses into R functions and apply it across cases. Inmedicine we often say, “Don’t order a test if it does notchange clinical management.” With “big data” we areeffectively performing diagnostic tests on our data tosee if we can help direct patient care, defray costs, allocateresources, and identify problems early. However, we are

Q&A

Clinical Chemistry 61:12 (2015) 1439

Page 8: “Big Data” in Laboratory Medicine Q&A

in danger of investing our resources in custom analyticsonly to end up with metrics that are uninformative or towhich the appropriate response is unclear. If the analysisdoes not or cannot inform clinical or laboratory medicalpractice, then we produce reports that have no value. It iscritical that the individuals performing analysis have athorough understanding of clinical and/or laboratorymedicine processes. Otherwise, they are likely to discoverphenomena that may appear significant to an outsiderbut don’t really matter from a practical standpoint. Forthis reason, abstraction of quality management personneland programmers from the clinical or clinical laboratoryenvironment does not facilitate a team-based approach toimprove patient care. A coordinated approach preventsmedical professionals from making pie-in-the-sky requestsand analysts from drawing naive conclusions.

Gary Horowitz: By far, the most important pitfall for usto consider relates to the effects of local practice on thedata, limiting the generalizability of the findings. As anexample, we recently did an analysis of extremely highferritin values (�10 000 ng/mL) from our hospital.None of the traditional causes (hemochromatosis, Still’sdisease, hemophagocytic lymphohistiocytosis) wereamong our cases; rather, we found liver failure and otherhematologic diseases most commonly. Does this reflectour patient population, our doctors’ ordering habits,and/or something else? Should we “educate” our clini-cians by telling them that a ferritin level �10 000 ng/mLis not seen in those other diseases? A third example relatesto efforts to derive reference intervals from laboratorydatabases, which are attractive because they encompassmassive amounts of information. Simple nonparametrictechniques, as well as sophisticated statistical techniques,have been used in these efforts. But the results vary tre-mendously depending on whether one includes all pa-tients, or limits the analyses to just outpatients, or limitsthe analyses to just outpatients with ICD-9/ICD-10codes indicating the absence of disease. And even the useof these codes to filter the data can be problematic. Weonce analyzed the distribution of hemoglobin A1c valuesamong outpatients with diabetes at our institution, rely-ing on ICD-9 codes in the database to establish the diag-nosis. The prevalence of excellent control was very high,so high in fact that we felt obliged to dig deeper into thedata, at which point we discovered that many of thepatients were being screened for diabetes rather than hav-ing an established diagnosis of diabetes. In other words,an ICD-9 code may not reflect reality, and one must becareful in drawing conclusions, notwithstanding howmuch data one has.

Albert Chan: “Big data” alone does not guarantee betteroutcomes. Overwhelming clinicians with unbridled vol-umes of data makes it more difficult to separate the signalfrom the noise. To truly realize the benefits, we need todevelop better algorithmic and computational approachesto convert “big data” into big insights to find novel oppor-tunities for clinical treatment. With these new approaches,we will be able to empower our clinicians to be better diag-nosticians in ways not possible without data.

Author Contributions: All authors confirmed they have contributed tothe intellectual content of this paper and have met the following 3 require-ments: (a) significant contributions to the conception and design, acquisi-tion of data, or analysis and interpretation of data; (b) drafting or revisingthe article for intellectual content; and (c) final approval of the publishedarticle.

Authors’ Disclosures or Potential Conflicts of Interest: Upon man-uscript submission, all authors completed the author disclosure form. Dis-closures and/or potential conflicts of interest:

Employment or Leadership: E.W. Klee, Association for MolecularPathology.Consultant or Advisory Role: A.S. Chan, AnalyticsMD.Stock Ownership: A.S. Chan, AnalyticsMD.Honoraria: G.L. Horowitz, SYCL (presentation at AACC 2015Meeting).Research Funding: None declared.Expert Testimony: None declared.Patents: None declared.Other Remuneration: Soft Genetics, receive royalties for joint soft-ware development.

Acknowledgments: This year’s AACC Society for Young ClinicalLaboratorians (SYCL) Workshop, detailing a number of IT Solutionsin Laboratory Medicine, inspired this Q&A session. The developmentof the focus for the 2015 SYCL Workshop was a collaborative one anda result of the efforts put forth by the 2015 SYCL Workshop and MixerPlanning Committee.

2015 SYCL Workshop Chair: Nicole V. Tolan, PhD, DABCC, BethIsrael Deaconess Medical Center and Harvard Medical School, Depart-ment of Pathology and Laboratory. Medicine, Director of ClinicalChemistry and POCT.

SYCL AACC Staff Liaison: Michele Horwitz, AACC, Director ofMembership.

Lindsay Bazydlo, PhD, DABCC, FACB, University of Virginia, De-partment of Pathology, Associate Director of Clinical Chemistry andToxicology, Scientific Director of Coagulation Laboratory.

Erin J. Kaleta, PhD, DABCC, Sonora Quest Laboratories, ClinicalDirector.

Mark Marzinke, PhD, DABCC, Johns Hopkins School of Medi-cine, Departments of Pathology and Medicine, Director of Preanalyticsand General Chemistry, Pharmacology Analytical Laboratory.

Fred Strathmann, PhD, DABCC (CC, TC), University of Utah andARUP Laboratories, Department of Pathology, Director of Toxicologyand Mass Spectrometry.

Previously published online at DOI: 10.1373/clinchem.2015.248591

Q&A

1440 Clinical Chemistry 61:12 (2015)