5
www.ScienceTranslationalMedicine.org 19 December 2012 Vol 4 Issue 165 165cm15 1 COMMENTARY “ ” THE DATA DELUGE Just a few decades ago, the idea that health sciences researchers across the globe could share their latest data and knowledge elec- tronically, within seconds, was just a futur- istic vision. At that time, technology, com- puter models, and cultural diferences were not sufciently understood to advance data sharing in biomedical research. But this vi- sion materialized when collaborations and broad sharing networks were formed during the Human Genome Project and in creating PubMed, for example. Today, there is little question that responsible data sharing is a necessity to advance science (1)—and also a moral obligation (2). Many patients are willing to share their data if they are prop- erly consulted (3). Te question is no longer whether or not to share data for research, but how to do it in a way that adds value to society and protects individual privacy and preferences. Although much has been written about handling the data deluge in health care and biomedical science (4, 5), practical solutions are still maturing. Te increasing adoption of electronic health records, integrated labo- ratory information systems, online social networks, and high-throughput technolo- gies has spurred the interest of government, industry, and academia groups in “Big Data” (6). As their goals align, important questions emerge: Where will these data reside? How will they be organized? Who should have ac- cess? Who will pay for the infrastructure for storing, sharing, and analyzing data? It is not clear whether biomedical researchers and health-care providers can remain focused on their activities by outsourcing data storage, deidentifcation, annotation, curation, and distribution to reliable third parties that pos- sess the appropriate expertise. Moving for- ward, it is desirable that these data be inte- grated so that they provide more value than if shared for only one purpose at a time. CARE TO SHARE? On the biomedical front, many research- ers are faced with a common challenge. Tey carefully collect data and use them until results are published but do not have the means to properly maintain the data or sofware long term and/or prepare them for sharing with other scientists. Most re- search journals are not equipped to review and maintain large annotated databases or sofware, and small research groups may not have the resources to maintain data, meta- data (information about the data that would help others reuse them), and sofware de- veloped primarily for in-house use. Access to data generated elsewhere is also difcult. Although public repositories exist, many types of data are not represented, so there is no “home” for them. Funding agencies and journals increas- ingly demand that researchers share their data and sofware with others to ensure re- producibility of results as well as to support new analyses, but the task of developing and maintaining infrastructure to accomplish this demand is hard, time-consuming, and risky because of its high cost and the dif- culties in recruiting specialized personnel. Te currently used model—peer-to-peer data exchange—becomes quickly intracta- ble when there are a multitude of unknown requesters. Reproducing results of others is difcult because the same sofware environ- ment needs to be constructed, which many times requires extensive installation and confguration of diferent versions of sof- ware components. On the health-care front, there are in- triguing parallels. Data collected in the process of care could prove extremely use- ful for quality improvement initiatives as well as clinical research beyond the source institutions (7). Besides addressing the lack of data standardization across diferent in- stitutions, several steps need to be taken to make derivatives of these data available for others in a way that protects individual and institutional privacy and that ascertains data quality. Even when exchanged for care, elec- tronic data from patient records require spe- cial protections and a corresponding policy framework that ensures proper consent and compliance with regulations that cut across institutional, state, federal, and internation- al boundaries. To responsibly exchange these data for research is even more daunting. Te vi- sion of a “learning health-care system” in which all these data can be used for qual- ity improvement and for health services or patient-centered outcomes research is sen- sible, but enabling this system is not simple (8). Furthermore, it is not trivial to track data usage to address the public’s increasing interest in guarding their own records (9) and in understanding how data and speci- mens obtained for one purpose (for exam- ple, health care or a specifc study) are being used for other purposes, such as secondary analyses in other studies (10). Because of the high stakes, not all health care organizations have bought into the idea of data sharing; other than the technical challenges, prereq- uisites such as a system of incentives and a clear business model have to be developed. As research becomes increasingly transla- tional, it is important that these challenges start to be addressed in a systematic way. Awareness of this data-sharing chal- lenge has prompted diferent institutions, including the National Institutes of Health (NIH), the Institutes of Medicine (IOM), the Agency for Healthcare Research on Qual- ity (AHRQ), and the Patient-Centered Out- comes Research Institute (PCORI) in the United States, as well as international agen- cies, to assemble experts to discuss current best practices and new models for sharing diverse data, such as whole genomes, images, and structured data items commonly found in electronic health records. For example, the NIH Working Group on Data and In- formatics (http://acd.od.nih.gov/diwg.htm) has made important recommendations to the Advisory Committee to the NIH Direc- tor. To summarize their recommendations, data and metadata should be shared, incen- tives should be ofered to those who share DATA SHARING To Share or Not To Share: That Is Not the Question Lucila Ohno-Machado E-mail: [email protected] School of Medicine, University of California San Diego, 9500 Gilman Drive, MC 0505, La Jolla, CA 92093, USA. There is an increasing awareness of the power of integrating multiple sources of data to accelerate biomedical discoveries. Some even argue that it is unethical not to share data that could be used for the public good. However, the challenges involved in sharing clini- cal and biomedical data are seldom discussed. I briefy review some of these challenges and provide an overview of how they are being addressed by the scientifc community. by guest on July 1, 2020 http://stm.sciencemag.org/ Downloaded from

DATA SHARING To Share or Not To Share: That Is Not the ... · the means to properly maintain the data or sof ware long term and/or prepare them for sharing with other scientists

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DATA SHARING To Share or Not To Share: That Is Not the ... · the means to properly maintain the data or sof ware long term and/or prepare them for sharing with other scientists

www.ScienceTranslationalMedicine.org 19 December 2012 Vol 4 Issue 165 165cm15 1

C O M M E N TA R Y “ ”

THE DATA DELUGEJust a few decades ago, the idea that health sciences researchers across the globe could share their latest data and knowledge elec-tronically, within seconds, was just a futur-istic vision. At that time, technology, com-puter models, and cultural dif erences were not suf ciently understood to advance data sharing in biomedical research. But this vi-sion materialized when collaborations and broad sharing networks were formed during the Human Genome Project and in creating PubMed, for example. Today, there is little question that responsible data sharing is a necessity to advance science (1)—and also a moral obligation (2). Many patients are willing to share their data if they are prop-erly consulted (3). T e question is no longer whether or not to share data for research, but how to do it in a way that adds value to society and protects individual privacy and preferences.

Although much has been written about handling the data deluge in health care and biomedical science (4, 5), practical solutions are still maturing. T e increasing adoption of electronic health records, integrated labo-ratory information systems, online social networks, and high-throughput technolo-gies has spurred the interest of government, industry, and academia groups in “Big Data” (6). As their goals align, important questions emerge: Where will these data reside? How will they be organized? Who should have ac-cess? Who will pay for the infrastructure for storing, sharing, and analyzing data? It is not clear whether biomedical researchers and health-care providers can remain focused on their activities by outsourcing data storage, deidentif cation, annotation, curation, and distribution to reliable third parties that pos-

sess the appropriate expertise. Moving for-ward, it is desirable that these data be inte-grated so that they provide more value than if shared for only one purpose at a time.

CARE TO SHARE?On the biomedical front, many research-ers are faced with a common challenge. T ey carefully collect data and use them until results are published but do not have the means to properly maintain the data or sof ware long term and/or prepare them for sharing with other scientists. Most re-search journals are not equipped to review and maintain large annotated databases or sof ware, and small research groups may not have the resources to maintain data, meta-data (information about the data that would help others reuse them), and sof ware de-veloped primarily for in-house use. Access to data generated elsewhere is also dif cult. Although public repositories exist, many types of data are not represented, so there is no “home” for them.

Funding agencies and journals increas-ingly demand that researchers share their data and sof ware with others to ensure re-producibility of results as well as to support new analyses, but the task of developing and maintaining infrastructure to accomplish this demand is hard, time-consuming, and risky because of its high cost and the dif -culties in recruiting specialized personnel. T e currently used model—peer-to-peer data exchange—becomes quickly intracta-ble when there are a multitude of unknown requesters. Reproducing results of others is dif cult because the same sof ware environ-ment needs to be constructed, which many times requires extensive installation and conf guration of dif erent versions of sof -ware components.

On the health-care front, there are in-triguing parallels. Data collected in the

process of care could prove extremely use-ful for quality improvement initiatives as well as clinical research beyond the source institutions (7). Besides addressing the lack of data standardization across dif erent in-stitutions, several steps need to be taken to make derivatives of these data available for others in a way that protects individual and institutional privacy and that ascertains data quality. Even when exchanged for care, elec-tronic data from patient records require spe-cial protections and a corresponding policy framework that ensures proper consent and compliance with regulations that cut across institutional, state, federal, and internation-al boundaries.

To responsibly exchange these data for research is even more daunting. T e vi-sion of a “learning health-care system” in which all these data can be used for qual-ity improvement and for health services or patient-centered outcomes research is sen-sible, but enabling this system is not simple (8). Furthermore, it is not trivial to track data usage to address the public’s increasing interest in guarding their own records (9) and in understanding how data and speci-mens obtained for one purpose (for exam-ple, health care or a specif c study) are being used for other purposes, such as secondary analyses in other studies (10). Because of the high stakes, not all health care organizations have bought into the idea of data sharing; other than the technical challenges, prereq-uisites such as a system of incentives and a clear business model have to be developed. As research becomes increasingly transla-tional, it is important that these challenges start to be addressed in a systematic way.

Awareness of this data-sharing chal-lenge has prompted dif erent institutions, including the National Institutes of Health (NIH), the Institutes of Medicine (IOM), the Agency for Healthcare Research on Qual-ity (AHRQ), and the Patient-Centered Out-comes Research Institute (PCORI) in the United States, as well as international agen-cies, to assemble experts to discuss current best practices and new models for sharing diverse data, such as whole genomes, images, and structured data items commonly found in electronic health records. For example, the NIH Working Group on Data and In-formatics (http://acd.od.nih.gov/diwg.htm) has made important recommendations to the Advisory Committee to the NIH Direc-tor. To summarize their recommendations, data and metadata should be shared, incen-tives should be of ered to those who share

D ATA S H A R I N G

To Share or Not To Share: That Is Not the QuestionLucila Ohno-Machado

E-mail: [email protected]

School of Medicine, University of California San Diego, 9500 Gilman Drive, MC 0505, La Jolla, CA 92093, USA.

There is an increasing awareness of the power of integrating multiple sources of data to accelerate biomedical discoveries. Some even argue that it is unethical not to share data that could be used for the public good. However, the challenges involved in sharing clini-cal and biomedical data are seldom discussed. I brief y review some of these challenges and provide an overview of how they are being addressed by the scientif c community.

by guest on July 1, 2020http://stm

.sciencemag.org/

Dow

nloaded from

Page 2: DATA SHARING To Share or Not To Share: That Is Not the ... · the means to properly maintain the data or sof ware long term and/or prepare them for sharing with other scientists

www.ScienceTranslationalMedicine.org 19 December 2012 Vol 4 Issue 165 165cm15 2

C O M M E N TA R Y “ ”

data, and investments in user training and infrastructure need to be coordinated to ensure ef cient use of resources. On the training side, the number and size of pro-grams to train informatics professionals and researchers needs to increase. On the infra-structure side, a backbone for data and sof -ware sharing needs to be implemented—for example, through a network of biomedical computing centers. Building this network in a rapidly evolving technological landscape will require the development of new models for data sharing.

STATE OF TECHNOLOGYAlthough initial technical setup may be complex, dif erent solutions currently exist that allow researchers to share health and biomedical research data that involves hu-man subjects in a privacy-preserving man-ner. “Cloud” computing has presented new ways in which to build and deliver sof ware, and cloud storage has become mainstream in the digital world (such as the Amazon Cloud

Drive and Apple iCloud). Cloud-based initia-tives are part of an architectural solution that allows researchers to outsource infrastructure and use resources “on demand.” T is power to scale a computational resource comes at a cost: Economies of scale are achieved by having multiple users use the resources of the cloud, which increases the complexity of managing the security and conf dentiality of the data. In general, the requirements for human-subject data protection are not com-pletely resolved by commercial, public cloud providers (11). To handle protected health information, these entities would need to sign business associate agreements with the data-contributing institutions; some cloud providers are not yet ready for this respon-sibility. Human genomes contain biometric information and hence can be considered protected health information, so this creates a problem when using public clouds.

Fortunately, many privacy technology algorithms and policy frameworks are be-ing developed. Policies and technology can

protect privacy in the cloud, particularly for specialized solutions that can be imple-mented in “private” and certain “commu-nity” clouds. T ese research clouds hosting protected health information must have a strong emphasis on privacy protection in which the advantages of elastic computing (the provision of on-demand computa-tional resources for a large number of users) should still hold, but the environments han-dling protected health information are seg-regated, responsibilities are clearly spelled out, and additional access and quality assur-ance mechanisms are implemented.

Dif erent models for data sharing in a research community cloud are currently being investigated—for example, in the Johns Hopkins Institute for NanoBioTech-nology (http://releases.jhu.edu/2012/11/06/collecting-cancer-data-in-the-cloud/) and in the iDASH National Center for Biomedical Computing (12). iDASH stands for “inte-grating data for analysis, anonymization, and sharing” and is one of six centers funded

Contributor

DUA and QA

User DUA

Model 1. Traditional:

User downloads

center data

User A

Model 2. Remote desktop:

User computes with

center-hosted data in

center environment

User B

Model 3. Virtualization and

distributed computing:

User performs center

computation in his or

her own environment

User C

User DUA

Access controls

Policies

DUA management

VM 1

VM 2

Contributor QA

Contributor QA

VM 2

Tool creator

Data owner

System creator

Data 1

Tool 2

Data 2

Data 2

Data 1

Tool 1

Tool 2

Tool A

Tool 3

VM 2

Data C

Data-sharing center

Fig. 1. Data-sharing models. To avoid multiple pairwise agreements among institutions, a broker for data can be created. Data contributors specify their requirements for data access by users and sign a contributor data use agreement (DUA). Completing a quality assurance (QA) process is required for data, tool, and VM contributions. Data users also sign a DUA that complies with the requirements of the contributor, so that contributors do not have to negotiate every data-sharing engagement with diff erent institutions. Three models of sharing are displayed. Model 1 is the traditional model, in which users download data for use in their local computers. In model 2, the remote desktop model, users connect to a center but access and analyze data within the center using existing, or their own, algorithms. Model 3 involves virtualization and distributed computation, in which users import software environments (virtual machines, or VMs) to analyze their data using their local computational infrastructures.

CR

ED

IT: Y

. HA

MM

ON

D/S

CIE

NC

E T

RA

NSL

ATI

ON

AL

ME

DIC

INE

by guest on July 1, 2020http://stm

.sciencemag.org/

Dow

nloaded from

Page 3: DATA SHARING To Share or Not To Share: That Is Not the ... · the means to properly maintain the data or sof ware long term and/or prepare them for sharing with other scientists

www.ScienceTranslationalMedicine.org 19 December 2012 Vol 4 Issue 165 165cm15 3

C O M M E N TA R Y “ ”in 2010 by the roadmap initiative at the NIH. T e focus of this initiative is on new mod-els for data sharing that allow researchers and institutions to pass the responsibility of data sharing, computing, and storage of large amounts of protected health information to a third party. T rough technology and policy innovation, centers such as these address dif-ferent models of data sharing, as illustrated in Fig. 1 and brief y described below.

Users download data. In this tradition-al model (shown as model 1 in Fig. 1), da-ta-seekers identify relevant data sources in a distributed or centralized resource (such as a server) and download data to their lo-cal computers. However, as data become “big” (giga- to petabytes of data) and issues related to frequency of updates, available network bandwidth, and ascertainment of data provenance (if these data are further distributed) become more common, it is not always practical or desirable to have data downloaded to local computers. T is model, although still highly prevalent in the scien-tif c community, may not work ideally in the long term. T e liabilities involved on the part of data donors and users are high, and gigabyte networks are still limited to certain institutions. Although deidentif cation and privacy-protection algorithms can mitigate the conf dentiality problem (13–17), once data are downloaded there is no way to track their use, and there is still some risk of re-identif cation (18).

More research in reidentif cation and quantif cation of risk for privacy breaches will help develop policies for this model of sharing, particularly if information about human genomes is going to be shared (19). In fact, the recent NIH Workshop “Estab-lishing a Central Resource of Data from Genome Sequencing Projects” (http://www.genome.gov/27549169) recommended that “sequence/phenotype/exposure data sets [be] deposited in one or several central da-tabases.” In addition to recommending a central location for such data, the meeting discussions stressed the development of governance methods and policies for central databases that support responsible access to individual data sets. T e U.S. Presidential Bioethics Advisory Committee recently is-sued a report emphasizing the importance of protecting health information, particularly the data about an individual’s genome (20).

An important feature of centralizing data is the ability to keep harmonized collec-tions for future use. Manipulation of data, such as harmonization across dif erent data

sets (preprocessing), may result in products that are as useful—if not more useful—than the original data. By having preprocessing executed on local computers with no easy mechanism to upload preprocessed data back to the collective data resource, the user downloads model usually only provides one-way resource sharing. Participants of the NIH Workshop recommended that har-monization of data retrospectively should be captured in the central databases. Despite some limitations, this data-sharing mecha-nism is well understood by researchers and institutions and is still practical for small, nonsensitive (“sanitized”) data that are not being requested with very high frequency.

Users access and analyze data re-motely. In this model (shown as model 2 in Fig. 1), no data are downloaded, and they remain protected in centralized or distrib-uted data sets. Users can perform analyses using preexisting sof ware (located where data reside) or submit their own sof ware. Given the need to protect privacy, the sof -ware undergoes a specialized quality assur-ance process so as to ensure that no data are leaked with the results of the computation. Although this model requires users to be connected to the Internet, liabilities are re-duced in the case of lost or stolen comput-ers. T e environment is admittedly a little less f exible than the data download model but can be privacy-protected and of er com-putational resources that may not be avail-able to the user of the data in his or her insti-tution. It also of ers auditing capabilities to the data-hosting center that are not possible when data are downloaded by the users.

A variety of operating systems, applica-tions, and data sets are required to support this model because it is hard to predict what users will need. Besides protecting the data, this model is especially useful if the user is going to perform demanding operations on large, sensitive data sets, such as genome queries. For example, if the researcher wants to perform de novo assembly of a large ge-nome such as that of an individual patient, but does not have the computational infra-structure, she could use this model to com-pute “in the cloud.”

Users import whole software envi-ronments. T is model (shown as model 3 in Fig. 1) is similar to the remote access model above, except that instead of having users use external computational resources, the users download virtual machines (VMs) to compute in their own hardware with their local data. T e VM import model can also

enable distributed computation, with each party installing the same VM and contrib-uting results of its local computation to a coordinating center. For example, we have shown that it is possible to create an accu-rate predictive model by exporting the com-putation to dif erent centers and aggregating results only, without any individual patient data ever being transferred (21).

T e advantage of this model is that the same sof ware environment can be repro-duced, and users do not need to spend time installing specif c versions of operating systems and applications. T us, results are more likely to be reproducible. T is model is useful when data cannot be transferred outside of an institution, as is the case in several health-care organizations in the United States or when legislation prevents transmission of data from an international collaborator outside their country.

T is model also enables the creation of a network of collaborating centers, even if institutional policies disallow sharing of data at the individual level. For example, a researcher who wants to build a prognostic model for patients with a particular disease but has limited data at her own institution may need data from several centers. She would like to use a multivariate model to ad-just for potential confounders, but she is not able to access such patient-level data at dif-ferent institutions. With this model, she can combine coef cients and covariance matri-ces calculated locally at each institution (us-ing the same VM) and transmitted to a cen-tral node. Another example in which this model is benef cial is when genomic data need to stay at one institution, but the phe-notype data for the same patient is hosted in a dif erent institution—and neither is able to transmit patient-level data to the other. Some algorithms can be decomposed so that multivariate models can be construct-ed across these “vertically” separated data. T is may be one of the most ef ective ways to deal with international collaborations in which legislation against physical placement of data outside of the country may currently prevent some data-sharing initiatives.

STATE OF POLICYAlthough technical solutions to data sharing are complex and varied, they may not be as challenging as the solutions for policy issues (22–25). T e multiplicity of institutional policies, dif erent types of consent, and dif-ferent interpretations of what constitutes a small risk for reidentif cation point to the

by guest on July 1, 2020http://stm

.sciencemag.org/

Dow

nloaded from

Page 4: DATA SHARING To Share or Not To Share: That Is Not the ... · the means to properly maintain the data or sof ware long term and/or prepare them for sharing with other scientists

www.ScienceTranslationalMedicine.org 19 December 2012 Vol 4 Issue 165 165cm15 4

C O M M E N TA R Y “ ”need for solutions that will largely be based on proper enforcement of well-designed policies and regulations. For example, there has been discussion on whether access to a community cloud resource could be grant-ed depending on user “certif cations” that would require training in responsible con-duct of research, among other things. Sim-plif cation of data-use agreements (DUAs) could be codif ed in addition to state and federal requirements and enforced through a network composed of several data-hosting centers. For example, iDASH investigators have worked with legal council at the Uni-versity of California to develop a simple system to facilitate data “donation” and data “utilization” by dif erent parties through data contributor agreements and data user agreements. T is way, there are no pairwise DUAs between institutions and those who want to access the data. Observing the terms of use specif ed by the data contributor agreement, iDASH becomes responsible for the distribution of the data. A DUA system covers some of the requirements of data-sharing models 1 and 2 described above.

Other items that need attention are the use of appropriate access controls, depend-ing on the sensitivity of the data (such as two-factor authentication). Additionally, algorithms for deidentif cation, data obfus-cation, and methods to evaluate the risk of reidentif cation incurred in the disclosure of “limited data sets” (data sets for which certain identif ers were removed) are also needed. With the whole genome constitut-ing the ultimate identif er for an individual, special protection needs to be implemented when these kinds of data are linked to other sensitive information.

A SHARED FUTURET e time has come to address the need to make better use of the avalanche of health care and biomedical research data that are being currently generated through both private and public funding. All of those involved in health sciences research have a keen interest in preventing and alleviat-ing the burden of human disease. T ere are technical and policy solutions to support

data sharing that respect individual and in-stitutional privacy and at the same time pro-vide a public good that can help accelerate research. Several models of data sharing ex-ist. We are just beginning to understand the ecosystem of sharing and to build systems that support these models. T e increasing engagement of the public with translation-al scientists who are at the forefront of the battle against disease is changing the way we collectively look at data sharing: It is not an option, it is a necessity. Turning data into a public good in a way that respects patient privacy will af ect translational research and human health in unprecedented ways.

REFERENCES AND NOTES 1. S. F. Terry, P. F. Terry, Power to the people: Participant

ownership of clinical trial data. Sci. Transl. Med. 3, 69cm3 (2011).

2. T. Lang, Advancing global health research through digi-tal technology and sharing data. Science 331, 714–717 (2011).

3. A. L. McGuire, J. M. Oliver, M. J. Slashinski, J. L. Graves, T. Wang, P. A. Kelly, W. Fisher, C. C. Lau, J. Goss, M. Okcu, D. Treadwell-Deering, A. M. Goldman, J. L. Noebels, S. G. Hilsenbeck, To share or not to share: A randomized trial of consent for data sharing in genome research. Genet. Med. 13, 948–955 (2011).

4. G. Bell, T. Hey, A. Szalay, Beyond the data deluge. Science 323, 1297–1298 (2009).

5. D. Greenbaum, M. Gerstein, The role of cloud computing in managing the deluge of potentially private genetic data. Am. J. Bioeth. 11, 39–41 (2011).

6. J. U. S. Mervis, Agencies rally to tackle big data. Science 336, 22 (2012).

7. J. J. Nadler, G. J. Downing, Liberating health data for clinical research applications. Sci. Transl. Med. 2, 18cm6 (2010).

8. B. C. Delaney, K. A. Peterson, S. Speedie, A. Taweel, T. N. Arvanitis, F. D. Hobbs, Envisioning a learning health care system: The electronic primary care research network, a case study. Ann. Fam. Med. 10, 54–59 (2012).

9. L. Beard, R. Schein, D. Morra, K. Wilson, J. Keelan, The challenges in making electronic health records acces-sible to patients. J. Am. Med. Inform. Assoc. 19, 116–120 (2012).

10. S. B. Trinidad, S. M. Fullerton, E. J. Ludman, G. P. Jarvik, E. B. Larson, W. Burke, Research ethics. Research practice and participant preferences: The growing gulf. Science 331, 287–288 (2011).

11. E. J. Schweitzer, Reconciliation of the cloud computing model with US federal electronic health record regula-tions. J. Am. Med. Inform. Assoc. 19, 161–165 (2012).

12. L. Ohno-Machado, V. Bafna, A. A. Boxwala, B. E. Chap-man, W. W. Chapman, K. Chaudhuri, M. E. Day, C. Farcas, N. D. Heintzman, X. Jiang, H. Kim, J. Kim, M. E. Matheny, F. S. Resnic, S. A. Vinterboi, DASH team, iDASH: Integrat-

ing data for analysis, anonymization, and sharing. J. Am. Med. Inform. Assoc. 19, 196–201 (2012).

13. G. Loukides, A. Gkoulalas-Divanis, B. Malin, Anony-mization of electronic medical records for validating genome-wide association studies. Proc. Natl. Acad. Sci. U.S.A. 107, 7898–7903 (2010).

14. A. Tamersoy, G. Loukides, M. E. Nergiz, Y. Saygin, B. Malin, Anonymization of longitudinal electronic medical re-cords. IEEE Trans. Inf. Technol. Biomed. 16, 413–423 (2012).

15. B. Malin, K. Benitez, D. Masys, Never too old for anonym-ity: A statistical standard for demographic data sharing via the HIPAA Privacy Rule. J. Am. Med. Inform. Assoc. 18, 3–10 (2011).

16. D. McGraw, Building public trust in uses of Health Insur-ance Portability and Accountability Act de-identifi ed data. J. Am. Med. Inform. Assoc. (2012).10.1136/ami-ajnl-2012-000936

17. C. A. Kushida, D. A. Nichols, R. Jadrnicek, R. Miller, J. K. Walsh, K. Griffi n, Strategies for de-identifi cation and anonymization of electronic health record data for use in multicenter research studies. Med. Care 50 (suppl.), S82–S101 (2012).

18. G. Loukides, J. C. Denny, B. Malin, The disclosure of diag-nosis codes can breach research participants’ privacy. J. Am. Med. Inform. Assoc. 17, 322–327 (2010).

19. C. Heeney, N. Hawkins, J. de Vries, P. Boddington, J. Kaye, Assessing the privacy risks of data sharing in genomics. Public Health Genomics 14, 17–25 (2011).

20. Presidential Commission for the Study of Bioethics Is-sues, Privacy and Progress in Whole Genome Sequencing. October 2012; http://bioethics.gov/cms/node/764.

21. Y. Wu, X. Jiang, J. Kim, L. Ohno-Machado, Grid Binary LO-gistic REgression (GLORE): Building shared models with-out sharing data. J. Am. Med. Inform. Assoc. 19, 758–764 (2012).

22. S. O. Dyke, T. J. Hubbard, Developing and implementing an institute-wide data sharing policy. Genome Med 3, 60 (2011).

23. S. H. Harmon, K. H. Chen, Medical research data-sharing: The ‘public good’ and vulnerable groups. Med. Law Rev. 20, 516–539 (2012).

24. J. Kaye, From single biobanks to international networks: Developing e-governance. Hum. Genet. 130, 377–382 (2011).

25. S. M. Fullerton, N. R. Anderson, G. Guzauskas, D. Free-man, K. Fryer-Edwards, Meeting the governance chal-lenges of next-generation biorepository research. Sci. Transl. Med. 2, 25cm3 (2010).

Acknowledgments: I thank the iDASH team for making this work possible. Funding: iDASH is supported by NIH through the NIH Roadmap for Medical Research, grant U54HL108460. Research and development in data sharing is funded through Agency for Healthcare Research and Quality grant R01HS19913 and NIH grants UH2HL108785 and UL1TR000100 (L.O.-M.). Competing interests: I am the principal investigator of iDASH.

Citation: L. Ohno-Machado, To share or not to share: That is not the question. Sci. Transl. Med. 4, 165cm15 (2012).

10.1126/scitranslmed.3004454

by guest on July 1, 2020http://stm

.sciencemag.org/

Dow

nloaded from

Page 5: DATA SHARING To Share or Not To Share: That Is Not the ... · the means to properly maintain the data or sof ware long term and/or prepare them for sharing with other scientists

To Share or Not To Share: That Is Not the QuestionLucila Ohno-Machado

DOI: 10.1126/scitranslmed.3004454, 165cm15165cm15.4Sci Transl Med

ARTICLE TOOLS http://stm.sciencemag.org/content/4/165/165cm15

CONTENTRELATED

http://stm.sciencemag.org/content/scitransmed/8/366/366ed14.fullhttp://stm.sciencemag.org/content/scitransmed/8/322/322ps2.fullhttp://stm.sciencemag.org/content/scitransmed/5/215/215cm7.fullhttp://stm.sciencemag.org/content/scitransmed/3/69/69cm3.fullhttp://stm.sciencemag.org/content/scitransmed/2/18/18cm6.fullhttp://stm.sciencemag.org/content/scitransmed/2/15/15cm3.fullhttp://stm.sciencemag.org/content/scitransmed/1/9/9cm8.fullhttp://stm.sciencemag.org/content/scitransmed/5/182/182fs13.fullhttp://stm.sciencemag.org/content/scitransmed/3/113/113cm34.full

REFERENCES

http://stm.sciencemag.org/content/4/165/165cm15#BIBLThis article cites 24 articles, 6 of which you can access for free

PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions

Terms of ServiceUse of this article is subject to the

registered trademark of AAAS. is aScience Translational MedicineScience, 1200 New York Avenue NW, Washington, DC 20005. The title

(ISSN 1946-6242) is published by the American Association for the Advancement ofScience Translational Medicine

Copyright © 2012, American Association for the Advancement of Science

by guest on July 1, 2020http://stm

.sciencemag.org/

Dow

nloaded from