20 RemoteNEPS: data dissemination in a collaborative workspace

Preview:

Citation preview

Abstract:  The German National Educational Panel Study has been set up to collect longitudinal data for educational research. About 60,000 target persons will be questioned and tested within six starting cohorts. This will generate a very large amount of data. Accordingly, one of the most important challenges is to provide comfortable and user-friendly data access that simultaneously allows a high level of data protection. A better infrastructure for research data has been under construction in Germany since the 1990s. Research Data Centers offer a broad range of ways to access data. The National Educational Panel Study is not only matching these common access ways but also developing a remote access solution, called RemoteNEPS. This Secure Data Access will enable users to work with the data from their own computer via a terminal server solution. The data itself will not leave the National Educational Panel Study secure environment. This set-ting gives the opportunity to provide detailed data to researchers while guaranteeing a high level of data security. As well as providing data security, the concept of RemoteNEPS can be expanded to higher levels of data utility. A community for educational research can be set up to support good scientific practice. Additionally, RemoteNEPS has the capability to handle the structure of the National Educational Panel Study data even after multiple waves or when matching new data from different sources.

Z Erziehungswiss (2011) 14:315–325DOI 10.1007/s11618-011-0192-5

20  RemoteNEPS: data dissemination in a collaborative workspace

Ingo Barkow · Thomas Leopold · Marcel Raab · David Schiller · Knut Wenzig · Hans-Peter Blossfeld · Marc Rittberger

© VS Verlag für Sozialwissenschaften 2011

I. Barkow, M.A. () · Prof. Dr. M. RittbergerInformation Center for Education, DIPF – German Institute for International Educational Research, 60486 Frankfurt a. M., Germanye-mail: barkow@dipf.de

Prof. Dr. M. Rittbergere-mail: rittberger@dipf.de

Dipl.-Soz. T. Leopold · Dipl.-Soz. M. Raab · D. Schiller, M.A. · Dipl.-Soz.-wirt K. WenzigNational Educational Panel Study, University of Bamberg, 96045 Bamberg, Germanye-mail: thomas.leopold@uni-bamberg.de

Dipl.-Soz. M. Raabe-mail: marcel.raab@uni-bamberg.de

D. Schiller, M.A.e-mail: david.schiller@uni-bamberg.de

Dipl.-Soz.-wirt K. Wenzige-mail: knut.wenzig@uni-bamberg.de

Prof. Dr. Dr. h.c. H.-P. BlossfeldChair of Sociology I, University of Bamberg, 96045 Bamberg, Germanye-mail: soziologie1@uni-bamberg.de

316 I. Barkow et al.

Keywords:  Education · Panel study · Research data · Data protection · Remote access

RemoteNEPS: Eine neue Technologie für den Fernzugriff auf Forschungsdaten

Zusammenfassung: Das Nationale Bildungspanel wurde ins Leben gerufen, um Längsschnitt-daten für die Bildungsforschung zu erheben. Insgesamt werden 60.000 Personen innerhalb von sechs Startkohorten befragt und getestet, was zu einer sehr großen Datenmenge führen wird. Eine der größten Herausforderungen ist es daher, einen komfortablen und benutzerfreundlichen Daten-zugriff anzubieten, der gleichzeitig hohe Datenschutzstandards erfüllt. Seit den 1990er Jahren wird in Deutschland eine moderne Infrastruktur für Forschungsdaten aufgebaut. Forschungs-datenzentren bieten vielfältige Möglichkeiten des Datenzugriffs. Das Nationale Bildungspanel wird sich einerseits an diesen Standards orientieren, zusätzlich aber die moderne Fernzugriffs-lösung RemoteNEPS entwickeln. Dieses Angebot erlaubt es dem Nutzer mittels einer Termi-nalserververbindung vom eigenen Rechner aus mit den Daten zu arbeiten. Die Forschungsdaten bleiben dabei in einer sicheren Umgebung auf den Servern des Nationalen Bildungspanels. Diese Rahmenbedingungen ermöglichen es, hochqualitative Mikrodaten unter Einhaltung eines hohen Sicherheitsstandards zur Verfügung zu stellen. Das Konzept RemoteNEPS gewährleistet jedoch nicht nur die Sicherheit der Daten, sondern es ermöglicht auch eine bessere Datennutzung. Dazu gehören die Förderung guter wissenschaftlicher Praxis und die Unterstützung kollaborativer Pro-jekte in der Bildungsforschung.

Schlüsselwörter:  Bildung · Panelstudie · Forschungsdaten · Datenschutz · Fernzugriff

20.1   Introduction

Modern societies rely increasingly on information. Today, analyzing this information is greatly facilitated, because it is often stored as digital data. Enormous amounts of data collected from different sources can be merged, and new methods create an abundance of opportunities to analyze them. In 2004, the German Data Forum (http://www.ratswd.de) was established to promote the development of an infrastructure for research data (Solga and Wagner 2007). This infrastructure includes a wide range of different types of data such as administrative data from the Federal Statistical Office or the Federal Employment Agency; transaction data such as information about telephone calls, credit card purchases, or retail store scanning (Lane 2009); and survey data such as the SOEP (Socio-Economic Panel Study) or the Allbus (German General Social Survey).

In 2008, the National Educational Panel Study (NEPS) was set up to find out more about how education is acquired, to see how it impacts on individual biographies, and to describe and analyze the major educational processes and trajectories across the life span. The goal is to provide a rich source of potential analyses for the various disciplines con-cerned with educational and training processes, and to set up a basis for major improve-ments in educational reporting and the provision of expert advice for policymakers in Germany. To reach this goal, the NEPS is collecting longitudinal data on the development of competencies, learning environments, the effects of social inequality and migration, and returns to education throughout the life span (see Chap. 1, this volume). Six differ-ent starting cohorts comprising more than 60,000 target persons will yield an enormous

31720 RemoteNEPS: data dissemination in a collaborative workspace

amount of data (see Chap. 4, this volume). To make these data available to the national and international scientific community, modern means of data access must strive for an optimal solution at the intersection between data utility and data security (see Chap. 19, this volume). To ensure high-quality user support while complying with high standards of confidentiality protection, the NEPS will implement a portfolio approach of data access consisting of organizational, legal, statistical, educational, and technical aspects (Lane et al. 2008). This strategy aims to ensure compliance with the needs of the scientific com-munity and the regulations of the German Data Protection Act.

This chapter outlines the development of a secure data access concept for the NEPS. We begin by reviewing current approaches to data access in Germany. Then we intro-duce “RemoteNEPS,” a secure environment aimed at providing user-friendly access to all NEPS data. Subsequently, we discuss how RemoteNEPS can be expanded and highlight some pointers for future developments. We conclude with an outlook on how the first wave of the NEPS data will be disseminated in 2011.

20.2   Current data dissemination approaches

Public use files. The most liberal way of disseminating data is to offer public use files. These files make the data accessible without restraint. Public use data can often be sim-ply downloaded from the data producer’s website. The main advantage of this approach is that a wide range of users are able to obtain the data with minimal effort. Public use files, however, require strict means of disclosure protection before they can be released safely. This often drastically reduces the statistical information available. For example, categorical variables are collapsed; continuous variables are rounded, top-coded, or bot-tom-coded; and geographical information is deleted. As a result, the usefulness of public use data is limited, and it often fails to meet researchers’ needs.

Scientific use files. Scientific use files address a more restricted audience. These data are only available to the scientific community and require contractual agreements between the data producer and external researchers. These agreements typically involve security arrangements at the researcher’s home institution and confidentiality pledges, as well as a specification of the purpose, duration, and termination of data usage. After an agreement has been signed, scientific use files are usually delivered on CD-ROM or through the Web using secure connections. Compared to public use files, the anonymization techniques used to generate scientific use files are less restrictive, more statistical information is preserved, and the data utility is higher. The application procedure required to access the data, however, limits the target audience and requires higher investments by users. In addition, scientific use files are still restricted versions of the original survey data, and some important information, such as geographical data, is not released. From the data producer’s perspective, both public and scientific use files do not allow control over any further usage (and dissemination) of the data.

On-site access. Data access on site requires the researcher to go to the data producer, where data are available in a controlled physical environment. The secure site prevents any copying or removing sensitive data from the data producer’s premises. In the analysis room, all input our output devices are typically locked down and the computers are not

318 I. Barkow et al.

connected to the internet or any local area network. The data producer’s staff is allowed to monitor all work with the data at all times. Any access to printers is controlled, and outputs are reviewed before they can be taken away. The main advantage of a control-led environment is that data are highly secure and researchers can access the full range of information, including sensitive items. Considering only the amount and quality of the statistical information available, data utility is maximized. However, if data utility is “defined as a function of both data quality and the number of researchers using the data” (Lane and Schur 2009, p. 11), the disadvantages of on-site access become obvious. The amount of research that can be done at secure sites is limited, and the costs in terms of time and money for accessing the data on-site are substantial. These high demands can easily discourage many potential users of the data. Therefore, on-site access may lead to a considerable underutilization of data (Lane et al. 2008).

Remote execution. By providing access to sensitive data without requiring the user’s physical presence at the data producer’s site, remote execution addresses one of the main problems of on-site access. In contrast to scientific use files or on-site access, however, remote execution does not allow direct access to microdata. Instead, the user submits scripts via email or an online execution system. These scripts are developed on the basis of empty test files or synthetic data that have the same file structure as the original data but do not contain any sensitive information. After submission, the scripts are executed at the site of the data producer. Depending on the IT infrastructure, the scripts are executed either automatically by a remote execution system (e.g., LISSY http://www.lisproject.org/data-access/lissy.htm. or JoSuAhttp://idsc.iza.org/index.php?page=4.) or manually by the staff at the research data center. Typically, outputs are checked for disclosure before being sent to the submitter. A thorough disclosure control allows the data producer to guarantee a high level of data confidentiality. These procedures, however, place a high burden on the reviewing staff and often lead to tedious delays for the data users (Lane et al. 2008). These delays can reduce the usability of the data significantly, particularly in cases of output-intense data exploration or complex analyses (Grim et al. 2009). At the same time, data utility remains high because remote execution allows at least indirect access to sensitive microdata.

20.3   A highly secure environment to provide user-friendly data access for the national educational panel study

The last section described common ways to access data in Germany and most research data centers now provide them. Nevertheless, ongoing research is looking for a better and more comfortable technique for accessing data. The most promising approach is summa-rized under the expression “remote access.”

Only a few institutions in Germany are working on remote access solutions. The Ger-man Data Forum established a special research project called infinitE to analyze the pos-sibilities of remote access in the country (Brandt and Zwick 2009). The IQB (Institute for Educational Progress) in Berlin and the DIPF (Educational Research and Educational Information) in Frankfurt are also working on new ways of accessing research data remotely. Because the infrastructure for research data in Germany is quite new—the first

31920 RemoteNEPS: data dissemination in a collaborative workspace

research data centers (RDC) were established in 2001 (RDC of the German Statistical Office), 2002 (RDC of the Statistical Offices of the Federal States), and 2004 (RDC of the Federal Employment Agency at the Institute of Employment Research)—methods of data storage and data access are still under development, and there is a basic necessity for more research in this area. Other countries already work with remote access systems, for example, the National Opinion Research Center at the University of Chicago (NORC) in the United States of America or Microdata Online Access (MONA) at Statistics Sweden (Lane and Shipp 2007; Söderberg 2005).

Remote access differs greatly from remote execution, which is already provided in most of the current German RDCs. While working with remote execution, the researchers send in programmed syntax of the statistical package of their choice (most RDCs sup-port Stata, SPSS, or SAS) to the data provider. Afterwards, staff members in this facility execute this code on their server infrastructure to produce a result file that is then sent back to the user. The researcher therefore only sees the results, not the original data. The idea behind remote access is to allow researchers to do the calculations themselves. As a result, a closer connection to the data can be established that will lead to more detailed and interesting answers to research questions. As the data stays within the secure environ-ment of the data provider, more detailed data can be offered to the researchers than that using a scientific use file. In the following, the concept of remote access will be described in more detail.

To work with data via remote access, a researcher has to open a secure connection to the data provider’s server from her or his own workspace. In the meantime, most interna-tional data providers such as NORC and MONA use the term “data enclave” to describe this environment. The enclave is a protected working area in which the researcher can only process the data. In other words, she or he is unable to copy data from the enclave and store it on her or his own computer. Only the information needed for the remote access client is delivered to the researcher’s computer via the Internet. The data itself never leaves the server, and will be displayed only as a video stream. The result of this setting is a highly secure environment for data access that provides the possibility of allowing access to the microdata on a very detailed level. Therefore, researchers can work within their normal office environment and they do not have to travel to a remote location to get a similar service level. In addition all the necessary software, such as statistical packages or office programs, is provided in the enclave.

In short, remote access allows much more and deeper analyses than normal scientific use files, while it is almost as comfortable as working on one’s own desktop. Finally, remote execution or on-site will only have to be done for a small number of cases. Of course, the main benefit is less work for the researcher and for the staff of the data pro-vider, because the researcher can see the data during the process and accordingly build her or his syntax more effectively, whereas the members of the data provider only have to check the final results before they leave the enclave.

The NEPS will provide huge amounts of data in a very complex structure. To serve the needs of the scientific community and to secure the privacy of respondents, NEPS will provide not only the usual forms of data dissemination, but is also working on a remote access and data enclave solution. RemoteNEPS will offer a forward-looking scientific research environment to the users of NEPS data.

320 I. Barkow et al.

For a better view on the advantages of a RemoteNEPS, the different data dissemination approaches will be discussed with a focus on the portfolio approach that is used for data protection.

Organizational approach. The data of the NEPS is collected only for use by the sci-entific community. In line with this premise, researchers first have to prove their identity as scholars, researchers, or academic staff. If they cannot do this, access to the data will be denied.

Legal approach. Researchers and data providing agencies in Germany are bound to the corresponding legal regulations for data protection (see also Chap. 19, this volume). While working with NEPS data, researchers have to sign an additional contract specify-ing the legal rules with which they have to comply.

After these two first formal steps, a user profile for the researcher is set up within the software system of the NEPS. Now the researcher is able to work with NEPS data. The first two steps have to be completed before any researcher can access the data, and they differ only slightly between the different approaches for data dissemination. The only real difference lies in the more sophisticated user management for remote access solutions.

Statistical approach. Because the researchers’ analyses are statistical, statistical dis-closure control (SDC) is a central concept for data security in the social sciences. All the different ways of data access mentioned in Sect. 20.2 are subject to different degrees of anonymization (Wende 2004) and therefore different kinds of statistical disclosure con-trol. Statistical data protection can be divided into two different techniques; changes to the data itself and control over the results of statistical calculations (Hundepool et al. 2010). Changes to the data may range from aggregating data (e.g., not disclosing somebody’s occupation but only the branch of the economy they work in—although this obviously makes it harder to identify a single person, it also restricts the possibilities for analysis) to adding noise (the true values are changed following a given model) or creating synthetic data while preserving the structure of the original data (Rubin 1993). The same proce-dures can be applied after checking the results of statistical analysis in terms of their dis-closure potentiality. Results can also be aggregated, changed, or synthesized. Every data provider using techniques of statistical disclosure control has to bear in mind how far the possibility of a good analysis will be weakened and what are the main research interests of the scientific community. Using the functionalities of a database system opens up a wide range of possible automatizations in the area of statistical disclosure control.

Within the portfolio approach, the level of statistical data protection depends on the level of security reached by the other approaches. The scientific use file needs the highest level of statistical protection, whereas on-site, remote execution and remote access need less protection because the data stays in the enclave. Only the output of calculations will be provided after it has been checked for confidentiality by staff members.

Educational approach. One core feature of NORC’s enclave concept is to establish a community of well-trained and trusted users. As Bradburn et al. (2006, p. 8) note, “it is important to develop the human as well as the physical infrastructure for the data enclave.” Before accessing any data, external researchers will be invited to participate in training courses addressing several different aspects of the NEPS service. First, users of the NEPS data must be fully aware of their responsibilities when accessing sensi-tive microdata—a goal that can be achieved far more effectively through personal con-

32120 RemoteNEPS: data dissemination in a collaborative workspace

tact rather than just by signing contracts and ethical statements. Furthermore, the users are trained on disclosure risks and the principles of disclosure control. Gaining a better understanding of the disclosure control process prevents users from producing unneces-sary or critical output, thereby reducing the number and volume of requests (Lane et al. 2008). The training modules will also provide an introduction to the study design and data structures, thereby ensuring a solid level of background knowledge that will help new users start their research projects. This includes topics such as how to work with a specific type of data, how to merge effectively, how to weight correctly, and so on. Such data training eases the use of the data and leads to better research quality. Finally, the courses will train researchers in all technical aspects of using a data enclave. This part of the training illustrates how to establish a connection with the enclave, and how to use the software tools and the rich metadata provided by the NEPS research data center. Finally, the courses introduce the benefits of collaborative work within the NEPS data enclave, thereby helping to establish research networks and develop new ideas.

Training will be offered onsite and online. The content of the training modules varies with regard to the user’s level of experience. For inexperienced users, training may be mandatory.

Technical approach. Because remote access is a fairly new concept, we shall also describe its technical background in this section. Additionally, although the term remote access is quite common for a data enclave solution in the social sciences and is also used regularly in conferences or scientific papers, the term itself is misleading for a computer scientist, because in this discipline, it describes only the technical means of accessing a computer remotely (e.g., virtual private networks or dial-in connections). To avoid the description of misleading technical concepts in this paragraph, the term remote access will be dropped and replaced by the project name “RemoteNEPS.”

From a technical point of view, RemoteNEPS is a terminal server solution. This means data processing as well as computational power rely completely on a remote server sys-tem provided in the NEPS data center. The client only acts as a presentation layer. Except for some configuration files for displaying what is essentially a video stream, no data will be actually stored on the system. The connection to the server system is established by a remote desktop client (e.g., RDP Client for Microsoft Terminal Services, ICA Client for Citrix XenApp, or Vmware View for vSphere Infrastructures). Because the RemoteNEPS is modeled after the MONA system (Söderberg 2005) of Statistics Sweden, a Microsoft Terminal Server solution is used equivalently. This system uses the Remote Desktop Pro-tocol 7.0 that offers 128 bit encryption based on a RC4 algorithm to avoid man-in-the-middle attacks and can therefore be considered highly secure.

While connected to RemoteNEPS, the user will be presented with a remote session with an own desktop containing several statistical packages (e.g., Stata, SPSS) for data analyses as well as a private folder to store results and the programming code.

Because behavior within a terminal server solution in Microsoft environments is controlled by Group Policy Objects (GPOs), several restrictions are implemented, for example:0 The user will not be able to return to his or her own desktop and therefore cannot start

external programs.

322 I. Barkow et al.

0 The user is not allowed to save data onto his or her own desktop.0 Within the terminal session, all start menu options for configuring system options are

switched off (e.g., computer management or control panel).0 Print functionality is switched off.0 All user interactions on the terminal server session can be logged.0 The terminal server session does not allow for an Internet connection.

All these restrictions avoid data exchange with the outer system and are designed to keep the data within the terminal server session. If a user wants to have the results stored on an external computer, the data has to be handed over to the user service for clearance.

Although these restrictions avoid storing the data physically on the outer system, they do not completely avoid the possibility of data theft. The user might still consider tak-ing photos of the session or recording the data stream by connecting a device between the computer and the monitor. Nevertheless, the security approach of a data enclave is based heavily on the feasibility of usage, because extracting data from a video stream can be considered highly problematic. Still it might well be that some highly sensitive data cannot be offered via RemoteNEPS and can only be made available for direct processing within the NEPS facilities.

Another challenge lies in sharing data and results in research groups. Whereas a data enclave is able to control the security levels of individual users, sharing data between groups might be problematic. From a technical point of view, it is not difficult to create shared folders or document management systems with appropriate user rights. Neverthe-less, problems arise when one user of the group who has access to more sensitive data than other users of the same research team puts results on the shared drive. Situations such as these can be avoided by extensive user training to promote awareness as well as by logging the contents of each shared folder.

20.4   Beyond data security: capabilities of RemoteNEPS

Having described the technical background of the remote access solution, we shall now point out the advantages for the user and the capabilities of this system. From a structural point of view, RemoteNEPS tries to maximize user experience by offering different shapes and formats for analyzing data. Metadata about the different items used in the study is generally freely available on a Web portal in the Internet (i.e., outside RemoteNEPS). This metadata portal can also be found within RemoteNEPS, but in a more sophisticated version offering additional functionality. Here the metadata are connected to the raw data coming from data producers in the form of a relational database as well as an analytical database. This means all NEPS data is fully represented in database structures and not only in flat file format. Out of the portal within RemoteNEPS, the user will be able to generate the format she or he needs to analyze the data with the statistical package of choice (e.g., Stata, SPSS, SAS, R). This decision was taken because the structure of NEPS with six different starting cohorts is much too complex to be represented in one matrix file structure. Furthermore, the user has much more control over the preselection of data to be analyzed. In the long run, the idea is to provide users with a shopping basket approach to select only the relevant

32320 RemoteNEPS: data dissemination in a collaborative workspace

variables from the database, thus creating, in contrast to scientific use files or public use files, individual files tailored to each research question, scientist, or research group.

An additional feature of the database approach is the representation of data in the form of analytical databases. As stated before, NEPS data will be available not only in rela-tional databases but also in so-called OLAP cubes derived from a data warehouse struc-ture. This data model enables users to gain further features when analyzing data such as reporting, data mining, text mining, and predictive analytics. This is because these tools derive from the area of business intelligence and are native to the SQL Server 2008 R2 used in NEPS. These features are not meant to replace any means of analyzing data by using statistical packages, but can be considered as an additional option for finding new research ideas and testing other kinds of algorithms on given data. An interesting research question in this respect could also be how algorithms included in analytical databases can be used to provide an asset to social sciences and educational research in general.

Another feature of a data warehouse structure is the possibility of summarizing infor-mation from different sources into one generic structure by means of extract, transform, and load processes (ETL processes). These techniques can be used to harmonize data between different waves and stages of the NEPS survey. In the long run, this native abil-ity of data warehouses can be used to incorporate data from other sources as well (e.g., educational studies with similar research questions).

In addition to the capabilities of the RemoteNEPS from the database side, the system also provides features to build up a collaborative workspace for researchers.

The advantages of a data enclave go beyond convenient access to sensitive microdata. It creates a collaborative space in which researchers can share their knowledge, thereby helping to attain the standard of “good scientific practice.” Research that complies with these standards as defined by the German Research Foundation (DFG 1998) ensures that all procedures are documented, all results of numerical calculations are reproducible, and that original data is stored together with the published manuscript. The American Socio-logical Association calls for a full disclosure of methods and analyses in order to allow verification of findings (ASA 1999, p. 15). Similar guidelines have been set up by the Ger-man Sociological Association (DGS 1992) and the German Psychological Society (DGPs 2004). Currently, these standards are hardly ever met in the publication process.

Within a data enclave, however, all data and scripts are stored together at the data pro-ducer’s facilities. This enforces transparency and encourages collaboration and exchange (Bradburn et al. 2006). Team workspaces within the enclave will allow the collaborative annotation of analyses, the sharing of all results, and even the development of publica-tions such as journal articles. Potential instruments are wikis, bulletin boards, blogs, and shared folders. Tools such as ZOHO or Google documents enable researchers to write documents and build spreadsheets jointly. Reference management and knowledge organi-zation are facilitated, for example, by bibsonomy.org or scholarz.net. In addition, version control systems such as Apache Subversion or Git are useful to write scripts together and to learn from efficient and freely distributed coding templates. The software supporting data analysis and the development of publications can be complemented by social com-munities such as ResearchGATE.net. Considering the wide array of available tools, the challenge is probably less to develop own solutions, but to integrate open interfaces into a collaborative workspace that is clearly arranged and easy to use.

324 I. Barkow et al.

This kind of workspace greatly improves the generalizability and replicability of social research. It also strengthens the relationship between the data producer and the scientific community. In addition, data producers may capitalize on users’ work on the enclave that creates new metadata (Gregory et al. 2009). A further potential benefit is a shift from an individual to a community-based approach in knowledge production within the social sci-ences (Lane et al. 2008). A similar transformation has already taken place in the physical and life sciences in which research communities have been developed to answer policy questions (Lane and Schur 2009).

20.5   Conclusion

The relationship between data dissemination and data protection is often described as “tension-filled” (Schaar 2009). Data providers offer several access modalities when try-ing to attain the right balance between data utility and disclosure risk.

Our review of different dissemination approaches has shown that each solution yields specific benefits but triggers specific costs. A scientific use file, for example, is easily accessible but contains no sensitive data. In turn, remote execution allows the researcher to run analyses using sensitive information but does not offer direct access to the data. On-site access permits direct access to sensitive microdata, but requires substantial invest-ments in terms of time and money and may therefore lead to a considerable underutiliza-tion of the data.

Remote access is a promising approach to ease the tension between data utility and data protection. It makes it possible to combine organizational, legal, statistical, educa-tional, and technical controls in a way that enables data providers such as the NEPS to offer secure and user-friendly access to sensitive microdata. In addition, a remote access solution supports the idea of a collaborative workspace in which researchers share their knowledge, create new metadata, and contribute to the generalizability and replicabil-ity of social research. Furthermore, a data enclave is well-suited to implement technical innovations such as extended search capabilities within a Web portal, automatic display of typical aggregated information, code generators, or reporting functionalities that will enhance the usability of the NEPS data and metadata.

Remote access, however, will not be the only way of disseminating NEPS data to the scientific community in 2011. Alternative means will be offered to support the full range of users’ interests. For example, the analysis of very sensitive information will still require on-site access to protect confidentiality. Moreover, because the NEPS will be the first large-scale provider of a remote access solution in Germany, researchers who are not yet familiar with modern means of data access may still prefer scientific use files. The NEPS will therefore test different data access approaches for disseminating data from the first wave with its six panel cohorts. Based on this experience, the NEPS will focus on a data access solution that is most user-friendly while maximizing data utility and comply-ing with high standards of confidentiality protection. These ideas will hopefully help to set new standards of scientific user support, create synergies for future projects, and thus be of lasting benefit for the scientific community.

32520 RemoteNEPS: data dissemination in a collaborative workspace

References

American Sociological Association (ASA) (Ed.). (1999). Code of ethics and policies and proce-dures of the asa committee on professional ethics. http://www.asanet.org/images/asa/docs/pdf/Ethics%20Code.pdf. Accessed 13 Sep 2010.

Bradburn, N., Horton, R., Lane, J., & Tilkin, M. (2006). Developing a data enclave for sensitive microdata. NSF SBE/CISE workshop, March 15–17, 2005. Airlie House, Virginia.

Brandt, M., & Zwick, M. (2009). infinitE—Eine informationelle Infrastruktur für das E-Science Age. Wirtschaft und Statistik, 9, 670–676.

Deutsche Forschungsgemeinschaft (DFG) (Ed.). (1998). Recommendations of the commission on professional self regulation in science: Proposals for safeguarding good scientific practice. http://www.dfg.de/download/pdf/foerderung/rechtliche_rahmenbedingungen/gute_wissen-schaftliche_praxis/self_regulation_98.pdf. Accessed 14 Sep 2010.

Deutsche Gesellschaft für Psychologie (DGPs) (Ed.). (2004). Ethischen Richtlinien der DGPs und des BDP. http://www.dgps.de/dgps/aufgaben/ethikrl2004.pdf. Accessed 14 Sep 2010.

Deutsche Gesellschaft für Soziologie (DGS) (Ed.). (1992). Ethik-Kodex der Deutschen Gesells-chaft für Soziologie (DGS) und des Berufsverbandes Deutscher Soziologen (BDS). http://www.soziologie.de/index.php?id=19. Accessed 13 Sep 2010.

Gregory, A., Heus, P., & Ryssevik, J. (2009). Metadata (Working Papers No. 57). Berlin: German Council of Social and Economic Data.

Grim, R., Heus, P., Mulcahy, T., & Ryssevik, J. (2009). Secure remote access system for an upgraded CESSDA RI. http://www.cessda.org/project/doc/CESSDA_RI_SRA_FINAL.pdf. Accessed 26 Nov 2010.

Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Lenz, R., Nylor, J., Schulte Nord-holt, E., Seri, G., & De Wolf, P.-P. (2010). Handbook on statistical disclosure control. http://neon.vb.cbs.nl/casc/.%5CSDC_Handbook.pdf. Accessed 10 Nov 2010.

Lane, J. (2009). Administrative transaction data (Working Papers No. 52). Berlin: German Council of Social and Economic Data.

Lane, J., & Schur, C. (2009). Balancing access to data and privacy: A review of the issues and approaches for the future (Working Papers No. 113). Berlin: German Council of Social and Economic Data.

Lane, J., & Shipp, S. (2007). Using a remote access data enclave for data dissemination. The Inter-national Journal of Digital Curation, 1(2), 128–134.

Lane, J., Heus, P., & Mulcahy, T. (2008). Data access in a cyber world: Making use of cyberinfras-trukture. Transactions on Data Privacy, 1(1), 2–16.

Rubin, D. B. (1993). Discussion statistical disclosure limitation. Journal of Official Statistics, 9(2), 461–468.

Schaar, P. (2009). Data protection and statistics—A dynamic and tension-filled relationship (Work-ing Papers No. 82). Berlin: German Council of Social and Economic Data.

Solga, H., & Wagner G. G. (2007). A modern statistical infrastructure for excellent research and policy advice: Report on the German council for social and economic data during its first period in office (2004–2006) (Working Papers No. 2). Berlin: German Council of Social and Economic Data.

Söderberg, L.-J. (2005). MONA—Microdata ON-line Access at Statistics Sweden. Joint UNECE/Eurostat work session on statistical data confidentiality. Geneva, Switzerland.

Wende, T. (2004). Different grades of statistical disclosure control correlated with German statistics law. In J. Domingo-Ferrer & V. Torra (Eds.), Privacy in statistical databases (pp. 336–342). Berlin: Springer.