Ada slide presentation rsc day_feb2017_v2

Preview:

Citation preview

Introduction to ADA: The Australian Data Archive as a Trusted Repository for Research Data

Dr. Steve McEachernDirector, ADA

2017 Research Support Community DayColombo Theatre, UNSW13 February, 2017

ADA in Brief

• The Social Science Data Archive (now ADA) was set up in 1981, housed in the Research School of Social Sciences, with a mission to collect and preserve Australian social science data on behalf of the social science research community

• The Archive holds over 5000 datasets from around 1500 studies, including national election studies; public opinion polls; social attitudes surveys, censuses, aggregate statistics, administrative data and many other sources.

• Data holdings are sourced from academic, government and private sectors.

So what is a data archive?

• ‘A “trusted system” that provides... an accessible and comprehensive service empowering researchers to locate, request, retrieve and use data resources in a simple, seamless and cost effective way, while at the same time protecting the privacy, confidentiality and intellectual property rights of those involved.’

Social Sciences and Humanities Research Council of Canada. “National Data Archive Consultation Final Report: Building Infrastructure for Access to and Preservation of Research Data in Canada” URL: http://www.sshrc.ca/web/whatsnew/initiatives/da_finalreport_e.pdf [20 November 2003].

ADA Subarchives

• Social Science – predominantly survey or polling based quantitative social science data

• Historical – an archive of Australian census data tables from 1834 to the present day

• Indigenous – A thematic archive bringing together research data about Aboriginal and Torres Strait Islanders

• Longitudinal –major longitudinal cohort and panel surveys of the Australian population

• Qualitative – a new collection which provides specialist data archiving and access services to qualitative researchers

• Crime & Justice – major collections of data in crime, law and justice, including criminal justice administrative data

• International – a central point of access for links to international data sources around the world

ADA Data Holdings

Ageing Business and management Census data Culture Demography Drugs, alcohol and tobacco Economics Education, employment and work Environment, Conservation, Land

use Family studies Foreign affairs Gambling Health Housing

Law, Crime, Courts Mass media, communication and

language Migration, immigration and

multiculturalism Politics and elections Public opinion and social

attitudes Psychology Quality of life Science, Technology Social welfare Sociology Tourism, recreation and leisure Travel and transport

ADA data holdings cover a wide variety of subject areas:

Example studies

• Australian Survey of Social Attitudes (ANU, UWA, UQ, …)• Longitudinal Surveys of Australian Youth (NCVER)• Australian Election Studies (ANU, QUT)• ANUPolls, Morgan Gallup Polls, Age Polls, Lowy polls

(1947 – Present)• Colonial census tables and images, 1838-1901 (ABS)• Census tabulations, 1966 – Present (ABS)• National Drug Strategy Household Survey, 1994 – Present

(AIHW)• Australian Workplace Relations Survey, 1990, 1995, 2014

(forthcoming) - Dept of Employment• Negotiating the Life Course (ANU, AIFS, UQ)

Forthcoming

• Longitudinal studies– Department of Social Services

• HILDA, LSAC, LSIC, BNLA– National Centre for Vocational Education Research

• LSAY new wave– Department of Health

• Australian Longitudinal Studies on Womens Health (ALSWH) and Mens Health (Ten to Men)

– Bruce: Child Support study• Exercise, Recreation and Sport Survey 2001-2010

(Australian Sports Commission)• Giving Australia survey (DSS)

The ADA website

The ADA Study Page

Dataset study pages

Study information is based on the DDI-C (Data Documentation Initiative) standard, and includes:

• Study: information including the investigators, abstract, sample, data collection methods, and access requirements.

• Variables: a list of variables available in a quantitative dataset

• Related Materials: additional documentation (reports, questionnaires, technical information), links and other related studies (eg. others in the series) that may interest you

Who uses ADA?

• 2016– 12000 online analyses (usually crosstabulations)– 1100 data file downloads

• Registrations:– Approx. 1000 new users each year

• User types:– Undergraduates: 41% of analysis, 16% of downloads– Postgraduates: 33% / 40%– Researchers:11% / 40%– Others (media, government, NGO, etc.): 15% / 4%

• Institution types: (approx.)– Australian universities: 70%– International universities: 15%– Government departments and agencies: 10%– Other: 5%

Data dissemination options

The ADA study page

Study information is available through the tabs at the top of the study:

• Study: information including the investigators, abstract, sample, data collection methods, and access requirements.

• Variables: a list of variables available in a quantitative dataset• Related Materials: additional documentation, links and other

related studies (eg. others in the series) that may interest youThe study page is also the access point for the ADA Nesstar system,

for:• Analysis of quantitative data online, • Download of data to your own computer. Note: you will need to log in to your ADA user account in order to

access the Nesstar system.

Types of access

• Browse (viewing metadata):– Open access

• Analyse (Online analysis): free user registration– General access studies: Free access for registered users– Restricted studies: User still requires approval to access

• Data download:– For unrestricted data: submit a user request, and sign ADA

general user undertaking (reviewed by ADA staff)– For restricted data: restricted access request form and specific

user undertaking (reviewed by ADA and depositor of data)– Special access: depends on the particular access requirements

Browsing: The ADA Study Page

Exploring data in Nesstar

• The information about the study (from the ADA study page) is also available in Nesstar. Click on the Dataset icon to explore the study.

• For quantitative analysis, you can also view basic statistics and charts for individual variables in this section, by exploring the Variables tab

Exploring variables in Nesstar

Creating a cross-tabulation

Downloading data

• Nesstar is also used as the ADA data download system, to export the data files for the study to your own computer.

• To download data, you need to have been approved for download access for the study you are interested in.

• This can be done by submitting a Request for Data Access:– a) from the “Request Analysis and Download access” link from a study

page, OR– b) from your personal User page (http://users.ada.edu.au)

• This request then goes to the ADA User Services team for approval.

• Once your download access has been approved, you will receive an email notification from ADA, and a link to the study will be added to your User Page.

Managing and Depositing Data: ADA and DDI

Data deposit: ADAPT

Archival processing

Manual system with some automation tools1. Deposit:

– Review of ADAPT submission– Storage via ADAPT to file store

2. Data processing:– File format conversion (usually to SPSS for processing)– Privacy/confidentiality review– Data cleaning (in consultation with depositor)

3. Metadata processing:– DDI-C metadata creation in Nesstar Publisher

4. Publishing:– Archival storage and access format creation– Data publication to Nesstar server– Metadata publication to Nesstar and ADA CMS

Future directions

Future trends

• Mandated rather than recommended data archiving– How do we scale?– Looking at self-deposit systems

• Open access to data as the default – Government: PM&C Open Data Policy, data.gov(.au/.uk)– Research: Horizon2020, ESRC, NSF, ARC/NHMRC??

• Broader range of data types available– Qualitative data: YES– Social media data:

• Raw feed (firehose): NO • Processed data: ??? (how to support access)

– Administrative data: ???• Broader range of users of that data

– Different disciplines: health, environment, comp. sci.– Different users: public/media/government– Different geographies: internationally

Core needs for social science data

• Collection• Preservation• Integration• Analysis• Dissemination

ADA trusted digital repository project

• Funded by ANDS 2016-17• Aims:

– Completion of the Data Seal of Approval self-assessment and certification process

• http://www.datasealofapproval.org/en/ • 16 requirements: • Assessment on 0-4 scale:• All requirements must be at least a 1

– Implemention of improvements to ADA systems and procedures to improve certification assessment

– Review of the DSA certification process and criteria to assess suitability for the Australian research data environment

DSA requirements

• “Fundamental to the following guidelines are five criteria, that together determine whether or not the digital research data may be qualified as sustainably archived:– The research data can be found on the Internet.– The research data are accessible, while taking into account

relevant legislation with regard to personal information and intellectual property of the data.

– The research data are available in a usable format.– The research data are reliable.– The research data can be referred to.”

• http://www.datasealofapproval.org/media/filer_public/2013/09/27/dsa-booklet_1_june2010.pdf

The guidelines

• “The associated guidelines relate to the implementation of these criteria and focus on three stakeholders: the data producer, the data repository and the data consumer.1. The data producer is responsible for the quality of the digital research

data.2. The data repository is responsible for the quality of storage and

availability of the data and data management.3. The data consumer is responsible for the quality of use of the digital

research data.”– http://www.datasealofapproval.org/media/filer_public/2013/09/27/dsa-b

ooklet_1_june2010.pdf

• Guidelines: https://drive.google.com/file/d/0B4qnUFYMgSc-eDRSTE53bDUwd28/view

Repositories and archives project

• With UNSW Library (Maude Frances)• Exploring mechanisms for deposit and preservation

of data through repository to the data archive• Questions we are exploring:

– Where should we deposit the data?– Who should store the data?– What metadata should we collect?– Who should manage the metadata?– How to transfer content (data and metadata) between

repository and archive?– How to determine the “source of truth”? (e.g. who should

mint the DOI?)

ADA Dataverse

• Redevelopment of our database and website infrastructure– New website– New data catalogue

• New functionality:– Self-deposit of data– Open data access– API access (both for deposit and access, e.g. through R)– Shibboleth authentication

• Currently in early testing– For completion in 2017 (probably Q3)

• Functionality intended to support additional DSA requirements

ADA Dataverse

Questions?Steven McEachernsteven.mceachern@anu.edu.auada@anu.edu.au http://ada.edu.au

Data documentation standards

DDI-Codebook

• Two flavours of DDI – Codebook and Lifecycle• Focus on DDI-C, four sections:

1. Document description: characteristics of the DDI XML document itself

2. Study description: characteristics of the Study (project) that the DDI is describing (including Related Materials: documents associated with the project, such as questionnaires, codebooks, etc.)

3. File description: characteristics of the physical data files4. Variable description: characteristics of the variables in the

data file

Dublin Core

• Type• Format• Identifier• Source• Language• Relation• Coverage• Rights

• Title• Creator• Subject• Description• Publisher• Contributor• Date

DCAT (W3C)

DCAT standard is relatively simple, and includes four basic objects:• Dataset: “a collection of data, published or curated by a

single agent, and available for access or download in one or more formats”

• Data catalog(ue): “ a curated collection of metadata about datasets”

• Catalog(ue) record: “a record in a data catalog, describing a single dataset”

• Distribution: “represents a specific available form of a dataset”

• Key object for SRC is the Dataset– others are distribution-related

ADA systems architecture

Approach

• Core archive website: – http://www.ada.edu.au

• Sub-archives focussed on specialised thematic or methodological areas- eg. http://www.ada.edu.au/indigenous/home

• “Add-on” systems for complex analysis or visualisation tasks:– Nesstar– GIS: http://gis-test.ada.edu.au– Longitudinal visualisation: Panemalia– Historical census data: http://hccda.ada.edu.au

OAIS architecture

Data sharing policies in Australia

Policy trends in data access

• Mandated rather than recommended data archiving• Open access to data as the default (NSF, Office of

the President, data.gov(.au,.uk))• Broader range of data types available• Broader range of users of that data

Policy drivers

• Funders: Return on investment:– Government data: Treasury, PM&C– Research data: ARC/NHMRC, Horizon 2020

• Journal publishers: Reputation:– Open access journals (e.g. PLOS One) and – For-profit publishers (e.g. Nature, Science, Elsevier)

concerned about loss of credibility from fraudulent research• Learned societies and disciplines: Good science

AND reputation: – American Political Science Association: DART initiative– American Economic Association:

Government data

• Australia: Australian Government Public Data Policy Statement– The Australian Government commits to optimise the use and

reuse of public data; to release non-sensitive data as open by default; and to collaborate with the private and research sectors to extend the value of public data for the benefit of the Australian public.

– Public data includes all data collected by government entities for any purposes including; government administration, research or service delivery.

– Non-sensitive data is anonymised data that does not identify an individual or breach privacy or security requirements.

– https://www.dpmc.gov.au/sites/default/files/publications/aust_govt_public_data_policy_statement_1.pdf

Research data

• Australian Code for the Responsible Conduct of Research

• https://www.nhmrc.gov.au/guidelines-publications/r39 (Joint ARC/NHMRC publication)

• Section 2: Management of research data and primary materials

• Then provides related links to ethics statements and similar

ACRCR Section 2: Responsibilities of Institutions

Section 2.1.1: In general, the minimum recommended period for retention of research data is 5 years from the date of publication. However, in any particular case, the period for which data should be retained should be determined by the specific type of research. For example:• for short-term research projects that are for assessment purposes

only, such as research projects completed by students, retaining research data for 12 months after the completion of the project may be sufficient

• for most clinical trials, retaining research data for 15 years or more may be necessary

• for areas such as gene therapy, research data must be retained permanently (eg patient records)

• if the work has community or heritage value, research data should be kept permanently at this stage, preferably within a national collection.

ARC statement

"Researchers and institutions have an obligation to care for and maintain research data in accordance with the Australian Code for the Responsible Conduct of Research (2007). The ARC considers data management planning an important part of the responsible conduct of research and strongly encourages the depositing of data arising from a Project in an appropriate publicly accessible subject and/or institutional repository"

ANDS suggest three questions

1. Where will your research data be stored at completion of the project?

2. What access will you provide to the data set on completion of the project?

3. How will you enable others to reuse your research data?

Horizon 2020

• http://ec.europa.eu/research/participants/docs/h2020-funding-guide/cross-cutting-issues/open-access-data-management/open-access_en.htm

• (All grants): Develop a data management plan (DMP) within 6 months of commencement of project

• Pilot program (2014-17):– Deposit research data described in DMP, preferably in a

research data repository– As far as possible, projects must then take measures to

enable third parties to access, mine, exploit, reproduce and disseminate (free of charge for any user) this research data.

– Guidelines recommend FAIR principles

FAIR principles

• Findable• Accessible• Interoperable• Reusable

• Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).

Recommended