Upload
emmeline-harrington
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
National Science Foundation Cooperative Agreement: OCI-0940841
Reagan Moore, PIMary Whitton, Project Manager
Policy Topics
• Policy-based Data Management• Practical Policy Working Group outcomes
– Data Center policies• Applications
– DataNet Federation Consortium analyzed 175 policies for• Data sharing (research collaborations)• SILS Digital library (personal collections)• RDA Practical Policy (data centers)• UNC-CH Protected data (secure medical workspace)• Odum/Dataverse (archive)• NSF data management plans (publication)
– Science Observatory Network (real-time sensor data) – PECE/RPI (anthropology)– NOAA NCDC (archive)
National Science Foundation Cooperative Agreement: OCI-0940841
Policy-based Data Management
Summary of the Problem
Practical Policy
Assertion or assurance that is enforced about a (data) collection (data set, digital object, file) by the creators of the collection
Computer actionable policies are used to enforce data management automate administrative tasks validate compliance with assessment criteria automate scientific data processing and analyses
Users motivated by issues related to scale, distribution
National Science Foundation Cooperative Agreement: OCI-0940841
Practical Policy Working Group
• Practical Policy members represented– 11 types of data management systems– 30 institutions– 2 testbeds
• iRODSRenaissance Computing Institute,DataNet Federation Consortium – DFC
• GPFSInstitute of Physics of the Academy of Sciences, CESNETGarching Computing Centre – RZG
• Published two documents– Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Templates” February, 2015,
http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466-B3E5775121CC.– Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Implementations”, February,
2015, http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466-B3E5775121CC.
Policy Templates
INLS 624
Data Center Policies
• Contextual metadata extraction – Automate extraction of metadata from files
• Data access control– Automate application of appropriate access contrls
• Data backup– Automate creation of replicas
• Data format control– Automate identification of data format
• Data retention– Apply a retention period
• Disposition– Apply a disposition policy at end of retention period
7
INLS 624
Data Center Policies
• Integrity (including replication)– Verify integrity and replace bad copies
• Notification– Manage events about changes to the collection
• Restricted searching– Manage searches on collection
• Storage cost reports– Generate cost report
• Use agreements– Manage use agreements before data are retrieved
8
National Science Foundation Cooperative Agreement: OCI-0940841
Digital Library Management
INLS 624
LifeTime Library Policies
• Requirements– Enable students to create a personal digital collection– Provide pedagogy mechanisms for experimenting with:
• Naming - File names• Arrangement - Organization in collections• Description - Tags and metadata• Access controls - Sharing and publication• Ingestion - Controlled loading of data• Distribution - Storage locations
10
INLS 624
Student Experiences
• Students invariably:– Changed their minds about the purpose of the collection– Changed their minds about the description
• Term definitions tended to drift over the semester
– Changed their minds about the arrangement• Added new collections for additional types of data
• Resulting collections had:– 1,000 – 10,000 files– 2 Gigabytes to 150 Gigabytes in size– 4-10 metadata attributes per file
11
National Science Foundation Cooperative Agreement: OCI-0940841
Protected Data
Protected Data Management
• UNC-CH has published an administrator’s guide for the management of protected data. This includes:– PII Personally Identifiable Information– PHI Protected Health Information– PCI Payment Card Industry information
• The question is whether each of the tasks specified in the guide can be automated as policies enforced by the data grid.
• See Chapter 6 of the Policy Examples Workbook– This specifies 51 tasks that should be managed by the
administrator
INLS 624
Protected Data Tasks1 Check for presence of PII on ingestion2 Check for viruses on ingestion3 Check passwords for required attributes4 Encrypt data on ingestion5 Encrypt data transfers6 Federation - control data copies (access control)7 Federation - manage remote data grid interactions (update rule base)8 Federation - periodically copy data9 Federation- manage data retrieval (update access controls)10 Generate checksum on ingestion11 Generate report of corrections to data sets or access controls12 Generate report for cost (time) required to audit events13 Generate report of types of protected assets present within a collection14 Generate report of all security and corruption events15 Generate report of the policies that are applied to the collections16 List all storage systems being used17 List persons who can access a collection
14
INLS 624
Protected Data Tasks18 List staff by position and required training courses19 List versions of technology that are being used20 Maintain document on independent assessment of software21 Maintain log of all software changes, OS upgrades22 Maintain log of disclosures23 Maintain password history on user name24 Parse event trail for all accessed systems25 Parse event trail for all persons accessing collection26 Parse event trail for all unsuccessful attempts to access data27 Parse event trail for changes to policies28 Parse event trail for inactivity29 Parse event trail for updates to rule bases30 Parse event trail to correlate data accesses with client actions31 Provide test environment to verify policies on new systems32 Provide test system for evaluating a recovery procedure33 Provide training courses for users34 Replicate data sets on ingestion 15
INLS 624
Protected Data Tasks35 Replicate iCAT periodically36 Set access approval flag37 Set access controls38 Set access restriction until approval flag is set39 Set approval flag per collection for enabling bulk download40 Set asset protection classifier for data sets based on type of PII41 Set flag for whether tickets can be used on files in a collection42 Set lockout flag and period on user name - counting number of tries43 Set password update flag on user name44 Set retention period for data reviews45 Set retention period on ingestion46 Track systems by type (server, laptop, router,….)47 Verify approval flags within a collection48 Verify files have not been corrupted49 Verify presence of required replicas50 Verify that no controlled data collections have public or anonymous access51 Verify that protected assets have been encrypted
16
INLS 624
Task Automation
• There are some unifying requirements across tasks:– Checking material for PII, viruses– Management of passwords– Generation of log files for all actions done– Creation of state information to track processes– Management of encryption– Management of access controls– Generation of audit trails– Parsing of events to demonstrate compliance over time– Verification that processes were correctly applied
• Many of these requirements can also be applied to digital libraries and research collaborations
17
National Science Foundation Cooperative Agreement: OCI-0940841
Preservation
Cross-Disciplinary Data Discovery and Geographically Distributed Preservation
DFC April 2013 NSF Review Slide 19
INLS 624
Archive Policies
• The Dataverse network has about 800 GigaBytes of data that may contain protected information.
• An archive is needed with independent management of the material to ensure recovery in the case of a disaster.– Digital objects and provenance metadata must be re-
loadable into Dataverse.– Assessment criteria need to be evaluated to verify integrity.– Access controls must be enforced on restricted data.– Dataverse naming convention must be retained.
• Approach is to replicate the data holdings into an iRODS data grid. 20
INLS 624
Policies
• See chapter 5 of the Policy Examples Workbook – Odum preservation policies
• Preservation tasks include:– Staging files between Dataverse and iRODS– Checking data for presence of protected
information– Periodic verification of integrity and replicas– Verification of access controls– Reports on usage statistics
21
National Science Foundation Cooperative Agreement: OCI-0940841
NSF Data Management Plans
INLS 624
NSF Data Management Plans
• The National Science Foundation has mandated that every project provide a 2-page description of how data will be managed.
• Each NSF directorate published guidelines on what the data management should include.
• An analysis of 12 sets of requirements identified 38 data management tasks that could be automated
• See Chapter 7 of Policy Template Workbook
23
INLS 624
NSF DMP Requirements
24
INLS 624
NSF DMP Requirements
25
National Science Foundation Cooperative Agreement: OCI-0940841
Science Observatory Network
Real-Time Sensor Data
• Harvest sensor data from the Antelope Real Time Sensor orb.– Manages environmental, oceanic, seismic data– More that 3,000 sensors across the US
• Register each sensor as an independent collection– Retrieve the most recent sensor data– Harvest sensor data periodically– Transform to JSON, netCDF– Provide access to archived data
National Science Foundation Cooperative Agreement: OCI-0940841
PECE / RPI
INLS 624
Collection Management Policies
• Contextual metadata extraction • Data access control• Data backup• Data format control• Data retention• Disposition• Integrity (including replication)• Notification• Restricted searching• Storage cost reports• Use agreements
29
National Science Foundation Cooperative Agreement: OCI-0940841
NOAA NCDC
NOAA Climatic Data Center
• Manages an archive of climate data records received from multiple sources– Uses a staging area to
• Check input data for viruses • Manage ingestion into a tape archive
• Challenges– Needed a way to improve security
• Eliminate direct access to storage within the NOAA firewall
– Needed a way to automate management of each file• Verify archival storage before file is deleted
ftp1
ftp4
ftp2
ftp5
ingest1
ingest2
Tape
Disk Cache
HDSS
DMZ Landing Zone: Open for data delivery
DM
Z Fi
rew
all
NCDC External Firewall
FTP Load Balance
ftp3
External Providers
FTP/FTPS
NCDC Internal Network
FTP PUSH/PULL
ftp
iRODS Secure Ingest
iRODS DMZ Grid
/DMZ/Archive
/NR2/NR3
iRODS NCDC Grid
/NCDC
/NR2/Ingest
/NR3/NR2
/Archive
/NR3
iRODS is:• Secure authentication• Security via Obscurity (one to bind them)• Uses a pull mechanism to move data into NCDC grid• A virtual management tool (clean-up) • Scope is entire grid
iRODS
National Science Foundation Cooperative Agreement: OCI-0940841
www.datafed.orgwww.irods.org
Policy Examples WorkbookPolicy Templates Workbook