Upload
daisy-tucker
View
216
Download
4
Embed Size (px)
Citation preview
Long-term preservation of digital geospatial data: challenges for ensuring
access and encouraging reuse
Anne Robertson, EDINA & Steve Morris, NCSU LibrariesEDINA National Data Centre
University of Edinburgh
North Carolina State University LibrariesNCGDAP
Architecture Working Group OGC TC/PC Meeting
Bonn, 9th November 2005
Objectives
Why we’re here………………
• Introduce preservation and access use cases to OGC
• Find points of intersection with OGC initiatives
• Flesh out research agenda for preservation of
geospatial digital data
• “Permanent access and reuse” not just
preservation
North Carolina Preservation Partners
• North Carolina State University Libraries– University-wide GIS services since 1992– New focus on publishing WMS services for use by
external clients or service aggregators– Archiving local agency geospatial data since 2000
• NC Center for Geographic Information & Analysis– State government GIS agency– Maintains state’s Corporate Geographic Database– Coordinates many SDI initiatives, including NC OneMap
• NC OneMap– Seamless access to local, state, and federal data;
component part of National Map– WMS services available individually from sources or
through aggregator viewer– Focus on standards, best practices, data sharing
agreements, inventories, and metadata outreach
NC Geospatial Data Archiving Project
• Cooperative project with Library of Congress under the National Digital Information Infrastructure and Preservation Program (NDIIPP)– One of 8 NDIIPP partnership projects, others focusing on
web pages, numeric data, video, business records, etc.– Focus on developing a network of partners, identifying
preservation issues in various domain areas
• NCGDAP: 3 year project focused on preservation of state and local agency digital geospatial data– Identify and acquire data– Develop digital repository; ingest and manage content
• Objective: engage existing spatial data infrastructures in process of data preservation
NCGDAP Project Phases
• Content Identification and Selection– Work from existing inventory processes– Select from among “early”, “middle”, and “late” stage
information products• Content Acquisition
– Acquire state and local agency content– Investigate methods of automating archive development
• Partnership Building– Work within NC OneMap framework (infrastructure)– Several other emerging geo-preservation projects
• Content Retention and Transfer– Metadata and ingest workflow– Emphasis on repository-agnostic approach, avoid
“imprinting” one environment– Initially using DSpace open source software, re-ingest
into a different environment later
Common Themes – Cartographic Representation
• The counterpart to the map is not just the dataset but also models, symbology, interpretation. These key elements give real meaning – how are these captured for reuse?
Common Themes – GML for archiving?
• Interest in alternative to proprietary vector file formats• “Permanent access” requirements:
– profiles and application schemas widely understood and supported, avoid requiring “digital archaeology”
– Role of GML Simple Features Specification?• Assessing formats for preservation: sustainability factors,
quality & functionality factors• Planned environmental scan of existing GML profiles and
application schemas– Collaboration with National Archives and Records Administration
and FGDC Historical Data Working Group – Vendor support? Official status? Stability over time?
• How to handle proprietary formats?– UC Santa Barbara/Stanford NDIIPP project working on format
registry– Spatial databases pose special challenges
Common Themes – Content replication
• Need efficient means to replicate content to archive– North Carolina: 100 counties and 140 municipalities
• Content replication also needed for:– Disaster preparedness– State and federal data improvement projects– Aggregation by regional geospatial web service providers
• WFS, e.g.: efficiency in complete content transfer?• Rsync-like function, plus: rights management, inventory
processes, metadata management, informed by data update cycles
• Archiving delta files vs. complete replication – need to avoid requiring “digital archaeology” in the future
• Other models: LOCKSS (Lots of Copies Keeps Stuff Safe)
Common Themes – Time versioning
• How to manage datasets that change over time?– Versions will live in different repositories, must handle relationships
outside of the individual repository
• Industry focus on most current data … but increased demand for temporal data– e.g., land use change detection, business trends analysis– Much older data lost -- “Digital dark age”
• Draft NCGDAP approach: manage information for “serial objects” separately, link to serial entity via persistent identifier (Handle)– Support “get current data/metadata/DRM” operations– Avoid managing volatile information (e.g., service connections) in
individual static metadata records– Other technologies: OpenURL for service connections?
EDINA• A National Data Centre for Tertiary Education since 1995
– based at the University of Edinburgh Data Library
• Our mission... to enhance the productivity of research, learning and teaching in
UK higher and further education • GeoServices team - provide SDI components to UK
academic sector• Substantial experience in handling and delivering key
geospatial data and geo-referenced information• OGC members since 1999• Strategic move toward interoperability & shared services
role – use of OGC interface specifications in our projects and services
GRADE project introductionAccording to OECD Follow up Group on Issues of Access to Publicly Funded Research Data1 …
“More widespread and efficient access to and sharing of research data will have substantial benefits for most areas of scientific research.”
Evidence of re-use of data within UK data centres is low:
– “Level of re-use of data held in the AHDS and ESRC archives has been disappointingly low” (Alison Allden, 2003)
– “NERC spends about £5 million per annum on data management, but unclear what benefit it derives from this. More research is needed to establish benefits and value of data re-use” (Mark Thorley, 2003)
– Qualidata survey of qualitative data re-use (2000). 44% respondents used colleague's data rather than acquiring archived data via a dissemination service (33%)
1 Interim Report, 20 October 2002
GRADE project introduction• Within UK academia there is a focus on the potential use of
digital repositories to assist with a variety of facets of digital asset management including encouraging reuse of research data
• GRADE will investigate and report on the technical and cultural issues around the reuse of geospatial data within the context of discipline-based repositories
• Particular focus on sharing and reuse of derived geospatial data
• EDINA leading GRADE with consortium partners:– AHRC Research Centre for Studies in Intellectual Property and
Technology Law, School of Law, Edinburgh University– National Oceanography Centre, Southampton University
– Variety of other associate partners including NCGDAP, British Atmospheric Data Centre, Ordnance Survey
Common Themes – Digital Rights
• UK environment, a complex one– dominant provider of base vector geospatial data provider– array of space borne survey data available, much free for non-
commercial use– Stakeholder interest from research funders (research councils) and
research hosts (institutions)
• When we consider the reuse of derived geospatial data concerns over data ownership, IPR and copyright often suppress any initial enthusiasm
• We can offer the geoDRM discussion real scenarios of– IPR issues for derived geospatial data and– Geospatial data reuse/sharing use cases
Derived Data ExampleOS Landline
Digitise coastline positions
Input
Processing
Processing
Output
ESRI Shapefile and tables of retreat
Ground surveyHistoric OS Maps
2001 Orthophotos
Scan Scan
Geo-reference Geo-reference
Accuracy assessment
Planimetric correction
GPS survey
Calculation of cliff retreat
Source: Use case provision of derivedgeospatial data as part of the GRADE project
in scoping digital repositories (draft report)
Common Themes – Content Packaging
• Consider a geospatial data asset deposited into a repository, it’s more than one file:– GML and associated schema!– proprietary vector format plus cartographic representation detail– geodatabase– raster with header file– Data set metadata and IPR info
• What is best method to package data?• In eLibrary world the Metadata Encoding and
Transmission Standard (METS) and IMS content package (IMS CP) and MPEG-21 DIDL for repository objects
• “Interoperable repositories need to encode, exchange and describe complex objects in agreed ways”
• What direction is the GI industry taking with content packaging?
Common Themes – Persistent Identifiers
• Once a geospatial data asset is deposited within a repository, there is a need to be able to persistently identify this asset
• Particular repository softwares use particular schemes e.g. Fedora uses ‘info’ URI scheme
• Requirement to ensure identifier is actionable
• We are thinking about OpenURL Resolvers and perhaps Digital Object Identifier (DOI) for handle schemes
• What direction is GI industry taking with persistent identifiers?
Common Themes – ‘data plus services’ model
National Library of New Zealandhttp://wiki.tertiary.govt.nz/static/wikifarm/InstitutionalRepositories.uploads/Main/IR_report.pdf
Conclusions
• Aim is to flesh out research agenda• Presented 7 common themes from our work• Shift to web services consumption poses threat to
secondary archive development … but can geospatial web services be put to use in preservation processes?
• Encourage GI community to connect with these issues or outcome may be that archive community will fail to take account of OGC work
• Where to from here?
Contact details
Anne RobertsonGRADE Project ManagerEdina National Data [email protected] web site: http://edina.ac.uk/projects/grade
Steve MorrisHead of Digital Library InitiativesNorth Carolina State University Libraries
[email protected] NCGDAP web site: http://www.lib.ncsu.edu/ncgdap/
Questions?