State and Local Agency Digital Geospatial Data PreservationThe North Carolina Experience
Steve MorrisSteve MorrisNCSU LibrariesNCSU Libraries
Earth Sciences Information Earth Sciences Information Partners (ESIP) WorkshopPartners (ESIP) WorkshopJuly 8, 2009July 8, 2009
One of eight initial collection building projects in the Library of Congress NDIIPP (National Digital Information Infrastructure and Preservation Program)
Lead organizations: North Carolina State University Libraries and North Carolina Center for Geographic Information & Analysis (NCCGIA)
Focus: State and local government geospatial data in NC Repository development as catalyst for discussion Goal: Engage spatial data infrastructure in data archiving
Initial 3 year project extended to Dec. 2009
NC Geospatial Data Archiving Project (NCGDAP)
NCGDAP Data Types – Raster
• Digital orthophotography• Satellite imagery
Static data
NCGDAP Data Types – Vector Data
• Point, line, and polygon• Attached attribute data
Often updated
Note: Percentages based on the actual number of respondents to each question
Downtown Raleigh Near State Capitol
2005 Wake County Ortho
Imagery = DurableStatic Simple structureMostly open formats
Vector data = VolatileFrequent updateComplex structureMostly proprietary formats
Downtown Raleigh, NC Near State Capitol
2005 Wake County Ortho
Imagery = DurableStatic Simple structureMostly open formats
Vector data = VolatileFrequent updateComplex structureMostly commercial formats
NCGDAP Data Types – Spatial Databases
• Vector and raster data
• Relationships• Behaviors• Annotation• Data Models
Dynamic content Constantly updated information Data versioning
Digital object complexity Spatially-enabled databases Complicated, multi-component formats Proprietary formats
Geospatial Data: Compelling Issues
Data consists of multi-file, multi-format objects
Ancillary data files can be shared by datasets
Some format conversions involve one-to-many relationships
Compressed archive files are common and behave unpredictably
And all the usual challenges: format validation, validity checking, threat scanning,…
Ingest Challenges: General
Where is the Dataset?
Here’s One!
Files
• Multi-file dataset• Georeferencing• Metadata file• Symbolization file• Additional documentation• License• Disclaimer• More
Metadata
• FGDC• Acquisition metadata• Transfer metadata • Ingest metadata• Archive rights• Archive processes• Collection metadata• Series metadata
Metadata is encoded in a variety or ways The FGDC content standard for metadata lacked an
encoding standard (arrived pre-XML), addressed in ISO 19115/19139 North American Profile implementation
XML (varied schemas), TXT, HTML Metadata is missing
Only about 25% of local agencies use FGDC Metadata is wrong
Metadata is commonly asynchronous with the data Inconsistent use of dataset naming, etc.
e.g., “Streets” vs. “Wake County Streets”
Ingest Challenges: Metadata
Existing geospatial metadata often needs: Remediation – to fix errors or omissions Normalization – to adhere to a standard structure Synchronization – so that the data at hand matches the metadata
If no metadata then: Can build minimal metadata using templates and auto-extraction Lose key information such as data quality, lineage, data
dictionaries
Automating metadata for repository ingest Raster data is easy – large sets of consistently structured files Vector data is hard – each dataset is a different story
Many additional administrative and technical metadata elements not accommodated by FGDC
NCGDAP Metadata Summary
Extended Curation: Feedback and Outreach
Data Receipt
Format Processing
Metadata Processing
Ingest Processes
Content Producers
Industry
Standards Organizations
Metadata standards and outreach Metadata quality, best practices
Inventories Reduce “contact fatigue”, shareable information store
Content exchange networks Leverage more compelling business reasons to put data in
motion Automate process, add technical & administrative metadata
Framework data communities Snapshot frequency, schemas, format strategies
Spatial Data Infrastructure and Archiving
Geospatial datasets are typically complex, multi-file objects
Data are often accompanied by ancillary data, which must be associated with the data item
Rights information and licenses must be associated with the item
Various implementations in different domains (METS, IMS-CP, XFDU, etc.)
Simpler .zip-based packages also used (MEF, KMZ, etc.)
Content Packaging Issues
Spatial Database Approaches
Manage database forward over time
Extract data layers to preservable form
Set aside archival snapshot of database
Partners (NC, KY, UT, Library of Congress, NCSU): State geospatial organizations State Archives
State-to-state and geo-to-Archives collaboration Organizational and technical diversity across states
Archives as part of spatial data infrastructure Selection and appraisal processes Retention schedule development Data transfer to archives Development of enhanced business cases
GeoMAPP: Geospatial Multistate Archival and Preservation Partnership
NCGDAP Learning Outcomes
Preservation of GIS projects is needed to support re-creation of past work
Preservation of data representations is needed to document decision-making processes
Validation, remediation, and conversion of data and metadata is expensive: push for improvements upstream
Some repositories handle “items”: can result in “atomization” of data
For vendors, frame data preservation as a “customer problem” -- must build the business case
Thank You!
Steve MorrisHead, Digital Library InitiativesNorth Carolina State University [email protected]
North Carolina Geospatial Data Archiving Projecthttp://www.lib.ncsu.edu/ncgdap
GeoMAPPhttp://www.geomapp.net
AGRC exports data from SGID and splits out datasets by series. Metadata occasionally incomplete complete
Local governments supply GIS datasets on CD/DVD to AGRC. Metadata often missing
• All Metadata is completed to FGDC Standards • AGRC creates geoPDF files of individual datasets, plus
ZIP files of the native format. • One ZIP file would contain all the pieces belonging to
one shapefile or, alternatively, the file would contain a geodatabase.
• Geodatabases would not be just one big database with everything in it (multiple series and years).
• Instead, the native files would be composed of a single downloadable file per series per year.
AGRC copies these files to Archives’ FTP server.
Example FTP Site Structure:ftp.archives-agrc.utah.gov/Archives Metadata harvested to populate Archive’s Finding Aids
oBiota Dublin Core MetadataoBoundaries Dublin Core Metadata
MunicipalityRecords-Series-26846 Dublin Core Metadata2000
oMunicipalBoundaries.zip FGDC MetadataoMunicipalBoundaries.pdf FGDC Metadata
200120022003
CountyBoundaries-Series-26845 Dublin Core Metadata20032004
Draft of Utah’s GIS to Archives Data Flow
Database with Dublin CoreDescriptive and
Administrative Metadata
iRODS
DSpace
ContentFiles
DistributedStorage Layer
Single item & batch ingest into DSpace by
Archivist
Kentucky Metadata Workflow into DSpace and iRODS Environment
UN
C
oth
er
KD
LA
Batch metadata extraction
using iRODS rules
Database with Administrative & Preservation Metadata
Preservation metadata from iRODS rules
Metadata & contententered by agencies using template and
modified by Archivist
Source Metadata Translation
Hub-and-spoke model a la Echo DEPositoryrepository agnosticmodular conversion
hubfacilitate repository
software migration & inter-archive exchange
Lead organizations: North Carolina Center for Geographic Information & Analysis (NCCGIA), State Archives of NC, with Library of Congress
Partners: State geospatial organizations of Kentucky and Utah State Archives of Kentucky and Utah NCSU Libraries in catalytic/advisory role
State-to-state and geo-to-Archives collaboration 2 year project: Nov. 2007-Dec. 2009 Archives as part of Spatial Data Infrastructure
GeoMAPP: Geospatial Multistate Archival and Preservation Partnership
Introduce GIS organizations and State Archives to each other
Archival selection and appraisal processes Retention schedule development Data transfer to archives Development of enhanced business case
GeoMAPP: Project Components
Repository Goal Capture at-risk data Explore technical and organizational challenges
Project End Goal Data Producers: Improved temporal data
management practices Archives: More efficient means of acquiring and
preserving data; Progress towards best practices
NC Geospatial Data Archiving Project (NCGDAP)
Temporal data management vs. long-term preservation
Data capture Backups are common, but not long-term
archives Producer focus on current data Shift to web services-based access
Inadequate or non-existent metadata Consistent NC survey statistics: Only 40% of
data producers create and maintain metadata Existing metadata often needs to be normalized,
synchronized with the data, and remediated
Geospatial Data Preservation Challenges
Loss of memory about the data is also a problem
When to automate and when not to Learn first from human intervention Minimizing risk of error related to human intervention
Accepting that ingest packages used will evolve over time (implications for archive?)
Handling post-ingest migrations
Ongoing Challenges
Challenge: Preservation Metadata
Metadata Archived?
0.0%10.0%20.0%30.0%40.0%50.0%60.0%70.0%
FGDC format Locally definedmetadata
NC OneMapmetadata starter
block
None
% o
f R
esp
on
den
ts
Results from a 2006 survey of all 100 NC counties and 25 largest NC municipalities
Capture “transfer set” metadata Normalize, synchronize, and remediate existing
metadata, and retain original metadata record Treat contact information as archival Update metadata with format conversions Use ESRI Profile of FGDC
added technical and administrative elementsHas an XML schemaArcCatalog tool support
Use simple rights encoding scheme Record metadata in a workflow management
database
Some Key Metadata Decisions
NCSU Libraries 27 March 2006Digital Preservation in State Government - Wilmington
SIP Item Creation: Workflow
• Submission Information Package grouping– Ontology logic based on defined multi-file
complex format components and directory structure
• Repository-agnostic item grouping
Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata Version one (1994) mandated for use by federal agencies Descriptive metadata, plus some administrative and
technical Extensive use at state level, spotty use at local level Problem: content standard without an encoding spec FGDC profiles: ESRI, NBII, Remote Sensing, etc.
ISO Standards ISO 19115: Geospatial Information – Metadata (2003) ISO 19139: Geospatial Information – Metadata – XML
(2007) North American Profile of ISO to replace FGDC CGDSM
Metadata Overview