Upload
andrew-ward
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
Preservation of Digital Geospatial Data: Challenges and Opportunities Steve MorrisHead of Digital Library InitaitivesNorth Carolina State University Libraries
NARA Meeting Dec. 14, 2005
Note: Percentages based on the actual number of respondents to each question 2
Outline
Digital Geospatial Data: TypesRisks to Digital Geospatial DataOverview of NC Geospatial Data Archiving ProjectPreservation Challenges and Possible Solutions
Note: Percentages based on the actual number of respondents to each question 3
Geospatial data types: Vector data
Note: Percentages based on the actual number of respondents to each question 4
Geospatial data types: Satellite imagery
Note: Percentages based on the actual number of respondents to each question 5
Geospatial data types: Aerial imagery
Note: Percentages based on the actual number of respondents to each question 6
Geospatial data types: Aerial imagery
Note: Percentages based on the actual number of respondents to each question 7
Geospatial data types: Aerial imagery
Note: Percentages based on the actual number of respondents to each question 8
Geospatial data types: Tabular data (w/vector)
Note: Percentages based on the actual number of respondents to each question 9
Time series – vector dataParcel Boundary Changes 2001-2004, North Raleigh, NC
Note: Percentages based on the actual number of respondents to each question 10
Time series – Ortho imageryVicinity of Raleigh-Durham International Airport 1993-2002
Note: Percentages based on the actual number of respondents to each question 11
Today’s geospatial data as tomorrow’s cultural heritage
Note: Percentages based on the actual number of respondents to each question 12
Risks to Digital Geospatial Data
.shp
.mif
.gml
.e00
.dwg
.dgn
.bsb
.bil
.sid
Note: Percentages based on the actual number of respondents to each question 13
Risks to Digital Geospatial Data
Producer focus on current dataTime-versioned content generally not archives
Future support of data formats in questionVast range of data formats in use--complex
Shift to “streaming data” for accessArchives have been a by-product of providing access
Preservation metadata requirementsDescriptive, administrative, technical, DRM
GeodatabasesComplex functionality
Note: Percentages based on the actual number of respondents to each question 14
NC Geospatial Data Archiving Project
Partnership between university library (NCSU) and state agency (NCCGIA)Focus on state and local geospatial content in North Carolina (state demonstration)Tied to NC OneMap initiative, which provides for seamless access to data, metadata, and inventory informationObjective: engage existing state/federal geospatial data infrastructures in preservation
Note: Percentages based on the actual number of respondents to each question 15
Targeted Content
Resource TypesGIS “vector” (point/line/polygon) dataDigital orthophotography Digital mapsTabular data (e.g. assessment data)
Content ProducersMostly state, local, regional agenciesSome university, not-for-profit, commercialSelected local federal projects
Note: Percentages based on the actual number of respondents to each question 16
Local Government GIS: Archival Issues
Data resources are highly distributed and subject to frequent updateMore detailed, current, accurate than federal/state data resourcesNorth Carolina local agency GIS environment
100 counties, 95 with GIS85 counties with high resolution orthophotographyGrowing number of municipal systemsValue: $162 million plus investment (est. in 2003)
Note: Percentages based on the actual number of respondents to each question 17
Work plan in a Nutshell
Work from existing data inventories
NC OneMap Data Sharing Agreements as the “blanket”, individual agreements as the “quilt”
Partnership: work with existing geospatial data infrastructures (state and federal)
Technical approachMETS with FGDC, PREMIS?, GeoDRM?
Dspace now; re-ingest to different environment
Web services consumption for archival development
Note: Percentages based on the actual number of respondents to each question 18
NCGDAP Philosphy of Engagement
Take the dataas in the mannerIn which it can be obtained
Provide feedback to producer organizations/inform state geospatial infrastructure
“Wrangle”and archivedata
Note the ‘Project’ in ‘North Carolina Geospatial Data ArchivingProject’– the process, the learning experience, and the engagementwith geospatial data infrastructures are more important than the archive
Note: Percentages based on the actual number of respondents to each question 19
Big Challenges
Format migration paths
Management of data versions over time
Preservation metadata
Harnessing geospatial web services
Preserving cartographic representation
Keeping content repository-agnostic
Preserving geodatabases
More …
Note: Percentages based on the actual number of respondents to each question 20
Vector Data Format Issues
Vector data much more complicated than image data
‘Archiving’ vs. ‘Permanent access’An ‘open’ pile of XML might make an archive, but if using it requires a team of programmers to do digital archaeology then it does not provide permanent access
Piles of XML need to be widely understood piles
GML: need widely accepted application schemas (like OSMM?)
The Geodatabase conundrumExport feature classes, and lose topology, annotation, relationships, etc.
… or use the Geodatabase as the primary archival platform (some are now thinking this way)
Note: Percentages based on the actual number of respondents to each question 21
GIS Software Used: NC Local Agencies
0%
10%
20%
30%
40%
50%
60%
70%
ArcGIS (ESRI) ArcInfo (ESRI) ArcView 8.x (ESRI)
ArcView 3.x (ESRI) ArcIMS (ESRI) GenaMap
IMAGINE Intergraph MapInfo
Understanding Systems Other Not Sure
Source: NC OneMap Data Inventory 2004
Note: Percentages based on the actual number of respondents to each question 22
Vector Data Format OptionsOption A: use an open format and have a really unfortunate transformation and limited vendor support for the output objectOption B: use closed format but retain the original content and count on short- and medium-term vendor support. Option C: do both to buy time and look for an open, ASCII-based solution. (watch GML activity)
No sweet spot, just an evolving and changing mix offlawed options that are used in combination.
Note: Percentages based on the actual number of respondents to each question 23
Geography Markup Language Issues
GML still more useful as a transfer format than an archival format, support limited even for transfer“Permanent access” requirements:
profiles and application schemas widely understood and supported, avoid requiring “digital archaeology”role of GML Simple Features Profile?
Assessing formats for preservation: sustainability factors, quality & functionality factors
Apply same approach to GML profiles and application schemas?
Note: Percentages based on the actual number of respondents to each question 24
Geography Markup Language Issues
Plans for environmental scan of existing GML profiles and application schemas or profiles
schema name (e.g. OSMM, top10NL, ESRI GML, LandGML)responsible agency; schema has official government status?GML version; known unsupported GML componentsschema history; known interoperation with other schemas vendor support; translator support; stability over time
Note: Percentages based on the actual number of respondents to each question 25
Managing Time-versioned Content
Note: Percentages based on the actual number of respondents to each question 26
Managing Time-versioned Content
Many local agency data layers continuously updated
E.g., some county cadastral data updated daily—older versions not generally available
Individual versioned datasets will wander off from the archive
How do users “get current metadata/DRM/object” from a versioned dataset found “in the wild”?
How do we certify concurrency and agreement between the metadata and the data?
Note: Percentages based on the actual number of respondents to each question 27
Managing Time-versioned Content
Can we manage the relationship loosely using a persistent identifier link to a parent object?
version
version version
version
Persistent IDResolver
Parent ObjectManager
version
Note: Percentages based on the actual number of respondents to each question 28
Preservation Metadata Issues
FGDC MetadataMany flavors, incoming metadata needs processing
Cross-walk elements to PREMIS, MODS?
Metadata wrapper/Content packagingMETS (Metadata Encoding and Transmission Standard) vs. other industry solutions
Need a geospatial industry solution for the ‘METS-like problem’
GeoDRM a likely trigger—wrapper to enforce licensing (MPEG 21 references in OGIS Web Services 3)
Note: Percentages based on the actual number of respondents to each question 29
Metadata Availability
Note: Percentages based on the actual number of respondents to each question 30
Harnessing Geospatial Web Services
Note: Percentages based on the actual number of respondents to each question 31
Note: Percentages based on the actual number of respondents to each question 32
Note: Percentages based on the actual number of respondents to each question 33
Note: Percentages based on the actual number of respondents to each question 34
Note: Percentages based on the actual number of respondents to each question 35
Note: Percentages based on the actual number of respondents to each question 36
Geospatial Web Service Types
Image servicesDeliver image resulting from query against underlying dataLimited opportunity for analysis
Feature servicesStream actual feature data, greater opportunity for data analysis
OtherGeocoding servicesRouting.etc.
Note: Percentages based on the actual number of respondents to each question 37
Note: Percentages based on the actual number of respondents to each question 38
Accessible ArcXML Services
Geospatial Web Services Rights IssuesExample: Desktop GIS-accessible ArcIMS39 of 100 NC counties have desktop GIS-accessible
ArcIMS servicesIt is difficult to know how many of these counties actually expect users to either:
A) access data through desktop GIS for viewing only, orB) extract and download data
Note: Percentages based on the actual number of respondents to each question 39
Harnessing Geospatial Web Services
Automated content identification ‘capabilities files,’ registries, catalog services
WMS (Web Map Service) for batch extraction of image atlases
last ditch capture option
preserve cartographic representation
retain records of decision-making process
… feature services (WFS) later.
Rights issues in the web services space are ambiguous
Note: Percentages based on the actual number of respondents to each question 40
“Web mash-ups” and the New Mainstream Geospatial Web Services
Note: Percentages based on the actual number of respondents to each question 41
Preserving Cartographic Representation
Note: Percentages based on the actual number of respondents to each question 42
Preserving Cartographic Representation
The true counterpart of the old map is not the GIS dataset, but rather the cartographic representation that builds on that data:
Intellectual choices about symbolization, layer combinations
Data models, analysis, annotations
Cartographic representation typically encoded in proprietary files (.avl, .lyr, .apr, .mxd) that do not lend themselves well to migration
Symbologies have meaning to particular communities at particular points in time, preserving information about symbol sets and their meaning is a different problem
Note: Percentages based on the actual number of respondents to each question 43
Preserving Cartographic Representation
Image-based approachesGenerate images using Map Book or similar tools
Harvest existing atlas images
Capture atlases from WMS servers
Export ‘layouts’ or ‘maps’ to image
Vector-based approachesStore explicitly in the data format (e.g. Feature Class Representation in ArcGIS 9.2)
Archive and upward-migrate existing files .avl, .apr, .lyr, .mxd, etc.
SVG, VML or other XML approaches
Other?
Note: Percentages based on the actual number of respondents to each question 44
Preserving Cartographic Representation
Note: Percentages based on the actual number of respondents to each question 45
Preserving Cartographic Representation
Note: Percentages based on the actual number of respondents to each question 46
Interest in how geospatial content interacts with widely available digital repository software
Focus on salient, domain-specific issues
Challenge: remain repository agnosticAvoid “imprinting” on repository software environment
Preservation package should not be the same as the ingest object of the first environment
Tension between exploiting repository software features vs. becoming software dependent
Repository Architecture Issues
Note: Percentages based on the actual number of respondents to each question 47
Preserving Geodatabases
Spatial databases in general vs. ESRI Geodatabase “format”Not just data layers and attributes—also topology, annotation, relationships, behaviorsESRI Geodatabase archival issues
XML Export, Geodatabase History, File Geodatabase, Geodatabase Replication
Some looking to Geodatabase as archival platform (in addition to feature class export)
Note: Percentages based on the actual number of respondents to each question 48
Geodatabase Availability
Local agencies, especially municipalities, are increasingly turning to the ESRI Geodatabase format to manage geospatial data. According to the 2003 Local Government GIS Data Inventory, 10.0% of all county framework data and 32.7% of all municipal framework data were managed in that format.
Cities: Street Centerline Formats
Geodatabase
Shapefile
Coverage
Other
Counties: Street Centerline Formats
Geodatabase
Shapefile
Coverage
Other
Note: Percentages based on the actual number of respondents to each question 49
Evolving Geodatabase Handling Approaches
Project Stage Planned Approach
Original Proposal (Nov. 2003)
Export feature classes as shapefiles; archive Geodatabases less than 2 GB in size
Finalized Work Plan (Dec. 2004)
Also export content as Geodatabase XML
Possible Future Work Plan Changes
Explore maintenance of some archival content in Geodatabase form; explore Geodatabase replication as an archive development approach; archive Geodatabases of unlimited size
Note: Percentages based on the actual number of respondents to each question 50
Content replication also needed for:Disaster preparednessState and federal data improvement projectsAggregation by regional geospatial web service providers
WFS, e.g.: efficiency in complete content transfer?Rsync-like function, plus: rights management, inventory processes, metadata management, informed by data update cyclesArchiving delta files vs. complete replication – need to avoid requiring “digital archaeology” in the future
Efficient Content Replication
Note: Percentages based on the actual number of respondents to each question 51
GML for archiving
GeoDRM -- Adding preservation use cases
Content Packaging -- Industry solution?
Web Services Context DocumentsCan we save data state as well as application state?
Content ReplicationIs this layer in the architecture?
Persistent Identifiers
Points of Engagement with the Open Geospatial Consortium (OGC)
Note: Percentages based on the actual number of respondents to each question 52
Demonstration archiveOutreach activity – planting seeds
International, national, state, local, commercial
Learning experience, informing:Spatial data infrastructureCommercial vendors (data/software/consulting)Repository software communitiesMetadata practice (both GIS & preservation)Rights management developmentsData and interoperability standards
Project Outcomes
Note: Percentages based on the actual number of respondents to each question 53
Content Identification and Selection
Work from NC OneMap Data Inventory
Combine with inventory information from various state agencies and from previous NCSU efforts
Develop methodology for selecting from among “early,” “middle,” and “late” stage products
Develop criteria for time series development
Investigate use of emerging Open Geospatial Consortium technologies in data identification
Note: Percentages based on the actual number of respondents to each question 54
Content Acquisition
Work from NC OneMap Data Sharing Agreements as a starting point (the “blanket”)Secure individual agreements (the “quilt”) Investigate use of OGC technologies in captureExplore use of METS as a metadata wrapper
Ingest FGDC metadata; Xwalk to MODS? PREMIS?Maybe METS DRM short term; GeoDRM long termConsider links to services; version managementGet the geospatial community to tackle the content packaging problem (maybe MPEG 21?)
Note: Percentages based on the actual number of respondents to each question 55
Partnership Building
Work within context of the NC OneMap initiativeState, local, federal partnership
State expression of the National Map
Defined characteristic: “Historic and temporal data will be maintained and available”Advisory Committee drawn from the NC Geographic Information Coordinating Council subcommittees
Seek external partnersNational States Geographic Information Council FGDC Historical Data Committee
… more
Note: Percentages based on the actual number of respondents to each question 56
Content Retention and Transfer
Ingest into DspaceExplore how geospatial content interacts with existing digital repository software environments
Investigate re-ingest into a second platformChallenge: keep the collection repository-agnostic
Start to define format migration pathsSpecial problem: geodatabases
Purse long term solutionRoles of data producing agencies, state agencies; NC OneMap; NCSU
Note: Percentages based on the actual number of respondents to each question 57
Project Status
Completing inventory analysis stage
Storage system and backup deployed
DSpace deployed to production
Metadata workflow finalized
Ingest workflow near finalization
Content migration workflow near finalization
Regional site visits planned for coming months
Wide range of outreach/collaboration: FGDC, ESRI, EDINA (JISC), USGS, OGC, TRB, etc.
Pilot project, georegistering digital archival geologic maps
Note: Percentages based on the actual number of respondents to each question 58
Questions?
Contact:
Steve MorrisHead, Digital Library InitiativesNCSU Librariesph: (919) [email protected]