37
Cooperative Project with Library of Congress on Preservation of Digital Geospatial Data Steve Morris Head of Digital Library Initiatives NCSU Libraries

Cooperative Project with Library of Congress on Preservation of Digital Geospatial Data Steve Morris Head of Digital Library Initiatives NCSU Libraries

Embed Size (px)

Citation preview

Cooperative Project with Library of Congress on Preservation of Digital Geospatial Data

Steve MorrisHead of Digital Library InitiativesNCSU Libraries

Note: Percentages based on the actual number of respondents to each question 2

NC Geospatial Data Archiving Project(NCGDAP)

Partnership between NCSU Libraries and NC Center for Geographic Information & Analysis$520,000 funding – 3 yearsFocus on state and local geospatial content in North Carolina (state demonstration)Address NC OneMap objective: “Historic and temporal data will be maintained and available.”One of eight projects in the first NDIIPP funding round: “Building a Network of Partners”

Note: Percentages based on the actual number of respondents to each question 3

Note: Percentages based on the actual number of respondents to each question 4

NDIIPP OverviewNational Digital Information Infrastructure and Preservation Program

Congress appropriated $100 million for this effort, which instructs the Library to spend an initial $25 million to develop and execute a congressionally approved strategic plan

Eight initial projects, 2004-2007: web pages, cultural heritage, numeric data, video, business records, mixed content, geospatial (2)

Developing partnerships and identifying issuesExtensive interaction among NDIIPP projects

Note: Percentages based on the actual number of respondents to each question 5

Targeted Content

Resource TypesGIS “vector” (point/line/polygon) data

Digital orthophotography

Digital maps

Tabular data (e.g. assessment data)

Content ProducersMostly state, local, regional agencies

Some university, not-for-profit, commercial

Selected local federal projects

Note: Percentages based on the actual number of respondents to each question 6

Risks to Digital Geospatial Data

.shp

.mif

.gml

.e00

.dwg

.dgn

.bsb

.bil

.sid

Note: Percentages based on the actual number of respondents to each question 7

Risks to Digital Geospatial Data

Focus on current dataArchiving data does not guarantee “permanent access”

Future support of data formats in questionNeed to migrate formats or allow for emulation

Data failure“Bit rot”, media failure

Preservation metadata requirementsDescriptive, administrative, technical, DRM

Shift to “streaming data” for access

Note: Percentages based on the actual number of respondents to each question 8

Time series – vector dataParcel Boundary Changes 2001-2004, North Raleigh, NC

Note: Percentages based on the actual number of respondents to each question 9

Time series – Ortho imageryVicinity of Raleigh-Durham International Airport 1993-2002

Note: Percentages based on the actual number of respondents to each question 10

Today’s geospatial data as tomorrow’s cultural heritage

Note: Percentages based on the actual number of respondents to each question 11

Earlier NCSU Acquisition Efforts

NCSU University Extension project 2000-2001Target: County/city data in eastern NC

“Digital rescue” not “digital preservation”

Project learning outcomesConfirmed concerns about long term access

Need for efficient inventory/acquisition

Wide range in rights/licensing

Need to work within statewide infrastructure

Acquired experience; unanticipated collaboration

Note: Percentages based on the actual number of respondents to each question 12

One Earlier Project Outcome: Directory of County and City Services

Among top 15 most used resources on library web site

99.5% of directory users from outside ncsu.edu

Note: Percentages based on the actual number of respondents to each question 13

NDIIPP Project Phases

Content Identification and Selection

Content Acquisition

Partnership Building

Content Retention and Transfer

All 8 NDIIPP cooperative projectsadhere to this structure

Note: Percentages based on the actual number of respondents to each question 14

Content Identification and Selection

Work from NC OneMap Data Inventory

Combine with inventory information from various state agencies and from previous NCSU efforts

Develop methodology for selecting from among “early,” “middle,” and “late” stage products

Develop criteria for time series development

Investigate use of emerging Open Geospatial Consortium technologies in data identification

Note: Percentages based on the actual number of respondents to each question 15

Content AcquisitionWork from NC OneMap Data Sharing Agreements as a starting point (the “blanket”)Secure individual agreements (the “quilt”) Investigate use of OGC technologies in captureUse METS (Metadata Encoding and Transfer Standard) as a metadata wrapper

Bundle data files, metadata, ancillary documentationSupplement FGDC metadata with additional administrative, technical, and descriptive metadataEncode rights (Digital Rights Management – DRM)Links to services

Note: Percentages based on the actual number of respondents to each question 16

Partnership Building

Work within context of the NC OneMap initiativeExplore state, local, federal partnerships

Defined characteristic: “Historic and temporal data will be maintained and available”Advisory Committee drawn from the NC Geographic Information Coordinating Council subcommittees

Seek external partnersNational States Geographic Information Council FGDC Historical Data Committee

… more

Note: Percentages based on the actual number of respondents to each question 17

Content Retention and Transfer

Ingest into Dspace open source digital repository software

Look more generically at the issue of putting geospatial content into digital repositories

Investigate re-ingest into a second platformStart to define format migration paths

Special problem: geodatabases

Purse long term solutionRoles of data producing agencies, state agencies; NC OneMap; NCSU

Note: Percentages based on the actual number of respondents to each question 18

Big Geoarchiving Challenges

Format migration paths

Management of data versions over time

Preservation metadata

Preserving cartographic representation

Keeping content repository-agnostic

Preserving geodatabases

Harnessing geospatial web services

More …

Note: Percentages based on the actual number of respondents to each question 19

Vector Data Format Issues

Vector data much more complicated than image data

‘Preservation’ vs. ‘Permanent access’An ‘open’ pile of XML might make an archive, but if using it requires a team of programmers to do digital archaeology then it does not provide permanent access

Piles of XML need to be widely understood piles

GML: need widely accepted application schemas (like OSMM?)

The Geodatabase conundrumExport feature classes, and lose topology, annotation, relationships, etc.

… or use the Geodatabase as the primary archival platform (some are now thinking this way)

Note: Percentages based on the actual number of respondents to each question 20

Geography Markup Language Issues

GML still more useful as a transfer format than an archival format, support limited even for transferFGDC Historical Data Working Group investigations into GML for use in archivingPlans for environmental scan of existing GML profiles and application schemas or profiles

schema name (e.g. OSMM, top10NL, ESRI GML, LandGML)responsible agency; scheme has official government status?GML version; known unsupported GML componentsschema history; known interoperation with other schemas vendor support; translator support

Note: Percentages based on the actual number of respondents to each question 21

Managing Time-versioned Content

Many local agency data layers continuously updated

Older versions not generally available

Individual versioned datasets will wander off from the archive

How do users “get current metadata/DRM/object” from a versioned dataset found “in the wild”?

How do we certify concurrency and agreement between the metadata and the data?

Note: Percentages based on the actual number of respondents to each question 22

Preservation Metadata Issues

FGDC MetadataMany flavors, incoming metadata needs processingOther standards: PREMIS, MODS

Metadata wrapperMETS (Metadata Encoding and Transmission Standard) vs. other industry solutionsNeed a geospatial industry solution for the ‘METS-like problem’GeoDRM a likely trigger—wrapper to enforce licensing (MPEG 21 references in OGC Web Services 3)

Note: Percentages based on the actual number of respondents to each question 23

Preserving Cartographic Representation

The true counterpart of the old map is not the GIS dataset, but rather the cartographic representation that builds on that data:

Intellectual choices about symbolization, layer combinations

Data models, analysis, annotations

Cartographic representation typically encoded in proprietary files (.avl, .lyr, .apr, .mxd) that do not lend themselves well to migration

Symbologies have meaning to particular communities at particular points in time, preserving information about symbol sets and their meaning is a different problem

Note: Percentages based on the actual number of respondents to each question 24

Preserving Cartographic Representation

Note: Percentages based on the actual number of respondents to each question 25

Preserving Cartographic Representation

Image-based approaches (“dessicated data”)Generate images using Map Book or similar tools

Harvest existing atlas images

Capture atlases from WMS servers

Export ‘layouts’ or ‘maps’ to image

Vector-based approachesStore explicitly in the data format (e.g. Feature Class Representation in ArcGIS 9.2)

Archive and upward-migrate existing files .avl, .apr, .lyr, .mxd, etc.

SVG, VML or other XML approaches

Other?

Note: Percentages based on the actual number of respondents to each question 26

Preserving Cartographic Representation

Note: Percentages based on the actual number of respondents to each question 27

Preserving Cartographic Representation

Note: Percentages based on the actual number of respondents to each question 28

Preserving Geodatabases

Not just data layers and attributes—also topology, annotation, relationships, behaviors

ESRI Geodatabase archival issuesXML Export, Geodatabase History, File Geodatabase, Geodatabase Replication

Growing use of geodatabases by municipal, county agencies

Some looking to Geodatabase as archival platform (in addition to feature class export)

Note: Percentages based on the actual number of respondents to each question 29

Geodatabase Availability

According to the 2003 Local Government GIS Data Inventory, 10.0% of all county framework data and 32.7% of all municipal framework data were managed in that format.

Cities: Street Centerline Formats

Geodatabase

Shapefile

Coverage

Other

Counties: Street Centerline Formats

Geodatabase

Shapefile

Coverage

Other

Note: Percentages based on the actual number of respondents to each question 30

Evolving Geodatabase Handling Approaches

Project Stage Planned Approach

Original Proposal (Nov. 2003)

Export feature classes as shapefiles; archive Geodatabases less than 2 GB in size

Finalized Work Plan (Dec. 2004)

Also export content as Geodatabase XML

Possible Future Work Plan Changes

Explore maintenance of some archival content in Geodatabase form; explore Geodatabase replication as an archive development approach; archive Geodatabases of unlimited size

Note: Percentages based on the actual number of respondents to each question 31

Harnessing Geospatial Web Services

Automated content identification ‘capabilities files,’ registries, catalog services

WMS (Web Map Service) for batch extraction of image atlases

last ditch capture option

preserve cartographic representation

retain records of decision-making process

… feature services (WFS) later.

Rights issues in the web services space are ambiguous

Note: Percentages based on the actual number of respondents to each question 32

Partnerships

ESRI Discussing software requirements: meetings with development teams April 2005

Open Geospatial Consortium (OGC)Meet with Architecture Working Group Nov. 2005

National Archives and Records AdministrationInvestigations into GML for archiving; planned presentation to NARA technology team

FGDC Historical Data Working GroupGeneral geospatial data preservation issues

Note: Percentages based on the actual number of respondents to each question 33

Partnerships

EDINA (University of Edinburgh, UK) NCSU is Associate Partner on UK project for geospatial institutional repositories

UC Santa Barbara & Stanford UniversityOther NDIIPP geospatial project

EROS Data CenterPlanned site visit

Project visits to regional GIS groupsAlbemarle Regional GIS meeting Nov. 3

More planned …

Note: Percentages based on the actual number of respondents to each question 34

Progress to Date

Completion of project agreements

Hiring staff

Acquisition and deployment of storage system (12.4 TB capacity – two 16.8 TB systems)

Testing and deployment of repository software

Development of metadata workflow

Development of ingest workflow

Pilot project with NC Geologic Survey data

… Initial focus on developing the “plumbing”

Note: Percentages based on the actual number of respondents to each question 35

Questions for You?

What are your current practices for:Archiving data and managing time versionsManaging geodatabase versionsTransfer mechanisms for data

• to regional entities?• to off-site storage for disaster recovery?

Archiving project files and finished products

What rights issues exist with regard to putting county and city data into an archive?What would you like this project to do?

Note: Percentages based on the actual number of respondents to each question 36

Ways to Participate in NCGDAP

Identifying data for inclusion in the repository

Discussing data format strategies

Sharing ideas about archiving approaches and architectures

Sharing and identifying concerns about rights issues, liability, etc.

Host project visits to regional GIS groups

Use Local Government GIS listserv to discuss preservation issues?

Note: Percentages based on the actual number of respondents to each question 37

Questions?

Contact:

Steve MorrisHead, Digital Library InitiativesNCSU [email protected]

http://www.lib.ncsu.edu/ncgdap