Upload
ark-group-australia
View
475
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
Digital Preservation NOW
Andrew Waugh
Senior Technical Advisor
Public Record Office Victoria
Goal of session
• To present practical steps that you can take to preserve digital information now, without having a digital archive
Outline of session
• Goal of preservation
• Preserving the bit stream
• Preserving accessibility
• Preserving the context
• Conclusions
The goal of preservation
• Ensure access to records as long as they are required
• A record is…– information created, received, and maintained as
evidence and information by an organization or person, in persuance of legal obligations or in the transaction of business(AS ISO 15489.1-2002)
The key to records is evidence
• What, where, when, how, who
• Evidence to colleagues (business activity)
• Evidence of accountability (investigations)
• Evidence to courts (legal evidence)
• Evidence to researchers (historical evidence)
So what does evidence require?
• That record was produced as part of normal business process (authentic)
• That record can be found & read (accessible)
• That it can be related to the rest of the records (context)
• That it hasn’t been tampered with (integrity)
Key issues
• Preserving the bit stream– If you don’t have the bits, you don’t have anything
• Preserving access to the information– In the face of fragile applications
• Preserving the context– The evidence
Preserving the bit stream
Core issue
• If you don’t have the binary data (files) that makes up the record there you cannot preserve anything
• Problems you need to protect against– Media failures (corruption, crashes)– Technology obsolescence– Human error
Basically a solved problem
• A core function of your IT department– Day to day operation of storage systems– Back-up/restore and disaster recovery– Periodic replacement of media and technology
Recommendations
• Store on at least two pieces of media, ideally two technologies or (less ideally) two brands
• Store in at least two sites• Information not being accessed must be periodically
checked for corruption• Track individual pieces of media – include brand and
batch• Always use mainstream technology in widespread
use
Storage media (disc)
• Default storage choice should be on-line (disc) storage unless massive storage required– e.g. 3 Terabytes RAID 5 ~$4000
• RAID 1 or 6 (or derivatives) to guard against disc failures. RAID 5 is problematic now.
• Expect to replace each disc within 5 years
• External (USB) discs not recommended for long term storage (> 1 year)
Storage media (tape)
• Choice when greater storage capacity than economic with disc– Be sure to factor in whole of life costs including media
replacement and operator costs
• Preferred formats LTO Ultrium, IBM 3592, T10000• Tape robots are preferred over manual handling• Get expert advice on tape solutions as these are no
longer common – use only for large organisations• NEVER EVER choose leading edge technology, always
stay within industry standard
Storage media (optical)
• Prefer CD-R (phthalocyanine dye)• Can use CD-R (azo dye) or DVD-R, but monitor
carefully• Do not use CD-RW, DVD-RW, or CD-R (cyanine dye)• Use ‘name brands’, and archival quality if possible• Refresh in 2 to 5 years• Unlikely to be generally economic compared with
disc or tape due to high operator cost and low capacity
Monitor…
• Recommend statistical sampling of data to– check for corruption of copies (checksums)– deterioration of media
• Technology watch to guard against obsolete media– plan for media refresh every 2 to 10 years
• Track individual pieces of media (if used)– Ensure that none are lost– Ensure that all are tested and refreshed
Back-up & disaster recovery
• Ensure that– Your IT organisation has both a back-up and
disaster recovery regime– It is effective (periodically test restoration)
Preserving accessibility
Software fragility
• Without software to interpret and display the content, the data is lost– Software may not run on the current version of the
operating system or current computer– Current software version may not accurately deal
with files from older versions – You may not have the required software
Do nothing option
• So far has worked because backwards compatibility is better than we thought – Operating systems continue to support older
programs (Windows, Unix/Linux)– Modern programs seem to have good support for
files from older versions– This may not last forever…
If you are going to do nothing…
• Perform a risk analysis– Survey your holdings to identify and quantify file formats
• versions, if possible, ages if not
– Consider risk of loss of access• Use criteria from normalisation section
– Identify high value holdings
• Monitor software trends (is risk increasing?)• Identify contingency plans• Influence users to use lower risk formats
Normalisation option
• Proactively convert formats to a long term preservation format (LTPF)
• This is a format that is likely to be usable for the forseeable future– Can find replacement software to render data– Can find software to migrate from LTPF to new format
• Library of Congress sustainability factors– http://www.digitalpreservation.gov/formats/
Characteristics of a good LTPF
• Supports critical features of your data• Published file format specification• Independent implementations• Wide community adoption• Simple• Formal standard• Public domain• Low risk conversion
If you normalise
• Don’t jump out of the frying pan– Still need to do the analysis presented for ‘do
nothing case’– Just fewer formats
• Develop test regime to test conversion into nominated format– Suite of ‘typical’ documents illustrating critical
features
LTPF suggestions
• Documents– PDF/A, ODF
• Images– TIFF, JPEG2000, JPEG (if already in JPEG)
• Video– MPEG2 or MPEG4
Normalisation challenges
• Many types of data have no suitable LTPF (e.g. CAD/GIS)
• Long tail of formats (never be able to assign a LTPF for all types of digital object)
• Loss of characteristics in the normalisation
• Increasing complexity of digital objects (i.e. formats embedded within formats)
Digital rights management
• DRM systems are designed to control (prevent) access to digital objects– Owner of digital object removes right of access– May not permit access even though it is required (e.g. investigations)– DRM system ceases to exist
• DRM systems do not recognise an organisation’s right to use their records
• Trusted Computing and Digitial Rights Management Principles and Policies, NZESC– http://www.e.govt.nz/policy/tc-and-drm/principles-policies-06/tc-drm-
0906.pdf
Is it evidence?
(Context)
Core Issues
• If you cannot find it, it does not exist
• If you can find it, and cannot understand the context, it is meaningless– Users are interested in the story, not a document
• If you cannot show its authenticity, integrity, and context, it may have low evidential weight
It’s all basic records management
• Create the record as part of the business process (authenticity)– This includes putting it aside
• Putting the record in its context– Tell the story – who, what, where and when
• Show that the record has not been subsequently modified– Audit log
Key requirements
• Making sure that records are created in their context (business issue)
• Having someplace to put the records and capture their context– Electronic Document & Records Management
System (EDRMS)– Classification system
If you do not have an EDRMS?
• Do whatever you can…
• Set up classification system in– Email system– Corporate file server
• Good idea even you plan to get an EDRMS– It gets everyone used to using a classification
system
Why is metadata Important?
• Who, what, where and when is answered by metadata associated with record– Captured (ideally) by system when record is
created– Entered by user
• Many different metadata standards
NAA/ANZ metadata standard
• Proposed basis for an Australian recordkeeping standard
• Australian Government Recordkeeping Standard version 2.0– http://www.naa.gov.au/Images/AGRkMS_Final%2
0Edit_16%2007%2008_Revised_tcm2-12630.pdf
Minimum metadata to be kept
• Identifier (unique id referring to this object)• Name (human readable tag)• Start date (creation date)• Contextual link (relation with file, series)• Change history (demonstrating integrity)• Disposal (when and how to dispose of record)• Extent (size)• Agent (organisation or person associated with
record)
What can you do now - storage
• Make sure that your organisation can preserve the bits– Survey holdings of media to discover the extent of your
problem– Move records off unmanaged, obsolete, deteriorating
media– Ensure back-up and disaster recovery systems are in
place and working– Sample records to detect corruption and decay– Plan to migrate to new technology
What can you do now – access
• Make sure that your organisation can turn the files into something a human can understand– Survey holdings of records to understand what
formats you have and the importance of the records
– Perform a risk assessment on the formats– Choose an LTPF and normalise high risk formats– Encourage use of LTPF for business
What can you do now – context
• Make sure that digital objects are records– Organise the objects so that they have a context
(classification)– Move towards an EDRMS or business application
that captures the records, preserves their context, and protects their integrity
Questions?