53
CHITO N. ANGELES

Standards and procedure in digitization and digital preservation

Embed Size (px)

Citation preview

CHITO N. ANGELES

Are there standards for digitization or digital archiving?

Yes, but limited to certain aspects only.

ISO/TR 13028:2010 - Information and documentation - Implementation guidelines for digitization of records.

Not applicable to: technical specifications for the digital capture of records; technical specifications for the long-term preservation of digital records; or digitization of existing archival holdings for preservation purposes, etc.

ISO/TR 19005-1:2005; ISO/TR 19005-2:2011; ISO/TR 19005-3: 2012, (underdevelopment) - Document management - Electronic document file format for long-term preservation.

Specifies how to use the Portable Document Format (PDF) for long-term preservation of electronic documents.

Standard is known as PDF/A.

Unlike preservation microfilming and photocopying, there are no formal standards that govern the capture, processing, and storage of digital images.

There are, however, a number of projects and publications that have set forth best practices for creating high-quality digital images, access systems, and storage systems.

Also known as imaging or scanning, is the means of converting hard-copy, or non-digital, records into digital format.

Hard-copy or non-digital records include audio, visual, image or text.

Digitization may also be undertaken by taking digital photographs of the source records, where appropriate.

Source: Government Recordkeeping Group, Archives New Zealand. Continuum Create and Maintain: Digitisation Standard (2005).

A process by which digital data is preserved in digital form in order to ensure the usability, durability and intellectual integrity of the information contained therein.

A more precise definition is: the storage, maintenance, and accessibility of a digital object over the long term, usually as a consequence of applying one or more digital preservation strategies.

These strategies may include technology preservation, technology emulation or data migration.

Source: The NINCH Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Materials (2002).

Born Digital - Digital materials which are created and retained in digital form.

May or may not have a non-digital equivalent.

Source: Government Recordkeeping Group, Archives New Zealand. Continuum Create and Maintain: Digitisation Standard (2005).

Digital Repository / Archive - a digital repository is where digital content, assets, are stored and can be searched and retrieved for later use.

A repository supports mechanisms to import, export, identify, store and retrieve digital assets.

Putting digital content into a repository enables staff and institutions to then manage and preserve it, and therefore derive maximum value from it.

Digital repositories may include research outputs and journal articles, theses, elearning objects and teaching materials or research data.

Source: Digital Repositories: Helping universities and colleges. JISC, August 2005.

Master - A faithful digital reproduction of a document, optimized for longevity and for production of a range of delivery versions (derivatives).

Masters are captured at the highest practicable quality or resolution and stored for long-term usage.

Typically, masters are stored in an off-line mode on tape or CD and are accessed only for the production of derivative images.

Source: Government Recordkeeping Group, Archives New Zealand. Continuum Create and Maintain: Digitisation Standard (2005).

Derivative - an image created from the master image, through some kind of image editing process to create a user or working copy.

The process usually involves a loss of information to reduce the size by sampling it to a lower resolution, using lossy compression techniques, or altering an image using image processing techniques.

Typically, derivatives are made for purposes such as web access, including “thumbnail” images, or as “reference” or “service” images that should fit completely within an average monitor.

Source: Government Recordkeeping Group, Archives New Zealand. Continuum Create and Maintain: Digitisation Standard (2005).

Digital images - electronic snapshots taken of a scene or scanned from documents, such as photographs, manuscripts, printed texts, and artwork.

The digital image is sampled and mapped as a grid of dots or picture elements (pixels).

Each pixel is assigned a tonal value (black, white, shades of gray or color), which is represented in binary code (zeros and ones).

Resolution - a measure of the ability to capture detail in the original work.

The spatial frequency at which a digital image is sampled (the sampling frequency) is often a good indicator of resolution.

Dots-per-inch (dpi) or pixels-per-inch (ppi) are common and synonymous terms used to express resolution for digital images.

Pixel Dimensions - the horizontal and vertical measurements of an image expressed in pixels.

May be determined by multiplying both the width and the height by the dpi.

Example: an 8" x 10" document scanned at 300 dpi has the pixel dimensions of 2,400 pixels (8" x 300 dpi) by 3,000 pixels (10" x 300 dpi).

Bit Depth- determined by the number of bits used to define each pixel.

The greater the bit depth, the greater the number of tones (grayscale or color) that can be represented.

Digital images may be produced in black and white (bitonal), grayscale, or color.

Bit Depth

A bitonal image is represented by pixels consisting of 1 bit each, which can represent two tones (typically black and white), using the values 0 for black and 1 for white or vice versa.

A grayscale image is composed of pixels represented by multiple bits of information, typically ranging from 2 to 8 bits or more.

Bit Depth

A color image is typically represented by a bit depth ranging from 8 to 24 or higher.

With a 24-bit image, the bits are often divided into three groupings: 8 for red, 8 for green, and 8 for blue. Combinations of those bits are used to represent other colors.

A 24-bit image offers 16.7 million (2 24 ) color values.

File Size - calculated by multiplying the surface area of a document (height x width) to be scanned by the bit depth and the dpi2.

Because image file size is represented in bytes, which are made up of 8 bits, divide this figure by 8.

Formula 1 for File SizeFS = (height x width x bit depth x dpi2) / 8

File Size

Example: Compute the file size of a US-Letter size page captured in 8-bit Grayscale at 100dpi.

FS = (8.5 x 11 x 8 x 1002)/8

FS = 935,000 bytes.

File Size

If the pixel dimensions are given, multiply them by each other and the bit depth to determine the number of bits in an image file.

Formula 2 for File SizeFS= (pixel dimensions x bit depth) / 8

File Size

Example: Compute the file size of a 24-bit image captured with a digital camera with pixel dimensions of 2,048 x 3,072.

FS = (2048 x 3072 x 24)/8

FS = 18,874,368 bytes.

Compression - algorithms designed to reduce the size of the image for storage or transmission.

Lossless schemes (e.g., ITU-T6) abbreviate the binary code without discarding any information, so that when the image is "decompressed" it is bit for bit identical to the original. Most often used with bitonal scanning of textual material.

Lossy schemes (e.g., JPEG) utilize a means for averaging or discarding the least significant information, based on an understanding of visual perception. Typically used with tonal images.

File Formats - consist of both the bits that comprise the image and header information on how to read and interpret the file.

File formats vary in terms of resolution, bit-depth, color capabilities, and support for compression and metadata.

Optical Character Recognition (OCR) - a technology that enables you to convert different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data.

Source: http://finereader.abbyy.com/about_ocr/whatis_ocr/

Quality (usability, functionality)

Persistence (long-term access)

Interoperability (e.g., across platforms and software environments)

Storage Space (file size)

Storage Hardware

Storage Media (e.g., DVDs, CDs)

Master copies should be created to the highest technical standards achievable.

Image formats should be open-source (non proprietary), have published technical specifications available in the public domain.

Image formats should be widely supported by many software applications and operating systems.

Digitize an original or first generation (i.e., print rather than microfilm) of the source material to achieve the best quality image possible.

Create backup copies of all files on servers and storage media (e.g., DVDs) and have an off-site backup strategy.

Create meaningful metadata for image files or collections.

Prior to digitization, consideration of third party copyright or other constraints inherent in the record should be resolved.

OCR should be performed on all digital reproductions where the content is primarily textual and computer processed. Collections that are photographic in nature and those not computer processed need not require OCR.

Plan for future technological developments and migration.

Tagged Image File Format (TIFF)

Extensions: .tif, .tiff

Bit-depths: 1-bit bitonal; 4- or 8-bit. grayscale or palette color; up to 64-bit color.

Compression: Uncompressed◦ Lossless: ITU-T.6, LZW, etc.

◦ Lossy: JPEG

Standard/ Proprietary: De facto standard.

Web Support: plug-in or external application.

Supports multiple images/file (multi-page).

Joint Photographic Expert Group (JPEG) / JPEG File Interchange Format (JFIF)

Extensions: .jpg, .jpeg, .jif, .jfif

Bit-depths: 8-bit grayscale; 24-bit color.

Compression: Lossless; Lossy: JPEG.

Standard/ Proprietary: JPEG: ISO 10918-1/2; JFIF: de Facto Standard.

Web Support: Native since Microsoft Internet Explorer 2, Netscape Navigator 2.

JP2-JPX/ JPEG 2000

Extensions: .jp2, .jpx, .j2k, .j2c

Bit-depths: supports up to 214 channels, each with 1-38 bits; gray or color.

Compression: Uncompressed◦ Lossless/Lossy: Wavelet.

Standard/ Proprietary: JPEG: ISO/IEC 15444 parts 1-6, 8-11.

Web Support: Plug-in.

Portable Document Format (PDF)

Extension: .pdf

Bit-depths: 4-bit grayscale; 8-bit color; up to 64-bit color support.

Compression: Uncompressed◦ Lossless: ITU-T.6, LZW, JBIG

◦ Lossy: JPEG

Standard/ Proprietary: De facto standard.

Web Support: Plug-in or external application.

Contains OCR text layer.

DjVu, pronounced “day·zha·voo”

Extension: .djvu

Bit-depths: 1-bit bitonal, 4- to 8-bit grayscale; 24-bit color support.

Compression: Lossless: JB2, IW44; Lossy.

Standard/ Proprietary: Emerging standard.

Web Support: Plug-in or external application.

Supports multiple images/file (multi-page).

Contains OCR text layer.

DjVu

High quality image compression technique:◦ Scanned bitonal: 300dpi: 5-40K per page (3-10

times better than TIFF/G4).

◦ 5-10 times better than than JPEG or PDF

Image Masters◦ Preservation / Archive Copy

◦ Uncompressed

◦ Highest possible quality recommended

Derivatives◦ Display / Viewing / Reading

◦ Printing

◦ Thumbnails

Image Masters◦ TIFF

◦ JPEG (if using digital cameras)

Derivatives / Deliverables◦ Text/ Documents: PDF, DjVu

◦ Photographs: PNG, DjVu

Black and White◦ File Format: TIFF

◦ Compression: Uncompressed or Lossless compressed using CCITT Group 4 (ITU-T6)

◦ Bit Depth: 600dpi, bitonal

Grayscale◦ File Format: TIFF

◦ Compression: Uncompressed or Lossless compressed using LZW or JPEG2000

◦ Bit Depth: 300dpi, 8-bit grayscale

Color◦ File Format: TIFF

◦ Compression: Uncompressed or Lossless Compressed using LZW or JPEG2000

◦ Bit Depth: 300dpi, 24-bit color

Thumbnail◦ File Format: JPEG

◦ Compression: Lossy

◦ Resolution: 72-100 dpi

View / Service copy◦ File Format: JPEG / PDF / DjVu

◦ Compression: Lossy

◦ Resolution: 72-100 dpi

Print Copy (PDF/DjVu)◦ File Format: PDF / DjVu

◦ Compression: Lossy

◦ Resolution: 100-150 DPI

Flatbed Scanner◦ Best known and largest selling scanner

Sheet Feed Scanner◦ Use the same basic technology as flatbeds, but

maximize throughput, usually at the expense of quality.

◦ Designed for high-volume scanning

Overhead Scanner◦ High speed book scanner.

◦ Sometimes referred to as “Planetary scanner”

◦ Bound volumes can be placed face up for scanning

V-Shaped Book Scanner◦ Uses Digital SLR Cameras and a unique v-shaped,

auto-adjusting book cradle and platen to capture sharp images at up to 700 pages an hour.

◦ Natively captures flat images. No need for page curvature correction.

Image Capture and Processing◦ IrfanView (Freeware)

Image capture, conversion, processing

◦ Adobe Acrobat (Proprietary)

PDF creation, conversion, processing

OCR

Watermarks

◦ Document Express Editor (Proprietary)

DjVu creation, conversion, processing

OCR

Image Capture

Image Processing

Quality Control

Delivery

Storage and Backup

Document(s) or other materials are captured in digital form using a scanner or digital camera.

Guidelines and Procedures:◦ Pre-scanning

Preparing item level inventory list

◦ Copyright Statement

Should accompany each digital file.

If accessed from the web, copyright statement can be displayed on the website (if the same rights apply to all items on the site).

Image editing (if necessary) ◦ Compression of files, sharpening of images,

deskewing, image rotation, cropping, deleting and reordering pages.

Optical Character Recognition Creating Derivatives Adding Watermarks Adding Security (e.g., restrictions on copying,

printing, or extraction, and password protection)

Creation of metadata describing the scanned materials.

What to look for when checking digital images for quality:◦ Missing pages.

◦ Incorrect order of pages.

◦ Pages of different sizes.

◦ Readability of text.

◦ Black or white areas on some parts of the page that is covering the content.

◦ Image not the correct size

◦ Image in wrong resolution

◦ Image in wrong file format

What to look for when checking digital images for quality:◦ Image in wrong mode or bit-depth

◦ Overall light problems (e.g., too dark)

◦ Loss of detail in highlights or shadows

◦ Poor contrasts

◦ Uneven tone or flares

◦ Missing scan lines or dropped-out pixels

◦ Lack of sharpness

◦ Excessive sharpening

◦ Image in wrong orientation

What to look for when checking digital images for quality:◦ Image not centered or skewed

◦ Incomplete or cropped images

◦ Excessive noise (see dark areas)

◦ Misaligned color channels

◦ Image processing and scanner artifacts (e.g., extraneous lines, noise, banding)

The process of getting the scanned images to the user through computer networks/Web, monitors, and printers.

Delivery Methods◦ Removable Storage Devices

◦ Optical Media (CDs, DVDs)

◦ Static Web Pages

◦ Digital Repositories

Recommended Digital Repository software:◦ Eprints

◦ Dspace

◦ Greenstone

Strategies for storage and backup may include:◦ Dedicated server or shared storage solution.

Database Systems

File-based Systems (FTP, WebDav, Shared Folders)

◦ Writing the digitized records to magnetic tape.

◦ Writing the digitized records to optical media (e.g., CD, DVD).