46
Digital Archiving – A Digital Archiving – A Workflow Workflow K P Raghuraman K P Raghuraman National Centre for Science Information National Centre for Science Information Indian Institute of Science, Indian Institute of Science, Bangalore Bangalore NAMASKARA NAMASKARA

Digital Archiving – A Workflow K P Raghuraman National Centre for Science Information Indian Institute of Science, Bangalore NAMASKARA

Embed Size (px)

Citation preview

Digital Archiving – A Workflow Digital Archiving – A Workflow

K P RaghuramanK P RaghuramanNational Centre for Science InformationNational Centre for Science Information

Indian Institute of Science, BangaloreIndian Institute of Science, Bangalore

NAMASKARANAMASKARA

22April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

AcknowledgementsAcknowledgements

Organizers Organizers

Mr. Francis JayakantMr. Francis Jayakant

Mr. Filbert MinjMr. Filbert Minj

Friends who supported me in the Friends who supported me in the efforteffort

InternetInternet

Digital ArchivingDigital Archiving

What is Digital ArchiveWhat is Digital Archive

Documented Information & storage systemDocumented Information & storage system

Holds permanent, fixed data for a long time Holds permanent, fixed data for a long time (?)(?) in a structured and easy accessible way in a structured and easy accessible way

Employs information architecture Employs information architecture configured to assure trustworthiness and configured to assure trustworthiness and long term retentionlong term retention

33April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Digital Archiving – NeedDigital Archiving – Need

A practical task for keeping A practical task for keeping documents intact for future usedocuments intact for future use

Improved access to information Improved access to information resources, preservation and resources, preservation and dissemination as requireddissemination as required

Any time; anywhere and any placeAny time; anywhere and any place

44April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Digital Archiving – BenefitsDigital Archiving – Benefits

Digitisation contribute to Digitisation contribute to

Conservation of physical resourcesConservation of physical resources

Enables effective sharing of information and contributes to knowledge flowEnables effective sharing of information and contributes to knowledge flow

Unlocks information that was previously difficult to access in paper formUnlocks information that was previously difficult to access in paper form

Use of digital surrogates will reduce wear and tear of originals / made legibleUse of digital surrogates will reduce wear and tear of originals / made legible

Negate the use of originalsNegate the use of originals

Access to information could be restricted with remote accessAccess to information could be restricted with remote access

Provide customizable user interface for collaborative working environmentProvide customizable user interface for collaborative working environment

Faster support regarding any query & questionFaster support regarding any query & question

Cost saving on paper & Time saving in finding informationCost saving on paper & Time saving in finding information

55April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Digital Archiving – AdvantagesDigital Archiving – Advantages

Improved searching mechanismsImproved searching mechanismsMetadata search - Full text search - Boolean Metadata search - Full text search - Boolean search search Support simultaneous searching in a Support simultaneous searching in a standardised form, across a range of resource standardised form, across a range of resource categories. categories. Information, rather than media, can be collated Information, rather than media, can be collated to support a query, regardless of the original to support a query, regardless of the original source material type.source material type.

Space saveSpace save3000 kg of paper could be saved in a DVD3000 kg of paper could be saved in a DVDData can be recombined for manipulation and Data can be recombined for manipulation and compressed for various applicationscompressed for various applications

66April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Digital Archiving Digital Archiving – – Technology and Process Technology and Process

Digital record is mirror image of original Digital record is mirror image of original analogue/paper based file in terms of analogue/paper based file in terms of

Page layout and number of pagesPage layout and number of pagesHand written text, graphics & logosHand written text, graphics & logosColour of original documentColour of original document

These images is then rendered into desired format These images is then rendered into desired format (e.g. pdf) for archiving, printing and distribution(e.g. pdf) for archiving, printing and distribution

Creation of Metadata – used for search and index Creation of Metadata – used for search and index Additional metadata providing contextual informationAdditional metadata providing contextual information

Who uses the recordsWho uses the recordsHow will they be usedHow will they be usedWhen will be they usedWhen will be they usedAccess codes to prevent unauthorized access Access codes to prevent unauthorized access

77April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

88April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

DigitisationDigitisation

Crude definitionCrude definitionScanScan

SaveSave

Is it just Scan and SaveIs it just Scan and Save

Is there a workflowIs there a workflow

Are guidelines for the whole processAre guidelines for the whole process

99April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

DigitisationDigitisation

DefinitionDefinitionConverting written and printed information Converting written and printed information into electronic forminto electronic formCreation of computerisation of a printed Creation of computerisation of a printed analog.analog.

ContentsContentsContents – text image, audio or combination Contents – text image, audio or combination of these (multimedia)of these (multimedia)

Objective of DigitisationObjective of Digitisation

Create content of databases Create content of databases Facilitate accessFacilitate access

Preservation Preservation

Dissemination of information resourcesDissemination of information resources

1010April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Digitisation ProcessDigitisation Process

OutputOutput

Electronic DocumentElectronic DocumentTagged Image File Format (TIFF)Tagged Image File Format (TIFF)

Portable Document Format (PDF)Portable Document Format (PDF)Useful for hosting information on the intranetUseful for hosting information on the intranet

Platform independentPlatform independent

PDF readers are available as free PDF readers are available as free downloadsdownloads

1111April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

1212April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Digitisation - Objects and ProcessDigitisation - Objects and Process

ImageImageTextTextAudioAudioVideoVideo

Scanner captures images.Scanner captures images.Software analyses images and creates Software analyses images and creates texts and imagestexts and imagesSoftware converters convert raw Audio and Software converters convert raw Audio and raw Video to standard digital formatraw Video to standard digital format

1313April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Digitisation - IssuesDigitisation - Issues

HardwareHardware• ComputerComputer• ScannerScanner

SoftwareSoftware• Communication software PC – Scanner – TWAIN Communication software PC – Scanner – TWAIN

complaintcomplaint• Image processing – Photoshop, Macromedia Fireworks Image processing – Photoshop, Macromedia Fireworks

etc.etc.• Enable text material to be converted to Text i.e. OCR Enable text material to be converted to Text i.e. OCR

(Optical Character Recognition) – AABBYY, OmniPage(Optical Character Recognition) – AABBYY, OmniPage

Suitable PolicySuitable Policy• Consistent quality threshold for scanned images.Consistent quality threshold for scanned images.• Choosing appropriate image format – TIFF, JPEG etc.Choosing appropriate image format – TIFF, JPEG etc.• Choosing an appropriate file name scheme.Choosing an appropriate file name scheme.

1414April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

ScannersScanners

Flat bed scannersFlat bed scannersNormal Desktop scannerNormal Desktop scanner

Sheet fed scannersSheet fed scannersSame as above but here document moves and Same as above but here document moves and scan-head is immobilescan-head is immobile

Handheld scannerHandheld scannerUsed to capture text – size of a pen.Used to capture text – size of a pen.

Drum scannerDrum scannerUsed in publishing industriesUsed in publishing industries

Planetary ScannerPlanetary ScannerScanning booksScanning books

1515April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Types of ImagesTypes of Images

1-bit black and white – either black or white1-bit black and white – either black or whiteUsed for printed text or line graphicsUsed for printed text or line graphicsUnsuitable for imagesUnsuitable for images

8-bit grey scale – 256 grey scales8-bit grey scale – 256 grey scalesBlack and white photographsBlack and white photographsNon-color documentsNon-color documents

8-bit color – 256 colors 8-bit color – 256 colors low quality imageslow quality images

24-bit color – 16.8 million shades of color24-bit color – 16.8 million shades of colorIdeal archival quality images Ideal archival quality images For color photo printingFor color photo printing

1616April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

ResolutionResolution

Measurement in dots per inch (dpi)Measurement in dots per inch (dpi)

Higher dpi higher the file sizeHigher dpi higher the file size

ResolutioResolutionn

400400 300300 200200 100100

2-bit 2-bit black and black and whitewhite

20 K20 K 11K11K 5 K5 K 1 K1 K

8-bit B&W 8-bit B&W or coloror color

158 K158 K 89K89K 39 K39 K 9K9K

24 bit 24 bit colorcolor

475 K475 K 267 K267 K 118 K118 K 29 K29 K

1717April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Image - SizeImage - Size

Images size measured in pixelsImages size measured in pixels

Image size varies with scanned resolutionImage size varies with scanned resolution

Modification of image size is called Modification of image size is called resamplingresampling

Image screen pixels are found on each pixel Image screen pixels are found on each pixel of the screenof the screen

One screen pixel contains one image pixel One screen pixel contains one image pixel and can have any RGB valueand can have any RGB value

800 x 600 pixels 14” monitor800 x 600 pixels 14” monitor

1024 x 786 pixels 16” monitor1024 x 786 pixels 16” monitor

1818April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Image – File FormatsImage – File Formats

Some standard image formatsSome standard image formats

TIFF – Tagged Image File FormatTIFF – Tagged Image File FormatJPEG – Joint Photographic Expert JPEG – Joint Photographic Expert GroupGroupDjVu – déjà vu (a free file format)DjVu – déjà vu (a free file format)GIF – Graphic Interchange FormatGIF – Graphic Interchange FormatPNG – Portable Network GraphicsPNG – Portable Network Graphics

1919April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

TIFFTIFF

Multiple images and data in the same fileMultiple images and data in the same fileTags in file header (information on size, Tags in file header (information on size, compression)compression)Loss-less format, useful for archival Loss-less format, useful for archival imagesimagesPlatform independentPlatform independentFormat useful for future modification – Format useful for future modification – can edited without compression losscan edited without compression loss

DisadvantageDisadvantageSize of image is very highSize of image is very high

2020April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

JPEGJPEG

Strongest format for web images and Strongest format for web images and printing imagesprinting imagesSuperior quality can be producedSuperior quality can be producedVariety of compression capability Variety of compression capability Best method for online viewingBest method for online viewing

DisadvantageDisadvantageLossy compression formatLossy compression format

2121April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

GIFGIF

Very old formatVery old format

Lossless compression formatLossless compression format

Less storage spaceLess storage space

Strong candidate for graphic art and Strong candidate for graphic art and drawing.drawing.

DisadvantageDisadvantage

Limited to 256 colors.Limited to 256 colors.

2222April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

DjVuDjVu

File format to save scanned images File format to save scanned images especially with text.especially with text.

Advanced technology for image layer Advanced technology for image layer separation of text and images.separation of text and images.

High quality readable images, stored High quality readable images, stored in minimum space – useful for web.in minimum space – useful for web.

Progressive loading – useful for web.Progressive loading – useful for web.

Format used for Million books projectFormat used for Million books project

2323April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

PNGPNG

A new formatA new format

Created to improve on GIF formatCreated to improve on GIF format

Supports 24-bit color or greyscaleSupports 24-bit color or greyscale

Provides for variety of transparencyProvides for variety of transparency

Lossless data compressionLossless data compression

DisadvantageDisadvantage

New so old software does not supportNew so old software does not support

2424April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

File Formats File Formats

AudioAudioWavWav

Microsoft, IBM audio file format.Microsoft, IBM audio file format.Lossless storage method – large files.Lossless storage method – large files.

MP3 – MPEG -1 Audio Layer-3MP3 – MPEG -1 Audio Layer-3Popular digital audio encoding.Popular digital audio encoding.Lossy compression format so smaller files.Lossy compression format so smaller files.Still can produce good reproduction of originalStill can produce good reproduction of original..

Real Audio – ramReal Audio – ramVariety of audio codecs from lowbitrate to high fidelity Variety of audio codecs from lowbitrate to high fidelity formatsformatsStreaming audio formatStreaming audio format

2525April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

File FormatsFile Formats

VideoVideoMPEG 21MPEG 21

Defines “Rights Expression Language” standardDefines “Rights Expression Language” standard– Sharing digital rights/permissions/restrictions for Sharing digital rights/permissions/restrictions for

content from content creator to consumercontent from content creator to consumer

XML based file systemXML based file system– Can communicate machine readable license Can communicate machine readable license

information in a "ubiquitous, unambiguous and secure" information in a "ubiquitous, unambiguous and secure" manner.manner.

The main objective of the MPEG-21 is to define the The main objective of the MPEG-21 is to define the technology needed to support users to exchange, technology needed to support users to exchange, access, consume, trade or manipulate Digital Items in access, consume, trade or manipulate Digital Items in an efficient and transparent way. an efficient and transparent way.

2626April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

OCROCR

Optical Character RecognitionOptical Character RecognitionGoal – Recreate text and other Goal – Recreate text and other elements like tables and layout so as elements like tables and layout so as to edit in popular word-processorsto edit in popular word-processorsRequirement – Scanner and text Requirement – Scanner and text conversion software (OCR)conversion software (OCR)Technology – Examines patterns of Technology – Examines patterns of dots and recognizes them and writes dots and recognizes them and writes them as alphabetic characters and them as alphabetic characters and numbersnumbers

2727April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

OCR - ProcessOCR - Process

The scanner or camera produces The scanner or camera produces TIFF imageTIFF image

The software cleans the image for The software cleans the image for noises and starts recognizing patternsnoises and starts recognizing patterns

Recognized patterns in alphabets and Recognized patterns in alphabets and numbersnumbers

Unrecognized patterns into imagesUnrecognized patterns into images

2828April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Widely used settingsWidely used settings

24 –bits color24 –bits color

600 dpi (while 300 or 400 for text are popular)600 dpi (while 300 or 400 for text are popular)

TIFF Rev 6 without compression or LZW TIFF Rev 6 without compression or LZW compressioncompression

(PNG is currently becoming popular)(PNG is currently becoming popular)

Photographs to be scanned twice the sizePhotographs to be scanned twice the size

B&W photographs in grey scaleB&W photographs in grey scale

Text can also use the above settings can be Text can also use the above settings can be stored as PDF or DjVustored as PDF or DjVu

2929April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Popular Practices FollowedPopular Practices Followed

Initially Preservation Masters are Initially Preservation Masters are createdcreated..

Should be uncompressed to retain Should be uncompressed to retain archival integrityarchival integrity

For long time storage purposes.For long time storage purposes.

Compressed Web files are created for Compressed Web files are created for surrogate files in repository or for surrogate files in repository or for web-siteweb-site

3030April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Specific File FormatsSpecific File Formats

Original Preservation Master

Surrogates

Image TIFF JPEG, DjVu

Text TIFF JPEG, DjVu, PDF

Audio Linear WAV MP3/ RealAudio ram

Video MPEG 21  

3131April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

OCR - AccuracyOCR - Accuracy

DependsDependsColor of paperColor of paperCharacters should be reasonably well Characters should be reasonably well formedformedThe font should one of the popular ones.The font should one of the popular ones.

99% accuracy achieved99% accuracy achievedBleached white paperBleached white paper10pt character size10pt character size1.5 line spacing1.5 line spacingComputer based printoutsComputer based printouts

3232April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

OCR - IssuesOCR - Issues

Deal with archival materialDeal with archival material

Old text printed during hand pressed Old text printed during hand pressed periodperiod

Gothic and exotic fonts usedGothic and exotic fonts used

Paper color is yellowPaper color is yellow

Characters are often broken and not Characters are often broken and not well-formed due to age and well-formed due to age and environment factorsenvironment factors

3333April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Best PracticeBest Practice

First scan and store as TIFF filesFirst scan and store as TIFF files

OCR TIFF filesOCR TIFF files

Depending on the application and size Depending on the application and size can convert it into pdf or any formatcan convert it into pdf or any format

Depending on accuracy of OCR use TIFF Depending on accuracy of OCR use TIFF or OCR copies for pdfor OCR copies for pdf

3434April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

OCR – Software OCR – Software

AABBYY – Fine Reader – Very AABBYY – Fine Reader – Very popularpopular

OMNI Page – High end OCR toolOMNI Page – High end OCR tool

Read IRIS – A competitor to AABBYY Read IRIS – A competitor to AABBYY and OMNI Pageand OMNI Page

MODI – Microsoft Office Document MODI – Microsoft Office Document Imaging (introduced in Win-XP and Imaging (introduced in Win-XP and exports to word)exports to word)

3535April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

3636April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Camera Camera produces raw produces raw uncorrected uncorrected color photo of color photo of the each pagethe each page

3737April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

The software The software cleans up the cleans up the image and saves image and saves as Hi-Res TIFF as Hi-Res TIFF imageimage

Using OCR it can Using OCR it can converted to converted to editable texteditable text

3838April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

SummarySummary

Digitization is a processDigitization is a process

Large number of analogue items like Large number of analogue items like image, text, audio and video are image, text, audio and video are captured into digital formcaptured into digital form

Understand the variables and tasks in Understand the variables and tasks in the processthe process

Methods of capturing imagesMethods of capturing images

Conversion process performedConversion process performed

3939April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

SummarySummary

DocumentDocument the workflow the workflow

This will lead to life history for each This will lead to life history for each digitized itemdigitized item

Help Create Consistency and Help Create Consistency and ReliabilityReliability

4040April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

New DefinitionNew Definition

Is this the end of digitization?Is this the end of digitization?

Are we through with the work?Are we through with the work?

As in every other job here too As in every other job here too sustainability and maintenance is sustainability and maintenance is necessarynecessary

4141April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Long term maintenance Long term maintenance

Technology is changing rapidlyTechnology is changing rapidly

Obstacles that may need to overcomeObstacles that may need to overcomeLack of awareness in general about how Lack of awareness in general about how such resources may be exploited such resources may be exploited effectively for scholarly purposeseffectively for scholarly purposes

Lack of relevant IT skills and/or Lack of relevant IT skills and/or analytical methodsanalytical methods

Lack of appropriate user support.Lack of appropriate user support.

4242April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Strategies to preserving dataStrategies to preserving data

Preserving the data and the hardware and Preserving the data and the hardware and software platforms from which they are software platforms from which they are originally made accessible.originally made accessible.Refreshing data by copying them Refreshing data by copying them periodically onto new storage media.periodically onto new storage media.Migrating data through changing technical Migrating data through changing technical regimes by rendering them into an regimes by rendering them into an appropriate standard interchange formats.appropriate standard interchange formats.Emulating the look and feel of the original Emulating the look and feel of the original data on successive generations of data on successive generations of hardware and software platforms.hardware and software platforms.

4343April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Points to ponderPoints to ponder

Unlike paper, parchment and other traditional forms of Unlike paper, parchment and other traditional forms of recording medium, electronic systems and their data are not recording medium, electronic systems and their data are not durable. Digital materials have very different preservation durable. Digital materials have very different preservation requirements to analogue materials, which may last for requirements to analogue materials, which may last for many decades through storage in optimal environmental many decades through storage in optimal environmental conditions. conditions.

The other difficulty with electronic data and files is that they The other difficulty with electronic data and files is that they require the intervention of other systems to facilitate require the intervention of other systems to facilitate readability or usability.  This innate dependency makes the readability or usability.  This innate dependency makes the files themselves very fragile.  A problem in any of the files themselves very fragile.  A problem in any of the supporting components can render the information useless.  supporting components can render the information useless. 

It is not enough to physically preserve the storage medium It is not enough to physically preserve the storage medium or present the bitstream.  Without the commensurate tools or present the bitstream.  Without the commensurate tools to decode and present the bitstream, a future user will be to decode and present the bitstream, a future user will be met with gibberishmet with gibberish..

4444April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Digitization – Next Step Digitization – Next Step

Will mean preservation of materials Will mean preservation of materials that are ‘born digital’ .that are ‘born digital’ .

MigrationMigrationElectronic data transferred from one data Electronic data transferred from one data format to anotherformat to another..

EmulationEmulationAttempts to use current and future Attempts to use current and future technologies to emulate the tools and logic technologies to emulate the tools and logic used when the records and files were used when the records and files were originally createdoriginally created

4545April 19, 2023April 19, 2023

Archives and Publication Cell, IIScArchives and Publication Cell, IISc

Informative web sitesInformative web sites

Irish Virtual Research Library and Archive - Irish Virtual Research Library and Archive - Project WorkbookProject Workbook http://www.ucd.ie/ivrla/workbook/wdigpreservation.htmlThe Arts and Humanities Data Service The Arts and Humanities Data Service (AHDS) is a UK national service aiding the (AHDS) is a UK national service aiding the discovery, creation and preservation of discovery, creation and preservation of digital resources in and for research, digital resources in and for research, teaching and learning in the arts and teaching and learning in the arts and humanitieshumanities http://ahds.ac.uk/about/publications/index.htm

4646April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc