Upload
doris-cross
View
212
Download
0
Embed Size (px)
Citation preview
Digital Archiving – A Workflow Digital Archiving – A Workflow
K P RaghuramanK P RaghuramanNational Centre for Science InformationNational Centre for Science Information
Indian Institute of Science, BangaloreIndian Institute of Science, Bangalore
NAMASKARANAMASKARA
22April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
AcknowledgementsAcknowledgements
Organizers Organizers
Mr. Francis JayakantMr. Francis Jayakant
Mr. Filbert MinjMr. Filbert Minj
Friends who supported me in the Friends who supported me in the efforteffort
InternetInternet
Digital ArchivingDigital Archiving
What is Digital ArchiveWhat is Digital Archive
Documented Information & storage systemDocumented Information & storage system
Holds permanent, fixed data for a long time Holds permanent, fixed data for a long time (?)(?) in a structured and easy accessible way in a structured and easy accessible way
Employs information architecture Employs information architecture configured to assure trustworthiness and configured to assure trustworthiness and long term retentionlong term retention
33April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Digital Archiving – NeedDigital Archiving – Need
A practical task for keeping A practical task for keeping documents intact for future usedocuments intact for future use
Improved access to information Improved access to information resources, preservation and resources, preservation and dissemination as requireddissemination as required
Any time; anywhere and any placeAny time; anywhere and any place
44April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Digital Archiving – BenefitsDigital Archiving – Benefits
Digitisation contribute to Digitisation contribute to
Conservation of physical resourcesConservation of physical resources
Enables effective sharing of information and contributes to knowledge flowEnables effective sharing of information and contributes to knowledge flow
Unlocks information that was previously difficult to access in paper formUnlocks information that was previously difficult to access in paper form
Use of digital surrogates will reduce wear and tear of originals / made legibleUse of digital surrogates will reduce wear and tear of originals / made legible
Negate the use of originalsNegate the use of originals
Access to information could be restricted with remote accessAccess to information could be restricted with remote access
Provide customizable user interface for collaborative working environmentProvide customizable user interface for collaborative working environment
Faster support regarding any query & questionFaster support regarding any query & question
Cost saving on paper & Time saving in finding informationCost saving on paper & Time saving in finding information
55April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Digital Archiving – AdvantagesDigital Archiving – Advantages
Improved searching mechanismsImproved searching mechanismsMetadata search - Full text search - Boolean Metadata search - Full text search - Boolean search search Support simultaneous searching in a Support simultaneous searching in a standardised form, across a range of resource standardised form, across a range of resource categories. categories. Information, rather than media, can be collated Information, rather than media, can be collated to support a query, regardless of the original to support a query, regardless of the original source material type.source material type.
Space saveSpace save3000 kg of paper could be saved in a DVD3000 kg of paper could be saved in a DVDData can be recombined for manipulation and Data can be recombined for manipulation and compressed for various applicationscompressed for various applications
66April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Digital Archiving Digital Archiving – – Technology and Process Technology and Process
Digital record is mirror image of original Digital record is mirror image of original analogue/paper based file in terms of analogue/paper based file in terms of
Page layout and number of pagesPage layout and number of pagesHand written text, graphics & logosHand written text, graphics & logosColour of original documentColour of original document
These images is then rendered into desired format These images is then rendered into desired format (e.g. pdf) for archiving, printing and distribution(e.g. pdf) for archiving, printing and distribution
Creation of Metadata – used for search and index Creation of Metadata – used for search and index Additional metadata providing contextual informationAdditional metadata providing contextual information
Who uses the recordsWho uses the recordsHow will they be usedHow will they be usedWhen will be they usedWhen will be they usedAccess codes to prevent unauthorized access Access codes to prevent unauthorized access
77April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
88April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
DigitisationDigitisation
Crude definitionCrude definitionScanScan
SaveSave
Is it just Scan and SaveIs it just Scan and Save
Is there a workflowIs there a workflow
Are guidelines for the whole processAre guidelines for the whole process
99April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
DigitisationDigitisation
DefinitionDefinitionConverting written and printed information Converting written and printed information into electronic forminto electronic formCreation of computerisation of a printed Creation of computerisation of a printed analog.analog.
ContentsContentsContents – text image, audio or combination Contents – text image, audio or combination of these (multimedia)of these (multimedia)
Objective of DigitisationObjective of Digitisation
Create content of databases Create content of databases Facilitate accessFacilitate access
Preservation Preservation
Dissemination of information resourcesDissemination of information resources
1010April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Digitisation ProcessDigitisation Process
OutputOutput
Electronic DocumentElectronic DocumentTagged Image File Format (TIFF)Tagged Image File Format (TIFF)
Portable Document Format (PDF)Portable Document Format (PDF)Useful for hosting information on the intranetUseful for hosting information on the intranet
Platform independentPlatform independent
PDF readers are available as free PDF readers are available as free downloadsdownloads
1111April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
1212April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Digitisation - Objects and ProcessDigitisation - Objects and Process
ImageImageTextTextAudioAudioVideoVideo
Scanner captures images.Scanner captures images.Software analyses images and creates Software analyses images and creates texts and imagestexts and imagesSoftware converters convert raw Audio and Software converters convert raw Audio and raw Video to standard digital formatraw Video to standard digital format
1313April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Digitisation - IssuesDigitisation - Issues
HardwareHardware• ComputerComputer• ScannerScanner
SoftwareSoftware• Communication software PC – Scanner – TWAIN Communication software PC – Scanner – TWAIN
complaintcomplaint• Image processing – Photoshop, Macromedia Fireworks Image processing – Photoshop, Macromedia Fireworks
etc.etc.• Enable text material to be converted to Text i.e. OCR Enable text material to be converted to Text i.e. OCR
(Optical Character Recognition) – AABBYY, OmniPage(Optical Character Recognition) – AABBYY, OmniPage
Suitable PolicySuitable Policy• Consistent quality threshold for scanned images.Consistent quality threshold for scanned images.• Choosing appropriate image format – TIFF, JPEG etc.Choosing appropriate image format – TIFF, JPEG etc.• Choosing an appropriate file name scheme.Choosing an appropriate file name scheme.
1414April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
ScannersScanners
Flat bed scannersFlat bed scannersNormal Desktop scannerNormal Desktop scanner
Sheet fed scannersSheet fed scannersSame as above but here document moves and Same as above but here document moves and scan-head is immobilescan-head is immobile
Handheld scannerHandheld scannerUsed to capture text – size of a pen.Used to capture text – size of a pen.
Drum scannerDrum scannerUsed in publishing industriesUsed in publishing industries
Planetary ScannerPlanetary ScannerScanning booksScanning books
1515April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Types of ImagesTypes of Images
1-bit black and white – either black or white1-bit black and white – either black or whiteUsed for printed text or line graphicsUsed for printed text or line graphicsUnsuitable for imagesUnsuitable for images
8-bit grey scale – 256 grey scales8-bit grey scale – 256 grey scalesBlack and white photographsBlack and white photographsNon-color documentsNon-color documents
8-bit color – 256 colors 8-bit color – 256 colors low quality imageslow quality images
24-bit color – 16.8 million shades of color24-bit color – 16.8 million shades of colorIdeal archival quality images Ideal archival quality images For color photo printingFor color photo printing
1616April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
ResolutionResolution
Measurement in dots per inch (dpi)Measurement in dots per inch (dpi)
Higher dpi higher the file sizeHigher dpi higher the file size
ResolutioResolutionn
400400 300300 200200 100100
2-bit 2-bit black and black and whitewhite
20 K20 K 11K11K 5 K5 K 1 K1 K
8-bit B&W 8-bit B&W or coloror color
158 K158 K 89K89K 39 K39 K 9K9K
24 bit 24 bit colorcolor
475 K475 K 267 K267 K 118 K118 K 29 K29 K
1717April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Image - SizeImage - Size
Images size measured in pixelsImages size measured in pixels
Image size varies with scanned resolutionImage size varies with scanned resolution
Modification of image size is called Modification of image size is called resamplingresampling
Image screen pixels are found on each pixel Image screen pixels are found on each pixel of the screenof the screen
One screen pixel contains one image pixel One screen pixel contains one image pixel and can have any RGB valueand can have any RGB value
800 x 600 pixels 14” monitor800 x 600 pixels 14” monitor
1024 x 786 pixels 16” monitor1024 x 786 pixels 16” monitor
1818April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Image – File FormatsImage – File Formats
Some standard image formatsSome standard image formats
TIFF – Tagged Image File FormatTIFF – Tagged Image File FormatJPEG – Joint Photographic Expert JPEG – Joint Photographic Expert GroupGroupDjVu – déjà vu (a free file format)DjVu – déjà vu (a free file format)GIF – Graphic Interchange FormatGIF – Graphic Interchange FormatPNG – Portable Network GraphicsPNG – Portable Network Graphics
1919April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
TIFFTIFF
Multiple images and data in the same fileMultiple images and data in the same fileTags in file header (information on size, Tags in file header (information on size, compression)compression)Loss-less format, useful for archival Loss-less format, useful for archival imagesimagesPlatform independentPlatform independentFormat useful for future modification – Format useful for future modification – can edited without compression losscan edited without compression loss
DisadvantageDisadvantageSize of image is very highSize of image is very high
2020April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
JPEGJPEG
Strongest format for web images and Strongest format for web images and printing imagesprinting imagesSuperior quality can be producedSuperior quality can be producedVariety of compression capability Variety of compression capability Best method for online viewingBest method for online viewing
DisadvantageDisadvantageLossy compression formatLossy compression format
2121April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
GIFGIF
Very old formatVery old format
Lossless compression formatLossless compression format
Less storage spaceLess storage space
Strong candidate for graphic art and Strong candidate for graphic art and drawing.drawing.
DisadvantageDisadvantage
Limited to 256 colors.Limited to 256 colors.
2222April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
DjVuDjVu
File format to save scanned images File format to save scanned images especially with text.especially with text.
Advanced technology for image layer Advanced technology for image layer separation of text and images.separation of text and images.
High quality readable images, stored High quality readable images, stored in minimum space – useful for web.in minimum space – useful for web.
Progressive loading – useful for web.Progressive loading – useful for web.
Format used for Million books projectFormat used for Million books project
2323April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
PNGPNG
A new formatA new format
Created to improve on GIF formatCreated to improve on GIF format
Supports 24-bit color or greyscaleSupports 24-bit color or greyscale
Provides for variety of transparencyProvides for variety of transparency
Lossless data compressionLossless data compression
DisadvantageDisadvantage
New so old software does not supportNew so old software does not support
2424April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
File Formats File Formats
AudioAudioWavWav
Microsoft, IBM audio file format.Microsoft, IBM audio file format.Lossless storage method – large files.Lossless storage method – large files.
MP3 – MPEG -1 Audio Layer-3MP3 – MPEG -1 Audio Layer-3Popular digital audio encoding.Popular digital audio encoding.Lossy compression format so smaller files.Lossy compression format so smaller files.Still can produce good reproduction of originalStill can produce good reproduction of original..
Real Audio – ramReal Audio – ramVariety of audio codecs from lowbitrate to high fidelity Variety of audio codecs from lowbitrate to high fidelity formatsformatsStreaming audio formatStreaming audio format
2525April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
File FormatsFile Formats
VideoVideoMPEG 21MPEG 21
Defines “Rights Expression Language” standardDefines “Rights Expression Language” standard– Sharing digital rights/permissions/restrictions for Sharing digital rights/permissions/restrictions for
content from content creator to consumercontent from content creator to consumer
XML based file systemXML based file system– Can communicate machine readable license Can communicate machine readable license
information in a "ubiquitous, unambiguous and secure" information in a "ubiquitous, unambiguous and secure" manner.manner.
The main objective of the MPEG-21 is to define the The main objective of the MPEG-21 is to define the technology needed to support users to exchange, technology needed to support users to exchange, access, consume, trade or manipulate Digital Items in access, consume, trade or manipulate Digital Items in an efficient and transparent way. an efficient and transparent way.
2626April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
OCROCR
Optical Character RecognitionOptical Character RecognitionGoal – Recreate text and other Goal – Recreate text and other elements like tables and layout so as elements like tables and layout so as to edit in popular word-processorsto edit in popular word-processorsRequirement – Scanner and text Requirement – Scanner and text conversion software (OCR)conversion software (OCR)Technology – Examines patterns of Technology – Examines patterns of dots and recognizes them and writes dots and recognizes them and writes them as alphabetic characters and them as alphabetic characters and numbersnumbers
2727April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
OCR - ProcessOCR - Process
The scanner or camera produces The scanner or camera produces TIFF imageTIFF image
The software cleans the image for The software cleans the image for noises and starts recognizing patternsnoises and starts recognizing patterns
Recognized patterns in alphabets and Recognized patterns in alphabets and numbersnumbers
Unrecognized patterns into imagesUnrecognized patterns into images
2828April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Widely used settingsWidely used settings
24 –bits color24 –bits color
600 dpi (while 300 or 400 for text are popular)600 dpi (while 300 or 400 for text are popular)
TIFF Rev 6 without compression or LZW TIFF Rev 6 without compression or LZW compressioncompression
(PNG is currently becoming popular)(PNG is currently becoming popular)
Photographs to be scanned twice the sizePhotographs to be scanned twice the size
B&W photographs in grey scaleB&W photographs in grey scale
Text can also use the above settings can be Text can also use the above settings can be stored as PDF or DjVustored as PDF or DjVu
2929April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Popular Practices FollowedPopular Practices Followed
Initially Preservation Masters are Initially Preservation Masters are createdcreated..
Should be uncompressed to retain Should be uncompressed to retain archival integrityarchival integrity
For long time storage purposes.For long time storage purposes.
Compressed Web files are created for Compressed Web files are created for surrogate files in repository or for surrogate files in repository or for web-siteweb-site
3030April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Specific File FormatsSpecific File Formats
Original Preservation Master
Surrogates
Image TIFF JPEG, DjVu
Text TIFF JPEG, DjVu, PDF
Audio Linear WAV MP3/ RealAudio ram
Video MPEG 21
3131April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
OCR - AccuracyOCR - Accuracy
DependsDependsColor of paperColor of paperCharacters should be reasonably well Characters should be reasonably well formedformedThe font should one of the popular ones.The font should one of the popular ones.
99% accuracy achieved99% accuracy achievedBleached white paperBleached white paper10pt character size10pt character size1.5 line spacing1.5 line spacingComputer based printoutsComputer based printouts
3232April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
OCR - IssuesOCR - Issues
Deal with archival materialDeal with archival material
Old text printed during hand pressed Old text printed during hand pressed periodperiod
Gothic and exotic fonts usedGothic and exotic fonts used
Paper color is yellowPaper color is yellow
Characters are often broken and not Characters are often broken and not well-formed due to age and well-formed due to age and environment factorsenvironment factors
3333April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Best PracticeBest Practice
First scan and store as TIFF filesFirst scan and store as TIFF files
OCR TIFF filesOCR TIFF files
Depending on the application and size Depending on the application and size can convert it into pdf or any formatcan convert it into pdf or any format
Depending on accuracy of OCR use TIFF Depending on accuracy of OCR use TIFF or OCR copies for pdfor OCR copies for pdf
3434April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
OCR – Software OCR – Software
AABBYY – Fine Reader – Very AABBYY – Fine Reader – Very popularpopular
OMNI Page – High end OCR toolOMNI Page – High end OCR tool
Read IRIS – A competitor to AABBYY Read IRIS – A competitor to AABBYY and OMNI Pageand OMNI Page
MODI – Microsoft Office Document MODI – Microsoft Office Document Imaging (introduced in Win-XP and Imaging (introduced in Win-XP and exports to word)exports to word)
3535April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
3636April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Camera Camera produces raw produces raw uncorrected uncorrected color photo of color photo of the each pagethe each page
3737April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
The software The software cleans up the cleans up the image and saves image and saves as Hi-Res TIFF as Hi-Res TIFF imageimage
Using OCR it can Using OCR it can converted to converted to editable texteditable text
3838April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
SummarySummary
Digitization is a processDigitization is a process
Large number of analogue items like Large number of analogue items like image, text, audio and video are image, text, audio and video are captured into digital formcaptured into digital form
Understand the variables and tasks in Understand the variables and tasks in the processthe process
Methods of capturing imagesMethods of capturing images
Conversion process performedConversion process performed
3939April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
SummarySummary
DocumentDocument the workflow the workflow
This will lead to life history for each This will lead to life history for each digitized itemdigitized item
Help Create Consistency and Help Create Consistency and ReliabilityReliability
4040April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
New DefinitionNew Definition
Is this the end of digitization?Is this the end of digitization?
Are we through with the work?Are we through with the work?
As in every other job here too As in every other job here too sustainability and maintenance is sustainability and maintenance is necessarynecessary
4141April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Long term maintenance Long term maintenance
Technology is changing rapidlyTechnology is changing rapidly
Obstacles that may need to overcomeObstacles that may need to overcomeLack of awareness in general about how Lack of awareness in general about how such resources may be exploited such resources may be exploited effectively for scholarly purposeseffectively for scholarly purposes
Lack of relevant IT skills and/or Lack of relevant IT skills and/or analytical methodsanalytical methods
Lack of appropriate user support.Lack of appropriate user support.
4242April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Strategies to preserving dataStrategies to preserving data
Preserving the data and the hardware and Preserving the data and the hardware and software platforms from which they are software platforms from which they are originally made accessible.originally made accessible.Refreshing data by copying them Refreshing data by copying them periodically onto new storage media.periodically onto new storage media.Migrating data through changing technical Migrating data through changing technical regimes by rendering them into an regimes by rendering them into an appropriate standard interchange formats.appropriate standard interchange formats.Emulating the look and feel of the original Emulating the look and feel of the original data on successive generations of data on successive generations of hardware and software platforms.hardware and software platforms.
4343April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Points to ponderPoints to ponder
Unlike paper, parchment and other traditional forms of Unlike paper, parchment and other traditional forms of recording medium, electronic systems and their data are not recording medium, electronic systems and their data are not durable. Digital materials have very different preservation durable. Digital materials have very different preservation requirements to analogue materials, which may last for requirements to analogue materials, which may last for many decades through storage in optimal environmental many decades through storage in optimal environmental conditions. conditions.
The other difficulty with electronic data and files is that they The other difficulty with electronic data and files is that they require the intervention of other systems to facilitate require the intervention of other systems to facilitate readability or usability. This innate dependency makes the readability or usability. This innate dependency makes the files themselves very fragile. A problem in any of the files themselves very fragile. A problem in any of the supporting components can render the information useless. supporting components can render the information useless.
It is not enough to physically preserve the storage medium It is not enough to physically preserve the storage medium or present the bitstream. Without the commensurate tools or present the bitstream. Without the commensurate tools to decode and present the bitstream, a future user will be to decode and present the bitstream, a future user will be met with gibberishmet with gibberish..
4444April 19, 2023April 19, 2023 Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Digitization – Next Step Digitization – Next Step
Will mean preservation of materials Will mean preservation of materials that are ‘born digital’ .that are ‘born digital’ .
MigrationMigrationElectronic data transferred from one data Electronic data transferred from one data format to anotherformat to another..
EmulationEmulationAttempts to use current and future Attempts to use current and future technologies to emulate the tools and logic technologies to emulate the tools and logic used when the records and files were used when the records and files were originally createdoriginally created
4545April 19, 2023April 19, 2023
Archives and Publication Cell, IIScArchives and Publication Cell, IISc
Informative web sitesInformative web sites
Irish Virtual Research Library and Archive - Irish Virtual Research Library and Archive - Project WorkbookProject Workbook http://www.ucd.ie/ivrla/workbook/wdigpreservation.htmlThe Arts and Humanities Data Service The Arts and Humanities Data Service (AHDS) is a UK national service aiding the (AHDS) is a UK national service aiding the discovery, creation and preservation of discovery, creation and preservation of digital resources in and for research, digital resources in and for research, teaching and learning in the arts and teaching and learning in the arts and humanitieshumanities http://ahds.ac.uk/about/publications/index.htm