Transcript
Page 1: Digitization Workflow Management System for Massive Digitization Projects

Digitization Workflow Management System for

Massive Digitization Projects

Bibliotheca Alexandrina

November 19, 2006

The 2nd International Conference on Universal Digital Library 2006

(ICUDL 2006)

Mohamed Yakout Noha Adly Magdy [email protected] [email protected] [email protected]

Page 2: Digitization Workflow Management System for Massive Digitization Projects

Goals Automate, track and manage the digitization

workflow. Flexibility in defining digitization workflow Phases. Support dynamic evolution and deviations with a

history tracking. Flexibility integration with the LIS and Library Digital

Repository. Accept external partially digitized Jobs to start in the

proper Phase within the digitization workflow Simultaneous management of multiple projects with

a diversity of materials (books, journals, manuscripts, audio, video, slides, … etc)

Page 3: Digitization Workflow Management System for Massive Digitization Projects

Related Work Manual workflow management using several software packages (MS

Excel, MS SharePoint, MS Project) Simple tracking workflow system with limited capabilities Several integrated digitization activities (digital capturing, image

processing, OCRing, …) in one software DOCWorks from CCS. BookRestorer from i2s. OUPS

Limitations: Tightly coupled with certain tools and do not allow easily other tools to be

integrated. No Resources Management (e.g. Workstations and users) Lack of projects and collections management. Manual files handling between the storage server and clients. Lack of handling workflow exceptions, dynamic evolution and deviations

except through manual intervention.

Page 4: Digitization Workflow Management System for Massive Digitization Projects

System Data Model

Phase

Job

Job Type

Collection

Workstation

User

Page 5: Digitization Workflow Management System for Massive Digitization Projects

System Data Model

The object being digitized Book for Naguib Mahfouz Photos for an event Map for Alexandria Music sheet for Omar Khayrat

Phase

Job

Job Type

Collection

Workstation

User

Page 6: Digitization Workflow Management System for Massive Digitization Projects

System Data Model

All types of materials in the system Book Manuscripts Map Journals Audio Video

Phase

Job

Job Type

Collection

Workstation

User

Page 7: Digitization Workflow Management System for Massive Digitization Projects

System Data Model

A task that should be applied within the digitization process

Scanning Processing OCRing Encoding Publishing Zipping for archiving

Phase

Job

Job Type

Collection

Workstation

User

Page 8: Digitization Workflow Management System for Massive Digitization Projects

System Data Model

The system users with several roles Digital lab operators Shift operators Administrator

Phase

Job

Job Type

Collection

Workstation

User

Page 9: Digitization Workflow Management System for Massive Digitization Projects

System Data Model

Represents logical grouping for the Jobs

Nasser AlexMed AMEEL

Phase

Job

Job Type

Collection

Workstation

User

Page 10: Digitization Workflow Management System for Massive Digitization Projects

System Data Model

The computer used to perform the Phase

Phase

Job

Job Type

Collection

Workstation

User

Page 11: Digitization Workflow Management System for Massive Digitization Projects

System ArchitectureP

ha

se M

ana

ge

r

Job Type CPhase CX

PrePhase

CX

Phase CX

PostPhase

CX

Phase C1 Phase C2Phase CN

Job Type B

Phase B1 Phase B2Phase BN

Job Type A

Phase A1 Phase A2Phase AN

XML Phases Definition Handler File Handler

Database Handler

DAF

DatabaseStored Procedures

LIS Server

LIS

File Server

Authentication and Authorization Handler

Check-InModule

Jobs in the System

Administration Module

ReportingModule

Check-OutTo

Digital Documents Repository

Archiving Module

Off-line Storage

Page 12: Digitization Workflow Management System for Massive Digitization Projects

System ArchitectureP

ha

se M

ana

ge

r

Job Type CPhase CX

PrePhase

CX

Phase CX

PostPhase

CX

Phase C1 Phase C2Phase CN

Job Type B

Phase B1 Phase B2Phase BN

Job Type A

Phase A1 Phase A2Phase AN

XML Phases Definition Handler File Handler

Database Handler

DAF

DatabaseStored Procedures

LIS Server

LIS

File Server

Authentication and Authorization Handler

Check-InModule

Jobs in the System

Administration Module

ReportingModule

Check-OutTo

Digital Documents Repository

Archiving Module

Off-line Storage

Page 13: Digitization Workflow Management System for Massive Digitization Projects

System ArchitectureP

ha

se M

ana

ge

r

Job Type CPhase CX

PrePhase

CX

Phase CX

PostPhase

CX

Phase C1 Phase C2Phase CN

Job Type B

Phase B1 Phase B2Phase BN

Job Type A

Phase A1 Phase A2Phase AN

XML Phases Definition Handler File Handler

Database Handler

DAF

DatabaseStored Procedures

LIS Server

LIS

File Server

Authentication and Authorization Handler

Check-InModule

Jobs in the System

Administration Module

ReportingModule

Check-OutTo

Digital Documents Repository

Archiving Module

Off-line Storage

Page 14: Digitization Workflow Management System for Massive Digitization Projects

System Handlers

XML Phases Definition Handler Pre-Phase and Post-Phase Physical section Database section Reflection Call

<Phase Name="Book Arabic OCR"> <PrePhase> <Physical Mode="UnRestricted"> <Folder Name="OTIFF" Create="false" ToDestination="false" NewName="OTIFF" Mode="Restircted"> <File Name="OriginalFiles" Type="tif" Count="+" ToDestination="false" Compare=""/> </Folder> . .

</Physical>

</PrePhase> <PostPhase> <Physical Mode="UnRestricted"> <Folder Name="TXT" Create="false" ToDestination="true" NewName="TXT" Mode="Restircted"> <File Name="" Type="frf" Count="1" ToDestination="true" Compare=""/> <File Name="" Type="art" Count="1" ToDestination="true" Compare=""/> </Folder> </Physical> <Database> <Field Name="Font" DisplayName="Font Family: " /> <Field Name="LrnPage" DisplayName="Learn Page : "/> . . </Database> <ReflectionCall Method="packageName.doSomething" />

</PostPhase></Phase>

Page 15: Digitization Workflow Management System for Massive Digitization Projects

System Handlers

XML Phases Definition Handler Pre-Phase and Post-Phase Physical section Database section Reflection Call

<Phase Name="Book Arabic OCR"> <PrePhase>

<Physical Mode="UnRestricted"> <Folder Name="OTIFF" Create="false" ToDestination="false" NewName="OTIFF" Mode="Restircted"> <File Name="OriginalFiles" Type="tif" Count="+" ToDestination="false" Compare=""/> </Folder> . .

</Physical> </PrePhase> <PostPhase>

<Physical Mode="UnRestricted"> <Folder Name="TXT" Create="false" ToDestination="true" NewName="TXT" Mode="Restircted"> <File Name="" Type="frf" Count="1" ToDestination="true" Compare=""/> <File Name="" Type="art" Count="1" ToDestination="true" Compare=""/> </Folder>

</Physical> <Database> <Field Name="Font" DisplayName="Font Family: " /> <Field Name="LrnPage" DisplayName="Learn Page : "/> . . </Database> <ReflectionCall Method="packageName.doSomething" /> </PostPhase></Phase>

Page 16: Digitization Workflow Management System for Massive Digitization Projects

System Handlers

XML Phases Definition Handler Pre-Phase and Post-Phase Physical section Database section Reflection Call

<Phase Name="Book Arabic OCR"> <PrePhase> <Physical Mode="UnRestricted"> <Folder Name="OTIFF" Create="false" ToDestination="false" NewName="OTIFF" Mode="Restircted"> <File Name="OriginalFiles" Type="tif" Count="+" ToDestination="false" Compare=""/> </Folder> . .

</Physical> </PrePhase> <PostPhase> <Physical Mode="UnRestricted"> <Folder Name="TXT" Create="false" ToDestination="true" NewName="TXT" Mode="Restircted"> <File Name="" Type="frf" Count="1" ToDestination="true" Compare=""/> <File Name="" Type="art" Count="1" ToDestination="true" Compare=""/> </Folder> </Physical>

<Database> <Field Name="Font" DisplayName="Font Family: " /> <Field Name="LrnPage" DisplayName="Learn Page : "/> . .

</Database> <ReflectionCall Method="packageName.doSomething" /> </PostPhase></Phase>

Page 17: Digitization Workflow Management System for Massive Digitization Projects

System Handlers

XML Phases Definition Handler Pre-Phase and Post-Phase Physical section Database section Reflection Call

<Phase Name="Book Arabic OCR"> <PrePhase> <Physical Mode="UnRestricted"> <Folder Name="OTIFF" Create="false" ToDestination="false" NewName="OTIFF" Mode="Restircted"> <File Name="OriginalFiles" Type="tif" Count="+" ToDestination="false" Compare=""/> </Folder> . .

</Physical> </PrePhase> <PostPhase> <Physical Mode="UnRestricted"> <Folder Name="TXT" Create="false" ToDestination="true" NewName="TXT" Mode="Restircted"> <File Name="" Type="frf" Count="1" ToDestination="true" Compare=""/> <File Name="" Type="art" Count="1" ToDestination="true" Compare=""/> </Folder> </Physical> <Database> <Field Name="Font" DisplayName="Font Family: " /> <Field Name="LrnPage" DisplayName="Learn Page : "/> . . </Database>

<ReflectionCall Method="packageName.doSomething" /> </PostPhase></Phase>

Page 18: Digitization Workflow Management System for Massive Digitization Projects

System ArchitectureP

ha

se M

ana

ge

r

Job Type CPhase CX

PrePhase

CX

Phase CX

PostPhase

CX

Phase C1 Phase C2Phase CN

Job Type B

Phase B1 Phase B2Phase BN

Job Type A

Phase A1 Phase A2Phase AN

XML Phases Definition Handler File Handler

Database Handler

DAF

DatabaseStored Procedures

LIS Server

LIS

File Server

Authentication and Authorization Handler

Check-InModule

Jobs in the System

Administration Module

ReportingModule

Check-OutTo

Digital Documents Repository

Archiving Module

Off-line Storage

Page 19: Digitization Workflow Management System for Massive Digitization Projects

System ArchitectureP

ha

se M

ana

ge

r

Job Type CPhase CX

PrePhase

CX

Phase CX

PostPhase

CX

Phase C1 Phase C2Phase CN

Job Type B

Phase B1 Phase B2Phase BN

Job Type A

Phase A1 Phase A2Phase AN

XML Phases Definition Handler File Handler

Database Handler

DAF

DatabaseStored Procedures

LIS Server

LIS

File Server

Authentication and Authorization Handler

Check-InModule

Jobs in the System

Administration Module

ReportingModule

Check-OutTo

Digital Documents Repository

Archiving Module

Off-line Storage

Page 20: Digitization Workflow Management System for Massive Digitization Projects

System ArchitectureP

ha

se M

ana

ge

r

Job Type CPhase CX

PrePhase

CX

Phase CX

PostPhase

CX

Phase C1 Phase C2Phase CN

Job Type B

Phase B1 Phase B2Phase BN

Job Type A

Phase A1 Phase A2Phase AN

XML Phases Definition Handler File Handler

Database Handler

DAF

DatabaseStored Procedures

LIS Server

LIS

File Server

Authentication and Authorization Handler

Check-InModule

Jobs in the System

Administration Module

ReportingModule

Check-OutTo

Digital Documents Repository

Archiving Module

Off-line Storage

Page 21: Digitization Workflow Management System for Massive Digitization Projects

System ArchitectureP

ha

se M

ana

ge

r

Job Type CPhase CX

PrePhase

CX

Phase CX

PostPhase

CX

Phase C1 Phase C2Phase CN

Job Type B

Phase B1 Phase B2Phase BN

Job Type A

Phase A1 Phase A2Phase AN

XML Phases Definition Handler File Handler

Database Handler

DAF

DatabaseStored Procedures

LIS Server

LIS

File Server

Authentication and Authorization Handler

Check-InModule

Jobs in the System

Administration Module

ReportingModule

Check-OutTo

Digital Documents Repository

Archiving Module

Off-line Storage

Page 22: Digitization Workflow Management System for Massive Digitization Projects

System Modules Check-In

Plug-in based for integration.

Creates the Job in the system

Assign the Job to any Phase

Check-Out Java Reflection Call

section of the XML Phases Definition

Ingest the Job’s digital objects into the repository

DWMS

Check-in Plug-in

VirtuaPlug-in

DigiArabPlug-in

MARC FilePlug-in

.

.

.

.

MODS FilePlug-in

Check-out Plug-ins

DARPlug-in

FedoraPlug-in

DSpacePlug-in

.

.

.

.

aDORePlug-in

DAR

Fedora

DSpace

aDORe

Page 23: Digitization Workflow Management System for Massive Digitization Projects

System ArchitectureP

ha

se M

ana

ge

r

Job Type CPhase CX

PrePhase

CX

Phase CX

PostPhase

CX

Phase C1 Phase C2Phase CN

Job Type B

Phase B1 Phase B2Phase BN

Job Type A

Phase A1 Phase A2Phase AN

XML Phases Definition Handler File Handler

Database Handler

DAF

DatabaseStored Procedures

LIS Server

LIS

File Server

Authentication and Authorization Handler

Check-InModule

Jobs in the System

Administration Module

ReportingModule

Check-OutTo

Digital Documents Repository

Archiving Module

Off-line Storage

Page 24: Digitization Workflow Management System for Massive Digitization Projects

System Modules Phases Manager

Request a new Job Download the Jobs folders and files Submit the Job back to the system to continue other Phases Reject a Job and recommend another Phase in addition to

specifying reasons. Redirect a Job from the default Phase Sequence Provide information on the files level to help solving problems

Page 25: Digitization Workflow Management System for Massive Digitization Projects

System Modules (Contd)

Reporting Workflow Tracking Pending Items Late Jobs Operators rates Build Customized Report

Archiving On different Medias with

different size and on online storage

Administration

Page 26: Digitization Workflow Management System for Massive Digitization Projects

BA Digitization Workflow

Page 27: Digitization Workflow Management System for Massive Digitization Projects

Job Type: Small Images

Job Type: Latin Books

Job Type: Arabic Books

Arabic Books Scanning

Arabic Books Processing

Arabic Books OCRing

Arabic Books Encoding & Publishing

Arabic Books Archiving

Job Type: Manuscripts

ManuscriptsScanning

ManuscriptsProcessing

ManuscriptsArchiving

Small ImagesScanning

Small ImagesProcessing

Small ImagesPublishing

Small ImagesArchiving

ManuscriptsEncoding & Publishing

Che

ck-in

Che

ck-o

ut

Arabic Books QA

Latin BooksScanning

Latin Book sProcessing

Latin BooksOCRing

Latin BooksEncoding & Publishing

Latin BooksArchiving

Latin BooksQA

ManuscriptsQA

Job Type: Maps

MapsScanning

MapsProcessing

MapsPublishing

MapsArchiving

Job Type: Large Images

Large ImagesScanning

Large ImagesProcessing

Large ImagesPublishing

Large ImagesArchiving

Page 28: Digitization Workflow Management System for Massive Digitization Projects

Quality Assurance

Supported on two different stages Maintain QA information on the files levels while moving

from a Phase to another. A QA Phase is defined in the Digitization Phase Sequence

as the last Phase before the Archiving

Arabic Books Scanning

Arabic Books Processing

Arabic Books OCRing

Arabic Books Encoding & Publishing

Arabic Books Archiving

Arabic Books QA

Information of output objects (pages) level

Page 29: Digitization Workflow Management System for Massive Digitization Projects

Achieving Flexibility Using DWMS

The defined Phase Sequence for a Job Type is a guide, rather than a prescription.

The list of Phases can or can not be in the Phase Sequence. The operator can assign the Job to any of all of these Phases.

Jobs can be Forwarded dynamically to another Phase in the Phase Sequence.

Changes in the Phase Sequence affects the current and new Jobs in the system, leading to natural process evolution

Arabic Books Scanning

Arabic Books Processing

Arabic Books OCRing

Arabic Books Encoding & Publishing

Arabic Books Archiving

Arabic Books QA

Arabic Books Scanning

Arabic Books Processing

Arabic Books OCRing

Arabic Books Encoding & Publishing

Arabic Books Archiving

Arabic Books QA

Arabic Books Scanning

Arabic Books Processing

Arabic Books OCRing

Arabic Books Archiving

Arabic Books QA

Page 30: Digitization Workflow Management System for Massive Digitization Projects

Job Life Cycle

Start

Reject

Assign

Redirect

Finish

Administrator accept the rejection

File transfer Ordinary job finishing

Recommend re-do a phase to the jobAdministrator accept the recommendation

Job assigned to next stage

Reject job for some problems

New Job To Repository

Page 31: Digitization Workflow Management System for Massive Digitization Projects

Future Work

Check-out plug-in for Fedora.. Check-in plug-ins will be implemented to

support various metadata standards formats MODS, DC, VAR, etc.

Enhance the software interface with graphical tools to help design and follow the digitization process.

Page 32: Digitization Workflow Management System for Massive Digitization Projects

Thank You

[email protected]


Recommended