Managing the Digitization of Large Press Archives

Preview:

DESCRIPTION

From the 2014 DLF Forum in Atlanta, GA. Session Leaders Bassem Elsayed, Bibliotheca Alexandrina Ahmed Samir, Bibliotheca Alexandrina Managing the digitization of press material is quite a challenge; not only in terms of quantity, but also in terms of text and material quality, designing the workflow system which organizes the operations, and handling the metadata. This challenge has been the focus of the Bibliotheca Alexandrina’s digitization work during the past year in the course of its partnership with the Center for Economic, Judicial, and Social Study and Documentation (CEDEJ). Having more than 800,000 pages of press articles to be digitally preserved and publicly accessed, triggered an inevitable need to design a workflow that can manage such a massive collection and handle its attributes proficiently. The deployment of this endeavor required simultaneous intervention of four main aspects; data analysis of the collection, developing a digitization workflow for the collection at hand, implementing and installing the necessary software tools for metadata entry, and finally, publishing the digital archive online for researchers and public access. The presentation will demonstrate the workflow system which is being implemented to manage this massive press collection, which has yielded to date more than 400,000 pages. It will shed some light on the BA’s Digital Assets Factory (DAF), which is the nucleus upon which the digitization process of CEDEJ collection has been built. Additionally, the presentation will discuss the tools implemented for ingesting data into the digitization process starting form indexing until the creation of batches that are ingested into the system. The outflow will also be discussed in terms of organizing and grouping multipart press clips, in addition to the reviewing, validation and correction of the output. Light will also be shed on the challenges encountered to associate the accessible online archive with a powerful search engine supporting multidimensional search while maintaining a user-friendly navigation experience.

Citation preview

The New Library of Alexandria Overview

Bibliotheca Alexandrina (BA)  

Ø  Center of excellence in the production and dissemination of knowledge

Ø  Place of dialogue, learning and understanding between cultures and peoples

Ø  The World’s Window on Egypt

Ø  Egypt’s Window on the World Ø  Instrument for Rising to the Challenges of

the Digital Age

Ø  Center for Dialogue Between Peoples and Civilizations

Not just a Library of Books but rather a vast cultural and scientific complex

A library that can accommodate millions of books  

7

http://archive.bibalex.org

8

14

15

http://descegy.bibalex.org

16

http://lartarab.bibalex.org

17

More than 230,000 Arabic books are freely available online for Arabic

readers worldwide

18

http://suezcanal.bibalex.org

19

20

http://naguib.bibalex.org/

21

http://nasser.bibalex.org

22

http://sadat.bibalex.org

Ø  Project Overview Ø  Collection Overview Ø  Data Representation Ø  System Workflow

�  DAF (Digital Assets Factory) �  Cataloguing �  Website

§  Solr search Engine §  Article Viewer

24

25

Ø  Centre for Economic, Judicial, and Social Study and Documentation (CEDEJ) collaborated with Bibliotheca Alexandrina (BA) for the digitization of its archive of massive press articles collection

Ø  The project consists of multiple modules to: �  Index the Press Archive Collection �  Control data entry workflow �  Digitize and process data �  Catalogue and review Articles �  Archive Web Publishing

26

27

Ø  Package of press archive �  800,000+ press clips varying between

§  Press §  Reports

�  500+ publishers �  60,000+ writers and reporters �  200 Different subjects

§  Economic, politics, social life, etc… �  Archive Languages:

§  Arabic, English and French �  Date range from 1966 to 2009

28

Ø  Finished so far �  115,000 press clips varying between

§  Press §  Reports

�  200 publishers �  14,000 writers and reporters �  100 Different subjects

§  Economic, politics, social life, etc… �  Archive Languages:

§  Arabic, English and French �  Date range from 1966 to 2009

29

30

Ø  A list of packaged press archive is submitted to

Bibliotheca Alexandrina to be scanned and catalogued

Ø  Source of data is a collection of boxes Ø  The box is organized on the following

hierarchy �  Folder �  File �  Sub-File �  Document

Ø  Document represents a single page of press

31

32

33

34

35

36

37

38

Article Creation

39

Article Metadata

40

Lookups Management

41

Reports

42

43

44

45

Ø  Based on Apache Lucene project v4.1

Ø  SolrNet API is used to connect to Solr server

Ø  Features �  Simple/Advanced search �  Results Highlighting �  Fields AutoComplete �  Text search (Article Viewer)

46

47

48

49

50

51

52

53

Ø  Article viewer is used for previewing articles �  It is one of multiple viewers developed at BA

Ø  Architecture �  Server Side: RESTful services �  Client Side: JavaScript using JSONP

Ø  Features �  Image preview �  Metadata preview �  Text selection �  Searching/highlighting �  Zooming options: fit width/height

54

Ø  Viewer Web Services �  Metadata Web Service:

§  Retrieve article catalogue metadata §  Return technical information (width, height, page

count..) �  Content Web Service:

§  Retrieve the image of each single page in the article applying scaling to custom width and height responsively

§  Return the selected text based on the user highlighted area

�  Search Web Service: §  Perform the search using Solr engine APIs in the

content of the articles §  Highlight the matching phrases in the article image

55

56

57

58

Recommended