IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)

1

IMPACT final event – The Hague – 26 June 2012

Metadata extraction from title pages

Evaluation of the FEP pilot

1

Evaluation of the FEP pilot

at the German National Library

Christa Schöning-Walter

Outline

1. Institutional background

2. IMPACT test case

3. Strategic goals

4. Preliminary work

5. Results5. Results

6. Perspective

| IMPACT event | June 26, 2012 |2

The German National Library (DNB)– some facts and figures (I)

− Legal deposit: Collecting, cataloguing, archiving

−

Collecting, cataloguing, archiving and making available to the general public all German and German-language publications, publications about Germany etc from 1913

− Bibliographic services: − Bibliographic services:

− National Bibliography

− Authority files

− Bibliographic standards

− 2 sites: Leipzig, Frankfurt am Main


The German National Library (DNB) – some facts and figures (II)

− Collection size (January 2012): 27 million media units −

− Daily input: 1.500 physical units (each with 2 copies)

− Since 2006: Collection mandate includes non-physical media (online publications)

− DNBG = Law regarding the German National Library

− PflAV = Legal Deposit Regulation − PflAV = Legal Deposit Regulation

− Since 2009: Considerations on and implementation of automated cataloguing processes


2

Target of the IMPACT scenario

Opening questions (summer 2011):

− Can metadata extraction from title pages successfully be done by a rule engine in case of simple structured monographic publications?

− Is this useful in order to accelerate the cataloguing processes if no machine-readable metadata from other sources is available?

| IMPACT event | June 26, 2012 |

Test case: Theses

− 14.000 print units annually− simple structure !?

5

Starting point

Since January 2012:

− Experimental application studies in collaboration with the University of Innsbruck

− Using the rule-based exploitation features of FEP (Functional Extension Parser)

What is FEP?

− Software platform for the purpose of analysing the − Software platform for the purpose of analysing the logical structure of documents

− Developed within IMPACT work package EE4 (Goal: enrichment of OCR output with structure information)


Strategic goals

In particular:

− Making descriptive cataloguing less time-consuming and literature processing of printed media faster by

− Partial digitisation

− Automated metadata extraction

− Result transfer into the bibliographic record

− Quality check and completion of cataloguing by the staff

−

staff

Generally:

− Gaining experience in the area of automated metadata extraction / automated cataloguing


Conceptual design of the workflow

Example: http://d-nb.info/1017138931

Accession(Printed media

units)

Bibliographicrecord

Data Provider

Repository

OAI-Harvester Cataloguing

Qualitiycheck

Statistics

FEP results

Service partner

Scan service (title page +

ToC)

OCR output/Indexing

Statistics

Stack


a

8

3

The idea:

9 | IMPACT event | June 26, 2012 |

The idea:Taking bibliographic data over from metadata mining tools.

The Objective: Automated exploitation of descriptive bibliographic data

− Specification, implementation, evaluation and gradually

−

evaluation and gradually improvement of

− Appropriate structure types

− Dictionaries (controlled vocabulary, indicating keywords, abbreviations etc)abbreviations etc)

− Expert rules

− Etc

Illustration: University of Innsbruck|10

Preliminary work (I)

− Specification of the bibliographic statements to be mined from the title page

−

from the title page

Attribute Value

Publication year 2010

Language code /1ger/1eng

Creator <last name>,<first name>

Title <full title>:<additional title information>/


Title <full title>:<additional title information>/ <author statement>

Size 30 cm21 cm

Theses statement <city name>, <corporate body name>,<type of publication>,<year of graduation>

11

Preliminary work (II)

− Going over some hundreds of title pages of theses (scans from 2009-2011 + documents from daily business)

−

(scans from 2009-2011 + documents from daily business)

− Exploring typical structural patterns / regularities etc, such as

− Prefixes

− Phrases

− Notation

− Position

Examples of indicating phrases to find out the creator: vonvon <Verfasser> vorgelegte Dissertationvon Herrn/Frau:vorgelegt von(:)

− Position


vorgelegt von(:)vorgelegt JJJJ vonvorgelegt dem Fachbereich ... vonName:Name des Verfassers:Name der Verfasserin:verfasst von(:)eingereicht von...

Expert rules

4

Preliminary work (III)

Choosing / preparing dictionaries for tagging,

Theses statement items (examples): …Berlin, ESCP Europe Wirtschaftshochschule

dictionaries for tagging, matching and mapping purposes:

− List of universities which have the right to graduation (identifying the corporate bodies)

Berlin, ESCP Europe Wirtschaftshochschule Berlin, Freie Univ. Berlin, Humboldt-Univ. Berlin, Steinbeis-Hochsch. Berlin, Techn. Univ. Berlin, Univ. der Künste…

Academic grades (examples):…the corporate bodies)

− Name Authority File subset (identifying personal names)

− List of academic grades| IMPACT event | June 26, 2012 |

…M.A. Master of Arts / Magister ArtiumM.Sc. Master of ScienceM.Eng. Master of EngineeringLL.M. Master of Laws / Legum MagisterM.F.A. Master of Fine ArtsM.Mus. Master of MusicM.Ed. Master of Education…13

Preliminary work (IV)

− Setting up a sample of documents for evaluation purposes:

−

purposes:

− 1.000 theses from several universities

− Publication year: 2010 – 2011

− Different dimensions (A- and B-size)

− Scans: 300 dpi, bitonal

− Transfer format: Pdf (in future: XML files)

− Ground truth determination:

− Manually region tagging on image files (done in Vietnam by the Aletheia tool)


Document processing in brief

− Database: Storage of all available information

−

available information (OCR output, automatically or manually produced annotations, dictionaries, facts etc)

− Input of expert rules

− Rule engine: Stepwise − Rule engine: Stepwise proceeding taking intermediary results into account

Illustration: University of Innsbruck


Results

Second test phase with a revised list of universities (June 2012):


(1) total conformity (2) complete title + noise (just to be deleted by the staff)

16

5

Forecast: Feasibility study

− Technical and organisational requirements: Operational aspects, technical workflow, interfaces etc

−

Operational aspects, technical workflow, interfaces etc

− Further functional enhancements needed:

− Dictionary maintenance: Expanding controlled vocabulary, sorting out unsuitable items etc

− Taking additional facts into account: Ground truth etc

− Additional expert rules (?)

− Additional functions: Language guesser, document − Additional functions: Language guesser, document size etc

− Customising FEP (?)


New ideas

− Extraction of defined structures from the body of monographic publications, such as table of contents,

−

monographic publications, such as table of contents, abstracts, pure text (without any introductory remarks, footers, references etc)

Target:− Improvement of the results of current automated

subject cataloguing projects, such as

− Thematic classification by machine learning − Thematic classification by machine learning techniques

− Subject headings obtainment by text analysis techniques


Reducing the noise via preceding structure analysis processes

18

Thank you for your attention.

Christa Schöning-Walter Sandra Hamm

Staff position ’Automated Cataloguing’ Project leader

[email protected] [email protected]

German National LibraryGerman National Library

Digital Services

Frankfurt am Main, Germany