5
1 IMPACT final event – The Hague – 26 June 2012 Metadata extraction from title pages Evaluation of the FEP pilot 1 at the German National Library Christa Schöning-Walter Outline 1. Institutional background 2. IMPACT test case 3. Strategic goals 4. Preliminary work 5. Results 5. Results 6. Perspective | IMPACT event | June 26, 2012 | 2 The German National Library (DNB) – some facts and figures (I) - Legal deposit: Collecting, cataloguing, archiving Collecting, cataloguing, archiving and making available to the general public all German and German-language publications, publications about Germany etc from 1913 - Bibliographic services: - Bibliographic services: - National Bibliography - Authority files - Bibliographic standards - 2 sites: Leipzig, Frankfurt am Main | IMPACT event | June 26, 2012 | 3 The German National Library (DNB) – some facts and figures (II) - Collection size (January 2012): 27 million media units - Daily input: 1.500 physical units (each with 2 copies) - Since 2006: Collection mandate includes non-physical media (online publications) - DNBG = Law regarding the German National Library - PflAV = Legal Deposit Regulation - PflAV = Legal Deposit Regulation - Since 2009: Considerations on and implementation of automated cataloguing processes | IMPACT event | June 26, 2012 | 4

IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)

1

IMPACT final event – The Hague – 26 June 2012

Metadata extraction from title pages

Evaluation of the FEP pilot

1

Evaluation of the FEP pilot

at the German National Library

Christa Schöning-Walter

Outline

1. Institutional background

2. IMPACT test case

3. Strategic goals

4. Preliminary work

5. Results5. Results

6. Perspective

| IMPACT event | June 26, 2012 |2

The German National Library (DNB)– some facts and figures (I)

− Legal deposit: Collecting, cataloguing, archiving

Collecting, cataloguing, archiving and making available to the general public all German and German-language publications, publications about Germany etc from 1913

− Bibliographic services: − Bibliographic services:

− National Bibliography

− Authority files

− Bibliographic standards

− 2 sites: Leipzig, Frankfurt am Main

| IMPACT event | June 26, 2012 |3

The German National Library (DNB) – some facts and figures (II)

− Collection size (January 2012): 27 million media units −

− Daily input: 1.500 physical units (each with 2 copies)

− Since 2006: Collection mandate includes non-physical media (online publications)

− DNBG = Law regarding the German National Library

− PflAV = Legal Deposit Regulation − PflAV = Legal Deposit Regulation

− Since 2009: Considerations on and implementation of automated cataloguing processes

| IMPACT event | June 26, 2012 |4

Page 2: IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)

2

Target of the IMPACT scenario

Opening questions (summer 2011):

− Can metadata extraction from title pages successfully be done by a rule engine in case of simple structured monographic publications?

− Is this useful in order to accelerate the cataloguing processes if no machine-readable metadata from other sources is available?

| IMPACT event | June 26, 2012 |

Test case: Theses

− 14.000 print units annually− simple structure !?

5

Starting point

Since January 2012:

− Experimental application studies in collaboration with the University of Innsbruck

− Using the rule-based exploitation features of FEP (Functional Extension Parser)

What is FEP?

− Software platform for the purpose of analysing the − Software platform for the purpose of analysing the logical structure of documents

− Developed within IMPACT work package EE4 (Goal: enrichment of OCR output with structure information)

| IMPACT event | June 26, 2012 |6

Strategic goals

In particular:

− Making descriptive cataloguing less time-consuming and literature processing of printed media faster by

− Partial digitisation

− Automated metadata extraction

− Result transfer into the bibliographic record

− Quality check and completion of cataloguing by the staff

staff

Generally:

− Gaining experience in the area of automated metadata extraction / automated cataloguing

| IMPACT event | June 26, 2012 |7

Conceptual design of the workflow

Example: http://d-nb.info/1017138931

Accession(Printed media

units)

Bibliographicrecord

Data Provider

Repository

OAI-Harvester Cataloguing

Qualitiycheck

Statistics

FEP results

Service partner

Scan service (title page +

ToC)

OCR output/Indexing

Statistics

Stack

| IMPACT event | June 26, 2012 |

a

8

Page 3: IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)

3

The idea:

9 | IMPACT event | June 26, 2012 |

The idea:Taking bibliographic data over from metadata mining tools.

The Objective: Automated exploitation of descriptive bibliographic data

− Specification, implementation, evaluation and gradually

evaluation and gradually improvement of

− Appropriate structure types

− Dictionaries (controlled vocabulary, indicating keywords, abbreviations etc)abbreviations etc)

− Expert rules

− Etc

Illustration: University of Innsbruck|10

Preliminary work (I)

− Specification of the bibliographic statements to be mined from the title page

from the title page

Attribute Value

Publication year 2010

Language code /1ger/1eng

Creator <last name>,<first name>

Title <full title>:<additional title information>/

| IMPACT event | June 26, 2012 |

Title <full title>:<additional title information>/ <author statement>

Size 30 cm21 cm

Theses statement <city name>, <corporate body name>,<type of publication>,<year of graduation>

11

Preliminary work (II)

− Going over some hundreds of title pages of theses (scans from 2009-2011 + documents from daily business)

(scans from 2009-2011 + documents from daily business)

− Exploring typical structural patterns / regularities etc, such as

− Prefixes

− Phrases

− Notation

− Position

Examples of indicating phrases to find out the creator: vonvon <Verfasser> vorgelegte Dissertationvon Herrn/Frau:vorgelegt von(:)

− Position

| IMPACT event | June 26, 2012 |12

vorgelegt von(:)vorgelegt JJJJ vonvorgelegt dem Fachbereich ... vonName:Name des Verfassers:Name der Verfasserin:verfasst von(:)eingereicht von...

Expert rules

Page 4: IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)

4

Preliminary work (III)

Choosing / preparing dictionaries for tagging,

Theses statement items (examples): …Berlin, ESCP Europe Wirtschaftshochschule

dictionaries for tagging, matching and mapping purposes:

− List of universities which have the right to graduation (identifying the corporate bodies)

Berlin, ESCP Europe Wirtschaftshochschule Berlin, Freie Univ. Berlin, Humboldt-Univ. Berlin, Steinbeis-Hochsch. Berlin, Techn. Univ. Berlin, Univ. der Künste…

Academic grades (examples):…the corporate bodies)

− Name Authority File subset (identifying personal names)

− List of academic grades| IMPACT event | June 26, 2012 |

…M.A. Master of Arts / Magister ArtiumM.Sc. Master of ScienceM.Eng. Master of EngineeringLL.M. Master of Laws / Legum MagisterM.F.A. Master of Fine ArtsM.Mus. Master of MusicM.Ed. Master of Education…13

Preliminary work (IV)

− Setting up a sample of documents for evaluation purposes:

purposes:

− 1.000 theses from several universities

− Publication year: 2010 – 2011

− Different dimensions (A- and B-size)

− Scans: 300 dpi, bitonal

− Transfer format: Pdf (in future: XML files)

− Ground truth determination:

− Manually region tagging on image files (done in Vietnam by the Aletheia tool)

| IMPACT event | June 26, 2012 |14

Document processing in brief

− Database: Storage of all available information

available information (OCR output, automatically or manually produced annotations, dictionaries, facts etc)

− Input of expert rules

− Rule engine: Stepwise − Rule engine: Stepwise proceeding taking intermediary results into account

Illustration: University of Innsbruck

| IMPACT event | June 26, 2012 |15

Results

Second test phase with a revised list of universities (June 2012):

| IMPACT event | June 26, 2012 |

(1) total conformity (2) complete title + noise (just to be deleted by the staff)

16

Page 5: IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)

5

Forecast: Feasibility study

− Technical and organisational requirements: Operational aspects, technical workflow, interfaces etc

Operational aspects, technical workflow, interfaces etc

− Further functional enhancements needed:

− Dictionary maintenance: Expanding controlled vocabulary, sorting out unsuitable items etc

− Taking additional facts into account: Ground truth etc

− Additional expert rules (?)

− Additional functions: Language guesser, document − Additional functions: Language guesser, document size etc

− Customising FEP (?)

| IMPACT event | June 26, 2012 |17

New ideas

− Extraction of defined structures from the body of monographic publications, such as table of contents,

monographic publications, such as table of contents, abstracts, pure text (without any introductory remarks, footers, references etc)

Target:− Improvement of the results of current automated

subject cataloguing projects, such as

− Thematic classification by machine learning − Thematic classification by machine learning techniques

− Subject headings obtainment by text analysis techniques

| IMPACT event | June 26, 2012 |

Reducing the noise via preceding structure analysis processes

18

Thank you for your attention.

Christa Schöning-Walter Sandra Hamm

Staff position ’Automated Cataloguing’ Project leader

[email protected] [email protected]

German National LibraryGerman National Library

Digital Services

Frankfurt am Main, Germany

| IMPACT event | June 26, 2012 |19