Upload
impact-centre-of-competence
View
980
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
1
IMPACT final event – The Hague – 26 June 2012
Metadata extraction from title pages
Evaluation of the FEP pilot
1
Evaluation of the FEP pilot
at the German National Library
Christa Schöning-Walter
Outline
1. Institutional background
2. IMPACT test case
3. Strategic goals
4. Preliminary work
5. Results5. Results
6. Perspective
| IMPACT event | June 26, 2012 |2
The German National Library (DNB)– some facts and figures (I)
− Legal deposit: Collecting, cataloguing, archiving
−
Collecting, cataloguing, archiving and making available to the general public all German and German-language publications, publications about Germany etc from 1913
− Bibliographic services: − Bibliographic services:
− National Bibliography
− Authority files
− Bibliographic standards
− 2 sites: Leipzig, Frankfurt am Main
| IMPACT event | June 26, 2012 |3
The German National Library (DNB) – some facts and figures (II)
− Collection size (January 2012): 27 million media units −
− Daily input: 1.500 physical units (each with 2 copies)
− Since 2006: Collection mandate includes non-physical media (online publications)
− DNBG = Law regarding the German National Library
− PflAV = Legal Deposit Regulation − PflAV = Legal Deposit Regulation
− Since 2009: Considerations on and implementation of automated cataloguing processes
| IMPACT event | June 26, 2012 |4
2
Target of the IMPACT scenario
Opening questions (summer 2011):
− Can metadata extraction from title pages successfully be done by a rule engine in case of simple structured monographic publications?
− Is this useful in order to accelerate the cataloguing processes if no machine-readable metadata from other sources is available?
| IMPACT event | June 26, 2012 |
Test case: Theses
− 14.000 print units annually− simple structure !?
5
Starting point
Since January 2012:
− Experimental application studies in collaboration with the University of Innsbruck
− Using the rule-based exploitation features of FEP (Functional Extension Parser)
What is FEP?
− Software platform for the purpose of analysing the − Software platform for the purpose of analysing the logical structure of documents
− Developed within IMPACT work package EE4 (Goal: enrichment of OCR output with structure information)
| IMPACT event | June 26, 2012 |6
Strategic goals
In particular:
− Making descriptive cataloguing less time-consuming and literature processing of printed media faster by
− Partial digitisation
− Automated metadata extraction
− Result transfer into the bibliographic record
− Quality check and completion of cataloguing by the staff
−
staff
Generally:
− Gaining experience in the area of automated metadata extraction / automated cataloguing
| IMPACT event | June 26, 2012 |7
Conceptual design of the workflow
Example: http://d-nb.info/1017138931
Accession(Printed media
units)
Bibliographicrecord
Data Provider
Repository
OAI-Harvester Cataloguing
Qualitiycheck
Statistics
FEP results
Service partner
Scan service (title page +
ToC)
OCR output/Indexing
Statistics
Stack
| IMPACT event | June 26, 2012 |
a
8
3
The idea:
9 | IMPACT event | June 26, 2012 |
The idea:Taking bibliographic data over from metadata mining tools.
The Objective: Automated exploitation of descriptive bibliographic data
− Specification, implementation, evaluation and gradually
−
evaluation and gradually improvement of
− Appropriate structure types
− Dictionaries (controlled vocabulary, indicating keywords, abbreviations etc)abbreviations etc)
− Expert rules
− Etc
Illustration: University of Innsbruck|10
Preliminary work (I)
− Specification of the bibliographic statements to be mined from the title page
−
from the title page
Attribute Value
Publication year 2010
Language code /1ger/1eng
Creator <last name>,<first name>
Title <full title>:<additional title information>/
| IMPACT event | June 26, 2012 |
Title <full title>:<additional title information>/ <author statement>
Size 30 cm21 cm
Theses statement <city name>, <corporate body name>,<type of publication>,<year of graduation>
11
Preliminary work (II)
− Going over some hundreds of title pages of theses (scans from 2009-2011 + documents from daily business)
−
(scans from 2009-2011 + documents from daily business)
− Exploring typical structural patterns / regularities etc, such as
− Prefixes
− Phrases
− Notation
− Position
Examples of indicating phrases to find out the creator: vonvon <Verfasser> vorgelegte Dissertationvon Herrn/Frau:vorgelegt von(:)
− Position
| IMPACT event | June 26, 2012 |12
vorgelegt von(:)vorgelegt JJJJ vonvorgelegt dem Fachbereich ... vonName:Name des Verfassers:Name der Verfasserin:verfasst von(:)eingereicht von...
Expert rules
4
Preliminary work (III)
Choosing / preparing dictionaries for tagging,
Theses statement items (examples): …Berlin, ESCP Europe Wirtschaftshochschule
dictionaries for tagging, matching and mapping purposes:
− List of universities which have the right to graduation (identifying the corporate bodies)
Berlin, ESCP Europe Wirtschaftshochschule Berlin, Freie Univ. Berlin, Humboldt-Univ. Berlin, Steinbeis-Hochsch. Berlin, Techn. Univ. Berlin, Univ. der Künste…
Academic grades (examples):…the corporate bodies)
− Name Authority File subset (identifying personal names)
− List of academic grades| IMPACT event | June 26, 2012 |
…M.A. Master of Arts / Magister ArtiumM.Sc. Master of ScienceM.Eng. Master of EngineeringLL.M. Master of Laws / Legum MagisterM.F.A. Master of Fine ArtsM.Mus. Master of MusicM.Ed. Master of Education…13
Preliminary work (IV)
− Setting up a sample of documents for evaluation purposes:
−
purposes:
− 1.000 theses from several universities
− Publication year: 2010 – 2011
− Different dimensions (A- and B-size)
− Scans: 300 dpi, bitonal
− Transfer format: Pdf (in future: XML files)
− Ground truth determination:
− Manually region tagging on image files (done in Vietnam by the Aletheia tool)
| IMPACT event | June 26, 2012 |14
Document processing in brief
− Database: Storage of all available information
−
available information (OCR output, automatically or manually produced annotations, dictionaries, facts etc)
− Input of expert rules
− Rule engine: Stepwise − Rule engine: Stepwise proceeding taking intermediary results into account
Illustration: University of Innsbruck
| IMPACT event | June 26, 2012 |15
Results
Second test phase with a revised list of universities (June 2012):
| IMPACT event | June 26, 2012 |
(1) total conformity (2) complete title + noise (just to be deleted by the staff)
16
5
Forecast: Feasibility study
− Technical and organisational requirements: Operational aspects, technical workflow, interfaces etc
−
Operational aspects, technical workflow, interfaces etc
− Further functional enhancements needed:
− Dictionary maintenance: Expanding controlled vocabulary, sorting out unsuitable items etc
− Taking additional facts into account: Ground truth etc
− Additional expert rules (?)
− Additional functions: Language guesser, document − Additional functions: Language guesser, document size etc
− Customising FEP (?)
| IMPACT event | June 26, 2012 |17
New ideas
− Extraction of defined structures from the body of monographic publications, such as table of contents,
−
monographic publications, such as table of contents, abstracts, pure text (without any introductory remarks, footers, references etc)
Target:− Improvement of the results of current automated
subject cataloguing projects, such as
− Thematic classification by machine learning − Thematic classification by machine learning techniques
− Subject headings obtainment by text analysis techniques
| IMPACT event | June 26, 2012 |
Reducing the noise via preceding structure analysis processes
18
Thank you for your attention.
Christa Schöning-Walter Sandra Hamm
Staff position ’Automated Cataloguing’ Project leader
[email protected] [email protected]
German National LibraryGerman National Library
Digital Services
Frankfurt am Main, Germany
| IMPACT event | June 26, 2012 |19