View
224
Download
0
Category
Preview:
Citation preview
ACL/LaTeCH-Portland, June 24th 2011
Enrichment and Structuring of Archival Description
Metadata
Kalliopi Zervanou*, Ioannis Korkontzelos**,
Antal van den Bosch* & Sophia Ananiadou** * Tilburg Centre for Cognition &
CommunicationThe University of Tilburg, NL
K.Zervanou@uvt.nl Antal.vdnBosch@uvt.nl
** National Centre for Text MiningThe University of Manchester, UK
Ioannis.Korkontzelos@manchester.ac.uk Sophia.Ananiadou@manchester.ac.uk
ACL/LaTeCH-Portland, June 24th 2011
Research on Metadata
• Developing standards:– collection specific (e.g. EAD, MARC21)– cross-collection (e.g. Dublin Core)
• Provide mappings: – across schemas– ontologies (ad hoc or standard CDOC-CRM)
• Discard metadata for IR (Koolen et al., 2007)
• Exploit metadata for IR (Zhang&Kamps, 2009)
ACL/LaTeCH-Portland, June 24th 2011
The IISH EAD dataset
• EAD: XML standard for encoding archival descriptions
• Challenges: – Variety of languages used– Varying type and amount of information– Style: enumerations, lists, incomplete
sentences
ACL/LaTeCH-Portland, June 24th 2011
Motivation & Objectives
• Improved search and retrieval– content-based metadata document
clustering– content-based/semantic search– support exploratory search– link across collections, metadata formats &
institutions– create unified metadata knowledge
resources
ACL/LaTeCH-Portland, June 24th 2011
Method overview
ACL/LaTeCH-Portland, June 24th 2011
Method overview
ACL/LaTeCH-Portland, June 24th 2011
Pre-processing
• EAD/XML element selection & extraction– EAD elements containing free-text &
archive content information
• Language identification (n-gram method)– Identifier trained on Europarl corpus
• Text snippets length: ~20 tokens
ACL/LaTeCH-Portland, June 24th 2011
Snippet length based on language
ACL/LaTeCH-Portland, June 24th 2011
Method overview
ACL/LaTeCH-Portland, June 24th 2011
Method overview
ACL/LaTeCH-Portland, June 24th 2011
Enrichment & Structuring
• Topic detection: Automatic term recognition using C-value method
• Agglomerative hierarchical term clustering:– complete, single & average linkage criteria– document co-occurence & lexical similarity
measures
ACL/LaTeCH-Portland, June 24th 2011
Method overview
ACL/LaTeCH-Portland, June 24th 2011
Method overview
ACL/LaTeCH-Portland, June 24th 2011
Term results (auto eval)
ACL/LaTeCH-Portland, June 24th 2011
Results
• C-value best performance: candidates that occur as non-nested at least once
• Average linkage criterion & Doc Co-occurence: provide broader and richer hierarchies
Questions?Check-out our poster!
Recommended