16
ACL/LaTeCH-Portland, June 24th 2011 Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg Centre for Cognition & Communication The University of Tilburg, NL [email protected] [email protected] ** National Centre for Text Mining The University of Manchester, UK [email protected] [email protected]

Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

Embed Size (px)

Citation preview

Page 1: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Enrichment and Structuring of Archival Description

Metadata

Kalliopi Zervanou*, Ioannis Korkontzelos**,

Antal van den Bosch* & Sophia Ananiadou** * Tilburg Centre for Cognition &

CommunicationThe University of Tilburg, NL

[email protected] [email protected]

** National Centre for Text MiningThe University of Manchester, UK

[email protected] [email protected]

Page 2: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Research on Metadata

• Developing standards:– collection specific (e.g. EAD, MARC21)– cross-collection (e.g. Dublin Core)

• Provide mappings: – across schemas– ontologies (ad hoc or standard CDOC-CRM)

• Discard metadata for IR (Koolen et al., 2007)

• Exploit metadata for IR (Zhang&Kamps, 2009)

Page 3: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

The IISH EAD dataset

• EAD: XML standard for encoding archival descriptions

• Challenges: – Variety of languages used– Varying type and amount of information– Style: enumerations, lists, incomplete

sentences

Page 4: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Motivation & Objectives

• Improved search and retrieval– content-based metadata document

clustering– content-based/semantic search– support exploratory search– link across collections, metadata formats &

institutions– create unified metadata knowledge

resources

Page 5: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Method overview

Page 6: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Method overview

Page 7: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Pre-processing

• EAD/XML element selection & extraction– EAD elements containing free-text &

archive content information

• Language identification (n-gram method)– Identifier trained on Europarl corpus

• Text snippets length: ~20 tokens

Page 8: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Snippet length based on language

Page 9: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Method overview

Page 10: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Method overview

Page 11: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Enrichment & Structuring

• Topic detection: Automatic term recognition using C-value method

• Agglomerative hierarchical term clustering:– complete, single & average linkage criteria– document co-occurence & lexical similarity

measures

Page 12: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Method overview

Page 13: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Method overview

Page 14: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Term results (auto eval)

Page 15: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

ACL/LaTeCH-Portland, June 24th 2011

Results

• C-value best performance: candidates that occur as non-nested at least once

• Average linkage criterion & Doc Co-occurence: provide broader and richer hierarchies

Page 16: Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg

Questions?Check-out our poster!