Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Thesauri Quality Assessment: Analyzing the Rijksmuseum Library Thesaurus
By: Daan de RuijterSupervised by: Jacco van Ossenbruggen &
Chris Dijkshoorn
Agenda
▪ Background information and context▪ Research layout▪ Results▪ Conclusion▪ Questions and Discussion
2
What is a Thesaurus?
A thesaurus is a:➔ Structured vocabulary➔ Describing Concepts➔ According to a predefined format
A characterizing feature of thesauri is the hierarchy between different Concepts.➔ A hierarchy helps searching, navigating and maintaining the
vocabulary
3
An Example of how a Thesaurus Describes a Concept:
Archeologie
What falls under this concept?➔ All literature about “archeologie”➔ This is a design choice
What kind information is needed for a thesaurus to describe this concept?➔ A unique identifier (ID)➔ A prefered label (in one or multiple languages)➔ Alternative labels (e.g. synonyms or verb forms)➔ Hierarchical relations with other concepts
◆ Broader, Narrower or Related➔ Time related data (creation date, last modification)
4
Context
The Rijksmuseum Library thesaurus is currently maintained in the MARC format➔ This is a somewhat dated format➔ The thesaurus has manual data entry➔ Entries have a lack of quality assurance
Main research focus:How can we assess the quality of such a thesaurus?
5
Research Questions
1. What changes are caused when converting from MARC to SKOS?
2. What are the quality issues of the thesaurus expressed in SKOS?
3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?
6
Research Layout
7
3. Align Library and Collection Thesauri
To see how well the two can be integrated.
Mainly done through string matching.
1. Convert MARC to SKOS
SKOS supports tools and standards for quality analysis.
Done by directly mapping XML tags with an XSL Transformations.
2. Analyze SKOS Quality Issues
With formalized methods defined by previous studies.
Done with standard SKOS tools and custom python scripts.
Research Questions
1. What changes are caused when converting from MARC to SKOS?
2. What are the quality issues of the thesaurus expressed in SKOS?
3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?
8
Data Structure: MARC<record>
<leader>00485nz a2200169o 4500</leader>
<controlfield tag="001">126536</controlfield>
<controlfield tag="003">NL-AmRIJ</controlfield>
<controlfield
tag="005">20141121114503.0</controlfield>
<controlfield tag="008">091231
||az||||||||||||||||||||||||||| d</controlfield>
<datafield tag="040" ind1=" " ind2=" ">
<subfield code="a">NL-AmRIJ</subfield>
<subfield code="b">dut</subfield>
<subfield code="c">NL-AmRIJ</subfield>
<subfield code="e">fobidrtb</subfield>
</datafield>
<datafield tag="150" ind1=" " ind2=" ">
<subfield code="a">archeologie</subfield>
</datafield>
<datafield tag="550" ind1=" " ind2=" ">
<subfield code="w">h</subfield>
<subfield code="a">klassieke oudheid</subfield>
<subfield code="0">(NLAmRIJ)126561</subfield>
</datafield>
9
<datafield tag="550" ind1=" " ind2=" ">
<subfield code="w">h</subfield>
<subfield code="a">archeologische sites</subfield>
</datafield>
<datafield tag="550" ind1=" " ind2=" ">
<subfield code="w">h</subfield>
<subfield code="a">onderwaterarcheologie</subfield>
</datafield>
<datafield tag="550" ind1=" " ind2=" ">
<subfield code="w">h</subfield>
<subfield code="a">pre- en protohistorie</subfield>
<subfield code="0">(NLAmRIJ)126580</subfield>
</datafield>
<datafield tag="680" ind1=" " ind2=" ">
<subfield code="i">Vertaling: archaeology</subfield>
</datafield>
<datafield tag="942" ind1=" " ind2=" ">
<subfield code="a">TOPIC_TERM</subfield>
</datafield>
</record>
Data Structure: MARC
<record>
<leader>00485nz a2200169o 4500</leader>
<controlfield tag="001">126536</controlfield>
<controlfield tag="003">NL-AmRIJ</controlfield>
<controlfield tag="005">20141121114503.0</controlfield>
<controlfield tag="008">091231 ||az||||||||||||||||||||||||||| d</controlfield>
<datafield tag="150" ind1=" " ind2=" ">
<subfield code="a">archeologie</subfield>
</datafield>
<datafield tag="680" ind1=" " ind2=" ">
<subfield code="i">Vertaling: archaeology</subfield>
</datafield>
<datafield tag="550" ind1=" " ind2=" ">
<subfield code="w">h</subfield>
<subfield code="a">klassieke oudheid</subfield>
<subfield code="0">(NLAmRIJ)126561</subfield>
</datafield>
10
Record ID
Record “nl” label
Record “en” label
Hierarchical relation
Data Structure: SKOS
<skos:Concept rdf:about="http://hdl.handle.net/10934/RM0001.LIBTHESAU.126536">
<skos:prefLabel xml:lang="nl">archeologie</skos:prefLabel>
<skos:prefLabel xml:lang="en">archaeology</skos:prefLabel>
<skos:narrower rdf:resource="http://hdl.handle.net/10934/RM0001.LIBTHESAU.126561"/>
<skos:narrower rdf:resource="http://hdl.handle.net/10934/RM0001.LIBTHESAU.126580"/>
<skos:inScheme rdf:resource="http://hdl.handle.net/10934/RM0001.SCHEME.TOPIC_TERM"/>
<skos:changeNote>
<rdf:Description>
<dct:modified>2014-11-21</dct:modified>
</rdf:Description>
</skos:changeNote>
<dct:created>2009-12-31</dct:created>
</skos:Concept>
11
Concept IDConcept “nl” label
Concept “en” label
Hierarchical relation
Converting the Thesaurus: MARC 550 Tag Errors. What are they?
12
In our concept “archeologie”:➔ Amount of hierarchical relations in MARC: 4➔ Amount of hierarchical relations in SKOS: 2What happened?
Converting the Thesaurus: MARC 550 Tag Errors. What are they?
13
<datafield tag="550" ind1=" " ind2=" ">
<subfield code="w">h</subfield>
<subfield code="a">klassieke oudheid</subfield>
<subfield code="0">(NLAmRIJ)126561</subfield>
</datafield>
<datafield tag="550" ind1=" " ind2=" ">
<subfield code="w">h</subfield>
<subfield code="a">archeologische sites</subfield>
</datafield>
<datafield tag="550" ind1=" " ind2=" ">
<subfield code="w">h</subfield>
<subfield code="a">onderwaterarcheologie</subfield>
</datafield>
<datafield tag="550" ind1=" " ind2=" ">
<subfield code="w">h</subfield>
<subfield code="a">pre- en protohistorie</subfield>
<subfield code="0">(NLAmRIJ)126580</subfield>
</datafield>
Concept ID
No code “0”
No code “0”
MARC 550 Tag Errors (N = 14828)
code “w” code “a” code “0”
Error count 9 37 875
Correct entry example h boekwetenschap (NL-AmRIJ)126543
Entry error examples
w NULL NULL
9 mariaverering
(NL-AmRIJ)131820 (NL-AmRIJ)#129341
hippodromen (NL-AmRIJ) 14
Each error represent a hierarchical relation that cannot be converted to SKOS
Research Questions
1. What changes are caused when converting from MARC to SKOS?
2. What are the quality issues of the thesaurus expressed in SKOS?
3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?
15
Language Coverage (N = 7826)
16
Low amount of English terms or alternative labels could be seen as a quality issue➔ This depends on the intended use of the thesaurus
nl en
prefLabel 7826 60
altLabel 1149 0
Quality Analysis Results (for all: N = 7826)
17
Quality Issue Count in MARC Count in SKOS After Skosify
Omitted or Invalid Language Tags 0 0 0
Incomplete Language Coverage 7766 7766 7766
No Common Language 0 0 0
Overlapping Labels 29 29 29
Empty Labels 0 0 0
Orphan Concepts 976 1391 1364
Cyclic Hierarchical Relations unknown 23 0
Valueless Associative Relations unknown 183 0
Omitted Top Concepts unknown 2043 0
Concept without a hierarchical
relation
Research Questions
1. What changes are caused when converting from MARC to SKOS?
2. What are the quality issues of the thesaurus expressed in SKOS?
3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?
18
What is an alignment?
An alignment is a:➔ Concept from two different thesauri that is found to be identical
Alignments are most commonly found by:➔ Exactly matching concept labels➔ Matching modified labels (stemming, lemmatization)➔ Comparing concept structures
19
An Alignment Example
Concept in the library thesaurus:
<skos:Concept rdf:about="http://hdl.handle.net/10934/RM0001.LIBTHESAU.129814">
<skos:prefLabel xml:lang="nl">kunstenaars</skos:prefLabel>
<skos:inScheme rdf:resource="http://hdl.handle.net/10934/RM0001.SCHEME.TOPIC_TERM"/>
Concept in the collection thesaurus:
<skos:Concept rdf:about="http://hdl.handle.net/10934/RM0001.THESAU.38160">
<skos:prefLabel xml:lang="nl">kunstenaar</skos:prefLabel>
<skos:inScheme rdf:resource="http://hdl.handle.net/10934/RM0001.SCHEME.OCCUPATION"/>
20
This concept is seen as a topic term
This concept is seen as an occupation
Library Thesaurus Alignment onto the Collection Thesaurus Using Exact String Matching (N = 7826)
Selected Label Type
Selected Languages
Aligned Concepts
Aligned Concepts after Stemming
Percentage of Aligned Concepts
after Stemming
skos:prefLabel, skos:altLabel
nl, en 844 1030 13.16%
nl 840 1024 13.08%
en 3 4 0.05%
skos:prefLabel
nl, en 729 894 11.42%
nl 726 890 11.37%
en 3 4 0.05%
skos:altLabel
nl, en 13 16 0.20%
nl 13 16 0.20%
en 0 0 0.00%
21
Before stemming
After stemming
Conclusion
▪ The mapping from MARC to SKOS tags proved to be a viable conversion method.
▪ SKOS provided both better insight into quality issues, and was supported by tools to fix them.
▪ The amount of alignments between the Rijksmuseum thesauri was low.
22
What’s in it for the Rijksmuseum?
23
▪ Converting the thesaurus from MARC to SKOS would allow for better maintainability and interoperability
▪ Improving the thesaurus quality in terms of documentation and structure allows for more alignments to be made with other thesauri
THANK YOU FOR YOUR ATTENTION
Are there any questions?
Follow this project on Github:Special thanks to:Jacco van Ossenbruggen (VU - supervision)Chris Dijkshoorn (Rijksmuseum - supervision)Contact me:[email protected] 24