44
Hunting for semantic clusters – Hierarchical structuring of Cultural Heritage objects within large aggregations Shenghui Wang 1 Antoine Isaac 2 Valentine Charles 2 Rob Koopman 1 Anthi Agoropoulou 2 Titia van der Werf 1 1 OCLC Research, Leiden, The Netherlands 2 Europeana Foundation, The Hague, The Netherlands TPDL 2013 Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( Hunting for semantic clusters TPDL 2013 1 / 24

Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Hunting for semantic clusters– Hierarchical structuring of Cultural Heritage objects within large aggregations

Shenghui Wang1 Antoine Isaac2 Valentine Charles2

Rob Koopman1 Anthi Agoropoulou2 Titia van der Werf1

1OCLC Research, Leiden, The Netherlands

2Europeana Foundation, The Hague, The Netherlands

TPDL 2013

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 1 / 24

Page 2: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Outline

1 Introduction

2 Hierarchically structuring CH objects based on levels of similarity

3 Results and evaluation

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 2 / 24

Page 3: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Outline

1 Introduction

2 Hierarchically structuring CH objects based on levels of similarity

3 Results and evaluation

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 3 / 24

Page 4: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Large-scale aggregators

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 4 / 24

Page 5: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Search Europeana

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 5 / 24

Page 6: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Duplicates

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 6 / 24

Page 7: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Duplicates?

Same objects, different providers

Same page digitised three times

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 7 / 24

Page 8: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Duplicates?

Same objects, different providers

Same page digitised three times

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 7 / 24

Page 9: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Challenges in large-scale aggregators

Aggregation of metadata from heterogeneous collections leads to dataquality issues (e.g., duplicates)

Mapping from different formats and vocabularies to a shared datamodel may cause information missing (e.g., internal and external linksbetween objects)

Cultural Heritage objects could be linked differently (e.g., duplication,depiction/representation, derivation, succession, etc.)

Keyword-based search does not provide end users a global overview ofwhat is available.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 8 / 24

Page 10: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Challenges in large-scale aggregators

Aggregation of metadata from heterogeneous collections leads to dataquality issues (e.g., duplicates)

Mapping from different formats and vocabularies to a shared datamodel may cause information missing (e.g., internal and external linksbetween objects)

Cultural Heritage objects could be linked differently (e.g., duplication,depiction/representation, derivation, succession, etc.)

Keyword-based search does not provide end users a global overview ofwhat is available.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 8 / 24

Page 11: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Challenges in large-scale aggregators

Aggregation of metadata from heterogeneous collections leads to dataquality issues (e.g., duplicates)

Mapping from different formats and vocabularies to a shared datamodel may cause information missing (e.g., internal and external linksbetween objects)

Cultural Heritage objects could be linked differently (e.g., duplication,depiction/representation, derivation, succession, etc.)

Keyword-based search does not provide end users a global overview ofwhat is available.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 8 / 24

Page 12: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Challenges in large-scale aggregators

Aggregation of metadata from heterogeneous collections leads to dataquality issues (e.g., duplicates)

Mapping from different formats and vocabularies to a shared datamodel may cause information missing (e.g., internal and external linksbetween objects)

Cultural Heritage objects could be linked differently (e.g., duplication,depiction/representation, derivation, succession, etc.)

Keyword-based search does not provide end users a global overview ofwhat is available.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 8 / 24

Page 13: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Outline

1 Introduction

2 Hierarchically structuring CH objects based on levels of similarity

3 Results and evaluation

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 9 / 24

Page 14: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Hierarchical structuring based on levels of similarity

Our method contains three parts:

Fast clustering algorithm based on minhashes and compressionsimilarity

Field selection for focal semantic clusters

Hierarchically structuring records based on similarity

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 10 / 24

Page 15: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Fast clustering based on minhashes and compressionsimilarity

Two-step approach:

Grouping records which could potentially be further clustered

Transform metadata into a set of minhashesGroup records with similar minhashes

Iterative parallel clustering records based on compression similarity

Select cluster heads which are far apartGreedily assign records to the closest cluster headDivide clusters if the clusters are not ”compact” enough

By varying the similarity level, clusters with different compactness canbe produced.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 11 / 24

Page 16: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Fast clustering based on minhashes and compressionsimilarity

Two-step approach:

Grouping records which could potentially be further clustered

Transform metadata into a set of minhashesGroup records with similar minhashes

Iterative parallel clustering records based on compression similarity

Select cluster heads which are far apartGreedily assign records to the closest cluster headDivide clusters if the clusters are not ”compact” enough

By varying the similarity level, clusters with different compactness canbe produced.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 11 / 24

Page 17: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Focal semantic clusters

At level 100, many clusters contains duplicates, or records withalmost identical metadata.

At level 80, many clusters are of specific interests, e.g., pages of thesame book, pictures of the a same building, etc.

These focal semantic clusters often represent small cultural entities,which can be connected to other entities.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 12 / 24

Page 18: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Focal semantic clusters

At level 100, many clusters contains duplicates, or records withalmost identical metadata.

At level 80, many clusters are of specific interests, e.g., pages of thesame book, pictures of the a same building, etc.

These focal semantic clusters often represent small cultural entities,which can be connected to other entities.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 12 / 24

Page 19: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Focal semantic clusters

At level 100, many clusters contains duplicates, or records withalmost identical metadata.

At level 80, many clusters are of specific interests, e.g., pages of thesame book, pictures of the a same building, etc.

These focal semantic clusters often represent small cultural entities,which can be connected to other entities.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 12 / 24

Page 20: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Field selection for focal semantic clusters

However, different data providers do not apply same standards in thesame way.

Same information could be put into different metadata fieldsThe extent to which an object is described varies a lot provider byprovider.

Not all metadata fields should be used for clustering.

Otherwise, the pages of one book are scattered in multiple clusters.

We applied a standard Genetic Algorithm to automatically select theimportant fields which give the best focal clusters.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 13 / 24

Page 21: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Field selection for focal semantic clusters

However, different data providers do not apply same standards in thesame way.

Same information could be put into different metadata fieldsThe extent to which an object is described varies a lot provider byprovider.

Not all metadata fields should be used for clustering.

Otherwise, the pages of one book are scattered in multiple clusters.

We applied a standard Genetic Algorithm to automatically select theimportant fields which give the best focal clusters.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 13 / 24

Page 22: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Field selection for focal semantic clusters

However, different data providers do not apply same standards in thesame way.

Same information could be put into different metadata fieldsThe extent to which an object is described varies a lot provider byprovider.

Not all metadata fields should be used for clustering.

Otherwise, the pages of one book are scattered in multiple clusters.

We applied a standard Genetic Algorithm to automatically select theimportant fields which give the best focal clusters.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 13 / 24

Page 23: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Hierarchical structuring based on levels of similarity

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·Field selection 3

Provider 3· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·Field selection 2

Provider 2· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

Field selection 1

Provider 1Level 100

Original

G.A.

Repr.

· · ·· · ·· · ·· · ·Level 80

· · ·· · ·Level 60

· · ·· · ·· · ·· · ·Level 40

· · ·· · ·level 20

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 14 / 24

Page 24: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Hierarchy example

20_651

40_4745 40_7923

19954396 19954417 19971448 19955162 19954431 19956753

80_17351 80_17198

19954427 19955460 19954333 19954398

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 15 / 24

Page 25: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Outline

1 Introduction

2 Hierarchically structuring CH objects based on levels of similarity

3 Results and evaluation

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 16 / 24

Page 26: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Experiments with a small dataset

We applied the method on 1.1 million records from the UK.

Manually check randomly chosen clusters and try to understand whatmade these records clustered together, i.e., identify the semantic linksbetween records

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 17 / 24

Page 27: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Categories of clusters

Same objects/duplicate records

Views of the same object

Derivative works

Parts of the same object

Collections

Thematic groupings

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 18 / 24

Page 28: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Categories of clusters

Same objects/duplicate records

Views of the same object

Derivative works

Parts of the same object

Collections

Thematic groupingsBowburn, boiler house

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 18 / 24

Page 29: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Categories of clusters

Same objects/duplicate records

Views of the same object

Derivative works

Parts of the same object

Collections

Thematic groupings

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 18 / 24

Page 30: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Categories of clusters

Same objects/duplicate records

Views of the same object

Derivative works

Parts of the same object

Collections

Thematic groupingsLetter from Capt. John LivingstonRAMC

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 18 / 24

Page 31: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Categories of clusters

Same objects/duplicate records

Views of the same object

Derivative works

Parts of the same object

Collections

Thematic groupings

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 18 / 24

Page 32: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Categories of clusters

Same objects/duplicate records

Views of the same object

Derivative works

Parts of the same object

Collections

Thematic groupings

”Rural life”

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 18 / 24

Page 33: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Working with the full Europeana dataset

23.6M records from 2428 data providers across Europe (a data dumpon Feb 2013)

At level 100, we found more than 200K clusters which contain highlysimilar records

At level 80, we found nearly 1.5 million focal clusters from allindividual data providers.

Similarity level #Records to be clustered #Clusters Time

100 23,595,555 200,245 6m2.82s80 23,595,555 1,476,089 *60 6,407,615 382,268 3m35.26s40 2,431,753 212,389 2m28.79s20 1,068,188 84,554 1m20.99s

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 19 / 24

Page 34: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Working with the full Europeana dataset

23.6M records from 2428 data providers across Europe (a data dumpon Feb 2013)

At level 100, we found more than 200K clusters which contain highlysimilar records

At level 80, we found nearly 1.5 million focal clusters from allindividual data providers.

Similarity level #Records to be clustered #Clusters Time

100 23,595,555 200,245 6m2.82s80 23,595,555 1,476,089 *60 6,407,615 382,268 3m35.26s40 2,431,753 212,389 2m28.79s20 1,068,188 84,554 1m20.99s

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 19 / 24

Page 35: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Working with the full Europeana dataset

23.6M records from 2428 data providers across Europe (a data dumpon Feb 2013)

At level 100, we found more than 200K clusters which contain highlysimilar records

At level 80, we found nearly 1.5 million focal clusters from allindividual data providers.

Similarity level #Records to be clustered #Clusters Time

100 23,595,555 200,245 6m2.82s80 23,595,555 1,476,089 *60 6,407,615 382,268 3m35.26s40 2,431,753 212,389 2m28.79s20 1,068,188 84,554 1m20.99s

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 19 / 24

Page 36: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Working with the full Europeana dataset

23.6M records from 2428 data providers across Europe (a data dumpon Feb 2013)

At level 100, we found more than 200K clusters which contain highlysimilar records

At level 80, we found nearly 1.5 million focal clusters from allindividual data providers.

Similarity level #Records to be clustered #Clusters Time

100 23,595,555 200,245 6m2.82s80 23,595,555 1,476,089 *60 6,407,615 382,268 3m35.26s40 2,431,753 212,389 2m28.79s20 1,068,188 84,554 1m20.99s

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 19 / 24

Page 37: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Field selection for focal semantic clusters

For the 10 providers with most records (covering 35% of the wholeEuropeana dataset), it took 161 minutes on average.

Datasets with 200-250 records cost 21 minutes on average.

#Providers metadata field1 2358 dc:title

2 436 dc:type

3 328 dc:language

4 315 dc:rights

5 309 dc:subject

#Providers field combination1 1521 dc:title

2 37 dc:title dc:type

3 28 dc:title dc:creator

4 23 dc:title dc:identifier

5 20 dc:description

(a) Top 10 most selected fields (b) Top 5 most selected field combinations

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 20 / 24

Page 38: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Field selection for focal semantic clusters

For the 10 providers with most records (covering 35% of the wholeEuropeana dataset), it took 161 minutes on average.

Datasets with 200-250 records cost 21 minutes on average.

#Providers metadata field1 2358 dc:title

2 436 dc:type

3 328 dc:language

4 315 dc:rights

5 309 dc:subject

#Providers field combination1 1521 dc:title

2 37 dc:title dc:type

3 28 dc:title dc:creator

4 23 dc:title dc:identifier

5 20 dc:description

(a) Top 10 most selected fields (b) Top 5 most selected field combinations

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 20 / 24

Page 39: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Manual evaluation

Randomly select 100 clusters at each level

7 evaluators categorised these clusters, based on the categories foundin the first round

Cluster CategorySimilarity Level

100 80 60 40 20

Same objects/duplicate records 11 10 1 0 0Views of the same object 61 33 6 2 5Parts of an object 10 11 3 1 2Derivative works 2 1 0 0 0Collections 1 4 27 13 43Thematic grouping 9 34 36 29 22Nonsense 2 3 30 57 28

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 21 / 24

Page 40: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Manual evaluation

Randomly select 100 clusters at each level

7 evaluators categorised these clusters, based on the categories foundin the first round

Cluster CategorySimilarity Level

100 80 60 40 20

Same objects/duplicate records 11 10 1 0 0Views of the same object 61 33 6 2 5Parts of an object 10 11 3 1 2Derivative works 2 1 0 0 0Collections 1 4 27 13 43Thematic grouping 9 34 36 29 22Nonsense 2 3 30 57 28

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 21 / 24

Page 41: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Conclusions

Finding similar CH objects is the first step towards identifyingsemantic links and groups of objects within large-scale aggregations.

We developed a fast and scalable clustering algorithm, applied aGenetic Algorithm to select important fields and proposed aninfrastructure to hierarchically structuring CH objects.

Our evaluation shows the clusters at high similarity levels are usuallyaccurate and useful, while those at lower levels need moreinvestigation.

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 22 / 24

Page 42: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Multidimensional similarities

Sir James Eyre (1734-1799), Chief Justice of the Common Pleas

(Government Art Collection)

Eyre, James(Austrian National Library)

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 23 / 24

Page 43: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Multidimensional similarities

Sir James Eyre (1734-1799), Chief Justice of the Common Pleas

(Government Art Collection)

Sir John Eardley Wilmot (1709-1792) Chief Justice of the Common Pleas

(Government Art Collection)

Eyre, James(Austrian National Library)

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 23 / 24

Page 44: Hunting for semantic clusters - OCLC€¦ · Hunting for semantic clusters ... Mapping from di erent formats and vocabularies to a shared data model may cause information missing

Continue hunting for semantic clusters

Thank you!

Shenghui Wang ([email protected])Antoine Isaac ([email protected])Valentine Charles ([email protected])Rob Koopman ([email protected])Anthi Agoropoulou ([email protected])

Titia van der Werf ([email protected])

Shenghui Wang, Antoine Isaac, Valentine Charles, Rob Koopman, Anthi Agoropoulou, Titia van der Werf ( OCLC Research, Leiden, The Netherlands, Europeana Foundation, The Hague, The Netherlands)Hunting for semantic clusters TPDL 2013 24 / 24