22
Detecting and Reporting Extensional Concept Drift in Statistical Linked Data Albert Mero˜ no-Pe˜ nuela 1,2 Christophe Gu´ eret 2 Rinke Hoekstra 1,3 Stefan Schlobach 1 1 Department of Computer Science, VU University Amsterdam, NL 2 Data Archiving and Networked Services, KNAW, NL 3 Leibniz Center for Law, Faculty of Law, University of Amsterdam, NL October 22th, 2013 First International Workshop on Semantic Statistics ISWC 2013 Mero˜ no-Pe˜ nuela, A. et al. Extensional Concept Drift

Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Detecting and Reporting Extensional ConceptDrift in Statistical Linked Data

Albert Merono-Penuela1,2 Christophe Gueret2

Rinke Hoekstra1,3 Stefan Schlobach1

1Department of Computer Science, VU University Amsterdam, NL

2Data Archiving and Networked Services, KNAW, NL

3Leibniz Center for Law, Faculty of Law, University of Amsterdam, NL

October 22th, 2013First International Workshop on Semantic Statistics

ISWC 2013

Merono-Penuela, A. et al. Extensional Concept Drift

Page 2: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

MotivationStability of Meaning of Concepts

As the world changes continuously, concepts also change theirmeaning over time

We call this concept drift

Smooth transitions

Radical transitions (concept shift)

Merono-Penuela, A. et al. Extensional Concept Drift

Page 3: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

MotivationStability of Meaning of Concepts in SLD

Concepts are also present in SLD

Variable meaning/semantics (what the variable is supposed torepresent?)

Variable values/factors (RomschKatholik, RomsKatholic,KatholicChristelijk)

Merono-Penuela, A. et al. Extensional Concept Drift

Page 4: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

MotivationStability of Meaning of Concepts in SLD

To what extent stability of meaning of these concepts isguaranteed in data collected on very long time ranges?

If meaning is stable

Old models are reusable

Backwards comparisons always make sense

Else

New models may be necessary

Backwards comparisons may be incorrect

Merono-Penuela, A. et al. Extensional Concept Drift

Page 5: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Concept DriftTypes of change of meaning

The meaning of a concept C can change in several ways1.

Intension drift occurs when there is a difference in theproperties or attributes of two variants of the same concept(simint(C

′,C ′′) 6= 1).

Extension drift occurs when there is a difference in theindividuals that belong to two variants of the same concept(simext(C

′,C ′′) 6= 1).

Label drift occurs when there is a difference in the labels oftwo variants of the same concept (simlabel(C

′,C ′′) 6= 1).

1S. Wang, S. Schlobach, M. Klein, What Is Concept Drift and How toMeasure It?, 2010

Merono-Penuela, A. et al. Extensional Concept Drift

Page 6: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Key ideasIf you remember nothing else, remember this

In this talk we present two key ideas

A method to detect extensionally drifted concepts

A recommendation to annotate such drifts in SLD

Merono-Penuela, A. et al. Extensional Concept Drift

Page 7: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Extensional drift detectionUsing statistics

We define the extension function ext(C ) of a concept C asthe number of individuals that belong to C

ext(C ) = |{a : C}|

We define the extension similarity function simext(C′,C ′′)

between two variants C ′,C ′′ of a concept C as the functionthat returns the probability that C ′ and C ′′ have identicalpopulations2

simext(C′,C ′′) = wilcox .test(ext(C ′), ext(C ′′))

Intuitive idea: concept variants with significantly differentpopulations (p < 0.05) suffer radical transitions

2F. Wilcoxon, Individual comparisons by ranking methods, 1945Merono-Penuela, A. et al. Extensional Concept Drift

Page 8: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Extensional drift reportingUsing SPARQL UPDATEs

Linked Data Cloud

SPARQL endpoint

Typical analysis workflow. Users SELECT data from SLD, butanalyses run offline and results are not pushed back

Merono-Penuela, A. et al. Extensional Concept Drift

Page 9: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Extensional drift reportingUsing SPARQL UPDATEs

Linked Data Cloud

SPARQL endpoint

Proposal. After running analyses, UPDATE results (e.g. detectedextensional drifts) back to the endpoint

Merono-Penuela, A. et al. Extensional Concept Drift

Page 10: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Case-StudyPublishing Linked Census Data

The Dutch historical censuses (1795 - 1971)

Lots of statistical data (2,288 tables)

Lots of concepts

Big time span (176 years)

May contain lots of drifted concepts!Merono-Penuela, A. et al. Extensional Concept Drift

Page 11: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Case-StudyPublishing Linked Census Data

Towards 5-star historical census data

Merono-Penuela, A. et al. Extensional Concept Drift

Page 12: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Case-StudyPublishing Linked Census Data

> 1 year ago

1 year ago

Currently

TabLinker3 : Supervised XLS2RDF converter

3https://github.com/Data2Semantics/TabLinker

Merono-Penuela, A. et al. Extensional Concept Drift

Page 13: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Case-StudyQuerying

SPARQL template for unfolding RDF Data Cubes

Merono-Penuela, A. et al. Extensional Concept Drift

Page 14: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Case-StudyQuerying

Census contains very heterogeneous dataSubset selection: occupation census of 1889 and 1899

Age range Gender Marital status Municipality

36-50 M G Velsen51-60 V O Zaandam23-35 VROUWEN O. Haarlem36-50 MANNEN G. Weesp

Class Subclass Occupation Position Population

II b Aanemers A 3IV a Agenten B 1

XXI d Ambtenarenen beambten

C 2

I h Afwerkenvan huizen

A 5

Merono-Penuela, A. et al. Extensional Concept Drift

Page 15: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Case-StudyData preprocessing

1889 HISCO mappings 1899 HISCO mappings

HISCO normalization

Merono-Penuela, A. et al. Extensional Concept Drift

Page 16: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Case-StudyExtensional drift detection in R

Is there extensional concept drift between two variants of the sameoccupation (i.e. same HISCO code)?

E.g. is there concept drift between ship loaders of 1889 and 1899?

Merono-Penuela, A. et al. Extensional Concept Drift

Page 17: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Case-StudyExtensional drift detection in R

Are the occupations in the two censuses comparable?

217 common HISCO codes

72.2% of all 1889 HISCO codes

71.1% of all 1899 HISCO codes

Merono-Penuela, A. et al. Extensional Concept Drift

Page 18: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Case-StudyExtensional drift detection in R

For all h common HISCO codes

Query for population distributions of h in 1889

Query for population distributions of h in 1899

Do wilcox.test between the two and get p-values

Merono-Penuela, A. et al. Extensional Concept Drift

Page 19: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Case-StudyResults

Late industralization of the Netherlands (late 19th century)19.35% of occupations show radical extensional drifts

Merono-Penuela, A. et al. Extensional Concept Drift

Page 20: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Case-StudyExtensional drift reporting

Closing the pull-push cycle4

4https://github.com/albertmeronyo/ConceptDrift/

Merono-Penuela, A. et al. Extensional Concept Drift

Page 21: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Conclusions & Future work

Concept drift: concepts change over time

It affects model reusability and backwards querying

Extensional drifts can be detected with wilcox.test

We SPARQL UPDATE drifts to let others know

Scale up variables, tables

Parametrization of the wilcox.test depending on the timegap

Integration of intension and label drift measures

RDF HISCO will be released soon

HISCO mappings will be published as separate named graphs(so that you can link yours)

Merono-Penuela, A. et al. Extensional Concept Drift

Page 22: Detecting and Reporting Extensional Concept Drift in Statistical Linked Data

Thank youQuestions, suggestions?

@albertmeronyohttp://www.cedar-project.nl

http://www.data2semantics.org

Merono-Penuela, A. et al. Extensional Concept Drift