34
Crowdsourcing the Semantic Web Elena Simperl University of Southampton Seminar@University of Manchester 4 March 2015

Crowdsourcing the Semantic Web

Embed Size (px)

Citation preview

Crowdsourcing the Semantic Web

Elena Simperl

University of Southampton

Seminar@University of Manchester 4 March 2015

Crowdsourcing Web semantics

• Crowdsourcing is increasingly used to augment the results of algorithms solving Semantic Web problems

• Great challenges

• Which form of crowdsourcing for which task?

• How to design the crowdsourcing exercise effectively?

• How to combine human- and machine-driven approaches?

There is crowdsourcing and crowdsourcing

06-Mar-15 3

Typology of crowdsourcing

Macrotasks Microtasks

Challenges Self-organized crowds

Crowdfunding

4

Source: Source: (Prpić, Shukla, Kietzmann and McCarthy 2015)

Dimensions of crowdsourcing • What to crowdsource?

– Tasks you can’t complete in-house or using computers

– A question of time, budget, resources, ethics etc.

• Who is the crowd?

– Crowdsourcing ≠‘turkers’

– Open call, biased through use of platforms and promotion channels

– No traditional means to manage and incentivize

– Crowd has little to no context about the project

• How to crowdsource?

– Macro vs. microtasks

– Complex workflows

– Assessment and aggregation

– Aligning incentives

– Hybrid systems

– Learn to optimize from previous interactions

5

Hybrid systems (or ‚social machines‘)

06-Mar-15 Tutorial@ISWC2013 6

Physical world (people and devices)

Design and composition Participation and data supply

Model of social interaction

Virtual world (Network of

social interactions)

Source: Dave Robertson

CROWDSOURCING DATA CURATION

Crowdsourcing Linked Data quality assessment M Acosta, A Zaveri, E Simperl, D Kontokostas, S Auer, J Lehmann ISWC 2013, 260-276

7

Overview

• Compare two forms of crowdsourcing (challenges and paid microtasks) in the context of data quality assessment and repair

• Experiments on DBpedia

– Identify potential errors, classify, and repair them

– TripleCheckMate challenge vs. Mechanical Turk

8

What to crowdsource

• Incorrect object

Example: dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.

• Incorrect data type or language tags

Example: dbpedia:Torishima_Izu_Islands foaf:name “鳥島”@en.

• Incorrect link to “external Web pages”

Example: dbpedia:John-Two-Hawks dbpedia-owl:wikiPageExternalLink

<http://cedarlakedvd.com/>

Who is the crowd

Challenge

LD Experts Difficult task Final prize

Find Verify

Microtasks

Workers Easy task Micropayments

TripleCheckMate [Kontoskostas2013]

MTurk

Adapted from [Bernstein2010]

http://mturk.com

How to crowdsource: workflow

11

How to crowdsource: microtask design

• Selection of foaf:name or

rdfs:label to extract human-

readable descriptions

• Values extracted automatically

from Wikipedia infoboxes

• Link to the Wikipedia article via foaf:isPrimaryTopicOf

• Preview of external pages by implementing HTML iframe

Incorrect object

Incorrect data type or language tag

Incorrect outlink

Experiments

• Crowdsourcing approaches

• Find stage: Challenge with LD experts

• Verify stage: Microtasks (5 assignments)

• Gold standard

• Two of the authors generated gold standard for all contest triples indepedently

• Conflicts resolved via mutual agreement

• Validation: precision

Overall results LD experts Microtask

workers

Number of distinct participants

50 80

Total time 3 weeks (predefined) 4 days

Total triples evaluated 1,512 1,073

Total cost ~ US$ 400 (predefined) ~ US$ 43

Precision results: Incorrect objects • Turkers can reduce the error rates of LD experts from the Find stage

• 117 DBpedia triples had predicates related to dates with incorrect/incomplete values:

”2005 Six Nations Championship” Date 12 .

• 52 DBpedia triples had erroneous values from the source:

”English (programming language)” Influenced by ? .

• Experts classified all these triples as incorrect

• Workers compared values against Wikipedia and successfully classified this triples as “correct”

Triples compared LD Experts MTurk (majority voting: n=5)

509 0.7151 0.8977

Precision results: Incorrect data types

0

20

40

60

80

100

120

140

Date English Millimetre NanometreNumber

Numberwithdecimals

Second Volt Year Notspecified /URI

Nu

mb

er

of

trip

les

Data types

Experts TP

Experts FP

Crowd TP

Crowd FP

Triples compared LD Experts MTurk (majority voting: n=5)

341 0.8270 0.4752

Precision results: Incorrect links

• We analyzed the 189 misclassifications by the experts

• The 6% misclassifications by the workers correspond to pages with a language different from English

50% 39%

11%

Freebase links

Wikipedia images

External links

Triples compared Baseline LD Experts MTurk (n=5 majority voting)

223 0.2598 0.1525 0.9412

Conclusions

• The effort of LD experts should be invested in tasks that demand domain-specific skills

– Experts seem less motivated to perform on tasks that come with an additional overhead (e.g., checking an external link, going back on Wikipedia, data repair)

• MTurk crowd was exceptionally good at performing object types and links checks

– But seem not to understand the intricacies of data types

Future work

• Additional experiments with data types

• Workflows for new types of errors, e.g., tasks which experts cannot answer on the fly

• Training the crowd

• Building the DBpedia ‘social machine’

– Add automatic QA tools

– Close the feedback loop (from crowd to experts)

– Change the parameters of the ‘Find’ step to achieve seamless integration

– Change crowdsourcing model of ‘Find’ to improve retention

19

CROWDSOURCING IMAGE ANNOTATION

Improving paid microtasks through gamification and adaptive furtherance incentives O Feyisetan, E Simperl, M Van Kleek, N Shadbolt WWW 2015, to appear

20

Overview

• Make paid microtasks more cost-effective though gamification*

*use of game elements and mechanics in a non-game context

What to crowdsource: image labeling

22

Who is the crowd: paid microtask platforms

23

[Kaufmann, Schulze, Viet, 2011]

Research hypotheses

• Contributors will perform better if tasks are more engaging

• Increased accuracy through higher inter-annotator agreement

• Cost savings through reduced unit costs

• Micro-targeting incentives when players attempt to quit improves retention

How to crowdsource: microtask design

• Image labeling tasks published on microtask platform

– Free text labels, varying numbers of labels per image, taboo words

– 1st setting: ‘standard’ tasks, including basic spam control

– 2nd setting: the same requirements and rewards, but contributors were asked to carry out the task in WordSmith

25

How to crowdsource: gamification

• Levels – 9 levels from newbie to Wordsmith, function of images tagged

• Badges – function of number of images tagged

• Bonus points – for new tags

• Treasure points – for multiples of bonus points

• Feedback alerts - related to badges, points, levels

• Leaderboard - hourly scores and level of the top 5 players

• Activities widget – real-time updates on other players

26

How to crowdsource: incentives

• Money – 5 cents extra for the effort

• Power – see how other players tagged the images

• Leaderboard - advance to the ‘Global’ leaderboard seen by everyone

• Badges – receive ’Ultimate’ badge and a shiny new avatar

• Levels – advance straight to the next level

• Access - quicker access to treasure points

27

Assigned at random or targeted based on Bayes inference (number of tagged images as feature)

Experiments

• Crowdsourcing approaches

– Paid microtasks

– Wordsmith GWAP

• Data set: ESP Game (120K images with curated labels)

• Validation: precision

28

Experiments (2)

• 1st experiment: ‘newcomers’

– Tag one image with two keywords ~ complete Wordsmith level 1

– 200 images, US$ 0.02 per task, 3 judgements per task

• 2nd experiment: ‘novice’

– Tag 11 images ~ complete Wordsmith level 2

– 2200 images, US$ 0.10 per task, 600 workers

– Furtherance incentives: none, random, targeted (Bayes inference based on # of images tagged)

29

Results: 1st experiment

Metric CrowdFlower Wordsmith

Total workers 600 423

Total keywords 1,200 41,206

Unique keywords 111 5,708

Avg. agreement 5.72% 37.7%

Mean images/person 1 32

Max images/person 1 200

Results: 2nd experiment

Metric CrowdFlower Wordsmith (no furtherance incentives

Total workers 600 514

Total keywords 13,200 35,890

Unique keywords 1,323 4,091

Avg. agreement 6.32% 10.9%

Mean images/person 11 27

Max images/person 1 351

Results: 2nd experiment, incentives

• Incentives led to increased participation

– Power: mean image/person = 132

– People come back (20 times!) and play longer (43 hours, 3 hours without incentives)

32

Conclusions and future work

• Task design matters as much as payment

• Gamification achieves high accuracy for lower costs and improved engagement

• Workers appreciate social features, but their main motivation is still task-driven

– Feedback mechanisms, peer learning, training

33

Email: [email protected]

Twitter: @esimperl

34