Task and Workflow Design in Human Computation KSE 652 Social Computing System Design and Analysis...

Preview:

Citation preview

Task and Workflow Designin Human Computation

KSE 652 Social Computing System Design and Analysis

Uichin Lee

TurKit: Human Computation Algorithms on Mechanical Turk

Greg Little, Lydia B. Chilton, Rob Miller, and Max Goldman

(MIT CSAIL)UIST 2010

Workflow in M-Turk

HIT

HIT

HIT

HIT

HIT

HIT

Data Collected

in CSV File

Requester posts HIT Groups to

Mechanical Turk

Data Exported for Use

Workflow: Pros & Cons

• Easy to run simple, parallelized tasks.• Not so easy to run tasks in which turkers

improve on or validate each other’s work.

• TurKit to the rescue!

The TurKit Toolkit

• Arrows indicate the flow of information.

• Programmer writes 2 sets of source code:– HTML files for web

servers– JavaScript executed by

TurKit

• Output is retrieved via a JavaScript database.

Turkers

Mechanical Turk

Web Server TurKit

*.html *.js

Programmer

JavaScript Database

Crash-and-rerun programming model

• Observation: local computation is cheap, but the external class costs money

• Managing states over a long running program is challenging– Examples: Computer restarts? Errors?

• Solution: store states in the database (just in case)• If an error happens, just crash the program and re-run by

following the history in DB– Throw a “crash” exception; the script is automatically re-run.

• New keyword “once”: – Remove non-determinism– Don’t need to re-execute an expensive operation (when re-run)

• But why should we re-run???

Example: quicksort

Parallelism

• First time the script runs, HITs A and C will be created

• For a given forked branch, if a task fails (e.g., HIT A), TurKit crashes the forked branch (and re-run)

• Synchronization w/ join()

MTurk Functions

• Prompt(message, # of people)– mturk.prompt("What is your favorite color?", 100)

• Voting(message, options)• Sort(message, items)

VOTE() SORT()

TurKit: Implementation

• TurKit: Java using Rhino to interpret JavaScript code, and E4X2 to handle XML results from MTurk

• IDE: Google App Engine3 (GAE)

Online IDE

Exploring Iterative and Parallel Human Computation Processes

Greg Little, Lydia B. ChiltonMax Goldman, Robert C. Miller

HCOMP 2010

HC Task Model

• Dimension: – Dependent (iterative) or independent (parallel) tasks – Creation and decision tasks

• Task model examples

Creation tasks (creating new content): e.g., writing ideas,

imagery solutions, etc.

Decision tasks (voting/rating): e.g., rating quality of a description of an

image

HC Task Model

• Combining tasks: iterative and parallel tasks

Iterative pattern: a sequence of creation tasks where the result of each task feeds into the next one, followed by a comparison task

Parallel pattern: a set of creation tasks executed in parallel, followed by a task of choosing the best

Experiment: Writing Image Description

• Iterative vs. parallel; each 6 creation tasks ($0.02), followed by rating tasks (1-10 scale, $0.01)

Experiment: Writing Image Description

• Turkers in iterative condition gave better description while parallel condition always shows an empty text area.

Experiment: Writing Image Description

• Average rating after n iterations– After six iterations: 7.9 vs. 7.4, t-test T29=2.1, p=0.04

iterative

parallel

Experiment: Writing Image Description

• Length vs. rating: positive correlation

• The two outliers (circled) represent instances of text copied from the Internet (with superficial description)

Length (characters)

Ratin

g

Experiment: Writing Image Description

• Work Quality:– 31% mainly append content at the end, and make only minor

modifications (if any) to existing content; – 27% modify/expand existing content, but it is evident that they use

the provided description as a basis;– 17% seem to ignore the provided description entirely and start over;– 13% mostly trim or remove content; – 11% make very small changes (adding a word, fixing a misspelling,

etc);– 1% copy-paste superficially related content found on the internet.

• Creating vs. improving (takes about the same time, avg. 211 seconds)

Experiment: Brainstorming

Experiment: Brainstorming

• Iterative work: higher average rating– Biased thinking: e.g., tech -> xxtech -> yytech

• Parallel work: diversity, higher deviation (rating) – No iteration for brainstorming

Iteration Rating

Avg.

Rati

ng

iterative

parallel

Example: Blurry Text Recognition

Example: Blurry Text Recognition

• Iterative performs better than parallel

Iteration

Accu

racy

Summary

• TurKit: a flexible programming tool for m-turk

• Various work-flow can be designed; e.g., iterative, parallel, and hybrid

• Iterative performs better than parallel in several cases (e.g., image description, brainstorming, text recognition)

Turkalytics: Real-time Analytics for Human Computation

Paul Heymann and Hector Garcia-MolinaWWW'11

Basic Buyer human programming• A human program generates forms; advertised through a marketplace. • Workers look at posts, and then complete the forms for compensation.

Game Maker human programming• The programmer writes a human program and a game. • The game implements features to make it fun and difficult to cheat. • The human program loads and dumps data from the game.

Human Processing programming

Human Processing programming• Task description:

– Input, output, web forms, human driver, other information– Human task instance

• Human drivers: interact with workers– Functions: initialization (forms, games), retrieving results – “Human Program” accesses workers via “human drivers”

• Recruiters: post task instances into the marketplaces, (by working with marketplace drivers)– Marketplace driver provides an interface to marketplaces

(description) (instance)

Turkalytics

• Challenge: collecting reliable data about the workers and the tasks they perform

• Why?– If a task is not being completed, is it because no workers

are seeing it? Is it because the task is currently being offered at too low a price?

– How does the task completion time break down? – Do workers spend more time previewing tasks or doing

them? – Do they take long breaks? – Which are the more “reliable” workers?

Interaction Model

• Search-Preview-Accept (SPA) model

Interaction Model• Search-Continue-RapidAccept-Accept-Preview (SCRAP)

Continue completing a task that was accepted but not submitted

Accept the next task in a HITGroup w/o previewing it

Turkalytics Data Models

Turkalytics ArchitectureClient-side javascript: ta.js Log Server

Client-side javascript: ta.js

ta.js

ta.js

Ajax: POST

Log messages (JSON )

Analysis Server

Log messages (JSON )

Implementation: client-side Javascript

• Requester embeds a Turkalytics script (ta.js) into a HIT (when designing a HIT)– Monitoring: Detect relevant worker data and actions.– Sending: Log events by making image requests to the

log server (ajax: POST)

Implementation: ta.js -- client-side JavaScript

• ta.js’s monitoring activities:– Client Information: Worker’s screen resolution? What

plugins are supported? Can ta.js set cookies?– DOM Events: Over the course of a page view, the

browser emits various events (e.g., load, submit, before unload, and unload events)

– Activity: listens on a second-by-second basis for the mousemove, scroll and keydown events to determine if the worker is active or inactive.

– Form Contents: examines forms on the page and their contents; logs initial form contents, incremental updates, and final state.

Implementation: log/analysis

• Log Server:– Simple web app built on Google’s App Engine. – Receives logging events from clients running ta.js and saves them

to a data store. • IP address, user agent, and referer, etc

• Analysis Server: – Periodically polls the log server to download any new events that

have been received – Event inserted into DB, considering the following:

• Time constraints: data availability to analysis server• Dependencies: if events are dependent on one another• Incomplete input: if all events are not received yet..• Unknown input: what if unexpected input is received?

Implementation: analysis

// what type of data (event) is sent // actual data for a given type

Detailed info about task

// session ID

Experiments• Tasks:

– Named Entity Recognition (NER): This task, posted in groups of 200 by a researcher in Natural Language Processing, asks workers to label words in a Wikipedia article if they correspond to people, organizations, locations, or demonyms. (2, 000 HITs, 1 HIT Type, more than 500 workers.)

– Turker Count (TC): This task, posted once a week by a professor of business at U.C. Berkeley, asks workers to push a button, and is designed just to gauge how many workers are present in the marketplace. (2 HITs, 1 HIT Type, more than 1, 000 workers each.)

– Create Diagram (CD): This task, posted by the authors, asked workers to draw diagrams for this paper based on hand drawn sketches

Experiments: origin of workers

• GeoLite City DB from MaxMind to geolocate all remote users by IP address

Experiments: worker characteristics

Experiments: states/actions

• RapidAccept is quite popular (Continue is rare)

Experiments: # previews• Artificial recency for NER/CD (keep making them near the top in the list):

NER and CD exhibit less severe drop as opposed to TC

ArtificialRecency

Experiments: activity vs. delay

• Average active and total seconds for each worker who completed the NER task (correlation 0.88)

Discussion

• Multi-tasking users? Activity vs. working time• Privacy??– We can collect as much as we can..– How about Google Analytics? Any web pages that we visit

can collect such information…

• False data injection?• How can we better utilize the dataset?– Re-designing existing tasks, pricing, etc. (or mining user

behavior?)

Recommended