Query Optimization over Crowdsourced Data

Query Optimization over Crowdsourced Data

Hyunjung Park, Jennifer Widom Stanford University

Deco: Declarative Crowdsourcing

Give me a Spanish-speaking country.

Give me a country. What language do they speak in country X? What is the capital of country X?

8/27/2013 Hyunjung Park 2

“Find the capitals of eight Spanish-speaking countries”

DBMS

country language capital

Italy Italian Rome

Spain Spanish Madrid

… … …

country language capital

Italy Italian Rome

Spain Spanish Madrid

Deco System

Deco Query Optimization

•  Crowd incurs monetary cost •  Some query plans are much cheaper than others

•  Cost estimation is complicated by: –  Previously collected data –  Unknown database state

–  Inconsistency of human answers


Outline

•  Motivating example •  Deco data model and queries

•  Cost and cardinality estimation

•  Experimental results


Everything implemented in full prototype

Motivating Example: Plan 1


Give me a country.

What language do they speak in country X?

What is the capital of country X?

unseen

Spanish

F

T

T

F


8x

Give me a country. Give me a country. Give me a country.

Motivating Example: Plan 2





unseen

Spanish

F

T

T

F


8x

Preview of Experimental Results

0

5

10

15

Plan 1 Plan 2

Actual costs spent on Mechanical Turk




Give me a country.


($)

Outline

•  Motivating example •  Deco data model and queries

•  Cost and cardinality estimation

•  Experimental results


Deco: Data Model (1/2)

•  Conceptual Relation: visible to end-users Country (country, language, capital)

•  Resolution Rules: cleanse raw data using UDFs country: dupElim language: majority(3)

capital: majority(3)


Deco: Data Model (2/2)

•  Fetch Rules: “access methods” for the crowd language => country

“Give me a {language}-speaking country.”

Ø => country “Give me a country.”

country => language “What language do they speak in {country}?”

country => capital “What is the capital of {country}?”


[$0.05]

[$0.01]

[$0.02]

[$0.03]

Deco: Queries

•  Deco query: SQL query over conceptual relations SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8

•  Query processor: access the crowd as needed to produce query result while: 1.  Minimizing monetary cost

2.  Reducing latency


query optimizer

query execution engine

Query Optimization

•  Find the best query plan in terms of estimated monetary cost

•  As in traditional query optimizer 1.  Cost and cardinality estimation 2.  Search space

3.  Plan enumeration algorithm

8/27/2013 12 Hyunjung Park

Cost Estimation

•  Total monetary cost = ∑Fetch F F.price × F.cardinality –  Existing data is “free”

•  Definition of Cardinality in Deco –  Total number of expected output tuples from operator

until query execution terminates

•  Cardinality estimation –  Final database state needs to be estimated

simultaneously


Cardinality Estimation: Setting

•  $0.05 for all fetch rules

•  No existing data

•  Selectivity factors –  language=‘Spanish’: 0.1

–  dupElim: 0.8 –  majority(3): 0.4 (=1/2.5)


Cardinality Estimation: Plan 1


SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8

MinTuples[8]

Project[co,ca]

DLOJoin[co]

DLOJoin[co]

Resolve[dupeli] Resolve[maj3]

Resolve[maj3]Filter[la=’Spanish’]

Scan[CtryA]

Fetch[Øàco]

Scan[CtryD2]

Fetch[coàca]

Scan[CtryD1]

Fetch[coàla]

1

2

3

4 12

5 13

96

7 8 10 11

14

Ø => country country => language country => capital

Cost estimation: $0.05×(100+200+20) = $16.00 200

20

100

Cardinality Estimation: Plan 2


MinTuples[8]

Project[co,ca]

DLOJoin[co]

DLOJoin[co]

Resolve[dupeli] Resolve[maj3]

Resolve[maj3]Filter[la=’Spanish’]

Scan[CtryA]

Fetch[laàco]

Scan[CtryD2]

Fetch[coàca]

Scan[CtryD1]

Fetch[coàla]

1

2

3

4 12

5 13

96

7 8a 10 11

14

SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8

language => country country => language country => capital

Cost estimation: $0.05×(10+20+20) = $2.50 20 10

20


0

1

2

3

Actual

Plan 2

Experimental Results

0

5

10

15

Actual

Plan 1

country => capital country => language language => country Ø => country

($) ($)


0

1

2

3

Actual Estimated

Plan 2

Experimental Results

0

5

10

15

Actual Estimated

Plan 1

country => capital country => language language => country Ø => country

($) ($)

Related Work

•  Declarative approach for crowdsourcing –  Arnold, CrowdDB, CrowdSearcher, Jabberwocky, Qurk, ...

•  Crowd-powered algorithms/operations –  Filter, sort, join, max, entity resolution, …

•  Also: –  Traditional query optimization –  Heterogeneous or federated database systems


Summary

•  Cost estimation in Deco –  Distinguish between existing data vs. new data

–  Estimate cardinality and final database state simultaneously

•  In the paper: –  Full description of cost estimation and plan

enumeration algorithms

–  More experimental results


Thank you!

Software

Query Optimization over Crowdsourced Data