21
Query Optimization over Crowdsourced Data Hyunjung Park, Jennifer Widom Stanford University

Query Optimization over Crowdsourced Data

Embed Size (px)

DESCRIPTION

Presented in VLDB 2013.

Citation preview

Page 1: Query Optimization over Crowdsourced Data

Query Optimization over Crowdsourced Data

Hyunjung Park, Jennifer Widom Stanford University

Page 2: Query Optimization over Crowdsourced Data

Deco: Declarative Crowdsourcing

Give me a Spanish-speaking country.

Give me a country. What language do they speak in country X? What is the capital of country X?

8/27/2013 Hyunjung Park 2

“Find the capitals of eight Spanish-speaking countries”

DBMS

country language capital

Italy Italian Rome

Spain Spanish Madrid

… … …

country language capital

Italy Italian Rome

Spain Spanish Madrid

Deco System

Page 3: Query Optimization over Crowdsourced Data

Deco Query Optimization

•  Crowd incurs monetary cost •  Some query plans are much cheaper than others

•  Cost estimation is complicated by: –  Previously collected data –  Unknown database state

–  Inconsistency of human answers

8/27/2013 Hyunjung Park 3

Page 4: Query Optimization over Crowdsourced Data

Outline

•  Motivating example •  Deco data model and queries

•  Cost and cardinality estimation

•  Experimental results

8/27/2013 Hyunjung Park 4

Everything implemented in full prototype

Page 5: Query Optimization over Crowdsourced Data

Motivating Example: Plan 1

8/27/2013 Hyunjung Park 5

Give me a country.

What language do they speak in country X?

What is the capital of country X?

unseen

Spanish

F

T

T

F

“Find the capitals of eight Spanish-speaking countries”

8x

Page 6: Query Optimization over Crowdsourced Data

Give me a country. Give me a country. Give me a country.

Motivating Example: Plan 2

8/27/2013 Hyunjung Park 6

Give me a Spanish-speaking country.

What language do they speak in country X?

What is the capital of country X?

unseen

Spanish

F

T

T

F

“Find the capitals of eight Spanish-speaking countries”

8x

Page 7: Query Optimization over Crowdsourced Data

Preview of Experimental Results

0

5

10

15

Plan 1 Plan 2

Actual costs spent on Mechanical Turk

What is the capital of country X?

What language do they speak in country X?

Give me a Spanish-speaking country.

Give me a country.

8/27/2013 Hyunjung Park 7

($)

Page 8: Query Optimization over Crowdsourced Data

Outline

•  Motivating example •  Deco data model and queries

•  Cost and cardinality estimation

•  Experimental results

8/27/2013 Hyunjung Park 8

Page 9: Query Optimization over Crowdsourced Data

Deco: Data Model (1/2)

•  Conceptual Relation: visible to end-users Country (country, language, capital)

•  Resolution Rules: cleanse raw data using UDFs country: dupElim language: majority(3)

capital: majority(3)

8/27/2013 Hyunjung Park 9

Page 10: Query Optimization over Crowdsourced Data

Deco: Data Model (2/2)

•  Fetch Rules: “access methods” for the crowd language => country

“Give me a {language}-speaking country.”

Ø => country “Give me a country.”

country => language “What language do they speak in {country}?”

country => capital “What is the capital of {country}?”

8/27/2013 Hyunjung Park 10

[$0.05]

[$0.01]

[$0.02]

[$0.03]

Page 11: Query Optimization over Crowdsourced Data

Deco: Queries

•  Deco query: SQL query over conceptual relations SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8

•  Query processor: access the crowd as needed to produce query result while: 1.  Minimizing monetary cost

2.  Reducing latency

8/27/2013 Hyunjung Park 11

query optimizer

query execution engine

Page 12: Query Optimization over Crowdsourced Data

Query Optimization

•  Find the best query plan in terms of estimated monetary cost

•  As in traditional query optimizer 1.  Cost and cardinality estimation 2.  Search space

3.  Plan enumeration algorithm

8/27/2013 12 Hyunjung Park

Page 13: Query Optimization over Crowdsourced Data

Cost Estimation

•  Total monetary cost = ∑Fetch  F  F.price × F.cardinality –  Existing data is “free”

•  Definition of Cardinality in Deco –  Total number of expected output tuples from operator

until query execution terminates

•  Cardinality estimation –  Final database state needs to be estimated

simultaneously

8/27/2013 Hyunjung Park 13

Page 14: Query Optimization over Crowdsourced Data

Cardinality Estimation: Setting

•  $0.05 for all fetch rules

•  No existing data

•  Selectivity factors –  language=‘Spanish’: 0.1

–  dupElim: 0.8 –  majority(3): 0.4 (=1/2.5)

8/27/2013 Hyunjung Park 14

Page 15: Query Optimization over Crowdsourced Data

Cardinality Estimation: Plan 1

8/27/2013 15 Hyunjung Park

SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8

MinTuples[8]

Project[co,ca]

DLOJoin[co]

DLOJoin[co]

Resolve[dupeli] Resolve[maj3]

Resolve[maj3]Filter[la=’Spanish’]

Scan[CtryA]

Fetch[Øàco]

Scan[CtryD2]

Fetch[coàca]

Scan[CtryD1]

Fetch[coàla]

1

2

3

4 12

5 13

96

7 8 10 11

14

Ø => country country => language country => capital

Cost estimation: $0.05×(100+200+20) = $16.00 200

20

100

Page 16: Query Optimization over Crowdsourced Data

Cardinality Estimation: Plan 2

8/27/2013 16 Hyunjung Park

MinTuples[8]

Project[co,ca]

DLOJoin[co]

DLOJoin[co]

Resolve[dupeli] Resolve[maj3]

Resolve[maj3]Filter[la=’Spanish’]

Scan[CtryA]

Fetch[laàco]

Scan[CtryD2]

Fetch[coàca]

Scan[CtryD1]

Fetch[coàla]

1

2

3

4 12

5 13

96

7 8a 10 11

14

SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8

language => country country => language country => capital

Cost estimation: $0.05×(10+20+20) = $2.50 20 10

20

Page 17: Query Optimization over Crowdsourced Data

8/27/2013 Hyunjung Park 17

0

1

2

3

Actual

Plan 2

Experimental Results

0

5

10

15

Actual

Plan 1

country => capital country => language language => country Ø => country

($) ($)

Page 18: Query Optimization over Crowdsourced Data

8/27/2013 Hyunjung Park 18

0

1

2

3

Actual Estimated

Plan 2

Experimental Results

0

5

10

15

Actual Estimated

Plan 1

country => capital country => language language => country Ø => country

($) ($)

Page 19: Query Optimization over Crowdsourced Data

Related Work

•  Declarative approach for crowdsourcing –  Arnold, CrowdDB, CrowdSearcher, Jabberwocky, Qurk, ...

•  Crowd-powered algorithms/operations –  Filter, sort, join, max, entity resolution, …

•  Also: –  Traditional query optimization –  Heterogeneous or federated database systems

8/27/2013 19 Hyunjung Park

Page 20: Query Optimization over Crowdsourced Data

Summary

•  Cost estimation in Deco –  Distinguish between existing data vs. new data

–  Estimate cardinality and final database state simultaneously

•  In the paper: –  Full description of cost estimation and plan

enumeration algorithms

–  More experimental results

8/27/2013 Hyunjung Park 20

Page 21: Query Optimization over Crowdsourced Data

Thank you!