14
Crowd Algorithms Hector Garcia-Molina, Stephen Guo, Aditya Parameswaran, Hyunjung Park, Alkis Polyzotis, Petros Venetis, Jennifer Widom Stanford and UC Santa Cruz Scoop — The Stanford – Santa Cruz Project for Cooperative Computing with Algorithms, Data, and People

Crowd Algorithms Hector Garcia-Molina, Stephen Guo, Aditya Parameswaran, Hyunjung Park, Alkis Polyzotis, Petros Venetis, Jennifer Widom Stanford and UC

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

Crowd Algorithms

Hector Garcia-Molina, Stephen Guo, Aditya Parameswaran, Hyunjung Park,

Alkis Polyzotis, Petros Venetis, Jennifer Widom

Stanford and UC Santa Cruz

Scoop — The Stanford – Santa Cruz Project for Cooperative Computing with Algorithms, Data, and People

2

The Goal

Design Fundamental Algorithms for Human Computation

Latency

Cost

Uncertainty

• Which questions do I ask?• When do I ask the questions? • When do I stop?• How do I combine the answers?

3

The Problems

Sort / Max

GraphSearch

Categorize

Filter

Crowd-

Crowd-

Crowd-

Crowd-

Latency

Cost

Uncertainty

: Difficult!

: Difficult!

: Difficult!

: Difficult!

Progress!

[VLDB 2011]

The focus of this talk.

Summaries of the rest

Filters

4

Dataset of Items

Predicate 1

Predicate 2……

Predicate k

Is this image that of Bytes Café ?

Is the image blurry?

Does it show people’s faces?

Filtered Dataset

Given: —Error Probability (FP/FN) & Selectivity for each

predicate

—Desired Overall Error Probability

To: Compose a filtering strategy—Minimize Overall Cost (# of questions)

• Which questions do I ask?• When do I ask the questions? • When do I stop?• How do I combine the answers?

Single Filter

Surprisingly difficult! Need to meet an overall error threshold

—Say, up to 10% of my images may be wrongly filtered

Minimize overall expected number of questions

Boils down to the following: —Take one item—Ask some questions• Results in a certain number of (Y, N) for a given

item—Do I stop (if so, what do I return), or do I continue

asking?

5

Dataset of Items Predicate 1

Filtered Dataset

Hasn’t this been done before?

Solutions from statistics guarantee the same error per item—Important on contexts like:• Automobile testing• Diagnosis

We’re worried about aggregate error over all items: a uniquely data-oriented problem—I don’t care if every image is perfect as long as the

overall error is met.—As we will see, results in $$$ savings

6

Strategies

7

YES = 5, NO = 6Return “Passed”

YES Answers

NOAnswers

YES = 3, NO = 7Return “Failed”

YES = 3, NO = 5Continue

Reformulated Task:

For each point in grid : Return Pass/Fail/Cont.

Equivalently,

Find the best shape and color it!

Start here, with no questions

Common Strategies

Always ask X questions, return most likely answer—The triangle shape

If you get X YES, return “Pass” or Y NO, return “Fail”, else keep asking.—Rectangular shape

Ask until |#YES - #NO| > X, or at most Y questions—Chopped off rectangle—Anhai’s work on MOBS

8

Summary of Results

A characterization of which “shapes” are optimal

A optimal PTIME “probabilistic” approach—LP leveraging the inherent DP structure—Optimal: Strategy with minimum overall cost • for given parameters and requirements

—Probabilistic: Probability of “Pass” “Fail” “Continue”

9

Empirical Results

Evaluation on 10000 synthetic scenarios Tested:

—Optimal, Brute Force, Statistical, 5 Heuristic Algorithms

Optimal Probabilistic issues fewer questions overall—15% savings on average compared to brute force • 32% savings when optimal wins

—22% savings on average compared to the statistics approach• 49% savings when optimal wins

10

Translates to $$$ for many items !!

Generate Parameters

Other AlgorithmsBrute Force

Deterministic Optimal

Probabilistic

COST1 >>

COST2

COST3

>>

Crowd-Max/Sort

The problem(s):—Find the strategy of sorting n items • Given: Probability of error for a comparison• Given: Desired threshold on

error,#questions,#rounds

Sorting automatically given evidence —NP-Hard even for a simple probability of error

model—Related work in the area of voting theory,

economics Which r questions do we ask next?

11

Ask all pairs a total of 2k/n times

Tournament, with k repetitions at each level

One question in each roundDecreasing Parallelism

More Accuracy

Crowd-GraphSearch

Image Categorization Example

12

vehicle

car

nissan honda toyota

maxima

sentra

To attach: image of a honda car

Is image one of vehicle? YES!

Is image one of toyota? NO!

Is image one of honda?

YES!

target node = intended category Is the image one of X? = Is the target node reachable from

X?

Find the target node by asking minimum number of search questions.

Crowd-Categorize

k buckets, n items Categorize every item, overall error <

threshold For k = 1, same as filters problem Two versions:

—Discrete • Independent (like in the filters case) • Dependent buckets (e.g., colors,

GraphSearch)—Continuous (e.g., age)

13

…….Dataset of Items

14

Questions?