Example Generation for Data Flow Programs

Preview:

DESCRIPTION

 

Citation preview

Chris Olston Shubham ChopraUtkarsh Srivastava

Generating Example Data For Dataflow Programs

Research

Data Processing Renaissance

Lots of data (TBs/day at Yahoo!)

Lots of queries and programs to analyze that data

New data flow languages Map-Reduce, Pig Latin, Dryad

Other data flow systems Aurora, Tioga, River

Example Dataflow Program

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

Find users that tend to visit

high-pagerank pages

Iterative Process

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

Bug in UDFcanonicalize?

Joining on right attribute?

Everything being filtered out?

No Output

How to do test runs?

• Run with real data– Too inefficient (TBs of data)

• Create smaller data sets (e.g., by sampling)– Empty results due to joins [Chaudhuri et. al. 99], and

selective filters

• Biased sampling for joins– Indexes not always present

Examples to Illustrate Program

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

(www.cnn.com, 0.9) (www.frogs.com, 0.3)(www.snails.com, 0.4)

(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)(Fred, www.snails.com, 0.4)

(Amy, 0.6) (Fred, 0.4)

(Amy, 0.6)

(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)

(Fred, www.snails.com, 0.4)

( Amy,

( Fred, )

)

Value Addition From Examples

• Examples can be used for

– Debugging

– Understanding a program written by someone else

– Learning a new operator, or language

Outline

• Formalization of good examples

• Example Generation Algorithm

• Performance Evaluation

Good Examples: Consistency

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

0. Consistency

output example =

operator applied on input example

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

Good Examples: Realism

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

1. Realism

Formalization: Fraction of examples that are real or are derived from real records

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

Good Examples: Completeness

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

Demonstrate the salient properties of each operator,

e.g., FILTER

2. Completeness

(Amy, 0.6) (Fred, 0.4)

(Amy, 0.6)

Good Examples: Completeness

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

Demonstrate the salient properties of each operator,

e.g., JOIN

2. Completeness(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

(www.cnn.com, 0.9) (www.frogs.com, 0.3)(www.snails.com, 0.4)

(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)(Fred, www.snails.com, 0.4)

Formalizing Completeness

• For any operator, classify input/output example records into equivalence classes.

• Each equivalence class demonstrates one property of the operator.

• Try to have at least one example from each class

Equivalence Class Examples

FILTERE0: All input records that pass the filter

E1: All input records that fail the filter

JOINE0: All output records

UNIONE0: All records belonging to first input

E1: All records belonging to second input

Formalizing Completeness

Operator Completeness: Fraction of equivalence classes that have at least one example record.

Overall Completeness: Average of per-operator completeness.

# equivalence classes# example records

Good Examples: Conciseness

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

3. Conciseness

Operator Conciseness:

Overall Conciseness:Average of per-operator conciseness

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

Outline

• Formalization of good examples

• Example Generation Algorithm

• Performance Evaluation

Related Work

Related Areas:– Reverse Query Processing– Database Testing– Software and Hardware Verification

• Differences– Realism not a concern– Notion of conciseness is different– Intermediate result size is immaterial

Strawman I: Downstream Propagation

Take some portion of input data and run the program over it.

1. Realism

2. Completeness

3. Conciseness

Strawman II: Upstream Propagation

Start from what output is desired, and work backwards

1. Realism

2. Completeness

3. Conciseness

Our Algorithm

Algorithm Passes1. Downstream 2. Pruning3. Upstream4. Pruning

Our Algorithm

Take a subset of input and propagate through the

program.

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20) (Fred, 25)

(Amy, 20) (Fred, 25)

(Jack, 30)

(Amy, 20) (Fred, 25)(Jack, 30)

(Amy, 20) (Fred, 25)(Jack, 30)

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20) (Fred, 25)

(Amy, 20) (Fred, 25)

(Jack, 30)

(Amy, 20) (Fred, 25)(Jack, 30)

(Amy, 20) (Fred, 25)(Jack, 30)

Prune redundant examples, i.e., improve conciseness without

hurting completeness.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20) (Fred, 25)

(Jack, 30)

(Amy, 20) (Fred, 25)(Jack, 30)

Prune redundant examples, i.e., improve conciseness without

hurting completeness.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

(Amy, 20) (Fred, 25)

(Amy, 20) (Fred, 25)(Jack, 30)

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)

(Jack, 30)

(Amy, 20) (Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)

Prune redundant examples, i.e., improve conciseness without

hurting completeness.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Formalization of Pruning

Example Records Elements Equivalence Classes Sets

Pick minimum #records to cover every equivalence

classSet-Cover Problem

• More involved because completeness of other operators must be maintained; details in paper

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)

(Jack, 30)

(Amy, 20) (Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)

Enhance completeness by inserting constraint records

(best effort; details in paper)

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)

(Jack, 30)

(Amy, 20) (Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)

(--, 17)

Enhance completeness by inserting constraint records

(best effort; details in paper)

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)(Bill, 17)

(Jack, 30)(--, 17)

(Amy, 20)(--, 17)

(Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)(--, 17)

Enhance completeness by inserting constraint records

(best effort; details in paper)

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)(Bill, 17)

(Jack, 30)(Bob, 17)

(Amy, 20)(Bill, 17)

(Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)

Enhance completeness by inserting constraint records

(best effort; details in paper)

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)(Bill, 17)

(Jack, 30)(Bob, 17)

(Amy, 20)(Bill, 17)

(Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)

Prune redundant examples (as in Pass 2). Favor real examples

over synthetic ones.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)(Bill, 17)

(Jack, 30)(Bob, 17)

(Amy, 20)(Bill, 17)

(Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)

Prune redundant examples (as in Pass 2). Favor real examples

over synthetic ones.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Bill, 17)

(Jack, 30)

(Bill, 17)(Jack, 30)(Jack, 30)

(Bill, 17)

Prune redundant examples (as in Pass 2). Favor real examples

over synthetic ones.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Implementation Status

• Available as ILLUSTRATE command in open-source release of Pig

• Available as Eclipse Plugin (PigPen)

PigPen Snapshot

Performance Evaluation

Program I: (Web Search Result Viewing Statistics)

– LOAD

– FILTER by compound arithmetic expression

– GROUP

– TRANSFORM using built-in aggregate function

Performance on Program I

downstream upstream our algorithm0

0.25

0.5

0.75

1realism conciseness completeness

Performance Evaluation

Program II: (Web Advertising Activity)

– LOAD table A

– FILTER A by compound logical expression

– JOIN with table B (highly selective)

– TRANSFORM using 4 string manipulation UDFS (non-invertible)

Performance on Program II

downstream upstream our algorithm0

0.25

0.5

0.75

1realism conciseness completeness

Running Time

P1 P2 P3 P4 P5 P6 P7 P80

0.51

1.52

2.53

3.54

downstream upstream our algorithm

runn

ing

time

(sec

onds

)

Conclusions

• Writing dataflow programs is an iterative process.

• Actual dataset too large for test runs.

• Our algorithm can automatically generate examples that illustrate the program through:• Realism• Conciseness• Completeness

Research

Recommended