41
Chris Olston Shubham Chopra Utkarsh Srivastava Generating Example Data For Dataflow Programs Research

Example Generation for Data Flow Programs

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Example Generation for Data Flow Programs

Chris Olston Shubham ChopraUtkarsh Srivastava

Generating Example Data For Dataflow Programs

Research

Page 2: Example Generation for Data Flow Programs

Data Processing Renaissance

Lots of data (TBs/day at Yahoo!)

Lots of queries and programs to analyze that data

New data flow languages Map-Reduce, Pig Latin, Dryad

Other data flow systems Aurora, Tioga, River

Page 3: Example Generation for Data Flow Programs

Example Dataflow Program

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

Find users that tend to visit

high-pagerank pages

Page 4: Example Generation for Data Flow Programs

Iterative Process

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

Bug in UDFcanonicalize?

Joining on right attribute?

Everything being filtered out?

No Output

Page 5: Example Generation for Data Flow Programs

How to do test runs?

• Run with real data– Too inefficient (TBs of data)

• Create smaller data sets (e.g., by sampling)– Empty results due to joins [Chaudhuri et. al. 99], and

selective filters

• Biased sampling for joins– Indexes not always present

Page 6: Example Generation for Data Flow Programs

Examples to Illustrate Program

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

(www.cnn.com, 0.9) (www.frogs.com, 0.3)(www.snails.com, 0.4)

(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)(Fred, www.snails.com, 0.4)

(Amy, 0.6) (Fred, 0.4)

(Amy, 0.6)

(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)

(Fred, www.snails.com, 0.4)

( Amy,

( Fred, )

)

Page 7: Example Generation for Data Flow Programs

Value Addition From Examples

• Examples can be used for

– Debugging

– Understanding a program written by someone else

– Learning a new operator, or language

Page 8: Example Generation for Data Flow Programs

Outline

• Formalization of good examples

• Example Generation Algorithm

• Performance Evaluation

Page 9: Example Generation for Data Flow Programs

Good Examples: Consistency

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

0. Consistency

output example =

operator applied on input example

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

Page 10: Example Generation for Data Flow Programs

Good Examples: Realism

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

1. Realism

Formalization: Fraction of examples that are real or are derived from real records

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

Page 11: Example Generation for Data Flow Programs

Good Examples: Completeness

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

Demonstrate the salient properties of each operator,

e.g., FILTER

2. Completeness

(Amy, 0.6) (Fred, 0.4)

(Amy, 0.6)

Page 12: Example Generation for Data Flow Programs

Good Examples: Completeness

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

Demonstrate the salient properties of each operator,

e.g., JOIN

2. Completeness(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

(www.cnn.com, 0.9) (www.frogs.com, 0.3)(www.snails.com, 0.4)

(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)(Fred, www.snails.com, 0.4)

Page 13: Example Generation for Data Flow Programs

Formalizing Completeness

• For any operator, classify input/output example records into equivalence classes.

• Each equivalence class demonstrates one property of the operator.

• Try to have at least one example from each class

Page 14: Example Generation for Data Flow Programs

Equivalence Class Examples

FILTERE0: All input records that pass the filter

E1: All input records that fail the filter

JOINE0: All output records

UNIONE0: All records belonging to first input

E1: All records belonging to second input

Page 15: Example Generation for Data Flow Programs

Formalizing Completeness

Operator Completeness: Fraction of equivalence classes that have at least one example record.

Overall Completeness: Average of per-operator completeness.

Page 16: Example Generation for Data Flow Programs

# equivalence classes# example records

Good Examples: Conciseness

LOAD(user, url)

LOAD(url, pagerank)

TRANSFORMuser, canonicalize(url)

JOINon url

GROUPon user

TRANSFORMuser, AVG(pagerank)

FILTERavgPR> 0.5

3. Conciseness

Operator Conciseness:

Overall Conciseness:Average of per-operator conciseness

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

Page 17: Example Generation for Data Flow Programs

Outline

• Formalization of good examples

• Example Generation Algorithm

• Performance Evaluation

Page 18: Example Generation for Data Flow Programs

Related Work

Related Areas:– Reverse Query Processing– Database Testing– Software and Hardware Verification

• Differences– Realism not a concern– Notion of conciseness is different– Intermediate result size is immaterial

Page 19: Example Generation for Data Flow Programs

Strawman I: Downstream Propagation

Take some portion of input data and run the program over it.

1. Realism

2. Completeness

3. Conciseness

Page 20: Example Generation for Data Flow Programs

Strawman II: Upstream Propagation

Start from what output is desired, and work backwards

1. Realism

2. Completeness

3. Conciseness

Page 21: Example Generation for Data Flow Programs

Our Algorithm

Algorithm Passes1. Downstream 2. Pruning3. Upstream4. Pruning

Page 22: Example Generation for Data Flow Programs

Our Algorithm

Take a subset of input and propagate through the

program.

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20) (Fred, 25)

(Amy, 20) (Fred, 25)

(Jack, 30)

(Amy, 20) (Fred, 25)(Jack, 30)

(Amy, 20) (Fred, 25)(Jack, 30)

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Page 23: Example Generation for Data Flow Programs

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20) (Fred, 25)

(Amy, 20) (Fred, 25)

(Jack, 30)

(Amy, 20) (Fred, 25)(Jack, 30)

(Amy, 20) (Fred, 25)(Jack, 30)

Prune redundant examples, i.e., improve conciseness without

hurting completeness.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Page 24: Example Generation for Data Flow Programs

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20) (Fred, 25)

(Jack, 30)

(Amy, 20) (Fred, 25)(Jack, 30)

Prune redundant examples, i.e., improve conciseness without

hurting completeness.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

(Amy, 20) (Fred, 25)

(Amy, 20) (Fred, 25)(Jack, 30)

Page 25: Example Generation for Data Flow Programs

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)

(Jack, 30)

(Amy, 20) (Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)

Prune redundant examples, i.e., improve conciseness without

hurting completeness.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Page 26: Example Generation for Data Flow Programs

Formalization of Pruning

Example Records Elements Equivalence Classes Sets

Pick minimum #records to cover every equivalence

classSet-Cover Problem

• More involved because completeness of other operators must be maintained; details in paper

Page 27: Example Generation for Data Flow Programs

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)

(Jack, 30)

(Amy, 20) (Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)

Enhance completeness by inserting constraint records

(best effort; details in paper)

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Page 28: Example Generation for Data Flow Programs

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)

(Jack, 30)

(Amy, 20) (Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)

(--, 17)

Enhance completeness by inserting constraint records

(best effort; details in paper)

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Page 29: Example Generation for Data Flow Programs

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)(Bill, 17)

(Jack, 30)(--, 17)

(Amy, 20)(--, 17)

(Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)(--, 17)

Enhance completeness by inserting constraint records

(best effort; details in paper)

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Page 30: Example Generation for Data Flow Programs

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)(Bill, 17)

(Jack, 30)(Bob, 17)

(Amy, 20)(Bill, 17)

(Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)

Enhance completeness by inserting constraint records

(best effort; details in paper)

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Page 31: Example Generation for Data Flow Programs

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)(Bill, 17)

(Jack, 30)(Bob, 17)

(Amy, 20)(Bill, 17)

(Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)

Prune redundant examples (as in Pass 2). Favor real examples

over synthetic ones.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Page 32: Example Generation for Data Flow Programs

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Amy, 20)(Bill, 17)

(Jack, 30)(Bob, 17)

(Amy, 20)(Bill, 17)

(Amy, 20) (Jack, 30)

(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)

Prune redundant examples (as in Pass 2). Favor real examples

over synthetic ones.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Page 33: Example Generation for Data Flow Programs

Our Algorithm

LOAD(user, age)

FILTERudf(user)

LOAD(user, age)

UNION FILTERage>18

(Bill, 17)

(Jack, 30)

(Bill, 17)(Jack, 30)(Jack, 30)

(Bill, 17)

Prune redundant examples (as in Pass 2). Favor real examples

over synthetic ones.

Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning

Page 34: Example Generation for Data Flow Programs

Implementation Status

• Available as ILLUSTRATE command in open-source release of Pig

• Available as Eclipse Plugin (PigPen)

Page 35: Example Generation for Data Flow Programs

PigPen Snapshot

Page 36: Example Generation for Data Flow Programs

Performance Evaluation

Program I: (Web Search Result Viewing Statistics)

– LOAD

– FILTER by compound arithmetic expression

– GROUP

– TRANSFORM using built-in aggregate function

Page 37: Example Generation for Data Flow Programs

Performance on Program I

downstream upstream our algorithm0

0.25

0.5

0.75

1realism conciseness completeness

Page 38: Example Generation for Data Flow Programs

Performance Evaluation

Program II: (Web Advertising Activity)

– LOAD table A

– FILTER A by compound logical expression

– JOIN with table B (highly selective)

– TRANSFORM using 4 string manipulation UDFS (non-invertible)

Page 39: Example Generation for Data Flow Programs

Performance on Program II

downstream upstream our algorithm0

0.25

0.5

0.75

1realism conciseness completeness

Page 40: Example Generation for Data Flow Programs

Running Time

P1 P2 P3 P4 P5 P6 P7 P80

0.51

1.52

2.53

3.54

downstream upstream our algorithm

runn

ing

time

(sec

onds

)

Page 41: Example Generation for Data Flow Programs

Conclusions

• Writing dataflow programs is an iterative process.

• Actual dataset too large for test runs.

• Our algorithm can automatically generate examples that illustrate the program through:• Realism• Conciseness• Completeness

Research