Upload
meetutkarsh
View
4.459
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
Chris Olston Shubham ChopraUtkarsh Srivastava
Generating Example Data For Dataflow Programs
Research
Data Processing Renaissance
Lots of data (TBs/day at Yahoo!)
Lots of queries and programs to analyze that data
New data flow languages Map-Reduce, Pig Latin, Dryad
Other data flow systems Aurora, Tioga, River
Example Dataflow Program
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
Find users that tend to visit
high-pagerank pages
Iterative Process
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
Bug in UDFcanonicalize?
Joining on right attribute?
Everything being filtered out?
No Output
How to do test runs?
• Run with real data– Too inefficient (TBs of data)
• Create smaller data sets (e.g., by sampling)– Empty results due to joins [Chaudhuri et. al. 99], and
selective filters
• Biased sampling for joins– Indexes not always present
Examples to Illustrate Program
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
(www.cnn.com, 0.9) (www.frogs.com, 0.3)(www.snails.com, 0.4)
(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)(Fred, www.snails.com, 0.4)
(Amy, 0.6) (Fred, 0.4)
(Amy, 0.6)
(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)
(Fred, www.snails.com, 0.4)
( Amy,
( Fred, )
)
Value Addition From Examples
• Examples can be used for
– Debugging
– Understanding a program written by someone else
– Learning a new operator, or language
Outline
• Formalization of good examples
• Example Generation Algorithm
• Performance Evaluation
Good Examples: Consistency
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
0. Consistency
output example =
operator applied on input example
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
Good Examples: Realism
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
1. Realism
Formalization: Fraction of examples that are real or are derived from real records
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
Good Examples: Completeness
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
Demonstrate the salient properties of each operator,
e.g., FILTER
2. Completeness
(Amy, 0.6) (Fred, 0.4)
(Amy, 0.6)
Good Examples: Completeness
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
Demonstrate the salient properties of each operator,
e.g., JOIN
2. Completeness(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
(www.cnn.com, 0.9) (www.frogs.com, 0.3)(www.snails.com, 0.4)
(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)(Fred, www.snails.com, 0.4)
Formalizing Completeness
• For any operator, classify input/output example records into equivalence classes.
• Each equivalence class demonstrates one property of the operator.
• Try to have at least one example from each class
Equivalence Class Examples
FILTERE0: All input records that pass the filter
E1: All input records that fail the filter
JOINE0: All output records
UNIONE0: All records belonging to first input
E1: All records belonging to second input
Formalizing Completeness
Operator Completeness: Fraction of equivalence classes that have at least one example record.
Overall Completeness: Average of per-operator completeness.
# equivalence classes# example records
Good Examples: Conciseness
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
3. Conciseness
Operator Conciseness:
Overall Conciseness:Average of per-operator conciseness
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
Outline
• Formalization of good examples
• Example Generation Algorithm
• Performance Evaluation
Related Work
Related Areas:– Reverse Query Processing– Database Testing– Software and Hardware Verification
• Differences– Realism not a concern– Notion of conciseness is different– Intermediate result size is immaterial
Strawman I: Downstream Propagation
Take some portion of input data and run the program over it.
1. Realism
2. Completeness
3. Conciseness
Strawman II: Upstream Propagation
Start from what output is desired, and work backwards
1. Realism
2. Completeness
3. Conciseness
Our Algorithm
Algorithm Passes1. Downstream 2. Pruning3. Upstream4. Pruning
Our Algorithm
Take a subset of input and propagate through the
program.
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20) (Fred, 25)
(Amy, 20) (Fred, 25)
(Jack, 30)
(Amy, 20) (Fred, 25)(Jack, 30)
(Amy, 20) (Fred, 25)(Jack, 30)
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20) (Fred, 25)
(Amy, 20) (Fred, 25)
(Jack, 30)
(Amy, 20) (Fred, 25)(Jack, 30)
(Amy, 20) (Fred, 25)(Jack, 30)
Prune redundant examples, i.e., improve conciseness without
hurting completeness.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20) (Fred, 25)
(Jack, 30)
(Amy, 20) (Fred, 25)(Jack, 30)
Prune redundant examples, i.e., improve conciseness without
hurting completeness.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
(Amy, 20) (Fred, 25)
(Amy, 20) (Fred, 25)(Jack, 30)
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)
(Jack, 30)
(Amy, 20) (Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)
Prune redundant examples, i.e., improve conciseness without
hurting completeness.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
Formalization of Pruning
Example Records Elements Equivalence Classes Sets
Pick minimum #records to cover every equivalence
classSet-Cover Problem
• More involved because completeness of other operators must be maintained; details in paper
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)
(Jack, 30)
(Amy, 20) (Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)
Enhance completeness by inserting constraint records
(best effort; details in paper)
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)
(Jack, 30)
(Amy, 20) (Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)
(--, 17)
Enhance completeness by inserting constraint records
(best effort; details in paper)
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)(Bill, 17)
(Jack, 30)(--, 17)
(Amy, 20)(--, 17)
(Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)(--, 17)
Enhance completeness by inserting constraint records
(best effort; details in paper)
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)(Bill, 17)
(Jack, 30)(Bob, 17)
(Amy, 20)(Bill, 17)
(Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)
Enhance completeness by inserting constraint records
(best effort; details in paper)
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)(Bill, 17)
(Jack, 30)(Bob, 17)
(Amy, 20)(Bill, 17)
(Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)
Prune redundant examples (as in Pass 2). Favor real examples
over synthetic ones.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)(Bill, 17)
(Jack, 30)(Bob, 17)
(Amy, 20)(Bill, 17)
(Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)
Prune redundant examples (as in Pass 2). Favor real examples
over synthetic ones.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Bill, 17)
(Jack, 30)
(Bill, 17)(Jack, 30)(Jack, 30)
(Bill, 17)
Prune redundant examples (as in Pass 2). Favor real examples
over synthetic ones.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
Implementation Status
• Available as ILLUSTRATE command in open-source release of Pig
• Available as Eclipse Plugin (PigPen)
PigPen Snapshot
Performance Evaluation
Program I: (Web Search Result Viewing Statistics)
– LOAD
– FILTER by compound arithmetic expression
– GROUP
– TRANSFORM using built-in aggregate function
Performance on Program I
downstream upstream our algorithm0
0.25
0.5
0.75
1realism conciseness completeness
Performance Evaluation
Program II: (Web Advertising Activity)
– LOAD table A
– FILTER A by compound logical expression
– JOIN with table B (highly selective)
– TRANSFORM using 4 string manipulation UDFS (non-invertible)
Performance on Program II
downstream upstream our algorithm0
0.25
0.5
0.75
1realism conciseness completeness
Running Time
P1 P2 P3 P4 P5 P6 P7 P80
0.51
1.52
2.53
3.54
downstream upstream our algorithm
runn
ing
time
(sec
onds
)
Conclusions
• Writing dataflow programs is an iterative process.
• Actual dataset too large for test runs.
• Our algorithm can automatically generate examples that illustrate the program through:• Realism• Conciseness• Completeness
Research