21
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions Published in SoCC 2010

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting

Embed Size (px)

Citation preview

  • Slide 1

Slide 2 SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions Published in SoCC 2010 Slide 3 Motivation Science is becoming a data management problem MapReduce is an attractive solution Simple API, declarative layer, seamless scalability, But it is hard to Express complex algorithms and Get good performance (14 hours vs. 70 minutes) SkewReduce: Goal: Scalable analysis with minimal effort Toward scalable feature extraction analysis 2 Slide 4 Application 1: Extracting Celestial Objects Input { (x,y,ir,r,g,b,uv,) } Coordinates Light intensities Output List of celestial objects Star Galaxy Planet Asteroid M34 from Sloan Digital Sky Survey 3 Slide 5 Application 2: Friends of Friends Simple clustering algorithm used in astronomy Input: Points in multi-dimensional space Output: List of clusters (e.g., galaxy, ) Original data annotated with cluster ID Friends Two points within distance Friends of Friends Transitive closure of Friends relation 4 Slide 6 Parallel Friends of Friends Partition Local clustering Merge P1-P2 P3-P4 Merge P1-P2-P3-P4 Finalize Annotate original data P2P4 P3P1 C1 C2 C3 C4 C5 C6 C5 C3 C6 C3 C4C3C6 C5 5 Slide 7 Parallel Feature Extraction Partition multi-dimensional input data Extract features from each partition Merge (or reconcile) features Finalize output Features INPUT DATA Map Hierarchical Reduce 6 Slide 8 Skew Local Clustering (MAP) Merge (REDUCE) Problem: Skew The top red line runs for 1.5 hours 5 minutes Time Task ID 35 minutes 7 8 node cluster, 32map/32reduce slots Slide 9 Unbalanced Computation: Skew Computation skew Characteristics of an algorithm Same amount of input data != Same runtime O(N log N) ~ O(N 2 ) 0 friends per particleO(N) friends per particle 8 Can we scale out off-the-shelf implementation without (or minimal) modifications? Slide 10 Solution 1? Micro partition Assign tiny amount of work to each task to reduce skew 9 Slide 11 How about having micro partitions? It works! Framework overhead! To find sweet spot, need to try different granularities! Can we find a good partitioning plan without trial and error? 10 8 node cluster, 32map/32reduce slots Slide 12 Outline Motivation SkewReduce API (in the paper) Partition Optimization Evaluation Summary 11 Slide 13 Partition Optimization Varying granularities of partitions Can we automatically find a good partition plan and schedule? Serial Feature Extraction Algorithm Merge Algorithm 1 2 13 14 15 5 6 9 3 4 12 7 8 10 11 12 Slide 14 Approach Sample SkewReduce Optimizer 1 2 13 14 15 5 6 9 3 4 12 7 8 10 11 Cluster configuration Cluster configuration Cost functions Goal: minimize expected total runtime SkewReduce runtime plan Bounding boxes for data partitions Schedule Runtime Plan 13 Slide 15 Partition Plan Guided By Cost Functions Given sample, how long will it take to process? Two cost functions: Feature cost: (Bounding box, sample, sample rate) cost Merge cost:(Bounding boxes, sample, sample rate) cost Basis of two optimization decisions How (axis, point) to split a partition When to stop partitioning 14 Slide 16 Search Partition Plan Greedy top-down search Split if total expected runtime improves Evaluate costs for subpartitions and merge Estimate new runtime 100 Original 1 2 3 50 10 Possible Split 2 1 3 1 3 2 Schedule 2 = 110 Schedule 1 = 60 REJECT ACCEPT 15 Time Slide 17 Summary of Contributions Given a feature extraction application Possibly with computation skew SkewReduce Automatically partitions input data Improves runtime in spite of computation skew Key technique: user-defined cost functions 16 Slide 18 Evaluation 8 node cluster Dual quad core CPU, 16 GB RAM Hadoop 0.20.1 + custom patch in MapReduce API Distributed Friends of Friends Astro: Gravitational simulation snapshot 900 M particles Seaflow: flow cytometry survey 59 M observations 17 Slide 19 Does SkewReduce work? SkewReduce plan yields 2 ~ 8 times faster running time 128 MB16 MB4 MB2 MBManualSkewReduce 14.18.84.15.72.01.6 87.263.177.798.7-14.1 Hours Minutes 18 (1.9 GB, 3D)(18 GB, 3D) MapReduce 1 hour preparation Slide 20 Impact of Cost Function Higher fidelity = Better performance Astro 19 Slide 21 Highlights of Evaluation Sample size Representativeness of sample is important Runtime of SkewReduce optimization Less than 15% of real runtime of SkewReduce plan Data volume in Merge phase Total volume during Merge = 1% of input data Details in the paper 20 Slide 22 Conclusion Scientific analysis should be easy to write, scalable, and have a predictable performance SkewReduce API for feature extracting functions Scalable execution Good performance in spite of skew Cost-based partition optimization using a data sample Published in SoCC 2010 More general version is coming out soon! 21