Upload
vasia-kalavri
View
502
Download
1
Embed Size (px)
Citation preview
Big Data Processing Systems StudyVasiliki Kalavri, emjd-dc
3 Dec 2012
MapReducesimplified Data Processing on Large
Clusters
OSDI 2004
MapReduce
● Specify a map and a reduce functions
● The system takes care of○ parallelization
○ partitioning
○ scheduling
○ communication
○ fault-tolerance
3
Hadoop MapReduce 1.0
4
MapReduce Limitations
● Static Pipeline
● No support for common operations
● Data materialization after every job
● Slow - not fit for interactive analysis
● Complex configuration
5
YARN (MapReduce v.2)
6
What is Hadoop/MR NOT good for?
● All the things it wasn't built for○ Iterative computations○ Stream processing○ Incremental computations○ Interactive Analysis○ [insert research paper here]
7
Improving Hadoop performance
● Reduce Network & Disk I/O● Skewed Datasets● DB-like optimizations
○ column-oriented storage
○ indexes
8
Map-Reduce Inspired Systems
Extending the Programming Model
Map-Reduce Inspired Systems
● Extend the programming model to support○ Iterative○ Streamapplications
10
Iterative Processing
● Characteristics○ Datasets already stored
○ Need to reuse a dataset more than once, possibly
multiple times○ Iterative jobs, e.g. estimates, convergence
● Problems with iterative MR applications○ manual orchestration of several MR jobs○ re-loading & re-processing of invariant data○ no explicit way to define a termination condition
11
HaLoopEfficient Iterative Data Processing on
Large Clusters
VLDB 2010
System Overview
13
Programming Model
● Iterative Programming Model
Ri+1 = R0 U (Ri ⋈ L)
● Extensions to MR○ loop body○ termination condition○ loop-invariant data
14
Loop-Aware Scheduling
● Inter-Iteration Locality○ schedule tasks of different iterations which
access the same data on the same machines
15
Caching and Indexing
● Reducer Input Cache○ caches and indexes reducer inputs○ reduces M->R I/O
● Reducer Output Cache○ stores and indexes most recent local reducer
outputs○ reduces termination condition computation cost
● Mapper Input Cache○ avoids non-local data reads in mappers
16
Stream Processing
● Characteristics○ Data continuously comes into the system○ Usually needs to be processed as it arrives○ Frequent updates
● Problems with stream MR applications○ runs on a static snapshot os a dataset○ computations need to finish
17
MuppetMapReduce-Style Processing of Fast Data
VLDB 2012
Programming Model
● MapUpdate○ operates on streams, i.e. sequence of events with
the same id in increasing timestamp order
● Slates○ in-memory data structures which "summarize"
all events with key k that an Update function has
seen so far
19
Example Applications
● An application that monitors the FourSquare-checkin stream to count the number of checkins per retailer and displays the count on a Web page
● Detect "hot" topics in Twitter
20
System Overview
● Uses Cassandra to persist slate states
21
Map-Reduce Inspired Systems
Improving Performance
Map-Reduce Inspired Systems
● Improve performance by○ reusing data
○ building caches / indexes
○ DBMS-like optimizations
○ reducing I/O
23
IncoopMapReduce for Incremental Computations
SOCC 2011
System Overview
25
Inc-HDFS
● Content-based chunking● Fingerprint calculation
26
Incremental MapReduce
● Incremental Map○ persistently store intermediate results○ insert reference to memoization server
○ query memoization server and fetch result if
already computed
● Incremental Reduce○ persistently store entire tasks computations
○ store and map sub-computations used in the
Contraction phase
27
Contraction Phase
● Break up large Reduce tasks into many applications of the Combine function
● Only a subset of Combiners needs to be re-executed
28
HAILOnly Aggressive Elephants are Fast Elephants
VLDB 2012
System Overview
30
Upload Pipeline
● HDFS upload pipeline is changed so that:
○ the Client creates PAX blocks○ Datanodes do not flush data or checksums to
disk○ After all chunks of a block have been received,
the block is sorted in memory and flushed○ Each DataNode computes its own checksums
31
Query Pipeline
Transparency is achieved using UDFs:
● HailInputFormat○ elaborate splitting policy○ scheduling taked into account relevant indexes
● HailRecordReader○ Uses user annotation / configuration info to
select records for map phase○ transforms records from PAX to row format
32
ThemisAn I/O Efficient MapReduce
SOCC 2012
How to limit Disk I/O?
● Process records in memory and spill to disk as rarely as possible
● Relax fault-tolerance guarantees○ job-level recovery
● Dynamic memory management○ pluggable policies
● Per-node I/O management○ organize data in large batches
34
Memory policies
● Pool-based○ fixed-sized pre-allocated buffers
● Quota-based○ controls dataflow between computational stages
using queues
● Constraint-based○ dynamically adjusts memory allocation based on
requests and available memory
35
System Overview
Data-flow graph consisting of stages:
● Phase Zero extracts information about distribution of records and keys
● Phase One implements mapping and shuffling
● Phase Two implements the sorting and reduce, always keeping results in memory
36
ReStoreReusing Results of MapReduce Jobs
VLDB 2012
System Overview
● Built as an extension to Pig● When a workflow is submitted, ReStore:
○ re-writes the query to reuse stored results○ stores outputs of the workflow○ stores results of sub-jobs
○ decided which outputs to store in HDFS and
which to delete
38
System Architecture
39
Example
40
MANIMALAutomatic Optimization for MapReduce
Programs
VLDB 2011
Idea
● Apply well-known query optimization techniques to Map-Reduce jobs
● Static analysis of compiled code● Apply optimizations only when "safe"
42
System Architecture
43
Example Optimizations
● Selection○ if the map function is a filter, use a B+Tree to
only scan the relevant portion of the input
● Projection○ eliminate unnecessary fields from input records
44
SkewTuneMitigating Skew in MapReduce Applications
SIGMOD 2012
Common Types of Skew
● Uneven distribution of input data○ partitioning which does not guarantee even
distribution○ popular key groups
● Expensive records○ some portions of the input take longer to process
than others
46
System Overview
● Per-task progress estimation● Per-task statistics● Late skew detection
○ skew mitigation is delayed until a slot is
available
● Only re-partition one task at a time○ only when half the time remaining is less than
the re-partitioning overhead
47
Implementation
Re-partition a map task
● mitigators execute as mappers within a new MapReduce job
● output is written to HDFS
Re-partition a reduce task
● mitigator job with an identity map read input from task tracker
48
StarfishA Self-Tuning System for Big Data Analytics
CIDR 2011
System Overview
50
Job-Level Tuning
● Just-in-Time Optimizer○ choose efficient execution techniques, e.g. joins
● Profiler○ learns performance models, job profiles
● Sampler○ collects statistics about input, intermediate and
output data○ helps the profiler build approximate models
51
Workflow-Level Tuning
● Workflow-aware Scheduler○ exploring data locality on workflow-level instead
of making locally optimal decisions
● What-If Engine○ answers questions based on simulations of job
executions
52
Workload-Level Tuning
● Workload Optimizer○ Data-flow sharing○ Materialization of intermediate results for reuse○ Reorganization
● Elastisizer○ node and network configuration automation
53
Big-Data Processing Beyond MapReduce
DryadDistributed Data-Parallel Programs from
Sequential Building Blocks
EuroSys 2007
System Overview
56
Graph Description
57
Communication
58
Graph Optimizations
● Schedule vertices clode to the input data
● If a computation is associative and commutative, use an aggregation tree
● Dynamically refine the graph based on output data sizes○ vary number of vertices in each stage,
connectivity
59
SCOPEEasy and Efficient Parallel Processing of
Massive Data Sets
VLDB 2008
System Overview
61
SCOPE scripting language
● resembles SQL with C# expressions
● commands are data transformation operators
● extensible mapreduce-like commands
62
SCOPE Execution
● The Compiler creates internal parse tree
● The Optimizer creates a parallel execution plan, i.e. a Cosmos job
● The Job Manager constructs the graph and schedules execution
63
SparkCluster Computing with Working Sets
HotCloud 2010
RDDs
Read-only collection of objects● partitioned across machines● store their "lineage"● can be re-constructed● users can control persistence and
partitioning
65
Programming Model
● Scala API● driver program
○ defines RDDs and actions on them
● workers○ long-lived processes
○ store and process RRD partitions in-memory
66
Job Stages
67
Nephele/PACTsA Programming Model and Execution Framework for Web-Scale Analytical
ProcessingSoCC 2010
The Stratosphere Stack
69
System Overview
● Execution plan in the form of a DAG● Abstracts parallelization and
communication● Optimizer to choose best execution
strategy
70
Programming Model
● Input Contracts: ○ give guarantees on how data is organized into
independent subsets○ Map, Reduce, Match, Cross, CoGroup
● Output Contracts:○ define properties on the output data○ Same-Key, Super-Key, Unique-Key
71
ASTERIXScalable, Semi-structured Data Platform for
Evolving-World Models
Distributed and Parallel Databases 2011
Evolving World Model
● As-of queries○ What is the best route to get to the Olympic
Stadium right now?
○ What is the traffic situation like on Saturday nights close to the city center?
○ How many visitors that visited the City Hall during the past year also went for dinner in that nearby restaurant?
73
Data Model - Query Language
● Semi-structured data model, ADM○ dataset ~ table: indexed, partitioned, replicated○ dataverse ~ database○ DDL: primary key, partitioning key○ "open" data schemes
● AQL query language○ declarative, inspired from Jaql and XQuery○ logical plan -> DAG -> Hyracks Job
74
System Overview
75
DremelInteractive analysis of Web-Scale Datasets
VLDB 2010
Columnar Storage
● lossless representation○ save field types, repetition/definition levels
● fast encoding○ recursively traverses record and computes levels
● efficient record assembly○ use a FSM to reconstruct records
77
Query Execution
● Language based on SQL
● Tree architecture○ Root server
■ receives incoming queries
■ reads table metadata
■ routes queries to the next level of the tree
○ Leaf servers■ communicate with storage layer
78
Query Dispatcher
● Schedules queries to available slots
● Balances the load
● Assures fault-tolerance
● Specifies what percentage of tablets to be scanned before returning a result
79
CIELa universal execution engine for distributed
data-flow computing
NSDI 2011
Dynamic Task Graph
81
System Architecture
82
Skywriting Language
● Turing-complete● Arbitrary data-dependent control flow
○ while loops○ recursive functions
● Supports invokation of code written in other languages
83
References
www.citeulike.org/user/vasiakalavri
84