Upload
yahoo-developer-network
View
1.176
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
2. Acknowledgements
Work on Deformable Mesh Abstractions is joint work with GeetaIyer
and SriramKailasam
Work on Edge Node File Systems is joint work with Kovendhan
Work on Deformable Mesh Abstractions is funded by Yahoo
Research
3. Introduction
Cloud computing: provides pay-for-use access to compute and storage
resources over the Internet.
Smart applications: intelligence embedded within the application
(e.g. Recommender systems)
Computation, data requirements and algorithms increasingly becoming
complex.
Popular programming models for cloud: MapReduce, Dryad.
Are these right abstractions for smart apps?
4. MapReduce Origins
Primary motivation:
To facilitate indexing, searching, sorting like operations on
massive datasets over large resources.
Inspired from map and reduce primitives in LISP.
Requirement to perform computations on key-value pairs to generate
intermediate key-value pairs and reduce all values with the same
key.
Runtime responsible for parallelization of map and reduce tasks,
and handles other low level details.
5. Limitations and Proposed Extensions
Limitations in original MR model:
Input/output restricted to key-value pairs.
Jobs are loosely synchronized (no connected computation).
No support for iteration and recursion.
Doesnt directly support multiple inputsfor a job.
Optimized for batch processing.
Different nodes are assumed to perform work roughly at the same
rate.
Inherent assumption that all tasks require the same amount of
time.
Extensions:
IterativeMR:
adds support for iterations
relies on long running mapreduce tasks and streaming data between
iterations
Spark:
Supports iterations and interactive queries.
Each iteration is handled as a separate MapReduce job, incurring
job submission overheads.
Streaming makes fault tolerance difficult.
6. Basic Database Operations
Projection
Selection
Aggregation
Join, Cartesian product, Set operations
Only the unary operations can be directly modeled with the original
MapReduce framework.
There is no direct support for operations over multiple, possibly
heterogeneous input data sources.
Can be done indirectly by chaining extra MapReduce steps.
7. Dryad & DryadLINQ
Motivated primarily from the parallel databases.
Makes the communication graph explicit.
Execution graph expressed as Directed Acyclic Graph (DAG).
DryadLINQ allows computations to be expressed in terms of LINQ
operators (similar to SQL operators)
Automatically parallelized by Dryad execution engine.
Supports multiple datasets and runtime optimizations of complete
execution graph.
8. Limitations
Lacks support for recursively spawning new tasks as computation
proceeds.
Adaptive computations like AI planning, branch-and-bound cannot be
supported directly.
9. Smart Apps
Key aspects/ requirements
10. Different nodes executing in parallel needs to communicate;
requires support for a shared communication model. 11. Data
partitioning changes, as the computation proceeds. 12. Efficient
support for fixed number of iterations or condition based
termination. 13. Real world graphs may not be captured by
hash-based partitioning; alternate partitioning schemes.Classes of
Applications
AI planning
Decision tree algorithms
Association rule mining
Recommender systems
Data mining
Graph algorithms
Clustering algorithms
14. Deformable Mesh Abstraction
Focus:
New programming model targeted towards wider applications that
cannot be modeled efficiently using existing frameworks.
At the same time, support MapReduce-like computations
efficiently.
Bring out clear separation between programmer expressibility issues
and runtime environment issues.
15. Expressibility Issues
Loosely
Synchronized
16. capturing different programming paradigms efficiently. 17.
recursive spawning of new tasks at runtime. 18. efficient and
location independent communication support. 19. changing the Shared
nothing viewpoint. 20. support operating on changing
datasets.Unconnected
Iterative
Recursive
(all-to-all)
(point-to-point)
Runtime
creation
Programming Paradigms
21. Runtime Issues
22. offering performance guarantees on unreliable environments. 23. handling heterogeneity in terms of 24. capability 25. storage 26. reliability 27. minimizing synchronization delay between different tasks. 28. providing efficient fault tolerance.