Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Smart Apps on Clouds" by D. Janakiram

1. Programming Abstractions for Smart Apps on Clouds
Prof. D. Janakiram,
Professor, Dept of CSE,
IIT, Madras

2. Acknowledgements
Work on Deformable Mesh Abstractions is joint work with GeetaIyer and SriramKailasam
Work on Edge Node File Systems is joint work with Kovendhan
Work on Deformable Mesh Abstractions is funded by Yahoo Research
3. Introduction
Cloud computing: provides pay-for-use access to compute and storage resources over the Internet.
Smart applications: intelligence embedded within the application (e.g. Recommender systems)
Computation, data requirements and algorithms increasingly becoming complex.
Popular programming models for cloud: MapReduce, Dryad.
Are these right abstractions for smart apps?
4. MapReduce Origins
Primary motivation:
To facilitate indexing, searching, sorting like operations on massive datasets over large resources.
Inspired from map and reduce primitives in LISP.
Requirement to perform computations on key-value pairs to generate intermediate key-value pairs and reduce all values with the same key.
Runtime responsible for parallelization of map and reduce tasks, and handles other low level details.
5. Limitations and Proposed Extensions
Limitations in original MR model:
Input/output restricted to key-value pairs.
Jobs are loosely synchronized (no connected computation).
No support for iteration and recursion.
Doesnt directly support multiple inputsfor a job.
Optimized for batch processing.
Different nodes are assumed to perform work roughly at the same rate.
Inherent assumption that all tasks require the same amount of time.
Extensions:
IterativeMR:
adds support for iterations
relies on long running mapreduce tasks and streaming data between iterations
Spark:
Supports iterations and interactive queries.
Each iteration is handled as a separate MapReduce job, incurring job submission overheads.
Streaming makes fault tolerance difficult.
6. Basic Database Operations
Projection
Selection
Aggregation
Join, Cartesian product, Set operations
Only the unary operations can be directly modeled with the original MapReduce framework.
There is no direct support for operations over multiple, possibly heterogeneous input data sources.
Can be done indirectly by chaining extra MapReduce steps.
7. Dryad & DryadLINQ
Motivated primarily from the parallel databases.
Makes the communication graph explicit.
Execution graph expressed as Directed Acyclic Graph (DAG).
DryadLINQ allows computations to be expressed in terms of LINQ operators (similar to SQL operators)
Automatically parallelized by Dryad execution engine.
Supports multiple datasets and runtime optimizations of complete execution graph.
8. Limitations
Lacks support for recursively spawning new tasks as computation proceeds.
Adaptive computations like AI planning, branch-and-bound cannot be supported directly.
9. Smart Apps
Key aspects/ requirements

As computation proceeds, search space expands with newly generated data; requires support for spawning new tasks on-the-fly.

10. Different nodes executing in parallel needs to communicate; requires support for a shared communication model. 11. Data partitioning changes, as the computation proceeds. 12. Efficient support for fixed number of iterations or condition based termination. 13. Real world graphs may not be captured by hash-based partitioning; alternate partitioning schemes.Classes of Applications
AI planning
Decision tree algorithms
Association rule mining
Recommender systems
Data mining
Graph algorithms
Clustering algorithms
14. Deformable Mesh Abstraction
Focus:
New programming model targeted towards wider applications that cannot be modeled efficiently using existing frameworks.
At the same time, support MapReduce-like computations efficiently.
Bring out clear separation between programmer expressibility issues and runtime environment issues.
15. Expressibility Issues
Loosely
Synchronized

Extending programmer expressibility

16. capturing different programming paradigms efficiently. 17. recursive spawning of new tasks at runtime. 18. efficient and location independent communication support. 19. changing the Shared nothing viewpoint. 20. support operating on changing datasets.Unconnected
Iterative
Recursive
(all-to-all)
(point-to-point)
Runtime
creation
Programming Paradigms
21. Runtime Issues

Handling runtime level details

22. offering performance guarantees on unreliable environments. 23. handling heterogeneity in terms of 24. capability 25. storage 26. reliability 27. minimizing synchronization delay between different tasks. 28. providing efficient fault tolerance.

Documents

Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Smart Apps on Clouds" by D. Janakiram