Upload
safir-shah
View
65
Download
0
Tags:
Embed Size (px)
Citation preview
Dache: A Data Aware Caching for Big-Data Applications Using
the MapReduce FrameworkGuided By:Asst. prof: Ms. Mary Mareena
Submitted By:Muhammed Safir O P
Contents: INTRODUCTION
ABSTRACT
EXISTING SYSTEM
PROPOSED SYSTEM
SYSTEM ARCHITECTURE
RESULT AND DISCUSSION
CONCLUSION
INTRODUCTION: Google MapReduce: A software
framework for large-scale distributed computing on large amounts of data.
Hadoop : An open-source implementation of the Google MapReduce programming model.
Two phases:Map Phase and Reduce Phase.
Provisioning cache layer for efficiently identifying and accessing cache items.
EXISTING SYSTEM: MapReduce is used for providing a
standardized framework.
Intermediate data is thrown away since map reduce is unable to utilize them
LIMITATIONS:
Inefficiency in incremental processing.
Duplicate computations being performed.
Do not have a mechanism to find duplicate computations and accelerate job execution.
EXISTING SYSTEM(Cont.):
Input is splitter and feed to workers in map phase.
Intermediate files generated in the map phase are shuffled and sorted by the system and fed into workers in reduce phase.
Final results are computed by multiple reducers and written to disk.
Input is splitter and feed to workers in map phase.
Intermediate files generated in the map phase are shuffled and sorted by the system and fed into workers in reduce phase.
Final results are computed by multiple reducers and written to disk.
PROPOSED SYSTEM
Dache - a data aware cache system for big-data applications using the MapReduce framework.
Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.
PROPOSED SYSTEM(Cont.): Identifies the source input from which a
cache item is obtained, and the operations applied on the input, so that a cache item produced by the workers in the map phase is indexed properly.
Partition operation applied in map phase.
MAP REDUCE ARCHITECTURE
Job Client:• Submit
jobs
Job Tracker:• Co-
ordinate jobs
Task Tracker:• Execute
job Tasks
MAP REDUCE ARCHITECTURE
1. Clients submits jobs to the Job Tracker.2. Job Tracker talks to the name node.3. Job Tracker creates execution plan.4. Job Tracker submit works to Task tracker.5. Task Trackers report progress via heart beats.6. Job Tracker manages the phases.7. Job Tracker update the status.
MAP PHASE DESCRIPTION PHASE:
A piece of cached data stored in Distributed File System(DFS).
Content of a cache item is described by the original data and operations applied.
2 tuple:{Origin, Operation}.
Origin: Name of a file in DFS. Operation: Linear list of available operations performed on the origin file.
REDUCE PHASE DESCRIPTION PHASE:
Input for the reduce phase is also a list of key-value pairs, where the value could be list of values.
Original input and the applied operations are required.
Original input obtained by storing the intermediate results of the map phase in the DFS.
Job Types and Cache Organization Relation:
When processing each file split, the cache manager reports the previous file splitting scheme used in its cache item.
Job Types and Cache Organization Relation:
To find words starting with ‘ab’, We use the results from the cache for word starting with ‘a’ ; and also add it to the cache
Find the best match among overlapped results [choose ‘ab’ instead of ‘a’]
CACHE REQUEST AND REPLY:
MAP CACHE:
Cache requests must be sent out before the file splitting phase.
Job tracker issues cache requests to the cache manager.
Cache manager replies a list of cache descriptions.
CACHE REQUEST AND REPLY:
REDUCE CACHE:
First , compare the requested cache item with the cached items in the cache manager’s database.
Cache manager identify the overlaps of the original input files of the requested cache and stored cache.
Linear scan is used here.
LIFE TIME MANAGEMENT OF
CACHE ITEM Cache Manager: Determines how much time a cache
item can be kept in DFS.
Two types of policies:• Fixed Storage quota:
Least Recent Used(LRU) is employed.• Optimal utility
Estimates the saved computation time ts by caching cache item for given amount of time ta.
Expenses ts = P storage x Scache x ts
Save ts = P computation x R duplicate x ts
RESULT AND DISCUSSION:The graph for CPU utilization of Hadoop and Dache in the two programs:
Tera-sort Program. Word-count Program.
RESULT AND DISCUSSION(Cont.):The graph for Completion time for the two programs using Dache and Hadoop.
Tera-sort Program. Word-count Program.
RESULT AND DISCUSSION(Cont.):The graph for Total cache size in GB for the two programs using Dache and Hadoop.
Tera-sort Program. Word-count Program.
CONCLUSION: Requires minimum change to the original
MapReduce programming model.
Application code only requires slight changes in order to utilize Dache.
Implement Dache in Hadoop by extending relevant components.
Testbed experiments show that it can eliminate all the duplicate tasks in incremental MapReduce jobs.
Minimum execution time and CPU utilization.
REFERENCES: J. Dean and S. Ghemawat, MapReduce:
Simplified data processing on large clusters, Communication of ACM, vol. 51, no. 1, pp. 107-113, 2008.
Hadoop, http://Hadoop.apache.org, 2013.
Cache algorithms, http://en.wikipedia.org/wiki/Cache
algorithms, 2013.
Java programming language, http://www.java.com/, 2013.
Google compute engine, http://cloud.google.com/products/computeengine.html, 2013.
Any Questions???