Upload
kathleen-davidson
View
218
Download
1
Embed Size (px)
Citation preview
MAPREDUCEPRESENTED BY: KATIE WOODS & JORDAN HOWELL
TEAM MEMBERS:
• Katie Woods: Covered Sections
• 1. Introduction
• 3.Implementation
• 6.Experience
• 7.Related Work
• Jordan Howell: Covered Sections
• 2.Programming Model
• 4.Refinements
• 5.Performance
• Conclusion
OVERVIEW• What is MapReduce
• Programming model
• Section 2.1-6.1
• Related Work
• Conclusion
• How to run MapReduce on OpenStack
• Reference
• What we went over
• Questions/Comments
WHAT IS MAPREDUCE?
• Originally created by Google
• Used to query large data-sets
• Extracts relations from unstructured data
• Can draw from many disparate data sources
2. PROGRAMMING MODEL
• Two parts: Map() and Reduce()
• MapReduce library groups together intermediate values associated with same intermediate key I
• Passes values to Reduce() via an iterator
• Reduce() merges values to form possibly smaller set of values
• Zero or one output value is produced per Reduce request
3. IMPLEMENTATION
• Many different implementations
• The right choice depends on the environment
• For example: Google’s setup
• typically dual x86 processors 2-4GB memory
• 100Mb/s or 1Gb/s networking hardware
• Hundreds or thousands of machines per cluster
• Inexpensive IDE disks directly on machines
3.1 EXECUTION• Input data partitioned into M splits
• Intermediate key space partitioned into R pieces.
• 1. Splits input files into 16 to 64 MB per piece. Then starts up many copies of the program on a cluster of machines.
• 2. Master assigns each one map task or a reduce task.
• 3. Worker reads the input split.
• 4. Periodically, buffered pairs get written to local disk.
• 5. Reduce worker makes procedure call to retrieve data from map worker’s local disk.
• 6. Results are written to final output file for the reduce partition
• 7. Master wakes user program and returns all output files.
3.2 MASTER DATA STRUCTURES
• Master keeps several data structures
• Each Map task and reduce task it stores the state (idle, in-process, and completed)
• The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks
3.3 FAULT TOLERANCE
• Since MapReduce library is designed to process large amounts the library must tolerate machine failures gracefully
• Worker Failure:
• Master Failure:
• Semantics in the Presence of Failures:
3.4 LOCALITY
• Due to relatively scarce network bandwidth data is stored on the local disk of each machine
• The files are split into 64MB blocks. Each block is copied typically three times to other machines
• The master machine tries to schedule jobs on the machine that contains a replica of the corresponding input data.
• Otherwise, it schedules a job close to a machine containing the job data
• When running large operations most input data is read locally and consumes no network bandwidth.
3.5 TASK GRANULARITY• The map and reduce phases are split into different size pieces
• M pieces for Map Phase &
• R pieces for Reduce Phase
• Total phase pieces should be much larger than the number of worker machines
• This helps with dynamic load balancing and recovery speed when a worker fails
• Reduce phase pieces are usually constrained by users since each task is in a separate output file
• The number of map phase pieces are chosen so that the input data size is between 16MB and 64MB
• Google usually uses 200,000 map pieces, 5,000 reduce pieces, and 2,000 worker machines
3.6 BACKUP TASKS
• “Straggler” machines can cause large total computation time
• Stragglers can arise by many different reasons
• Straggler alleviation is possible
• The master backs up in-progress tasks when the operation is close to finishing
• Task is marked as complete when the primary task or backup completes
• Backup task overhead has been tuned to no more than a few percent
• An example task takes 44% longer when the backup is disabled
4. REFINEMENTS
• General algorithms fit most needs
• User defined extensions have been found useful
4.1 PARTITIONING FUNCTION
• Users can define the number of reduce tasks to run (R)
• We can redefine the intermediate keys
• A default function is hash(key) mod R which results in fairly well balanced partitions
• Sometimes we may want to group output together, such as grouping web data by domain
• We can redefine partition to use hash(Hostname(urlkey)) mod R
4.2 ORDERING GUARANTEES
• Within each partition, intermediate key/value pairs are always processed in increasing order
• Makes it easy to generate a sorted output file per partition
• This supports efficient lookup of random keys
4.3 COMBINER FUNCTION
• There is sometimes significant repetition in the intermediate keys
• This is usually handled in the Reduce function, but sometimes we want to partially combine it in the Map function
• The combiner function sometimes grants significant performance gains
4.4 INPUT AND OUTPUT TYPES
• MapReduce can take data from several different formats
• The way the data is organized for input greatly effects the output
• Adding support for a new data type only requires users to change the reader interface
4.5 SIDE EFFECTS
• Sometimes we want to produce auxiliary files as additional outputs from the Map or Reduce operators
• Users are responsible for these files, as long as these outputs are deterministic
• This restriction has never been an issue in practice
4.6 SKIPPING BAD RECORDS
• Sometimes there are bugs in user code
• Course of action is to fix the bug but sometimes not feasible
• It is acceptable to ignore a few records
• Optional mode of execution
• Detects which records cause deterministic crashes and skips them
4.7 LOCAL EXECUTION
• Debugging problems in functions can be tricky
• Decisions can be made dynamically by the master
• To help facilitate debugging, profiling, and small-scale testing
• Controls are provided to limit particular map tasks
4.8 STATUS INFORMATION
• The master runs an internal HTTP server and exports a set of status pages
• Shows progress of the computation
• Contains links to the standard error and standard output files
• User uses data to predict how long the computation will take
• Top-level status page shows which works have failed & which Map and Reduce tasks were processing when they failed
4.9 COUNTERS
• MapReduce provides a counter facility to count occurrences of various events
• Some counter values are automatically maintained by the MapReduce library
• Users have found the counter facility useful for sanity checking the behavior of operations
5 PERFORMANCE
• This section measures the performance of MapReduce on two computations, Grep and Sort.
• These programs represent a large subset of real programs that MapReduce users have created
5.1 CLUSTER CONFIGURATION• Cluster of ≈ 1800 machines.
• Two 2GHz Intel Xeon processors with Hyper-Threading.
• 4 GB of memory.
• Two 160GB IDE(Integrated Drive Electronics) Disks.
• Gigabit Ethernet link.
• Arranged in a two-level tree-shaped switched network.
• ≈ 100-200 Gbps aggregate bandwidth available at root.
• Every machine is located in the same hosting facility.
• Round trip between pairs is less than a millisecond.
• Out of the 4GB of memory available, approximately 1-1.5GB was reserved by other tasks.
• Programs were run on a weekend afternoon, when the CPUs, disks, and network were mostly idle.
5.2 GREP
• Grep scans through 10^10 100-byte records.
• The program looks for a match to a rare 3-character pattern.
• This pattern occurs in 92,337 records.
• The input gets slip up into ≈ 64 MB pieces.
• Output gets stored into one file
5.3 SORT
• The sort program sorts through 10^10 100-byte records.
• This is modeled after the TeraSort benchmark.
• Whole program is less than 50 lines.
• Like Grep the input for the sort program is split up into 64MB pieces.
• The sorted output is partitioned into 4000 files.
• The partitioning function uses the initial bytes of the key to segregate the output into one of the 4000 pieces.
SORT CONTINUED
5.4 EFFECT OF BACKUP TASK
• Backup Tasks Disabled
• After 960 still 5 reduce tasks not completed
• Takes 1283 Seconds to completely finish
• Increase of 44% time
5.5 MACHINE FAILURES
• Intentional kill 200 process
• Finishes in 933 seconds
• 5% increase over normal time
6 EXPERIENCE
• Extraction of data for popular queries
• Google Zeitgeist
• Extracting properties of web pages
• Geographical locations of web pages for localized search
• Clustering problems for Google News and Froogle products
• Large-scale machine learning problems and graph computations
6.1 LARGE SCALE INDEXING
• Production Indexing System
• Produces data structures for searches
• Completely rewritten with MapReduce
• What it does:
• Crawler gathers approx. 20 terabytes of data
• Indexing Process: 5-10 MapReduce operations
6.1 CONTINUED
• Indexing code is Simpler
• 3800 lines of C++ to 700 w/ MapReduce
• Improved Performance
• Separates unrelated computations
• Avoids extra passes over data
• Easier to Operate
• MapReduce handles issues without operator intervention
• Machine failures, slow machines, networking hiccups
7 RELATED WORK
• MapReduce can be viewed as a simplification of many system’s programming models to be adaptable and scalable
• Works off the restricted model of pushing data to be stored locally on the worker’s system
• Backup system similar to Charlotte System with the fix of the ability to skip bad records caused by failures
CONCLUSION
• Been successfully used at Google for different purposes
• Easy for programmers to use, even without the background of distributed and parallel systems
• Used well to sorting, data mining, machine learning and many more
LESSONS LEARNED
• Restricting the programming model makes it easy to parallelize and distribute computations fault-tolerant.
• Bandwidth is precious. Optimizations save it.
• Redundant execution can be used reduce the impact of slow machines, and to handle machine failures and data loss
MAP REDUCE WITH OPENSTACK
• Cloud Computing with Data Analytics
COMMON DEPLOYMENT
• Swift Storage joined to MapReduce cluster.
• Scalable storage mode to handle large amounts of data.
• Reason why is because annual data growth is 60%
BEGINNER DEPLOYMENT
• Cloudera has a mapreduce distribution
• Lets companies slowly implement big data
ADVANCED DEPLOYMENT
• Best Flexibility, Scalability, and Autonomy
• Build Cloud and then add Swift and Nova
• Quantum should be added for segmentation
GOOGLE FILE SYSTEM
• Letting a third party handle storage while company focuses on Computational processing.
NEED TO KNOW
• Let employees grow with technology
• If you do not plan you will most likely fail
WHAT WE WENT OVER• What MapReduce is
• Programming Model: types
• Implantation: execution, master data structures, fault tolerance, locality, task granularity, and backup tasks
• Refinements: partitioning function, ordering guarantees, combiner functions, input & output types, side-effects, skipping bad records, local execution, status information, and counters
• Performance: cluster configuration, GREP, SORT, effect & backup tasks, and machine failures
• Experiences
• Related Work
• Conclusion
• How MapReduce is run with OpenStack
QUESTIONS/COMMENTS
REFERENCES