MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL

MAPREDUCEPRESENTED BY: KATIE WOODS & JORDAN HOWELL

TEAM MEMBERS:

• Katie Woods: Covered Sections

• 1. Introduction

• 3.Implementation

• 6.Experience

• 7.Related Work

• Jordan Howell: Covered Sections

• 2.Programming Model

• 4.Refinements

• 5.Performance

• Conclusion

OVERVIEW• What is MapReduce

• Programming model

• Section 2.1-6.1

• Related Work

• Conclusion

• How to run MapReduce on OpenStack

• Reference

• What we went over

• Questions/Comments

WHAT IS MAPREDUCE?

• Originally created by Google

• Used to query large data-sets

• Extracts relations from unstructured data

• Can draw from many disparate data sources

2. PROGRAMMING MODEL

• Two parts: Map() and Reduce()

• MapReduce library groups together intermediate values associated with same intermediate key I

• Passes values to Reduce() via an iterator

• Reduce() merges values to form possibly smaller set of values

• Zero or one output value is produced per Reduce request

3. IMPLEMENTATION

• Many different implementations

• The right choice depends on the environment

• For example: Google’s setup

• typically dual x86 processors 2-4GB memory

• 100Mb/s or 1Gb/s networking hardware

• Hundreds or thousands of machines per cluster

• Inexpensive IDE disks directly on machines

3.1 EXECUTION• Input data partitioned into M splits

• Intermediate key space partitioned into R pieces.

• 1. Splits input files into 16 to 64 MB per piece. Then starts up many copies of the program on a cluster of machines.

• 2. Master assigns each one map task or a reduce task.

• 3. Worker reads the input split.

• 4. Periodically, buffered pairs get written to local disk.

• 5. Reduce worker makes procedure call to retrieve data from map worker’s local disk.

• 6. Results are written to final output file for the reduce partition

• 7. Master wakes user program and returns all output files.

3.2 MASTER DATA STRUCTURES

• Master keeps several data structures

• Each Map task and reduce task it stores the state (idle, in-process, and completed)

• The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks

3.3 FAULT TOLERANCE

• Since MapReduce library is designed to process large amounts the library must tolerate machine failures gracefully

• Worker Failure:

• Master Failure:

• Semantics in the Presence of Failures:

3.4 LOCALITY

• Due to relatively scarce network bandwidth data is stored on the local disk of each machine

• The files are split into 64MB blocks. Each block is copied typically three times to other machines

• The master machine tries to schedule jobs on the machine that contains a replica of the corresponding input data.

• Otherwise, it schedules a job close to a machine containing the job data

• When running large operations most input data is read locally and consumes no network bandwidth.

3.5 TASK GRANULARITY• The map and reduce phases are split into different size pieces

• M pieces for Map Phase &

• R pieces for Reduce Phase

• Total phase pieces should be much larger than the number of worker machines

• This helps with dynamic load balancing and recovery speed when a worker fails

• Reduce phase pieces are usually constrained by users since each task is in a separate output file

• The number of map phase pieces are chosen so that the input data size is between 16MB and 64MB

• Google usually uses 200,000 map pieces, 5,000 reduce pieces, and 2,000 worker machines

3.6 BACKUP TASKS

• “Straggler” machines can cause large total computation time

• Stragglers can arise by many different reasons

• Straggler alleviation is possible

• The master backs up in-progress tasks when the operation is close to finishing

• Task is marked as complete when the primary task or backup completes

• Backup task overhead has been tuned to no more than a few percent

• An example task takes 44% longer when the backup is disabled

4. REFINEMENTS

• General algorithms fit most needs

• User defined extensions have been found useful

4.1 PARTITIONING FUNCTION

• Users can define the number of reduce tasks to run (R)

• We can redefine the intermediate keys

• A default function is hash(key) mod R which results in fairly well balanced partitions

• Sometimes we may want to group output together, such as grouping web data by domain

• We can redefine partition to use hash(Hostname(urlkey)) mod R

4.2 ORDERING GUARANTEES

• Within each partition, intermediate key/value pairs are always processed in increasing order

• Makes it easy to generate a sorted output file per partition

• This supports efficient lookup of random keys

4.3 COMBINER FUNCTION

• There is sometimes significant repetition in the intermediate keys

• This is usually handled in the Reduce function, but sometimes we want to partially combine it in the Map function

• The combiner function sometimes grants significant performance gains

4.4 INPUT AND OUTPUT TYPES

• MapReduce can take data from several different formats

• The way the data is organized for input greatly effects the output

• Adding support for a new data type only requires users to change the reader interface

4.5 SIDE EFFECTS

• Sometimes we want to produce auxiliary files as additional outputs from the Map or Reduce operators

• Users are responsible for these files, as long as these outputs are deterministic

• This restriction has never been an issue in practice

4.6 SKIPPING BAD RECORDS

• Sometimes there are bugs in user code

• Course of action is to fix the bug but sometimes not feasible

• It is acceptable to ignore a few records

• Optional mode of execution

• Detects which records cause deterministic crashes and skips them

4.7 LOCAL EXECUTION

• Debugging problems in functions can be tricky

• Decisions can be made dynamically by the master

• To help facilitate debugging, profiling, and small-scale testing

• Controls are provided to limit particular map tasks

4.8 STATUS INFORMATION

• The master runs an internal HTTP server and exports a set of status pages

• Shows progress of the computation

• Contains links to the standard error and standard output files

• User uses data to predict how long the computation will take

• Top-level status page shows which works have failed & which Map and Reduce tasks were processing when they failed

4.9 COUNTERS

• MapReduce provides a counter facility to count occurrences of various events

• Some counter values are automatically maintained by the MapReduce library

• Users have found the counter facility useful for sanity checking the behavior of operations

5 PERFORMANCE

• This section measures the performance of MapReduce on two computations, Grep and Sort.

• These programs represent a large subset of real programs that MapReduce users have created

5.1 CLUSTER CONFIGURATION• Cluster of ≈ 1800 machines.

• Two 2GHz Intel Xeon processors with Hyper-Threading.

• 4 GB of memory.

• Two 160GB IDE(Integrated Drive Electronics) Disks.

• Gigabit Ethernet link.

• Arranged in a two-level tree-shaped switched network.

• ≈ 100-200 Gbps aggregate bandwidth available at root.

• Every machine is located in the same hosting facility.

• Round trip between pairs is less than a millisecond.

• Out of the 4GB of memory available, approximately 1-1.5GB was reserved by other tasks.

• Programs were run on a weekend afternoon, when the CPUs, disks, and network were mostly idle.

5.2 GREP

• Grep scans through 10^10 100-byte records.

• The program looks for a match to a rare 3-character pattern.

• This pattern occurs in 92,337 records.

• The input gets slip up into ≈ 64 MB pieces.

• Output gets stored into one file

5.3 SORT

• The sort program sorts through 10^10 100-byte records.

• This is modeled after the TeraSort benchmark.

• Whole program is less than 50 lines.

• Like Grep the input for the sort program is split up into 64MB pieces.

• The sorted output is partitioned into 4000 files.

• The partitioning function uses the initial bytes of the key to segregate the output into one of the 4000 pieces.

SORT CONTINUED

5.4 EFFECT OF BACKUP TASK

• Backup Tasks Disabled

• After 960 still 5 reduce tasks not completed

• Takes 1283 Seconds to completely finish

• Increase of 44% time

5.5 MACHINE FAILURES

• Intentional kill 200 process

• Finishes in 933 seconds

• 5% increase over normal time

6 EXPERIENCE

• Extraction of data for popular queries

• Google Zeitgeist

• Extracting properties of web pages

• Geographical locations of web pages for localized search

• Clustering problems for Google News and Froogle products

• Large-scale machine learning problems and graph computations

6.1 LARGE SCALE INDEXING

• Production Indexing System

• Produces data structures for searches

• Completely rewritten with MapReduce

• What it does:

• Crawler gathers approx. 20 terabytes of data

• Indexing Process: 5-10 MapReduce operations

6.1 CONTINUED

• Indexing code is Simpler

• 3800 lines of C++ to 700 w/ MapReduce

• Improved Performance

• Separates unrelated computations

• Avoids extra passes over data

• Easier to Operate

• MapReduce handles issues without operator intervention

• Machine failures, slow machines, networking hiccups

7 RELATED WORK

• MapReduce can be viewed as a simplification of many system’s programming models to be adaptable and scalable

• Works off the restricted model of pushing data to be stored locally on the worker’s system

• Backup system similar to Charlotte System with the fix of the ability to skip bad records caused by failures

CONCLUSION

• Been successfully used at Google for different purposes

• Easy for programmers to use, even without the background of distributed and parallel systems

• Used well to sorting, data mining, machine learning and many more

LESSONS LEARNED

• Restricting the programming model makes it easy to parallelize and distribute computations fault-tolerant.

• Bandwidth is precious. Optimizations save it.

• Redundant execution can be used reduce the impact of slow machines, and to handle machine failures and data loss

MAP REDUCE WITH OPENSTACK

• Cloud Computing with Data Analytics

COMMON DEPLOYMENT

• Swift Storage joined to MapReduce cluster.

• Scalable storage mode to handle large amounts of data.

• Reason why is because annual data growth is 60%

BEGINNER DEPLOYMENT

• Cloudera has a mapreduce distribution

• Lets companies slowly implement big data

ADVANCED DEPLOYMENT

• Best Flexibility, Scalability, and Autonomy

• Build Cloud and then add Swift and Nova

• Quantum should be added for segmentation

GOOGLE FILE SYSTEM

• Letting a third party handle storage while company focuses on Computational processing.

NEED TO KNOW

• Let employees grow with technology

• If you do not plan you will most likely fail

WHAT WE WENT OVER• What MapReduce is

• Programming Model: types

• Implantation: execution, master data structures, fault tolerance, locality, task granularity, and backup tasks

• Refinements: partitioning function, ordering guarantees, combiner functions, input & output types, side-effects, skipping bad records, local execution, status information, and counters

• Performance: cluster configuration, GREP, SORT, effect & backup tasks, and machine failures

• Experiences

• Related Work

• Conclusion

• How MapReduce is run with OpenStack

QUESTIONS/COMMENTS

REFERENCES

Documents

MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL