Download pdf - Map Reduce (Part 3) - Graz University of Technologykti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part... · 2018-11-26 · Map Reduce (Part 3) Databases 2 (VU) (706.711 / 707.030)

Map Reduce (Part 3)Databases 2 (VU) (706.711 / 707.030)

Mark Kroll

Institute of Interactive Systems and Data Science,Graz University of Technology

Nov. 26, 2018

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 1 / 45

Outline

1 Problems Suited for MapReduce

2 MapReduce: ApplicationsMatrix-Vector MultiplicationInformation Retrieval

3 Hadoop EcosystemDesigning a Big Data SystemBig Data Storage Technologies

Slides are partially based onSlides “Mining Massive Datasets” by Jure Leskovec.

“Limitations and Challenges of HDFS and MapReduce” by Weets et al., 2015.

Tutorial on “’Data-Intensive Text Processing with MapReduce’ by Lin, 2009.


Not all problems are suited for MapReduce

problems at hand need to be formulated and implemented as MapReduce program, i.e.parallelizable according to the divide&conquer idea, for instance

I analyzing textual data (word count statistics)I matrix-vector multiplications (PageRank)I relational algebra operations (Database query primitives such as joins)

in addition, files need to be largeI due to task setup time and scheduling overhead

and should be rarely updated in placeI not suitable when managing online sales, for instance, the principal operations on Amazon

data involve responding to searches for products, recording sales, and so on; processes thatinvolve relatively li�le calculation and that change the database


When MapReduce is not a suitable choice

when your processing requires a lot of data to be shu�led over the networkI remember the idea is to bring the algorithm to the data

real-time processing - when fast responses are needed

for complex algorithmsI for example, machine learning algorithms such as SVMs

processing graphs→ use dedicated frameworks such as Giraph


When MapReduce is not a suitable choice

when lot of data needs to be sorted / shu�led, e.g. the map phase generates too many keysI shu�ling is very time-consuming o�en overloading the networkI reason is that the reducers fetch all intermediate data at once as soon as the last mapper

finishes→ led to strategies such as virtual shu�ling or predictive scheduling

when iterations are neededI for example clustering algorithms such as K-Means→ use frameworks such as Spark

handling streaming dataI MapReduce is best suited to batch process hugh amounts of data→ use frameworks such as Storm, Flink


MapReduce: Typical Applications

matrix operations such as matrix-matrix and matrix-vector multiplications which fit nicelyinto the MapReduce programming model

I for example, to calculate the PageRank

another important class of operations that can use MapReduce e�ectively arerelational-algebra operations

I many operations on data can be described easily in terms of the common database-queryprimitives such as selection, union, intersection, joins

I which can be used for social network analysis, e.g. finding paths of di�erent lengths, countingfriends, etc.


Application 1: Matrix-Vector Multiplication

What do we need to do?Suppose we have an n× n matrixM, whose element in row i and column j will be denoted mij .Suppose we also have a vector v of length n, whose jth element is vj . Then the matrix-vectorproduct is the vector x of length n, whose ith element is given by

xi =n∑

j=1

mijvj

Outline a Map-Reduce program that calculates the vector x.


Matrix-Vector Multiplication



let us first assume that the vector v is large, but it still can fit into the memory

the matrix M and the vector v will be each stored in a file of the DFSassume that the row-column coordinates of a matrix element (indices) can be discovered

I for example, each value is stored as a triple (i, j,mij)

similarly, the position of vj can be discovered analogously



So how does the Map Function look like?

a map function applies to one element of the matrixMthe vector v is first read in its entirety and is available for all Map tasks at that computenode

from each matrix element mij the map function produces the key-value pair (i,mijvj)

all terms of the sum that make up the component xi of the matrix-vector product will getthe same key i



What about the Reduce Function?reduce function sums all the values associated with a given key i

result is a pair (i, xi)

thereby calculating one entry in the output vector x


Matrix-Vector Multiplication: Divide&Conquer

however, it might be that the vector v does not fit into main memoryI it is not required that the vector v fits into the memory at a compute node, but if it does not

there will be a very large number of disk accesses as we move pieces of the vector into mainmemory

alternatively we can divide the matrixM into vertical stripes of equal width and divide thevector into an equal number of horizontal stripes of the same height

→ use enough stripes so that the portion of the vector in one stripe can fit into mainmemory at a compute node



Figure: Divide matrixM and vector v into stripes



the ith stripe of matrixM multiplies only components from the ith stripe of the vector

can divide matrixM into one file for each stripe, and do the same for the vector veach Map task is assigned a chunk from one of the stripes in the matrix and gets the entirecorresponding stripe of the vector

Map and Reduce tasks can then act exactly as before

need to sum up once more the results of the stripes multiplication


Application 2: Information Retrieval

Information Retrieval (aka Search) in a Nutshellis the activity of obtaining information relevant to an information need

I for instance a search query you submit to Google

from a collection of information resourcesI for instance databases of texts, images or sounds

Information Retrieval is so to saythe Science of Searching for Information


Information Retrieval Process

taken from Eissa Alshari Semantic Arabic Information Retrieval Framework


Search Process - Step 1: Indexing

taken from h�ps://developer.apple.com/


Search Process - Step 2: Retrieval


Are Indexing or Retrieval MapReducable?

The Indexing ProblemScalability is critical

must be relatively fast, but need not be real time

fundamentally a batch operation

→ sounds like a Map-Reduce problem

The Retrieval Problemmust have sub-second response time

for the web, only need relatively few results

→ rather not a Map-Reduce problem


Index Construction with MapReduce

original purpose behind the MapReduce programming conceptMap over all documents

I emit term as key, (docNr., termFrequency) as valueI emit other information as necessary (e.g., term position)

Sort/shu�le: group postings by term

ReduceI gather and sort the postings (e.g., by docNr. or tf)I write postings to disk

→MapReduce does all the heavy li�ing


Index Construction with MapReduce: An Overview


including information on term position:


Apache provides a nice so�ware stack


Parts of the Hadoop Eco System

Yarn (Yet Another Resource Negotiator)

is a cluster management system to run Big Data applications on a cluster data; not a dataprocessing platform itself but enables the platforms to run their code in a clusterenvironment

HBase

open source, non-relational, distributed database modeled a�er Google’s BigTable

Hive

data warehouse infrastructure built on top of Hadoop for providing data summarization,query, and analysis

Pig

high-level platform for creating programs that run on Apache Hadoop (language is calledPig Latin)


Data Processing: From Batch to Streaming . . .

Batch Processing involves operating over a large, static dataset and returning the result at a later time when thecomputation is complete.Stream Processing systems compute over data as it enters the system.The term Microbatch is frequently used to describe scenarios where batches are small and/or processed at small intervals.Even though processing may happen as o�en as once every few minutes, data is still processed a batch at a time.


Big Data Processing Frameworks (1)

Batch-only frameworks:I Apache Hadoop can be considered a processing framework with MapReduce as its default

processing engine. Hadoop was the first big data framework to gain significant traction in theopen-source community.

Stream-only frameworks:I Apache Storm focuses on extremely low latency and is perhaps the best option for workloads

that require near real-time processing. It can handle very large quantities of data with anddeliver results with less latency than other solutions. Its topologies are composed of (i)Streams, (ii) Spouts and (iii) Bolts

I Apache Samza is tightly tied to the Apache Kafka messaging system. While Kafka can beused by many stream processing systems, Samza is designed specifically to take advantage ofKafka’s unique architecture and guarantees. It uses Kafka to provide fault tolerance, bu�ering,and state storage


Big Data Processing Frameworks (2)

Hybrid frameworks:I Apache Spark is a next generation batch processing framework with stream processing

capabilities; Spark focuses primarily on speeding up batch processing workloads by o�eringfull in-memory computation and processing optimization.

I Apache Flink a stream processing framework that can also handle batch tasks. It considersbatches to simply be data streams with finite boundaries, and thus treats batch processing as asubset of stream processing. This stream-first approach to all processing has a number ofinteresting side e�ects. This stream-first approach has been called the Kappa architecture.


Java(-ish) is the Hadoop Language


The Good, the Bad and the Ugly wrt. the Ecosystem

The GoodI Easy to write parallel, highly scalable applicationsI Stream and Batch processingI Seamless integration with other systems (e.g. RDBMS)

The BadI Rapid development, hard to keep overviewI 150 projects in (or near) Hadoop Eco System

F see https://hadoopecosystemtable.github.io/

The UglyI Maven dependency hell, if integrated with other systemsI Spark depends on > 50 libraries with a specific version!


https://hadoopecosystemtable.github.io/

System Design: An Overview

usually boils down to picking the right components wrt. Input, Processing + OutputMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 30 / 45

System Design: An Overview

usually boils down to picking the right components wrt. Input, Processing + OutputMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 31 / 45

So let’s design a system that

1 receives 5 million sensor measurements per minuteI 100 bytes per record ∼ 8MB/sek

2 stores measurements for 3 yearsI ∼ 240 TB per year, 720 TB in total

3 creates daily reportsI for any given date

4 detects faulty sensors in real-timeI alert the maintainers via SMS


System Design: Processing & Analyzing Sensor Data

. . . receives 5 million sensor measurements per minute



. . . stores measurements for 3 years



. . . creates daily reportsMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 35 / 45


. . . detects faulty sensors in real-timeMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 36 / 45

Big Data Storage Technologies

File-based: HDFSI distributed, permanent file storageI tuned for large filesI no indexing

Key-Value based: HBaseI distributed Key/Value storeI fast look-upsI based on HDFS

Message-based: KafkaI distributed Producer/Consumer messaging systemI data partitioned in topicsI producer groups / consumer groups


File-based: HDFS


File-based: HDFS


Key-Value based: HBase


Key-Value based: HBase


Message-based: Kafka


Message-based: Kafka


Map Reduce Lectures’ Recap:

Part 1:I Handling Big DataI Key Elements: MapReduce Framework & Distributed File System

Part 2:I Optimization: Maximizing ParallelismI Stragglers Problem & Input Data Skew

Part 3:I Suitability & ApplicationsI Hadoop Ecosystem & Storage Technologies


The EndNext Lecture: “Beyond Map-Reduce”