Map Reduce (Part 3)Databases 2 (VU) (706.711 / 707.030)
Mark Kroll
Institute of Interactive Systems and Data Science,Graz University of Technology
Nov. 26, 2018
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 1 / 45
Outline
1 Problems Suited for MapReduce
2 MapReduce: ApplicationsMatrix-Vector MultiplicationInformation Retrieval
3 Hadoop EcosystemDesigning a Big Data SystemBig Data Storage Technologies
Slides are partially based onSlides “Mining Massive Datasets” by Jure Leskovec.
“Limitations and Challenges of HDFS and MapReduce” by Weets et al., 2015.
Tutorial on “’Data-Intensive Text Processing with MapReduce’ by Lin, 2009.
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 2 / 45
Not all problems are suited for MapReduce
problems at hand need to be formulated and implemented as MapReduce program, i.e.parallelizable according to the divide&conquer idea, for instance
I analyzing textual data (word count statistics)I matrix-vector multiplications (PageRank)I relational algebra operations (Database query primitives such as joins)
in addition, files need to be largeI due to task setup time and scheduling overhead
and should be rarely updated in placeI not suitable when managing online sales, for instance, the principal operations on Amazon
data involve responding to searches for products, recording sales, and so on; processes thatinvolve relatively li�le calculation and that change the database
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 3 / 45
When MapReduce is not a suitable choice
when your processing requires a lot of data to be shu�led over the networkI remember the idea is to bring the algorithm to the data
real-time processing - when fast responses are needed
for complex algorithmsI for example, machine learning algorithms such as SVMs
processing graphs→ use dedicated frameworks such as Giraph
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 4 / 45
When MapReduce is not a suitable choice
when lot of data needs to be sorted / shu�led, e.g. the map phase generates too many keysI shu�ling is very time-consuming o�en overloading the networkI reason is that the reducers fetch all intermediate data at once as soon as the last mapper
finishes→ led to strategies such as virtual shu�ling or predictive scheduling
when iterations are neededI for example clustering algorithms such as K-Means→ use frameworks such as Spark
handling streaming dataI MapReduce is best suited to batch process hugh amounts of data→ use frameworks such as Storm, Flink
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 5 / 45
MapReduce: Typical Applications
matrix operations such as matrix-matrix and matrix-vector multiplications which fit nicelyinto the MapReduce programming model
I for example, to calculate the PageRank
another important class of operations that can use MapReduce e�ectively arerelational-algebra operations
I many operations on data can be described easily in terms of the common database-queryprimitives such as selection, union, intersection, joins
I which can be used for social network analysis, e.g. finding paths of di�erent lengths, countingfriends, etc.
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 6 / 45
Application 1: Matrix-Vector Multiplication
What do we need to do?Suppose we have an n× n matrixM, whose element in row i and column j will be denoted mij .Suppose we also have a vector v of length n, whose jth element is vj . Then the matrix-vectorproduct is the vector x of length n, whose ith element is given by
xi =n∑
j=1
mijvj
Outline a Map-Reduce program that calculates the vector x.
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 7 / 45
Matrix-Vector Multiplication
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 8 / 45
Matrix-Vector Multiplication
let us first assume that the vector v is large, but it still can fit into the memory
the matrix M and the vector v will be each stored in a file of the DFSassume that the row-column coordinates of a matrix element (indices) can be discovered
I for example, each value is stored as a triple (i, j,mij)
similarly, the position of vj can be discovered analogously
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 9 / 45
Matrix-Vector Multiplication
So how does the Map Function look like?
a map function applies to one element of the matrixMthe vector v is first read in its entirety and is available for all Map tasks at that computenode
from each matrix element mij the map function produces the key-value pair (i,mijvj)
all terms of the sum that make up the component xi of the matrix-vector product will getthe same key i
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 10 / 45
Matrix-Vector Multiplication
What about the Reduce Function?reduce function sums all the values associated with a given key i
result is a pair (i, xi)
thereby calculating one entry in the output vector x
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 11 / 45
Matrix-Vector Multiplication: Divide&Conquer
however, it might be that the vector v does not fit into main memoryI it is not required that the vector v fits into the memory at a compute node, but if it does not
there will be a very large number of disk accesses as we move pieces of the vector into mainmemory
alternatively we can divide the matrixM into vertical stripes of equal width and divide thevector into an equal number of horizontal stripes of the same height
→ use enough stripes so that the portion of the vector in one stripe can fit into mainmemory at a compute node
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 12 / 45
Matrix-Vector Multiplication: Divide&Conquer
Figure: Divide matrixM and vector v into stripes
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 13 / 45
Matrix-Vector Multiplication: Divide&Conquer
the ith stripe of matrixM multiplies only components from the ith stripe of the vector
can divide matrixM into one file for each stripe, and do the same for the vector veach Map task is assigned a chunk from one of the stripes in the matrix and gets the entirecorresponding stripe of the vector
Map and Reduce tasks can then act exactly as before
need to sum up once more the results of the stripes multiplication
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 14 / 45
Application 2: Information Retrieval
Information Retrieval (aka Search) in a Nutshellis the activity of obtaining information relevant to an information need
I for instance a search query you submit to Google
from a collection of information resourcesI for instance databases of texts, images or sounds
Information Retrieval is so to saythe Science of Searching for Information
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 15 / 45
Information Retrieval Process
taken from Eissa Alshari Semantic Arabic Information Retrieval Framework
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 16 / 45
Search Process - Step 1: Indexing
taken from h�ps://developer.apple.com/
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 17 / 45
Search Process - Step 2: Retrieval
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 18 / 45
Are Indexing or Retrieval MapReducable?
The Indexing ProblemScalability is critical
must be relatively fast, but need not be real time
fundamentally a batch operation
→ sounds like a Map-Reduce problem
The Retrieval Problemmust have sub-second response time
for the web, only need relatively few results
→ rather not a Map-Reduce problem
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 19 / 45
Index Construction with MapReduce
original purpose behind the MapReduce programming conceptMap over all documents
I emit term as key, (docNr., termFrequency) as valueI emit other information as necessary (e.g., term position)
Sort/shu�le: group postings by term
ReduceI gather and sort the postings (e.g., by docNr. or tf)I write postings to disk
→MapReduce does all the heavy li�ing
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 20 / 45
Index Construction with MapReduce: An Overview
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 21 / 45
including information on term position:
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 22 / 45
Apache provides a nice so�ware stack
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 23 / 45
Parts of the Hadoop Eco System
Yarn (Yet Another Resource Negotiator)
is a cluster management system to run Big Data applications on a cluster data; not a dataprocessing platform itself but enables the platforms to run their code in a clusterenvironment
HBase
open source, non-relational, distributed database modeled a�er Google’s BigTable
Hive
data warehouse infrastructure built on top of Hadoop for providing data summarization,query, and analysis
Pig
high-level platform for creating programs that run on Apache Hadoop (language is calledPig Latin)
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 24 / 45
Data Processing: From Batch to Streaming . . .
Batch Processing involves operating over a large, static dataset and returning the result at a later time when thecomputation is complete.Stream Processing systems compute over data as it enters the system.The term Microbatch is frequently used to describe scenarios where batches are small and/or processed at small intervals.Even though processing may happen as o�en as once every few minutes, data is still processed a batch at a time.
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 25 / 45
Big Data Processing Frameworks (1)
Batch-only frameworks:I Apache Hadoop can be considered a processing framework with MapReduce as its default
processing engine. Hadoop was the first big data framework to gain significant traction in theopen-source community.
Stream-only frameworks:I Apache Storm focuses on extremely low latency and is perhaps the best option for workloads
that require near real-time processing. It can handle very large quantities of data with anddeliver results with less latency than other solutions. Its topologies are composed of (i)Streams, (ii) Spouts and (iii) Bolts
I Apache Samza is tightly tied to the Apache Kafka messaging system. While Kafka can beused by many stream processing systems, Samza is designed specifically to take advantage ofKafka’s unique architecture and guarantees. It uses Kafka to provide fault tolerance, bu�ering,and state storage
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 26 / 45
Big Data Processing Frameworks (2)
Hybrid frameworks:I Apache Spark is a next generation batch processing framework with stream processing
capabilities; Spark focuses primarily on speeding up batch processing workloads by o�eringfull in-memory computation and processing optimization.
I Apache Flink a stream processing framework that can also handle batch tasks. It considersbatches to simply be data streams with finite boundaries, and thus treats batch processing as asubset of stream processing. This stream-first approach to all processing has a number ofinteresting side e�ects. This stream-first approach has been called the Kappa architecture.
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 27 / 45
Java(-ish) is the Hadoop Language
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 28 / 45
The Good, the Bad and the Ugly wrt. the Ecosystem
The GoodI Easy to write parallel, highly scalable applicationsI Stream and Batch processingI Seamless integration with other systems (e.g. RDBMS)
The BadI Rapid development, hard to keep overviewI 150 projects in (or near) Hadoop Eco System
F see https://hadoopecosystemtable.github.io/
The UglyI Maven dependency hell, if integrated with other systemsI Spark depends on > 50 libraries with a specific version!
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 29 / 45
System Design: An Overview
usually boils down to picking the right components wrt. Input, Processing + OutputMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 30 / 45
System Design: An Overview
usually boils down to picking the right components wrt. Input, Processing + OutputMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 31 / 45
So let’s design a system that
1 receives 5 million sensor measurements per minuteI 100 bytes per record ∼ 8MB/sek
2 stores measurements for 3 yearsI ∼ 240 TB per year, 720 TB in total
3 creates daily reportsI for any given date
4 detects faulty sensors in real-timeI alert the maintainers via SMS
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 32 / 45
System Design: Processing & Analyzing Sensor Data
. . . receives 5 million sensor measurements per minute
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 33 / 45
System Design: Processing & Analyzing Sensor Data
. . . stores measurements for 3 years
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 34 / 45
System Design: Processing & Analyzing Sensor Data
. . . creates daily reportsMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 35 / 45
System Design: Processing & Analyzing Sensor Data
. . . detects faulty sensors in real-timeMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 36 / 45
Big Data Storage Technologies
File-based: HDFSI distributed, permanent file storageI tuned for large filesI no indexing
Key-Value based: HBaseI distributed Key/Value storeI fast look-upsI based on HDFS
Message-based: KafkaI distributed Producer/Consumer messaging systemI data partitioned in topicsI producer groups / consumer groups
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 37 / 45
File-based: HDFS
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 38 / 45
File-based: HDFS
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 39 / 45
Key-Value based: HBase
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 40 / 45
Key-Value based: HBase
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 41 / 45
Message-based: Kafka
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 42 / 45
Message-based: Kafka
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 43 / 45
Map Reduce Lectures’ Recap:
Part 1:I Handling Big DataI Key Elements: MapReduce Framework & Distributed File System
Part 2:I Optimization: Maximizing ParallelismI Stragglers Problem & Input Data Skew
Part 3:I Suitability & ApplicationsI Hadoop Ecosystem & Storage Technologies
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 44 / 45
The EndNext Lecture: “Beyond Map-Reduce”
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 45 / 45