Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
1
COSC 6397
Big Data Analytics
Advanced MapReduce
Edgar Gabriel
Spring 2017
Basic statistical operations• Calculating minimum, maximum, mean, median, standard
deviation
• Data typically multi-dimensional -> analytics can be based on one
or more dimensions of the data
– Same intermediate key used by all mappers
– Exploiting parallelism only on the map side – single reducer
often required
Image source: Hadoop MapReduce Cookbook, chapter 5.
2
Group-by operations• Calculate basic operations by group
– Allows to utilize more than one reducer
– Grouping based on key of the mapper step
• Example: calculate number of accesses to a webpage based on a
log-file
Image source: Hadoop MapReduce Cookbook, chapter 5.
Frequency distributions
• Arrangement of values that one or more variables take in
a sample
• Each entry in the table contains the number of
occurrences of values within a particular group
• Table summarizes the distribution of values in the
sample
• Example:
– Analyze the log file of a web server
– Sort the number of hits received by each URL in ascending
order
– Input Example:205.212.115.106 - [01/Jul/1995:00:00:00:12 -0400] “GET /countdown.html HTTP/1.0” 200 3985
3
Frequency distributions
• First MapReduce job counts the number of occurrences
of a URL
– Result of the MapReduce job: a file containing the list of
<URL> <no. of occurrences>
• Second MapReduce job
– Use the output of first MapReduce job as input
– Mapper: use <no of occurrences> as key and <URL> as
value
– Reducer: omit the <no of occurrences> in output file
(ignoring URL)
• Sorting is implicit by Hadoop framework
Example output
Image source: Hadoop MapReduce Cookbook, chapter 5.
4
Histograms
• Graphical representation of the distribution of data
• Estimate of the probability distribution of a continuous
variable
• Representation of tabulated frequencies, shown as
adjacent rectangles, erected over discrete intervals
– area proportional to the frequency of the observations in
the interval
• Example:
– Determine the number of accesses to the web server per
hour
Image source: Hadoop MapReduce Cookbook, chapter 5.
Histograms
• Map step uses the hour as the key and ‘one’ as the
value
• Reducer sums up the number of occurrences for each
hour
Image source: Hadoop MapReduce Cookbook, chapter 5.
5
Histograms
0
20000
40000
60000
80000
100000
120000
140000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Scatter Plots
• A scatter plot is using Cartesian coordinates to display
values for two variables for a set of data
• Typically used when a variable exists that is under the
control of the user
– a parameter is systematically incremented and/or
decremented by the other,
• also called the control parameter or independent
variable
• is typically plotted along the horizontal axis
– The measured or dependent variable is customarily
plotted along the vertical axis
6
Scatter Plots
• Example: find the relationship between the size of web
pages and the number of hits
Image source: Hadoop MapReduce Cookbook, chapter 5.
Scatter Plots
Image source: Hadoop MapReduce Cookbook, chapter 5.
7
Secondary Sorting
• MapReduce sorts intermediate key-value pairs by the
keys during shuffle and sort phase
• Sometimes, additional sorting based on the values
would be useful
• Example: data from sensors
– Intermediate key-value pair:
key = mi, value = ( tj, ri)
with mi being a sensor id
tj being a time stamp
ri being the actual value
– Order of values for a given key is not in increasing order
of timestamps
Secondary Sorting (II)• Solution: encode sensor id and time stamp in the key
key = mi, value = ( tj, ri) key = (mi:tj) value = ri
but need to ensure that all keys containing mi end up at the
same reduce invokation!
• Implement three classes:
– Partitioner: which intermediate keys are sent to which
reducers
– SortComparator: decides how intermediate keys are sorted
– GroupComparator: decides which intermediate keys are
grouped to a single reduce method invokation
job.setPartitionerClass(SensorPartitioner.class);
job.setGroupingComparatorClass(KeyGroupingComparator.class);
job.setSortComparatorClass(CompositeKeyComparator.class);
8
Partitionerpublic static class SensorPartitioner extends Partitioner<Text,Text>
{
public int getPartition(Text key, Text val, int numReducers) {
String [] tempstring = key.toString().split(“:");
int sensorId = Integer.parseInt(tempstring[1]);
return sensorId % numReducers;
}
}
• Data from one sensor will end up at the same reducer
• Since the keys are still different, the reduce method
will still be invoked separately for each key = (mi:tj)
SortComparator: determine order in which
keys are presented to the reducer
public class CompositeKeyComparator extends WritableComparator {
public int compare(WritableComparable w1, WritableComparable w2) {
String [] t1 = w1.toString().split(“:");
String [] t2 = w2.toString().split(“:");
int s1 = Integer.parseInt(t1[1]);
int s2 = Integer.parseInt(t2[1]);
int result = s1.compareTo(s2);
if(0 == result) {
double d1 = Double.parseDouble(t1[2]);
double d2 = Double.parseDouble(t2[2]);
result = -1 * d1.compareTo(d2);
}
return result;
}
}
Sorting based on
Sensor id
Sorting based on
Time stamp
9
GroupComparator: determine which keys are
grouped together in a single call to a reducer
public class KeyGroupingComparator extends WritableComparator {
public int compare(WritableComparable w1, WritableComparable w2) {
String [] t1 = w1.toString().split(“:");
String [] t2 = w2.toString().split(“:");
int s1 = Integer.parseInt(t1[1]);
int s2 = Integer.parseInt(t2[1]);
return s1.compareTo(s2);
}
}
Graphical flow
s1 0800 x1
s1 0805 x2
s2 0920 x3
s2 0910 x4
s1 0715 x5
s3 1005 x6
input file
map()
map()
map()
s1:0800 x1
s2:0920 x3
s1:0715 x5
intermediate
key-value pairs
reduce()
reduce()
Sensor
Partitioner
s1:0800 x1
s1:0805 x2
s1:0715 x5
s2:0920 x3
s3:1005 x6s3:1005 x6
s2:0910 x4
s1:0805 x2 s1:0800 x1
s1:0805 x2
s1:0715 x5
CompositeKey
Comparator
s2:0910 x4
s2:0910 x4
s3:1005 x6
s2:0920 x3
KeyGroup
Comparator
s1:0715 x5
s1:0800 x1
s1:0805 x2
s2:0910 x4
s2:0920 x3
s3:1005 x6
10
Inverted index
• Inverted index: An index data structure storing a
mapping from content to its locations in the original
document
• Necessary for fast full text searches, used on a large
scale in search engines
• Requires significant processing when a document is
added to the corpus
• Popular data structure used in document retrieval
systems
Example
• Two input files:
Input.txt: The sample input file contains sample keywords.
Input2.txt: Another input contains different keywords.
• Simple inverted index:
Another input2.txt
The input.txt
contains input.txt input2.txt
different input2.txt
file input.txt
input input.txt input2.txt
keywords input.txt input2.txt
sample input.txt
11
MapReduce it?
• The indexing problem
– Scalability is critical
– Must be relatively fast, but need not be real time
– Fundamentally a batch operation
– Incremental updates may or may not be important
• The retrieval problem
– Must have sub-second response time
– For the web, only need relatively few results
MapReduce: Index Construction
• Map over all documents, assuming one document as
input to a mapper
– Emit term as key, (documendId, termfrequency) as valueFileSplit fileSplit = (FileSplit) context.getInputSplit();
String fileName = fileSplit.getPath().getName();
– Mapper needs to be able to store all terms related to a
document in memory!
• Sort/shuffle: group postings by term
• Reduce
– Gather and sort the postings (e.g., by documentId or tf)
– Write postings to disk
• Fundamentally, a large sorting problem
12
Inverted Indexing: Pseudo-Code
Inverted Indexing: Pseudo-Code
Initial implementation: terms as keys, postings as values
• Reducers must buffer all postings associated with key (to sort)
• Can run out of memory to buffer postings
13
Alternative Solution
• Emit key: (t, docid) value: tf
• How is this different?
• Framework does the sorting
• Term frequency implicitly stored
• Directly write postings to disk
• Requires secondary sorting!