13
1 COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical operations Calculating minimum, maximum, mean, median, standard deviation Data typically multi-dimensional -> analytics can be based on one or more dimensions of the data Same intermediate key used by all mappers Exploiting parallelism only on the map side – single reducer often required Image source: Hadoop MapReduce Cookbook, chapter 5.

COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

1

COSC 6397

Big Data Analytics

Advanced MapReduce

Edgar Gabriel

Spring 2017

Basic statistical operations• Calculating minimum, maximum, mean, median, standard

deviation

• Data typically multi-dimensional -> analytics can be based on one

or more dimensions of the data

– Same intermediate key used by all mappers

– Exploiting parallelism only on the map side – single reducer

often required

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 2: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

2

Group-by operations• Calculate basic operations by group

– Allows to utilize more than one reducer

– Grouping based on key of the mapper step

• Example: calculate number of accesses to a webpage based on a

log-file

Image source: Hadoop MapReduce Cookbook, chapter 5.

Frequency distributions

• Arrangement of values that one or more variables take in

a sample

• Each entry in the table contains the number of

occurrences of values within a particular group

• Table summarizes the distribution of values in the

sample

• Example:

– Analyze the log file of a web server

– Sort the number of hits received by each URL in ascending

order

– Input Example:205.212.115.106 - [01/Jul/1995:00:00:00:12 -0400] “GET /countdown.html HTTP/1.0” 200 3985

Page 3: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

3

Frequency distributions

• First MapReduce job counts the number of occurrences

of a URL

– Result of the MapReduce job: a file containing the list of

<URL> <no. of occurrences>

• Second MapReduce job

– Use the output of first MapReduce job as input

– Mapper: use <no of occurrences> as key and <URL> as

value

– Reducer: omit the <no of occurrences> in output file

(ignoring URL)

• Sorting is implicit by Hadoop framework

Example output

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 4: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

4

Histograms

• Graphical representation of the distribution of data

• Estimate of the probability distribution of a continuous

variable

• Representation of tabulated frequencies, shown as

adjacent rectangles, erected over discrete intervals

– area proportional to the frequency of the observations in

the interval

• Example:

– Determine the number of accesses to the web server per

hour

Image source: Hadoop MapReduce Cookbook, chapter 5.

Histograms

• Map step uses the hour as the key and ‘one’ as the

value

• Reducer sums up the number of occurrences for each

hour

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 5: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

5

Histograms

0

20000

40000

60000

80000

100000

120000

140000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Scatter Plots

• A scatter plot is using Cartesian coordinates to display

values for two variables for a set of data

• Typically used when a variable exists that is under the

control of the user

– a parameter is systematically incremented and/or

decremented by the other,

• also called the control parameter or independent

variable

• is typically plotted along the horizontal axis

– The measured or dependent variable is customarily

plotted along the vertical axis

Page 6: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

6

Scatter Plots

• Example: find the relationship between the size of web

pages and the number of hits

Image source: Hadoop MapReduce Cookbook, chapter 5.

Scatter Plots

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 7: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

7

Secondary Sorting

• MapReduce sorts intermediate key-value pairs by the

keys during shuffle and sort phase

• Sometimes, additional sorting based on the values

would be useful

• Example: data from sensors

– Intermediate key-value pair:

key = mi, value = ( tj, ri)

with mi being a sensor id

tj being a time stamp

ri being the actual value

– Order of values for a given key is not in increasing order

of timestamps

Secondary Sorting (II)• Solution: encode sensor id and time stamp in the key

key = mi, value = ( tj, ri) key = (mi:tj) value = ri

but need to ensure that all keys containing mi end up at the

same reduce invokation!

• Implement three classes:

– Partitioner: which intermediate keys are sent to which

reducers

– SortComparator: decides how intermediate keys are sorted

– GroupComparator: decides which intermediate keys are

grouped to a single reduce method invokation

job.setPartitionerClass(SensorPartitioner.class);

job.setGroupingComparatorClass(KeyGroupingComparator.class);

job.setSortComparatorClass(CompositeKeyComparator.class);

Page 8: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

8

Partitionerpublic static class SensorPartitioner extends Partitioner<Text,Text>

{

public int getPartition(Text key, Text val, int numReducers) {

String [] tempstring = key.toString().split(“:");

int sensorId = Integer.parseInt(tempstring[1]);

return sensorId % numReducers;

}

}

• Data from one sensor will end up at the same reducer

• Since the keys are still different, the reduce method

will still be invoked separately for each key = (mi:tj)

SortComparator: determine order in which

keys are presented to the reducer

public class CompositeKeyComparator extends WritableComparator {

public int compare(WritableComparable w1, WritableComparable w2) {

String [] t1 = w1.toString().split(“:");

String [] t2 = w2.toString().split(“:");

int s1 = Integer.parseInt(t1[1]);

int s2 = Integer.parseInt(t2[1]);

int result = s1.compareTo(s2);

if(0 == result) {

double d1 = Double.parseDouble(t1[2]);

double d2 = Double.parseDouble(t2[2]);

result = -1 * d1.compareTo(d2);

}

return result;

}

}

Sorting based on

Sensor id

Sorting based on

Time stamp

Page 9: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

9

GroupComparator: determine which keys are

grouped together in a single call to a reducer

public class KeyGroupingComparator extends WritableComparator {

public int compare(WritableComparable w1, WritableComparable w2) {

String [] t1 = w1.toString().split(“:");

String [] t2 = w2.toString().split(“:");

int s1 = Integer.parseInt(t1[1]);

int s2 = Integer.parseInt(t2[1]);

return s1.compareTo(s2);

}

}

Graphical flow

s1 0800 x1

s1 0805 x2

s2 0920 x3

s2 0910 x4

s1 0715 x5

s3 1005 x6

input file

map()

map()

map()

s1:0800 x1

s2:0920 x3

s1:0715 x5

intermediate

key-value pairs

reduce()

reduce()

Sensor

Partitioner

s1:0800 x1

s1:0805 x2

s1:0715 x5

s2:0920 x3

s3:1005 x6s3:1005 x6

s2:0910 x4

s1:0805 x2 s1:0800 x1

s1:0805 x2

s1:0715 x5

CompositeKey

Comparator

s2:0910 x4

s2:0910 x4

s3:1005 x6

s2:0920 x3

KeyGroup

Comparator

s1:0715 x5

s1:0800 x1

s1:0805 x2

s2:0910 x4

s2:0920 x3

s3:1005 x6

Page 10: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

10

Inverted index

• Inverted index: An index data structure storing a

mapping from content to its locations in the original

document

• Necessary for fast full text searches, used on a large

scale in search engines

• Requires significant processing when a document is

added to the corpus

• Popular data structure used in document retrieval

systems

Example

• Two input files:

Input.txt: The sample input file contains sample keywords.

Input2.txt: Another input contains different keywords.

• Simple inverted index:

Another input2.txt

The input.txt

contains input.txt input2.txt

different input2.txt

file input.txt

input input.txt input2.txt

keywords input.txt input2.txt

sample input.txt

Page 11: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

11

MapReduce it?

• The indexing problem

– Scalability is critical

– Must be relatively fast, but need not be real time

– Fundamentally a batch operation

– Incremental updates may or may not be important

• The retrieval problem

– Must have sub-second response time

– For the web, only need relatively few results

MapReduce: Index Construction

• Map over all documents, assuming one document as

input to a mapper

– Emit term as key, (documendId, termfrequency) as valueFileSplit fileSplit = (FileSplit) context.getInputSplit();

String fileName = fileSplit.getPath().getName();

– Mapper needs to be able to store all terms related to a

document in memory!

• Sort/shuffle: group postings by term

• Reduce

– Gather and sort the postings (e.g., by documentId or tf)

– Write postings to disk

• Fundamentally, a large sorting problem

Page 12: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

12

Inverted Indexing: Pseudo-Code

Inverted Indexing: Pseudo-Code

Initial implementation: terms as keys, postings as values

• Reducers must buffer all postings associated with key (to sort)

• Can run out of memory to buffer postings

Page 13: COSC 6397 Big Data Analytics Advanced MapReducegabriel/courses/cosc6339_s17/BDA_07_Advanced… · COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2017 Basic statistical

13

Alternative Solution

• Emit key: (t, docid) value: tf

• How is this different?

• Framework does the sorting

• Term frequency implicitly stored

• Directly write postings to disk

• Requires secondary sorting!