IntroducontoRHadooplaurel.datsi.fi.upm.es/_media/docencia/asignaturas/... · IntroducontoRHadoop Master’s(Degree(in(Informacs(Engineering(Master’s(Programme(in(ICT(Innovaon:(DataScience((EIT(ICT(Labs(Master(School)(Academic(Year(2015G2106

Introduc)on to RHadoop Master’s Degree in Informa1cs Engineering

Master’s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School) Academic Year 2015-‐2106

Contents

•  Introduc1on to… •  MapReduce •  HDFS •  Hadoop •  Data Analy1cs with RHadoop

MapReduce & DQ

• Divide and Conquer (DQ) •  General idea

•  Divide a problem into sub-‐problems (smaller) •  Solve each problem (independently) •  Combine the solu1ons

DQ: pseudo-‐code

Func1on DQ (X: Problem data) if small(X) then S = easy(X) if not divide(X) => (X1, ..., Xk) for i = 1 to k do Si = DQ(Xi) S = combine(S1, ..., Sk) return S

DQ: efficiency

•  Efficiency of this approach •  An appropriate threshold must be selected to apply easy(X) •  Decomposi1on and combining func1ons must be efficient •  Sub-‐problems must be (approximately) of the same size

DQ: Remarks

•  It can not be applied to any type of problems •  Some1mes, it might not be obvious how to divide a large problem into sub-‐problems •  If such division is uneven, we will have an unbalanced system, which would have an import impact on the overall performance of the algorithm •  The size of the reduced problems must be significantly smaller than the original one so that massively parallel supercomputer could be used and the communica1on overhead can be compensated

MapReduce: general scheme

Source: www.academia.edu

MapReduce: more detail

Source: Hadoop Book

MapReduce: example

Source: MilanoR

Hadoop Distributed File System (HDFS)

• Distributed File System evolved from Google implementa1on (GFS) •  Fault-‐tolerant: files and divided in chunks and those are distributed and replicated through the cluster •  Normally, the replica1on ra1o is 3

•  There is a Master Node that stores this meta-‐data: which files, into how many chunks these are divided and where they are stored •  Large block sizes are preferred (128MB by default)


Source: Hadoop tutorial


•  In HDFS, blocks should be read from the beginning to the end (this favors the MapReduce approach) •  Files in the HDFS system ARE NOT stored along with the host system files •  HDFS is normally an abstrac1on OVER an exis1ng file system (ext3, ext4, etc.) •  Thus, there are specific commands to manipulate the HDFS file system

•  To open a file stored in HDFS, the client must contact the NameNode to retrieve the loca1on of each block of the file (at the DataNodes) •  Parallel reads are possible (and preferred)


• Data locality: normally, when launching a job, it is run in the same node that stores the data it must manipulate •  The meta-‐data stored in the NameNode is not automa1cally replicated (it must be done manually or with an inac1ve NameNode)

HDFS from the command line

•  Each user of the HDFS has a personal directory •  No security direc1ves implemented, so users can write anywhere

•  Access to HDFS through the hdfs command hdfs dfs command

•  Important commands •  -‐copyFromLocal vs. -‐copyToLocal •  -‐mkdir •  -‐cp, -‐mv

•  Documenta1on in the Hadoop Website

Hadoop MRv1 vs Yarn (MRv2)

• Hadoop MRv1 •  Resources management and tasks scheduling and monitoring done by a single process (bogle-‐neck): Job Tracker •  Each sub-‐problem is run by an independent process: Task Tracker

• Hadoop MRv2 •  Resources management and tasks scheduling and monitoring are split in different processes •  Resource Manager (RM): overall resources management •  Applica>on Master(AM): per job tasks scheduling and monitoring

•  A NodeManager runs the tasks at each compu1ng node

Hadoop MRv1 vs Yarn (MRv2)

Example: wordcount

•  Input: document made up of words • Output: A set of (Word, count(Word)) •  Two func1ons: map and reduce

• map(k1, v1):

for each word w in v1 emit(w, 1)

•  reduce(k2, v2_list):

int result = 0; for each v in v2_list

result += v; emit(k2, result)

Example: wordcount

Example: wordcount

RHadoop

• Developed by Revolu1on Analy1cs (acquired by Microsol) •  Three main components

•  rhdfs: R + HDFS •  rmr2: R + Map Reduce •  rhbase: R + Hbase

• Can be downloaded from: hgps://github.com/Revolu1onAnaly1cs/RHadoop/wiki/Downloads

• Already installed and configured in the VM provided…

RHadoop: interac)ng with HDFS # Load rhdfs library library(rhdfs)

# Start rhdfs hdfs.init()

# Basic "ls", path is mandatory hdfs.ls("/user/hadoop”)

# Create directory work.dir <-‐ "/user/hadoop/aux/” hdfs.mkdir(work.dir)

# And delete hdfs.delete(work.dir)

# Create again hdfs.mkdir(work.dir)

RHadoop: wordcount example

•  Library loading and ini1aliza1on

# Loading the RHadoop libraries library('rhdfs’) library('rmr2') # Ini1alizaing the RHadoop hdfs.init()


wordcount = func1on(input, # The output can be an HDFS path but # if it is NULL some temporary file will # be generated and wrapped in a big data # object, like the ones generated by to.dfs output = NULL, pagern = " ") {

# Defining wordcount Map func1on wc.map = func1on(., lines) { keyval( unlist(strsplit(x = lines, split = pagern)), 1) }

# Defining wordcount Reduce func1on wc.reduce = func1on(word, counts ) { keyval(word, sum(counts)) }


# Defining MapReduce parameters by calling mapreduce func1on mapreduce(input = input , output = output, # You can specify your own input and output formats # and produce binary formats with the func1ons # make.input.format and make.output.format input.format = "text”, map = wc.map, reduce = wc.reduce, # With combiner combine = T)

}


# Running MapReduce Job by passing the Hadoop # input directory loca1on as parameter wordcount('/user/hadoop/wordcount/quijote.txt') # Retrieving the RHadoop MapReduce output # data by passing output # directory loca1on as parameter from.dfs("/tmp/file1b0817a5bcd0")

•  El Quijote can be downloaded from: hgp://www.gutenberg.org/cache/epub/996/pg996.txt

RHadoop: airline example

• We will analyze the commercial data of an airline •  The input data file is a CSV • We will need to use a custom input formager to ease the task of processing the file

• Data can be downloaded from: hgp://stat-‐compu1ng.org/dataexpo/2009/1987.csv.bz2

RHadoop: airline example

library(rmr2) library('rhdfs’)

hdfs.init()

# Put data in HDFS hdfs.data.root = '/user/hadoop/rhadoop/airline’ hdfs.data = file.path(hdfs.data.root, 'data’) hdfs.mkdir(hdfs.data)

hdfs.put("/home/hadoop/Downloads/1987.csv", hdfs.data)

hdfs.out = file.path(hdfs.data.root, 'out')

RHadoop: airline example (input format) # # asa.csv.input.format() -‐ read CSV data files and label field names # for beger code readability (especially in the mapper) # asa.csv.input.format = make.input.format(format='csv', mode='text', streaming.format = NULL, sep=',', col.names = c('Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut', 'Cancelled', 'Cancella1onCode', 'Diverted', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircralDelay'), stringsAsFactors=F)

RHadoop: airline example (mapper 1/2)

# # the mapper gets keys and values from the input formager # in our case, the key is NULL and the value is a data.frame from read.table() # mapper.year.market.enroute_1me = func1on(key, val.df) { # Remove header lines, cancella1ons, and diversions: val.df = subset(val.df, Year != 'Year' & Cancelled == 0 & Diverted == 0) # We don't care about direc1on of travel, so construct a new 'market' vector # with airports ordered alphabe1cally (e.g, LAX to JFK becomes 'JFK-‐LAX') market = with(val.df, ifelse(Origin < Dest, paste(Origin, Dest, sep='-‐'), paste(Dest, Origin, sep='-‐')) )

RHadoop: airline example (mapper 2/2)

# key consists of year, market output.key = data.frame(year=as.numeric(val.df$Year), market=market, stringsAsFactors=F) # emit data.frame of gate-‐to-‐gate elapsed 1mes (CRS and actual) + 1me in air output.val = val.df[,c('CRSElapsedTime', 'ActualElapsedTime', 'AirTime')] colnames(output.val) = c('scheduled', 'actual', 'inflight') # and finally, make sure they're numeric while we're at it output.val = transform(output.val, scheduled = as.numeric(scheduled), actual = as.numeric(actual), inflight = as.numeric(inflight)) return( keyval(output.key, output.val) ) }

RHadoop: airline example (reducer)

# # the reducer gets all the values for a given key # the values (which may be mul1-‐valued as here) come in the form of a data.frame # reducer.year.market.enroute_1me = func1on(key, val.df) { output.key = key output.val = data.frame(flights = nrow(val.df), scheduled = mean(val.df$scheduled, na.rm=T), actual = mean(val.df$actual, na.rm=T), inflight = mean(val.df$inflight, na.rm=T) ) return( keyval(output.key, output.val) ) }

RHadoop: final configura)on and execu)on

mr.year.market.enroute_1me = func1on (input, output) { mapreduce(input = input, output = output, input.format = asa.csv.input.format, map = mapper.year.market.enroute_1me, reduce = reducer.year.market.enroute_1me, backend.parameters = list( hadoop = list(D = "mapred.reduce.tasks=2") ), verbose=T) } out = mr.year.market.enroute_1me(hdfs.data, hdfs.out)

RHadoop: gathering results

results = from.dfs( out ) results.df = as.data.frame(results, stringsAsFactors=F ) colnames(results.df) = c('year', 'market', 'flights', 'scheduled', 'actual', 'inflight') print(head(results.df)) # save(results.df, file="out/enroute.1me.market.RData")

Documents

IntroducontoRHadooplaurel.datsi.fi.upm.es/_media/docencia/asignaturas/... · IntroducontoRHadoop Master’s(Degree(in(Informacs(Engineering(Master’s(Programme(in(ICT(Innovaon:(DataScience((EIT(ICT(Labs(Master(School)(Academic(Year(2015G2106