518

index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

  • Upload
    vanlien

  • View
    233

  • Download
    6

Embed Size (px)

Citation preview

Page 1: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,
Page 2: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

LearningHadoop2

Page 3: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TableofContents

LearningHadoop2

Credits

AbouttheAuthors

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.Introduction

Anoteonversioning

ThebackgroundofHadoop

ComponentsofHadoop

Commonbuildingblocks

Storage

Computation

Bettertogether

Hadoop2–what’sthebigdeal?

StorageinHadoop2

Page 4: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ComputationinHadoop2

DistributionsofApacheHadoop

Adualapproach

AWS–infrastructureondemandfromAmazon

SimpleStorageService(S3)

ElasticMapReduce(EMR)

Gettingstarted

ClouderaQuickStartVM

AmazonEMR

CreatinganAWSaccount

Signingupforthenecessaryservices

UsingElasticMapReduce

GettingHadoopupandrunning

HowtouseEMR

AWScredentials

TheAWScommand-lineinterface

Runningtheexamples

DataprocessingwithHadoop

WhyTwitter?

Buildingourfirstdataset

Oneservice,multipleAPIs

AnatomyofaTweet

Twittercredentials

ProgrammaticaccesswithPython

Summary

2.Storage

TheinnerworkingsofHDFS

Clusterstartup

NameNodestartup

DataNodestartup

Blockreplication

Page 5: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Command-lineaccesstotheHDFSfilesystem

ExploringtheHDFSfilesystem

Protectingthefilesystemmetadata

SecondaryNameNodenottotherescue

Hadoop2NameNodeHA

KeepingtheHANameNodesinsync

Clientconfiguration

Howafailoverworks

ApacheZooKeeper–adifferenttypeoffilesystem

ImplementingadistributedlockwithsequentialZNodes

ImplementinggroupmembershipandleaderelectionusingephemeralZNodes

JavaAPI

Buildingblocks

Furtherreading

AutomaticNameNodefailover

HDFSsnapshots

Hadoopfilesystems

Hadoopinterfaces

JavaFileSystemAPI

Libhdfs

Thrift

Managingandserializingdata

TheWritableinterface

Introducingthewrapperclasses

Arraywrapperclasses

TheComparableandWritableComparableinterfaces

Storingdata

SerializationandContainers

Compression

General-purposefileformats

Column-orienteddataformats

Page 6: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

RCFile

ORC

Parquet

Avro

UsingtheJavaAPI

Summary

3.Processing–MapReduceandBeyond

MapReduce

JavaAPItoMapReduce

TheMapperclass

TheReducerclass

TheDriverclass

Combiner

Partitioning

Theoptionalpartitionfunction

Hadoop-providedmapperandreducerimplementations

Sharingreferencedata

WritingMapReduceprograms

Gettingstarted

Runningtheexamples

Localcluster

ElasticMapReduce

WordCount,theHelloWorldofMapReduce

Wordco-occurrences

Trendingtopics

TheTopNpattern

Sentimentofhashtags

Textcleanupusingchainmapper

WalkingthrougharunofaMapReducejob

Startup

Splittingtheinput

Page 7: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Taskassignment

Taskstartup

OngoingJobTrackermonitoring

Mapperinput

Mapperexecution

Mapperoutputandreducerinput

Reducerinput

Reducerexecution

Reduceroutput

Shutdown

Input/Output

InputFormatandRecordReader

Hadoop-providedInputFormat

Hadoop-providedRecordReader

OutputFormatandRecordWriter

Hadoop-providedOutputFormat

Sequencefiles

YARN

YARNarchitecture

ThecomponentsofYARN

AnatomyofaYARNapplication

LifecycleofaYARNapplication

Faulttoleranceandmonitoring

Thinkinginlayers

Executionmodels

YARNintherealworld–ComputationbeyondMapReduce

TheproblemwithMapReduce

Tez

Hive-on-tez

ApacheSpark

ApacheSamza

Page 8: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

YARN-independentframeworks

YARNtodayandbeyond

Summary

4.Real-timeComputationwithSamza

StreamprocessingwithSamza

HowSamzaworks

Samzahigh-levelarchitecture

Samza’sbestfriend–ApacheKafka

YARNintegration

Anindependentmodel

HelloSamza!

Buildingatweetparsingjob

Theconfigurationfile

GettingTwitterdataintoKafka

RunningaSamzajob

SamzaandHDFS

Windowingfunctions

Multijobworkflows

Tweetsentimentanalysis

Bootstrapstreams

Statefultasks

Summary

5.IterativeComputationwithSpark

ApacheSpark

Clustercomputingwithworkingsets

ResilientDistributedDatasets(RDDs)

Actions

Deployment

SparkonYARN

SparkonEC2

GettingstartedwithSpark

Page 9: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Writingandrunningstandaloneapplications

ScalaAPI

JavaAPI

WordCountinJava

PythonAPI

TheSparkecosystem

SparkStreaming

GraphX

MLlib

SparkSQL

ProcessingdatawithApacheSpark

Buildingandrunningtheexamples

RunningtheexamplesonYARN

Findingpopulartopics

Assigningasentimenttotopics

Dataprocessingonstreams

Statemanagement

DataanalysiswithSparkSQL

SQLondatastreams

ComparingSamzaandSparkStreaming

Summary

6.DataAnalysiswithApachePig

AnoverviewofPig

Gettingstarted

RunningPig

Grunt–thePiginteractiveshell

ElasticMapReduce

FundamentalsofApachePig

ProgrammingPig

Pigdatatypes

Pigfunctions

Page 10: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Load/store

Eval

Thetuple,bag,andmapfunctions

Themath,string,anddatetimefunctions

Dynamicinvokers

Macros

Workingwithdata

Filtering

Aggregation

Foreach

Join

ExtendingPig(UDFs)

ContributedUDFs

Piggybank

ElephantBird

ApacheDataFu

AnalyzingtheTwitterstream

Prerequisites

Datasetexploration

Tweetmetadata

Datapreparation

Topnstatistics

Datetimemanipulation

Sessions

Capturinguserinteractions

Linkanalysis

Influentialusers

Summary

7.HadoopandSQL

WhySQLonHadoop

OtherSQL-on-Hadoopsolutions

Page 11: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Prerequisites

OverviewofHive

ThenatureofHivetables

Hivearchitecture

Datatypes

DDLstatements

Fileformatsandstorage

JSON

Avro

Columnarstores

Queries

StructuringHivetablesforgivenworkloads

Partitioningatable

Overwritingandupdatingdata

Bucketingandsorting

Samplingdata

Writingscripts

HiveandAmazonWebServices

HiveandS3

HiveonElasticMapReduce

ExtendingHiveQL

Programmaticinterfaces

JDBC

Thrift

Stingerinitiative

Impala

ThearchitectureofImpala

Co-existingwithHive

Adifferentphilosophy

Drill,Tajo,andbeyond

Summary

Page 12: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

8.DataLifecycleManagement

Whatdatalifecyclemanagementis

Importanceofdatalifecyclemanagement

Toolstohelp

Buildingatweetanalysiscapability

Gettingthetweetdata

IntroducingOozie

AnoteonHDFSfilepermissions

Makingdevelopmentalittleeasier

ExtractingdataandingestingintoHive

Anoteonworkflowdirectorystructure

IntroducingHCatalog

UsingHCatalog

TheOoziesharelib

HCatalogandpartitionedtables

Producingderiveddata

Performingmultipleactionsinparallel

Callingasubworkflow

Addingglobalsettings

Challengesofexternaldata

Datavalidation

Validationactions

Handlingformatchanges

HandlingschemaevolutionwithAvro

FinalthoughtsonusingAvroschemaevolution

Onlymakeadditivechanges

Manageschemaversionsexplicitly

Thinkaboutschemadistribution

Collectingadditionaldata

Schedulingworkflows

OtherOozietriggers

Page 13: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Pullingitalltogether

Othertoolstohelp

Summary

9.MakingDevelopmentEasier

Choosingaframework

Hadoopstreaming

StreamingwordcountinPython

Differencesinjobswhenusingstreaming

Findingimportantwordsintext

Calculatetermfrequency

Calculatedocumentfrequency

Puttingitalltogether–TF-IDF

KiteData

DataCore

DataHCatalog

DataHive

DataMapReduce

DataSpark

DataCrunch

ApacheCrunch

Gettingstarted

Concepts

Dataserialization

Dataprocessingpatterns

Aggregationandsorting

Joiningdata

Pipelinesimplementationandexecution

SparkPipeline

MemPipeline

Crunchexamples

Wordco-occurrence

Page 14: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TF-IDF

KiteMorphlines

Concepts

Morphlinecommands

Summary

10.RunningaHadoopCluster

I’madeveloper–Idon’tcareaboutoperations!

HadoopandDevOpspractices

ClouderaManager

Topayornottopay

ClustermanagementusingClouderaManager

ClouderaManagerandothermanagementtools

MonitoringwithClouderaManager

Findingconfigurationfiles

ClouderaManagerAPI

ClouderaManagerlock-in

Ambari–theopensourcealternative

OperationsintheHadoop2world

Sharingresources

Buildingaphysicalcluster

Physicallayout

Rackawareness

Servicelayout

Upgradingaservice

BuildingaclusteronEMR

Considerationsaboutfilesystems

GettingdataintoEMR

EC2instancesandtuning

Clustertuning

JVMconsiderations

Thesmallfilesproblem

Page 15: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Mapandreduceoptimizations

Security

EvolutionoftheHadoopsecuritymodel

Beyondbasicauthorization

ThefutureofHadoopsecurity

Consequencesofusingasecuredcluster

Monitoring

Hadoop–wherefailuresdon’tmatter

Monitoringintegration

Application-levelmetrics

Troubleshooting

Logginglevels

Accesstologfiles

ResourceManager,NodeManager,andApplicationManager

Applications

Nodes

Scheduler

MapReduce

MapReducev1

MapReducev2(YARN)

JobHistoryServer

NameNodeandDataNode

Summary

11.WheretoGoNext

Alternativedistributions

ClouderaDistributionforHadoop

HortonworksDataPlatform

MapR

Andtherest…

Choosingadistribution

Othercomputationalframeworks

Page 16: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ApacheStorm

ApacheGiraph

ApacheHAMA

Otherinterestingprojects

HBase

Sqoop

Whir

Mahout

Hue

Otherprogrammingabstractions

Cascading

AWSresources

SimpleDBandDynamoDB

Kinesis

DataPipeline

Sourcesofinformation

Sourcecode

Mailinglistsandforums

LinkedIngroups

HUGs

Conferences

Summary

Index

Page 17: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

LearningHadoop2

Page 18: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

LearningHadoop2Copyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:February2015

Productionreference:1060215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78328-551-8

www.packtpub.com

Page 19: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

CreditsAuthors

GarryTurkington

GabrieleModena

Reviewers

AtdheBuja

AmitGurdasani

JakobHoman

JamesLampton

DavideSetti

ValerieParham-Thompson

CommissioningEditor

EdwardGordon

AcquisitionEditor

JoanneFitzpatrick

ContentDevelopmentEditor

VaibhavPawar

TechnicalEditors

IndrajitA.Das

MenzaMathew

CopyEditors

RoshniBanerjee

SarangChari

PranjaliChury

ProjectCoordinator

KrantiBerde

Proofreaders

SimranBhogal

MartinDiver

LawrenceA.Herman

Page 20: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PaulHindle

Indexer

HemanginiBari

Graphics

AbhinashSahu

ProductionCoordinator

NiteshThakur

CoverWork

NiteshThakur

Page 21: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AbouttheAuthorsGarryTurkingtonhasover15yearsofindustryexperience,mostofwhichhasbeenfocusedonthedesignandimplementationoflarge-scaledistributedsystems.InhiscurrentroleastheCTOatImproveDigital,heisprimarilyresponsiblefortherealizationofsystemsthatstore,process,andextractvaluefromthecompany’slargedatavolumes.BeforejoiningImproveDigital,hespenttimeatAmazon.co.uk,whereheledseveralsoftwaredevelopmentteams,buildingsystemsthatprocesstheAmazoncatalogdataforeveryitemworldwide.Priortothis,hespentadecadeinvariousgovernmentpositionsinboththeUKandtheUSA.

HehasBScandPhDdegreesinComputerSciencefromQueensUniversityBelfastinNorthernIreland,andaMaster’sdegreeinEngineeringinSystemsEngineeringfromStevensInstituteofTechnologyintheUSA.HeistheauthorofHadoopBeginnersGuide,publishedbyPacktPublishingin2013,andisacommitterontheApacheSamzaproject.

IwouldliketothankmywifeLeaandmotherSarahfortheirsupportandpatiencethroughthewritingofanotherbookandmydaughterMayaforfrequentlycheeringmeupandaskingmehardquestions.IwouldalsoliketothankGabrieleforbeingsuchanamazingco-authoronthisproject.

GabrieleModenaisadatascientistatImproveDigital.Inhiscurrentposition,heusesHadooptomanage,process,andanalyzebehavioralandmachine-generateddata.Gabrieleenjoysusingstatisticalandcomputationalmethodstolookforpatternsinlargeamountsofdata.PriortohiscurrentjobinadtechheheldanumberofpositionsinAcademiaandIndustrywherehedidresearchinmachinelearningandartificialintelligence.

HeholdsaBScdegreeinComputerSciencefromtheUniversityofTrento,ItalyandaResearchMScdegreeinArtificialIntelligence:LearningSystems,fromtheUniversityofAmsterdamintheNetherlands.

Firstandforemost,IwanttothankLauraforhersupport,constantencouragementandendlesspatienceputtingupwithfartoomany“can’tdo,I’mworkingontheHadoopbook”.SheismyrockandIdedicatethisbooktoher.

AspecialthankyougoestoAmit,Atdhe,Davide,Jakob,JamesandValerie,whoseinvaluablefeedbackandcommentarymadethisworkpossible.

Finally,I’dliketothankmyco-author,Garry,forbringingmeonboardwiththisproject;ithasbeenapleasureworkingtogether.

Page 22: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AbouttheReviewersAtdheBujaisacertifiedethicalhacker,DBA(MCITP,OCA11g),anddeveloperwithgoodmanagementskills.HeisaDBAattheAgencyforInformationSociety/MinistryofPublicAdministration,wherehealsomanagessomeprojectsofe-governanceandhasmorethan10years’experienceworkingonSQLServer.

AtdheisaregularcolumnistforUBTNews.Currently,heholdsanMScdegreeincomputerscienceandengineeringandhasabachelor’sdegreeinmanagementandinformation.Hespecializesinandiscertifiedinmanytechnologies,suchasSQLServer(allversions),Oracle11g,CEH,WindowsServer,MSProject,SCOM2012R2,BizTalk,andintegrationbusinessprocesses.

Hewasthereviewerofthebook,MicrosoftSQLServer2012withHadoop,publishedbyPacktPublishing.Hiscapabilitiesgobeyondtheaforementionedknowledge!

IthankDonikaandmyfamilyforalltheencouragementandsupport.

AmitGurdasaniisasoftwareengineeratAmazon.Hearchitectsdistributedsystemstoprocessproductcataloguedata.Priortobuildinghigh-throughputsystemsatAmazon,hewasworkingontheentiresoftwarestack,bothasasystems-leveldeveloperatEricssonandIBMaswellasanapplicationdeveloperatManhattanAssociates.Hemaintainsastronginterestinbulkdataprocessing,datastreaming,andservice-orientedsoftwarearchitectures.

JakobHomanhasbeeninvolvedwithbigdataandtheApacheHadoopecosystemformorethan5years.HeisaHadoopcommitteraswellasacommitterfortheApacheGiraph,Spark,Kafka,andTajoprojects,andisaPMCmember.HehasworkedinbringingallthesesystemstoscaleatYahoo!andLinkedIn.

JamesLamptonisaseasonedpractitionerofallthingsdata(bigorsmall)with10yearsofhands-onexperienceinbuildingandusinglarge-scaledatastorageandprocessingplatforms.Heisabelieverinholisticapproachestosolvingproblemsusingtherighttoolfortherightjob.HisfavoritetoolsincludePython,Java,Hadoop,Pig,Storm,andSQL(whichsometimesIlikeandsometimesIdon’t).HehasrecentlycompletedhisPhDfromtheUniversityofMarylandwiththereleaseofPigSqueal:amechanismforrunningPigscriptsonStorm.

Iwouldliketothankmyspouse,Andrea,andmyson,Henry,forgivingmetimetoreadwork-relatedthingsathome.IwouldalsoliketothankGarry,Gabriele,andthefolksatPacktPublishingfortheopportunitytoreviewthismanuscriptandfortheirpatienceandunderstanding,asmyfreetimewasconsumedwhenwritingmydissertation.

DavideSetti,aftergraduatinginphysicsfromtheUniversityofTrento,joinedtheSoNetresearchunitattheFondazioneBrunoKesslerinTrento,whereheappliedlarge-scaledataanalysistechniquestounderstandpeople’sbehaviorsinsocialnetworksandlargecollaborativeprojectssuchasWikipedia.

In2010,DavidemovedtoFondazione,whereheledthedevelopmentofdataanalytic

Page 23: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

toolstosupportresearchoncivicmedia,citizenjournalism,anddigitalmedia.

In2013,DavidebecametheCTOofSpazioDati,whereheleadsthedevelopmentoftoolstoperformsemanticanalysisofmassiveamountsofdatainthebusinessinformationsector.

Whennotsolvinghardproblems,Davideenjoystakingcareofhisfamilyvineyardandplayingwithhistwochildren.

Page 24: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

www.PacktPub.com

Page 25: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

Page 26: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

Page 27: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

Page 28: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PrefaceThisbookwilltakeyouonahands-onexplorationofthewonderfulworldthatisHadoop2anditsrapidlygrowingecosystem.Buildingonthesolidfoundationfromtheearlierversionsoftheplatform,Hadoop2allowsmultipledataprocessingframeworkstobeexecutedonasingleHadoopcluster.

Togiveanunderstandingofthissignificantevolution,wewillexplorebothhowthesenewmodelsworkandalsoshowtheirapplicationsinprocessinglargedatavolumeswithbatch,iterative,andnear-real-timealgorithms.

Page 29: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WhatthisbookcoversChapter1,Introduction,givesthebackgroundtoHadoopandtheBigDataproblemsitlookstosolve.WealsohighlighttheareasinwhichHadoop1hadroomforimprovement.

Chapter2,Storage,delvesintotheHadoopDistributedFileSystem,wheremostdataprocessedbyHadoopisstored.WeexaminetheparticularcharacteristicsofHDFS,showhowtouseit,anddiscusshowithasimprovedinHadoop2.WealsointroduceZooKeeper,anotherstoragesystemwithinHadoop,uponwhichmanyofitshigh-availabilityfeaturesrely.

Chapter3,Processing–MapReduceandBeyond,firstdiscussesthetraditionalHadoopprocessingmodelandhowitisused.WethendiscusshowHadoop2hasgeneralizedtheplatformtousemultiplecomputationalmodels,ofwhichMapReduceismerelyone.

Chapter4,Real-timeComputationwithSamza,takesadeeperlookatoneofthesealternativeprocessingmodelsenabledbyHadoop2.Inparticular,welookathowtoprocessreal-timestreamingdatawithApacheSamza.

Chapter5,IterativeComputationwithSpark,delvesintoaverydifferentalternativeprocessingmodel.Inthischapter,welookathowApacheSparkprovidesthemeanstodoiterativeprocessing.

Chapter6,DataAnalysiswithPig,demonstrateshowApachePigmakesthetraditionalcomputationalmodelofMapReduceeasiertousebyprovidingalanguagetodescribedataflows.

Chapter7,HadoopandSQL,looksathowthefamiliarSQLlanguagehasbeenimplementedatopdatastoredinHadoop.ThroughtheuseofApacheHiveanddescribingalternativessuchasClouderaImpala,weshowhowBigDataprocessingcanbemadepossibleusingexistingskillsandtools.

Chapter8,DataLifecycleManagement,takesalookatthebiggerpictureofjusthowtomanageallthatdatathatistobeprocessedinHadoop.UsingApacheOozie,weshowhowtobuildupworkflowstoingest,process,andmanagedata.

Chapter9,MakingDevelopmentEasier,focusesonaselectionoftoolsaimedathelpingadevelopergetresultsquickly.ThroughtheuseofHadoopstreaming,ApacheCrunchandKite,weshowhowtheuseoftherighttoolcanspeedupthedevelopmentlooporprovidenewAPIswithrichersemanticsandlessboilerplate.

Chapter10,RunningaHadoopCluster,takesalookattheoperationalsideofHadoop.Byfocusingontheareasofinteresttodevelopers,suchasclustermanagement,monitoring,andsecurity,thischaptershouldhelpyoutoworkbetterwithyouroperationsstaff.

Chapter11,WheretoGoNext,takesyouonawhirlwindtourthroughanumberofotherprojectsandtoolsthatwefeelareuseful,butcouldnotcoverindetailinthebookduetospaceconstraints.Wealsogivesomepointersonwheretofindadditionalsourcesofinformationandhowtoengagewiththevariousopensourcecommunities.

Page 30: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WhatyouneedforthisbookBecausemostpeopledon’thavealargenumberofsparemachinessittingaround,weusetheClouderaQuickStartvirtualmachineformostoftheexamplesinthisbook.ThisisasinglemachineimagewithallthecomponentsofafullHadoopclusterpre-installed.ItcanberunonanyhostmachinesupportingeithertheVMwareortheVirtualBoxvirtualizationtechnology.

WealsoexploreAmazonWebServicesandhowsomeoftheHadooptechnologiescanberunontheAWSElasticMapReduceservice.TheAWSservicescanbemanagedthroughawebbrowseroraLinuxcommand-lineinterface.

Page 31: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WhothisbookisforThisbookisprimarilyaimedatapplicationandsystemdevelopersinterestedinlearninghowtosolvepracticalproblemsusingtheHadoopframeworkandrelatedcomponents.Althoughweshowexamplesinafewprogramminglanguages,astrongfoundationinJavaisthemainprerequisite.

Dataengineersandarchitectsmightalsofindthematerialconcerningdatalifecycle,fileformats,andcomputationalmodelsuseful.

Page 32: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduce.jarfiletoourenvironmentbeforeaccessingindividualfields.”

Ablockofcodeissetasfollows:

topic_edges_grouped=FOREACHtopic_edges_grouped{

GENERATE

group.topic_idastopic,

group.source_idassource,

topic_edges.(destination_id,w)asedges;

}

Anycommand-lineinputoroutputiswrittenasfollows:

$hdfsdfs-puttarget/elephant-bird-pig-4.5.jarhdfs:///jar/

$hdfsdfs–puttarget/elephant-bird-hadoop-compat-4.5.jarhdfs:///jar/

$hdfsdfs–putelephant-bird-core-4.5.jarhdfs:///jar/

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxes,appearinthetextlikethis:“Oncetheformisfilledin,weneedtoreviewandacceptthetermsofserviceandclickontheCreateApplicationbuttoninthebottom-leftcornerofthepage.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

Page 33: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.

Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.

Page 34: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

Page 35: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DownloadingtheexamplecodeThesourcecodeforthisbookcanbefoundonGitHubathttps://github.com/learninghadoop2/book-examples.Theauthorswillbeapplyinganyerratatothiscodeandkeepingituptodateasthetechnologiesevolve.Inadditionyoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Page 36: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.

Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.

Page 37: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.

Page 38: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

QuestionsYoucancontactusat<[email protected]>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.

Page 39: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Chapter1.IntroductionThisbookwillteachyouhowtobuildamazingsystemsusingthelatestreleaseofHadoop.Beforeyouchangetheworldthough,weneedtodosomegroundwork,whichiswherethischaptercomesin.

Inthisintroductorychapter,wewillcoverthefollowingtopics:

AbriefrefresheronthebackgroundtoHadoopAwalk-throughofHadoop’sevolutionThekeyelementsinHadoop2TheHadoopdistributionswe’lluseinthisbookThedatasetwe’lluseforexamples

Page 40: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AnoteonversioningInHadoop1,theversionhistorywassomewhatconvolutedwithmultipleforkedbranchesinthe0.2xrange,leadingtooddsituations,wherea1.xversioncould,insomesituations,havefewerfeaturesthana0.23release.Intheversion2codebase,thisisfortunatelymuchmorestraightforward,butit’simportanttoclarifyexactlywhichversionwewilluseinthisbook.

Hadoop2.0wasreleasedinalphaandbetaversions,andalongtheway,severalincompatiblechangeswereintroduced.Therewas,inparticular,amajorAPIstabilizationeffortbetweenthebetaandfinalreleasestages.

Hadoop2.2.0wasthefirstgeneralavailability(GA)releaseoftheHadoop2codebase,anditsinterfacesarenowdeclaredstableandforwardcompatible.Wewillthereforeusethe2.2productandinterfacesinthisbook.Thoughtheprincipleswillbeusableona2.0beta,inparticular,therewillbeAPIincompatibilitiesinthebeta.ThisisparticularlyimportantasMapReducev2wasback-portedtoHadoop1byseveraldistributionvendors,buttheseproductswerebasedonthebetaandnottheGAAPIs.Ifyouareusingsuchaproduct,thenyouwillencountertheseincompatiblechanges.ItisrecommendedthatareleasebaseduponHadoop2.2orlaterisusedforboththedevelopmentandtheproductiondeploymentsofanyHadoop2workloads.

Page 41: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ThebackgroundofHadoopWe’reassumingthatmostreaderswillhavealittlefamiliaritywithHadoop,orattheveryleast,withbigdata-processingsystems.Consequently,wewon’tgiveadetailedbackgroundastowhyHadoopissuccessfulorthetypesofproblemithelpstosolveinthisbook.However,particularlybecauseofsomeaspectsofHadoop2andtheotherproductswewilluseinlaterchapters,itisusefultogiveasketchofhowweseeHadoopfittingintothetechnologylandscapeandwhicharetheparticularproblemareaswherewebelieveitgivesthemostbenefit.

Inancienttimes,beforetheterm“bigdata”cameintothepicture(whichequatestomaybeadecadeago),therewerefewoptionstoprocessdatasetsofsizesinterabytesandbeyond.Somecommercialdatabasescould,withveryspecificandexpensivehardwaresetups,bescaledtothislevel,buttheexpertiseandcapitalexpenditurerequiredmadeitanoptionforonlythelargestorganizations.Alternatively,onecouldbuildacustomsystemaimedatthespecificproblemathand.Thissufferedfromsomeofthesameproblems(expertiseandcost)andaddedtheriskinherentinanycutting-edgesystem.Ontheotherhand,ifasystemwassuccessfullyconstructed,itwaslikelyaverygoodfittotheneed.

Fewsmall-tomid-sizecompaniesevenworriedaboutthisspace,notonlybecausethesolutionswereoutoftheirreach,buttheygenerallyalsodidn’thaveanythingclosetothedatavolumesthatrequiredsuchsolutions.Astheabilitytogenerateverylargedatasetsbecamemorecommon,sodidtheneedtoprocessthatdata.

Eventhoughlargedatabecamemoredemocratizedandwasnolongerthedomainoftheprivilegedfew,majorarchitecturalchangeswererequiredifthedata-processingsystemscouldbemadeaffordabletosmallercompanies.Thefirstbigchangewastoreducetherequiredupfrontcapitalexpenditureonthesystem;thatmeansnohigh-endhardwareorexpensivesoftwarelicenses.Previously,high-endhardwarewouldhavebeenutilizedmostcommonlyinarelativelysmallnumberofverylargeserversandstoragesystems,eachofwhichhadmultipleapproachestoavoidhardwarefailures.Thoughveryimpressive,suchsystemsarehugelyexpensive,andmovingtoalargernumberoflower-endserverswouldbethequickestwaytodramaticallyreducethehardwarecostofanewsystem.Movingmoretowardcommodityhardwareinsteadofthetraditionalenterprise-gradeequipmentwouldalsomeanareductionincapabilitiesintheareaofresilienceandfaulttolerance.Thoseresponsibilitieswouldneedtobetakenupbythesoftwarelayer.Smartersoftware,dumberhardware.

GooglestartedthechangethatwouldeventuallybeknownasHadoop,whenin2003,andin2004,theyreleasedtwoacademicpapersdescribingtheGoogleFileSystem(GFS)(http://research.google.com/archive/gfs.html)andMapReduce(http://research.google.com/archive/mapreduce.html).Thetwotogetherprovidedaplatformforverylarge-scaledataprocessinginahighlyefficientmanner.Googlehadtakenthebuild-it-yourselfapproach,butinsteadofconstructingsomethingaimedatonespecificproblemordataset,theyinsteadcreatedaplatformonwhichmultipleprocessingapplicationscouldbeimplemented.Inparticular,theyutilizedlargenumbersof

Page 42: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

commodityserversandbuiltGFSandMapReduceinawaythatassumedhardwarefailureswouldbecommonplaceandweresimplysomethingthatthesoftwareneededtodealwith.

Atthesametime,DougCuttingwasworkingontheNutchopensourcewebcrawler.HewasworkingonelementswithinthesystemthatresonatedstronglyoncetheGoogleGFSandMapReducepaperswerepublished.DougstartedworkonopensourceimplementationsoftheseGoogleideas,andHadoopwassoonborn,firstly,asasubprojectofLucene,andthenasitsowntop-levelprojectwithintheApacheSoftwareFoundation.

Yahoo!hiredDougCuttingin2006andquicklybecameoneofthemostprominentsupportersoftheHadoopproject.InadditiontooftenpublicizingsomeofthelargestHadoopdeploymentsintheworld,Yahoo!allowedDougandotherengineerstocontributetoHadoopwhileemployedbythecompany,nottomentioncontributingbacksomeofitsowninternallydevelopedHadoopimprovementsandextensions.

Page 43: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ComponentsofHadoopThebroadHadoopumbrellaprojecthasmanycomponentsubprojects,andwe’lldiscussseveraloftheminthisbook.Atitscore,Hadoopprovidestwoservices:storageandcomputation.AtypicalHadoopworkflowconsistsofloadingdataintotheHadoopDistributedFileSystem(HDFS)andprocessingusingtheMapReduceAPIorseveraltoolsthatrelyonMapReduceasanexecutionframework.

Hadoop1:HDFSandMapReduce

BothlayersaredirectimplementationsofGoogle’sownGFSandMapReducetechnologies.

Page 44: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

CommonbuildingblocksBothHDFSandMapReduceexhibitseveralofthearchitecturalprinciplesdescribedintheprevioussection.Inparticular,thecommonprinciplesareasfollows:

Botharedesignedtorunonclustersofcommodity(thatis,lowtomediumspecification)serversBothscaletheircapacitybyaddingmoreservers(scale-out)asopposedtothepreviousmodelsofusinglargerhardware(scale-up)BothhavemechanismstoidentifyandworkaroundfailuresBothprovidemostoftheirservicestransparently,allowingtheusertoconcentrateontheproblemathandBothhaveanarchitecturewhereasoftwareclustersitsonthephysicalserversandmanagesaspectssuchasapplicationloadbalancingandfaulttolerance,withoutrelyingonhigh-endhardwaretodeliverthesecapabilities

Page 45: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

StorageHDFSisafilesystem,thoughnotaPOSIX-compliantone.Thisbasicallymeansthatitdoesnotdisplaythesamecharacteristicsasthatofaregularfilesystem.Inparticular,thecharacteristicsareasfollows:

HDFSstoresfilesinblocksthataretypicallyatleast64MBor(morecommonlynow)128MBinsize,muchlargerthanthe4-32KBseeninmostfilesystemsHDFSisoptimizedforthroughputoverlatency;itisveryefficientatstreamingreadsoflargefilesbutpoorwhenseekingformanysmallonesHDFSisoptimizedforworkloadsthataregenerallywrite-onceandread-manyInsteadofhandlingdiskfailuresbyhavingphysicalredundanciesindiskarraysorsimilarstrategies,HDFSusesreplication.Eachoftheblockscomprisingafileisstoredonmultiplenodeswithinthecluster,andaservicecalledtheNameNodeconstantlymonitorstoensurethatfailureshavenotdroppedanyblockbelowthedesiredreplicationfactor.Ifthisdoeshappen,thenitschedulesthemakingofanothercopywithinthecluster.

Page 46: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ComputationMapReduceisanAPI,anexecutionengine,andaprocessingparadigm;itprovidesaseriesoftransformationsfromasourceintoaresultdataset.Inthesimplestcase,theinputdataisfedthroughamapfunctionandtheresultanttemporarydataisthenfedthroughareducefunction.

MapReduceworksbestonsemistructuredorunstructureddata.Insteadofdataconformingtorigidschemas,therequirementisinsteadthatthedatacanbeprovidedtothemapfunctionasaseriesofkey-valuepairs.Theoutputofthemapfunctionisasetofotherkey-valuepairs,andthereducefunctionperformsaggregationtocollectthefinalsetofresults.

Hadoopprovidesastandardspecification(thatis,interface)forthemapandreducephases,andtheimplementationoftheseareoftenreferredtoasmappersandreducers.AtypicalMapReduceapplicationwillcompriseanumberofmappersandreducers,andit’snotunusualforseveralofthesetobeextremelysimple.Thedeveloperfocusesonexpressingthetransformationbetweenthesourceandtheresultantdata,andtheHadoopframeworkmanagesallaspectsofjobexecutionandcoordination.

Page 47: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

BettertogetherItispossibletoappreciatetheindividualmeritsofHDFSandMapReduce,buttheyareevenmorepowerfulwhencombined.Theycanbeusedindividually,butwhentheyaretogether,theybringoutthebestineachother,andthiscloseinterworkingwasamajorfactorinthesuccessandacceptanceofHadoop1.

WhenaMapReducejobisbeingplanned,Hadoopneedstodecideonwhichhosttoexecutethecodeinordertoprocessthedatasetmostefficiently.IftheMapReduceclusterhostsareallpullingtheirdatafromasinglestoragehostorarray,thenthislargelydoesn’tmatterasthestoragesystemisasharedresourcethatwillcausecontention.IfthestoragesystemwasmoretransparentandallowedMapReducetomanipulateitsdatamoredirectly,thentherewouldbeanopportunitytoperformtheprocessingclosertothedata,buildingontheprincipleofitbeinglessexpensivetomoveprocessingthandata.

ThemostcommondeploymentmodelforHadoopseestheHDFSandMapReduceclustersdeployedonthesamesetofservers.EachhostthatcontainsdataandtheHDFScomponenttomanagethedataalsohostsaMapReducecomponentthatcanscheduleandexecutedataprocessing.WhenajobissubmittedtoHadoop,itcanusethelocalityoptimizationtoscheduledataonthehostswheredataresidesasmuchaspossible,thusminimizingnetworktrafficandmaximizingperformance.

Page 48: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Hadoop2–what’sthebigdeal?IfwelookatthetwomaincomponentsofthecoreHadoopdistribution,storageandcomputation,weseethatHadoop2hasaverydifferentimpactoneachofthem.WhereastheHDFSfoundinHadoop2ismostlyamuchmorefeature-richandresilientproductthantheHDFSinHadoop1,forMapReduce,thechangesaremuchmoreprofoundandhave,infact,alteredhowHadoopisperceivedasaprocessingplatformingeneral.Let’slookatHDFSinHadoop2first.

Page 49: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

StorageinHadoop2We’lldiscusstheHDFSarchitectureinmoredetailinChapter2,Storage,butfornow,it’ssufficienttothinkofamaster-slavemodel.Theslavenodes(calledDataNodes)holdtheactualfilesystemdata.Inparticular,eachhostrunningaDataNodewilltypicallyhaveoneormoredisksontowhichfilescontainingthedataforeachHDFSblockarewritten.TheDataNodeitselfhasnounderstandingoftheoverallfilesystem;itsroleistostore,serve,andensuretheintegrityofthedataforwhichitisresponsible.

Themasternode(calledtheNameNode)isresponsibleforknowingwhichoftheDataNodesholdswhichblockandhowtheseblocksarestructuredtoformthefilesystem.Whenaclientlooksatthefilesystemandwishestoretrieveafile,it’sviaarequesttotheNameNodethatthelistofrequiredblocksisretrieved.

ThismodelworkswellandhasbeenscaledtoclusterswithtensofthousandsofnodesatcompaniessuchasYahoo!So,thoughitisscalable,thereisaresiliencyrisk;iftheNameNodebecomesunavailable,thentheentireclusterisrenderedeffectivelyuseless.NoHDFSoperationscanbeperformed,andsincethevastmajorityofinstallationsuseHDFSasthestoragelayerforservices,suchasMapReduce,theyalsobecomeunavailableeveniftheyarestillrunningwithoutproblems.

Morecatastrophically,theNameNodestoresthefilesystemmetadatatoapersistentfileonitslocalfilesystem.IftheNameNodehostcrashesinawaythatthisdataisnotrecoverable,thenalldataontheclusteriseffectivelylostforever.ThedatawillstillexistonthevariousDataNodes,butthemappingofwhichblockscomprisewhichfilesislost.Thisiswhy,inHadoop1,thebestpracticewastohavetheNameNodesynchronouslywriteitsfilesystemmetadatatobothlocaldisksandatleastoneremotenetworkvolume(typicallyviaNFS).

SeveralNameNodehigh-availability(HA)solutionshavebeenmadeavailablebythird-partysuppliers,butthecoreHadoopproductdidnotoffersuchresilienceinVersion1.Giventhisarchitecturalsinglepointoffailureandtheriskofdataloss,itwon’tbeasurprisetohearthatNameNodeHAisoneofthemajorfeaturesofHDFSinHadoop2andissomethingwe’lldiscussindetailinlaterchapters.ThefeatureprovidesbothastandbyNameNodethatcanbeautomaticallypromotedtoserviceallrequestsshouldtheactiveNameNodefail,butalsobuildsadditionalresilienceforthecriticalfilesystemmetadataatopthismechanism.

HDFSinHadoop2isstillanon-POSIXfilesystem;itstillhasaverylargeblocksizeanditstilltradeslatencyforthroughput.However,itdoesnowhaveafewcapabilitiesthatcanmakeitlookalittlemorelikeatraditionalfilesystem.Inparticular,thecoreHDFSinHadoop2nowcanberemotelymountedasanNFSvolume.Thisisanotherfeaturethatwaspreviouslyofferedasaproprietarycapabilitybythird-partysuppliersbutisnowinthemainApachecodebase.

Overall,theHDFSinHadoop2ismoreresilientandcanbemoreeasilyintegratedintoexistingworkflowsandprocesses.It’sastrongevolutionoftheproductfoundinHadoop

Page 50: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

1.

Page 51: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ComputationinHadoop2TheworkonHDFS2wasstartedbeforeadirectionforMapReducecrystallized.ThiswaslikelyduetothefactthatfeaturessuchasNameNodeHAweresuchanobviouspaththatthecommunityknewthemostcriticalareastoaddress.However,MapReducedidn’treallyhaveasimilarlistofareasofimprovement,andthat’swhy,whentheMRv2initiativestarted,itwasn’tcompletelyclearwhereitwouldlead.

PerhapsthemostfrequentcriticismofMapReduceinHadoop1washowitsbatchprocessingmodelwasill-suitedtoproblemdomainswherefasterresponsetimeswererequired.Hive,forexample,whichwe’lldiscussinChapter7,HadoopandSQL,providesaSQL-likeinterfaceontoHDFSdata,but,behindthescenes,thestatementsareconvertedintoMapReducejobsthatarethenexecutedlikeanyother.Anumberofotherproductsandtoolstookasimilarapproach,providingaspecificuser-facinginterfacethathidaMapReducetranslationlayer.

Thoughthisapproachhasbeenverysuccessful,andsomeamazingproductshavebeenbuilt,thefactremainsthatinmanycases,thereisamismatchasalloftheseinterfaces,someofwhichexpectacertaintypeofresponsiveness,arebehindthescenes,beingexecutedonabatch-processingplatform.WhenlookingtoenhanceMapReduce,improvementscouldbemadetomakeitabetterfittotheseusecases,butthefundamentalmismatchwouldremain.ThissituationledtoasignificantchangeoffocusoftheMRv2initiative;perhapsMapReduceitselfdidn’tneedchange,buttherealneedwastoenabledifferentprocessingmodelsontheHadoopplatform.ThuswasbornYetAnotherResourceNegotiator(YARN).

LookingatMapReduceinHadoop1,theproductactuallydidtwoquitedifferentthings;itprovidedtheprocessingframeworktoexecuteMapReducecomputations,butitalsomanagedtheallocationofthiscomputationacrossthecluster.Notonlydiditdirectdatatoandbetweenthespecificmapandreducetasks,butitalsodeterminedwhereeachtaskwouldrun,andmanagedthefulljoblifecycle,monitoringthehealthofeachtaskandnode,reschedulingifanyfailed,andsoon.

Thisisnotatrivialtask,andtheautomatedparallelizationofworkloadshasalwaysbeenoneofthemainbenefitsofHadoop.IfwelookatMapReduceinHadoop1,weseethataftertheuserdefinesthekeycriteriaforthejob,everythingelseistheresponsibilityofthesystem.Critically,fromascaleperspective,thesameMapReducejobcanbeappliedtodatasetsofanyvolumehostedonclustersofanysize.Ifthedatais1GBinsizeandonasinglehost,thenHadoopwillscheduletheprocessingaccordingly.Ifthedataisinstead1PBinsizeandhostedacross1,000machines,thenitdoeslikewise.Fromtheuser’sperspective,theactualscaleofthedataandclusteristransparent,andasidefromaffectingthetimetakentoprocessthejob,itdoesnotchangetheinterfacewithwhichtointeractwiththesystem.

InHadoop2,thisroleofjobschedulingandresourcemanagementisseparatedfromthatofexecutingtheactualapplication,andisimplementedbyYARN.

Page 52: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

YARNisresponsibleformanagingtheclusterresources,andsoMapReduceexistsasanapplicationthatrunsatoptheYARNframework.TheMapReduceinterfaceinHadoop2iscompletelycompatiblewiththatinHadoop1,bothsemanticallyandpractically.However,underthecovers,MapReducehasbecomeahostedapplicationontheYARNframework.

ThesignificanceofthissplitisthatotherapplicationscanbewrittenthatprovideprocessingmodelsmorefocusedontheactualproblemdomainandcanoffloadalltheresourcemanagementandschedulingresponsibilitiestoYARN.ThelatestversionsofmanydifferentexecutionengineshavebeenportedontoYARN,eitherinaproduction-readyorexperimentalstate,andithasshownthattheapproachcanallowasingleHadoopclustertoruneverythingfrombatch-orientedMapReducejobsthroughfast-responseSQLqueriestocontinuousdatastreamingandeventoimplementmodelssuchasgraphprocessingandtheMessagePassingInterface(MPI)fromtheHighPerformanceComputing(HPC)world.ThefollowingdiagramshowsthearchitectureofHadoop2:

Hadoop2

ThisiswhymuchoftheattentionandexcitementaroundHadoop2hasbeenfocusedonYARNandframeworksthatsitontopofit,suchasApacheTezandApacheSpark.WithYARN,theHadoopclusterisnolongerjustabatch-processingengine;itisthesingleplatformonwhichavastarrayofprocessingtechniquescanbeappliedtotheenormousdatavolumesstoredinHDFS.Moreover,applicationscanbuildonthesecomputationparadigmsandexecutionmodels.

TheanalogythatisachievingsometractionistothinkofYARNastheprocessingkerneluponwhichotherdomain-specificapplicationscanbebuilt.We’lldiscussYARNinmoredetailinthisbook,particularlyinChapter3,Processing–MapReduceandBeyond,Chapter4,Real-timeComputationwithSamza,andChapter5,IterativeComputationwithSpark.

Page 53: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,
Page 54: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DistributionsofApacheHadoopIntheveryearlydaysofHadoop,theburdenofinstalling(oftenbuildingfromsource)andmanagingeachcomponentanditsdependenciesfellontheuser.Asthesystembecamemorepopularandtheecosystemofthird-partytoolsandlibrariesstartedtogrow,thecomplexityofinstallingandmanagingaHadoopdeploymentincreaseddramaticallytothepointwhereprovidingacoherentofferofsoftwarepackages,documentation,andtrainingbuiltaroundthecoreApacheHadoophasbecomeabusinessmodel.EntertheworldofdistributionsforApacheHadoop.

HadoopdistributionsareconceptuallysimilartohowLinuxdistributionsprovideasetofintegratedsoftwarearoundacommoncore.Theytaketheburdenofbundlingandpackagingsoftwarethemselvesandprovidetheuserwithaneasywaytoinstall,manage,anddeployApacheHadoopandaselectednumberofthird-partylibraries.Inparticular,thedistributionreleasesdeliveraseriesofproductversionsthatarecertifiedtobemutuallycompatible.Historically,puttingtogetheraHadoop-basedplatformwasoftengreatlycomplicatedbythevariousversioninterdependencies.

Cloudera(http://www.cloudera.com),Hortonworks(http://www.hortonworks.com),andMapR(http://www.mapr.com)areamongstthefirsttohavereachedthemarket,eachcharacterizedbydifferentapproachesandsellingpoints.Hortonworkspositionsitselfastheopensourceplayer;ClouderaisalsocommittedtoopensourcebutaddsproprietarybitsforconfiguringandmanagingHadoop;MapRprovidesahybridopensource/proprietaryHadoopdistributioncharacterizedbyaproprietaryNFSlayerinsteadofHDFSandafocusonprovidingservices.

AnotherstrongplayerinthedistributionsecosystemisAmazon,whichoffersaversionofHadoopcalledElasticMapReduce(EMR)ontopoftheAmazonWebServices(AWS)infrastructure.

WiththeadventofHadoop2,thenumberofavailabledistributionsforHadoophasincreaseddramatically,farinexcessofthefourwementioned.ApossiblyincompletelistofsoftwareofferingsthatincludesApacheHadoopcanbefoundathttp://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support.

Page 55: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AdualapproachInthisbook,wewilldiscussboththebuildingandthemanagementoflocalHadoopclustersinadditiontoshowinghowtopushtheprocessingintothecloudviaEMR.

Thereasonforthisistwofold:firstly,thoughEMRmakesHadoopmuchmoreaccessible,thereareaspectsofthetechnologythatonlybecomeapparentwhenmanuallyadministeringthecluster.AlthoughitisalsopossibletouseEMRinamoremanualmode,we’llgenerallyusealocalclusterforsuchexplorations.Secondly,thoughitisn’tnecessarilyaneither/ordecision,manyorganizationsuseamixtureofin-houseandcloud-hostedcapacities,sometimesduetoaconcernofoverrelianceonasingleexternalprovider,butpracticallyspeaking,it’softenconvenienttododevelopmentandsmall-scaletestsonlocalcapacityandthendeployatproductionscaleintothecloud.

Inafewofthelaterchapters,wherewediscussadditionalproductsthatintegratewithHadoop,we’llmostlygiveexamplesoflocalclusters,asthereisnodifferencebetweenhowtheproductsworkregardlessofwheretheyaredeployed.

Page 56: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AWS–infrastructureondemandfromAmazonAWSisasetofcloud-computingservicesofferedbyAmazon.Wewilluseseveraloftheseservicesinthisbook.

Page 57: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SimpleStorageService(S3)Amazon’sSimpleStorageService(S3),foundathttp://aws.amazon.com/s3/,isastorageservicethatprovidesasimplekey-valuestoragemodel.Usingweb,command-line,orprogrammaticinterfacestocreateobjects,whichcanbeanythingfromtextfilestoimagestoMP3s,youcanstoreandretrieveyourdatabasedonahierarchicalmodel.Inthismodel,youcreatebucketsthatcontainobjects.Eachbuckethasauniqueidentifier,andwithineachbucket,everyobjectisuniquelynamed.ThissimplestrategyenablesanextremelypowerfulserviceforwhichAmazontakescompleteresponsibility(forservicescaling,inadditiontoreliabilityandavailabilityofdata).

Page 58: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ElasticMapReduce(EMR)Amazon’sElasticMapReduce,foundathttp://aws.amazon.com/elasticmapreduce/,isbasicallyHadoopinthecloud.Usinganyofthemultipleinterfaces(webconsole,CLI,orAPI),aHadoopworkflowisdefinedwithattributessuchasthenumberofHadoophostsrequiredandthelocationofthesourcedata.TheHadoopcodeimplementingtheMapReducejobsisprovided,andthevirtualGobuttonispressed.

Initsmostimpressivemode,EMRcanpullsourcedatafromS3,processitonaHadoopclusteritcreatesonAmazon’svirtualhoston-demandserviceEC2,pushtheresultsbackintoS3,andterminatetheHadoopclusterandtheEC2virtualmachineshostingit.Naturally,eachoftheseserviceshasacost(usuallyonperGBstoredandserver-timeusagebasis),buttheabilitytoaccesssuchpowerfuldata-processingcapabilitieswithnoneedfordedicatedhardwareisapowerfulone.

Page 59: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GettingstartedWewillnowdescribethetwoenvironmentswewillusethroughoutthebook:Cloudera’sQuickStartvirtualmachinewillbeourreferencesystemonwhichwewillshowallexamples,butwewilladditionallydemonstratesomeexamplesonAmazon’sEMRwhenthereissomeparticularlyvaluableaspecttorunningtheexampleintheon-demandservice.

Althoughtheexamplesandcodeprovidedareaimedatbeingasgeneral-purposeandportableaspossible,ourreferencesetup,whentalkingaboutalocalcluster,willbeClouderarunningatopCentOSLinux.

Forthemostpart,wewillshowexamplesthatmakeuseof,orareexecutedfrom,aterminalprompt.AlthoughHadoop’sgraphicalinterfaceshaveimprovedsignificantlyovertheyears(forexample,theexcellentHUEandClouderaManager),whenitcomestodevelopment,automation,andprogrammaticaccesstothesystem,thecommandlineisstillthemostpowerfultoolforthejob.

Allexamplesandsourcecodepresentedinthisbookcanbedownloadedfromhttps://github.com/learninghadoop2/book-examples.Inaddition,wehaveahomepageforthebookwherewewillpublishupdatesandrelatedmaterialathttp://learninghadoop2.com.

Page 60: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ClouderaQuickStartVMOneoftheadvantagesofHadoopdistributionsisthattheygiveaccesstoeasy-to-install,packagedsoftware.ClouderatakesthisonestepfurtherandprovidesafreelydownloadableVirtualMachineinstanceofitslatestdistribution,knownastheCDHQuickStartVM,deployedontopofCentOSLinux.

Intheremainingpartsofthisbook,wewillusetheCDH5.0.0VMasthereferenceandbaselinesystemtorunexamplesandsourcecode.ImagesoftheVMareavailableforVMware(http://www.vmware.com/nl/products/player/),KVM(http://www.linux-kvm.org/page/Main_Page),andVirtualBox(https://www.virtualbox.org/)virtualizationsystems.

Page 61: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AmazonEMRBeforeusingElasticMapReduce,weneedtosetupanAWSaccountandregisteritwiththenecessaryservices.

CreatinganAWSaccountAmazonhasintegrateditsgeneralaccountswithAWS,whichmeansthat,ifyoualreadyhaveanaccountforanyoftheAmazonretailwebsites,thisistheonlyaccountyouwillneedtouseAWSservices.

NoteNotethatAWSserviceshaveacost;youwillneedanactivecreditcardassociatedwiththeaccounttowhichchargescanbemade.

IfyourequireanewAmazonaccount,gotohttp://aws.amazon.com,selectCreateanewAWSaccount,andfollowtheprompts.Amazonhasaddedafreetierforsomeservices,soyoumightfindthatintheearlydaysoftestingandexploration,youarekeepingmanyofyouractivitieswithinthenonchargedtier.Thescopeofthefreetierhasbeenexpanding,somakesureyouknowwhatyouwillandwon’tbechargedfor.

SigningupforthenecessaryservicesOnceyouhaveanAmazonaccount,youwillneedtoregisteritforusewiththerequiredAWSservices,thatis,SimpleStorageService(S3),ElasticComputeCloud(EC2),andElasticMapReduce.ThereisnocosttosimplysignuptoanyAWSservice;theprocessjustmakestheserviceavailabletoyouraccount.

GototheS3,EC2,andEMRpageslinkedfromhttp://aws.amazon.com,clickontheSignupbuttononeachpage,andthenfollowtheprompts.

Page 62: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

UsingElasticMapReduceHavingcreatedanaccountwithAWSandregisteredalltherequiredservices,wecanproceedtoconfigureprogrammaticaccesstoEMR.

Page 63: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GettingHadoopupandrunningNoteCaution!Thiscostsrealmoney!

Beforegoinganyfurther,itiscriticaltounderstandthatuseofAWSserviceswillincurchargesthatwillappearonthecreditcardassociatedwithyourAmazonaccount.Mostofthechargesarequitesmallandincreasewiththeamountofinfrastructureconsumed;storing10GBofdatainS3costs10timesmorethan1GB,andrunning20EC2instancescosts20timesasmuchasasingleone.Therearetieredcostmodels,sotheactualcoststendtohavesmallermarginalincreasesathigherlevels.Butyoushouldreadcarefullythroughthepricingsectionsforeachservicebeforeusinganyofthem.NotealsothatcurrentlydatatransferoutofAWSservices,suchasEC2andS3,ischargeable,butdatatransferbetweenservicesisnot.Thismeansitisoftenmostcost-effectivetocarefullydesignyouruseofAWStokeepdatawithinAWSthroughasmuchofthedataprocessingaspossible.ForinformationregardingAWSandEMR,consulthttp://aws.amazon.com/elasticmapreduce/#pricing.

HowtouseEMRAmazonprovidesbothwebandcommand-lineinterfacestoEMR.Bothinterfacesarejustafrontendtotheverysamesystem;aclustercreatedwiththecommand-lineinterfacecanbeinspectedandmanagedwiththewebtoolsandvice-versa.

Forthemostpart,wewillbeusingthecommand-linetoolstocreateandmanageclustersprogrammaticallyandwillfallbackonthewebinterfacecaseswhereitmakessensetodoso.

AWScredentialsBeforeusingeitherprogrammaticorcommand-linetools,weneedtolookathowanaccountholderauthenticatestoAWStomakesuchrequests.

EachAWSaccounthasseveralidentifiers,suchasthefollowing,thatareusedwhenaccessingthevariousservices:

AccountID:eachAWSaccounthasanumericID.Accesskey:theassociatedaccesskeyisusedtoidentifytheaccountmakingtherequest.Secretaccesskey:thepartnertotheaccesskeyisthesecretaccesskey.Theaccesskeyisnotasecretandcouldbeexposedinservicerequests,butthesecretaccesskeyiswhatyouusetovalidateyourselfastheaccountowner.Treatitlikeyourcreditcard.Keypairs:thesearethekeypairsusedtologintoEC2hosts.Itispossibletoeithergeneratepublic/privatekeypairswithinEC2ortoimportexternallygeneratedkeysintothesystem.

Page 64: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

UsercredentialsandpermissionsaremanagedviaawebservicecalledIdentityandAccessManagement(IAM),whichyouneedtosignuptoinordertoobtainaccessandsecretkeys.

Ifthissoundsconfusing,it’sbecauseitis,atleastatfirst.WhenusingatooltoaccessanAWSservice,there’susuallythesingle,upfrontstepofaddingtherightcredentialstoaconfiguredfile,andtheneverythingjustworks.However,ifyoudodecidetoexploreprogrammaticorcommand-linetools,itwillbeworthinvestingalittletimetoreadthedocumentationforeachservicetounderstandhowitssecurityworks.MoreinformationoncreatinganAWSaccountandobtainingaccesscredentialscanbefoundathttp://docs.aws.amazon.com/iam.

Page 65: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheAWScommand-lineinterfaceEachAWSservicehistoricallyhaditsownsetofcommand-linetools.Recentlythough,Amazonhascreatedasingle,unifiedcommand-linetoolthatallowsaccesstomostservices.TheAmazonCLIcanbefoundathttp://aws.amazon.com/cli.

Itcanbeinstalledfromatarballorviathepiporeasy_installpackagemanagers.

OntheCDHQuickStartVM,wecaninstallawscliusingthefollowingcommand:

$pipinstallawscli

InordertoaccesstheAPI,weneedtoconfigurethesoftwaretoauthenticatetoAWSusingouraccessandsecretkeys.

ThisisalsoagoodmomenttosetupanEC2keypairbyfollowingtheinstructionsprovidedathttps://console.aws.amazon.com/ec2/home?region=us-east-1#c=EC2&s=KeyPairs.

AlthoughakeypairisnotstrictlynecessarytorunanEMRcluster,itwillgiveusthecapabilitytoremotelylogintothemasternodeandgainlow-levelaccesstothecluster.

Thefollowingcommandwillguideyouthroughaseriesofconfigurationstepsandstoretheresultingconfigurationinthe.aws/credentialfile:

$awsconfigure

OncetheCLIisconfigured,wecanqueryAWSwithaws<service><arguments>.TocreateandqueryanS3bucketusesomethinglikethefollowingcommand.NotethatS3bucketsneedtobegloballyuniqueacrossallAWSaccounts,somostcommonnames,suchass3://mybucket,willnotbeavailable:

$awss3mbs3://learninghadoop2

$awss3ls

WecanprovisionanEMRclusterwithfivem1.xlargenodesusingthefollowingcommands:

$awsemrcreate-cluster--name"EMRcluster"\

--ami-version3.2.0\

--instance-typem1.xlarge\

--instance-count5\

--log-uris3://learninghadoop2/emr-logs

Where--ami-versionistheIDofanAmazonMachineImagetemplate(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html),and--log-uriinstructsEMRtocollectlogsandstoretheminthelearninghadoop2S3bucket.

NoteIfyoudidnotspecifyadefaultregionwhensettinguptheAWSCLI,thenyouwillalsohavetoaddonetomostEMRcommandsintheAWSCLIusingthe—regionargument;forexample,--regioneu-west-1isruntousetheEUIrelandregion.Youcanfind

Page 66: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

detailsofallavailableAWSregionsathttp://docs.aws.amazon.com/general/latest/gr/rande.html.

Wecansubmitworkflowsbyaddingstepstoarunningclusterusingthefollowingcommand:

$awsemradd-steps--cluster-id<cluster>--steps<steps>

Toterminatethecluster,usethefollowingcommandline:

$awsemrterminate-clusters--cluster-id<cluster>

Inlaterchapters,wewillshowyouhowtoaddstepstoexecuteMapReducejobsandPigscripts.

MoreinformationonusingtheAWSCLIcanbefoundathttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage.html.

Page 67: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

RunningtheexamplesThesourcecodeofallexamplesisavailableathttps://github.com/learninghadoop2/book-examples.

Gradle(http://www.gradle.org/)scriptsandconfigurationsareprovidedtocompilemostoftheJavacode.ThegradlewscriptincludedwiththeexamplewillbootstrapGradleanduseittofetchdependenciesandcompilecode.

JARfilescanbecreatedbyinvokingthejartaskviaagradlewscript,asfollows:

./gradlewjar

JobsareusuallyexecutedbysubmittingaJARfileusingthehadoopjarcommand,asfollows:

$hadoopjarexample.jar<MainClass>[-libjars$LIBJARS]arg1arg2…argN

Theoptional-libjarsparameterspecifiesruntimethird-partydependenciestoshiptoremotenodes.

NoteSomeoftheframeworkswewillworkwith,suchasApacheSpark,comewiththeirownbuildandpackagemanagementtools.Additionalinformationandresourceswillbeprovidedfortheseparticularcases.

ThecopyJarGradletaskcanbeusedtodownloadthird-partydependenciesintobuild/libjars/<example>/lib,asfollows:

./gradlewcopyJar

Forconvenience,weprovideafatJarGradletaskthatbundlestheexampleclassesandtheirdependenciesintoasingleJARfile.Althoughthisapproachisdiscouragedinfavorofusing–libjar,itmightcomeinhandywhendealingwithdependencyissues.

Thefollowingcommandwillgeneratebuild/libs/<example>-all.jar:

$./gradlewfatJar

Page 68: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DataprocessingwithHadoopIntheremainingchaptersofthisbook,wewillintroducethecorecomponentsoftheHadoopecosystemaswellasanumberofthird-partytoolsandlibrariesthatwillmakewritingrobust,distributedcodeanaccessibleandhopefullyenjoyabletask.Whilereadingthisbook,youwilllearnhowtocollect,process,store,andextractinformationfromlargeamountsofstructuredandunstructureddata.

WewilluseadatasetgeneratedfromTwitter’s(http://www.twitter.com)real-timefirehose.Thisapproachwillallowustoexperimentwithrelativelysmalldatasetslocallyand,onceready,scaletheexamplesuptoproduction-leveldatasizes.

Page 69: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WhyTwitter?ThankstoitsprogrammaticAPIs,Twitterprovidesaneasywaytogeneratedatasetsofarbitrarysizeandinjectthemintoourlocal-orcloud-basedHadoopclusters.Otherthanthesheersize,thedatasetthatwewillusehasanumberofpropertiesthatfitseveralinterestingdatamodelingandprocessingusecases.

Twitterdatapossessesthefollowingproperties:

Unstructured:eachstatusupdateisatextmessagethatcancontainreferencestomediacontentsuchasURLsandimagesStructured:tweetsaretimestamped,sequentialrecordsGraph:relationshipssuchasrepliesandmentionscanbemodeledasanetworkofinteractionsGeolocated:thelocationwhereatweetwaspostedorwhereauserresidesRealtime:alldatageneratedonTwitterisavailableviaareal-timefirehose

ThesepropertieswillbereflectedinthetypeofapplicationthatwecanbuildwithHadoop.Theseincludeexamplesofsentimentanalysis,socialnetwork,andtrendanalysis.

Page 70: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

BuildingourfirstdatasetTwitter’stermsofserviceprohibitredistributionofuser-generateddatainanyform;forthisreason,wecannotmakeavailableacommondataset.Instead,wewilluseaPythonscripttoprogrammaticallyaccesstheplatformandcreateadumpofusertweetscollectedfromalivestream.

Oneservice,multipleAPIsTwitteruserssharemorethan200milliontweets,alsoknownasstatusupdates,aday.TheplatformoffersaccesstothiscorpusofdataviafourtypesofAPIs,eachofwhichrepresentsafacetofTwitterandaimsatsatisfyingspecificusecases,suchaslinkingandinteractingwithtwittercontentfromthird-partysources(TwitterforProducts),programmaticaccesstospecificusers’orsites’content(REST),searchcapabilitiesacrossusers’orsites’timelines(Search),andaccesstoallcontentcreatedontheTwitternetworkinrealtime(Streaming).

TheStreamingAPIallowsdirectaccesstotheTwitterstream,trackingkeywords,retrievinggeotaggedtweetsfromacertainregion,andmuchmore.Inthisbook,wewillmakeuseofthisAPIasadatasourcetoillustrateboththebatchandreal-timecapabilitiesofHadoop.Wewillnot,however,interactwiththeAPIitself;rather,wewillmakeuseofthird-partylibrariestooffloadchoressuchasauthenticationandconnectionmanagement.

AnatomyofaTweetEachtweetobjectreturnedbyacalltothereal-timeAPIsisrepresentedasaserializedJSONstringthatcontainsasetofattributesandmetadatainadditiontoatextualmessage.ThisadditionalcontentincludesanumericalIDthatuniquelyidentifiesthetweet,thelocationwherethetweetwasshared,theuserwhosharedit(userobject),whetheritwasrepublishedbyotherusers(retweeted)andhowmanytimes(retweetcount),themachine-detectedlanguageofitstext,whetherthetweetwaspostedinreplytosomeoneand,ifso,theuserandtweetIDsitrepliedto,andsoon.

ThestructureofaTweet,andanyotherobjectexposedbytheAPI,isconstantlyevolving.Anup-to-datereferencecanbefoundathttps://dev.twitter.com/docs/platform-objects/tweets.

TwittercredentialsTwittermakesuseoftheOAuthprotocoltoauthenticateandauthorizeaccessfromthird-partysoftwaretoitsplatform.

Theapplicationobtainsthroughanexternalchannel,forinstanceawebform,thefollowingpairofcredentials:

ConsumerkeyConsumersecret

Theconsumersecretisneverdirectlytransmittedtothethirdpartyasitisusedtosign

Page 71: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

eachrequest.

Theuserauthorizestheapplicationtoaccesstheserviceviaathree-wayprocessthat,oncecompleted,grantstheapplicationatokenconsistingofthefollowing:

AccesstokenAccesssecret

Similarly,totheconsumer,theaccesssecretisneverdirectlytransmittedtothethirdparty,anditisusedtosigneachrequest.

InordertousetheStreamingAPI,wewillfirstneedtoregisteranapplicationandgrantitprogrammaticaccesstothesystem.IfyourequireanewTwitteraccount,proceedtothesignuppageathttps://twitter.com/signup,andfillintherequiredinformation.Oncethisstepiscompleted,weneedtocreateasampleapplicationthatwillaccesstheAPIonourbehalfandgrantittheproperauthorizationrights.Wewilldosousingthewebformfoundathttps://dev.twitter.com/apps.

Whencreatinganewapp,weareaskedtogiveitaname,adescription,andaURL.ThefollowingscreenshotshowsthesettingsofasampleapplicationnamedLearningHadoop2BookDataset.Forthepurposeofthisbook,wedonotneedtospecifyavalidURL,soweusedaplaceholderinstead.

Oncetheformisfilledin,weneedtoreviewandacceptthetermsofserviceandclickontheCreateApplicationbuttoninthebottom-leftcornerofthepage.

Wearenowpresentedwithapagethatsummarizesourapplicationdetailsasseeninthefollowingscreenshot;theauthenticationandauthorizationcredentialscanbefoundundertheOAuthTooltab.

WearefinallyreadytogenerateourveryfirstTwitterdataset.

Page 72: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,
Page 73: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ProgrammaticaccesswithPythonInthissection,wewillusePythonandthetweepylibrary,foundathttps://github.com/tweepy/tweepy,tocollectTwitter’sdata.Thestream.pyfilefoundinthech1directoryofthebookcodearchiveinstantiatesalistenertothereal-timefirehose,grabsadatasample,andechoeseachtweet’stexttostandardoutput.

Thetweepylibrarycanbeinstalledusingeithertheeasy_installorpippackagemanagersorbycloningtherepositoryathttps://github.com/tweepy/tweepy.

OntheCDHQuickStartVM,wecaninstalltweepyusingthefollowingcommandline:

$pipinstalltweepy

Wheninvokedwiththe-jparameter,thescriptwilloutputaJSONtweettostandardoutput;-textractsandprintsthetextfield.Wespecifyhowmanytweetstoprintwith–n<numtweets>.When–nisnotspecified,thescriptwillrunindefinitely.ExecutioncanbeterminatedbypressingCtrl+C.

ThescriptexpectsOAuthcredentialstobestoredasshellenvironmentvariables;thefollowingcredentialswillhavetobesetintheterminalsessionfromwherestream.pywillbeexecuted.

$exportTWITTER_CONSUMER_KEY="your_consumer_key"

$exportTWITTER_CONSUMER_SECRET="your_consumer_secret"

$exportTWITTER_ACCESS_KEY="your_access_key"

$exportTWITTER_ACCESS_SECRET="your_access_secret"

OncetherequireddependencyhasbeeninstalledandtheOAuthdataintheshellenvironmenthasbeenset,wecanruntheprogramasfollows:

$pythonstream.py–t–n1000>tweets.txt

WearerelyingonLinux’sshellI/Otoredirecttheoutputwiththe>operatorofstream.pytoafilecalledtweets.txt.Ifeverythingwasexecutedcorrectly,youshouldseeawalloftext,whereeachlineisatweet.

Noticethatinthisexample,wedidnotmakeuseofHadoopatall.Inthenextchapters,wewillshowhowtoimportadatasetgeneratedfromtheStreamingAPIintoHadoopandanalyzeitscontentonthelocalclusterandAmazonEMR.

Fornow,let’stakealookatthesourcecodeofstream.py,whichcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch1/stream.py:

importtweepy

importos

importjson

importargparse

consumer_key=os.environ['TWITTER_CONSUMER_KEY']

consumer_secret=os.environ['TWITTER_CONSUMER_SECRET']

access_key=os.environ['TWITTER_ACCESS_KEY']

access_secret=os.environ['TWITTER_ACCESS_SECRET']

Page 74: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

classEchoStreamListener(tweepy.StreamListener):

def__init__(self,api,dump_json=False,numtweets=0):

self.api=api

self.dump_json=dump_json

self.count=0

self.limit=int(numtweets)

super(tweepy.StreamListener,self).__init__()

defon_data(self,tweet):

tweet_data=json.loads(tweet)

if'text'intweet_data:

ifself.dump_json:

printtweet.rstrip()

else:

printtweet_data['text'].encode("utf-8").rstrip()

self.count=self.count+1

returnFalseifself.count==self.limitelseTrue

defon_error(self,status_code):

returnTrue

defon_timeout(self):

returnTrue

if__name__=='__main__':

parser=get_parser()

args=parser.parse_args()

auth=tweepy.OAuthHandler(consumer_key,consumer_secret)

auth.set_access_token(access_key,access_secret)

api=tweepy.API(auth)

sapi=tweepy.streaming.Stream(

auth,EchoStreamListener(

api=api,

dump_json=args.json,

numtweets=args.numtweets))

sapi.sample()

First,weimportthreedependencies:tweepy,andtheosandjsonmodules,whichcomewiththePythoninterpreterversion2.6orgreater.

Wethendefineaclass,EchoStreamListener,thatinheritsandextendsStreamListenerfromtweepy.Asthenamesuggests,StreamListenerlistensforeventsandtweetsbeingpublishedonthereal-timestreamandperformsactionsaccordingly.

Wheneveraneweventisdetected,ittriggersacalltoon_data().Inthismethod,weextractthetextfieldfromatweetobjectandprintittostandardoutputwithUTF-8encoding.Alternatively,ifthescriptisinvokedwith-j,weprintthewholeJSONtweet.Whenthescriptisexecuted,weinstantiateatweepy.OAuthHandlerobjectwiththeOAuthcredentialsthatidentifyourTwitteraccount,andthenweusethisobjecttoauthenticatewiththeapplicationaccessandsecretkey.Wethenusetheauthobjecttocreateaninstanceofthetweepy.APIclass(api)

Page 75: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Uponsuccessfulauthentication,wetellPythontolistenforeventsonthereal-timestreamusingEchoStreamListener.

AnhttpGETrequesttothestatuses/sampleendpointisperformedbysample().Therequestreturnsarandomsampleofallpublicstatuses.

NoteBeware!Bydefault,sample()willrunindefinitely.RemembertoexplicitlyterminatethemethodcallbypressingCtrl+C.

Page 76: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SummaryThischaptergaveawhirlwindtourofwhereHadoopcamefrom,itsevolution,andwhytheversion2releaseissuchamajormilestone.WealsodescribedtheemergingmarketinHadoopdistributionsandhowwewilluseacombinationoflocalandclouddistributionsinthebook.

Finally,wedescribedhowtosetuptheneededsoftware,accounts,andenvironmentsrequiredinsubsequentchaptersanddemonstratedhowtopulldatafromtheTwitterstreamthatwewilluseforexamples.

Withthisbackgroundoutoftheway,wewillnowmoveontoadetailedexaminationofthestoragelayerwithinHadoop.

Page 77: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Chapter2.StorageAftertheoverviewofHadoopinthepreviouschapter,wewillnowstartlookingatitsvariouscomponentpartsinmoredetail.Wewillstartattheconceptualbottomofthestackinthischapter:themeansandmechanismsforstoringdatawithinHadoop.Inparticular,wewilldiscussthefollowingtopics:

DescribethearchitectureoftheHadoopDistributedFileSystem(HDFS)ShowwhatenhancementstoHDFShavebeenmadeinHadoop2ExplorehowtoaccessHDFSusingcommand-linetoolsandtheJavaAPIGiveabriefdescriptionofZooKeeper—another(sortof)filesystemwithinHadoopSurveyconsiderationsforstoringdatainHadoopandtheavailablefileformats

InChapter3,Processing–MapReduceandBeyond,wewilldescribehowHadoopprovidestheframeworktoallowdatatobeprocessed.

Page 78: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheinnerworkingsofHDFSInChapter1,Introduction,wegaveaveryhigh-leveloverviewofHDFS;wewillnowexploreitinalittlemoredetail.Asmentionedinthatchapter,HDFScanbeviewedasafilesystem,thoughonewithveryspecificperformancecharacteristicsandsemantics.It’simplementedwithtwomainserverprocesses:theNameNodeandtheDataNodes,configuredinamaster/slavesetup.IfyouviewtheNameNodeasholdingallthefilesystemmetadataandtheDataNodesasholdingtheactualfilesystemdata(blocks),thenthisisagoodstartingpoint.EveryfileplacedontoHDFSwillbesplitintomultipleblocksthatmightresideonnumerousDataNodes,andit’stheNameNodethatunderstandshowtheseblockscanbecombinedtoconstructthefiles.

Page 79: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ClusterstartupLet’sexplorethevariousresponsibilitiesofthesenodesandthecommunicationbetweenthembyassumingwehaveanHDFSclusterthatwaspreviouslyshutdownandthenexaminingthestartupbehavior.

NameNodestartupWe’llfirstlyconsiderthestartupoftheNameNode(thoughthereisnoactualorderingrequirementforthisandwearedoingitfornarrativereasonsalone).TheNameNodeactuallystorestwotypesofdataaboutthefilesystem:

Thestructureofthefilesystem,thatis,directorynames,filenames,locations,andattributesTheblocksthatcompriseeachfileonthefilesystem

ThisdataisstoredinfilesthattheNameNodereadsatstartup.NotethattheNameNodedoesnotpersistentlystorethemappingoftheblocksthatarestoredonparticularDataNodes;we’llseehowthatinformationiscommunicatedshortly.

BecausetheNameNodereliesonthisin-memoryrepresentationofthefilesystem,ittendstohavequitedifferenthardwarerequirementscomparedtotheDataNodes.We’llexplorehardwareselectioninmoredetailinChapter10,RunningaHadoopCluster;fornow,justrememberthattheNameNodetendstobequitememoryhungry.Thisisparticularlytrueonverylargeclusterswithmany(millionsormore)files,particularlyifthesefileshaveverylongnames.ThisscalinglimitationontheNameNodehasalsoledtoanadditionalHadoop2featurethatwewillnotexploreinmuchdetail:NameNodefederation,wherebymultipleNameNodes(orNameNodeHApairs)workcollaborativelytoprovidetheoverallmetadataforthefullfilesystem.

ThemainfilewrittenbytheNameNodeiscalledfsimage;thisisthesinglemostimportantpieceofdataintheentirecluster,aswithoutit,theknowledgeofhowtoreconstructallthedatablocksintotheusablefilesystemislost.Thisfileisreadintomemoryandallfuturemodificationstothefilesystemareappliedtothisin-memoryrepresentationofthefilesystem.TheNameNodedoesnotwriteoutnewversionsoffsimageasnewchangesareappliedafteritisrun;instead,itwritesanotherfilecallededits,whichisalistofthechangesthathavebeenmadesincethelastversionoffsimagewaswritten.

TheNameNodestartupprocessistofirstreadthefsimagefile,thentoreadtheeditsfile,andapplyallthechangesstoredintheeditsfiletothein-memorycopyoffsimage.Itthenwritestodiskanewup-to-dateversionofthefsimagefileandisreadytoreceiveclientrequests.

DataNodestartupWhentheDataNodesstartup,theyfirstcatalogtheblocksforwhichtheyholdcopies.Typically,theseblockswillbewrittensimplyasfilesonthelocalDataNodefilesystem.

Page 80: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheDataNodewillperformsomeblockconsistencycheckingandthenreporttotheNameNodethelistofblocksforwhichithasvalidcopies.ThisishowtheNameNodeconstructsthefinalmappingitrequires—bylearningwhichblocksarestoredonwhichDataNodes.OncetheDataNodehasregistereditselfwiththeNameNode,anongoingseriesofheartbeatrequestswillbesentbetweenthenodestoallowtheNameNodetodetectDataNodesthathaveshutdown,becomeunreachable,orhavenewlyenteredthecluster.

Page 81: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

BlockreplicationHDFSreplicateseachblockontomultipleDataNodes;thedefaultreplicationfactoris3,butthisisconfigurableonaper-filelevel.HDFScanalsobeconfiguredtobeabletodeterminewhethergivenDataNodesareinthesamephysicalhardwarerackornot.Givensmartblockplacementandthisknowledgeoftheclustertopology,HDFSwillattempttoplacethesecondreplicaonadifferenthostbutinthesameequipmentrackasthefirstandthethirdonahostoutsidetherack.Inthisway,thesystemcansurvivethefailureofasmuchasafullrackofequipmentandstillhaveatleastonelivereplicaforeachblock.Aswe’llseeinChapter3,Processing–MapReduceandBeyond,knowledgeofblockplacementalsoallowsHadooptoscheduleprocessingasnearaspossibletoareplicaofeachblock,whichcangreatlyimproveperformance.

Rememberthatreplicationisastrategyforresiliencebutisnotabackupmechanism;ifyouhavedatamasteredinHDFSthatiscritical,thenyouneedtoconsiderbackuporotherapproachesthatgiveprotectionforerrors,suchasaccidentallydeletedfiles,againstwhichreplicationwillnotdefend.

WhentheNameNodestartsupandisreceivingtheblockreportsfromtheDataNodes,itwillremaininsafemodeuntilaconfigurablethresholdofblocks(thedefaultis99.9percent)havebeenreportedaslive.Whileinsafemode,clientscannotmakeanymodificationstothefilesystem.

Page 82: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Command-lineaccesstotheHDFSfilesystemWithintheHadoopdistribution,thereisacommand-lineutilitycalledhdfs,whichistheprimarywaytointeractwiththefilesystemfromthecommandline.Runthiswithoutanyargumentstoseethevarioussubcommandsavailable.Therearemany,though;severalareusedtodothingslikestartingorstoppingvariousHDFScomponents.Thegeneralformofthehdfscommandis:

hdfs<sub-command><command>[arguments]

Thetwomainsubcommandswewilluseinthisbookare:

dfs:Thisisusedforgeneralfilesystemaccessandmanipulation,includingreading/writingandaccessingfilesanddirectoriesdfsadmin:Thisisusedforadministrationandmaintenanceofthefilesystem.Wewillnotcoverthiscommandindetail,though.Havealookatthe-reportcommand,whichgivesalistingofthestateofthefilesystemandallDataNodes:

$hdfsdfsadmin-report

NoteNotethatthedfsanddfsadmincommandscanalsobeusedwiththemainHadoopcommand-lineutility,forexample,hadoopfs-ls/.ThiswastheapproachinearlierversionsofHadoopbutisnowdeprecatedinfavorofthehdfscommand.

Page 83: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ExploringtheHDFSfilesystemRunthefollowingtogetalistoftheavailablecommandsprovidedbythedfssubcommand:

$hdfsdfs

Aswillbeseenfromtheoutputoftheprecedingcommand,manyoftheselooksimilartostandardUnixfilesystemcommandsand,notsurprisingly,theyworkaswouldbeexpected.InourtestVM,wehaveauseraccountcalledcloudera.Usingthisuser,wecanlisttherootofthefilesystemasfollows:

$hdfsdfs-ls/

Found7items

drwxr-xr-x-hbasehbase02014-04-0415:18/hbase

drwxr-xr-x-hdfssupergroup02014-10-2113:16/jar

drwxr-xr-x-hdfssupergroup02014-10-1515:26/schema

drwxr-xr-x-solrsolr02014-04-0415:16/solr

drwxrwxrwt-hdfssupergroup02014-11-1211:29/tmp

drwxr-xr-x-hdfssupergroup02014-07-1309:05/user

drwxr-xr-x-hdfssupergroup02014-04-0415:15/var

TheoutputisverysimilartotheUnixlscommand.Thefileattributesworkthesameastheuser/group/worldattributesonaUnixfilesystem(includingthetstickybitascanbeseen)plusdetailsoftheowner,group,andmodificationtimeofthedirectories.Thecolumnbetweenthegroupnameandthemodifieddateisthesize;thisis0fordirectoriesbutwillhaveavalueforfilesaswe’llseeinthecodefollowingthenextinformationbox:

NoteIfrelativepathsareused,theyaretakenfromthehomedirectoryoftheuser.Ifthereisnohomedirectory,wecancreateitusingthefollowingcommands:

$sudo-uhdfshdfsdfs–mkdir/user/cloudera

$sudo-uhdfshdfsdfs–chowncloudera:cloudera/user/cloudera

Themkdirandchownstepsrequiresuperuserprivileges(sudo-uhdfs).

$hdfsdfs-mkdirtestdir

$hdfsdfs-ls

Found1items

drwxr-xr-x-clouderacloudera02014-11-1311:21testdir

Then,wecancreateafile,copyittoHDFS,andreaditscontentsdirectlyfromitslocationonHDFS,asfollows:

$echo"Helloworld">testfile.txt

$hdfsdfs-puttestfile.txttestdir

Notethatthereisanoldercommandcalled-copyFromLocal,whichworksinthesamewayas-put;youmightseeitinolderdocumentationonline.Now,runthefollowingcommandandchecktheoutput:

Page 84: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

$hdfsdfs-lstestdir

Found1items

-rw-r--r--3clouderacloudera122014-11-1311:21

testdir/testfile.txt

Notethenewcolumnbetweenthefileattributesandtheowner;thisisthereplicationfactorofthefile.Now,finally,runthefollowingcommand:

$hdfsdfs-tailtestdir/testfile.txt

Helloworld

Muchoftherestofthedfssubcommandsareprettyintuitive;playaround.We’llexploresnapshotsandprogrammaticaccesstoHDFSlaterinthischapter.

Page 85: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ProtectingthefilesystemmetadataBecausethefsimagefileissocriticaltothefilesystem,itslossisacatastrophicfailure.InHadoop1,wheretheNameNodewasasinglepointoffailure,thebestpracticewastoconfiguretheNameNodetosynchronouslywritethefsimageandeditsfilestobothlocalstorageplusatleastoneotherlocationonaremotefilesystem(oftenNFS).IntheeventofNameNodefailure,areplacementNameNodecouldbestartedusingthisup-to-datecopyofthefilesystemmetadata.Theprocesswouldrequirenon-trivialmanualintervention,however,andwouldresultinaperiodofcompleteclusterunavailability.

Page 86: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SecondaryNameNodenottotherescueThemostunfortunatelynamedcomponentinallofHadoop1wastheSecondaryNameNode,which,notunreasonably,manypeopleexpecttobesomesortofbackuporstandbyNameNode.Itisnot;instead,theSecondaryNameNodewasresponsibleonlyforperiodicallyreadingthelatestversionofthefsimageandeditsfileandcreatinganewup-to-datefsimagewiththeoutstandingeditsapplied.Onabusycluster,thischeckpointcouldsignificantlyspeeduptherestartoftheNameNodebyreducingthenumberofeditsithadtoapplybeforebeingabletoserviceclients.

InHadoop2,thenamingismoreclear;thereareCheckpointnodes,whichdotherolepreviouslyperformedbytheSecondaryNameNode,plusBackupNameNodes,whichkeepalocalup-to-datecopyofthefilesystemmetadataeventhoughtheprocesstopromoteaBackupnodetobetheprimaryNameNodeisstillamultistagemanualprocess.

Page 87: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Hadoop2NameNodeHAInmostproductionHadoop2clusters,however,itmakesmoresensetousethefullHighAvailability(HA)solutioninsteadofrelyingonCheckpointandBackupnodes.ItisactuallyanerrortotrytocombineNameNodeHAwiththeCheckpointandBackupnodemechanisms.

Thecoreideaisforapair(currentlynomorethantwoaresupported)ofNameNodesconfiguredinanactive/passivecluster.OneNameNodeactsasthelivemasterthatservicesallclientrequests,andthesecondremainsreadytotakeovershouldtheprimaryfail.Inparticular,Hadoop2HDFSenablesthisHAthroughtwomechanisms:

ProvidingameansforbothNameNodestohaveconsistentviewsofthefilesystemProvidingameansforclientstoalwaysconnecttothemasterNameNode

KeepingtheHANameNodesinsyncThereareactuallytwomechanismsbywhichtheactiveandstandbyNameNodeskeeptheirviewsofthefilesystemconsistent;useofanNFSshareorQuorumJournalManager(QJM).

IntheNFScase,thereisanobviousrequirementonanexternalremoteNFSfileshare—notethatasuseofNFSwasbestpracticeinHadoop1forasecondcopyoffilesystemmetadatamanyclustersalreadyhaveone.Ifhighavailabilityisaconcern,thoughitshouldbeborneinmindthatmakingNFShighlyavailableoftenrequireshigh-endandexpensivehardware.InHadoop2,HAusesNFS;however,theNFSlocationbecomestheprimarylocationforthefilesystemmetadata.AstheactiveNameNodewritesallfilesystemchangestotheNFSshare,thestandbynodedetectsthesechangesandupdatesitscopyofthefilesystemmetadataaccordingly.

TheQJMmechanismusesanexternalservice(theJournalManagers)insteadofafilesystem.TheJournalManagerclusterisanoddnumberofservices(3,5,and7arethemostcommon)runningonthatnumberofhosts.AllchangestothefilesystemaresubmittedtotheQJMservice,andachangeistreatedascommittedonlywhenamajorityoftheQJMnodeshavecommittedthechange.ThestandbyNameNodereceiveschangeupdatesfromtheQJMserviceandusesthisinformationtokeepitscopyofthefilesystemmetadatauptodate.

TheQJMmechanismdoesnotrequireadditionalhardwareastheCheckpointnodesarelightweightandcanbeco-locatedwithotherservices.Thereisalsonosinglepointoffailureinthemodel.Consequently,theQJMHAisusuallythepreferredoption.

Ineithercase,bothinNFS-basedHAandQJM-basedHA,theDataNodessendblockstatusreportstobothNameNodestoensurethatbothhaveup-to-dateinformationofthemappingofblockstoDataNodes.Rememberthatthisblockassignmentinformationisnotheldinthefsimage/editsdata.

Page 88: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ClientconfigurationTheclientstotheHDFSclusterremainmostlyunawareofthefactthatNameNodeHAisbeingused.TheconfigurationfilesneedtoincludethedetailsofbothNameNodes,butthemechanismsfordeterminingwhichistheactiveNameNode—andwhentoswitchtothestandby—arefullyencapsulatedintheclientlibraries.ThefundamentalconceptthoughisthatinsteadofreferringtoanexplicitNameNodehostasinHadoop1,HDFSinHadoop2identifiesanameserviceIDfortheNameNodewithinwhichmultipleindividualNameNodes(eachwithitsownNameNodeID)aredefinedforHA.NotethattheconceptofnameserviceIDisalsousedbyNameNodefederation,whichwebrieflymentionedearlier.

Page 89: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HowafailoverworksFailovercanbeeithermanualorautomatic.AmanualfailoverrequiresanadministratortotriggertheswitchthatpromotesthestandbytothecurrentlyactiveNameNode.Thoughautomaticfailoverhasthegreatestimpactonmaintainingsystemavailability,theremightbeconditionsinwhichthisisnotalwaysdesirable.Triggeringamanualfailoverrequiresrunningonlyafewcommandsand,therefore,eveninthismode,thefailoverissignificantlyeasierthaninthecaseofHadoop1orwithHadoop2Backupnodes,wherethetransitiontoanewNameNoderequiressubstantialmanualeffort.

Regardlessofwhetherthefailoveristriggeredmanuallyorautomatically,ithastwomainphases:confirmationthatthepreviousmasterisnolongerservingrequestsandthepromotionofthestandbytobethemaster.

ThegreatestriskinafailoveristohaveaperiodinwhichbothNameNodesareservicingrequests.Insuchasituation,itispossiblethatconflictingchangesmightbemadetothefilesystemonthetwoNameNodesorthattheymightbecomeoutofsync.EventhoughthisshouldnotbepossibleiftheQJMisbeingused(itonlyeveracceptsconnectionsfromasingleclient),out-of-dateinformationmightbeservedtoclients,whomightthentrytomakeincorrectdecisionsbasedonthisstalemetadata.Thisis,ofcourse,particularlylikelyifthepreviousmasterNameNodeisbehavingincorrectlyinsomeway,whichiswhytheneedforthefailoverisidentifiedinthefirstplace.

ToensureonlyoneNameNodeisactiveatanytime,afencingmechanismisusedtovalidatethattheexistingNameNodemasterhasbeenshutdown.ThesimplestincludedmechanismwilltrytosshintotheNameNodehostandactivelykilltheprocessthoughacustomscriptcanalsobeexecuted,sothemechanismisflexible.ThefailoverwillnotcontinueuntilthefencingissuccessfulandthesystemhasconfirmedthatthepreviousmasterNameNodeisnowdeadandhasreleasedanyrequiredresources.

Oncefencingsucceeds,thestandbyNameNodebecomesthemasterandwillstartwritingtotheNFS-mountedfsimageandeditslogsifNFSisbeingusedforHAorwillbecomethesingleclienttotheQJMifthatistheHAmechanism.

Beforediscussingautomaticfailover,weneedaslightseguetointroduceanotherApacheprojectthatisusedtoenablethisfeature.

Page 90: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ApacheZooKeeper–adifferenttypeoffilesystemWithinHadoop,wewillmostlytalkaboutHDFSwhendiscussingfilesystemsanddatastorage.But,insidealmostallHadoop2installations,thereisanotherservicethatlookssomewhatlikeafilesystem,butwhichprovidessignificantcapabilitycrucialtotheproperfunctioningofdistributedsystems.ThisserviceisApacheZooKeeper(http://zookeeper.apache.org)and,asitisakeypartoftheimplementationofHDFSHA,wewillintroduceitinthischapter.Itis,however,alsousedbymultipleotherHadoopcomponentsandrelatedprojects,sowewilltouchonitseveralmoretimesthroughoutthebook.

ZooKeeperstartedoutasasubcomponentofHBaseandwasusedtoenableseveraloperationalcapabilitiesoftheservice.Whenanycomplexdistributedsystemisbuilt,thereareaseriesofactivitiesthatarealmostalwaysrequiredandwhicharealwaysdifficulttogetright.Theseactivitiesincludethingssuchashandlingsharedlocks,detectingcomponentfailure,andsupportingleaderelectionwithinagroupofcollaboratingservices.ZooKeeperwascreatedasthecoordinationservicethatwouldprovideaseriesofprimitiveoperationsuponwhichHBasecouldimplementthesetypesofoperationallycriticalfeatures.NotethatZooKeeperalsotakesinspirationfromtheGoogleChubbysystemdescribedathttp://research.google.com/archive/chubby-osdi06.pdf.

ZooKeeperrunsasaclusterofinstancesreferredtoasanensemble.Theensembleprovidesadatastructure,whichissomewhatanalogoustoafilesystem.EachlocationinthestructureiscalledaZNodeandcanhavechildrenasifitwereadirectorybutcanalsohavecontentasifitwereafile.NotethatZooKeeperisnotasuitableplacetostoreverylargeamountsofdata,andbydefault,themaximumamountofdatainaZNodeis1MB.Atanypointintime,oneserverintheensembleisthemasterandmakesalldecisionsaboutclientrequests.Thereareverywell-definedrulesaroundtheresponsibilitiesofthemaster,includingthatithastoensurethatarequestisonlycommittedwhenamajorityoftheensemblehavecommittedthechange,andthatoncecommittedanyconflictingchangeisrejected.

YoushouldhaveZooKeeperinstalledwithinyourClouderaVirtualMachine.Ifnot,useClouderaManagertoinstallitasasinglenodeonthehost.Inproductionsystems,ZooKeeperhasveryspecificsemanticsaroundabsolutemajorityvoting,sosomeofthelogiconlymakessenseinalargerensemble(3,5,or7nodesarethemostcommonsizes).

Thereisacommand-lineclienttoZooKeepercalledzookeeper-clientintheClouderaVM;notethatinthevanillaZooKeeperdistributionitiscalledzkCli.sh.Ifyourunitwithnoarguments,itwillconnecttotheZooKeeperserverrunningonthelocalmachine.Fromhere,youcantypehelptogetalistofcommands.

Themostimmediatelyinterestingcommandswillbecreate,ls,andget.Asthenamessuggest,thesecreateaZNode,listtheZNodesataparticularpointinthefilesystem,and

Page 91: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

getthedatastoredataparticularZNode.Herearesomeexamplesofusage.

CreateaZNodewithnodata:

$create/zk-test''

CreateachildofthefirstZNodeandstoresometextinit:

$create/zk-test/child1'sampledata'

RetrievethedataassociatedwithaparticularZNode:

$get/zk-test/child1

TheclientcanalsoregisterawatcheronagivenZNode—thiswillraiseanalertiftheZNodeinquestionchanges,eitheritsdataorchildrenbeingmodified.

Thismightnotsoundveryuseful,butZNodescanadditionallybecreatedasbothsequentialandephemeralnodes,andthisiswherethemagicstarts.

Page 92: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ImplementingadistributedlockwithsequentialZNodesIfaZNodeiscreatedwithintheCLIwiththe-soption,itwillbecreatedasasequentialnode.ZooKeeperwillsuffixthesuppliednamewitha10-digitintegerguaranteedtobeuniqueandgreaterthananyothersequentialchildrenofthesameZNode.Wecanusethismechanismtocreateadistributedlock.ZooKeeperitselfisnotholdingtheactuallock;theclientneedstounderstandwhatparticularstatesinZooKeepermeanintermsoftheirmappingtotheapplicationlocksinquestion.

Ifwecreatea(non-sequential)ZNodeat/zk-lock,thenanyclientwishingtoholdthelockwillcreateasequentialchildnode.Forexample,thecreate-s/zk-lock/locknodecommandmightcreatethenode,/zk-lock/locknode-0000000001,inthefirstcase,withincreasingintegersuffixesforsubsequentcalls.WhenaclientcreatesaZNodeunderthelock,itwillthencheckifitssequentialnodehasthelowestintegersuffix.Ifitdoes,thenitistreatedashavingthelock.Ifnot,thenitwillneedtowaituntilthenodeholdingthelockisdeleted.Theclientwillusuallyputawatchonthenodewiththenextlowestsuffixandthenbealertedwhenthatnodeisdeleted,indicatingthatitnowholdsthelock.

Page 93: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ImplementinggroupmembershipandleaderelectionusingephemeralZNodesAnyZooKeeperclientwillsendheartbeatstotheserverthroughoutthesession,showingthatitisalive.FortheZNodeswehavediscusseduntilnow,wecansaythattheyarepersistentandwillsurviveacrosssessions.Wecan,however,createaZNodeasephemeral,meaningitwilldisappearoncetheclientthatcreatediteitherdisconnectsorisdetectedasbeingdeadbytheZooKeeperserver.WithintheCLIanephemeralZNodeiscreatedbyaddingthe-eflagtothecreatecommand.

EphemeralZNodesareagoodmechanismtoimplementgroupmembershipdiscoverywithinadistributedsystem.Foranysystemwherenodescanfail,join,andleavewithoutnotice,knowingwhichnodesarealiveatanypointintimeisoftenadifficulttask.WithinZooKeeper,wecanprovidethebasisforsuchdiscoverybyhavingeachnodecreateanephemeralZNodeatacertainlocationintheZooKeeperfilesystem.TheZNodescanholddataabouttheservicenodes,suchashostname,IPaddress,portnumber,andsoon.Togetalistoflivenodes,wecansimplylistthechildnodesoftheparentgroupZNode.Becauseofthenatureofephemeralnodes,wecanhaveconfidencethatthelistoflivenodesretrievedatanytimeisuptodate.

IfwehaveeachservicenodecreateZNodechildrenthatarenotjustephemeralbutalsosequential,thenwecanalsobuildamechanismforleaderelectionforservicesthatneedtohaveasinglemasternodeatanyonetime.Themechanismisthesameforlocks;theclientservicenodecreatesthesequentialandephemeralZNodeandthenchecksifithasthelowestsequencenumber.Ifso,thenitisthemaster.Ifnot,thenitwillregisterawatcheronthenextlowestsequencenodetobealertedwhenitmightbecomethemaster.

Page 94: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

JavaAPITheorg.apache.zookeeper.ZooKeeperclassisthemainprogrammaticclienttoaccessaZooKeeperensemble.Refertothejavadocsforthefulldetails,butthebasicinterfaceisrelativelystraightforwardwithobviousone-to-onecorrespondencetocommandsintheCLI.Forexample:

create:isequivalenttoCLIcreategetChildren:isequivalenttoCLIlsgetData:isequivalenttoCLIget

Page 95: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

BuildingblocksAscanbeseen,ZooKeeperprovidesasmallnumberofwell-definedoperationswithverystrongsemanticguaranteesthatcanbebuiltintohigher-levelservices,suchasthelocks,groupmembership,andleaderelectionwediscussedearlier.It’sbesttothinkofZooKeeperasatoolkitofwell-engineeredandreliablefunctionscriticaltodistributedsystemsthatcanbebuiltuponwithouthavingtoworryabouttheintricaciesoftheirimplementation.TheprovidedZooKeeperinterfaceisquitelow-levelthough,andthereareafewhigher-levelinterfacesemergingthatprovidemoreofthemappingofthelow-levelprimitivesintoapplication-levellogic.TheCuratorproject(http://curator.apache.org/)isagoodexampleofthis.

ZooKeeperwasusedsparinglywithinHadoop1,butit’snowquiteubiquitous.It’susedbybothMapReduceandHDFSforthehighavailabilityoftheirJobTrackerandNameNodecomponents.HiveandImpala,whichwewillexplorelater,useittoplacelocksondatatablesthatarebeingaccessedbymultipleconcurrentjobs.Kafka,whichwe’lldiscussinthecontextofSamza,usesZooKeeperfornode(brokerinKafkaterminology),leaderelection,andstatemanagement.

Page 96: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

FurtherreadingWehavenotdescribedZooKeeperinmuchdetailandhavecompletelyomittedaspectssuchasitsabilitytoapplyquotasandaccesscontrolliststoZNodeswithinthefilesystemandthemechanismstobuildcallbacks.OurpurposeherewastogiveenoughofthedetailssothatyouwouldhavesomeideaofhowitwasbeingusedwithintheHadoopservicesweexploreinthisbook.Formoreinformation,consulttheprojecthomepage.

Page 97: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AutomaticNameNodefailoverNowthatwehaveintroducedZooKeeper,wecanshowhowitisusedtoenableautomaticNameNodefailover.

AutomaticNameNodefailoverintroducestwonewcomponentstothesystem,aZooKeeperquorum,andtheZooKeeperFailoverController(ZKFC),whichrunsoneachNameNodehost.TheZKFCcreatesanephemeralZNodeinZooKeeperandholdsthisZNodeforaslongasitdetectsthelocalNameNodetobealiveandfunctioningcorrectly.Itdeterminesthisbycontinuouslysendingsimplehealth-checkrequeststotheNameNode,andiftheNameNodefailstorespondcorrectlyoverashortperiodoftimetheZKFCwillassumetheNameNodehasfailed.IfaNameNodemachinecrashesorotherwisefails,theZKFCsessioninZooKeeperwillbeclosedandtheephemeralZNodewillalsobeautomaticallyremoved.

TheZKFCprocessesarealsomonitoringtheZNodesoftheotherNameNodesinthecluster.IftheZKFConthestandbyNameNodehostseestheexistingmasterZNodedisappear,itwillassumethemasterhasfailedandwillattemptafailover.ItdoesthisbytryingtoacquirethelockfortheNameNode(throughtheprotocoldescribedintheZooKeepersection)andifsuccessfulwillinitiateafailoverthroughthesamefencing/promotionmechanismdescribedearlier.

Page 98: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HDFSsnapshotsWementionedearlierthatHDFSreplicationaloneisnotasuitablebackupstrategy.IntheHadoop2filesystem,snapshotshavebeenadded,whichbringsanotherlevelofdataprotectiontoHDFS.

Filesystemsnapshotshavebeenusedforsometimeacrossavarietyoftechnologies.Thebasicideaisthatitbecomespossibletoviewtheexactstateofthefilesystematparticularpointsintime.Thisisachievedbytakingacopyofthefilesystemmetadataatthepointthesnapshotismadeandmakingthisavailabletobeviewedinthefuture.

Aschangestothefilesystemaremade,anychangethatwouldaffectthesnapshotistreatedspecially.Forexample,ifafilethatexistsinthesnapshotisdeletedthen,eventhoughitwillberemovedfromthecurrentstateofthefilesystem,itsmetadatawillremaininthesnapshot,andtheblocksassociatedwithitsdatawillremainonthefilesystemthoughnotaccessiblethroughanyviewofthesystemotherthanthesnapshot.

Anexamplemightillustratethispoint.Say,youhaveafilesystemcontainingthefollowingfiles:

/data1(5blocks)

/data2(10blocks)

Youtakeasnapshotandthendeletethefile/data2.Ifyouviewthecurrentstateofthefilesystem,thenonly/data1willbevisible.Ifyouexaminethesnapshot,youwillseebothfiles.Behindthescenes,all15blocksstillexist,butonlythoseassociatedwiththeun-deletedfile/data1arepartofthecurrentfilesystem.Theblocksforthefile/data2willbereleasedonlywhenthesnapshotisitselfremoved—snapshotsareread-onlyviews.

SnapshotsinHadoop2canbeappliedateitherthefullfilesystemleveloronlyonparticularpaths.Apathneedstobesetassnapshottable,andnotethatyoucannothaveapathsnapshottableifanyofitschildrenorparentpathsarethemselvessnapshottable.

Let’stakeasimpleexamplebasedonthedirectorywecreatedearliertoillustratetheuseofsnapshots.Thecommandswearegoingtoillustrateneedtobeexecutedwithsuperuserprivileges,whichcanbeobtainedwithsudo-uhdfs.

First,usethedfsadminsubcommandofthehdfsCLIutilitytoenablesnapshotsofadirectory,asfollows:

$sudo-uhdfshdfsdfsadmin-allowSnapshot\

/user/cloudera/testdir

Allowingsnapshotontestdirsucceeded

Now,wecreatethesnapshotandexamineit;snapshotsareavailablethroughthe.snapshotsubdirectoryofthesnapshottabledirectory.Notethatthe.snapshotdirectorywillnotbevisibleinanormallistingofthedirectory.Here’showwecreateasnapshotandexamineit:

$sudo-uhdfshdfsdfs-createSnapshot\

Page 99: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

/user/cloudera/testdirsn1

Createdsnapshot/user/cloudera/testdir/.snapshot/sn1

$sudo-uhdfshdfsdfs-ls\

/user/cloudera/testdir/.snapshot/sn1

Found1items-rw-r--r--1clouderacloudera122014-11-1311:21

/user/cloudera/testdir/.snapshot/sn1/testfile.txt

Now,weremovethetestfilefromthemaindirectoryandverifythatitisnowempty:

$sudo-uhdfshdfsdfs-rm\

/user/cloudera/testdir/testfile.txt

14/11/1313:13:51INFOfs.TrashPolicyDefault:Namenodetrashconfiguration:

Deletioninterval=1440minutes,Emptierinterval=0minutes.Moved:

'hdfs://localhost.localdomain:8020/user/cloudera/testdir/testfile.txt'to

trashat:hdfs://localhost.localdomain:8020/user/hdfs/.Trash/Current

$hdfsdfs-ls/user/cloudera/testdir

$

Notethementionoftrashdirectories;bydefault,HDFSwillcopyanydeletedfilesintoa.Trashdirectoryintheuser’shomedirectory,whichhelpstodefendagainstslippingfingers.Thesefilescanberemovedthroughhdfsdfs-expungeorwillbeautomaticallypurgedin7daysbydefault.

Now,weexaminethesnapshotwherethenow-deletedfileisstillavailable:

$hdfsdfs-lstestdir/.snapshot/sn1

Found1itemsdrwxr-xr-x-clouderacloudera02014-11-1313:12

testdir/.snapshot/sn1

$hdfsdfs-tailtestdir/.snapshot/sn1/testfile.txt

Helloworld

Then,wecandeletethesnapshot,freeingupanyblocksheldbyit,asfollows:

$sudo-uhdfshdfsdfs-deleteSnapshot\

/user/cloudera/testdirsn1

$hdfsdfs-lstestdir/.snapshot

$

Ascanbeseen,thefileswithinasnapshotarefullyavailabletobereadandcopied,providingaccesstothehistoricalstateofthefilesystematthepointwhenthesnapshotwasmade.Eachdirectorycanhaveupto65,535snapshots,andHDFSmanagessnapshotsinsuchawaythattheyarequiteefficientintermsofimpactonnormalfilesystemoperations.Theyareagreatmechanismtousepriortoanyactivitythatmighthaveadverseeffects,suchastryinganewversionofanapplicationthataccessesthefilesystem.Ifthenewsoftwarecorruptsfiles,theoldstateofthedirectorycanberestored.Ifafteraperiodofvalidationthesoftwareisaccepted,thenthesnapshotcaninsteadbedeleted.

Page 100: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HadoopfilesystemsUntilnow,wereferredtoHDFSastheHadoopfilesystem.Inreality,Hadoophasaratherabstractnotionoffilesystem.HDFSisonlyoneofseveralimplementationsoftheorg.apache.hadoop.fs.FileSystemJavaabstractclass.Alistofavailablefilesystemscanbefoundathttps://hadoop.apache.org/docs/r2.5.0/api/org/apache/hadoop/fs/FileSystem.html.Thefollowingtablesummarizessomeofthesefilesystems,alongwiththecorrespondingURIschemeandJavaimplementationclass.

Filesystem URIscheme Javaimplementation

Local file org.apache.hadoop.fs.LocalFileSystem

HDFS hdfs org.apache.hadoop.hdfs.DistributedFileSystem

S3(native) s3n org.apache.hadoop.fs.s3native.NativeS3FileSystem

S3(block-based) s3 org.apache.hadoop.fs.s3.S3FileSystem

ThereexisttwoimplementationsoftheS3filesystem.Native—s3n—isusedtoreadandwriteregularfiles.Datastoredusings3ncanbeaccessedbyanytoolandconverselycanbeusedtoreaddatageneratedbyotherS3tools.s3ncannothandlefileslargerthan5TBorrenameoperations.

MuchlikeHDFS,theblock-basedS3filesystemstoresfilesinblocksandrequiresanS3buckettobededicatedtothefilesystem.FilesstoredinanS3filesystemcanbelargerthan5TB,buttheywillnotbeinteroperablewithotherS3tools.Additionallyblock-basedS3supportsrenameoperations.

Page 101: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HadoopinterfacesHadoopiswritteninJava,andnotsurprisingly,allinteractionwiththesystemhappensviatheJavaAPI.Thecommand-lineinterfaceweusedthroughthehdfscommandinpreviousexamplesisaJavaapplicationthatusestheFileSystemclasstocarryoutinput/outputoperationsontheavailablefilesystems.

JavaFileSystemAPITheJavaAPI,providedbytheorg.apache.hadoop.fspackage,exposesApacheHadoopfilesystems.

org.apache.hadoop.fs.FileSystemistheabstractclasseachfilesystemimplementsandprovidesageneralinterfacetointeractwithdatainHadoop.AllcodethatusesHDFSshouldbewrittenwiththecapabilityofhandlingaFileSystemobject.

LibhdfsLibhdfsisaClibrarythat,despiteitsname,canbeusedtoaccessanyHadoopfilesystemandnotjustHDFS.ItiswrittenusingtheJavaNativeInterface(JNI)andmimicstheJavaFileSystemclass.

ThriftApacheThrift(http://thrift.apache.org)isaframeworkforbuildingcross-languagesoftwarethroughdataserializationandremotemethodinvocationmechanisms.TheHadoopThriftAPI,availableincontrib,exposesHadoopfilesystemsasaThriftservice.Thisinterfacemakesiteasyfornon-JavacodetoaccessdatastoredinaHadoopfilesystem.

Otherthantheaforementionedinterfaces,thereexistotherinterfacesthatallowaccesstoHadoopfilesystemsviaHTTPandFTP—theseforHDFSonly—aswellasWebDAV.

Page 102: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ManagingandserializingdataHavingafilesystemisallwellandgood,butwealsoneedmechanismstorepresentdataandstoreitonthefilesystems.Wewillexploresomeofthesemechanismsnow.

Page 103: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheWritableinterfaceItisuseful,tousasdevelopers,ifwecanmanipulatehigher-leveldatatypesandhaveHadooplookaftertheprocessesrequiredtoserializethemintobytestowritetoafilesystemandreconstructfromastreamofbyteswhenitisreadfromthefilesystem.

Theorg.apache.hadoop.iopackagecontainstheWritableinterface,whichprovidesthismechanismandisspecifiedasfollows:

publicinterfaceWritable

{

voidwrite(DataOutputout)throwsIOException;

voidreadFields(DataInputin)throwsIOException;

}

Themainpurposeofthisinterfaceistoprovidemechanismsfortheserializationanddeserializationofdataasitispassedacrossthenetworkorreadandwrittenfromthedisk.

WhenweexploreprocessingframeworksonHadoopinlaterchapters,wewilloftenseeinstanceswheretherequirementisforadataargumenttobeofthetypeWritable.Ifweusedatastructuresthatprovideasuitableimplementationofthisinterface,thentheHadoopmachinerycanautomaticallymanagetheserializationanddeserializationofthedatatypewithoutknowinganythingaboutwhatitrepresentsorhowitisused.

Page 104: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

IntroducingthewrapperclassesFortunately,youdon’thavetostartfromscratchandbuildWritablevariantsofallthedatatypesyouwilluse.HadoopprovidesclassesthatwraptheJavaprimitivetypesandimplementtheWritableinterface.Theyareprovidedintheorg.apache.hadoop.iopackage.

Theseclassesareconceptuallysimilartotheprimitivewrapperclasses,suchasIntegerandLong,foundinjava.lang.Theyholdasingleprimitivevaluethatcanbeseteitheratconstructionorviaasettermethod.Theyareasfollows:

BooleanWritable

ByteWritable

DoubleWritable

FloatWritable

IntWritable

LongWritable

VIntWritable:avariablelengthintegertypeVLongWritable:avariablelengthlongtypeThereisalsoText,whichwrapsjava.lang.String.

Page 105: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ArraywrapperclassesHadoopalsoprovidessomecollection-basedwrapperclasses.TheseclassesprovideWritablewrappersforarraysofotherWritableobjects.Forexample,aninstancecouldeitherholdanarrayofIntWritableorDoubleWritable,butnotarraysoftherawintorfloattypes.AspecificsubclassfortherequiredWritableclasswillberequired.Theyareasfollows:

ArrayWritable

TwoDArrayWritable

Page 106: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheComparableandWritableComparableinterfacesWewereslightlyinaccuratewhenwesaidthatthewrapperclassesimplementWritable;theyactuallyimplementacompositeinterfacecalledWritableComparableintheorg.apache.hadoop.iopackagethatcombinesWritablewiththestandardjava.lang.Comparableinterface:

publicinterfaceWritableComparableextendsWritable,Comparable

{}

TheneedforComparablewillonlybecomeapparentwhenweexploreMapReduceinthenextchapter,butfornow,justrememberthatthewrapperclassesprovidemechanismsforthemtobebothserializedandsortedbyHadooporanyofitsframeworks.

Page 107: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

StoringdataUntilnow,weintroducedthearchitectureofHDFSandhowtoprogrammaticallystoreandretrievedatausingthecommand-linetoolsandtheJavaAPI.Intheexamplesseenuntilnow,wehaveimplicitlyassumedthatourdatawasstoredasatextfile.Inreality,someapplicationsanddatasetswillrequireadhocdatastructurestoholdthefile’scontents.Overtheyears,fileformatshavebeencreatedtoaddressboththerequirementsofMapReduceprocessing—forinstance,wewantdatatobesplittable—andtosatisfytheneedtomodelbothstructuredandunstructureddata.Currently,alotoffocushasbeendedicatedtobettercapturetheusecasesofrelationaldatastorageandmodeling.Intheremainderofthischapter,wewillintroducesomeofthepopularfileformatchoicesavailablewithintheHadoopecosystem.

Page 108: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SerializationandContainersWhentalkingaboutfileformats,weareassumingtwotypesofscenarios,whichareasfollows:

Serialization:wewanttoencodedatastructuresgeneratedandmanipulatedatprocessingtimetoaformatwecanstoretoafile,transmit,andatalaterstage,retrieveandtranslatebackforfurthermanipulationContainers:oncedataisserializedtofiles,containersprovidemeanstogroupmultiplefilestogetherandaddadditionalmetadata

Page 109: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

CompressionWhenworkingwithdata,filecompressioncanoftenleadtosignificantsavingsbothintermsofthespacenecessarytostorefilesaswellasonthedataI/Oacrossthenetworkandfrom/tolocaldisks.

Inbroadterms,whenusingaprocessingframework,compressioncanoccuratthreepointsintheprocessingpipeline:

inputfilestobeprocessedoutputfilesthatresultafterprocessingiscompletedintermediate/temporaryfilesproducedinternallywithinthepipeline

Whenweaddcompressionatanyofthesestages,wehaveanopportunitytodramaticallyreducetheamountofdatatobereadorwrittentothediskoracrossthenetwork.ThisisparticularlyusefulwithframeworkssuchasMapReducethatcan,forexample,producevolumesoftemporarydatathatarelargerthaneithertheinputoroutputdatasets.

ApacheHadoopcomeswithanumberofcompressioncodecs:gzip,bzip2,LZO,snappy—eachwithitsowntradeoffs.Pickingacodecisaneducatedchoicethatshouldconsiderboththekindofdatabeingprocessedaswellasthenatureoftheprocessingframeworkitself.

Otherthanthegeneralspace/timetradeoff,wherethelargestspacesavingscomeattheexpenseofcompressionanddecompressionspeed(andviceversa),weneedtotakeintoaccountthatdatastoredinHDFSwillbeaccessedbyparallel,distributedsoftware;someofthissoftwarewillalsoadditsownparticularrequirementsonfileformats.MapReduce,forexample,ismostefficientonfilesthatcanbesplitintovalidsubfiles.

Thiscancomplicatedecisions,suchasthechoiceofwhethertocompressandwhichcodectouseifso,asmostcompressioncodecs(suchasgzip)donotsupportsplittablefiles,whereasafew(suchasLZO)do.

Page 110: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

General-purposefileformatsThefirstclassoffileformatsarethosegeneral-purposeonesthatcanbeappliedtoanyapplicationdomainandmakenoassumptionsondatastructureoraccesspatterns.

Text:thesimplestapproachtostoringdataonHDFSistouseflatfiles.Textfilescanbeusedbothtoholdunstructureddata—awebpageoratweet—aswellasstructureddata—aCSVfilethatisafewmillionrowslong.Textfilesaresplittable,thoughoneneedstoconsiderhowtohandleboundariesbetweenmultipleelements(forexample,lines)inthefile.SequenceFile:aSequenceFileisaflatdatastructureconsistingofbinarykey/valuepairs,introducedtoaddressspecificrequirementsofMapReduce-basedprocessing.ItisstillextensivelyusedinMapReduceasaninput/outputformat.AswewillseeinChapter3,Processing–MapReduceandBeyond,internally,thetemporaryoutputsofmapsarestoredusingSequenceFile.

SequenceFileprovidesWriter,Reader,andSorterclassestowrite,read,and,sortdata,respectively.

Dependingonthecompressionmechanisminuse,threevariationsofSequenceFilecanbedistinguished:

Uncompressedkey/valuerecords.Recordcompressedkey/valuerecords.Only‘values’arecompressed.Blockcompressedkey/valuerecords.Keysandvaluesarecollectedinblocksofarbitrarysizeandcompressedseparately.

Ineachcase,however,theSequenceFileremainssplittable,whichisoneofitsbiggeststrengths.

Page 111: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Column-orienteddataformatsIntherelationaldatabaseworld,column-orienteddatastoresorganizeandstoretablesbasedonthecolumns;generallyspeaking,thedataforeachcolumnwillbestoredtogether.ThisisasignificantlydifferentapproachcomparedtomostrelationalDBMSthatorganizedataperrow.Column-orientedstoragehassignificantperformanceadvantages;forexample,ifaqueryneedstoreadonlytwocolumnsfromaverywidetablecontaininghundredsofcolumns,thenonlytherequiredcolumndatafilesareaccessed.Atraditionalrow-orienteddatabasewouldhavetoreadallcolumnsforeachrowforwhichdatawasrequired.Thishasthegreatestimpactonworkloadswhereaggregatefunctionsarecomputedoverlargenumbersofsimilaritems,suchaswithOLAPworkloadstypicalofdatawarehousesystems.

InChapter7,HadoopandSQL,wewillseehowHadoopisbecomingaSQLbackendforthedatawarehouseworldthankstoprojectssuchasApacheHiveandClouderaImpala.Aspartoftheexpansionintothisdomain,anumberoffileformatshavebeendevelopedtoaccountforbothrelationalmodelinganddatawarehousingneeds.

RCFile,ORC,andParquetarethreestate-of-the-artcolumn-orientedfileformatsdevelopedwiththeseusecasesinmind.

RCFileRowColumnarFile(RCFile)wasoriginallydevelopedbyFacebooktobeusedasthebackendstoragefortheirHivedatawarehousesystemthatwasthefirstmainstreamSQL-on-Hadoopsystemavailableasopensource.

RCFileaimstoprovidethefollowing:

fastdataloadingfastqueryprocessingefficientstorageutilizationadaptabilitytodynamicworkloads

MoreinformationonRCFilecanbefoundathttp://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/abs11-4.html.

ORCTheOptimizedRowColumnarfileformat(ORC)aimstocombinetheperformanceoftheRCFilewiththeflexibilityofAvro.ItisprimarilyintendedtoworkwithApacheHiveandhasbeeninitiallydevelopedbyHortonworkstoovercometheperceivedlimitationsofotheravailablefileformats.

Moredetailscanbefoundathttp://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html.

ParquetParquet,foundathttp://parquet.incubator.apache.org,wasoriginallyajointeffortof

Page 112: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Cloudera,Twitter,andCriteo,andnowhasbeendonatedtotheApacheSoftwareFoundation.ThegoalsofParquetaretoprovideamodern,performant,columnarfileformattobeusedwithClouderaImpala.AswithImpala,ParquethasbeeninspiredbytheDremelpaper(http://research.google.com/pubs/pub36632.html).Itallowscomplex,nesteddatastructuresandallowsefficientencodingonaper-columnlevel.

AvroApacheAvro(http://avro.apache.org)isaschema-orientedbinarydataserializationformatandfilecontainer.Avrowillbeourpreferredbinarydataformatthroughoutthisbook.Itisbothsplittableandcompressible,makingitanefficientformatfordataprocessingwithframeworkssuchasMapReduce.

Numerousotherprojectsalsohavebuilt-inspecificAvrosupportandintegration,however,soitisverywidelyapplicable.WhendataisstoredinanAvrofile,itsschema—definedasaJSONobject—isstoredwithit.Afilecanbelaterprocessedbyathirdpartywithnoapriorinotionofhowdataisencoded.Thismakesdataself-describingandfacilitatesusewithdynamicandscriptinglanguages.Theschema-on-readmodelalsohelpsAvrorecordstobeefficienttostoreasthereisnoneedfortheindividualfieldstobetagged.

Inlaterchapters,youwillseehowthesepropertiescanmakedatalifecyclemanagementeasierandallownon-trivialoperationssuchasschemamigrations.

UsingtheJavaAPIWe’llnowdemonstratetheuseoftheJavaAPItoparseAvroschemas,readandwriteAvrofiles,anduseAvro’scodegenerationfacilities.Notethattheformatisintrinsicallylanguageindependent;thereareAPIsformostlanguages,andfilescreatedbyJavawillseamlesslybereadfromanyotherlanguage.

AvroschemasaredescribedasJSONdocumentsandrepresentedbytheorg.apache.avro.Schemaclass.TodemonstratetheAPIformanipulatingAvrodocuments,we’lllookaheadtoanAvrospecificationweuseforaHivetableinChapter7,HadoopandSQL.Thefollowingcodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch2/src/main/java/com/learninghadoop2/avro/AvroParse.java.

Inthefollowingcode,wewillusetheAvroJavaAPItocreateanAvrofilecontainingatweetrecordandthenre-readthefile,usingtheschemainthefiletoextractthedetailsofthestoredrecords:

publicstaticvoidtestGenericRecord(){

try{

Schemaschema=newSchema.Parser()

.parse(newFile("tweets_avro.avsc"));

GenericRecordtweet=newGenericData

.Record(schema);

tweet.put("text","Thegenerictweettext");

Page 113: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Filefile=newFile("tweets.avro");

DatumWriter<GenericRecord>datumWriter=

newGenericDatumWriter<>(schema);

DataFileWriter<GenericRecord>fileWriter=

newDataFileWriter<>(datumWriter);

fileWriter.create(schema,file);

fileWriter.append(tweet);

fileWriter.close();

DatumReader<GenericRecord>datumReader=

newGenericDatumReader<>(schema);

DataFileReader<GenericRecord>fileReader=

newDataFileReader(file,datumReader);

GenericRecordgenericTweet=null;

while(fileReader.hasNext()){

genericTweet=(GenericRecord)fileReader

.next(genericTweet);

for(Schema.Fieldfield:

genericTweet.getSchema().getFields()){

Objectval=genericTweet.get(field.name());

if(val!=null){

System.out.println(val);

}

}

}

}catch(IOExceptionie){

System.out.println("Errorparsingorwritingfile.");

}

}

Thetweets_avro.avscschema,foundathttps://github.com/learninghadoop2/book-examples/blob/master/ch2/tweets_avro.avsc,describesatweetwithmultiplefields.TocreateanAvroobjectofthistype,wefirstparsetheschemafile.WethenuseAvro’sconceptofaGenericRecordtobuildanAvrodocumentthatcomplieswiththisschema.Inthiscase,weonlysetasingleattribute—thetweettextitself.

TowritethisAvrofile—containingasingleobject—wethenuseAvro’sI/Ocapabilities.Toreadthefile,wedonotneedtostartwiththeschema,aswecanextractthisfromtheGenericRecordwereadfromthefile.Wethenwalkthroughtheschemastructureanddynamicallyprocessthedocumentbasedonthediscoveredfields.Thisisparticularlypowerful,asitisthekeyenablerofclientsremainingindependentoftheAvroschemaandhowitevolvesovertime.

Ifwehavetheschemafileinadvance,however,wecanuseAvrocodegenerationtocreateacustomizedclassthatmakesmanipulatingAvrorecordsmucheasier.Togeneratethecode,wewillusethecompileclassintheavro-tools.jar,passingitthenameoftheschemafileandthedesiredoutputdirectory:

$java-jar/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/avro/avro-

Page 114: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

tools.jarcompileschematweets_avro.avscsrc/main/java

Theclasswillbeplacedinadirectorystructurebasedonanynamespacedefinedintheschema.Sincewecreatedthisschemainthecom.learninghadoop2.avrotablesnamespace,weseethefollowing:

$lssrc/main/java/com/learninghadoop2/avrotables/tweets_avro.java

Withthisclass,let’srevisitthecreationandtheactofreadingandwritingAvroobjects,asfollows:

publicstaticvoidtestGeneratedCode(){

tweets_avrotweet=newtweets_avro();

tweet.setText("Thecodegeneratedtweettext");

try{

Filefile=newFile("tweets.avro");

DatumWriter<tweets_avro>datumWriter=

newSpecificDatumWriter<>(tweets_avro.class);

DataFileWriter<tweets_avro>fileWriter=

newDataFileWriter<>(datumWriter);

fileWriter.create(tweet.getSchema(),file);

fileWriter.append(tweet);

fileWriter.close();

DatumReader<tweets_avro>datumReader=

newSpecificDatumReader<>(tweets_avro.class);

DataFileReader<tweets_avro>fileReader=

newDataFileReader<>(file,datumReader);

while(fileReader.hasNext()){

tweet=fileReader.next(tweet);

System.out.println(tweet.getText());

}

}catch(IOExceptionie){

System.out.println("Errorinparsingorwritingfiles.");

}

}

Becauseweusedcodegeneration,wenowusetheAvroSpecificRecordmechanismalongsidethegeneratedclassthatrepresentstheobjectinourdomainmodel.Consequently,wecandirectlyinstantiatetheobjectandaccessitsattributesthroughfamiliarget/setmethods.

Writingthefileissimilartotheactionperformedbefore,exceptthatweusespecificclassesandalsoretrievetheschemadirectlyfromthetweetobjectwhenneeded.Readingissimilarlyeasedthroughtheabilitytocreateinstancesofaspecificclassanduseget/setmethods.

Page 115: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SummaryThischapterhasgivenawhistle-stoptourthroughstorageonaHadoopcluster.Inparticular,wecovered:

Thehigh-levelarchitectureofHDFS,themainfilesystemusedinHadoopHowHDFSworksunderthecoversand,inparticular,itsapproachtoreliabilityHowHadoop2hasaddedsignificantlytoHDFS,particularlyintheformofNameNodeHAandfilesystemsnapshotsWhatZooKeeperisandhowitisusedbyHadooptoenablefeaturessuchasNameNodeautomaticfailoverAnoverviewofthecommand-linetoolsusedtoaccessHDFSTheAPIforfilesystemsinHadoopandhowatacodelevelHDFSisjustoneimplementationofamoreflexiblefilesystemabstractionHowdatacanbeserializedontoaHadoopfilesystemandsomeofthesupportprovidedinthecoreclassesThevariousfileformatsavailableinwhichdataismostfrequentlystoredinHadoopandsomeoftheirparticularusecases

Inthenextchapter,wewilllookindetailathowHadoopprovidesprocessingframeworksthatcanbeusedtoprocessthedatastoredwithinit.

Page 116: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Chapter3.Processing–MapReduceandBeyondInHadoop1,theplatformhadtwoclearcomponents:HDFSfordatastorageandMapReducefordataprocessing.ThepreviouschapterdescribedtheevolutionofHDFSinHadoop2andinthischapterwe’lldiscussdataprocessing.

ThepicturewithprocessinginHadoop2haschangedmoresignificantlythanhasstorage,andHadoopnowsupportsmultipleprocessingmodelsasfirst-classcitizens.Inthischapterwe’llexplorebothMapReduceandothercomputationalmodelsinHadoop2.Inparticular,we’llcover:

WhatMapReduceisandtheJavaAPIrequiredtowriteapplicationsforitHowMapReduceisimplementedinpracticeHowHadoopreadsdataintoandoutofitsprocessingjobsYARN,theHadoop2componentthatallowsprocessingbeyondMapReduceontheplatformAnintroductiontoseveralcomputationalmodelsimplementedonYARN

Page 117: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MapReduceMapReduceistheprimaryprocessingmodelsupportedinHadoop1.Itfollowsadivideandconquermodelforprocessingdatamadepopularbya2006paperbyGoogle(http://research.google.com/archive/mapreduce.html)andhasfoundationsbothinfunctionalprogramminganddatabaseresearch.Thenameitselfreferstotwodistinctstepsappliedtoallinputdata,amapfunctionandareducefunction.

EveryMapReduceapplicationisasequenceofjobsthatbuildatopthisverysimplemodel.Sometimes,theoverallapplicationmayrequiremultiplejobs,wheretheoutputofthereducestagefromoneistheinputtothemapstageofanother,andsometimestheremightbemultiplemaporreducefunctions,butthecoreconceptsremainthesame.

WewillintroducetheMapReducemodelbylookingatthenatureofthemapandreducefunctionsandthendescribetheJavaAPIrequiredtobuildimplementationsofthefunctions.Aftershowingsomeexamples,wewillwalkthroughaMapReduceexecutiontogivemoreinsightintohowtheactualMapReduceframeworkexecutescodeatruntime.

LearningtheMapReducemodelcanbealittlecounter-intuitive;it’softendifficulttoappreciatehowverysimplefunctionscan,whencombined,provideveryrichprocessingonenormousdatasets.Butitdoeswork,trustus!

Asweexplorethenatureofthemapandreducefunctions,thinkofthemasbeingappliedtoastreamofrecordsbeingretrievedfromthesourcedataset.We’lldescribehowthathappenslater;fornow,thinkofthesourcedatabeingslicedintosmallerchunks,eachofwhichgetsfedtoadedicatedinstanceofthemapfunction.Eachrecordhasthemapfunctionapplied,producingasetofintermediarydata.Recordsareretrievedfromthistemporarydatasetandallassociatedrecordsarefedtogetherthroughthereducefunction.Thefinaloutputofthereducefunctionforallthesetsofrecordsistheoverallresultforthecompletejob.

Fromafunctionalperspective,MapReducetransformsdatastructuresfromonelistof(key,value)pairsintoanother.DuringtheMapphase,dataisloadedfromHDFS,andafunctionisappliedinparalleltoeveryinput(key,value)andanewlistof(key,value)pairsistheoutput:

map(k1,v1)->list(k2,v2)

Theframeworkthencollectsallpairswiththesamekeyfromalllistsandgroupsthemtogether,creatingonegroupforeachkey.AReducefunctionisappliedinparalleltoeachgroup,whichinturnproducesalistofvalues:

reduce(k2,list(v2))→k3,list(v3)

TheoutputisthenwrittenbacktoHDFSinthefollowingmanner:

Page 118: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MapandReducephases

Page 119: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

JavaAPItoMapReduceTheJavaAPItoMapReduceisexposedbytheorg.apache.hadoop.mapreducepackage.WritingaMapReduceprogram,atitscore,isamatterofsubclassingHadoop-providedMapperandReducerbaseclasses,andoverridingthemap()andreduce()methodswithourownimplementation.

Page 120: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheMapperclassForourownMapperimplementations,wewillsubclasstheMapperbaseclassandoverridethemap()method,asfollows:

classMapper<K1,V1,K2,V2>

{

voidmap(K1key,V1valueMapper.Contextcontext)

throwsIOException,InterruptedException

...

}

Theclassisdefinedintermsofthekey/valueinputandoutputtypes,andthenthemapmethodtakesaninputkey/valuepairasitsparameter.TheotherparameterisaninstanceoftheContextclassthatprovidesvariousmechanismstocommunicatewiththeHadoopframework,oneofwhichistooutputtheresultsofamaporreducemethod.

NoticethatthemapmethodonlyreferstoasingleinstanceofK1andV1key/valuepairs.ThisisacriticalaspectoftheMapReduceparadigminwhichyouwriteclassesthatprocesssinglerecords,andtheframeworkisresponsibleforalltheworkrequiredtoturnanenormousdatasetintoastreamofkey/valuepairs.Youwillneverhavetowritemaporreduceclassesthattrytodealwiththefulldataset.HadoopalsoprovidesmechanismsthroughitsInputFormatandOutputFormatclassesthatprovideimplementationsofcommonfileformatsandlikewiseremovetheneedforhavingtowritefileparsersforanybutcustomfiletypes.

Therearethreeadditionalmethodsthatsometimesmayberequiredtobeoverridden:.

protectedvoidsetup(Mapper.Contextcontext)

throwsIOException,InterruptedException

Thismethodiscalledoncebeforeanykey/valuepairsarepresentedtothemapmethod.Thedefaultimplementationdoesnothing:

protectedvoidcleanup(Mapper.Contextcontext)

throwsIOException,InterruptedException

Thismethodiscalledonceafterallkey/valuepairshavebeenpresentedtothemapmethod.Thedefaultimplementationdoesnothing:

protectedvoidrun(Mapper.Contextcontext)

throwsIOException,InterruptedException

ThismethodcontrolstheoverallflowoftaskprocessingwithinaJVM.Thedefaultimplementationcallsthesetupmethodoncebeforerepeatedlycallingthemapmethodforeachkey/valuepairinthesplitandthenfinallycallsthecleanupmethod.

Page 121: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheReducerclassTheReducerbaseclassworksverysimilarlytotheMapperclassandusuallyrequiresonlysubclassestooverrideasinglereduce()method.Hereisthecut-downclassdefinition:

publicclassReducer<K2,V2,K3,V3>

{

voidreduce(K2key,Iterable<V2>values,

Reducer.Contextcontext)

throwsIOException,InterruptedException

...

}

Again,noticetheclassdefinitionintermsofthebroaderdataflow(thereducemethodacceptsK2/V2asinputandprovidesK3/V3asoutput),whiletheactualreducemethodtakesonlyasinglekeyanditsassociatedlistofvalues.TheContextobjectisagainthemechanismtooutputtheresultofthemethod.

Thisclassalsohasthesetup,runandcleanupmethodswithsimilardefaultimplementationsaswiththeMapperclassthatcanoptionallybeoverridden:

protectedvoidsetup(Reducer.Contextcontext)

throwsIOException,InterruptedException

Thesetup()methodiscalledoncebeforeanykey/listsofvaluesarepresentedtothereducemethod.Thedefaultimplementationdoesnothing:

protectedvoidcleanup(Reducer.Contextcontext)

throwsIOException,InterruptedException

Thecleanup()methodiscalledonceafterallkey/listsofvalueshavebeenpresentedtothereducemethod.Thedefaultimplementationdoesnothing:

protectedvoidrun(Reducer.Contextcontext)

throwsIOException,InterruptedException

Therun()methodcontrolstheoverallflowofprocessingthetaskwithintheJVM.Thedefaultimplementationcallsthesetupmethodbeforerepeatedlyandpotentiallyconcurrentlycallingthereducemethodforasmanykey/valuepairsprovidedtotheReducerclass,andthenfinallycallsthecleanupmethod.

Page 122: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheDriverclassTheDriverclasscommunicateswiththeHadoopframeworkandspecifiestheconfigurationelementsneededtorunaMapReducejob.ThisinvolvesaspectssuchastellingHadoopwhichMapperandReducerclassestouse,wheretofindtheinputdataandinwhatformat,andwheretoplacetheoutputdataandhowtoformatit.

ThedriverlogicusuallyexistsinthemainmethodoftheclasswrittentoencapsulateaMapReducejob.ThereisnodefaultparentDriverclasstosubclass:

publicclassExampleDriverextendsConfiguredimplementsTool

{

...

publicstaticvoidrun(String[]args)throwsException

{

//CreateaConfigurationobjectthatisusedtosetotheroptions

Configurationconf=getConf();

//Getcommandlinearguments

args=newGenericOptionsParser(conf,args)

.getRemainingArgs();

//Createtheobjectrepresentingthejob

Jobjob=newJob(conf,"ExampleJob");

//Setthenameofthemainclassinthejobjarfile

job.setJarByClass(ExampleDriver.class);

//Setthemapperclass

job.setMapperClass(ExampleMapper.class);

//Setthereducerclass

job.setReducerClass(ExampleReducer.class);

//Setthetypesforthefinaloutputkeyandvalue

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

//Setinputandoutputfilepaths

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

//Executethejobandwaitforittocomplete

System.exit(job.waitForCompletion(true)?0:1);

}

publicstaticvoidmain(String[]args)throwsException

{

intexitCode=ToolRunner.run(newExampleDriver(),args);

System.exit(exitCode);

}

}

Intheprecedinglinesofcode,org.apache.hadoop.util.Toolisaninterfaceforhandlingcommand-lineoptions.TheactualhandlingisdelegatedtoToolRunner.run,whichruns

Page 123: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ToolwiththegivenConfigurationusedtogetandsetajob’sconfigurationoptions.Bysubclassingorg.apache.hadoop.conf.Configured,wecansettheConfigurationobjectdirectlyfromcommand-lineoptionsviaGenericOptionsParser.

Givenourprevioustalkofjobs,it’snotsurprisingthatmuchofthesetupinvolvesoperationsonajobobject.Thisincludessettingthejobnameandspecifyingwhichclassesaretobeusedforthemapperandreducerimplementations.

Certaininput/outputconfigurationsaresetand,finally,theargumentspassedtothemainmethodareusedtospecifytheinputandoutputlocationsforthejob.Thisisaverycommonmodelthatyouwillseeoften.

Thereareanumberofdefaultvaluesforconfigurationoptions,andweareimplicitlyusingsomeofthemintheprecedingclass.Mostnotably,wedon’tsayanythingabouttheformatoftheinputfilesorhowtheoutputfilesaretobewritten.ThesearedefinedthroughtheInputFormatandOutputFormatclassesmentionedearlier;wewillexplorethemindetaillater.Thedefaultinputandoutputformatsaretextfilesthatsuitourexamples.Therearemultiplewaysofexpressingtheformatwithintextfilesinadditiontoparticularlyoptimizedbinaryformats.

AcommonmodelforlesscomplexMapReducejobsistohavetheMapperandReducerclassesasinnerclasseswithinthedriver.Thisallowseverythingtobekeptinasinglefile,whichsimplifiesthecodedistribution.

Page 124: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

CombinerHadoopallowstheuseofacombinerclasstoperformsomeearlysortingoftheoutputfromthemapmethodbeforeit’sretrievedbythereducer.

MuchofHadoop’sdesignispredicatedonreducingtheexpensivepartsofajobthatusuallyequatetodiskandnetworkI/O.Theoutputofthemapperisoftenlarge;it’snotinfrequenttoseeitmanytimesthesizeoftheoriginalinput.Hadoopdoesallowconfigurationoptionstohelpreducetheimpactofthereducerstransferringsuchlargechunksofdataacrossthenetwork.Thecombinertakesadifferentapproachwhereit’spossibletoperformearlyaggregationtorequirelessdatatobetransferredinthefirstplace.

Thecombinerdoesnothaveitsowninterface;acombinermusthavethesamesignatureasthereducer,andhencealsosubclassestheReduceclassfromtheorg.apache.hadoop.mapreducepackage.Theeffectofthisistobasicallyperformamini-reduceonthemapperfortheoutputdestinedforeachreducer.

Hadoopdoesnotguaranteewhetherthecombinerwillbeexecuted.Sometimes,itmaynotbeexecutedatall,whileatothertimesitmaybeusedonce,twice,ormoretimesdependingonthesizeandnumberofoutputfilesgeneratedbythemapperforeachreducer.

Page 125: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PartitioningOneoftheimplicitguaranteesoftheReduceinterfaceisthatasinglereducerwillbegivenallthevaluesassociatedwithagivenkey.Withmultiplereducetasksrunningacrossacluster,eachmapperoutputmustbepartitionedintotheseparateoutputsdestinedforeachreducer.Thesepartitionedfilesarestoredonthelocalnodefilesystem.

Thenumberofreducetasksacrosstheclusterisnotasdynamicasthatofmappers,andindeedwecanspecifythevalueaspartofourjobsubmission.Hadooptherefore,knowshowmanyreducerswillbeneededtocompletethejob,andfromthis,itknowsintohowmanypartitionsthemapperoutputshouldbesplit.

TheoptionalpartitionfunctionWithintheorg.apache.hadoop.mapreducepackageisthePartitionerclass,anabstractclasswiththefollowingsignature:

publicabstractclassPartitioner<Key,Value>

{

publicabstractintgetPartition(Keykey,Valuevalue,int

numPartitions);

}

Bydefault,Hadoopwilluseastrategythathashestheoutputkeytoperformthepartitioning.ThisfunctionalityisprovidedbytheHashPartitionerclasswithintheorg.apache.hadoop.mapreduce.lib.partitionpackage,butit’snecessaryinsomecasestoprovideacustomsubclassofPartitionerwithapplication-specificpartitioninglogic.NoticethatthegetPartitionfunctiontakesthekey,value,andnumberofpartitionsasparameters,anyofwhichcanbeusedbythecustompartitioninglogic.

Acustompartitioningstrategywouldbeparticularlynecessaryif,forexample,thedataprovidedaveryunevendistributionwhenthestandardhashfunctionwasapplied.Unevenpartitioningcanresultinsometaskshavingtoperformsignificantlymoreworkthanothers,leadingtomuchlongeroveralljobexecutiontime.

Page 126: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Hadoop-providedmapperandreducerimplementationsWedon’talwayshavetowriteourownMapperandReducerclassesfromscratch.HadoopprovidesseveralcommonMapperandReducerimplementationsthatcanbeusedinourjobs.Ifwedon’toverrideanyofthemethodsintheMapperandReducerclasses,thedefaultimplementationsaretheidentityMapperandReducerclasses,whichsimplyoutputtheinputunchanged.

Themappersarefoundatorg.apache.hadoop.mapreduce.lib.mapperandincludethefollowing:

InverseMapper:returns(value,key)asanoutput,thatis,theinputkeyisoutputasthevalueandtheinputvalueisoutputasthekeyTokenCounterMapper:countsthenumberofdiscretetokensineachlineofinputIdentityMapper:implementstheidentityfunction,mappinginputsdirectlytooutputs

Thereducersarefoundatorg.apache.hadoop.mapreduce.lib.reduceandcurrentlyincludethefollowing:

IntSumReducer:outputsthesumofthelistofintegervaluesperkeyLongSumReducer:outputsthesumofthelistoflongvaluesperkeyIdentityReducer:implementstheidentityfunction,mappinginputsdirectlytooutputs

Page 127: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SharingreferencedataOccasionally,wemightwanttosharedataacrosstasks.Forinstance,ifweneedtoperformalookupoperationonanID-to-stringtranslationtable,wemightwantsuchadatasourcetobeaccessiblebythemapperorreducer.AstraightforwardapproachistostorethedatawewanttoaccessonHDFSandusetheFileSystemAPItoqueryitaspartoftheMaporReducesteps.

Hadoopgivesusanalternativemechanismtoachievethegoalofsharingreferencedataacrossalltasksinthejob,theDistributedCachedefinedbytheorg.apache.hadoop.mapreduce.filecache.DistributedCacheclass.Thiscanbeusedtoefficientlymakeavailablecommonread-onlyfilesthatareusedbythemaporreducetaskstoallnodes.

Thefilescanbetextdataasinthiscase,butcouldalsobeadditionalJARs,binarydata,orarchives;anythingispossible.ThefilestobedistributedareplacedonHDFSandaddedtotheDistributedCachewithinthejobdriver.Hadoopcopiesthefilesontothelocalfilesystemofeachnodepriortojobexecution,meaningeverytaskhaslocalaccesstothefiles.

AnalternativeistobundleneededfilesintothejobJARsubmittedtoHadoop.ThisdoestiethedatatothejobJAR,makingitmoredifficulttoshareacrossjobsandrequirestheJARtoberebuiltifthedatachanges.

Page 128: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WritingMapReduceprogramsInthischapter,wewillbefocusingonbatchworkloads;givenasetofhistoricaldata,wewilllookatpropertiesofthatdataset.InChapter4,Real-timeComputationwithSamza,andChapter5,IterativeComputationwithSpark,wewillshowhowasimilartypeofanalysiscanbeperformedoverastreamoftextcollectedinrealtime.

Page 129: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GettingstartedInthefollowingexamples,wewillassumeadatasetgeneratedbycollecting1,000tweetsusingthestream.pyscript,asshowninChapter1,Introduction:

$pythonstream.py–t–n1000>tweets.txt

WecanthencopythedatasetintoHDFSwith:

$hdfsdfs-puttweets.txt<destination>

TipNotethatuntilnowwehavebeenworkingonlywiththetextoftweets.Intheremainderofthisbook,we’llextendstream.pytooutputadditionaltweetmetadatainJSONformat.Keepthisinmindbeforedumpingterabytesofmessageswithstream.py.

OurfirstMapReduceprogramwillbethecanonicalWordCountexample.Avariationofthisprogramwillbeusedtodeterminetrendingtopics.Wewillthenanalyzetextassociatedwithtopicstodeterminewhetheritexpressesa“positive”or“negative”sentiment.Finally,wewillmakeuseofaMapReducepattern—ChainMapper—topullthingstogetherandpresentadatapipelinetocleanandpreparethetextualdatawe’llfeedtothetrendingtopicandsentimentanalysismodel.

Page 130: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

RunningtheexamplesThefullsourcecodeoftheexamplesdescribedinthissectioncanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch3.

BeforewerunourjobinHadoop,wemustcompileourcodeandcollecttherequiredclassfilesintoasingleJARfilethatwewillsubmittothesystem.UsingGradle,youcanbuildtheneededJARfilewith:

$./gradlewjar

LocalclusterJobsareexecutedonHadoopusingtheJARoptiontotheHadoopcommand-lineutility.Tousethis,wespecifythenameoftheJARfile,themainclasswithinit,andanyargumentsthatwillbepassedtothemainclass,asshowninthefollowingcommand:

$hadoopjar<jobjarfile><mainclass><argument1>…<argument2>

ElasticMapReduceRecallfromChapter1,Introduction,thatElasticMapReduceexpectsthejobJARfileanditsinputdatatobelocatedinanS3bucketandconverselywilldumpitsoutputbackintoS3.

NoteBecareful:thiswillcostmoney!Forthisexample,wewillusethesmallestpossibleclusterconfigurationavailableforEMR,asingle-nodecluster

Firstofall,wewillcopythetweetdatasetandthelistofpositiveandnegativewordstoS3usingtheawscommand-lineutility:

$awss3puttweets.txts3://<bucket>/input

$awss3putjob.jars3://<bucket>

WecanexecuteajobusingtheEMRcommand-linetoolasfollowsbyuploadingtheJARfiletos3://<bucket>andaddingCUSTOM_JARstepswiththeawsCLI:

$awsemradd-steps--cluster-id<cluster-id>--steps\

Type=CUSTOM_JAR,\

Name=CustomJAR,\

Jar=s3://<bucket>/job.jar,\

MainClass=<classname>,\

Args=arg1,arg2,…argN

Here,cluster-idistheIDofarunningEMRcluster,<classname>isthefullyqualifiednameofthemainclass,andarg1,arg2,…,argNarethejobarguments.

Page 131: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WordCount,theHelloWorldofMapReduceWordCountcountswordoccurrencesinadataset.Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/WordCount.javaConsiderthefollowingblockofcodeforexample:

publicclassWordCountextendsConfiguredimplementsTool

{

publicstaticclassWordCountMapper

extendsMapper<Object,Text,Text,IntWritable>

{

privatefinalstaticIntWritableone=newIntWritable(1);

privateTextword=newText();

publicvoidmap(Objectkey,Textvalue,Contextcontext

)throwsIOException,InterruptedException{

String[]words=value.toString().split("");

for(Stringstr:words)

{

word.set(str);

context.write(word,one);

}

}

}

publicstaticclassWordCountReducer

extendsReducer<Text,IntWritable,Text,IntWritable>{

publicvoidreduce(Textkey,Iterable<IntWritable>values,

Contextcontext

)throwsIOException,InterruptedException{

inttotal=0;

for(IntWritableval:values){

total++;

}

context.write(key,newIntWritable(total));

}

}

publicintrun(String[]args)throwsException{

Configurationconf=getConf();

args=newGenericOptionsParser(conf,args)

.getRemainingArgs();

Jobjob=Job.getInstance(conf);

job.setJarByClass(WordCount.class);

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

Page 132: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

return(job.waitForCompletion(true)?0:1);

}

publicstaticvoidmain(String[]args)throwsException{

intexitCode=ToolRunner.run(newWordCount(),args);

System.exit(exitCode);

}

}

ThisisourfirstcompleteMapReducejob.Lookatthestructure,andyoushouldrecognizetheelementswehavepreviouslydiscussed:theoverallJobclasswiththedriverconfigurationinitsmainmethodandtheMapperandReducerimplementationsdefinedasstaticnestedclasses.

We’lldoamoredetailedwalkthroughofthemechanicsofMapReduceinthenextsection,butfornow,let’slookattheprecedingcodeandthinkofhowitrealizesthekey/valuetransformationswediscussedearlier.

TheinputtotheMapperclassisarguablythehardesttounderstand,asthekeyisnotactuallyused.ThejobspecifiesTextInputFormatastheformatoftheinputdataand,bydefault,thisdeliverstothemapperdatawherethekeyisthebyteoffsetinthefileandthevalueisthetextofthatline.Inreality,youmayneveractuallyseeamapperthatusesthatbyteoffsetkey,butit’sprovided.

Themapperisexecutedonceforeachlineoftextintheinputsource,andeverytimeittakesthelineandbreaksitintowords.ItthenusestheContextobjecttooutput(morecommonlyknownasemitting)eachnewkey/valueoftheform(word,1).TheseareourK2/V2values.

Wesaidbeforethattheinputtothereducerisakeyandacorrespondinglistofvalues,andthereissomemagicthathappensbetweenthemapandreducemethodstocollectthevaluesforeachkeythatfacilitatesthis—calledtheshufflestage,whichwewon’tdescriberightnow.Hadoopexecutesthereduceronceforeachkey,andtheprecedingreducerimplementationsimplycountsthenumbersintheIterableobjectandgivesoutputforeachwordintheformof(word,count).TheseareourK3/V3values.

Takealookatthesignaturesofourmapperandreducerclasses:theWordCountMapperclassacceptsIntWritableandTextasinputandprovidesTextandIntWritableasoutput.TheWordCountReducerclasshasTextandIntWritableacceptedasbothinputandoutput.Thisisagainquiteacommonpattern,wherethemapmethodperformsaninversiononthekeyandvalues,andinsteademitsaseriesofdatapairsonwhichthereducerperformsaggregation.

Thedriverismoremeaningfulhere,aswehaverealvaluesfortheparameters.Weuseargumentspassedtotheclasstospecifytheinputandoutputlocations.

Runthejobwith:

$hadoopjarbuild/libs/mapreduce-example.jar

com.learninghadoop2.mapreduce.WordCount\

twitter.txtoutput

Page 133: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Examinetheoutputwithacommandsuchasthefollowing;theactualfilenamemightbedifferent,sojustlookinsidethedirectorycalledoutputinyourhomedirectoryonHDFS:

$hdfsdfs-catoutput/part-r-00000

Page 134: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Wordco-occurrencesWordsoccurringtogetherarelikelytobephrasesandcommon—frequentlyoccurring—phrasesarelikelytobeimportant.InNaturalLanguageProcessing,alistofco-occurringtermsiscalledanN-Gram.N-Gramsarethefoundationofseveralstatisticalmethodsfortextanalytics.WewillgiveanexampleofthespecialcaseofanN-Gram—andametricoftenencounteredinanalyticsapplications—composedoftwoterms(abigram).

AnaïveimplementationinMapReducewouldbeanextensionofWordCountthatemitsamulti-fieldkeycomposedoftwotab-separatedwords.

publicclassBiGramCountextendsConfiguredimplementsTool

{

publicstaticclassBiGramMapper

extendsMapper<Object,Text,Text,IntWritable>{

privatefinalstaticIntWritableone=newIntWritable(1);

privateTextword=newText();

publicvoidmap(Objectkey,Textvalue,Contextcontext

)throwsIOException,InterruptedException{

String[]words=value.toString().split("");

Textbigram=newText();

Stringprev=null;

for(Strings:words){

if(prev!=null){

bigram.set(prev+"\t+\t"+s);

context.write(bigram,one);

}

prev=s;

}

}

}

@Override

publicintrun(String[]args)throwsException{

Configurationconf=getConf();

args=newGenericOptionsParser(conf,args).getRemainingArgs();

Jobjob=Job.getInstance(conf);

job.setJarByClass(BiGramCount.class);

job.setMapperClass(BiGramMapper.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

return(job.waitForCompletion(true)?0:1);

}

publicstaticvoidmain(String[]args)throwsException{

intexitCode=ToolRunner.run(newBiGramCount(),args);

Page 135: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

System.exit(exitCode);

}

}

Inthisjob,wereplaceWordCountReducerwithorg.apache.hadoop.mapreduce.lib.reduce.IntSumReducer,whichimplementsthesamelogic.Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/BiGramCount.java

Page 136: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TrendingtopicsThe#symbol,calledahashtag,isusedtomarkkeywordsortopicsinatweet.ItwascreatedorganicallybyTwitterusersasawaytocategorizemessages.TwitterSearch(foundathttps://twitter.com/search-home)popularizedtheuseofhashtagsasamethodtoconnectandfindcontentrelatedtospecifictopicsaswellasthepeopletalkingaboutsuchtopics.Bycountingthefrequencywithwhichahashtagismentionedoveragiventimeperiod,wecandeterminewhichtopicsaretrendinginthesocialnetwork.

publicclassHashTagCountextendsConfiguredimplementsTool

{

publicstaticclassHashTagCountMapper

extendsMapper<Object,Text,Text,IntWritable>

{

privatefinalstaticIntWritableone=newIntWritable(1);

privateTextword=newText();

privateStringhashtagRegExp=

"(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)";

publicvoidmap(Objectkey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{

String[]words=value.toString().split("");

for(Stringstr:words)

{

if(str.matches(hashtagRegExp)){

word.set(str);

context.write(word,one);

}

}

}

}

publicintrun(String[]args)throwsException{

Configurationconf=getConf();

args=newGenericOptionsParser(conf,args)

.getRemainingArgs();

Jobjob=Job.getInstance(conf);

job.setJarByClass(HashTagCount.class);

job.setMapperClass(HashTagCountMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

return(job.waitForCompletion(true)?0:1);

Page 137: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

}

publicstaticvoidmain(String[]args)throwsException{

intexitCode=ToolRunner.run(newHashTagCount(),args);

System.exit(exitCode);

}

}

AsintheWordCountexample,wetokenizetextintheMapper.Weusearegularexpression—hashtagRegExp—todetectthepresenceofahashtaginTwitter’stextandemitthehashtagandthenumber1whenahashtagisfound.IntheReducerstep,wethencountthetotalnumberofemittedhashtagoccurrencesusingIntSumReducer.

Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagCount.java

ThiscompiledclasswillbeintheJARfilewebuiltwithGradleearlier,sonowweexecuteHashTagCountwiththefollowingcommand:

$hadoopjarbuild/libs/mapreduce-example.jar\

com.learninghadoop2.mapreduce.HashTagCounttwitter.txtoutput

Let’sexaminetheoutputasbefore:

$hdfsdfs-catoutput/part-r-00000

Youshouldseeoutputsimilartothefollowing:

#whey1

#willpower1

#win2

#winterblues1

#winterstorm1

#wipolitics1

#women6

#woodgrain1

Eachlineiscomposedofahashtagandthenumberoftimesitappearsinthetweetsdataset.Asyoucansee,theMapReducejobordersresultsbykey.Ifwewanttofindthemostmentionedtopics,weneedtoordertheresultset.Thenaïveapproachwouldbetoperformatotalorderoftheaggregatedvaluesandselectingthetop10.

Iftheoutputdatasetissmall,wecanpipeittostandardoutputandsortitusingthesortutility:

$hdfsdfs-catoutput/part-r-00000|sort-k2-n-r|head-n10

AnothersolutionwouldbetowriteanotherMapReducejobtotraversethewholeresultsetandsortbyvalue.Whendatabecomeslarge,thistypeofglobalsortingcanbecomequiteexpensive.Inthefollowingsection,wewillillustrateanefficientdesignpatterntosortaggregateddata

TheTopNpattern

Page 138: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

IntheTopNpattern,wekeepdatasortedinalocaldatastructure.EachmappercalculatesalistofthetopNrecordsinitssplitandsendsitslisttothereducer.AsinglereducertaskfindsthetopNglobalrecords.

WewillapplythisdesignpatterntoimplementaTopTenHashTagjobthatfindsthetoptentopicsinourdataset.ThejobtakesasinputtheoutputdatageneratedbyHashTagCountandreturnsalistofthetenmostfrequentlymentionedhashtags.

InTopTenMapperweuseTreeMaptokeepasortedlist—inascendingorder—ofhashtags.Thekeyofthismapisthenumberofoccurrences;thevalueisatab-separatedstringofhashtagsandtheirfrequency.Inmap(),foreachvalue,weupdatethetopNmap.WhentopNhasmorethantenitems,weremovethesmallest:

publicstaticclassTopTenMapperextendsMapper<Object,Text,

NullWritable,Text>{

privateTreeMap<Integer,Text>topN=newTreeMap<Integer,Text>();

privatefinalstaticIntWritableone=newIntWritable(1);

privateTextword=newText();

publicvoidmap(Objectkey,Textvalue,Contextcontext)throws

IOException,InterruptedException{

String[]words=value.toString().split("\t");

if(words.length<2){

return;

}

topN.put(Integer.parseInt(words[1]),newText(value));

if(topN.size()>10){

topN.remove(topN.firstKey());

}

}

@Override

protectedvoidcleanup(Contextcontext)throwsIOException,

InterruptedException{

for(Textt:topN.values()){

context.write(NullWritable.get(),t);

}

}

}

Wedon’temitanykey/valueinthemapfunction.Weimplementacleanup()methodthat,oncethemapperhasconsumedallitsinput,emitsthe(hashtag,count)valuesintopN.WeuseaNullWritablekeybecausewewantallvaluestobeassociatedwiththesamekeysothatwecanperformaglobalorderoverallmappers’topnlists.Thisimpliesthatourjobwillexecuteonlyonereducer.

Thereducerimplementslogicsimilartowhatwehaveinmap().WeinstantiateTreeMapanduseittokeepanorderedlistofthetop10values:

publicstaticclassTopTenReducerextends

Reducer<NullWritable,Text,NullWritable,Text>{

privateTreeMap<Integer,Text>topN=newTreeMap<Integer,Text>();

Page 139: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

@Override

publicvoidreduce(NullWritablekey,Iterable<Text>values,Context

context)throwsIOException,InterruptedException{

for(Textvalue:values){

String[]words=value.toString().split("\t");

topN.put(Integer.parseInt(words[1]),

newText(value));

if(topN.size()>10){

topN.remove(topN.firstKey());

}

}

for(Textword:topN.descendingMap().values()){

context.write(NullWritable.get(),word);

}

}

}

Finally,wetraversetopNindescendingordertogeneratethelistoftrendingtopics.

NoteNotethatinthisimplementation,weoverridehashtagsthathaveafrequencyvaluealreadypresentinTreeMapwhencallingtopN.put().Dependingontheusecase,it’sadvisedtouseadifferentdatastructure—suchastheonesofferedbytheGuavalibrary(https://code.google.com/p/guava-libraries/)—oradjusttheupdatingstrategy.

Inthedriver,weenforceasinglereducerbysettingjob.setNumReduceTasks(1):

$hadoopjarbuild/libs/mapreduce-example.jar\

com.learninghadoop2.mapreduce.TopTenHashTag\

output/part-r-00000\

top-ten

Wecaninspectthetoptentolisttrendingtopics:

$hdfsdfs-cattop-ten/part-r-00000

#Stalker48150

#gameinsight55

#12M52

#KCA46

#LORDJASONJEROME29

#Valencia19

#LesAnges616

#VoteLuan15

#hadoop212

#Gameinsight11

Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/TopTenHashTag.java

Page 140: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SentimentofhashtagsTheprocessofidentifyingsubjectiveinformationinadatasourceiscommonlyreferredtoassentimentanalysis.Inthepreviousexample,weshowhowtodetecttrendingtopicsinasocialnetwork;we’llnowanalyzethetextsharedaroundthosetopicstodeterminewhethertheyexpressamostlypositiveornegativesentiment.

AlistofpositiveandnegativewordsfortheEnglishlanguage—aso-calledopinionlexicon—canbefoundathttp://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar.

NoteTheseresources—andmanymore—havebeencollectedbyProf.BingLiu’sgroupattheUniversityofIllinoisatChicagoandhavebeenused,amongothers,inBingLiu,MinqingHuandJunshengCheng.“OpinionObserver:AnalyzingandComparingOpinionsontheWeb.”Proceedingsofthe14thInternationalWorldWideWebconference(WWW-2005),May10-14,2005,Chiba,Japan.

Inthisexample,we’llpresentabag-of-wordsmethodthat,althoughsimplisticinnature,canbeusedasabaselinetomineopinionintext.Foreachtweetandeachhashtag,wewillcountthenumberoftimesapositiveoranegativewordappearsandnormalizethiscountbythetextlength.

NoteThebag-of-wordsmodelisanapproachusedinNaturalLanguageProcessingandInformationRetrievaltorepresenttextualdocuments.Inthismodel,textisrepresentedasthesetorbag—withmultiplicity—ofitswords,disregardinggrammarandmorphologicalpropertiesandevenwordorder.

UncompressthearchiveandplacethewordlistsintoHDFSwiththefollowingcommandline:

$hdfsdfs–putpositive-words.txt<destination>

$hdfsdfs–putnegative-words.txt<destination>

IntheMapperclass,wedefinetwoobjectsthatwillholdthewordlists:positiveWordsandnegativeWordsasSet<String>:

privateSet<String>positiveWords=null;

privateSet<String>negativeWords=null;

Weoverridethedefaultsetup()methodoftheMappersothatalistofpositiveandnegativewords—specifiedbytwoconfigurationproperties:job.positivewords.pathandjob.negativewords.path—isreadfromHDFSusingthefilesystemAPIwediscussedinthepreviouschapter.WecouldhavealsousedDistributedCachetosharethisdataacrossthecluster.Thehelpermethod,parseWordsList,readsalistofwordlists,stripsoutcomments,andloadswordsintoHashSet<String>:

privateHashSet<String>parseWordsList(FileSystemfs,PathwordsListPath)

Page 141: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

{

HashSet<String>words=newHashSet<String>();

try{

if(fs.exists(wordsListPath)){

FSDataInputStreamfi=fs.open(wordsListPath);

BufferedReaderbr=

newBufferedReader(newInputStreamReader(fi));

Stringline=null;

while((line=br.readLine())!=null){

if(line.length()>0&&!line.startsWith(BEGIN_COMMENT)){

words.add(line);

}

}

fi.close();

}

}

catch(IOExceptione){

e.printStackTrace();

}

returnwords;

}

IntheMapperstep,weemitforeachhashtaginthetweettheoverallsentimentofthetweet(simplythepositivewordcountminusthenegativewordcount)andthelengthofthetweet.

We’llusetheseinthereducertocalculateanoverallsentimentratioweightedbythelengthofthetweetstoestimatethesentimentexpressedbyatweetonahashtag,asfollows:

publicvoidmap(Objectkey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{

String[]words=value.toString().split("");

IntegerpositiveCount=newInteger(0);

IntegernegativeCount=newInteger(0);

IntegerwordsCount=newInteger(0);

for(Stringstr:words)

{

if(str.matches(HASHTAG_PATTERN)){

hashtags.add(str);

}

if(positiveWords.contains(str)){

positiveCount+=1;

}elseif(negativeWords.contains(str)){

negativeCount+=1;

}

wordsCount+=1;

Page 142: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

}

IntegersentimentDifference=0;

if(wordsCount>0){

sentimentDifference=positiveCount-negativeCount;

}

Stringstats;

for(Stringhashtag:hashtags){

word.set(hashtag);

stats=String.format("%d%d",sentimentDifference,

wordsCount);

context.write(word,newText(stats));

}

}

}

IntheReducerstep,weaddtogetherthesentimentscoresgiventoeachinstanceofthehashtaganddividebythetotalsizeofallthetweetsinwhichitoccurred:

publicstaticclassHashTagSentimentReducer

extendsReducer<Text,Text,Text,DoubleWritable>{

publicvoidreduce(Textkey,Iterable<Text>values,

Contextcontext

)throwsIOException,InterruptedException{

doubletotalDifference=0;

doubletotalWords=0;

for(Textval:values){

String[]parts=val.toString().split("");

totalDifference+=Double.parseDouble(parts[0]);

totalWords+=Double.parseDouble(parts[1]);

}

context.write(key,

newDoubleWritable(totalDifference/totalWords));

}

}

Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentiment.java

Afterrunningtheprecedingcode,executeHashTagSentimentwiththefollowingcommand:

$hadoopjarbuild/libs/mapreduce-example.jar

com.learninghadoop2.mapreduce.HashTagSentimenttwitter.txtoutput-sentiment

<positivewords><negativewords>

Youcanexaminetheoutputwiththefollowingcommand:

$hdfsdfs-catoutput-sentiment/part-r-00

000

Youshouldseeanoutputsimilartothefollowing:

#10680.011861271213042056

#10YearsOfLove0.012285135487494233

Page 143: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

#110.011941109121333999

#120.011938693593171155

#12F0.012339242266249566

#12M0.011864286953783268

#12MCalleEnPazYaTeVasNicolas

Intheprecedingoutput,eachlineiscomposedofahashtagandthesentimentpolarityassociatedwithit.Thisnumberisaheuristicthattellsuswhetherahashtagisassociatedmostlywithpositive(polarity>0)ornegative(polarity<0)sentimentandthemagnitudeofsuchasentiment—thehigherorlowerthenumber,thestrongerthesentiment.

Page 144: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TextcleanupusingchainmapperIntheexamplespresenteduntilnow,weignoredakeystepofessentiallyeveryapplicationbuiltaroundtextprocessing,whichisthenormalizationandcleanupoftheinputdata.Threecommoncomponentsofthisnormalizationstepare:

ChangingthelettercasetoeitherlowerorupperRemovalofstopwordsStemming

Inthissection,wewillshowhowtheChainMapperclass—foundatorg.apache.hadoop.mapreduce.lib.chain.ChainMapper—allowsustosequentiallycombineaseriesofMapperstoputtogetherasthefirststepofadatacleanuppipeline.Mappersareaddedtotheconfiguredjobusingthefollowing:

ChainMapper.addMapper(

JobConfjob,

Class<?extendsMapper<K1,V1,K2,V2>>klass,

Class<?extendsK1>inputKeyClass,

Class<?extendsV1>inputValueClass,

Class<?extendsK2>outputKeyClass,

Class<?extendsV2>outputValueClass,JobConfmapperConf)

Thestaticmethod,addMapper,requiresthefollowingargumentstobepassed:

job:JobConftoaddtheMapperclassclass:MapperclasstoaddinputKeyClass:mapperinputkeyclassinputValueClass:mapperinputvalueclassoutputKeyClass:mapperoutputkeyclassoutputValueClass:mapperoutputvalueclassmapperConf:aJobConfwiththeconfigurationfortheMapperclass

Inthisexample,wewilltakecareofthefirstitemlistedabove:beforecomputingthesentimentofeachtweet,wewillconverttolowercaseeachwordpresentinitstext.Thiswillallowustomoreaccuratelyascertainthesentimentofhashtagsbyignoringdifferencesincapitalizationacrosstweets.

Firstofall,wedefineanewMapper—LowerCaseMapper—whosemap()functioncallsJavaString’stoLowerCase()methodonitsinputvalueandemitsthelowercasedtext:

publicclassLowerCaseMapperextendsMapper<LongWritable,Text,

IntWritable,Text>{

privateTextlowercased=newText();

publicvoidmap(LongWritablekey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{

lowercased.set(value.toString().toLowerCase());

context.write(newIntWritable(1),lowercased);

}

}

IntheHashTagSentimentChaindriver,weconfiguretheJobobjectsothatbothMappers

Page 145: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

willbechainedtogetherandexecuted:

publicclassHashTagSentimentChain

extendsConfiguredimplementsTool

{

publicintrun(String[]args)throwsException{

Configurationconf=getConf();

args=newGenericOptionsParser(conf,args).getRemainingArgs();

//location(onhdfs)ofthepositivewordslist

conf.set("job.positivewords.path",args[2]);

conf.set("job.negativewords.path",args[3]);

Jobjob=Job.getInstance(conf);

job.setJarByClass(HashTagSentimentChain.class);

ConfigurationlowerCaseMapperConf=newConfiguration(false);

ChainMapper.addMapper(job,

LowerCaseMapper.class,

LongWritable.class,Text.class,

IntWritable.class,Text.class,

lowerCaseMapperConf);

ConfigurationhashTagSentimentConf=newConfiguration(false);

ChainMapper.addMapper(job,

HashTagSentiment.HashTagSentimentMapper.class,

IntWritable.class,

Text.class,Text.class,

Text.class,

hashTagSentimentConf);

job.setReducerClass(HashTagSentiment.HashTagSentimentReducer.class);

job.setInputFormatClass(TextInputFormat.class);

FileInputFormat.addInputPath(job,newPath(args[0]));

job.setOutputFormatClass(TextOutputFormat.class);

FileOutputFormat.setOutputPath(job,newPath(args[1]));

return(job.waitForCompletion(true)?0:1);

}

publicstaticvoidmain(String[]args)throwsException{

intexitCode=ToolRunner.run(

newHashTagSentimentChain(),args);

System.exit(exitCode);

}

}

TheLowerCaseMapperandHashTagSentimentMapperclassesareinvokedinapipeline,wheretheoutputofthefirstbecomestheinputofthesecond.TheoutputofthelastMapperwillbewrittentothetask’soutput.AnimmediatebenefitofthisdesignisareductionofdiskI/Ooperations.Mappersdonotneedtobeawarethattheyarechained.

Page 146: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

It’sthereforepossibletoreusespecializedMappersthatcanbecombinedwithinasingletask.NotethatthispatternassumesthatallMappers—andtheReduce—usematchingoutputandinput(key,value)pairs.NocastingorconversionisdonebyChainMapperitself.

Finally,noticethattheaddMappercallforthelastmapperinthechainspecifiestheoutputkey/valueclassesapplicabletothewholemapperpipelinewhenusedasacomposite.

Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentimentChain.java

ExecuteHashTagSentimentChainwiththecommand:

$hadoopjarbuild/libs/mapreduce-example.jar

com.learninghadoop2.mapreduce.HashTagSentimentChaintwitter.txtoutput

<positivewords><negativewords>

Youshouldseeanoutputsimilartothepreviousexample.Noticethatthistime,thehashtagineachlineislowercased.

Page 147: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WalkingthrougharunofaMapReducejobToexploretherelationshipbetweenmapperandreducerinmoredetail,andtoexposesomeofHadoop’sinnerworkings,we’llnowgothroughhowaMapReducejobisexecuted.ThisappliestobothMapReduceinHadoop1andHadoop2eventhoughthelatterisimplementedverydifferentlyusingYARN,whichwe’lldiscusslaterinthischapter.Additionalinformationontheservicesdescribedinthissection,aswellassuggestionsfortroubleshootingMapReduceapplications,canbefoundinChapter10,RunningaHadoopCluster.

Page 148: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

StartupThedriveristheonlypieceofcodethatrunsonourlocalmachine,andthecalltoJob.waitForCompletion()startsthecommunicationwiththeJobTracker,whichisthemasternodeintheMapReducesystem.TheJobTrackerisresponsibleforallaspectsofjobschedulingandexecution,soitbecomesourprimaryinterfacewhenperforminganytaskrelatedtojobmanagement.

ToshareresourcesontheclustertheJobTrackercanuseoneofseveralschedulingapproachestohandleincomingjobs.Thegeneralmodelistohaveanumberofqueuestowhichjobscanbesubmittedalongwithpoliciestoassignresourcesacrossthequeues.ThemostcommonlyusedimplementationsforthesepoliciesareCapacityandFairScheduler.

TheJobTrackercommunicateswiththeNameNodeonourbehalfandmanagesallinteractionsrelatingtothedatastoredonHDFS.

Page 149: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SplittingtheinputThefirstoftheseinteractionshappenswhentheJobTrackerlooksattheinputdataanddetermineshowtoassignittomaptasks.RecallthatHDFSfilesareusuallysplitintoblocksofatleast64MBandtheJobTrackerwillassigneachblocktoonemaptask.OurWordCountexample,ofcourse,usedatrivialamountofdatathatwaswellwithinasingleblock.Pictureamuchlargerinputfilemeasuredinterabytes,andthesplitmodelmakesmoresense.Eachsegmentofthefile—orsplit,inMapReduceterminology—isprocesseduniquelybyonemaptask.Onceithascomputedthesplits,theJobTrackerplacesthemandtheJARfilecontainingtheMapperandReducerclassesintoajob-specificdirectoryonHDFS,whosepathwillbepassedtoeachtaskasitstarts.

Page 150: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TaskassignmentTheTaskTrackerserviceisresponsibleforallocatingresources,executingandtrackingthestatusofmapandreducetasksrunningonanode.OncetheJobTrackerhasdeterminedhowmanymaptaskswillbeneeded,itlooksatthenumberofhostsinthecluster,howmanyTaskTrackersareworking,andhowmanymaptaskseachcanconcurrentlyexecute(auser-definableconfigurationvariable).TheJobTrackeralsolookstoseewherethevariousinputdatablocksarelocatedacrosstheclusterandattemptstodefineanexecutionplanthatmaximizesthecaseswhentheTaskTrackerprocessesasplit/blocklocatedonthesamephysicalhost,or,failingthat,itprocessesatleastoneinthesamehardwarerack.ThisdatalocalityoptimizationisahugereasonbehindHadoop’sabilitytoefficientlyprocesssuchlargedatasets.Recallalsothat,bydefault,eachblockisreplicatedacrossthreedifferenthosts,sothelikelihoodofproducingatask/hostplanthatseesmostblocksprocessedlocallyishigherthanitmightseematfirst.

Page 151: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TaskstartupEachTaskTrackerthenstartsupaseparateJavavirtualmachinetoexecutethetasks.Thisdoesaddastartuptimepenalty,butitisolatestheTaskTrackerfromproblemscausedbymisbehavingmaporreducetasks,anditcanbeconfiguredtobesharedbetweensubsequentlyexecutedtasks.

Iftheclusterhasenoughcapacitytoexecuteallthemaptasksatonce,theywillallbestartedandgivenareferencetothesplittheyaretoprocessandthejobJARfile.Iftherearemoretasksthantheclustercapacity,theJobTrackerwillkeepaqueueofpendingtasksandassignthemtonodesastheycompletetheirinitiallyassignedmaptasks.

Wearenowreadytoseetheexecuteddataofmaptasks.Ifallthissoundslikealotofwork,itis;itexplainswhy,whenrunninganyMapReducejob,thereisalwaysanon-trivialamountoftimetakenasthesystemgetsstartedandperformsallthesesteps.

Page 152: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OngoingJobTrackermonitoringTheJobTrackerdoesn’tjuststopworknowandwaitfortheTaskTrackerstoexecuteallthemappersandreducers.It’sconstantlyexchangingheartbeatandstatusmessageswiththeTaskTrackers,lookingforevidenceofprogressorproblems.Italsocollectsmetricsfromthetasksthroughoutthejobexecution,someprovidedbyHadoopandothersspecifiedbythedeveloperofthemapandreducetasks,althoughwedon’tuseanyinthisexample.

Page 153: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MapperinputThedriverclassspecifiestheformatandstructureoftheinputfileusingTextInputFormat,andfromthis,Hadoopknowstotreatthisastextwiththebyteoffsetasthekeyandlinecontentsasthevalue.Assumethatourdatasetcontainsthefollowingtext:

Thisisatest

Yesitis

Thetwoinvocationsofthemapperwillthereforebegiventhefollowingoutput:

1Thisisatest

2Yesitis

Page 154: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MapperexecutionThekey/valuepairsreceivedbythemapperaretheoffsetinthefileofthelineandthelinecontents,respectively,becauseofhowthejobisconfigured.OurimplementationofthemapmethodinWordCountMapperdiscardsthekey,aswedonotcarewhereeachlineoccurredinthefile,andsplitstheprovidedvalueintowordsusingthesplitmethodonthestandardJavaStringclass.NotethatbettertokenizationcouldbeprovidedbyuseofregularexpressionsortheStringTokenizerclass,butforourpurposesthissimpleapproachwillsuffice.Foreachindividualword,themapperthenemitsakeycomprisedoftheactualworditself,andavalueof1.

Page 155: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MapperoutputandreducerinputTheoutputofthemapperisaseriesofpairsoftheform(word,1);inourexample,thesewillbe:

(This,1),(is,1),(a,1),(test,1),(Yes,1),(it,1),(is,1)

Theseoutputpairsfromthemapperarenotpasseddirectlytothereducer.Betweenmappingandreducingistheshufflestage,wheremuchofthemagicofMapReduceoccurs.

Page 156: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ReducerinputThereducerTaskTrackerreceivesupdatesfromtheJobTrackerthattellitwhichnodesintheclusterholdmapoutputpartitionsthatneedtobeprocessedbyitslocalreducetask.Itthenretrievesthesefromthevariousnodesandmergesthemintoasinglefilethatwillbefedtothereducetask.

Page 157: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ReducerexecutionOurWordCountReducerclassisverysimple;foreachword,itsimplycountsthenumberofelementsinthearrayandemitsthefinal(word,count)outputforeachword.ForourinvocationofWordCountonoursampleinput,allbutonewordhasonlyonevalueinthelistofvalues;ishastwo.

Page 158: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ReduceroutputThefinalsetofreduceroutputforourexampleistherefore:

(This,1),(is,2),(a,1),(test,1),(Yes,1),(it,1)

ThisdatawillbeoutputtopartitionfileswithintheoutputdirectoryspecifiedinthedriverthatwillbeformattedusingthespecifiedOutputFormatimplementation.Eachreducetaskwritestoasinglefilewiththefilenamepart-r-nnnnn,wherennnnnstartsat00000andisincremented.

Page 159: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ShutdownOncealltaskshavecompletedsuccessfully,theJobTrackeroutputsthefinalstateofthejobtotheclient,alongwiththefinalaggregatesofsomeofthemoreimportantcountersthatithasbeenaggregatingalongtheway.Thefulljobandtaskhistoryisavailableinthelogdirectoryoneachnodeor,moreaccessibly,viatheJobTrackerwebUI;pointyourbrowsertoport50030ontheJobTrackernode.

Page 160: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Input/OutputWehavetalkedaboutfilesbeingbrokenintosplitsaspartofthejobstartupandthedatainasplitbeingsenttothemapperimplementation.However,thisoverlookstwoaspects:howthedataisstoredinthefileandhowtheindividualkeysandvaluesarepassedtothemapperstructure.

Page 161: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

InputFormatandRecordReaderHadoophastheconceptofInputFormatforthefirstoftheseresponsibilities.TheInputFormatabstractclassintheorg.apache.hadoop.mapreducepackageprovidestwomethodsasshowninthefollowingcode:

publicabstractclassInputFormat<K,V>

{

publicabstractList<InputSplit>getSplits(JobContextcontext);

RecordReader<K,V>createRecordReader(InputSplitsplit,

TaskAttemptContextcontext);

}

ThesemethodsdisplaythetworesponsibilitiesoftheInputFormatclass:

ToprovidedetailsonhowtodivideaninputfileintothesplitsrequiredformapprocessingTocreateaRecordReaderthatwillgeneratetheseriesofkey/valuepairsfromasplit

TheRecordReaderclassisalsoanabstractclasswithintheorg.apache.hadoop.mapreducepackage:

publicabstractclassRecordReader<Key,Value>implementsCloseable

{

publicabstractvoidinitialize(InputSplitsplit,

TaskAttemptContextcontext);

publicabstractbooleannextKeyValue()

throwsIOException,InterruptedException;

publicabstractKeygetCurrentKey()

throwsIOException,InterruptedException;

publicabstractValuegetCurrentValue()

throwsIOException,InterruptedException;

publicabstractfloatgetProgress()

throwsIOException,InterruptedException;

publicabstractclose()throwsIOException;

}

ARecordReaderinstanceiscreatedforeachsplitandcallsgetNextKeyValuetoreturnaBooleanindicatingwhetheranotherkey/valuepairisavailable,and,ifso,thegetKeyandgetValuemethodsareusedtoaccessthekeyandvaluerespectively.

ThecombinationoftheInputFormatandRecordReaderclassesthereforeareallthatisrequiredtobridgebetweenanykindofinputdataandthekey/valuepairsrequiredbyMapReduce.

Page 162: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Hadoop-providedInputFormatTherearesomeHadoop-providedInputFormatimplementationswithintheorg.apache.hadoop.mapreduce.lib.inputpackage:

FileInputFormat:isanabstractbaseclassthatcanbetheparentofanyfile-basedinput.SequenceFileInputFormat:isanefficientbinaryfileformatthatwillbediscussedinanupcomingsection.TextInputFormat:isusedforplaintextfiles.KeyValueTextInputFormat:isusedforplaintextfiles.Eachlineisdividedintokeyandvaluepartsbyaseparatorbyte.

Notethatinputformatsarenotrestrictedtoreadingfromfiles;FileInputFormatisitselfasubclassofInputFormat.It’spossibletohaveHadoopusedatathatisnotbasedonfilesastheinputtoMapReducejobs;commonsourcesarerelationaldatabasesorcolumn-orienteddatabases,suchasAmazonDynamoDBorHBase.

Page 163: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Hadoop-providedRecordReaderHadoopprovidesafewcommonRecordReaderimplementations,whicharealsopresentwithintheorg.apache.hadoop.mapreduce.lib.inputpackage:

LineRecordReader:implementationisthedefaultRecordReaderclassfortextfilesthatpresentsthebyteoffsetinthefileasthekeyandthelinecontentsasthevalueSequenceFileRecordReader:implementationreadsthekey/valuefromthebinarySequenceFilecontainer

Page 164: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OutputFormatandRecordWriterThereisasimilarpatternforwritingtheoutputofajobcoordinatedbysubclassesofOutputFormatandRecordWriterfromtheorg.apache.hadoop.mapreducepackage.Wewon’texploretheseinanydetailhere,butthegeneralapproachissimilar,althoughOutputFormatdoeshaveamoreinvolvedAPI,asithasmethodsfortaskssuchasvalidationoftheoutputspecification.

It’sthisstepthatcausesajobtofailifaspecifiedoutputdirectoryalreadyexists.Ifyouwanteddifferentbehavior,itwouldrequireasubclassofOutputFormatthatoverridesthismethod.

Page 165: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Hadoop-providedOutputFormatThefollowingoutputformatsareprovidedintheorg.apache.hadoop.mapreduce.outputpackage:

FileOutputFormat:isthebaseclassforallfile-basedOutputFormatsNullOutputFormat:isadummyimplementationthatdiscardstheoutputandwritesnothingtothefileSequenceFileOutputFormat:writestothebinarySequenceFileformatTextOutputFormat:writesaplaintextfile

NotethattheseclassesdefinetheirrequiredRecordWriterimplementationsasstaticnestedclasses,sotherearenoseparatelyprovidedRecordWriterimplementations.

Page 166: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SequencefilesTheSequenceFileclasswithintheorg.apache.hadoop.iopackageprovidesanefficientbinaryfileformatthatisoftenusefulasanoutputfromaMapReducejob.Thisisespeciallytrueiftheoutputfromthejobisprocessedastheinputofanotherjob.Sequencefileshaveseveraladvantages,asfollows:

Asbinaryfiles,theyareintrinsicallymorecompactthantextfilesTheyadditionallysupportoptionalcompression,whichcanalsobeappliedatdifferentlevels,thatis,theycompresseachrecordoranentiresplitTheycanbesplitandprocessedinparallel

Thislastcharacteristicisimportantasmostbinaryformats—particularlythosethatarecompressedorencrypted—cannotbesplitandmustbereadasasinglelinearstreamofdata.UsingsuchfilesasinputtoaMapReducejobmeansthatasinglemapperwillbeusedtoprocesstheentirefile,causingapotentiallylargeperformancehit.Insuchasituation,it’spreferabletouseasplittableformat,suchasSequenceFile,or,ifyoucannotavoidreceivingthefileinanotherformat,doapreprocessingstepthatconvertsitintoasplittableformat.Thiswillbeatradeoff,astheconversionwilltaketime,butinmanycases—especiallywithcomplexmaptasks—thiswillbeoutweighedbythetimesavedthroughincreasedparallelism.

Page 167: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

YARNYARNstartedoutaspartoftheMapReducev2(MRv2)initiativebutisnowanindependentsub-projectwithinHadoop(thatis,it’satthesamelevelasMapReduce).ItgrewoutofarealizationthatMapReduceinHadoop1conflatedtworelatedbutdistinctresponsibilities:resourcemanagementandapplicationexecution.

Althoughithasenabledpreviouslyunimaginedprocessingonenormousdatasets,theMapReducemodelataconceptuallevelhasanimpactonperformanceandscalability.ImplicitintheMapReducemodelisthatanyapplicationcanonlybecomposedofaseriesoflargelylinearMapReducejobs,eachofwhichfollowsamodelofoneormoremapsfollowedbyoneormorereduces.Thismodelisagreatfitforsomeapplications,butnotall.Inparticular,it’sapoorfitforworkloadsrequiringverylow-latencyresponsetimes;theMapReducestartuptimesandsometimeslengthyjobchainsoftengreatlyexceedthetoleranceforauser-facingprocess.Themodelhasalsobeenfoundtobeveryinefficientforjobsthatwouldmorenaturallyberepresentedasadirectedacyclicgraph(DAG)oftaskswherethenodesonthegraphareprocessingsteps,andtheedgesaredataflows.IfanalyzedandexecutedasaDAGthentheapplicationmaybeperformedinonestepwithhighparallelismacrosstheprocessingsteps,butwhenviewedthroughtheMapReducelens,theresultisusuallyaninefficientseriesofinterdependentMapReducejobs.

NumerousprojectshavebuiltdifferenttypesofprocessingatopMapReduceandalthoughmanyarewildlysuccessful(ApacheHiveandPigaretwostandoutexamples),theclosecouplingofMapReduceasaprocessingparadigmwiththejobschedulingmechanisminHadoop1madeitverydifficultforanynewprojecttotailoreitheroftheseareastoitsspecificneeds.

TheresultisYetAnotherResourceNegotiator(YARN),whichprovidesahighlycapablejobschedulingmechanismwithinHadoopandthewell-definedinterfacesfordifferentprocessingmodelstobeimplementedwithinit.

Page 168: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

YARNarchitectureTounderstandhowYARNworks,it’simportanttostopthinkingaboutMapReduceandhowitprocessesdata.YARNitselfsaysnothingaboutthenatureoftheapplicationsthatrunatopit,ratherit’sfocusedonprovidingthemachineryfortheschedulingandexecutionofthesejobs.Aswe’llsee,YARNisjustascapableofhostinglong-runningstreamprocessingorlow-latency,user-facingworkloadsasitiscapableofhostingbatch-processingworkloads,suchasMapReduce.

ThecomponentsofYARNYARNiscomprisedoftwomaincomponents,theResourceManager(RM),whichmanagesresourcesacrossthecluster,andtheNodeManager(NM),whichrunsoneachhostandmanagestheresourcesontheindividualmachine.TheResourceManagerandNodeManagersdealwiththeschedulingandmanagementofcontainers,anabstractnotionofthememory,CPU,andI/Othatwillbededicatedtorunaparticularpieceofapplicationcode.UsingMapReduceasanexample,whenrunningatopYARN,theJobTrackerandeachTaskTrackerallrunintheirowndedicatedcontainers.Notethough,thatinYARN,eachMapReducejobhasitsowndedicatedJobTracker;thereisnosingleinstancethatmanagesalljobs,asinHadoop1.

YARNitselfisresponsibleonlyfortheschedulingoftasksacrossthecluster;allnotionsofapplication-levelprogress,monitoring,andfaulttolerancearehandledintheapplicationcode.Thisisaveryexplicitdesigndecision;bymakingYARNasindependentaspossible,ithasaveryclearsetofresponsibilitiesanddoesnotartificiallyconstrainthetypesofapplicationthatcanbeimplementedonYARN.

Asthearbiterofallclusterresources,YARNhastheabilitytoefficientlymanagetheclusterasawholeandnotfocusonapplication-levelresourcerequirements.IthasapluggableschedulingpolicywiththeprovidedimplementationssimilartotheexistingHadoopCapacityandFairScheduler.YARNalsotreatsallapplicationcodeasinherentlyuntrustedandallapplicationmanagementandcontroltasksareperformedinuserspace.

AnatomyofaYARNapplicationAsubmittedYARNapplicationhastwocomponents:theApplicationMaster(AM),whichcoordinatestheoverallapplicationflow,andthespecificationofthecodethatwillrunontheworkernodes.ForMapReduceatopYARN,theJobTrackerimplementstheApplicationMasterfunctionalityandTaskTrackersaretheapplicationcustomcodedeployedontheworkernodes.

Asmentionedintheprevioussection,theresponsibilitiesofapplicationmanagement,progressmonitoringandfaulttolerancearepushedtotheapplicationlevelinYARN.It’stheApplicationMasterthatperformsthesetasks;YARNitselfsaysnothingaboutthemechanismsforcommunicationbetweentheApplicationMasterandthecoderunningintheworkercontainers,forexample.

ThisgenericityallowsYARNapplicationstonotbetiedtoJavaclasses.The

Page 169: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ApplicationManagercaninsteadrequestaNodeManagertoexecuteshellscripts,nativeapplications,oranyothertypeofprocessingthatismadeavailableoneachnode.

Page 170: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

LifecycleofaYARNapplicationAswithMapReducejobsinHadoop1,YARNapplicationsaresubmittedtotheclusterbyaclient.WhenaYARNapplicationisstarted,theclientfirstcallstheResourceManager(morespecificallytheApplicationManagerportionoftheResourceManager)andrequeststheinitialcontainerwithinwhichtoexecutetheApplicationMaster.InmostcasestheApplicationMasterwillrunfromahostedcontainerinthecluster,justaswilltherestoftheapplicationcode.TheApplicationManagercommunicateswiththeothermaincomponentoftheResourceManager,thescheduleritself,whichhastheultimateresponsibilityofmanagingallresourcesacrossthecluster.

TheApplicationMasterstartsupintheprovidedcontainer,registersitselfwiththeResourceManager,andbeginstheprocessofnegotiatingitsrequiredresources.TheApplicationMastercommunicateswiththeResourceManagerandrequeststhecontainersitrequires.Thespecificationofthecontainersrequestedcanalsoincludeadditionalinformation,suchasdesiredplacementwithintheclusterandconcreteresourcerequirements,suchasaparticularamountofmemoryorCPU.

TheResourceManagerprovidestheApplicationMasterwiththedetailsofthecontainersithasbeenallocated,andtheApplicationMasterthencommunicateswiththeNodeManagerstostarttheapplication-specifictaskforeachcontainer.ThisisdonebyprovidingtheNodeManagerwiththespecificationoftheapplicationtobeexecuted,whichasmentionedmaybeaJARfile,ascript,apathtoalocalexecutable,oranythingelsethattheNodeManagercaninvoke.EachNodeManagerinstantiatesthecontainerfortheapplicationcodeandstartstheapplicationbasedontheprovidedspecification.

FaulttoleranceandmonitoringFromthispointonward,thebehaviorislargelyapplicationspecific.YARNwillnotmanageapplicationprogressbutdoesperformanumberofongoingtasks.TheAMLivelinessMonitorwithintheResourceManagerreceivesheartbeatsfromallApplicationMasters,andifitdeterminesthatanApplicationMasterhasfailedorstoppedworking,itwillde-registerthefailedApplicationMasterandreleaseallitsallocatedcontainers.TheResourceManagerwillthenrescheduletheapplicationaconfigurablenumberoftimes.

AlongsidethisprocesstheNMLivelinessMonitorwithintheResourceManagerreceivesheartbeatsfromtheNodeManagersandkeepstrackofthehealthofeachNodeManagerinthecluster.SimilartothemonitoringofApplicationMasterhealth,aNodeManagerwillbemarkedasdeadafterreceivingnoheartbeatsforadefaulttimeof10minutes,afterwhichallallocatedcontainersaremarkedasdead,andthenodeisexcludedfromfutureresourceallocation.

Atthesametime,theNodeManagerwillactivelymonitorresourceutilizationofeachallocatedcontainerand,forthoseresourcesnotconstrainedbyhardlimits,willkillcontainersthatexceedtheirresourceallocation.

Page 171: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Atahigherlevel,theYARNschedulerwillalwaysbelookingtomaximizetheclusterutilizationwithintheconstraintsofthesharingpolicybeingemployed.AswithHadoop1,thiswillallowlow-priorityapplicationstousemoreclusterresourcesifcontentionislow,buttheschedulerwillthenpreempttheseadditionalcontainers(thatis,requestthemtobeterminated)ifhigher-priorityapplicationsaresubmitted.

Therestoftheresponsibilityforapplication-levelfaulttoleranceandprogressmonitoringmustbeimplementedwithintheapplicationcode.ForMapReduceonYARN,forexample,allthemanagementoftaskschedulingandretriesisprovidedattheapplicationlevelandisnotinanywaydeliveredbyYARN.

Page 172: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ThinkinginlayersTheselaststatementsmaysuggestthatwritingapplicationstorunonYARNisalotofwork,andthisistrue.TheYARNAPIisquitelow-levelandlikelyintimidatingformostdeveloperswhojustwanttorunsomeprocessingtasksontheirdata.IfallwehadwasYARNandeverynewHadoopapplicationhadtohaveitsownApplicationMasterimplemented,thenYARNwouldnotlookquiteasinterestingasitdoes.

Whatmakesthepicturebetteristhat,ingeneral,therequirementisn’ttoimplementeachandeveryapplicationonYARN,butinsteaduseitforasmallernumberofprocessingframeworksthatprovidemuchfriendlierinterfacestobeimplemented.ThefirstofthesewasMapReduce;withithostedonYARN,thedeveloperwritestotheusualmapandreduceinterfacesandislargelyunawareoftheYARNmechanics.

Butonthesamecluster,anotherdevelopermayberunningajobthatusesadifferentframeworkwithsignificantlydifferentprocessingcharacteristics,andYARNwillmanagebothatthesametime.

We’llgivesomemoredetailonseveralYARNprocessingmodelscurrentlyavailable,buttheyrunthegamutfrombatchprocessingthroughlow-latencyqueriestostreamandgraphprocessingandbeyond.

AstheYARNexperiencegrows,however,thereareanumberofinitiativestomakethedevelopmentoftheseprocessingframeworkseasier.Ontheonehandtherearehigher-levelinterfaces,suchasClouderaKitten(https://github.com/cloudera/kitten)orApacheTwill(http://twill.incubator.apache.org/),thatgivefriendlierabstractionsabovetheYARNAPIs.Perhapsamoresignificantdevelopmentmodel,though,istheemergenceofframeworksthatproviderichertoolstomoreeasilyconstructapplicationswithacommongeneralclassofperformancecharacteristics.

Page 173: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ExecutionmodelsWehavementioneddifferentYARNapplicationshavingdistinctprocessingcharacteristics,butanemergingpatternhasseentheirexecutionmodelsingeneralbeingasourceofdifferentiation.Bythis,werefertohowtheYARNapplicationlifecycleismanaged,andweidentifythreemaintypes:per-jobapplication,per-session,andalways-on.

Batchprocessing,suchasMapReduceonYARN,seesthelifecycleoftheMapReduceframeworktiedtothatofthesubmittedapplication.IfwesubmitaMapReducejob,thentheJobTrackerandTaskTrackersthatexecuteitarecreatedspecificallyforthejobandareterminatedwhenthejobcompletes.Thisworkswellforbatch,butifwewishtoprovideamoreinteractivemodelthenthestartupoverheadofestablishingtheYARNapplicationandallitsresourceallocationswillseverelyimpacttheuserexperienceifeverycommandissuedsuffersthispenalty.Amoreinteractive,orsession-based,lifecyclewillseetheYARNapplicationstartandthenbeavailabletoserviceanumberofsubmittedrequests/commands.TheYARNapplicationterminatesonlywhenthesessionisexited.

Finally,wehavetheconceptoflong-runningapplicationsthatprocesscontinuousdatastreamsindependentofanyinteractiveinput.FortheseitmakesmostsensefortheYARNapplicationtostartandcontinuouslyprocessdatathatisretrievedthroughsomeexternalmechanism.Theapplicationwillonlyexitwhenexplicitlyshutdownorifanabnormalsituationoccurs.

Page 174: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

YARNintherealworld–ComputationbeyondMapReduceThepreviousdiscussionshavebeenalittleabstract,sointhissection,wewillexploreafewexistingYARNapplicationstoseejusthowtheyusetheframeworkandhowtheyprovideabreadthofprocessingcapability.OfparticularinterestishowtheYARNframeworkstakeverydifferentapproachestoresourcemanagement,I/Opipelining,andfaulttolerance.

Page 175: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheproblemwithMapReduceUntilnow,wehavelookedatMapReduceintermsofAPI.MapReduceinHadoopismorethanthat;upuntilHadoop2,itwasthedefaultexecutionengineforanumberoftools,amongwhichwereHiveandPig,whichwewilldiscussinmoredetaillaterinthisbook.WehaveseenhowMapReduceapplicationsare,infact,chainsofjobs.Thisveryaspectisonethebiggestpainpointsandconstrainingfactorsoftheframeworks.MapReducecheckpointsdatatoHDFSforintra-processcommunication:

AchainofMapReducejobs

Attheendofeachreducephase,outputiswrittentodisksothatitcanthenbeloadedbythemappersofthenextjobandusedasitsinput.ThisI/Ooverheadintroduceslatency,especiallywhenwehaveapplicationsthatrequiremultiplepassesonadataset(hencemultiplewrites).Unfortunately,thistypeofiterativecomputationisatthecoreofmanyanalyticsapplications.

ApacheTezandApacheSparkaretwoframeworksthataddressthisproblembygeneralizingtheMapReduceparadigm.Wewillbrieflydiscussthemintheremainderofthissection,nexttoApacheSamza,aframeworkthattakesanentirelydifferentapproachtoreal-timeprocessing.

Page 176: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TezTez(http://tez.apache.org)isalow-levelAPIandexecutionenginefocusedonprovidinglow-latencyprocessing,andisbeingusedasthebasisofthelatestevolutionofHive,Pigandseveralotherframeworksthatimplementstandardjoin,filter,mergeandgroupoperations.TezisanimplementationandevolutionofaprogrammingmodelpresentedbyMicrosoftinthe2009Dryadpaper(http://research.microsoft.com/en-us/projects/dryad/).TezisageneralizationofMapReduceasdataflowthatstrivestoachievefast,interactivecomputingbypipeliningI/Ooperationsoveraqueueforintra-processcommunication.ThisavoidstheexpensivewritestodisksthataffectMapReduce.TheAPIprovidesprimitivesexpressingdependenciesbetweenjobsasaDAG.ThefullDAGisthensubmittedtoaplannerthatcanoptimizetheexecutionflow.ThesameapplicationdepictedintheprecedingdiagramwouldbeexecutedinTezasasinglejob,withI/OpipelinedfromreducerstoreducerswithoutHDFSwritesandsubsequentreadsbymappers.Anexamplecanbeseeninthefollowingdiagram:.

ATezDAGisageneralizationofMapReduce

ThecanonicalWordCountexamplecanbefoundathttps://github.com/apache/incubator-tez/blob/master/tez-mapreduce-

Page 177: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java.

DAGdag=newDAG("WordCount");

dag.addVertex(tokenizerVertex)

.addVertex(summerVertex)

.addEdge(newEdge(tokenizerVertex,summerVertex,

edgeConf.createDefaultEdgeProperty()));

Eventhoughthegraphtopologydagcanbeexpressedwithafewlinesofcode,theboilerplaterequiredtoexecutethejobisconsiderable.Thiscodehandlesmanyofthelow-levelschedulingandexecutionresponsibilities,includingfaulttolerance.WhenTezdetectsafailedtask,itwalksbackuptheprocessinggraphtofindthepointfromwhichtore-executethefailedtasks.

Hive-on-tezHive0.13isthefirsthigh-profileprojecttouseTezasitsexecutionengine.We’lldiscussHiveinalotmoredetailinChapter7,HadoopandSQL,butfornowwewilljusttouchonhowit’simplementedonYARN.

Hive(http://hive.apache.org)isanengineforqueryingdatastoredonHDFSthroughstandardSQLsyntax.Ithasbeenenormouslysuccessful,asthistypeofcapabilitygreatlyreducesthebarrierstostartanalyticexplorationofdatainHadoop.

InHadoop1,Hivehadnochoice,buttoimplementitsSQLstatementsasaseriesofMapReducejobs.WhenSQLissubmittedtoHive,itgeneratestherequiredMapReducejobsbehindthescenesandexecutestheseonthecluster.Thisapproachhastwomaindrawbacks:thereisanon-trivialstartuppenaltyeachtime,andtheconstrainedMapReducemodelmeansthatseeminglysimpleSQLstatementsareoftentranslatedintoalengthyseriesofmultipledependentMapReducejobs.ThisisanexampleofthetypeofprocessingmorenaturallyconceptualizedasaDAGoftasks,asdescribedearlierinthischapter.

AlthoughsomebenefitsareachievedwhenHiveexecuteswithinMapReduce,withinYARN,themajorbenefitscomeinHive0.13whentheprojectisfullyre-implementedusingTez.ByexploitingtheTezAPIs,whicharefocusedonprovidinglow-latencyprocessing,Hivegainsevenmoreperformancewhilemakingitscodebasesimpler.

SinceTeztreatsitsworkloadsastheDAGswhichprovideamuchbetterfittotranslatedSQLqueries,HiveonTezcanperformanySQLstatementasasinglejobwithmaximizedparallelism.

TezhelpsHivesupportinteractivequeriesbyprovidinganalways-runningserviceinsteadofrequiringtheapplicationtobeinstantiatedfromscratchforeachSQLsubmission.Thisisimportantbecause,eventhoughqueriesthatprocesshugedatavolumeswillsimplytakesometime,thegoalisforHivetobecomelessofabatchtoolandinsteadmovetobeasmuchofaninteractivetoolaspossible.

Page 178: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ApacheSparkSpark(.apache.org)isaprocessingframeworkthatexcelsatiterativeandnearreal-timeprocessing.CreatedatUCBerkeley,ithasbeendonatedasanApacheproject.SparkprovidesanabstractionthatallowsdatainHadooptobeviewedasadistributeddatastructureuponwhichaseriesofoperationscanbeperformed.TheframeworkisbasedonthesameconceptsTezdrawsinspirationfrom(Dryad),butexcelswithjobsthatallowdatatobeheldandprocessedinmemory,anditcanveryefficientlyscheduleprocessingonthein-memorydatasetacrossthecluster.Sparkautomaticallycontrolsreplicationofdataacrossthecluster,ensuringthateachelementofthedistributeddatasetisheldinmemoryonatleasttwomachines,andprovidesreplication-basedfaulttolerancesomewhatakintoHDFS.

Sparkstartedasastandalonesystem,butwasportedtoalsorunonYARNasofits0.8release.Sparkisparticularlyinterestingbecause,althoughitsclassicprocessingmodelisbatch-oriented,withtheSparkshellitprovidesaninteractivefrontendandwiththeSparkStreamingsub-projectalsooffersnearreal-timeprocessingofdatastreams.Sparkisdifferentthingstodifferentpeople;it’sbothahigh-levelAPIandanexecutionengine.Atthetimeofwriting,portsofHiveandPigtoSparkareinprogress.

Page 179: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ApacheSamzaSamza(http://samza.apache.org)isastream-processingframeworkdevelopedatLinkedInanddonatedtotheApacheSoftwareFoundation.Samzaprocessesconceptuallyinfinitestreamsofdata,whichareseenbytheapplicationasaseriesofmessages.

SamzacurrentlyintegratesmosttightlywithApacheKafka(http://kafka.apache.org)althoughitdoeshaveapluggablearchitecture.Kafkaitselfisamessagingsystemthatexcelsatlargedatavolumesandprovidesatopic-basedabstractionsimilartomostothermessagingplatforms,suchasRabbitMQ.Publisherssendmessagestotopicsandinterestedclientsconsumemessagesfromthetopicsastheyarrive.Kafkahasmultipleaspectsthatsetitapartfromothermessagingplatforms,butforthisdiscussion,themostinterestingoneisthatKafkastoresmessagesforaperiodoftime,whichallowsmessagesintopicstobereplayed.Topicsarepartitionedacrossmultiplehostsandpartitionscanbereplicatedacrosshoststoprotectfromnodefailure.

Samzabuildsitsprocessingflowonitsconceptofstreams,whichwhenusingKafkamapdirectlytoKafkapartitions.AtypicalSamzajobmaylistentoonetopicforincomingmessages,performsometransformations,andthenwritetheoutputtoadifferenttopic.MultipleSamzajobscanthenbecomposedtoprovidemorecomplexprocessingstructures.

AsaYARNapplication,theSamzaApplicationMastermonitorsthehealthofallrunningSamzatasks.Ifataskfails,thenareplacementtaskisinstantiatedinanewcontainer.Samzaachievesfaulttolerancebyhavingeachtaskwriteitsprogresstoanewstream(againmodeledasaKafkatopic),soanyreplacementtaskjustneedstoreadthelatesttaskstatefromthischeckpointtopicandthenreplaythemainmessagetopicfromthelastprocessedposition.Samzaadditionallyofferssupportforlocaltaskstate,whichcanbeveryusefulforjoinandaggregationtypeworkloads.Thislocalstateisagainbuiltatopthestreamabstractionandhenceisintrinsicallyresilienttohostfailure.

YARN-independentframeworksAninterestingpointtonoteisthattwooftheprecedingprojects(SamzaandSpark)runatopYARNbutarenotspecifictoYARN.Sparkstartedoutasastandaloneserviceandhasimplementationsforotherschedulers,suchasApacheMesosortorunonAmazonEC2.ThoughSamzarunsonlyonYARNtoday,itsarchitectureexplicitlyisnotYARN-specific,andtherearediscussionsaboutprovidingrealizationsonotherplatforms.

IftheYARNmodelofpushingasmuchaspossibleintotheapplicationhasitsdownsidesthroughimplementationcomplexity,thenthisdecouplingisoneofitsmajorbenefits.AnapplicationwrittentouseYARNneednotbetiedtoit;bydefinition,allthefunctionalityfortheactualapplicationlogicandmanagementisencapsulatedwithintheapplicationcodeandisindependentofYARNoranotherframework.Thisis,ofcourse,notsayingthatdesigningascheduler-independentapplicationisatrivialtask,butit’snowatractabletask;thiswasabsolutelynotthecasepre-YARN.

Page 180: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

YARNtodayandbeyondThoughYARNhasbeenusedinproduction(atYahoo!inparticular)forsometime,thefinalGAversionwasnotreleaseduntillate2012.TheinterfacestoYARNwerealsosomewhatfluiduntilquitelateinthedevelopmentcycle.Consequently,thefullyforwardcompatibleYARNasofHadoop2.2isstillrelativelynew.

YARNisfullyfunctionaltoday,andthefuturedirectionwillseeextensionstoitscurrentcapabilities.Perhapsmostnotableamongthesewillbetheabilitytospecifyandcontrolcontainerresourcesonmoredimensions.Currently,onlylocation,memoryandCPUspecificationsarepossible,andthiswillbeexpandedintoareassuchasstorageandnetworkI/O.

Inaddition,theApplicationMastercurrentlyhaslittlecontroloverthemanagementofhowcontainersareco-locatedornot.Finer-grainedcontrolherewillallowtheApplicationMastertospecifypoliciesaroundwhencontainersmayormaynotbescheduledonthesamenode.Inaddition,thecurrentresourceallocationmodelisquitestatic,anditwillbeusefultoallowanapplicationtodynamicallychangetheresourcesallocatedtoarunningcontainer.

Page 181: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SummaryThischapterexploredhowtoprocessthoselargevolumesofdatathatwediscussedsomuchinthepreviouschapter.Inparticularwecovered:

HowMapReducewastheonlyprocessingmodelavailableinHadoop1anditsconceptualmodelTheJavaAPItoMapReduce,andhowtousethistobuildsomeexamples,fromawordcounttosentimentanalysisofTwitterhashtagsThedetailsofhowMapReduceisimplementedinpractice,andwewalkedthroughtheexecutionofaMapReducejobHowHadoopstoresdataandtheclassesinvolvedtorepresentinputandoutputformatsandrecordreadersandwritersThelimitationsofMapReducethatledtothedevelopmentofYARN,openingthedoortomultiplecomputationalmodelsontheHadoopplatformTheYARNarchitectureandhowapplicationsarebuiltatopit

Inthenexttwochapters,wewillmoveawayfromstrictlybatchprocessinganddelveintotheworldofnearreal-timeanditerativeprocessing,usingtwooftheYARN-hostedframeworksweintroducedinthischapter,namelySamzaandSpark.

Page 182: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Chapter4.Real-timeComputationwithSamzaThepreviouschapterdiscussedYARN,andfrequentlymentionedthebreadthofcomputationalmodelsandprocessingframeworksoutsideoftraditionalbatch-basedMapReducethatitenablesontheHadoopplatform.Inthischapterandthenext,wewillexploretwosuchprojectsinsomedepth,namelyApacheSamzaandApacheSpark.Wechosetheseframeworksastheydemonstratetheusageofstreamanditerativeprocessingandalsoprovideinterestingmechanismstocombineprocessingparadigms.InthischapterwewillexploreSamzaandcoverthefollowingtopics:

WhatSamzaisandhowitintegrateswithYARNandotherprojectssuchasApacheKafkaHowSamzaprovidesasimplecallback-basedinterfaceforstreamprocessingHowSamzacomposesmultiplestreamprocessingjobsintomorecomplexworkflowsHowSamzasupportspersistentlocalstatewithintasksandhowthisgreatlyenricheswhatitcanenable

Page 183: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

StreamprocessingwithSamzaToexploreapurestream-processingplatform,wewilluseSamza,whichisavailableathttps://samza.apache.org.Thecodeshownherewastestedwiththecurrent0.8releaseandwe’llkeeptheGitHubrepositoryupdatedastheprojectcontinuestoevolve.

SamzawasbuiltatLinkedInanddonatedtotheApacheSoftwareFoundationinSeptember2013.Overtheyears,LinkedInhasbuiltamodelthatconceptualizesmuchoftheirdataasstreams,andfromthistheysawtheneedforaframeworkthatcanprovideadeveloper-friendlymechanismtoprocesstheseubiquitousdatastreams.

TheteamatLinkedInrealizedthatwhenitcametodataprocessing,muchoftheattentionwenttotheextremeendsofthespectrum,forexample,RPCworkloadsareusuallyimplementedassynchronoussystemswithverylowlatencyrequirementsorbatchsystemswheretheperiodicityofjobsisoftenmeasuredinhours.ThegroundinbetweenhasbeenrelativelypoorlysupportedandthisistheareathatSamzaistargetedat;mostofitsjobsexpectresponsetimesrangingfrommillisecondstominutes.Theyalsoassumethatdataarrivesinatheoreticallyinfinitestreamofcontinuousmessages.

Page 184: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HowSamzaworksTherearenumerousstream-processingsystemssuchasStorm(http://storm.apache.org),intheopensourceworld,andmanyother(mostlycommercial)toolssuchascomplexeventprocessing(CEP)systemsthatalsotargetprocessingoncontinuousmessagestreams.Thesesystemshavemanysimilaritiesbutalsosomemajordifferences.

ForSamza,perhapsthemostsignificantdifferenceisitsassumptionsaboutmessagedelivery.Manysystemsworkveryhardtoreducethelatencyofeachmessage,sometimeswithanassumptionthatthegoalistogetthemessageintoandoutofthesystemasfastaspossible.Samzaassumesalmosttheopposite;itsstreamsarepersistentandresilientandanymessagewrittentoastreamcanbere-readforaperiodoftimeafteritsfirstarrival.Aswewillsee,thisgivessignificantcapabilityaroundfaulttolerance.Samzaalsobuildsonthismodeltoalloweachofitstaskstoholdresilientlocalstate.

SamzaismostlyimplementedinScalaeventhoughitspublicAPIsarewritteninJava.We’llshowJavaexamplesinthischapter,butanyJVMlanguagecanbeusedtoimplementSamzaapplications.We’lldiscussScalawhenweexploreSparkinthenextchapter.

Page 185: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Samzahigh-levelarchitectureSamzaviewstheworldashavingthreemainlayersorcomponents:thestreaming,execution,andprocessinglayers.

Samzaarchitecture

Thestreaminglayerprovidesaccesstothedatastreams,bothforconsumptionandpublication.TheexecutionlayerprovidesthemeansbywhichSamzaapplicationscanberun,haveresourcessuchasCPUandmemoryallocated,andhavetheirlifecyclesmanaged.TheprocessinglayeristheactualSamzaframeworkitself,anditsinterfacesallowper-messagefunctionality.

SamzaprovidespluggableinterfacestosupportthefirsttwolayersthoughthecurrentmainimplementationsuseKafkaforstreamingandYARNforexecution.We’lldiscussthesefurtherinthefollowingsections.

Page 186: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Samza’sbestfriend–ApacheKafkaSamzaitselfdoesnotimplementtheactualmessagestream.Instead,itprovidesaninterfaceforamessagesystemwithwhichitthenintegrates.ThedefaultstreamimplementationisbuiltuponApacheKafka(http://kafka.apache.org),amessagingsystemalsobuiltatLinkedInbutnowasuccessfulandwidelyadoptedopensourceproject.

KafkacanbeviewedasamessagebrokerakintosomethinglikeRabbitMQorActiveMQ,butasmentionedearlier,itwritesallmessagestodiskandscalesoutacrossmultiplehostsasacorepartofitsdesign.Kafkausestheconceptofapublish/subscribemodelthroughnamedtopicstowhichproducerswritemessagesandfromwhichconsumersreadmessages.Theseworkmuchliketopicsinanyothermessagingsystem.

BecauseKafkawritesallmessagestodisk,itmightnothavethesameultra-lowlatencymessagethroughputasothermessagingsystems,whichfocusongettingthemessageprocessedasfastaspossibleanddon’taimtostorethemessagelongterm.Kafkacan,however,scaleexceptionallywellanditsabilitytoreplayamessagestreamcanbeextremelyuseful.Forexample,ifaconsumingclientfails,thenitcanre-readmessagesfromaknowngoodpointintime,orifadownstreamalgorithmchanges,thentrafficcanbereplayedtoutilizethenewfunctionality.

Whenscalingacrosshosts,Kafkapartitionstopicsandsupportspartitionreplicationforfaulttolerance.EachKafkamessagehasakeyassociatedwiththemessageandthisisusedtodecidetowhichpartitionagivenmessageissent.Thisallowssemanticallyusefulpartitioning,forexample,ifthekeyisauserIDinthesystem,thenallmessagesforagivenuserwillbesenttothesamepartition.Kafkaguaranteesordereddeliverywithineachpartitionsothatanyclientreadingapartitioncanknowthattheyarereceivingallmessagesforeachkeyinthatpartitionintheorderinwhichtheyarewrittenbytheproducer.

Samzaperiodicallywritesoutcheckpointsofthepositionuptowhichithasreadinallthestreamsitisconsuming.ThesecheckpointmessagesarethemselveswrittentoaKafkatopic.Thus,whenaSamzajobstartsup,eachtaskcanrereaditscheckpointstreamtoknowfromwhichpositioninthestreamtostartprocessingmessages.ThismeansthatineffectKafkaalsoactsasabuffer;ifaSamzajobcrashesoristakendownforupgrade,nomessageswillbelost.Instead,thejobwilljustrestartfromthelastcheckpointedpositionwhenitrestarts.Thisbufferfunctionalityisalsoimportant,asitmakesiteasierformultipleSamzajobstorunaspartofacomplexworkflow.WhenKafkatopicsarethepointsofcoordinationbetweenthejobs,onejobmightconsumeatopicbeingwrittentobyanother;insuchcases,Kafkacanhelpsmoothoutissuescausedduetoanygivenjobrunningslowerthanothers.Traditionally,thebackpressurecausedbyaslowrunningjobcanbearealissueinasystemcomprisedofmultiplejobstages,butKafkaastheresilientbufferallowseachjobtoreadandwriteatitsownrate.NotethatthisisanalogoustohowmultiplecoordinatingMapReducejobswilluseHDFSforsimilarpurposes.

Kafkaprovidesat-leastoncemessagedeliverysemantics,thatistosaythatanymessage

Page 187: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

writtentoKafkawillbeguaranteedtobeavailabletoaclientoftheparticularpartition.Messagesmightbeprocessedbetweencheckpointshowever;itispossibleforduplicatemessagestobereceivedbytheclient.Thereareapplication-specificmechanismstomitigatethis,andbothKafkaandSamzahaveexactly-oncesemanticsontheirroadmaps,butfornowitissomethingyoushouldtakeintoconsiderationwhendesigningjobs.

Wewon’texplainKafkafurtherbeyondwhatweneedtodemonstrateSamza.Ifyouareinterested,checkoutitswebsiteandwiki;thereisalotofgoodinformation,includingsomeexcellentpapersandpresentations.

Page 188: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

YARNintegrationAsmentionedearlier,justasSamzautilizesKafkaforitsstreaminglayerimplementation,itusesYARNfortheexecutionlayer.JustlikeanyYARNapplicationdescribedinChapter3,Processing–MapReduceandBeyond,SamzaprovidesanimplementationofbothanApplicationMaster,whichcontrolsthelifecycleoftheoveralljob,plusimplementationsofSamza-specificfunctionality(calledtasks)thatareexecutedineachcontainer.JustasKafkapartitionsitstopics,tasksarethemechanismbywhichSamzapartitionsitsprocessing.EachKafkapartitionwillbereadbyasingleSamzatask.IfaSamzajobconsumesmultiplestreams,thenagiventaskwillbetheonlyconsumerwithinthejobforeverystreampartitionassignedtoit.

TheSamzaframeworkistoldbyeachjobconfigurationabouttheKafkastreamsthatareofinteresttothejob,andSamzacontinuouslypollsthesestreamstodetermineifanynewmessageshavearrived.Whenanewmessageisavailable,theSamzataskinvokesauser-definedcallbacktoprocessthemessage,amodelthatshouldn’tlooktooalientoMapReducedevelopers.ThismethodisdefinedinaninterfacecalledStreamTaskandhasthefollowingsignature:

publicvoidprocess(IncomingMessageEnvelopeenvelope,

MessageCollectorcollector,

TaskCoordinatorcoordinator)

ThisisthecoreofeachSamzataskanddefinesthefunctionalitytobeappliedtoreceivedmessages.ThereceivedmessagethatistobeprocessediswrappedintheIncomingMessageEnvelope;outputmessagescanbewrittentotheMessageCollector,andtaskmanagement(suchasShutdown)canbeperformedviatheTaskCoordinator.

Asmentioned,SamzacreatesonetaskinstanceforeachpartitionintheunderlyingKafkatopic.EachYARNcontainerwillmanageoneormoreofthesetasks.TheoverallmodelthenisoftheSamzaApplicationMastercoordinatingmultiplecontainers,eachofwhichisresponsibleforoneormoreStreamTaskinstances.

Page 189: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AnindependentmodelThoughwewilltalkexclusivelyofKafkaandYARNastheprovidersofSamza’sstreamingandexecutionlayersinthischapter,itisimportanttorememberthatthecoreSamzasystemuseswell-definedinterfacesforboththestreamandexecutionsystems.Thereareimplementationsofmultiplestreamsources(we’llseeoneinthenextsection)andalongsidetheYARNsupport,SamzashipswithaLocalJobRunnerclass.ThisalternativemethodofrunningtaskscanexecuteStreamTaskinstancesin-processontheJVMinsteadofrequiringafullYARNcluster,whichcansometimesbeausefultestinganddebuggingtool.ThereisalsoadiscussionofSamzaimplementationsontopofotherclustermanagerorvirtualizationframeworks.

Page 190: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HelloSamza!SincenoteveryonealreadyhasZooKeeper,Kafka,andYARNclustersreadytobeused,theSamzateamhascreatedawonderfulwaytogetstartedwiththeproduct.InsteadofjusthavingaHelloworld!program,thereisarepositorycalledHelloSamza,whichisavailablebycloningtherepositoryatgit://git.apache.org/samza-hello-samza.git.

ThiswilldownloadandinstalldedicatedinstancesofZooKeeper,Kafka,andYARN(the3majorprerequisitesforSamza),creatingafullstackuponwhichyoucansubmitSamzajobs.

TherearealsoanumberofexampleSamzajobsthatprocessdatafromWikipediaeditnotifications.Takealookatthepageathttp://samza.apache.org/startup/hello-samza/0.8/andfollowtheinstructionsgiventhere.(Atthetimeofwriting,Samzaisstillarelativelyyoungprojectandwe’drathernotincludedirectinformationabouttheexamples,whichmightbesubjecttochange).

FortheremainderoftheSamzaexamplesinthischapter,we’llassumeyouareeitherusingtheHelloSamzapackagetoprovidethenecessarycomponents(ZooKeeper/Kafka/YARN)oryouhaveintegratedwithotherinstancesofeach.

ThisexamplehasthreedifferentSamzajobsthatbuilduponeachother.ThefirstreadstheWikipediaedits,thesecondparsestheserecords,andthethirdproducesstatisticsbasedontheprocessedrecords.We’llbuildourownmultistreamworkflowshortly.

OneinterestingpointistheWikipediaFeedexamplehere;itusesWikipediaasitsmessagesourceinsteadofKafka.Specifically,itprovidesanotherimplementationoftheSamzaSystemConsumerinterfacetoallowSamzatoreadmessagesfromanexternalsystem.Asmentionedearlier,SamzaisnottiedtoKafkaand,asthisexampleshows,buildinganewstreamimplementationdoesnothavetobeagainstagenericinfrastructurecomponent;itcanbequitejob-specific,astheworkrequiredisnothuge.

TipNotethatthedefaultconfigurationforbothZooKeeperandKafkawillwritesystemdatatodirectoriesunder/tmp,whichwillbewhatyouhavesetifyouuseHelloSamza.BecarefulifyouareusingaLinuxdistributionthatpurgesthecontentsofthisdirectoryonareboot.Ifyouplantocarryoutanysignificanttesting,thenit’sbesttoreconfigurethesecomponentstouselessephemerallocations.Changetherelevantconfigfilesforeachservice;theyarelocatedintheservicedirectoryunderthehello-samza/deploydirectory.

Page 191: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

BuildingatweetparsingjobLet’sbuildourownsimplejobimplementationtoshowthefullcoderequired.We’lluseparsingoftheTwitterstreamastheexamplesinthischapterandwilllatersetupapipefromourclientconsumingmessagesfromtheTwitterAPIintoaKafkatopic.So,weneedaSamzataskthatwillreadthestreamofJSONmessages,extracttheactualtweettext,andwritethesetoatopicoftweets.

HereisthemaincodefromTwitterParseStreamTask.java,availableathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterParseStreamTask.java

packagecom.learninghadoop2.samza.tasks;

publicclassTwitterParseStreamTaskimplementsStreamTask{

@Override

publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector

collector,TaskCoordinatorcoordinator){

Stringmsg=((String)envelope.getMessage());

try{

JSONParserparser=newJSONParser();

Objectobj=parser.parse(msg);

JSONObjectjsonObj=(JSONObject)obj;

Stringtext=(String)jsonObj.get("text");

collector.send(newOutgoingMessageEnvelope(new

SystemStream("kafka","tweets-parsed"),text));

}catch(ParseExceptionpe){}

}

}

}

Thecodeislargelyself-explanatory,butthereareafewpointsofinterest.WeuseJSONSimple(http://code.google.com/p/json-simple/)forourrelativelystraightforwardJSONparsingrequirements;we’llalsouseitlaterinthisbook.

TheIncomingMessageEnvelopeanditscorrespondingOutputMessageEnvelopearethemainstructuresconcernedwiththeactualmessagedata.Alongwiththemessagepayload,theenvelopewillalsohavedataconcerningthesystem,topicname,and(optionally)partitionnumberinadditiontoothermetadata.Forourpurposes,wejustextractthemessagebodyfromtheincomingmessageandsendthetweettextweextractfromitviaanewOutgoingMessageEnvelopetoatopiccalledtweets-parsedwithinasystemcalledkafka.Notethelowercasename—we’llexplainthisinamoment.

ThetypeofmessageintheIncomingMessageEnvelopeisjava.lang.Object.Samzadoesnotcurrentlyenforceadatamodelandhencedoesnothavestrongly-typedmessagebodies.Therefore,whenextractingthemessagecontents,anexplicitcastisusuallyrequired.Sinceeachtaskneedstoknowtheexpectedmessageformatofthestreamsitprocesses,thisisnottheodditythatitmayappeartobe.

Page 192: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheconfigurationfileTherewasnothinginthepreviouscodethatsaidwherethemessagescamefrom;theframeworkjustpresentsthemtotheStreamTaskimplementation,butobviouslySamzaneedstoknowfromwheretofetchmessages.Thereisaconfigurationfileforeachjobthatdefinesthisandmore.Thefollowingcanbefoundastwitter-parse.propertiesathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-parser.properties:

#Job

job.factory.class=org.apache.samza.job.yarn.YarnJobFactory

job.name=twitter-parser

#YARN

yarn.package.path=file:///home/gturkington/samza/build/distributions/learni

nghadoop2-0.1.tar.gz

#Task

task.class=com.learninghadoop2.samza.tasks.TwitterParseStreamTask

task.inputs=kafka.tweets

task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointMa

nagerFactory

task.checkpoint.system=kafka

#Normally,thiswouldbe3,butwehaveonlyonebroker.

task.checkpoint.replication.factor=1

#Serializers

serializers.registry.string.class=org.apache.samza.serializers.StringSerdeF

actory

#Systems

systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactor

y

systems.kafka.streams.tweets.samza.msg.serde=string

systems.kafka.streams.tweets-parsed.samza.msg.serde=string

systems.kafka.consumer.zookeeper.connect=localhost:2181/

systems.kafka.consumer.auto.offset.reset=largest

systems.kafka.producer.metadata.broker.list=localhost:9092

systems.kafka.producer.producer.type=sync

systems.kafka.producer.batch.num.messages=1

Thismaylooklikealot,butfornowwe’lljustconsiderthehigh-levelstructureandsomekeysettings.ThejobsectionsetsYARNastheexecutionframework(asopposedtothelocaljobrunnerclass)andgivesthejobaname.Ifweweretorunmultiplecopiesofthissamejob,wewouldalsogiveeachcopyauniqueID.Thetasksectionspecifiestheimplementationclassofourtaskandalsothenameofthestreamsforwhichitshouldreceivemessages.SerializerstellSamzahowtoreadandwritemessagestoandfromthestreamandthesystemsectiondefinessystemsbynameandassociatesimplementationclasseswiththem.

Inourcase,wedefineonlyonesystemcalledkafkaandwerefertothissystemwhen

Page 193: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

sendingourmessageintheprecedingtask.Notethatthisnameisarbitraryandwecouldcallitwhateverwewant.Obviously,forclarityitmakessensetocalltheKafkasystembythesamenamebutthisisonlyaconvention.Inparticular,sometimesyouwillneedtogivedifferentnameswhendealingwithmultiplesystemsthataresimilartoeachother,orsometimesevenwhentreatingthesamesystemdifferentlyindifferentpartsofaconfigurationfile.

Inthissection,wewillalsospecifytheSerDetobeassociatedwiththestreamsusedbythetask.RecallthatKafkamessageshaveabodyandanoptionalkeythatisusedtodeterminetowhichpartitionthemessageissent.Samzaneedstoknowhowtotreatthecontentsofthekeysandmessagesforthesestreams.Samzahassupporttotreattheseasrawbytesorspecifictypessuchasstring,integer,andJSON,asmentionedearlier.

Therestoftheconfigurationwillbemostlyunchangedfromjobtojob,asitincludesthingssuchasthelocationoftheZooKeeperensembleandKafkaclusters,andspecifieshowstreamsaretobecheckpointed.Samzaallowsawidevarietyofcustomizationsandthefullconfigurationoptionsaredetailedathttp://samza.apache.org/learn/documentation/0.8/jobs/configuration-table.html.

Page 194: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GettingTwitterdataintoKafkaBeforewerunthejob,wedoneedtogetsometweetsintoKafka.Let’screateanewKafkatopiccalledtweetstowhichwe’llwritethetweets.

ToperformthisandotherKafka-relatedoperations,we’llusecommand-linetoolslocatedwithinthebindirectoryoftheKafkadistribution.IfyouarerunningajobfromwithinthestackcreatedaspartoftheHelloSamzaapplication;thiswillbedeploy/kafka/bin.

kafka-topics.shisageneral-purposetoolthatcanbeusedtocreate,update,anddescribetopics.MostofitsusagesrequireargumentstospecifythelocationofthelocalZooKeepercluster,whereKafkabrokersstoretheirdetails,andthenameofthetopictobeoperatedupon.Tocreateanewtopic,runthefollowingcommand:

$kafka-topics.sh--zookeeperlocalhost:2181--create–topictweets--

partitions1--replication-factor1

Thiscreatesatopiccalledtweetsandexplicitlysetsitsnumberofpartitionsandreplicationfactorto1.ThisissuitableifyouarerunningKafkawithinalocaltestVM,butclearlyproductiondeploymentswillhavemorepartitionstoscaleouttheloadacrossmultiplebrokersandareplicationfactorofatleast2toprovidefaulttolerance.

Usethelistoptionofthekafka-topics.shtooltosimplyshowthetopicsinthesystem,orusedescribetogetmoredetailedinformationonspecifictopics:

$kafka-topics.sh--zookeeperlocalhost:2181--describe--topictweets

Topic:tweetsPartitionCount:1ReplicationFactor:1Configs:

Topic:tweetsPartition:0Leader:0Replicas:0Isr:0

Themultiple0sarepossiblyconfusingasthesearelabelsandnotcounts.EachbrokerinthesystemhasanIDthatusuallystartsfrom0,asdothepartitionswithineachtopic.TheprecedingoutputistellingusthatthetopiccalledtweetshasasinglepartitionwithID0,thebrokeractingastheleaderforthatpartitionisbroker0,andthesetofin-syncreplicas(ISR)forthispartitionisagainonlybroker0.Thislastvalueisparticularlyimportantwhendealingwithreplication.

We’lluseourPythonutilityfrompreviouschapterstopullJSONtweetsfromtheTwitterfeed,andthenuseaKafkaCLImessageproducertowritethemessagestoaKafkatopic.Thisisn’taterriblyefficientwayofdoingthings,butitissuitableforillustrationpurposes.AssumingourPythonscriptisinourhomedirectory,runthefollowingcommandfromwithintheKafkabindirectory:

$python~/stream.py–j|./kafka-console-producer.sh--broker-list

localhost:9092--topictweets

ThiswillrunindefinitelysobecarefulnottoleaveitrunningovernightonatestVMwithsmalldiskspace,notthattheauthorshaveeverdonesuchathing.

Page 195: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

RunningaSamzajobTorunaSamzajob,weneedourcodetobepackagedalongwiththeSamzacomponentsrequiredtoexecuteitintoa.tar.gzarchivethatwillbereadbytheYARNNodeManager.Thisisthefilereferredtobytheyarn.file.packagepropertyintheSamzataskconfigurationfile.

WhenusingthesinglenodeHelloSamzawecanjustuseanabsolutepathonthefilesystem,asseeninthepreviousconfigurationexample.ForjobsonlargerYARNgrids,theeasiestwayistoputthepackageontoHDFSandrefertoitbyanhdfs://URIoronawebserver(SamzaprovidesamechanismtoallowYARNtoreadthefileviaHTTP).

BecauseSamzahasmultiplesubcomponentsandeachsubcomponenthasitsowndependencies,thefullYARNpackagecanendupcontainingalotofJARfiles(over100!).Inaddition,youneedtoincludeyourcustomcodefortheSamzataskaswellassomescriptsfromwithintheSamzadistribution.It’snotsomethingtobedonebyhand.Inthesamplecodeforthischapter,foundathttps://github.com/learninghadoop2/book-examples/tree/master/ch4,wehavesetupasamplestructuretoholdthecodeandconfigfilesandprovidedsomeautomationviaGradletobuildthenecessarytaskarchiveandstartthetasks.

WhenintherootoftheSamzaexamplecodedirectoryforthisbook,performthefollowingcommandtobuildasinglefilearchivecontainingalltheclassesofthischaptercompiledtogetherandbundledwithalltheotherrequiredfiles:

$./gradlewtargz

ThisGradletaskwillnotonlycreatethenecessary.tar.gzarchiveinthebuild/distributionsdirectory,butwillalsostoreanexpandedversionofthearchiveunderbuild/samza-package.Thiswillbeuseful,aswewilluseSamzascriptsstoredinthebindirectoryofthearchivetoactuallysubmitthetasktoYARN.

Sonow,let’srunourjob.Weneedtohavefilepathsfortwothings:theSamzarun-job.shscripttosubmitajobtoYARNandtheconfigurationfileforourjob.Sinceourcreatedjobpackagehasallthecompiledtasksbundledtogether,itisbyusingadifferentconfigurationfilethatspecifiesaspecifictaskimplementationclassinthetask.classpropertythatwetellSamzawhichtasktorun.Toactuallyrunthetask,wecanrunthefollowingcommandfromwithintheexplodedprojectarchiveunderbuild/samza-archives:

$bin/run-job.sh--config-

factory=org.apache.samza.config.factories.PropertiesConfigFactory--config-

path=]config/twitter-parser.properties

Forconvenience,weaddedaGradletasktorunthisjob:

$./gradlewrunTwitterParser

Toseetheoutputofthejob,we’llusetheKafkaCLIclienttoconsumemessages:

Page 196: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

$./kafka-console-consumer.sh–zookeeperlocalhost:2181–topictweets-

parsed

Youshouldseeacontinuousstreamoftweetsappearingontheclient.

NoteNotethatwedidnotexplicitlycreatethetopiccalledtweets-parsed.Kafkacanallowtopicstobecreateddynamicallywheneitheraproducerorconsumertriestousethetopic.Inmanysituations,thoughthedefaultpartitioningandreplicationvaluesmaynotbesuitable,andexplicittopiccreationwillberequiredtoensurethesecriticaltopicattributesarecorrectlydefined.

Page 197: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SamzaandHDFSYoumayhavenoticedthatwejustmentionedHDFSforthefirsttimeinourdiscussionofSamza.ThoughSamzaintegratestightlywithYARN,ithasnodirectintegrationwithHDFS.Atalogicallevel,Samza’sstream-implementingsystems(suchasKafka)areprovidingthestoragelayerthatisusuallyprovidedbyHDFSfortraditionalHadoopworkloads.IntheterminologyofSamza’sarchitecture,asdescribedearlier,YARNistheexecutionlayerinbothmodels,whereasSamzausesastreaminglayerforitssourceanddestinationdata,frameworkssuchasMapReduceuseHDFS.ThisisagoodexampleofhowYARNenablesalternativecomputationalmodelsthatnotonlyprocessdataverydifferentlythanbatch-orientedMapReduce,butthatcanalsouseentirelydifferentstoragesystemsfortheirsourcedata.

Page 198: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WindowingfunctionsIt’sfrequentlyusefultogeneratesomedatabasedonthemessagesreceivedonastreamoveracertaintimewindow.Anexampleofthismaybetorecordthetopnattributevaluesmeasuredeveryminute.SamzasupportsthisthroughtheWindowableTaskinterface,whichhasthefollowingsinglemethodtobeimplemented:

publicvoidwindow(MessageCollectorcollector,TaskCoordinator

coordinator);

ThisshouldlooksimilartotheprocessmethodintheStreamTaskinterface.However,becausethemethodiscalledonatimeschedule,itsinvocationisnotassociatedwithareceivedmessage.TheMessageCollectorandTaskCoordinatorparametersarestillthere,however,asmostwindowabletaskswillproduceoutputmessagesandmayalsowishtoperformsometaskmanagementactions.

Let’stakeourprevioustaskandaddawindowfunctionthatwilloutputthenumberoftweetsreceivedineachwindowedtimeperiod.ThisisthemainclassimplementationofTwitterStatisticsStreamTask.javafoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatisticsStreamTask.java

publicclassTwitterStatisticsStreamTaskimplementsStreamTask,

WindowableTask{

privateinttweets=0;

@Override

publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector

collector,TaskCoordinatorcoordinator){

tweets++;

}

@Override

publicvoidwindow(MessageCollectorcollector,TaskCoordinator

coordinator){

collector.send(newOutgoingMessageEnvelope(new

SystemStream("kafka","tweet-stats"),""+tweets));

//Resetcountsafterwindowing.

tweets=0;

}

}

TheTwitterStatisticsStreamTaskclasshasaprivatemembervariablecalledtweetsthatisinitializedto0andisincrementedineverycalltotheprocessmethod.Wethereforeknowthatthisvariablewillbeincrementedforeachmessagepassedtothetaskfromtheunderlyingstreamimplementation.EachSamzacontainerhasasinglethreadrunninginaloopthatexecutestheprocessandwindowmethodsonallthetaskswithinthecontainer.Thismeansthatwedonotneedtoguardinstancevariablesagainstconcurrentmodifications;onlyonemethodoneachtaskwithinacontainerwillbeexecutingsimultaneously.

Page 199: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Inourwindowmethod,wesendamessagetoanewtopicwecalltweet-statsandthenresetthetweetsvariable.ThisisprettystraightforwardandtheonlymissingpieceishowSamzawillknowwhentocallthewindowmethod.Wespecifythisintheconfigurationfile:

task.window.ms=5000

ThistellsSamzatocallthewindowmethodoneachtaskinstanceevery5seconds.Torunthewindowtask,thereisaGradletask:

$./gradlewrunTwitterStatistics

Ifweusekafka-console-consumer.shtolistenonthetweet-statsstreamnow,wewillseethefollowingoutput:

Numberoftweets:5012

Numberoftweets:5398

NoteNotethatthetermwindowinthiscontextreferstoSamzaconceptuallyslicingthestreamofmessagesintotimerangesandprovidingamechanismtoperformprocessingateachrangeboundary.Samzadoesnotdirectlyprovideanimplementationoftheotheruseofthetermwithregardstoslidingwindows,whereaseriesofvaluesisheldandprocessedovertime.However,thewindowabletaskinterfacedoesprovidetheplumbingtoimplementsuchslidingwindows.

Page 200: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MultijobworkflowsAswesawwiththeHelloSamzaexamples,someoftherealpowerofSamzacomesfromcompositionofmultiplejobsandwe’lluseatextcleanupjobtostartdemonstratingthiscapability.

Inthefollowingsection,we’llperformtweetsentimentanalysisbycomparingtweetswithasetofEnglishpositiveandnegativewords.SimplyapplyingthistotherawTwitterfeedwillhaveverypatchyresults,however,givenhowrichlymultilingualtheTwitterstreamis.Wealsoneedtoconsiderthingssuchastextcleanup,capitalization,frequentcontractions,andsoon.Asanyonewhohasworkedwithanynon-trivialdatasetknows,theactofmakingthedatafitforprocessingisusuallywherealargeamountofeffort(oftenthemajority!)goes.

Sobeforewetryanddetecttweetsentiments,let’sdosomesimpletextcleanup;inparticular,we’llselectonlyEnglishlanguagetweetsandwewillforcetheirtexttobelowercasebeforesendingthemtoanewoutputstream.

Languagedetectionisadifficultproblemandforthiswe’lluseafeatureoftheApacheTikalibrary(http://tika.apache.org).Tikaprovidesawidearrayoffunctionalitytoextracttextfromvarioussourcesandthentoextractfurtherinformationfromthattext.IfusingourGradlescripts,theTikadependencyisalreadyspecifiedandwillautomaticallybeincludedinthegeneratedjobpackage.Ifbuildingthroughanothermechanism,youwillneedtodownloadtheTikaJARfilefromthehomepageandaddittoyourYARNjobpackage.ThefollowingcodecanbefoundasTextCleanupStreamTask.javaathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TextCleanupStreamTask.java

publicclassTextCleanupStreamTaskimplementsStreamTask{

@Override

publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector

collector,TaskCoordinatorcoordinator){

Stringrawtext=((String)envelope.getMessage());

if("en".equals(detectLanguage(rawtext))){

collector.send(newOutgoingMessageEnvelope(new

SystemStream("kafka","english-tweets"),

rawtext.toLowerCase()));

}

}

privateStringdetectLanguage(Stringtext){

LanguageIdentifierli=newLanguageIdentifier(text);

returnli.getLanguage();

}

}

ThistaskisquitestraightforwardthankstotheheavyliftingperformedbyTika.WecreateautilitymethodthatwrapsthecreationanduseofaTika,LanguageDetector,andthenwe

Page 201: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

callthismethodonthemessagebodyofeachincomingmessageintheprocessmethod.Weonlywritetotheoutputstreamiftheresultofapplyingthisutilitymethodis"en",thatis,thetwo-lettercodeforEnglish.

Theconfigurationfileforthistaskissimilartothatofourprevioustask,withthespecificvaluesforthetasknameandimplementingclass.Itisintherepositoryastextcleanup.propertiesathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/textcleanup.properties.Wealsoneedtospecifytheinputstream:

task.inputs=kafka.tweets-parsed

ThisisimportantbecauseweneedthistasktoparsethetweettextthatwasextractedintheearliertaskandavoidduplicatingtheJSONparsinglogicthatisbestencapsulatedinoneplace.Wecanrunthistaskwiththefollowingcommand:

$./gradlewrunTextCleanup

Now,wecanrunallthreetaskstogether;TwitterParseStreamTaskandTwitterStatisticsStreamTaskwillconsumetherawtweetstream,whileTextCleanupStreamTaskwillconsumetheoutputfromTwitterParseStreamTask.

Dataprocessingonstreams

Page 202: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TweetsentimentanalysisWe’llnowimplementatasktoperformtweetsentimentanalysissimilartowhatwedidusingMapReduceinthepreviouschapter.ThiswillalsoshowusausefulmechanismofferedbySamza:bootstrapstreams.

BootstrapstreamsGenerallyspeaking,moststream-processingjobs(inSamzaoranotherframework)willstartprocessingmessagesthatarriveaftertheystartupandgenerallyignorehistoricalmessages.Becauseofitsconceptofreplayablestreams,Samzadoesn’thavethislimitation.

Inoursentimentanalysisjob,wehadtwosetsofreferenceterms:positiveandnegativewords.Thoughwe’venotshownitsofar,Samzacanconsumemessagesfrommultiplestreamsandtheunderlyingmachinerywillpollallnamedstreamsandprovidetheirmessages,oneatatime,totheprocessmethod.Wecanthereforecreatestreamsforthepositiveandnegativewordsandpushthedatasetsontothosestreams.Atfirstglance,wecouldplantorewindthesetwostreamstotheearliestpointandreadtweetsastheyarrive.TheproblemisthatSamzawon’tguaranteeorderingofmessagesfrommultiplestreams,andeventhoughthereisamechanismtogivestreamshigherpriority,wecan’tassumethatallnegativeandpositivewordswillbeprocessedbeforethefirsttweetarrives.

Forsuchtypesofscenarios,Samzahastheconceptofbootstrapstreams.Ifataskhasanybootstrapstreamsdefined,thenitwillreadthesestreamsfromtheearliestoffsetuntiltheyarefullyprocessed(technically,itwillreadthestreamstilltheygetcaughtup,sothatanynewwordssenttoeitherstreamwillbetreatedwithoutpriorityandwillarriveinterleavedbetweentweets).

We’llnowcreateanewjobcalledTweetSentimentStreamTaskthatreadstwobootstrapstreams,collectstheircontentsintoHashMaps,gathersrunningcountsforsentimenttrends,andusesawindowfunctiontooutputthisdataatintervals.Thiscodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterSentimentStreamTask.java

publicclassTwitterSentimentStreamTaskimplementsStreamTask,

WindowableTask{

privateSet<String>positiveWords=newHashSet<String>();

privateSet<String>negativeWords=newHashSet<String>();

privateinttweets=0;

privateintpositiveTweets=0;

privateintnegativeTweets=0;

privateintmaxPositive=0;

privateintmaxNegative=0;

@Override

publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector

collector,TaskCoordinatorcoordinator){

if("positive-

words".equals(envelope.getSystemStreamPartition().getStream())){

Page 203: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

positiveWords.add(((String)envelope.getMessage()));

}elseif("negative-

words".equals(envelope.getSystemStreamPartition().getStream())){

negativeWords.add(((String)envelope.getMessage()));

}elseif("english-

tweets".equals(envelope.getSystemStreamPartition().getStream())){

tweets++;

intpositive=0;

intnegative=0;

Stringwords=((String)envelope.getMessage());

for(Stringword:words.split("")){

if(positiveWords.contains(word)){

positive++;

}elseif(negativeWords.contains(word)){

negative++;

}

}

if(positive>negative){

positiveTweets++;

}

if(negative>positive){

negativeTweets++;

}

if(positive>maxPositive){

maxPositive=positive;

}

if(negative>maxNegative){

maxNegative=negative;

}

}

}

@Override

publicvoidwindow(MessageCollectorcollector,TaskCoordinator

coordinator){

Stringmsg=String.format("Tweets:%dPositive:%dNegative:%d

MaxPositive:%dMinPositive:%d",tweets,positiveTweets,negativeTweets,

maxPositive,maxNegative);

collector.send(newOutgoingMessageEnvelope(new

SystemStream("kafka","tweet-sentiment-stats"),msg));

//Resetcountsafterwindowing.

tweets=0;

positiveTweets=0;

negativeTweets=0;

maxPositive=0;

maxNegative=0;

}

Page 204: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

}

Inthistask,weaddanumberofprivatemembervariablesthatwewillusetokeeparunningcountofthenumberofoveralltweets,howmanywerepositiveandnegative,andthemaximumpositiveandnegativecountsseeninasingletweet.

ThistaskconsumesfromthreeKafkatopics.Eventhoughwewillconfiguretwotobeusedasbootstrapstreams,theyareallstillexactlythesametypeofKafkatopicfromwhichmessagesarereceived;theonlydifferencewithbootstrapstreamsisthatwetellSamzatouseKafka’srewindingcapabilitiestofullyre-readeachmessageinthestream.Fortheotherstreamoftweets,wejuststartreadingnewmessagesastheyarrive.

Ashintedearlier,ifatasksubscribestomultiplestreams,thesameprocessmethodwillreceivemessagesfromeachstream.Thatiswhyweuseenvelope.getSystemStreamPartition().getStream()toextractthestreamnameforeachgivenmessageandthenactaccordingly.Ifthemessageisfromeitherofthebootstrappedstreams,weadditscontentstotheappropriatehashmap.Webreakatweetmessageintoitsconstituentwords,testeachwordforpositiveornegativesentiment,andthenupdatecountsaccordingly.Asyoucansee,thistaskdoesn’toutputthereceivedtweetstoanothertopic.

Sincewedon’tperformanydirectprocessing,thereisnopointindoingso;anyothertaskthatwishestoconsumemessagescanjustsubscribedirectlytotheincomingtweetsstream.However,apossiblemodificationcouldbetowritepositiveandnegativesentimenttweetstodedicatedstreamsforeach.

Thewindowmethodoutputsaseriesofcountsandthenresetsthevariables(asitdidbefore).NotethatSamzadoeshavesupporttodirectlyexposemetricsthroughJMX,whichcouldpossiblybeabetterfitforsuchsimplewindowingexamples.However,wewon’thavespacetocoverthataspectoftheprojectinthisbook.

Torunthisjob,weneedtomodifytheconfigurationfilebysettingthejobandtasknamesasusual,butwealsoneedtospecifymultipleinputstreamsnow:

task.inputs=kafka.english-tweets,kafka.positive-words,kafka.negative-words

Then,weneedtospecifythattwoofourstreamsarebootstrapstreamsthatshouldbereadfromtheearliestoffset.Specifically,wesetthreepropertiesforthestreams.Wesaytheyaretobebootstrapped,thatis,fullyreadbeforeotherstreams,andthisisachievedbyspecifyingthattheoffsetoneachstreamneedstoberesettotheoldest(first)position:

systems.kafka.streams.positive-words.samza.bootstrap=true

systems.kafka.streams.positive-words.samza.reset.offset=true

systems.kafka.streams.positive-words.samza.offset.default=oldest

systems.kafka.streams.negative-words.samza.bootstrap=true

systems.kafka.streams.negative-words.samza.reset.offset=true

systems.kafka.streams.negative-words.samza.offset.default=oldest

Wecanrunthisjobwiththefollowingcommand:

Page 205: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

$./gradlewrunTwitterSentiment

Afterstartingthejob,lookattheoutputofthemessagesonthetweet-sentiment-statstopic.

Thesentimentdetectionjobwillbootstrapthepositiveandnegativewordstreamsbeforereadinganyofournewlydetectedlower-caseEnglishtweets.

Withthesentimentdetectionjob,wecannowvisualizeourfourcollaboratingjobsasshowninthefollowingdiagram:

Bootstrapstreamsandcollaboratingtasks

TipTocorrectlyrunthejobs,itmayseemnecessarytostarttheJSONparserjobfollowedbythecleanupjobbeforefinallystartingthesentimentjob,butthisisnotthecase.AnyunreadmessagesremainbufferedinKafka,soitdoesn’tmatterinwhichorderthejobsofamulti-jobworkflowarestarted.Ofcourse,thesentimentjobwilloutputcountsof0tweetsuntilitstartsreceivingdata,butnothingwillbreakifastreamjobstartsbeforethoseitdependson.

Page 206: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

StatefultasksThefinalaspectofSamzathatwewillexploreishowitallowsthetasksprocessingstreampartitionstohavepersistentlocalstate.Inthepreviousexample,weusedprivatevariablestokeepatrackofrunningtotals,butsometimesitisusefulforatasktohavericherlocalstate.Anexamplecouldbetheactofperformingalogicaljoinontwostreams,whereitisusefultobuildupastatemodelfromonestreamandcomparethiswiththeother.

NoteNotethatSamzacanutilizeitsconceptofpartitionedstreamstogreatlyoptimizetheactofjoiningstreams.Ifeachstreamtobejoinedusesthesamepartitionkey(forexample,auserID),theneachtaskconsumingthesestreamswillreceiveallmessagesassociatedwitheachIDacrossallthestreams.

Samzahasanotherabstractionsimilartoitsnotionoftheframeworktomanageitsjobsandthatwhichimplementsitstasks.Itdefinesanabstractkey-valuestorethatcanhavemultipleconcreteimplementations.Samzausesexistingopensourceprojectsfortheon-diskimplementationsandusedLevelDBasofv0.7andaddedRocksDBasofv0.8.Thereisalsoanin-memorystorethatdoesnotpersistthekey-valuedatabutthatmaybeusefulintestingorpotentiallyveryspecificproductionworkloads.

Eachtaskcanwritetothiskey-valuestoreandSamzamanagesitspersistencetothelocalimplementation.Tosupportpersistentstates,thestoreisalsomodeledasastreamandallwritestothestorearealsopushedintoastream.Ifataskfails,thenonrestart,itcanrecoverthestateofitslocalkey-valuestorebyreplayingthemessagesinthebackingtopic.Anobviousconcernherewillbethenumberofmessagesthatneedtobereplayed;however,whenusingKafka,forexample,itcompactsmessageswiththesamekeysothatonlythelatestupdateremainsinthetopic.

We’llmodifyourprevioustweetsentimentexampletoaddalifetimecountofthemaximumpositiveandnegativesentimentseeninanytweet.ThefollowingcodecanbefoundasTwitterStatefulSentimentStateTask.javaathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatefulSentimentStreamTask.javaNotethattheprocessmethodisthesameasTwitterSentimentStateTask.java,sowehaveomittedithereforspacereasons:

publicclassTwitterStatefulSentimentStreamTaskimplementsStreamTask,

WindowableTask,InitableTask{

privateSet<String>positiveWords=newHashSet<String>();

privateSet<String>negativeWords=newHashSet<String>();

privateinttweets=0;

privateintpositiveTweets=0;

privateintnegativeTweets=0;

privateintmaxPositive=0;

privateintmaxNegative=0;

privateKeyValueStore<String,Integer>store;

Page 207: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

@SuppressWarnings("unchecked")

@Override

publicvoidinit(Configconfig,TaskContextcontext){

this.store=(KeyValueStore<String,Integer>)

context.getStore("tweet-store");

}

@Override

publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector

collector,TaskCoordinatorcoordinator){

...

}

@Override

publicvoidwindow(MessageCollectorcollector,TaskCoordinator

coordinator){

IntegerlifetimeMaxPositive=store.get("lifetimeMaxPositive");

IntegerlifetimeMaxNegative=store.get("lifetimeMaxNegative");

if((lifetimeMaxPositive==null)||(maxPositive>

lifetimeMaxPositive)){

lifetimeMaxPositive=maxPositive;

store.put("lifetimeMaxPositive",lifetimeMaxPositive);

}

if((lifetimeMaxNegative==null)||(maxNegative>

lifetimeMaxNegative)){

lifetimeMaxNegative=maxNegative;

store.put("lifetimeMaxNegative",lifetimeMaxNegative);

}

Stringmsg=

String.format(

"Tweets:%dPositive:%dNegative:%dMaxPositive:%d

MaxNegative:%dLifetimeMaxPositive:%dLifetimeMaxNegative:%d",

tweets,positiveTweets,negativeTweets,maxPositive,

maxNegative,lifetimeMaxPositive,

lifetimeMaxNegative);

collector.send(newOutgoingMessageEnvelope(new

SystemStream("kafka","tweet-stateful-sentiment-stats"),msg));

//Resetcountsafterwindowing.

tweets=0;

positiveTweets=0;

negativeTweets=0;

maxPositive=0;

maxNegative=0;

}

}

ThisclassimplementsanewinterfacecalledInitableTask.Thishasasinglemethodcalledinitandisusedwhenataskneedstoconfigureaspectsofitsconfigurationbeforeitbeginsexecution.Weusetheinit()methodheretocreateaninstanceoftheKeyValueStoreclassandstoreitinaprivatemembervariable.

Page 208: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

KeyValueStore,asthenamesuggests,providesafamiliarput/gettypeinterface.Inthiscase,wespecifythatthekeysareofthetypeStringandthevaluesareIntegers.Inourwindowmethod,weretrieveanypreviouslystoredvaluesforthemaximumpositiveandnegativesentimentandifthecountinthecurrentwindowishigher,updatethestoreaccordingly.Then,wejustoutputtheresultsofthewindowmethodasbefore.

Asyoucansee,theuserdoesnotneedtodealwiththedetailsofeitherthelocalorremotepersistenceoftheKeyValueStoreinstance;thisisallhandledbySamza.Theefficiencyofthemechanismalsomakesittractablefortaskstoholdsizeableamountoflocalstate,whichcanbeparticularlyvaluableincasessuchaslong-runningaggregationsorstreamjoins.

Theconfigurationfileforthejobcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-stateful-sentiment.properties.Itneedstohaveafewentriesadded,whichareasfollows:

stores.tweet-

store.factory=org.apache.samza.storage.kv.KeyValueStorageEngineFactory

stores.tweet-store.changelog=kafka.twitter-stats-state

stores.tweet-store.key.serde=string

stores.tweet-store.msg.serde=integer

Thefirstlinespecifiestheimplementationclassforthestore,thesecondlinespecifiestheKafkatopictobeusedforpersistentstate,andthelasttwolinesspecifythetypeofthestorekeyandvalue.

Torunthisjob,usethefollowingcommand:

$./gradlewrunTwitterStatefulSentiment

Forconvenience,thefollowingcommandwillstartupfourjobs:theJSONparser,thetextcleanup,thestatisticsjobandthestatefulsentimentjobs:

$./gradlewrunTasks

Samzaisapurestream-processingsystemthatprovidespluggableimplementationsofitsstorageandexecutionlayers.ThemostcommonlyusedpluginsareYARNandKafka,andthesedemonstratehowSamzacanintegratetightlywithHadoopYARNwhileusingacompletelydifferentstoragelayer.Samzaisstillarelativelynewprojectandthecurrentfeaturesareonlyasubsetofwhatisenvisaged.Itisrecommendedtoconsultitswebpagetogetthelatestinformationonitscurrentstatus.

Page 209: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SummaryThischapterfocusedmuchmoreonwhatcanbedoneonHadoop2,andinparticularYARN,thanthedetailsofHadoopinternals.Thisisalmostcertainlyagoodthing,asitdemonstratesthatHadoopisrealizingitsgoalofbecomingamuchmoreflexibleandgenericdataprocessingplatformthatisnolongertiedtobatchprocessing.Inparticular,wehighlightedhowSamzashowsthattheprocessingframeworksthatcanbeimplementedonYARNcaninnovateandenablefunctionalityvastlydifferentfromthatavailableinHadoop1.

Inparticular,wesawhowSamzagoestotheoppositeendofthelatencyspectrumfrombatchprocessingandenablesper-messageprocessingofindividualmessagesastheyarrive.

WealsosawhowSamzaprovidesacallbackmechanismthatMapReducedeveloperswillbefamiliarwith,butusesitforaverydifferentprocessingmodel.WealsodiscussedthewaysinwhichSamzautilizesYARNasitsmainexecutionframeworkandhowitimplementsthemodeldescribedinChapter3,Processing–MapReduceandBeyond.

Inthenextchapter,wewillswitchgearsandexploreApacheSpark.ThoughithasaverydifferentdatamodelthanSamza,we’llseethatitdoesalsohaveanextensionthatsupportsprocessingofrealtimedatastreams,includingtheoptionofKafkaintegration.However,bothprojectsaresodifferentthattheyarecomplimentarymorethanincompetition.

Page 210: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Chapter5.IterativeComputationwithSparkInthepreviouschapter,wesawhowSamzacanenablenearreal-timestreamdataprocessingwithinHadoop.ThisisquiteastepawayfromthetraditionalbatchprocessingmodelofMapReduce,butstillkeepswiththemodelofprovidingawell-definedinterfaceagainstwhichbusinesslogictaskscanbeimplemented.InthischapterwewillexploreApacheSpark,whichcanbeviewedbothasaframeworkonwhichapplicationscanbebuiltaswellasaprocessingframeworkinitsownright.NotonlyareapplicationsbeingbuiltonSpark,butentirecomponentswithintheHadoopecosystemarealsobeingreimplementedtouseSparkastheirunderlyingprocessingframework.Inparticular,wewillcoverthefollowingtopics:

WhatSparkisandhowitscoresystemcanrunonYARNThedatamodelprovidedbySparkthatenableshugelyscalableandhighlyefficientdataprocessingThebreadthofadditionalSparkcomponentsandrelatedprojects

It’simportanttonoteupfrontthatalthoughSparkhasitsownmechanismtoprocessstreamingdata,thisisbutonepartofwhatSparkhastooffer.It’sbesttothinkofitasamuchbroaderinitiative.

Page 211: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ApacheSparkApacheSpark(https://spark.apache.org/)isadataprocessingframeworkbasedonageneralizationofMapReduce.ItwasoriginallydevelopedbytheAMPLabatUCBerkeley(https://amplab.cs.berkeley.edu/).LikeTez,SparkactsasanexecutionenginethatmodelsdatatransformationsasDAGsandstrivestoeliminatetheI/OoverheadofMapReduceinordertoperformiterativecomputationatscale.WhileTez’smaingoalwastoprovideafasterexecutionengineforMapReduceonHadoop,SparkhasbeendesignedbothasastandaloneframeworkandanAPIforapplicationdevelopment.Thesystemisdesignedtoperformgeneral-purposein-memorydataprocessing,streamworkflows,aswellasinteractiveanditerativecomputation.

SparkisimplementedinScala,whichisastaticallytypedprogramminglanguagefortheJavaVMandexposesnativeprogramminginterfacesforJavaandPythoninadditiontoScalaitself.NotethatthoughJavacodecancalltheScalainterfacedirectly,therearesomeaspectsofthetypesystemthatmakesuchcodeprettyunwieldy,andhenceweusethenativeJavaAPI.

ScalashipswithaninteractiveshellsimilartothatofRubyandPython;thisallowsuserstorunSparkinteractivelyfromtheinterpretertoqueryanydataset.

TheScalainterpreteroperatesbycompilingaclassforeachlinetypedbytheuser,loadingitintotheJVM,andinvokingafunctiononit.Thisclassincludesasingletonobjectthatcontainsthevariablesorfunctionsonthatlineandrunstheline’scodeinaninitializemethod.Inadditiontoitsrichprogramminginterfaces,Sparkisbecomingestablishedasanexecutionengine,withpopulartoolsoftheHadoopecosystem(suchasPigandHive)beingportedtotheframework.

Page 212: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ClustercomputingwithworkingsetsSpark’sarchitectureiscenteredaroundtheconceptofResilientDistributedDatasets(RDDs),whichisaread-onlycollectionofScalaobjectspartitionedacrossasetofmachinesthatcanpersistinmemory.Thisabstractionwasproposedina2012researchpaper,ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing,whichcanbefoundathttps://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.

ASparkapplicationconsistsofadriverprogramthatexecutesparalleloperationsonaclusterofworkersandlong-livedprocessesthatcanstoredatapartitionsinmemorybydispatchingfunctionsthatrunasparalleltasks,asshowninthefollowingdiagram:

Sparkclusterarchitecture

ProcessesarecoordinatedviaaSparkContextinstance.SparkContextconnectstoaresourcemanager(suchasYARN),requestsexecutorsonworkernodes,andsendstaskstobeexecuted.Executorsareresponsibleforrunningtasksandmanagingmemorylocally.

Sparkallowsyoutosharevariablesbetweentasks,orbetweentasksandthedriver,usinganabstractionknownassharedvariables.Sparksupportstwotypesofsharedvariables:broadcastvariables,whichcanbeusedtocacheavalueinmemoryonallnodes,andaccumulators,whichareadditivevariablessuchascountersandsums.

ResilientDistributedDatasets(RDDs)AnRDDisstoredinmemory,sharedacrossmachinesandisusedinMapReduce-likeparalleloperations.Faulttoleranceisachievedthroughthenotionoflineage:ifapartitionofanRDDislost,theRDDhasenoughinformationabouthowitwasderivedfromotherRDDstobeabletorebuildjustthatpartition.AnRDDcanbebuiltinfourways:

ByreadingdatafromafilestoredinHDFSBydividing–parallelizing–aScalacollectionintoanumberofpartitionsthataresenttoworkers

Page 213: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

BytransforminganexistingRDDusingparalleloperatorsBychangingthepersistenceofanexistingRDD

SparkshineswhenRDDscanfitinmemoryandcanbecachedacrossoperations.TheAPIexposesmethodstopersistRDDsandallowsforseveralpersistencestrategiesandstoragelevels,allowingforspilltodiskaswellasspace-efficientbinaryserialization.

ActionsOperationsareinvokedbypassingfunctionstoSpark.Thesystemdealswithvariablesandsideeffectsaccordingtothefunctionalprogrammingparadigm.Closurescanrefertovariablesinthescopewheretheyarecreated.Examplesofactionsarecount(returnsthenumberofelementsinthedataset),andsave(outputsthedatasettostorage).OtherparalleloperationsonRDDsincludethefollowing:

map:appliesafunctiontoeachelementofthedatasetfilter:selectselementsfromadatasetbasedonuser-providedcriteriareduce:combinesdatasetelementsusinganassociativefunctioncollect:sendsallelementsofthedatasettothedriverprogramforeach:passeseachelementthroughauser-providedfunctiongroupByKey:groupsitemstogetherbyaprovidedkeysortByKey:sortsitemsbykey

Page 214: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DeploymentSparkcanrunbothinlocalmode,similartoaHadoopsingle-nodesetup,oratoparesourcemanager.Currentlysupportedresourcemanagersinclude:

SparkStandaloneClusterModeYARNApacheMesos

SparkonYARNAnad-hoc-consolidatedJARneedstobebuiltinordertodeploySparkonYARN.SparklaunchesaninstanceofthestandalonedeployedclusterwithintheResourceManager.ClouderaandMapRbothshipwithSparkonYARNaspartoftheirsoftwaredistribution.Atthetimeofwriting,SparkisavailableforHortonworks’sHDPasatechnologypreview(http://hortonworks.com/hadoop/spark/).

SparkonEC2Sparkcomeswithadeploymentscript,spark-ec2,locatedintheec2directory.ThisscriptautomaticallysetsupSparkandHDFSonaclusterofEC2instances.InordertolaunchaSparkclusterontheAmazoncloud,gototheec2directoryandrunthefollowingcommand:

./spark-ec2-k<keypair>-i<key-file>-s<num-slaves>launch<cluster-

name>

Here,<keypair>isthenameofyourEC2keypair,<key-file>istheprivatekeyfileforthekeypair,<num-slaves>isthenumberofslavenodestobelaunched,and<cluster-name>isthenametobegiventoyourcluster.SeeChapter1,Introduction,formoredetailsregardingthesetupofkeypairs,andverifythattheclusterschedulerisupandseesalltheslavesbygoingtoitswebUI,theaddressofwhichwillbeprintedoncethescriptcompletes.

YoucanspecifyapathinS3astheinputthroughaURIoftheforms3n://<bucket>/path.YouwillalsoneedtosetyourAmazonsecuritycredentials,eitherbysettingtheenvironmentvariablesAWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYbeforeyourprogramisexecuted,orthroughSparkContext.hadoopConfiguration.

Page 215: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GettingstartedwithSparkSparkbinariesandsourcecodeareavailableontheprojectwebsiteathttp://spark.apache.org/.TheexamplesinthefollowingsectionhavebeentestedusingSpark1.1.0builtfromsourceontheClouderaCDH5.0QuickStartVM.

Downloadanduncompressthegziparchivewiththefollowingcommands:

$wgethttp://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz

$tarxvzfspark-1.1.0.tgz

$cdspark-1.1.0

SparkisbuiltonScala2.10andusessbt(https://github.com/sbt/sbt)tobuildthesourcecoreandrelatedexamples:

$./sbt/sbt-Dhadoop.version=2.2.0-Pyarnassembly

Withthe-Dhadoop.version=2.2.0and-Pyarnoptions,weinstructsbttobuildagainstHadoopversions2.2.0orhigherandenableYARNsupport.

StartSparkinstandalonemodewiththefollowingcommand:

$./sbin/start-all.sh

Thiscommandwilllaunchalocalmasterinstanceatspark://localhost:7077aswellasaworkernode.

Awebinterfacetothemasternodecanbeaccessedathttp://localhost:8080/andcanbeseeninthefollowingscreenshot:

Masternodewebinterface

Sparkcanruninteractivelythroughspark-shell,whichisamodifiedversionoftheScalashell.Asafirstexample,wewillimplementawordcountoftheTwitterdatasetweusedinChapter3,Processing-MapReduceandBeyond,usingtheScalaAPI.

Startaninteractivespark-shellsessionbyrunningthefollowingcommand:

$./bin/spark-shell

TheshellinstantiatesaSparkContextobject,sc,thatisresponsibleforhandlingdriverconnectionstoworkers.Wewilldescribeitssemanticslaterinthischapter.

Page 216: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Tomakethingsabiteasier,let’screateasampletextualdatasetthatcontainsonestatusupdateperline:

$stream.py-t-n1000>sample.txt

Then,copyittoHDFS:

$hdfsdfs-putsample.txt/tmp

Withinspark-shell,wefirstcreateanRDD-file-fromthesampledata:

valfile=sc.textFile("/tmp/sample.txt")

Then,weapplyaseriesoftransformationstocountthewordoccurrencesinthefile.Notethattheoutputofthetransformationchain-counts-isstillanRDD:

valcounts=file.flatMap(line=>line.split(""))

.map(word=>(word,1))

.reduceByKey((m,n)=>m+n)

Thischainoftransformationscorrespondstothemapandreducephasesthatwearefamiliarwith.Inthemapphase,weloadeachlineofthedataset(flatMap),tokenizeeachtweetintoasequenceofwords,counttheoccurrenceofeachword(map),andemit(key,value)pairs.Inthereducephase,wegroupbykey(word)andsumvalues(m,n)togethertoobtainwordcounts.

Finally,weprintthefirsttenelements,counts.take(10),totheconsole:

counts.take(10).foreach(println)

Page 217: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WritingandrunningstandaloneapplicationsSparkallowsstandaloneapplicationstobewrittenusingthreeAPIs:Scala,Java,andPython.

ScalaAPIThefirstthingaSparkdrivermustdoistocreateaSparkContextobject,whichtellsSparkhowtoaccessacluster.Afterimportingclassesandimplicitconversionsintoaprogram,asinthefollowing:

importorg.apache.spark.SparkContext

importorg.apache.spark.SparkContext._

TheSparkContextobjectcanbecreatedwiththefollowingconstructor:

newSparkContext(master,appName,[sparkHome])

ItcanalsobecreatedthroughSparkContext(conf),whichtakesaSparkConfobject.

ThemasterparameterisastringthatspecifiesaclusterURItoconnectto(suchasspark://localhost:7077)oralocalstringtoruninlocalmode.TheappNametermistheapplicationnamethatwillbeshownintheclusterwebUI.

ItisnotpossibletooverridethedefaultSparkContextclass,norisitpossibletocreateanewonewithinarunningSparkshell.ItishoweverpossibletospecifywhichmasterthecontextconnectstousingtheMASTERenvironmentvariable.Forexample,torunspark-shellonfourcores,usethefollowing:

$MASTER=local[4]./bin/spark-shell

JavaAPITheorg.apache.spark.api.javapackageexposesalltheSparkfeaturesavailableintheScalaversiontoJava.TheJavaAPIhasaJavaSparkContextclassthatreturnsinstancesoforg.apache.spark.api.java.JavaRDDandworkswithJavacollectionsinsteadofScalaones.

ThereareafewkeydifferencesbetweentheJavaandScalaAPIs:

Java7doesnotsupportanonymousorfirst-classfunctions;therefore,functionsmustbeimplementedbyextendingtheorg.apache.spark.api.java.function.Function,Function2,andotherclasses.AsofSparkversion1.0theAPIhasbeenrefactoredtosupportJava8lambdaexpressions.WithJava8,Functionclassescanbereplacedwithinlineexpressionsthatactasashorthandforanonymousfunctions.TheRDDmethodsreturnJavacollectionsKey-valuepairs,whicharesimplywrittenas(key,value)inScala,arerepresentedbythescala.Tuple2class.Tomaintaintypesafety,someRDDandfunctionmethods,suchasthosethathandlekeypairsanddoubles,areimplementedasspecializedclasses.

Page 218: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WordCountinJavaAnexampleofWordCountinJavaisincludedwiththeSparksourcecodedistributionatexamples/src/main/java/org/apache/spark/examples/JavaWordCount.java.

Firstofall,wecreateacontextusingtheJavaSparkContextclass:

JavaSparkContextsc=newJavaSparkContext(master,"JavaWordCount",

System.getenv("SPARK_HOME"),

JavaSparkContext.jarOfClass(JavaWordCount.class));

JavaRDD<String>data=sc.textFile(infile,1);

JavaRDD<String>words=data.flatMap(newFlatMapFunction<String,

String>(){

@Override

publicIterable<String>call(Strings){

returnArrays.asList(s.split(""));

}

});

JavaPairRDD<String,Integer>ones=words.map(newPairFunction<String,

String,Integer>(){

@Override

publicTuple2<String,Integer>call(Strings){

returnnewTuple2<String,Integer>(s,1);

}

});

JavaPairRDD<String,Integer>counts=ones.reduceByKey(new

Function2<Integer,Integer,Integer>(){

@Override

publicIntegercall(Integeri1,Integeri2){

returni1+i2;

}

});

WethenbuildanRDDfromtheHDFSlocationinfile.Inthefirststepofthetransformationchain,wetokenizeeachtweetinthedatasetandreturnalistofwords.WeuseaninstanceofJavaPairRDD<String,Integer>tocountoccurrencesofeachword.Finally,wereducetheRDDtoanewJavaPairRDD<String,Integer>instancethatcontainsalistoftuples,eachrepresentingawordandthenumberoftimesitwasfoundinthedataset.

PythonAPIPySparkrequiresPythonversion2.6orhigher.RDDssupportthesamemethodsastheirScalacounterpartsbuttakePythonfunctionsandreturnPythoncollectiontypes.Lambdasyntax(https://docs.python.org/2/reference/expressions.html)isusedtopassfunctionstoRDDs.

ThewordcountinpysparkisrelativelysimilartoitsScalacounterpart:

tweets=sc.textFile("/tmp/sample.txt")

counts=tweets.flatMap(lambdatweet:tweet.split(''))\

Page 219: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

.map(lambdaword:(word,1))\

.reduceByKey(lambdam,n:m+n)

Thelambdaconstructcreatesanonymousfunctionsatruntime.lambdatweet:tweet.split('')createsafunctionthattakesastringtweetastheinputandoutputsalistofstringssplitbywhitespace.Spark’sflatMapappliesthisfunctiontoeachlineofthetweetsdataset.Inthemapphase,foreachwordtoken,lambdaword:(word,1)returns(word,1)tuplesthatindicatetheoccurrenceofawordinthedataset.InreduceByKey,wegroupthesetuplesbykey-word-andsumthevaluestogethertoobtainthewordcountwithlambdam,n:m+n.

Page 220: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheSparkecosystemApacheSparkpowersanumberoftools,bothasalibraryandasanexecutionengine.

Page 221: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SparkStreamingSparkStreaming(foundathttp://spark.apache.org/docs/latest/streaming-programming-guide.html)isanextensionoftheScalaAPIthatallowsdataingestionfromstreamssuchasKafka,Flume,Twitter,ZeroMQ,andTCPsockets.

SparkStreamingreceivesliveinputdatastreamsanddividesthedataintobatches(arbitrarilysizedtimewindows),whicharethenprocessedbytheSparkcoreenginetogeneratethefinalstreamofresultsinbatches.Thishigh-levelabstractioniscalledDStream(org.apache.spark.streaming.dstream.DStreams)andisimplementedasasequenceofRDDs.DStreamallowsfortwokindsofoperations:transformationsandoutputoperations.TransformationsworkononeormoreDStreamstocreatenewDStreams.Aspartofachainoftransformations,datacanbepersistedeithertoastoragelayer(HDFS)oranoutputchannel.SparkStreamingallowsfortransformationsoveraslidingwindowofdata.Awindow-basedoperationneedstospecifytwoparameters:thewindowlength,thedurationofthewindowandtheslideinterval,theintervalatwhichthewindow-basedoperationisperformed.

Page 222: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GraphXGraphX(foundathttps://spark.apache.org/docs/latest/graphx-programming-guide.html)isanAPIforgraphcomputationthatexposesasetofoperatorsandalgorithmsforgraph-orientedcomputationaswellasanoptimizedvariantofPregel.

Page 223: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MLlibMLlib(foundathttp://spark.apache.org/docs/latest/mllib-guide.html)providescommonMachineLearning(ML)functionality,includingtestsanddatagenerators.MLlibcurrentlysupportsfourtypesofalgorithms:binaryclassification,regression,clustering,andcollaborativefiltering.

Page 224: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SparkSQLSparkSQLisderivedfromShark,whichisanimplementationoftheHivedatawarehousingsystemthatusesSparkasanexecutionengine.WewilldiscussHiveinChapter7,HadoopandSQL.WithSparkSQL,itispossibletomixSQL-likequerieswithScalaorPythoncode.TheresultsetsreturnedbyaqueryarethemselvesRDDs,andassuch,theycanbemanipulatedbySparkcoremethodsorMLlibandGraphX.

Page 225: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ProcessingdatawithApacheSparkInthissection,wewillimplementtheexamplesfromChapter3,Processing–MapReduceandBeyond,usingtheScalaAPI.Wewillconsiderboththebatchandreal-timeprocessingscenarios.WewillshowyouhowSparkStreamingcanbeusedtocomputestatisticsontheliveTwitterstream.

Page 226: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

BuildingandrunningtheexamplesScalasourcecodefortheexamplescanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch5.Wewillbeusingsbttobuild,manage,andexecutecode.

Thebuild.sbtfilecontrolsthecodebasemetadataandsoftwaredependencies;theseincludetheversionoftheScalainterpreterthatSparklinksto,alinktotheAkkapackagerepositoryusedtoresolveimplicitdependencies,aswellasdependenciesonSparkandHadooplibraries.

Thesourcecodeforallexamplescanbecompiledwith:

$sbtcompile

Or,itcanbepackagedintoaJARfilewith:

$sbtpackage

Ahelperscripttoexecutecompiledclassescanbegeneratedwith:

$sbtadd-start-script-tasks

$sbtstart-script

Thehelpercanbeinvokedasfollows:

$target/start<classname><master><param1>…<paramn>

Here,<master>istheURIofthemasternode.AninteractiveScalasessioncanbeinvokedviasbtwiththefollowingcommand:

$sbtconsole

ThisconsoleisnotthesameastheSparkinteractiveshell;rather,itisanalternativewaytoexecutecode.InordertorunSparkcodeinitwewillneedtomanuallyimportandinstantiateaSparkContextobject.Allexamplespresentedinthissectionexpectatwitter4j.propertiesfilecontainingtheconsumerkeyandsecretandtheaccesstokenstobepresentinthesamedirectorywheresbtorspark-shellisbeinginvoked:

oauth.consumerKey=

oauth.consumerSecret=

oauth.accessToken=

oauth.accessTokenSecret=

RunningtheexamplesonYARNToruntheexamplesonaYARNgrid,wefirstbuildaJARfileusing:

$sbtpackage

Then,weshipittotheresourcemanagerusingthespark-submitcommand:

./bin/spark-submit--classapplication.to.execute--masteryarn-cluster

[options]target/scala-2.10/chapter-4_2.10-1.0.jar[<param1>…<paramn>]

Page 227: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Unlikethestandalonemode,wedon’tneedtospecifya<master>URI.InYARN,theResourceManagerisselectedfromtheclusterconfiguration.MoreinformationonlaunchingsparkinYARNcanbefoundathttp://spark.apache.org/docs/latest/running-on-yarn.html.

FindingpopulartopicsUnliketheearlierexampleswiththeSparkshellweinitializeaSparkContextaspartoftheprogram.WepassthreeargumentstotheSparkContextconstructor:thetypeofschedulerwewanttouse,anamefortheapplication,andthedirectorywhereSparkisinstalled:

importorg.apache.spark.SparkContext._

importorg.apache.spark.SparkContext

importscala.util.matching.Regex

objectHashtagCount{

defmain(args:Array[String]){

[…]

valsc=newSparkContext(master,

"HashtagCount",

System.getenv("SPARK_HOME"))

valfile=sc.textFile(inputFile)

valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")

valcounts=file.flatMap(line=>

(patternfindAllInline).toList)

.map(word=>(word,1))

.reduceByKey((m,n)=>m+n)

counts.saveAsTextFile(outputPath)

}

}

WecreateaninitialRDDfromadatasetstoredinHDFS-inputFile-andapplylogicthatissimilartotheWordCountexample.

Foreachtweetinthedataset,weextractanarrayofstringsthatmatchthehashtagpattern(patternfindAllInline).toArray,andwecountanoccurrenceofeachstringusingthemapoperator.ThisgeneratesanewRDDasalistoftuplesintheform:

(word,1),(word2,1),(word,1)

Finally,wecombinetogetherelementsofthisRDDusingthereduceByKey()method.WestoretheRDDgeneratedbythislaststepbackintoHDFSwithsaveAsTextFile.

Thecodeforthestandalonedrivercanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagCount.scala

AssigningasentimenttotopicsThesourcecodeofthisexamplecanbefoundat

Page 228: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

https://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagSentiment.scalaandthecodeisasfollows:

importorg.apache.spark.SparkContext._

importorg.apache.spark.SparkContext

importscala.util.matching.Regex

importscala.io.Source

objectHashtagSentiment{

defmain(args:Array[String]){

[…]

valsc=newSparkContext(master,

"HashtagSentiment",

System.getenv("SPARK_HOME"))

valfile=sc.textFile(inputFile)

valpositive=Source.fromFile(positiveWordsPath)

.getLines

.filterNot(_startsWith";")

.toSet

valnegative=Source.fromFile(negativeWordsPath)

.getLines

.filterNot(_startsWith";")

.toSet

valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")

valcounts=file.flatMap(line=>(patternfindAllInline).map({

word=>(word,sentimentScore(line,positive,negative))

})).reduceByKey({(m,n)=>(m._1+n._1,m._2+n._2)})

valsentiment=counts.map({hashtagScore=>

valhashtag=hashtagScore._1

valscore=hashtagScore._2

valnormalizedScore=score._1/score._2

(hashtag,normalizedScore)

})

sentiment.saveAsTextFile(outputPath)

}

}

First,wereadalistofpositiveandnegativewordsintoScalaSetobjectsandfilteroutcomments(stringsbeginningwith;).

Whenahashtagisfound,wecallafunction-sentimentScore-toestimatethesentimentexpressedbythatgiventext.ThisfunctionimplementsthesamelogicweusedinChapter3,Processing–MapReduceandBeyond,toestimatethesentimentofatweet.Ittakesasinputparametersthetweet’stext,str,andalistofpositiveandnegativewordsasSet[String]objects.Thereturnvalueisthedifferencebetweenthepositiveandnegativescoresandthenumberofwordsinthetweets.InSpark,werepresentthisreturnvalueasapairofDoubleandIntegerobjects:

Page 229: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

defsentimentScore(str:String,positive:Set[String],

negative:Set[String]):(Double,Int)={

varpositiveScore=0;varnegativeScore=0;

str.split("""\s+""").foreach{w=>

if(positive.contains(w)){positiveScore+=1;}

if(negative.contains(w)){negativeScore+=1;}

}

((positiveScore-negativeScore).toDouble,

str.split("""\s+""").length)

}

Wereducethemapoutputbyaggregatingbythekey(thehashtag).Inthisphase,weemitatriplemadeofthehashtag,thesumofthedifferencebetweenpositiveandnegativescores,andthenumberofwordspertweet.WeuseanadditionalmapsteptonormalizethesentimentscoreandstoretheresultinglistofhashtagandsentimentpairstoHDFS.

Page 230: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DataprocessingonstreamsThepreviousexamplecanbeeasilyadjustedtoworkonareal-timestreamofdata.Inthisandthefollowingsection,wewillusespark-streaming-twittertoperformsomesimpleanalyticstasksonthereal-timefirehose:

valwindow=10

valssc=newStreamingContext(master,"TwitterStreamEcho",

Seconds(window),System.getenv("SPARK_HOME"))

valstream=TwitterUtils.createStream(ssc,auth)

valtweets=stream.map(tweet=>(tweet.getText()))

tweets.print()

ssc.start()

ssc.awaitTermination()

}

TheScalasourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/TwitterStreamEcho.scala

Thetwokeypackagesweneedtoimportare:

importorg.apache.spark.streaming.{Seconds,StreamingContext}

importorg.apache.spark.streaming.twitter._

WeinitializeanewStreamingContextssconalocalclusterusinga10-secondwindowandusethiscontexttocreateaDStreamoftweetswhosetextweprint.

Uponsuccessfulexecution,Twitter’sreal-timefirehosewillbeechoedintheterminalinbatchesof10secondsworthofdata.NoticethatthecomputationwillcontinueindefinitelybutcanbeinterruptedatanymomentbypressingCtrl+C.

TheTwitterUtilsobjectisawrappertotheTwitter4jlibrary(http://twitter4j.org/en/index.html)thatshipswithspark-streaming-twitter.AsuccessfulcalltoTwitterUtils.createStreamwillreturnaDStreamofTwitter4jobjects(TwitterInputDStream).Intheprecedingexample,weusedthegetText()methodtoextractthetweettext;however,noticethatthetwitter4jobjectexposesthefullTwitterAPI.Forinstance,wecanprintastreamofuserswiththefollowingcall:

valusers=stream.map(tweet=>(tweet.getUser().getId(),

tweet.getUser().getName()))

users.print()

StatemanagementSparkStreamingprovidesanadhocDStreamtokeepthestateofeachkeyinanRDDandtheupdateStateByKeymethodtomutatestate.

Wecanreusethecodeofthebatchexampletoassignandupdatesentimentscoresonstreams:

Page 231: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

objectStreamingHashTagSentiment{

[…]

valcounts=text.flatMap(line=>(patternfindAllInline)

.toList

.map(word=>(word,sentimentScore(line,positive,negative))))

.reduceByKey({(m,n)=>(m._1+n._1,m._2+n._2)})

valsentiment=counts.map({hashtagScore=>

valhashtag=hashtagScore._1

valscore=hashtagScore._2

valnormalizedScore=score._1/score._2

(hashtag,normalizedScore)

})

valstateDstream=sentiment

.updateStateByKey[Double](updateFunc)

stateDstream.print

ssc.checkpoint("/tmp/checkpoint")

ssc.start()

}

AstateDStreamiscreatedbycallinghashtagSentiment.updateStateByKey.

TheupdateFuncfunctionimplementsthestatemutationlogic,whichisacumulativesumofsentimentscoresoveraperiodoftime:

valupdateFunc=(values:Seq[Double],state:Option[Double])=>{

valcurrentScore=values.sum

valpreviousScore=state.getOrElse(0.0)

Some((currentScore+previousScore)*decayFactor)

}

decayFactorisaconstantvalue,lessthanorequaltozero,thatweusetoproportionallydecreasethescoreovertime.Intuitively,thiswillfadehashtagsiftheyarenottrendinganymore.SparkStreamingwritesintermediatedataforstatefuloperationstoHDFS,soweneedtocheckpointtheStreamingcontextwithssc.checkpoint.

Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/StreamingHashTagSentiment.scala

Page 232: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DataanalysiswithSparkSQLSparkSQLcaneasethetaskofrepresentingandmanipulatingstructureddata.WewillloadaJSONfileintoatemporarytableandcalculatesimplestatisticsbyblendingSQLstatementsandScalacode:

objectSparkJson{

[…]

valfile=sc.textFile(inputFile)

valsqlContext=neworg.apache.spark.sql.SQLContext(sc)

importsqlContext._

valtweets=sqlContext.jsonFile(inFile)

tweets.printSchema()

//RegistertheSchemaRDDasatable

tweets.registerTempTable("tweets")

valtext=sqlContext.sql("SELECTtext,user.idFROMtweets")

//Findthetenmostpopularhashtags

valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")

valcounts=text.flatMap(sqlRow=>(patternfindAllIn

sqlRow(0).toString).toList)

.map(word=>(word,1))

.reduceByKey((m,n)=>m+n)

counts.registerTempTable("hashtag_frequency")

counts.printSchema

valtop10=sqlContext.sql("SELECT_1ashashtag,_2asfrequencyFROM

hashtag_frequencyorderbyfrequencydesclimit10")

top10.foreach(println)

}

Aswithpreviousexamples,weinstantiateaSparkContextscandloadthedatasetofJSONtweets.Wethencreateaninstanceoforg.apache.spark.sql.SQLContextbasedontheexistingsc.TheimportsqlContext._givesaccesstoallfunctionsandimplicitconventionsforsqlContext.Weloadthetweets’JSONdatasetusingsqlContext.jsonFile.TheresultingtweetsobjectisaninstanceofSchemaRDD,whichisanewtypeofRDDintroducedbySparkSQL.TheSchemaRDDclassisconceptuallysimilartoatableinarelationaldatabase;itiscomposedofRowobjectsandaschemathatdescribesthecontentineachRow.Wecanseetheschemaforatweetbycallingtweets.printSchema().Beforewe’reabletomanipulatetweetswithSQLstatements,weneedtoregisterSchemaRDDasatableintheSQLContext.WethenextractthetextfieldofaJSONtweetwithanSQLquery.NotethattheoutputofsqlContext.sqlisanRDDagain.Assuch,wecanmanipulateitusingSparkcoremethods.Inourcase,wereusethelogicusedinpreviousexamplestoextracthashtagsandcounttheiroccurrences.Finally,weregistertheresultingRDDasatable,hashtag_frequency,andorderhashtagsby

Page 233: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

frequencywithaSQLquery.

Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SparkJson.scala.

SQLondatastreamsAtthetimeofwriting,aSQLContextcannotbedirectlyinstantiatedfromaStreamingContextobject.Itis,however,possibletoqueryaDStreambyregisteringaSchemaRDDforeachRDDinagivenstream:

objectSqlOnStream{

[…]

valssc=newStreamingContext(sc,Seconds(window))

valgson=newGson()

valdstream=TwitterUtils

.createStream(ssc,auth)

.map(gson.toJson(_))

valsqlContext=neworg.apache.spark.sql.SQLContext(sc)

importsqlContext._

dstream.foreachRDD(rdd=>{

rdd.foreach(println)

valjsonRDD=sqlContext.jsonRDD(rdd)

jsonRDD.registerTempTable("tweets")

jsonRDD.printSchema

sqlContext.sql(query)

})

ssc.checkpoint("/tmp/checkpoint")

ssc.start()

ssc.awaitTermination()

}

Inordertogetthetwoworkingtogether,wefirstcreateaSparkContextscthatweusetoinitializebothaStreamingContextsscandasqlContext.Asinpreviousexamples,weuseTwitterUtils.createStreamtocreateaDStreamRDDdstream.Inthisexample,weuseGoogle’sGsonJSONparsertoserializeeachtwitter4jobjecttoaJSONstring.ToexecuteSparkSQLqueriesonthestream,weregisteraSchemaRDDjsonRDDwithinadstream.foreachRDDloop.WeusethesqlContext.jsonRDDmethodtocreateanRDDfromabatchofJSONtweets.Atthispoint,wecanquerytheSchemaRDDusingthesqlContext.sqlmethod.

Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SqlOnStream.scala.

Page 234: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ComparingSamzaandSparkStreamingItisusefultocompareSamzaandSparkStreamingtohelpidentifytheareasinwhicheachcanbestbeapplied.Asithasbeenhopefullymadeclearinthisbook,thesetechnologiesareverymuchcomplimentary.EventhoughSparkStreamingmightappearcompetitivewithSamza,wefeelbothproductsoffercompellingadvantagesincertainareas.

Samzashineswhentheinputdataistrulyastreamofdiscreteeventsandyouwishtobuildprocessingthatoperatesonthistypeofinput.SamzajobsrunningonKafkacanhavelatenciesintheorderofmilliseconds.Thisprovidesaprogrammingmodelfocusedontheindividualmessagesandisthebetterfitfortruenearreal-timeprocessingapplications.Thoughitlackssupporttobuildtopologiesofcollaboratingjobs,itssimplemodelallowssimilarconstructstobebuiltand,perhapsmoreimportantly,beeasilyreasonedabout.Itsmodelofpartitioningandscalingalsofocusesonsimplicity,whichagainmakesaSamzaapplicationveryeasytounderstandandgivesitasignificantadvantagewhendealingwithsomethingasintrinsicallycomplexasreal-timedata.

Sparkismuchmorethanastreamingproduct.Itssupportforbuildingdistributeddatastructuresfromexistingdatasetsandusingpowerfulprimitivestomanipulatethesegivesittheabilitytoprocesslargedatasetsatahigherlevelofgranularity.OtherproductsintheSparkecosystembuildadditionalinterfacesorabstractionsuponthiscommonbatchprocessingcore.ThisisverymuchadifferentfocustothemessagestreammodelofSamza.

ThisbatchmodelisalsodemonstratedwhenwelookatSparkStreaming;insteadofaper-messageprocessingmodel,itslicesthemessagestreamintoaseriesofRDDs.Withafastexecutionengine,thismeanslatenciesaslowas1second(http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf).Forworkloadsthatwishtoanalyzethestreaminsuchaway,thiswillbeabetterfitthanSamza’sper-messagemodel,whichrequiresadditionallogictoprovidesuchwindowing.

Page 235: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SummaryThischapterexploredSparkandshowedyouhowitaddsiterativeprocessingasanewrichframeworkuponwhichapplicationscanbebuiltatopYARN.Inparticular,wehighlighted:

Thedistributeddata-structure-basedprocessingmodelofSparkandhowitallowsveryefficientin-memorydataprocessingThebroaderSparkecosystemandhowmultipleadditionalprojectsarebuiltatopittospecializethecomputationalmodelevenfurther

InthenextchapterwewillexploreApachePiganditsprogramminglanguage,PigLatin.WewillseehowthistoolcangreatlysimplifysoftwaredevelopmentforHadoopbyabstractingawaysomeoftheMapReduceandSparkcomplexity.

Page 236: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Chapter6.DataAnalysiswithApachePigInthepreviouschapters,weexploredanumberofAPIsfordataprocessing.MapReduce,Spark,TezandSamzaareratherlow-level,andwritingnon-trivialbusinesslogicwiththemoftenrequiressignificantJavadevelopment.Moreover,differentuserswillhavedifferentneeds.ItmightbeimpracticalforananalysttowriteMapReducecodeorbuildaDAGofinputsandoutputstoanswersomesimplequeries.Atthesametime,asoftwareengineeroraresearchermightwanttoprototypeideasandalgorithmsusinghigh-levelabstractionsbeforejumpingintolow-levelimplementationdetails.

Inthischapterandthefollowingone,wewillexploresometoolsthatprovideawaytoprocessdataonHDFSusinghigher-levelabstractions.InthischapterwewillexploreApachePig,and,inparticular,wewillcoverthefollowingtopics:

WhatApachePigisandthedataflowmodelitprovidesPigLatin’sdatatypesandfunctionsHowPigcanbeeasilyenhancedusingcustomusercodeHowwecanusePigtoanalyzetheTwitterstream

Page 237: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AnoverviewofPigHistorically,thePigtoolkitconsistedofacompilerthatgeneratedMapReduceprograms,bundledtheirdependencies,andexecutedthemonHadoop.PigjobsarewritteninalanguagecalledPigLatinandcanbeexecutedinbothinteractiveandbatchfashions.Furthermore,PigLatincanbeextendedusingUserDefinedFunctions(UDFs)writteninJava,Python,Ruby,Groovy,orJavaScript.

Pigusecasesincludethefollowing:

DataprocessingAdhocanalyticalqueriesRapidprototypingofalgorithmsExtractTransformLoadpipelines

Followingatrendwehaveseeninpreviouschapters,Pigismovingtowardsageneral-purposecomputingarchitecture.Asofversion0.13theExecutionEngineinterface(org.apache.pig.backend.executionengine)actsasabridgebetweenthefrontendandthebackendofPig,allowingPigLatinscriptstobecompiledandexecutedonframeworksotherthanMapReduce.Atthetimeofwriting,version0.13shipswithMRExecutionEngine(org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRExecutionEngineandworkonalow-latencybackendbasedonTez(org.apache.pig.backend.hadoop.executionengine.tez.*)isexpectedtobeincludedinversion0.14(seehttps://issues.apache.org/jira/browse/PIG-3446).WorkonintegratingSparkiscurrentlyinprogressinthedevelopmentbranch(seehttps://issues.apache.org/jira/browse/PIG-4059).

Pig0.13comeswithanumberofperformanceenhancementsfortheMapReducebackend,inparticulartwofeaturestoreducelatencyofsmalljobs:directHDFSaccess(https://issues.apache.org/jira/browse/PIG-3642)andautolocalmode(https://issues.apache.org/jira/browse/PIG-3463).DirectHDFS,theopt.fetchproperty,isturnedonbydefault.WhendoingaDUMPinasimple(map-only)scriptthatcontainsonlyLIMIT,FILTER,UNION,STREAM,orFOREACHoperators,inputdataisfetchedfromHDFS,andthequeryisexecuteddirectlyinPig,bypassingMapReduce.Withautolocal,thepig.auto.local.enabledproperty,PigwillrunaqueryintheHadooplocalmodewhenthedatasizeissmallerthanpig.auto.local.input.maxbytes.Autolocalisoffbydefault.

PigwilllaunchMapReducejobsifbothmodesareofforifthequeryisnoteligibleforeither.Ifbothmodesareon,Pigwillcheckwhetherthequeryiseligiblefordirectaccessand,ifnot,fallbacktoautolocal.Failingthat,itwillexecutethequeryonMapReduce.

Page 238: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GettingstartedWewillusethestream.pyscriptoptionstoextractJSONdataandretrieveaspecificnumberoftweets;wecanrunthiswithacommandsuchasthefollowing:

$pythonstream.py-j-n10000>tweets.json

Thetweets.jsonfilewillcontainoneJSONstringoneachlinerepresentingatweet.

RememberthattheTwitterAPIcredentialsneedtobemadeavailableasenvironmentvariablesorhardcodedinthescriptitself.

Page 239: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

RunningPigPigisatoolthattranslatesstatementswritteninPigLatinandexecutesthemeitheronasinglemachineinstandalonemodeoronafullHadoopclusterwhenindistributedmode.Eveninthelatter,Pig’sroleistotranslatePigLatinstatementsintoMapReducejobsandthereforeitdoesn’trequiretheinstallationofadditionalservicesordaemons.Itisusedasacommand-linetoolwithitsassociatedlibraries.

ClouderaCDHshipswithApachePigversion0.12.Alternatively,thePigsourcecodeandbinarydistributionscanbeobtainedathttps://pig.apache.org/releases.html.

Ascanbeexpected,theMapReducemoderequiresaccesstoaHadoopclusterandHDFSinstallation.MapReducemodeisthedefaultmodeexecutedwhenrunningthePigcommandatthecommand-lineprompt.Scriptscanbeexecutedwiththefollowingcommand:

$pig-f<script>

Parameterscanbepassedviathecommandlineusing-param<param>=<val>,asfollows:

$pig–paraminput=tweets.txt

ParameterscanalsobespecifiedinaparamfilethatcanbepassedtoPigusingthe-param_file<file>option.Multiplefilescanbespecified.Ifaparameterispresentmultipletimesinthefile,thelastvaluewillbeusedandawarningwillbedisplayed.Aparameterfilecontainsoneparameterperline.Emptylinesandcomments(specifiedbystartingalinewith#)areallowed.WithinaPigscript,parametersareintheform$<parameter>.Thedefaultvaluecanbeassignedusingthedefaultstatement:%defaultinputtweets.json'.ThedefaultcommandwillnotworkwithinaGruntsession;we’lldiscussGruntinthenextsection.

Inlocalmode,allfilesareinstalledandrunusingthelocalhostandfilesystem.Specifylocalmodeusingthe-xflag:

$pig-xlocal

Inbothexecutionmodes,Pigprogramscanberuneitherinaninteractiveshellorinbatchmode.

Page 240: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Grunt–thePiginteractiveshellPigcanruninaninteractivemodeusingtheGruntshell,whichisinvokedwhenweusethepigcommandattheterminalprompt.Intherestofthischapter,wewillassumethatexamplesareexecutedwithinaGruntsession.OtherthanexecutingPigLatinstatements,Gruntoffersanumberofutilitiesandaccesstoshellcommands:

fs:allowsuserstomanipulateHadoopfilesystemobjectsandhasthesamesemanticsastheHadoopCLIsh:executescommandsviatheoperatingsystemshellexec:launchesaPigscriptwithinaninteractiveGruntsessionkill:killsaMapReducejobhelp:printsalistofallavailablecommands

ElasticMapReducePigscriptscanbeexecutedonEMRbycreatingaclusterwith--applicationsName=Pig,Args=--version,<version>,asfollows:

$awsemrcreate-cluster\

--name"Pigcluster"\

--ami-version<amiversion>\

--instance-type<EC2instance>\

--instance-count<numberofnodes>\

--applicationsName=Pig,Args=--version,<version>\

--log-uri<S3bucket>\

--stepsType=PIG,\

Name="Pigscript",\

Args=[-f,s3://<scriptlocation>,\

-p,input=<inputparam>,\

-p,output=<outputparam>]

TheprecedingcommandwillprovisionanewEMRclusterandexecutes3://<scriptlocation>.Noticethatthescriptstobeexecutedandtheinput(-pinput)andoutput(-poutput)pathsareexpectedtobelocatedonS3.

AsanalternativetocreatinganewEMRcluster,itispossibletoaddPigstepstoanalready-instantiatedEMRclusterusingthefollowingcommand:

$awsemradd-steps\

--cluster-id<clusterid>\

--stepsType=PIG,\

Name="OtherPigscript",\

Args=[-f,s3://<scriptlocation>,\

-p,input=<inputparam>,\

-p,output=<outputparam>]

Intheprecedingcommand,<clusterid>istheIDoftheinstantiatedcluster.

ItisalsopossibletosshintothemasternodeandrunPigLatinstatementswithinaGruntsessionwiththefollowingcommand:

$awsemrssh--cluster-id<clusterid>--key-pair-file<keypair>

Page 241: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,
Page 242: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

FundamentalsofApachePigTheprimaryinterfacetoprogramApachePigisPigLatin,aprocedurallanguagethatimplementsideasofthedataflowparadigm.

PigLatinprogramsaregenerallyorganizedasfollows:

ALOADstatementreadsdatafromHDFSAseriesofstatementsaggregatesandmanipulatesdataASTOREstatementwritesoutputtothefilesystemAlternatively,aDUMPstatementdisplaystheoutputtotheterminal

Thefollowingexampleshowsasequenceofstatementsthatoutputsthetop10hashtagsorderedbythefrequency,extractedfromthedatasetoftweets:

tweets=LOAD'tweets.json'

USINGJsonLoader('created_at:chararray,

id:long,

id_str:chararray,

text:chararray');

hashtags=FOREACHtweets{

GENERATEFLATTEN(

REGEX_EXTRACT(

text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)',1)

)astag;

}

hashtags_grpd=GROUPhashtagsBYtag;

hashtags_count=FOREACHhashtags_grpd{

GENERATE

group,

COUNT(hashtags)asoccurrencies;

}

hashtags_count_sorted=ORDERhashtags_countBYoccurrenciesDESC;

top_10_hashtags=LIMIThashtags_count_sorted10;

DUMPtop_10_hashtags;

First,weloadthetweets.jsondatasetfromHDFS,de-serializetheJSONfile,andmapittoafour-columnschemathatcontainsatweet’screationtime,itsIDinnumericalandstringform,andthetext.Foreachtweet,weextracthashtagsfromitstextusingaregularexpression.Weaggregateonhashtag,countthenumberofoccurrences,andorderbyfrequency.Finally,welimittheorderedrecordstothetop10mostfrequenthashtags.

AseriesofstatementslikethepreviousoneispickedupbythePigcompiler,transformedintoMapReducejobs,andexecutedonaHadoopcluster.Theplannerandoptimizerwillresolvedependenciesoninputandoutputrelationsandparallelizetheexecutionofstatementswhereverpossible.

StatementsarethebuildingblocksofprocessingdatawithPig.Theytakearelationasinputandproduceanotherrelationasoutput.InPigLatinterms,arelationcanbedefined

Page 243: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

asabagoftuples,twodatatypeswewillusethroughouttheremainderofthischapter.

UsersexperiencedwithSQLandtherelationaldatamodelmightfindPigLatin’ssyntaxsomewhatfamiliar.Whilethereareindeedsimilaritiesinthesyntaxitself,PigLatinimplementsanentirelydifferentcomputationalmodel.PigLatinisprocedural,itspecifiestheactualdatatransformstobeperformed,whereasSQLisdeclarativeanddescribesthenatureoftheproblembutdoesnotspecifytheactualruntimeprocessing.Intermsoforganizingdata,arelationcanbethoughtofasatableinarelationaldatabase,wheretuplesinabagcorrespondtotherowsinatable.Relationsareunorderedandthereforeeasilyparallelizable,andtheyarelessconstrainedthanrelationaltables.Pigrelationscancontaintupleswithdifferentnumbersoffields,andthosewiththesamefieldcountcanhavefieldsofdifferenttypesincorrespondingpositions.

AkeydifferencebetweenSQLandthedataflowmodeladoptedbyPigLatinliesinhowsplitsinadatapipelinearemanaged.Intherelationalworld,adeclarativelanguagesuchasSQLimplementsandexecutesqueriesthatwillgenerateasingleresult.Thedataflowmodelseesdatatransformationsasagraphwhereinputandoutputarenodesconnectedbyanoperator.Forinstance,intermediatestepsofaquerymightrequiretheinputtobegroupedbyanumberofkeysandresultinmultipleoutputs(GROUPBY).Pighasbuilt-inmechanismstomanagemultipledataflowsinsuchagraphbyexecutingoperatorsassoonasinputsarereadilyavailableandpotentiallyapplydifferentoperatorstoeachflow.Forinstance,Pig’simplementationoftheGROUPBYoperatorusestheparallelfeature(http://pig.apache.org/docs/r0.12.0/perf.html#parallel)toallowausertoincreasethenumberofreducetasksfortheMapReducejobsgeneratedandhenceincreasesconcurrency.Anadditionalsideeffectofthispropertyisthatwhenmultipleoperatorscanbeexecutedinparallelinthesameprogram,Pigdoesso(moredetailsonPig’smulti-queryimplementationcanbefoundathttp://pig.apache.org/docs/r0.12.0/perf.html#multi-query-execution).AnotherconsequenceofPigLatin’sapproachtocomputationisthatitallowsthepersistenceofdataatanypointinthepipeline.Itallowsthedevelopertoselectspecificoperatorimplementationsandexecutionplanswhennecessary,effectivelyoverridingtheoptimizer.

PigLatinallowsandevenencouragesdeveloperstoinserttheirowncodealmostanywhereinapipelinebymeansofUserDefinedFunctions(UDFs)aswellasbyutilizingHadoopstreaming.UDFsallowuserstospecifycustombusinesslogiconhowdataisloaded,howitisstored,andhowitisprocessed,whereasstreamingallowsuserstolaunchexecutablesatanypointinthedataflow.

Page 244: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ProgrammingPigPigLatincomeswithanumberofbuilt-infunctions(theeval,load/store,math,string,bag,andtuplefunctions)andanumberofscalarandcomplexdatatypes.Additionally,Pigallowsfunctionanddata-typeextensionbymeansofUDFsanddynamicinvocationofJavamethods.

Page 245: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PigdatatypesPigsupportsthefollowingscalardatatypes:

int:asigned32-bitintegerlong:asigned64-bitintegerfloat:a32-bitfloatingpointdouble:a64-bitfloatingpointchararray:acharacterarray(string)inUnicodeUTF-8formatbytearray:abytearray(blob)boolean:abooleandatetime:adatetimebiginteger:aJavaBigIntegerbigdecimal:aJavaBigDecimal

Pigsupportsthefollowingcomplexdatatypes:

map:anassociativearrayenclosedby[],withthekeyandvalueseparatedby#,anditemsseparatedby,tuple:anorderedlistofdata,whereelementscanbeofanyscalarorcomplextypeenclosedby(),withitemsseparatedby,bag:anunorderedcollectionoftuplesenclosedby{}andseparatedby,

Bydefault,Pigtreatsdataasuntyped.Theusercandeclarethetypesofdataatloadtimeormanuallycastitwhennecessary.Ifadatatypeisnotdeclared,butascriptimplicitlytreatsavalueasacertaintype,Pigwillassumeitisofthattypeandcastitaccordingly.Thefieldsofabagortuplecanbereferredtobythenametuple.fieldorbytheposition$<index>.Pigcountsfrom0andhencethefirstelementwillbedenotedas$0.

Page 246: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PigfunctionsBuilt-infunctionsareimplementedinJava,andtheytrytofollowstandardJavaconventions.Therearehoweveranumberofdifferencestokeepinmind,whichareasfollows:

FunctionnamesarecasesensitiveanduppercaseIftheresultvalueisnull,empty,ornotanumber(NaN),PigreturnsnullIfPigisunabletoprocesstheexpression,itreturnsanexception

Alistofallbuilt-infunctionscanbefoundathttp://pig.apache.org/docs/r0.12.0/func.html.

Load/storeLoad/storefunctionsdeterminehowdatagoesintoandcomesoutofPig.ThePigStorage,TextLoader,andBinStoragefunctionscanbeusedtoreadandwriteUTF-8delimited,unstructuredtext,andbinarydatarespectively.Supportforcompressionisdeterminedbytheload/storefunction.ThePigStorageandTextLoaderfunctionssupportgzipandbzip2compressionforbothread(load)andwrite(store).TheBinStoragefunctiondoesnotsupportcompression.

Asofversion0.12,Pigincludesbuilt-insupportforloadingandstoringAvroandJSONdataviatheAvroStorage(load/store),JsonStorage(store),andJsonLoader(load).Atthetimeofwriting,JSONsupportisstillsomewhatlimited.Inparticular,PigexpectsaschemaforthedatatobeprovidedasanargumenttoJsonLoader/JsonStorage,oritassumesthat.pig_schema(producedbyJsonStorage)ispresentinthedirectorycontainingtheinputdata.Inpractice,thismakesitdifficulttoworkwithJSONdumpsnotgeneratedbyPigitself.

Asseeninourfollowingexample,wecanloadtheJSONdatasetwithJsonLoader:

tweets=LOAD'tweets.json'USINGJsonLoader(

'created_at:chararray,

id:long,

id_str:chararray,

text:chararray,

source:chararray');

WeprovideaschemasothatthefirstfiveelementsofaJSONobjectcreated_id,id,id_str,text,andsourcearemapped.Wecanlookattheschemaoftweetsbyusingdescribetweets,whichreturnsthefollowing:

tweets:{created_at:chararray,id:long,id_str:chararray,text:

chararray,source:chararray}

EvalEvalfunctionsimplementasetofoperationstobeappliedonanexpressionthatreturnsabagormapdatatype.Theexpressionresultisevaluatedwithinthefunctioncontext.

AVG(expression):computestheaverageofthenumericvaluesinasingle-column

Page 247: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

bagCOUNT(expression):countsallelementswithnon-nullvaluesinthefirstpositioninabagCOUNT_STAR(expression):countsallelementsinabagIsEmpty(expression):checkswhetherabagormapisemptyMAX(expression),MIN(expression),andSUM(expression):returnthemax,min,orthesumofelementsinabagTOKENIZE(expression):splitsastringandoutputsabagofwords

Thetuple,bag,andmapfunctionsThesefunctionsallowconversionfromandtothebag,tuple,andmaptypes.Theyincludethefollowing:

TOTUPLE(expression),TOMAP(expression),andTOBAG(expression):Thesecoerceexpressiontoatuple,map,orbagTOP(n,column,relation):Thisreturnsthetopntuplesfromabagoftuples

Themath,string,anddatetimefunctionsPigexposesanumberoffunctionsprovidedbythejava.lang.Math,java.lang.String,java.util.Date,andJoda-TimeDateTimeclass(foundathttp://www.joda.org/joda-time/).

DynamicinvokersDynamicinvokersallowtheexecutionofJavafunctionswithouthavingtowraptheminaUDF.Theycanbeusedforanystaticfunctionthat:

acceptsnoargumentsoracceptsacombinationofstring,int,long,double,float,orarraywiththesesametypesreturnsastring,int,long,double,orfloatvalue

OnlyprimitivescanbeusedfornumbersandJavaboxedclasses(suchasInteger)cannotbeusedasarguments.Dependingonthereturntype,aspecifickindofinvokermustbeused:InvokeForString,InvokeForInt,InvokeForLong,InvokeForDouble,orInvokeForFloat.Moredetailsregardingdynamicinvokerscanbefoundathttp://pig.apache.org/docs/r0.12.0/func.html#dynamic-invokers.

MacrosAsofversion0.9,PigLatin’spreprocessorsupportsmacroexpansion.MacrosaredefinedusingtheDEFINEstatement:

DEFINEmacro_name(param1,...,paramN)RETURNSoutput_bag{

pig_latin_statements

};

Themacroisexpandedinline,anditsparametersarereferencedinthePigLatinblockwithin{}.

Page 248: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ThemacrooutputrelationisgivenintheRETURNSstatements(output_bag).RETURNSvoidisusedforamacrowithnooutputrelation.

Wecandefineamacrotocountthenumberofrowsinarelation,asfollows:

DEFINEcount_rows(X)RETURNScnt{

grpd=group$Xall;

$cnt=foreachgrpdgenerateCOUNT($X);

};

WecanuseitinaPigscriptorGruntsessiontocountthenumberoftweets:

tweets_count=count_rows(tweets);

DUMPtweets_count;

Macrosallowustomakescriptsmodularbyhousingcodeinseparatefilesandimportingthemwhereneeded.Forexample,wecansavecount_rowsinafilecalledcount_rows.macroandlateronimportitwiththecommandimport'count_rows.macro'.

Macroshaveanumberoflimitations;inparticular,onlyPigLatinstatementsareallowedinsideamacro.ItisnotpossibletouseREGISTERstatementsandshellcommands,UDFsarenotallowed,andparametersubstitutioninsidethemacroisnotsupported.

Page 249: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WorkingwithdataPigLatinprovidesanumberofrelationaloperatorstocombinefunctionsandapplytransformationsondata.Typicaloperationsinadatapipelineconsistoffilteringrelations(FILTER),aggregatinginputsbasedonkeys(GROUP),generatingtransformationsbasedoncolumnsofdata(FOREACH),andjoiningrelations(JOIN)basedonsharedkeys.

Inthefollowingsections,wewillillustratesuchoperatorsonadatasetoftweetsgeneratedbyloadingJSONdata.

FilteringTheFILTERoperatorselectstuplesfromarelationbasedonanexpression,asfollows:

relation=FILTERrelationBYexpression;

Wecanusethisoperatortofiltertweetswhosetextmatchesthehashtagregularexpression,asfollows:

tweets_with_tag=FILTERtweetsBY

(text

MATCHES'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)'

);

AggregationTheGROUPoperatorgroupstogetherdatainoneormorerelationsbasedonanexpressionorakey,asfollows:

relation=GROUPrelationBYexpression;

Wecangrouptweetsbythesourcefieldintoanewrelationgrpd,asfollows:

grpd=GROUPtweetsBYsource;

Itispossibletogrouponmultipledimensionsbyspecifyingatupleasthekey,asfollows:

grpd=GROUPtweetsBY(created_at,source);

TheresultofaGROUPoperationisarelationthatincludesonetupleperuniquevalueofthegroupexpression.Thistuplecontainstwofields.Thefirstfieldisnamedgroupandisofthesametypeasthegroupkey.Thesecondfieldtakesthenameoftheoriginalrelationandisofthetypebag.Thenamesofbothfieldsaregeneratedbythesystem.

UsingtheALLkeyword,Pigwillaggregateacrossthewholerelation.TheGROUPtweetsALLschemewillaggregatealltuplesinthesamegroup.

Aspreviouslymentioned,PigallowsexplicithandlingoftheconcurrencyleveloftheGROUPoperatorusingthePARALLELoperator:

grpd=GROUPtweetsBY(created_at,id)PARALLEL10;

Intheprecedingexample,theMapReducejobgeneratedbythecompilerwillrun10concurrentreducetasks.Pighasaheuristicestimateofhowmanyreducerstouse.

Page 250: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Anotherwayofgloballyenforcingthenumberofreducetasksistousethesetdefault_parallel<n>command.

ForeachTheFOREACHoperatorappliesfunctionsoncolumns,asfollows:

relation=FOREACHrelationGENERATEtransformation;

TheoutputofFOREACHdependsonthetransformationapplied.

Wecanusetheoperatortoprojectthetextofalltweetsthatcontainahashtag,asfollows:

t=FOREACHtweets_with_tagGENERATEtext;

Wecanalsoapplyafunctiontotheprojectedcolumns.Forinstance,wecanusetheREGEX_TOKENIZEfunctiontospliteachtweetintowords,asfollows:

t=FOREACHtweets_with_tagGENERATEFLATTEN(TOKENIZE(text))asword;

TheFLATTENmodifierfurtherun-neststhebaggeneratedbyTOKENIZEintoatupleofwords.

JoinTheJOINoperatorperformsaninnerjoinoftwoormorerelationsbasedoncommonfieldvalues.Itssyntaxisasfollows:

relation=JOINrelation1BYexpression1,relation2BYexpression2;

Wecanuseajoinoperationtodetecttweetsthatcontainpositivewords,asfollows:

positive=LOAD'positive-words.txt'USINGPigStorage()as(w:chararray);

Filteroutthecomments,asfollows:

positive_words=FILTERpositiveBYNOTwMATCHES'^;.*';

positive_wordsisabagoftuples,eachcontainingaword.Wethentokenizethetweets’textandcreateanewbagof(id_str,word)tuplesasfollows:

id_words=FOREACHtweets{

GENERATE

id_str,

FLATTEN(TOKENIZE(text))asword;

}

Wejointhetworelationsonthewordfieldandobtainarelationofalltweetsthatcontainoneormorepositivewords,asfollows:

positive_tweets=JOINpositive_wordsBYw,id_wordsBYword;

Inthisstatement,wejoinpositive_wordsandid_wordsontheconditionthatid_words.wordisapositiveword.Thepositive_tweetsoperatorisabagintheformof{w:chararray,id_str:chararray,word:chararray}thatcontainsallelementsofpositive_wordsandid_wordsthatmatchthejoincondition.

Page 251: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WecancombinetheGROUPandFOREACHoperatortocalculatethenumberofpositivewordspertweet(withatleastonepositiveword).First,wegrouptherelationofpositivetweetsbythetweetID,andthenwecountthenumberofoccurrencesofeachIDintherelation,asfollows:

grpd=GROUPpositive_tweetsBYid_str;

score=FOREACHgrpdGENERATEFLATTEN(group),COUNT(positive_tweets);

TheJOINoperatorcanmakeuseoftheparallelizefeatureaswell,asfollows:

positive_tweets=JOINpositive_wordsBYw,id_wordsBYwordPARALLEL10

Theprecedingcommandwillexecutethejoinwith10reducertasks.

Itispossibletospecifytheoperator’sbehaviorwiththeUSINGkeywordfollowedbytheIDofaspecializedjoin.Moredetailscanbefoundathttp://pig.apache.org/docs/r0.12.0/perf.html#specialized-joins.

Page 252: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ExtendingPig(UDFs)FunctionscanbeapartofalmosteveryoperatorinPig.TherearetwomaindifferencesbetweenUDFsandbuilt-infunctions.First,UDFsneedtoberegisteredusingtheREGISTERkeywordinordertomakethemavailabletoPig.Secondly,theyneedtobequalifiedwhenused.PigUDFscancurrentlybeimplementedinJava,Python,Ruby,JavaScript,andGroovy.ThemostextensivesupportisprovidedforJavafunctions,whichallowyoutocustomizeallpartsoftheprocessincludingdataload/store,transformation,andaggregation.Additionally,JavafunctionsarealsomoreefficientbecausetheyareimplementedinthesamelanguageasPigandbecauseadditionalinterfacesaresupported,suchastheAlgebraicandAccumulatorinterfaces.Ontheotherhand,RubyandPythonAPIsallowmorerapidprototyping.

TheintegrationofUDFswiththePigenvironmentismainlymanagedbythefollowingtwostatementsREGISTERandDEFINE:

REGISTERregistersaJARfilesothattheUDFsinthefilecanbeused,asfollows:

REGISTER'piggybank.jar'

DEFINEcreatesanaliastoafunctionorastreamingcommand,asfollows:

DEFINEMyFunctionmy.package.uri.MyFunction

Theversion0.12ofPigintroducedthestreamingofUDFsasamechanismforwritingfunctionsusinglanguageswithnoJVMimplementation.

Page 253: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ContributedUDFsPig’scodebasehostsaUDFrepositorycalledPiggybank.OtherpopularcontributedrepositoriesareTwitter’sElephantBird(foundathttps://github.com/kevinweil/elephant-bird/)andApacheDataFu(foundathttp://datafu.incubator.apache.org/).

PiggybankPiggybankisaplaceforPiguserstosharetheirfunctions.SharedcodeislocatedintheofficialPigSubversionrepositoryfoundathttp://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/TheAPIdocumentationcanbefoundathttp://pig.apache.org/docs/r0.12.0/api/underthecontribsection.PiggybankUDFscanbeobtainedbycheckingoutandcompilingthesourcesfromtheSubversionrepositoryorbyusingtheJARfilethatshipswithbinaryreleasesofPig.InClouderaCDH,piggybank.jarisavailableat/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar.

ElephantBirdElephantBirdisanopensourcelibraryofallthingsHadoopusedinproductionatTwitter.Thislibrarycontainsanumberofserializationtools,custominputandoutputformats,writables,Pigload/storefunctions,andmoremiscellanea.

ElephantBirdshipswithanextremelyflexibleJSONloaderfunction,whichatthetimeofwriting,isthego-toresourceformanipulatingJSONdatainPig.

ApacheDataFuApacheDataFuPigcollectsanumberofanalyticalfunctionsdevelopedandcontributedbyLinkedIn.Theseincludestatisticalandestimationfunctions,bagandsetoperations,sampling,hashing,andlinkanalysis.

Page 254: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AnalyzingtheTwitterstreamInthefollowingexamples,wewillusetheimplementationofJsonLoaderprovidedbyElephantBirdtoloadandmanipulateJSONdata.WewillusePigtoexploretweetmetadataandanalyzetrendsinthedataset.Finally,wewillmodeltheinteractionbetweenusersasagraphanduseApacheDataFutoanalyzethissocialnetwork.

Page 255: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PrerequisitesDownloadtheelephant-bird-pig(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-pig/4.5/elephant-bird-pig-4.5.jar),elephant-bird-hadoop-compat(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-hadoop-compat/4.5/elephant-bird-hadoop-compat-4.5.jar),andelephant-bird-core(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-core/4.5/elephant-bird-core-4.5.jar)JARfilesfromtheMavencentralrepositoryandcopythemontoHDFSusingthefollowingcommand:

$hdfsdfs-puttarget/elephant-bird-pig-4.5.jarhdfs:///jar/

$hdfsdfs–puttarget/elephant-bird-hadoop-compat-4.5.jarhdfs:///jar/

$hdfsdfs–putelephant-bird-core-4.5.jarhdfs:///jar/

Page 256: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DatasetexplorationBeforedivingdeeperintothedataset,weneedtoregisterthedependenciestoElephantBirdandDataFu,asfollows:

REGISTER/opt/cloudera/parcels/CDH/lib/pig/datafu-1.1.0-cdh5.0.0.jar

REGISTER/opt/cloudera/parcels/CDH/lib/pig/lib/json-simple-1.1.jar

REGISTERhdfs:///jar/elephant-bird-pig-4.5.jar

REGISTERhdfs:///jar/elephant-bird-hadoop-compat-4.5.jar

REGISTERhdfs:///jar/elephant-bird-core-4.5.jar

Then,loadtheJSONdatasetoftweetsusingcom.twitter.elephantbird.pig.load.JsonLoader,asfollows:

tweets=LOAD'tweets.json'using

com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

com.twitter.elephantbird.pig.load.JsonLoaderdecodeseachlineoftheinputfiletoJSONandpassestheresultingmapofvaluestoPigasasingle-elementtuple.ThisenablesaccesstoelementsoftheJSONobjectwithouthavingtospecifyaschemaupfront.The–nestedLoadargumentinstructstheclasstoloadnesteddatastructures.

Page 257: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TweetmetadataIntheremainderofthechapter,wewillusemetadatafromtheJSONdatasettomodelthetweetstream.OneexampleofmetadataattachedtoatweetisthePlaceobject,whichcontainsgeographicalinformationabouttheuser’slocation.Placecontainsfieldsthatdescribeitsname,ID,country,countrycode,andmore.Afulldescriptioncanbefoundathttps://dev.twitter.com/docs/platform-objects/places.

place=FOREACHtweetsGENERATE(chararray)$0#'place'asplace;

Entitiesgiveinformationsuchasstructureddatafromtweets,URLs,hashtags,andmentions,withouthavingtoextractthemfromtext.Adescriptionofentitiescanbefoundathttps://dev.twitter.com/docs/entities.Thehashtagentityisanarrayoftagsextractedfromatweet.Eachentityhasthefollowingtwoattributes:

Text:isthehashtagtextIndices:isthecharacterpositionfromwhichthehashtagwasextracted

Thefollowingcodeusesentities:

hashtags_bag=FOREACHtweets{

GENERATE

FLATTEN($0#'entities'#'hashtags')astag;

}

Wethenflattenhashtags_bagtoextracteachhashtag’stext:

hashtags=FOREACHhashtags_bagGENERATEtag#'text'astopic;

Entitiesforuserobjectscontaininformationthatappearsintheuserprofileanddescriptionfields.Wecanextractthetweetauthor’sIDviatheuserfieldinthetweetmap:

users=FOREACHtweetsGENERATE$0#'user'#'id'asid;

Page 258: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DatapreparationTheSAMPLEbuilt-inoperatorselectsasetofntupleswithprobabilitypoutofthedataset,asfollows:

sampled=SAMPLEtweets0.01;

Theprecedingcommandwillselectapproximately1percentofthedataset.GiventhatSAMPLEisprobabilistic(http://en.wikipedia.org/wiki/Bernoulli_sampling),thereisnoguaranteethatthesamplesizewillbeexact.Moreoverthefunctionsampleswithreplacement,whichmeansthateachitemmightappearmorethanonce.

ApacheDataFuimplementsanumberofsamplingmethodsforcaseswherehavinganexactsamplesizeandnoreplacementisdesired(SimpleRandomSampling),samplingwithreplacement(SimpleRandomSampleWithReplacementVoteandSimpleRandomSampleWithReplacementElect),whenwewanttoaccountforsamplebias(WeightedRandomSampling),ortosampleacrossmultiplerelations(SampleByKey).

Wecancreateasampleofexactly1percentofthedataset,witheachitemhavingthesameprobabilityofbeingselected,usingSimpleRandomSample.

NoteTheactualguaranteeisasampleofsizeceil(p*n)withaprobabilityofatleast99percent.

First,wepassasamplingprobability0.01totheUDFconstructor:

DEFINESRSdatafu.pig.sampling.SimpleRandomSample('0.01');

andthebag,createdwith(GROUPtweetsALL),tobesampled:

sampled=FOREACH(GROUPtweetsALL)GENERATEFLATTEN(SRS(tweets));

TheSimpleRandomSampleUDFselectswithoutreplacement,whichmeansthateachitemwillappearonlyonce.

NoteWhichsamplingmethodtousedependsbothonthedataweareworkingwith,assumptionsonhowitemsaredistributed,thesizeofthedataset,andwhatwepracticallywanttoachieve.Ingeneral,whenwewanttoexploreadatasettoformulatehypotheses,SimpleRandomSamplecanbeagoodchoice.However,inseveralanalyticsapplications,itiscommontousemethodsthatassumereplacement(forexample,bootstrapping).

Notethatwhenworkingwithverylargedatasets,samplingwithreplacementandsamplingwithoutreplacementtendtobehavesimilarly.Theprobabilityofanitembeingselectedtwiceoutofapopulationofbillionsofitemswillbelow.

Page 259: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TopnstatisticsOneofthefirstquestionswemightwanttoaskishowfrequentcertainthingsare.Forinstance,wemightwanttocreateahistogramofthetop10topicsbythenumberofmentions.Similarly,wemightwanttofindthetop50countriesorthetop10users.Beforelookingattweetsdata,wewilldefineamacrosothatwecanapplythesameselectionlogictodifferentcollectionsofitems:

DEFINEtop_n(rel,col,n)

RETURNStop_n_items{

grpd=GROUP$relBY$col;

cnt_items=FOREACHgrpd

GENERATEFLATTEN(group),COUNT($rel)AScnt;

cnt_items_sorted=ORDERcnt_itemsBYcntDESC;

$top_n_items=LIMITcnt_items_sorted$n;

}

Thetop_nmethodtakesarelationrel,thecolumncolwewanttocount,andthenumberofitemstoreturnnasparameters.InthePigLatinblock,wefirstgrouprelbyitemsincol,countthenumberofoccurrencesofeachitem,sortthem,andselectthemostfrequentn.

Tofindthetop10Englishhashtags,wefilterthembylanguage,andextracttheirtext:

tweets_en=FILTERtweetsby$0#'lang'=='en';

hashtags_bag=FOREACHtweets{

GENERATE

FLATTEN($0#'entities'#'hashtags')AStag;

}

hashtags=FOREACHhashtags_bagGENERATEtag#'text'AStag;

Andapplythetop_nmacro:

top_10_hashtags=top_n(hashtags,tag,10);

Inordertobettercharacterizewhatistrendingandmakethisinformationmorerelevanttousers,wecandrilldownintothedatasetandlookathashtagspergeographiclocation.

First,wegeneratebagof(place,hashtag)tuples,asfollows:

hashtags_country_bag=FOREACHtweetsgenerate{

0#'place'asplace,

FLATTEN($0#'entities'#'hashtags')astag;

}

Andthen,weextractthecountrycodeandhashtagtext,asfollows:

hashtags_country=FOREACHhashtags_country_bag{

GENERATE

place#'country_code'asco,

tag#'text'astag;

}

Then,wecounthowmanytimeseachcountrycodeandhashtagappeartogether,as

Page 260: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

follows:

hashtags_country_frequency=FOREACH(GROUPhashtags_countryALL){

GENERATE

FLATTEN(group),

COUNT(hashtags_country)ascount;

}

Finally,wecountthetop10countriesperhashtagwiththeTOPfunction,asfollows:

hashtags_country_regrouped=GROUPhashtags_country_frequencyBYcnt;

top_results=FOREACHhashtags_country_regrouped{

result=TOP(10,1,hashtags_country_frequency);

GENERATEFLATTEN(result);

}

TOP‘sparametersarethenumberoftuplestoreturn,thecolumntocompare,andtherelationcontainingsaidcolumn:

top_results=FOREACHD{

result=TOP(10,1,C);

GENERATEFLATTEN(result);

}

Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/topn.pig.

Page 261: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DatetimemanipulationThecreated_atfieldintheJSONtweetsgivesustime-stampedinformationaboutwhenthetweetwasposted.Unfortunately,itsformatisnotcompatiblewithPig’sbuilt-indatetimetype.

PiggybankcomestotherescuewithanumberoftimemanipulationUDFscontainedinorg.apache.pig.piggybank.evaluation.datetime.convert.OneofthemisCustomFormatToISO,whichconvertsanarbitrarilyformattedtimestampintoanISO8601datetimestring.

InordertoaccesstheseUDFs,wefirstneedtoregisterthepiggybank.jarfile,asfollows:

REGISTER/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar

Tomakeourcodelessverbose,wecreateanaliasfortheCustomFormatToISOclass’sfullyqualifiedJavaname:

DEFINECustomFormatToISO

org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();

Byknowinghowtomanipulatetimestamps,wecancalculatestatisticsatdifferenttimeintervals.Forinstance,wecanlookathowmanytweetsarecreatedperhour.Pighasabuilt-inGetHourfunctionthatextractsthehouroutofadatetimetype.Tousethis,wefirstconvertthetimestampstringtoISO8601withCustomFormatToISOandthentheresultingchararraytodatetimeusingthebuilt-inToDatefunction,asfollows:

hourly_tweets=FOREACHtweets{

GENERATE

GetHour(

ToDate(

CustomFormatToISO(

$0#'created_at','EEEMMMMdHH:mm:ssZy')

)

)ashour;

}

Now,itisjustamatterofgroupinghourly_tweetsbyhourandthengeneratingacountoftweetspergroup,asfollows:

hourly_tweets_count=FOREACH(GROUPhourly_tweetsBYhour){

GENERATEFLATTEN(group),COUNT(hourly_tweets);

}

SessionsDataFu’sSessionizeclasscanhelpustobettercaptureuseractivityovertime.Asessionrepresentstheactivityofauserwithinagivenperiodoftime.Forinstance,wecanlookateachuser’stweetstreamatintervalsof15minutesandmeasurethesesessionstodeterminebothnetworkvolumesaswellasuseractivity:

DEFINESessionizedatafu.pig.sessions.Sessionize('15m');

users_activity=FOREACHtweets{

Page 262: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GENERATE

CustomFormatToISO($0#'created_at',

'EEEMMMMdHH:mm:ssZy')ASdt,

(chararray)$0#'user'#'id'asuser_id;

}

users_activity_sessionized=FOREACH

(GROUPusers_activityBYuser_id){

ordered=ORDERusers_activityBYdt;

GENERATEFLATTEN(Sessionize(ordered))

AS(dt,user_id,session_id);

}

user_activitysimplyrecordsthetimedtagivenuser_idpostedastatusupdate.

Sessionizetakesthesessiontimeoutandabagasinput.ThefirstelementoftheinputbagisanISO8601timestamp,andthebagmustbesortedbythistimestamp.Eventsthatarewithin15minutesfromeachotherwillbelongtothesamesession.

Itreturnstheinputbagwithanewfield,session_id,thatuniquelyidentifiesasession.Withthisdata,wecancalculatethesession’slengthandsomeotherstatistics.MoreexamplesofSessionizeusagecanbefoundathttp://datafu.incubator.apache.org/docs/datafu/guide/sessions.html.

Page 263: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

CapturinguserinteractionsIntheremainderofthechapter,wewilllookathowtocapturepatternsfromuserinteractions.Asafirststepinthisdirection,wewillcreateadatasetsuitabletomodelasocialnetwork.Thisdatasetwillcontainatimestamp,theIDofthetweet,theuserwhopostedthetweet,theuserandtweetshe’sreplyingto,andthehashtaginthetweet.

Twitterconsidersasareply(in_reply_to_status_id_str)anymessagebeginningwiththe@character.Suchtweetsareinterpretedasadirectmessagetothatperson.Placingan@characteranywhereelseinthetweetisinterpretedasamention('entities'#'user_mentions‘)andnotareply.Thedifferenceisthatmentionsareimmediatelybroadcasttoaperson’sfollowers,whereasrepliesarenot.Repliesare,however,consideredasmentions.

Whenworkingwithpersonallyidentifiableinformation,itisagoodideatoanonymizeifnotremoveentirelysensitivedatasuchasIPaddresses,names,anduserIDs.Acommonlyusedtechniqueinvolvesahashfunctionthattakesasinputthedatawewanttoanonymize,concatenatedwithadditionalrandomdatacalledsalt.Thefollowingcodeshowsanexampleofsuchanonymization:

DEFINESHAdatafu.pig.hash.SHA();

from_to_bag=FOREACHtweets{

dt=$0#'created_at';

user_id=(chararray)$0#'user'#'id';

tweet_id=(chararray)$0#'id_str';

reply_to_tweet=(chararray)$0#'in_reply_to_status_id_str';

reply_to=(chararray)$0#'in_reply_to_user_id_str';

place=$0#'place';

topics=$0#'entities'#'hashtags';

GENERATE

CustomFormatToISO(dt,'EEEMMMMdHH:mm:ssZy')ASdt,

SHA((chararray)CONCAT('SALT',user_id))ASsource,

SHA(((chararray)CONCAT('SALT',tweet_id)))AStweet_id,

((reply_to_tweetISNULL)

?NULL

:SHA((chararray)CONCAT('SALT',reply_to_tweet)))

ASreply_to_tweet_id,

((reply_toISNULL)

?NULL

:SHA((chararray)CONCAT('SALT',reply_to)))

ASdestination,

(chararray)place#'country_code'ascountry,

FLATTEN(topics)AStopic;

}

—extractthehashtagtext

from_to=FOREACHfrom_to_bag{

GENERATE

dt,

tweet_id,

reply_to_tweet_id,

source,

Page 264: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

destination,

country,

(chararray)topic#'text'AStopic;

}

Inthisexample,weuseCONCATtoappenda(notsorandom)saltstringtopersonaldata.WethengenerateahashofthesaltedIDswithDataFu’sSHAfunction.TheSHAfunctionrequiresitsinputparameterstobenonnull.Weenforcethisconditionusingif-then-elsestatements.InPigLatin,thisisexpressedas<conditionistrue>?<truebranch>:<falsebranch>.Ifthestringisnull,wereturnNULL,andifnot,wereturnthesaltedhash.Tomakecodemorereadable,weusealiasesforthetweetJSONfieldsandreferencethemintheGENERATEblock.

Page 265: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

LinkanalysisWecanredefineourapproachtodeterminetrendingtopicstoincludeusers’reactions.Afirst,naïve,approachcouldbetoconsideratopicasimportantifitcausedanumberofreplieslargerthanathresholdvalue.

Aproblemwiththisapproachisthattweetsgeneraterelativelyfewreplies,sothevolumeoftheresultingdatasetwillbelow.Hence,itrequiresaverylargeamountofdatatocontaintweetsbeingrepliedtoandproduceanyresult.Inpractice,wewouldlikelywanttocombinethismetricwithotherones(forexample,mentions)inordertoperformmoremeaningfulanalyses.

Tosatisfythisquery,wewillcreateanewdatasetthatincludesthehashtagsextractedfromboththetweetandtheoneauserisreplyingto:

tweet_hashtag=FOREACHfrom_toGENERATEtweet_id,topic;

from_to_self_joined=JOINfrom_toBYreply_to_tweet_idLEFT,

tweet_hashtagBYtweet_id;

twitter_graph=FOREACHfrom_to_self_joined{

GENERATE

from_to::dtASdt,

from_to::tweet_idAStweet_id,

from_to::reply_to_tweet_idASreply_to_tweet_id,

from_to::sourceASsource,

from_to::destinationASdestination,

from_to::topicAStopic,

from_to::countryAScountry,

tweet_hashtag::topicAStopic_replied;

}

NotethatPigdoesnotallowacrossjoinonthesamerelation,hencewehavetocreatetweet_hashtagfortheright-handsideofthejoin.Here,weusethe::operatortodisambiguatefromwhichrelationandcolumnwewanttoselectrecords.

Onceagain,wecanlookforthetop10topicsbynumberofrepliesusingthetop_nmacro:

top_10_topics=top_n(twitter_graph,topic_replied,10);

Countingthingswillonlytakeussofar.WecancomputemoredescriptivestatisticsonthisdatasetwithDataFu.UsingtheQuantilefunction,wecancalculatethemedian,the90th,95th,andthe99thpercentilesofthenumberofhashtagreactions,asfollows:

DEFINEQuantiledatafu.pig.stats.Quantile('0.5','0.90','0.95','0.99');

SincetheUDFexpectsanorderedbagofintegervaluesasinput,wefirstcountthefrequencyofeachtopic_repliedentry,asfollows.

topics_with_replies_grpd=GROUPtwitter_graphBYtopic_replied;

topics_with_replies_cnt=FOREACHtopics_with_replies_grpd{

GENERATE

COUNT(twitter_graph)ascnt;

}

Page 266: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Then,weapplyQuantileonthebagoffrequencies,asfollows:

quantiles=FOREACH(GROUPtopics_with_replies_cntALL){

sorted=ORDERtopics_with_replies_cntBYcnt;

GENERATEQuantile(sorted);

}

Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/graph.pig.

Page 267: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

InfluentialusersWewillusePageRank,analgorithmdevelopedbyGoogletorankwebpages(http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf),toidentifyinfluentialusersintheTwittergraphwegeneratedintheprevioussection.

Thistypeofanalysishasanumberofusecases,suchastargetedandcontextualadvertisement,recommendationsystems,spamdetection,andobviouslymeasuringtheimportanceofwebpages.Asimilarapproach,usedbyTwittertoimplementtheWhotoFollowfeature,isdescribedintheresearchpaperWTF:TheWhotoFollowserviceatTwitterfoundathttp://stanford.edu/~rezab/papers/wtf_overview.pdf.

Informally,PageRankdeterminestheimportanceofapagebasedontheimportanceofotherpageslinkingtoitandassignsitascorebetween0and1.AhighPageRankscoreindicatesthatalotofpagespointtoit.Intuitively,beinglinkedbypageswithahighPageRankisaqualityendorsement.IntermsoftheTwittergraph,weassumethatusersreceivingalotofrepliesareimportantorinfluentialwithinthesocialnetwork.InTwitter’scase,weconsideranextendeddefinitionofPageRank,wherethelinkbetweentwousersisgivenbyadirectreplyandlabeledbyanyeventualhashtagpresentinthemessage.Heuristically,wewanttoidentifyinfluentialusersonagiventopic.

InDataFu’simplementation,eachgraphisrepresentedasabagof(source,edges)tuples.ThesourcetupleisanintegerIDrepresentingthesourcenode.Theedgesareabagof(destination,weight)tuples.destinationisanintegerIDrepresentingthedestinationnode.weightisadoublerepresentinghowmuchtheedgeshouldbeweighted.TheoutputoftheUDFisabagof(source,rank)pairs,whererankisthePageRankvalueforthesourceuserinthegraph.Noticethatwetalkedaboutnodes,edges,andgraphsasabstractconcepts.InGoogle’scase,nodesarewebpages,edgesarelinksfromonepagetotheother,andgraphsaregroupsofpagesconnecteddirectlyandindirectly.

Inourcase,nodesrepresentusers,edgesrepresentin_reply_to_user_id_strmentions,andedgesarelabeledbyhashtagsintweets.TheoutputofPageRankshouldsuggestwhichusersareinfluentialonagiventopicgiventheirinteractionpatterns.

Inthissection,wewillwriteapipelineto:

RepresentdataasagraphwhereeachnodeisauserandahashtaglabelstheedgeMapIDsandhashtagstointegerssothattheycanbeconsumedbyPageRankApplyPageRankStoretheresultsintoHDFSinaninteroperableformat(Avro)

Werepresentthegraphasabagoftuplesintheform(source,destination,topic),whereeachtuplerepresentstheinteractionbetweennodes.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/pagerank.pig.

Wewillmapusers’andhashtags’texttonumericalIDs.WeusetheJavaStringhashCode()methodtoperformthisconversionstepandwrapthelogicinanEvalUDF.

Page 268: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

NoteThesizeofanintegeriseffectivelytheupperboundforthenumberofnodesandedgesinthegraph.Forproductioncode,itisrecommendedthatyouuseamorerobusthashfunction.

TheStringToIntclasstakesastringasinput,callsthehashCode()method,andreturnsthemethodoutputtoPig.TheUDFcodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/udf/com/learninghadoop2/pig/udf/StringToInt.java.

packagecom.learninghadoop2.pig.udf;

importjava.io.IOException;

importorg.apache.pig.EvalFunc;

importorg.apache.pig.data.Tuple;

publicclassStringToIntextendsEvalFunc<Integer>{

publicIntegerexec(Tupleinput)throwsIOException{

if(input==null||input.size()==0)

returnnull;

try{

Stringstr=(String)input.get(0);

returnstr.hashCode();

}catch(Exceptione){

throw

newIOException("CannotconvertStringtoInt",e);

}

}

}

Weextendorg.apache.pig.EvalFuncandoverridetheexecmethodtoreturnstr.hashCode()onthefunctioninput.TheEvalFunc<Integer>classisparameterizedwiththereturntypeoftheUDF(Integer).

Next,wecompiletheclassandarchiveitintoaJAR,asfollows:

$javac-classpath/opt/cloudera/parcels/CDH/lib/pig/pig.jar:$(hadoop

classpath)com/learninghadoop2/pig/udf/StringToInt.java

$jarcvfmyudfs-pig.jarcom/learninghadoop2/pig/udf/StringToInt.class

WecannowregistertheUDFinPigandcreateanaliastoStringToInt,asfollows:

REGISTERmyudfs-pig.jar

DEFINEStringToIntcom.learninghadoop2.pig.udf.StringToInt();

Wefilterouttweetswithnodestinationandnotopic,asfollows:

tweets_graph_filtered=FILTERtwitter_graphby

(destinationISNOTNULL)AND

(topicISNOTnull);

Then,weconvertthesource,destination,andtopictointegerIDs:

from_to=foreachtweets_graph_filtered{

GENERATE

StringToInt(source)assource_id,

Page 269: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

StringToInt(destination)asdestination_id,

StringToInt(topic)astopic_id;

}

Oncedataisintheappropriateformat,wecanreusetheimplementationofPageRankandtheexamplecode(foundathttps://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/linkanalysis/PageRank.java)providedbyDataFu,asshowninthefollowingcode:

DEFINEPageRankdatafu.pig.linkanalysis.PageRank('dangling_nodes','true');

Webeginbycreatingabagof(source_id,destination_id,topic_id)tuples,asfollows:

reply_to=groupfrom_toby(source_id,destination_id,topic_id);

Wecounttheoccurrencesofeachtuple,thatis,howmanytimestwopeopletalkedaboutatopic,asfollows:

topic_edges=foreachreply_to{

GENERATEflatten(group),((double)COUNT(from_to.topic_id))asw;

}

Rememberthattopicistheedgeofourgraph;webeginbycreatinganassociationbetweenthesourcenodeandthetopicedge,asfollows:

topic_edges_grouped=GROUPtopic_edgesby(topic_id,source_id);

Thenweregroupitwiththepurposeofaddingadestinationnodeandtheedgeweight,asfollows:

topic_edges_grouped=FOREACHtopic_edges_grouped{

GENERATE

group.topic_idastopic,

group.source_idassource,

topic_edges.(destination_id,w)asedges;

}

OncewecreatetheTwittergraph,wecalculatethePageRankofallusers(source_id):

topic_rank=FOREACH(GROUPtopic_edges_groupedBYtopic){

GENERATE

groupastopic,

FLATTEN(PageRank(topic_edges_grouped.(source,edges)))as(source,rank);

}

topic_rank=FOREACHtopic_rankGENERATEtopic,source,rank;

WestoretheresultinHDFSinAvroformat.IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReducejarfiletoourenvironmentbeforeaccessingindividualfields.WithinPig,forexample,ontheClouderaCDH5VM:

REGISTER/opt/cloudera/parcels/CDH/lib/avro/avro.jar

REGISTER/opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar

STOREtopic_rankINTO'replies-pagerank'usingAvroStorage();

Note

Page 270: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Intheselasttwosections,wemadeanumberofimplicitassumptionsonwhataTwittergraphmightlooklikeandwhattheconceptsoftopicanduserinteractionmean.Giventheconstraintsthatweposed,theresultingsocialnetworkweanalyzedwillberelativelysmallandnotnecessarilyrepresentativeoftheentireTwittersocialnetwork.Extrapolatingresultsfromthisdatasetisdiscouraged.Inpractice,therearemanyotherfactorsthatshouldbetakenintoaccounttogeneratearobustmodelofsocialinteraction.

Page 271: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SummaryInthischapter,weintroducedApachePig,aplatformforlarge-scaledataanalysisonHadoop.Inparticular,wecoveredthefollowingtopics:

ThegoalsofPigasawayofprovidingadataflow-likeabstractionthatdoesnotrequirehands-onMapReducedevelopmentHowPig’sapproachtoprocessingdatacomparestoSQL,wherePigisproceduralwhileSQLisdeclarativeGettingstartedwithPig—aneasytask,asitisalibrarythatgeneratescustomcodeanddoesn’trequireadditionalservicesAnoverviewofthedatatypes,corefunctions,andextensionmechanismsprovidedbyPigExamplesofapplyingPigtoanalyzetheTwitterdatasetindetail,whichdemonstrateditsabilitytoexpresscomplexconceptsinaveryconcisefashionHowlibrariessuchasPiggybank,ElephantBird,andDataFuproviderepositoriesfornumeroususefulprewrittenPigfunctionsInthenextchapter,wewillrevisittheSQLcomparisonbyexploringtoolsthatexposeaSQL-likeabstractionoverdatastoredinHDFS

Page 272: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Chapter7.HadoopandSQLMapReduceisapowerfulparadigmthatenablescomplexdataprocessingthatcanrevealvaluableinsights.Asdiscussedinearlierchaptershowever,itdoesrequireadifferentmindsetandsometrainingandexperienceonthemodelofbreakingprocessinganalyticsintoaseriesofmapandreducesteps.ThereareseveralproductsthatarebuiltatopHadooptoprovidehigher-levelormorefamiliarviewsofthedataheldwithinHDFS,andPigisaverypopularone.ThischapterwillexploretheothermostcommonabstractionimplementedatopHadoop:SQL.

Inthischapter,wewillcoverthefollowingtopics:

WhattheusecasesforSQLonHadoopareandwhyitissopopularHiveQL,theSQLdialectintroducedbyApacheHiveUsingHiveQLtoperformSQL-likeanalysisoftheTwitterdatasetHowHiveQLcanapproximatecommonfeaturesofrelationaldatabasessuchasjoinsandviewsHowHiveQLallowstheincorporationofuser-definedfunctionsintoitsqueriesHowSQLonHadoopcomplementsPigOtherSQL-on-HadoopproductssuchasImpalaandhowtheydifferfromHive

Page 273: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WhySQLonHadoopSofarwehaveseenhowtowriteHadoopprogramsusingtheMapReduceAPIsandhowPigLatinprovidesascriptingabstractionandawrapperforcustombusinesslogicbymeansofUDFs.Pigisaverypowerfultool,butitsdataflow-basedprogrammingmodelisnotfamiliartomostdevelopersorbusinessanalysts.ThetraditionaltoolofchoiceforsuchpeopletoexploredataisSQL.

Backin2008FacebookreleasedHive,thefirstwidelyusedimplementationofSQLonHadoop.

Insteadofprovidingawayofmorequicklydevelopingmapandreducetasks,HiveoffersanimplementationofHiveQL,aquerylanguagebasedonSQL.HivetakesHiveQLstatementsandimmediatelyandautomaticallytranslatesthequeriesintooneormoreMapReducejobs.ItthenexecutestheoverallMapReduceprogramandreturnstheresultstotheuser.

ThisinterfacetoHadoopnotonlyreducesthetimerequiredtoproduceresultsfromdataanalysis,italsosignificantlywidensthenetastowhocanuseHadoop.Insteadofrequiringsoftwaredevelopmentskills,anyonewho’sfamiliarwithSQLcanuseHive.

ThecombinationoftheseattributesisthatHiveQLisoftenusedasatoolforbusinessanddataanalyststoperformadhocqueriesonthedatastoredonHDFS.WithHive,thedataanalystcanworkonrefiningquerieswithouttheinvolvementofasoftwaredeveloper.JustaswithPig,HivealsoallowsHiveQLtobeextendedbymeansofUserDefinedFunctions,enablingthebaseSQLdialecttobecustomizedwithbusiness-specificfunctionality.

Page 274: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OtherSQL-on-HadoopsolutionsThoughHivewasthefirstproducttointroduceandsupportHiveQL,itisnolongertheonlyone.Laterinthischapter,wewillalsodiscussImpala,releasedin2013andalreadyaverypopulartool,particularlyforlow-latencyqueries.Thereareothers,butwewillmostlydiscussHiveandImpalaastheyhavebeenthemostsuccessful.

WhileintroducingthecorefeaturesandcapabilitiesofSQLonHadoophowever,wewillgiveexamplesusingHive;eventhoughHiveandImpalasharemanySQLfeatures,theyalsohavenumerousdifferences.Wedon’twanttoconstantlyhavetocaveateachnewfeaturewithexactlyhowitissupportedinHivecomparedtoImpala.We’llgenerallybelookingataspectsofthefeaturesetthatarecommontoboth,butifyouusebothproducts,it’simportanttoreadthelatestreleasenotestounderstandthedifferences.

Page 275: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PrerequisitesBeforedivingintospecifictechnologies,let’sgeneratesomedatathatwe’lluseintheexamplesthroughoutthischapter.We’llcreateamodifiedversionofaformerPigscriptasthemainfunctionalityforthis.ThescriptinthischapterassumesthattheElephantBirdJARsusedpreviouslyareavailableinthe/jardirectoryonHDFS.Thefullsourcecodeisathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/extract_for_hive.pig,butthecoreofextract_for_hive.pigisasfollows:

--loadJSONdata

tweets=load'$inputDir'using

com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');—Tweets

tweets_tsv=foreachtweets{

generate

(chararray)CustomFormatToISO($0#'created_at',

'EEEMMMMdHH:mm:ssZy')asdt,

(chararray)$0#'id_str',

(chararray)$0#'text'astext,

(chararray)$0#'in_reply_to',

(boolean)$0#'retweeted'asis_retweeted,

(chararray)$0#'user'#'id_str'asuser_id,(chararray)$0#'place'#'id'as

place_id;

}

storetweets_tsvinto'$outputDir/tweets'

usingPigStorage('\u0001');—Places

needed_fields=foreachtweets{

generate

(chararray)CustomFormatToISO($0#'created_at',

'EEEMMMMdHH:mm:ssZy')asdt,

(chararray)$0#'id_str'asid_str,

$0#'place'asplace;

}

place_fields=foreachneeded_fields{

generate

(chararray)place#'id'asplace_id,

(chararray)place#'country_code'asco,

(chararray)place#'country'ascountry,

(chararray)place#'name'asplace_name,

(chararray)place#'full_name'asplace_full_name,

(chararray)place#'place_type'asplace_type;

}

filtered_places=filterplace_fieldsbyco!='';

unique_places=distinctfiltered_places;

storeunique_placesinto'$outputDir/places'

usingPigStorage('\u0001');

—Users

users=foreachtweets{

generate

(chararray)CustomFormatToISO($0#'created_at',

'EEEMMMMdHH:mm:ssZy')asdt,

(chararray)$0#'id_str'asid_str,

$0#'user'asuser;

Page 276: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

}

user_fields=foreachusers{

generate

(chararray)CustomFormatToISO(user#'created_at',

'EEEMMMMdHH:mm:ssZy')asdt,

(chararray)user#'id_str'asuser_id,

(chararray)user#'location'asuser_location,

(chararray)user#'name'asuser_name,

(chararray)user#'description'asuser_description,

(int)user#'followers_count'asfollowers_count,

(int)user#'friends_count'asfriends_count,

(int)user#'favourites_count'asfavourites_count,

(chararray)user#'screen_name'asscreen_name,

(int)user#'listed_count'aslisted_count;

}

unique_users=distinctuser_fields;

storeunique_usersinto'$outputDir/users'

usingPigStorage('\u0001');

Runthisscriptasfollows:

$pig–fextract_for_hive.pig–paraminputDir=<jsoninput>-param

outputDir=<outputpath>

TheprecedingcodewritesdataintothreeseparateTSVfilesforthetweet,user,andplaceinformation.Noticethatinthestorecommand,wepassanargumentwhencallingPigStorage.ThissingleargumentchangesthedefaultfieldseparatorfromatabcharactertounicodevalueU0001,oryoucanalsouseCtrl+C+A.ThisisoftenusedasaseparatorinHivetablesandwillbeparticularlyusefultousasourtweetdatacouldcontaintabsinotherfields.

Page 277: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OverviewofHiveWewillnowshowhowyoucanimportdataintoHiveandrunaqueryagainstthetableabstractionHiveprovidesoverthedata.Inthisexample,andintheremainderofthechapter,wewillassumethatqueriesaretypedintotheshellthatcanbeinvokedbyexecutingthehivecommand.

RecentlyaclientcalledBeelinealsobecameavailableandwilllikelybethepreferredCLIclientinthenearfuture.

WhenimportinganynewdataintoHive,thereisgenerallyathree-stageprocess:

CreatethespecificationofthetableintowhichthedataistobeimportedImportthedataintothecreatedtableExecuteHiveQLqueriesagainstthetable

MostoftheHiveQLstatementsaredirectanaloguestosimilarlynamedstatementsinstandardSQL.WeassumeonlyapassingknowledgeofSQLthroughoutthischapter,butifyouneedarefresher,therearenumerousgoodonlinelearningresources.

Hivegivesastructuredqueryviewofourdata,andtoenablethat,wemustfirstdefinethespecificationofthetable’scolumnsandimportthedataintothetablebeforewecanexecuteanyqueries.AtablespecificationisgeneratedusingaCREATEstatementthatspecifiesthetablename,thenameandtypesofitscolumns,andsomemetadataabouthowthetableisstored:

CREATEtabletweets(

created_atstring,

tweet_idstring,

textstring,

in_reply_tostring,

retweetedboolean,

user_idstring,

place_idstring

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE;

Thestatementcreatesanewtabletweetsdefinedbyalistofnamesforcolumnsinthedatasetandtheirdatatype.WespecifythatfieldsaredelimitedbytheUnicodeU0001characterandthattheformatusedtostoredataisTEXTFILE.

DatacanbeimportedfromalocationinHDFStweets/usingtheLOADDATAstatement:

LOADDATAINPATH'tweets'OVERWRITEINTOTABLEtweets;

Bydefault,dataforHivetablesisstoredonHDFSunder/user/hive/warehouse.IfaLOADstatementisgivenapathtodataonHDFS,itwillnotsimplycopythedatainto/user/hive/warehouse,butwillmoveitthereinstead.IfyouwanttoanalyzedataonHDFSthatisusedbyotherapplications,theneithercreateacopyorusetheEXTERNALmechanismthatwillbedescribedlater.

Page 278: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OncedatahasbeenimportedintoHive,wecanrunqueriesagainstit.Forinstance:

SELECTCOUNT(*)FROMtweets;

Theprecedingcodewillreturnthetotalnumberoftweetspresentinthedataset.HiveQL,likeSQL,isnotcasesensitiveintermsofkeywords,columns,ortablenames.Byconvention,SQLstatementsuseuppercaseforSQLlanguagekeywords,andwewillgenerallyfollowthiswhenusingHiveQLwithinfiles,aswillbeshownlater.However,whentypinginteractivecommands,wewillfrequentlytakethelineofleastresistanceanduselowercase.

Ifyoulookcloselyatthetimetakenbythevariouscommandsintheprecedingexample,you’llnoticethatloadingdataintoatabletakesaboutaslongascreatingthetablespecification,buteventhesimplecountofallrowstakessignificantlylonger.TheoutputalsoshowsthattablecreationandtheloadingofdatadonotactuallycauseMapReducejobstobeexecuted,whichexplainstheveryshortexecutiontimes.

Page 279: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ThenatureofHivetablesAlthoughHivecopiesthedatafileintoitsworkingdirectory,itdoesnotactuallyprocesstheinputdataintorowsatthatpoint.

BoththeCREATETABLEandLOADDATAstatementsdonottrulycreateconcretetabledataassuch;instead,theyproducethemetadatathatwillbeusedwhenHivegeneratesMapReducejobstoaccessthedataconceptuallystoredinthetablebutactuallyresidingonHDFS.EventhoughtheHiveQLstatementsrefertoaspecifictablestructure,itisHive’sresponsibilitytogeneratecodethatcorrectlymapsthistotheactualon-diskformatinwhichthedatafilesarestored.

ThismightseemtosuggestthatHiveisn’tarealdatabase;thisistrue,itisn’t.Whereasarelationaldatabasewillrequireatableschematobedefinedbeforedataisingestedandtheningestonlydatathatconformstothatspecification,Hiveismuchmoreflexible.ThelessconcretenatureofHivetablesmeansthatschemascanbedefinedbasedonthedataasithasalreadyarrivedandnotonsomeassumptionofhowthedatashouldbe,whichmightprovetobewrong.Thoughchangeabledataformatsaretroublesomeregardlessoftechnology,theHivemodelprovidesanadditionaldegreeoffreedominhandlingtheproblemwhen,notif,itarises.

Page 280: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HivearchitectureUntilversion2,Hadoopwasprimarilyabatchsystem.Aswesawinpreviouschapters,MapReducejobstendtohavehighlatencyandoverheadderivedfromsubmissionandscheduling.Internally,HivecompilesHiveQLstatementsintoMapReducejobs.Hivequerieshavetraditionallybeencharacterizedbyhighlatency.ThishaschangedwiththeStingerinitiativeandtheimprovementsintroducedinHive0.13thatwewilldiscusslater.

HiverunsasaclientapplicationthatprocessesHiveQLqueries,convertsthemintoMapReducejobs,andsubmitsthesetoaHadoopclustereithertonativeMapReduceinHadoop1ortotheMapReduceApplicationMasterrunningonYARNinHadoop2.

Regardlessofthemodel,Hiveusesacomponentcalledthemetastore,inwhichitholdsallitsmetadataaboutthetablesdefinedinthesystem.Ironically,thisisstoredinarelationaldatabasededicatedtoHive’susage.IntheearliestversionsofHive,allclientscommunicateddirectlywiththemetastore,butthismeantthateveryuseroftheHiveCLItoolneededtoknowthemetastoreusernameandpassword.

HiveServerwascreatedtoactasapointofentryforremoteclients,whichcouldalsoactasasingleaccess-controlpointandwhichcontrolledallaccesstotheunderlyingmetastore.BecauseoflimitationsinHiveServer,thenewestwaytoaccessHiveisthroughthemulti-clientHiveServer2.

NoteHiveServer2introducesanumberofimprovementsoveritspredecessor,includinguserauthenticationandsupportformultipleconnectionsfromthesameclient.Moreinformationcanbefoundathttps://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2.

InstancesofHiveServerandHiveServer2canbemanuallyexecutedwiththehive--servicehiveserverandhive--servicehiveserver2commands,respectively.

Intheexampleswesawbeforeandintheremainderofthischapter,weimplicitlyuseHiveServertosubmitqueriesviatheHivecommand-linetool.HiveServer2comeswithBeeline.Forcompatibilityandmaturityreasons,Beelinebeingrelativelynew,bothtoolsareavailableonClouderaandmostothermajordistributions.TheBeelineclientispartofthecoreApacheHivedistributionandsoisalsofullyopensource.Beelinecanbeexecutedinembeddedversionwiththefollowingcommand:

$beeline-ujdbc:hive2://

Page 281: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DatatypesHiveQLsupportsmanyofthecommondatatypesprovidedbystandarddatabasesystems.Theseincludeprimitivetypes,suchasfloat,double,int,andstring,throughtostructuredcollectiontypesthatprovidetheSQLanaloguestotypessuchasarrays,structs,andunions(structswithoptionsforsomefields).SinceHiveisimplementedinJava,primitivetypeswillbehaveliketheirJavacounterparts.WecandistinguishHivedatatypesintothefollowingfivebroadcategories:

Numeric:tinyint,smallint,int,bigint,float,double,anddecimalDateandtime:timestampanddateString:string,varchar,andcharCollections:array,map,struct,anduniontypeMisc:boolean,binary,andNULL

Page 282: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DDLstatementsHiveQLprovidesanumberofstatementstocreate,delete,andalterdatabases,tables,andviews.TheCREATEDATABASE<name>statementcreatesanewdatabasewiththegivenname.Adatabaserepresentsanamespacewheretableandviewmetadataiscontained.Ifmultipledatabasesarepresent,theUSE<databasename>statementspecifieswhichonetousetoquerytablesorcreatenewmetadata.Ifnodatabaseisexplicitlyspecified,Hivewillrunallstatementsagainstthedefaultdatabase.SHOW[DATABASES,TABLES,VIEWS]displaysthedatabasescurrentlyavailablewithinadatawarehouseandwhichtableandviewmetadataispresentwithinthedatabasecurrentlyinuse:

CREATEDATABASEtwitter;

SHOWdatabases;

USEtwitter;

SHOWTABLES;

TheCREATETABLE[IFNOTEXISTS]<name>statementcreatesatablewiththegivenname.Asalludedtoearlier,whatisreallycreatedisthemetadatarepresentingthetableanditsmappingtofilesonHDFSaswellasadirectoryinwhichtostorethedatafiles.Ifatableorviewwiththesamenamealreadyexists,Hivewillraiseanexception.

Bothtableandcolumnnamesarecaseinsensitive.InolderversionsofHive(0.12andearlier),onlyalphanumericandunderscorecharacterswereallowedintableandcolumnnames.AsofHive0.13,thesystemsupportsunicodecharactersincolumnnames.Reservedwords,suchasloadandcreate,needtobeescapedbybackticks(the`character)tobetreatedliterally.

TheEXTERNALkeywordspecifiesthatthetableexistsinresourcesoutofHive’scontrol,whichcanbeausefulmechanismtoextractdatafromanothersourceatthebeginningofaHadoop-basedExtract-Transform-Load(ETL)pipeline.TheLOCATIONclausespecifieswherethesourcefile(ordirectory)istobefound.TheEXTERNALkeywordandLOCATIONclausehavebeenusedinthefollowingcode:

CREATEEXTERNALTABLEtweets(

created_atstring,

tweet_idstring,

textstring,

in_reply_tostring,

retweetedboolean,

user_idstring,

place_idstring

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE

LOCATION'${input}/tweets';

Thistablewillbecreatedinthemetastorebutthedatawillnotbecopiedintothe/user/hive/warehousedirectory.

Tip

Page 283: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

NotethatHivehasnoconceptofprimarykeyoruniqueidentifier.Uniquenessanddatanormalizationareaspectstobeaddressedbeforeloadingdataintothedatawarehouse.

TheCREATEVIEW<viewname>…ASSELECTstatementcreatesaviewwiththegivenname.Forexample,wecancreateaviewtoisolateretweetsfromothermessages,asfollows:

CREATEVIEWretweets

COMMENT'Tweetsthathavebeenretweeted'

ASSELECT*FROMtweetsWHEREretweeted=true;

Unlessotherwisespecified,columnnamesarederivedfromthedefiningSELECTstatement.Hivedoesnotcurrentlysupportmaterializedviews.

TheDROPTABLEandDROPVIEWstatementsremovebothmetadataanddataforagiventableorview.WhendroppinganEXTERNALtableoraview,onlymetadatawillberemovedandtheactualdatafileswillnotbeaffected.

HiveallowstablemetadatatobealteredviatheALTERTABLEstatement,whichcanbeusedtochangeacolumntype,name,position,andcommentortoaddandreplacecolumns.

Whenaddingcolumns,itisimportanttorememberthatonlymetadatawillbechangedandnotthedatasetitself.Thismeansthatifweweretoaddacolumninthemiddleofthetablewhichdidn’texistinolderfiles,thenwhileselectingfromolderdata,wemightgetwrongvaluesinthewrongcolumns.Thisisbecausewewouldbelookingatoldfileswithanewformat.WewilldiscussdataandschemamigrationsinChapter8,DataLifecycleManagement,whendiscussingAvro.

Similarly,ALTERVIEW<viewname>AS<selectstatement>changesthedefinitionofanexistingview.

Page 284: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

FileformatsandstorageThedatafilesunderlyingaHivetablearenodifferentfromanyotherfileonHDFS.UserscandirectlyreadtheHDFSfilesintheHivetablesusingothertools.TheycanalsouseothertoolstowritetoHDFSfilesthatcanbeloadedintoHivethroughCREATEEXTERNALTABLEorthroughLOADDATAINPATH.

HiveusestheSerializerandDeserializerclasses,SerDe,aswellasFileFormattoreadandwritetablerows.AnativeSerDeisusedifROWFORMATisnotspecifiedorROWFORMATDELIMITEDisspecifiedinaCREATETABLEstatement.TheDELIMITEDclauseinstructsthesystemtoreaddelimitedfiles.DelimitercharacterscanbeescapedusingtheESCAPEDBYclause.

HivecurrentlyusesthefollowingFileFormatclassestoreadandwriteHDFSfiles:

TextInputFormatandHiveIgnoreKeyTextOutputFormat:willread/writedatainplaintextfileformatSequenceFileInputFormatandSequenceFileOutputFormat:classesread/writedataintheHadoopSequenceFileformat

Additionally,thefollowingSerDeclassescanbeusedtoserializeanddeserializedata:

MetadataTypedColumnsetSerDe:willread/writedelimitedrecordssuchasCSVortab-separatedrecordsThriftSerDe,andDynamicSerDe:willread/writeThriftobjects

JSONAsofversion0.13,Hiveshipswiththenativeorg.apache.hive.hcatalog.data.JsonSerDe.ForolderversionsofHive,Hive-JSON-Serde(foundathttps://github.com/rcongiu/Hive-JSON-Serde)isarguablyoneofthemostfeature-richJSONserialization/deserializationmodules.

WecanuseeithermoduletoloadJSONtweetswithoutanyneedforpreprocessingandjustdefineaHiveschemathatmatchesthecontentofaJSONdocument.Inthefollowingexample,weuseHive-JSON-Serde.

Aswithanythird-partymodule,weloadtheSerDeJARsintoHivewiththefollowingcode:

ADDJARJARjson-serde-1.3-jar-with-dependencies.jar;

Then,weissuetheusualCREATEstatement,asfollows:

CREATEEXTERNALTABLEtweets(

contributorsstring,

coordinatesstruct<

coordinates:array<float>,

type:string>,

created_atstring,

entitiesstruct<

hashtags:array<struct<

Page 285: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

indices:array<tinyint>,

text:string>>,

)

ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe'

STOREDASTEXTFILE

LOCATION'tweets';

WiththisSerDe,wecanmapnesteddocuments(suchasentitiesorusers)tothestructormaptypes.WetellHivethatthedatastoredatLOCATION'tweets'istext(STOREDASTEXTFILE)andthateachrowisaJSONobject(ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe‘).InHive0.13andlater,wecanexpressthispropertyasROWFORMATSERDE'org.apache.hive.hcatalog.data.JsonSerDe'.

Manuallyspecifyingtheschemaforcomplexdocumentscanbeatediousanderror-proneprocess.Thehive-jsonmodule(foundathttps://github.com/hortonworks/hive-json)isahandyutilitytoanalyzelargedocumentsandgenerateanappropriateHiveschema.Dependingonthedocumentcollection,furtherrefinementmightbenecessary.

Inourexample,weusedaschemageneratedwithhive-jsonthatmapsthetweetsJSONtoanumberofstructdatatypes.Thisallowsustoquerythedatausingahandydotnotation.Forinstance,wecanextractthescreennameanddescriptionfieldsofauserobjectwiththefollowingcode:

SELECTuser.screen_name,user.descriptionFROMtweets_jsonLIMIT10;

AvroAvroSerde(https://cwiki.apache.org/confluence/display/Hive/AvroSerDe)allowsustoreadandwritedatainAvroformat.Startingfrom0.14,Avro-backedtablescanbecreatedusingtheSTOREDASAVROstatement,andHivewilltakecareofcreatinganappropriateAvroschemaforthetable.PriorversionsofHiveareabitmoreverbose.

Asanexample,let’sloadintoHivethePageRankdatasetwegeneratedinChapter6,DataAnalysiswithApachePig.ThisdatasetwascreatedusingPig’sAvroStorageclass,andhasthefollowingschema:

{

"type":"record",

"name":"record",

"fields":[

{"name":"topic","type":["null","int"]},

{"name":"source","type":["null","int"]},

{"name":"rank","type":["null","float"]}

]

}

ThetablestructureiscapturedinanAvrorecord,whichcontainsheaderinformation(anameandoptionalnamespacetoqualifythename)andanarrayofthefields.Eachfieldisspecifiedwithitsnameandtypeaswellasanoptionaldocumentationstring.

Forafewofthefields,thetypeisnotasinglevalue,butinsteadapairofvalues,oneofwhichisnull.ThisisanAvrounion,andthisistheidiomaticwayofhandlingcolumns

Page 286: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

thatmighthaveanullvalue.Avrospecifiesnullasaconcretetype,andanylocationwhereanothertypemighthaveanullvalueneedstobespecifiedinthisway.Thiswillbehandledtransparentlyforuswhenweusethefollowingschema.

Withthisdefinition,wecannowcreateaHivetablethatusesthisschemaforitstablespecification,asfollows:

CREATEEXTERNALTABLEtweets_pagerank

ROWFORMATSERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

WITHSERDEPROPERTIES('avro.schema.literal'='{

"type":"record",

"name":"record",

"fields":[

{"name":"topic","type":["null","int"]},

{"name":"source","type":["null","int"]},

{"name":"rank","type":["null","float"]}

]

}')

STOREDASINPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION'${data}/ch5-pagerank';

Then,lookatthefollowingtabledefinitionfromwithinHive(notealsothatHCatalog,whichwe’llintroduceinChapter8,DataLifeCycleManagement,alsosupportssuchdefinitions):

DESCRIBEtweets_pagerank;

OK

topicintfromdeserializer

sourceintfromdeserializer

rankfloatfromdeserializer

IntheDDL,wetoldHivethatdataisstoredinAvroformatusingAvroContainerInputFormatandAvroContainerOutputFormat.Eachrowneedstobeserializedanddeserializedusingorg.apache.hadoop.hive.serde2.avro.AvroSerDe.ThetableschemaisinferredbyHivefromtheAvroschemaembeddedinavro.schema.literal.

Alternatively,wecanstoreaschemaonHDFSandhaveHivereadittodeterminethetablestructure.Createtheprecedingschemainafilecalledpagerank.avsc—thisisthestandardfileextensionforAvroschemas.ThenplaceitonHDFS;weprefertohaveacommonlocationforschemafilessuchas/schema/avro.Finally,definethetableusingtheavro.schema.urlSerDepropertyWITHSERDEPROPERTIES('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc').

IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduceJARtoourenvironmentbeforeaccessingindividualfields.WithinHive,ontheClouderaCDH5VM:

ADDJAR/opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar;

Page 287: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Wecanalsousethistablelikeanyother.Forinstance,wecanquerythedatatoselecttheuserandtopicpairswithahighPageRank:

SELECTsource,topicfromtweets_pagerankWHERErank>=0.9;

InChapter8,DataLifecycleManagement,wewillseehowAvroandavro.schema.urlplayaninstrumentalroleinenablingschemamigrations.

ColumnarstoresHivecanalsotakeadvantageofcolumnarstorageviatheORC(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC)andParquet(https://cwiki.apache.org/confluence/display/Hive/Parquet)formats.

Ifatableisdefinedwithverymanycolumns,itisnotunusualforanygivenquerytoonlyprocessasmallsubsetofthesecolumns.ButeveninaSequenceFileeachfullrowandallitscolumnswillbereadfromdisk,decompressed,andprocessed.Thisconsumesalotofsystemresourcesfordatathatweknowinadvanceisnotofinterest.

Traditionalrelationaldatabasesalsostoredataonarowbasis,andatypeofdatabasecalledcolumnarchangedthistobecolumn-focused.Inthesimplestmodel,insteadofonefileforeachtable,therewouldbeonefileforeachcolumninthetable.Ifaqueryonlyneededtoaccessfivecolumnsinatablewith100columnsintotal,thenonlythefilesforthosefivecolumnswillberead.BothORCandParquetusethisprincipleaswellasotheroptimizationstoenablemuchfasterqueries.

Page 288: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

QueriesTablescanbequeriedusingthefamiliarSELECT…FROMstatement.TheWHEREstatementallowsthespecificationoffilteringconditions,GROUPBYaggregatesrecords,ORDERBYspecifiessortingcriteria,andLIMITspecifiesthenumberofrecordstoretrieve.Aggregatefunctions,suchascountandsum,canbeappliedtoaggregatedrecords.Forinstance,thefollowingcodereturnsthetop10mostprolificusersinthedataset:

SELECTuser_id,COUNT(*)AScntFROMtweetsGROUPBYuser_idORDERBYcnt

DESCLIMIT10

Thisreturnsthetop10mostprolificusersinthedataset:

22639496594

13321880534

9594688573

13677521183

3625629443

586460413

23752966883

14681885293

371142093

23850409403

Wecanimprovethereadabilityofthehiveoutputbysettingthefollowing:

SEThive.cli.print.header=true;

Thiswillinstructhive,thoughnotbeeline,toprintcolumnnamesaspartoftheoutput.

TipYoucanaddthecommandtothe.hivercfileusuallyfoundintherootoftheexecutinguser’shomedirectorytohaveitapplytoallhiveCLIsessions.

HiveQLimplementsaJOINoperatorthatenablesustocombinetablestogether.InthePrerequisitessection,wegeneratedseparatedatasetsfortheuserandplaceobjects.Let’snowloadthemintohiveusingexternaltables.

Wefirstcreateausertabletostoreuserdata,asfollows:

CREATEEXTERNALTABLEuser(

created_atstring,

user_idstring,

`location`string,

namestring,

descriptionstring,

followers_countbigint,

friends_countbigint,

favourites_countbigint,

screen_namestring,

listed_countbigint

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

Page 289: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

STOREDASTEXTFILE

LOCATION'${input}/users';

Wethencreateaplacetabletostorelocationdata,asfollows:

CREATEEXTERNALTABLEplace(

place_idstring,

country_codestring,

countrystring,

`name`string,

full_namestring,

place_typestring

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE

LOCATION'${input}/places';

WecanusetheJOINoperatortodisplaythenamesofthe10mostprolificusers,asfollows:

SELECTtweets.user_id,user.name,COUNT(tweets.user_id)AScnt

FROMtweets

JOINuserONuser.user_id=tweets.user_id

GROUPBYtweets.user_id,user.user_id,user.name

ORDERBYcntDESCLIMIT10;

TipOnlyequality,outer,andleft(semi)joinsaresupportedinHive.

NoticethattheremightbemultipleentrieswithagivenuserIDbutdifferentvaluesforthefollowers_count,friends_count,andfavourites_countcolumns.Toavoidduplicateentries,wecountonlyuser_idfromthetweetstable.

Wecanrewritethepreviousqueryasfollows:

SELECTtweets.user_id,u.name,COUNT(*)AScnt

FROMtweets

join(SELECTuser_id,nameFROMuserGROUPBYuser_id,name)u

ONu.user_id=tweets.user_id

GROUPBYtweets.user_id,u.name

ORDERBYcntDESCLIMIT10;

Insteadofdirectlyjoiningtheusertable,weexecuteasubquery,asfollows:

SELECTuser_id,nameFROMuserGROUPBYuser_id,name;

ThesubqueryextractsuniqueuserIDsandnames.NotethatHivehaslimitedsupportforsubqueries,historicallyonlypermittingasubqueryintheFROMclauseofaSELECTstatement.Hive0.13hasaddedlimitedsupportforsubquerieswithintheWHEREclausealso.

HiveQLisanever-evolvingrichlanguage,afullexpositionofwhichisbeyondthescopeofthischapter.Adescriptionofitsqueryandddlcapabilitiescanbefoundathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual.

Page 290: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

StructuringHivetablesforgivenworkloadsOftenHiveisn’tusedinisolation,insteadtablesarecreatedwithparticularworkloadsinmindorneedsinvokedinwaysthataresuitableforinclusioninautomatedprocesses.We’llnowexploresomeofthesescenarios.

Page 291: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PartitioningatableWithcolumnarfileformats,weexplainedthebenefitsofexcludingunneededdataasearlyaspossiblewhenprocessingaquery.AsimilarconcepthasbeenusedinSQLforsometime:tablepartitioning.

Whencreatingapartitionedtable,acolumnisspecifiedasthepartitionkey.Allvalueswiththatkeyarethenstoredtogether.InHive’scase,differentsubdirectoriesforeachpartitionkeyarecreatedunderthetabledirectoryinthewarehouselocationonHDFS.

It’simportanttounderstandthecardinalityofthepartitioncolumn.Withtoofewdistinctvalues,thebenefitsarereducedasthefilesarestillverylarge.Iftherearetoomanyvalues,thenqueriesmightneedalargenumberoffilestobescannedtoaccessalltherequireddata.Perhapsthemostcommonpartitionkeyisonebasedondate.Wecould,forexample,partitionourusertablefromearlierbasedonthecreated_atcolumn,thatis,thedatetheuserwasfirstregistered.Notethatsincepartitioningatablebydefinitionaffectsitsfilestructure,wecreatethistablenowasanon-externalone,asfollows:

CREATETABLEpartitioned_user(

created_atstring,

user_idstring,

`location`string,

namestring,

descriptionstring,

followers_countbigint,

friends_countbigint,

favourites_countbigint,

screen_namestring,

listed_countbigint

)PARTITIONEDBY(created_at_datestring)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE;

Toloaddataintoapartition,wecanexplicitlygiveavalueforthepartitionintowhichtoinsertthedata,asfollows:

INSERTINTOTABLEpartitioned_user

PARTITION(created_at_date='2014-01-01')

SELECT

created_at,

user_id,

location,

name,

description,

followers_count,

friends_count,

favourites_count,

screen_name,

listed_count

FROMuser;

Thisisatbestverbose,asweneedastatementforeachpartitionkeyvalue;ifasingle

Page 292: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

LOADorINSERTstatementcontainsdataformultiplepartitions,itjustwon’twork.Hivealsohasafeaturecalleddynamicpartitioning,whichcanhelpushere.Wesetthefollowingthreevariables:

SEThive.exec.dynamic.partition=true;

SEThive.exec.dynamic.partition.mode=nonstrict;

SEThive.exec.max.dynamic.partitions.pernode=5000;

Thefirsttwostatementsenableallpartitions(nonstrictoption)tobedynamic.Thethirdoneallows5,000distinctpartitionstobecreatedoneachmapperandreducernode.

Wecanthensimplyusethenameofthecolumntobeusedasthepartitionkey,andHivewillinsertdataintopartitionsdependingonthevalueofthekeyforagivenrow:

INSERTINTOTABLEpartitioned_user

PARTITION(created_at_date)

SELECT

created_at,

user_id,

location,

name,

description,

followers_count,

friends_count,

favourites_count,

screen_name,

listed_count,

to_date(created_at)ascreated_at_date

FROMuser;

Eventhoughweuseonlyasinglepartitioncolumnhere,wecanpartitionatablebymultiplecolumnkeys;justhavethemasacomma-separatedlistinthePARTITIONEDBYclause.

Notethatthepartitionkeycolumnsneedtobeincludedasthelastcolumnsinanystatementbeingusedtoinsertintoapartitionedtable.IntheprecedingcodeweuseHive’sto_datefunctiontoconvertthecreated_attimestamptoaYYYY-MM-DDformattedstring.

PartitioneddataisstoredinHDFSas/path/to/warehouse/<database>/<table>/key=<value>.Inourexample,thepartitioned_usertablestructurewilllooklike/user/hive/warehouse/default/partitioned_user/created_at=2014-04-01.

Ifdataisaddeddirectlytothefilesystem,forinstancebysomethird-partyprocessingtoolorbyhadoopfs-put,themetastorewon’tautomaticallydetectthenewpartitions.TheuserwillneedtomanuallyrunanALTERTABLEstatementsuchasthefollowingforeachnewlyaddedpartition:

ALTERTABLE<table_name>ADDPARTITION<location>;

Toaddmetadataforallpartitionsnotcurrentlypresentinthemetastorewecanuse:MSCKREPAIRTABLE<table_name>;statement.OnEMR,thisisequivalenttoexecutingthefollowingstatement:

ALTERTABLE<table_name>RECOVERPARTITIONS;

Page 293: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

NoticethatbothstatementswillworkalsowithEXTERNALtables.Inthefollowingchapter,wewillseehowthispatterncanbeexploitedtocreateflexibleandinteroperablepipelines.

OverwritingandupdatingdataPartitioningisalsousefulwhenweneedtoupdateaportionofatable.Normallyastatementofthefollowingformwillreplaceallthedataforthedestinationtable:

INSERTOVERWRITEINTO<table>…

IfOVERWRITEisomitted,theneachINSERTstatementwilladdadditionaldatatothetable.Sometimes,thisisdesirable,butoften,thesourcedatabeingingestedintoaHivetableisintendedtofullyupdateasubsetofthedataandkeeptherestuntouched.

IfweperformanINSERTOVERWRITEstatement(oraLOADOVERWRITEstatement)intoapartitionofatable,thenonlythespecifiedpartitionwillbeaffected.Thus,ifwewereinsertinguserdataandonlywantedtoaffectthepartitionswithdatainthesourcefile,wecouldachievethisbyaddingtheOVERWRITEkeywordtoourpreviousINSERTstatement.

WecanalsoaddcaveatstotheSELECTstatement.Say,forexample,weonlywantedtoupdatedataforacertainmonth:

INSERTINTOTABLEpartitioned_user

PARTITION(created_at_date)

SELECTcreated_at,

user_id,

location,

name,

description,

followers_count,

friends_count,

favourites_count,

screen_name,

listed_count,

to_date(created_at)ascreated_at_date

FROMuser

WHEREto_date(created_at)BETWEEN'2014-03-01'and'2014-03-31';

BucketingandsortingPartitioningatableisaconstructthatyoutakeexplicitadvantageofbyusingthepartitioncolumn(orcolumns)intheWHEREclauseofqueriesagainstthetables.ThereisanothermechanismcalledbucketingthatcanfurthersegmenthowatableisstoredanddoessoinawaythatallowsHiveitselftooptimizeitsinternalqueryplanstotakeadvantageofthestructure.

Let’screatebucketedversionsofourtweetsandusertables;notethefollowingadditionalCLUSTERBYandSORTBYstatementsintheCREATETABLEstatements:

CREATEtablebucketed_tweets(

tweet_idstring,

textstring,

in_reply_tostring,

Page 294: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

retweetedboolean,

user_idstring,

place_idstring

)PARTITIONEDBY(created_atstring)

CLUSTEREDBY(user_ID)into64BUCKETS

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE;

CREATETABLEbucketed_user(

user_idstring,

`location`string,

namestring,

descriptionstring,

followers_countbigint,

friends_countbigint,

favourites_countbigint,

screen_namestring,

listed_countbigint

)PARTITIONEDBY(created_atstring)

CLUSTEREDBY(user_ID)SORTEDBY(name)into64BUCKETS

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE;

Notethatwechangedthetweetstabletoalsobepartitioned;youcanonlybucketatablethatispartitioned.

Justasweneedtospecifyapartitioncolumnwheninsertingintoapartitionedtable,wemustalsotakecaretoensurethatdatainsertedintoabucketedtableiscorrectlyclustered.Wedothisbysettingthefollowingflagbeforeinsertingthedataintothetable:

SEThive.enforce.bucketing=true;

Justaswithpartitionedtables,youcannotapplythebucketingfunctionwhenusingtheLOADDATAstatement;ifyouwishtoloadexternaldataintoabucketedtable,firstinsertitintoatemporarytable,andthenusetheINSERT…SELECT…syntaxtopopulatethebucketedtable.

Whendataisinsertedintoabucketedtable,rowsareallocatedtoabucketbasedontheresultofahashfunctionappliedtothecolumnspecifiedintheCLUSTEREDBYclause.

Oneofthegreatestadvantagesofbucketingatablecomeswhenweneedtojointwotablesthataresimilarlybucketed,asinthepreviousexample.So,forexample,anyqueryofthefollowingformwouldbevastlyimproved:

SEThive.optimize.bucketmapjoin=true;

SELECT…

FROMbucketed_useruJOINbucketed_tweett

ONu.user_id=t.user_id;

Withthejoinbeingperformedonthecolumnusedtobucketthetable,Hivecanoptimizetheamountofprocessingasitknowsthateachbucketcontainsthesamesetofuser_idcolumnsinbothtables.Whiledeterminingwhichrowsagainstwhichtomatch,onlythose

Page 295: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

inthebucketneedtobecomparedagainst,andnotthewholetable.Thisdoesrequirethatthetablesarebothclusteredonthesamecolumnandthatthebucketnumbersareeitheridenticaloroneisamultipleoftheother.Inthelattercase,withsayonetableclusteredinto32bucketsandanotherinto64,thenatureofthedefaulthashfunctionusedtoallocatedatatoabucketmeansthattheIDsinbucket3inthefirsttablewillcoverthoseinbothbuckets3and35inthesecond.

SamplingdataBucketingatablecanalsohelpwhileusingHive’sabilitytosampledatainatable.Samplingallowsaquerytogatheronlyaspecifiedsubsetoftheoverallrowsinthetable.Thisisusefulwhenyouhaveanextremelylargetablewithmoderatelyconsistentdatapatterns.Insuchacase,applyingaquerytoasmallfractionofthedatawillbemuchfasterandwillstillgiveabroadlyrepresentativeresult.Note,ofcourse,thatthisonlyappliestoquerieswhereyouarelookingtodeterminetablecharacteristics,suchaspatternrangesinthedata;ifyouaretryingtocountanything,thentheresultneedstobescaledtothefulltablesize.

Foranon-bucketedtable,youcansampleinamechanismsimilartowhatwesawearlierbyspecifyingthatthequeryshouldonlybeappliedtoacertainsubsetofthetable:

SELECTmax(friends_count)

FROMuserTABLESAMPLE(BUCKET2OUTOF64ONname);

Inthisquery,Hivewilleffectivelyhashtherowsinthetableinto64bucketsbasedonthenamecolumn.Itwillthenonlyusethesecondbucketforthequery.Multiplebucketscanbespecified,andifRAND()isgivenastheONclause,thentheentirerowisusedbythebucketingfunction.

Thoughsuccessful,thisishighlyinefficientasthefulltableneedstobescannedtogeneratetherequiredsubsetofdata.Ifwesampleonabucketedtableandensurethenumberofbucketssampledisequaltooramultipleofthebucketsinthetable,thenHivewillonlyreadthebucketsinquestion.Forexample:

SELECTMAX(friends_count)

FROMbucketed_userTABLESAMPLE(BUCKET2OUTOF32onuser_id);

Intheprecedingqueryagainstthebucketed_usertable,whichiscreatedwith64bucketsontheuser_idcolumn,thesampling,sinceitisusingthesamecolumn,willonlyreadtherequiredbuckets.Inthiscase,thesewillbebuckets2and34fromeachpartition.

Afinalformofsamplingisblocksampling.Inthiscase,wecanspecifytherequiredamountofthetabletobesampled,andHivewilluseanapproximationofthisbyonlyreadingenoughsourcedatablocksonHDFStomeettherequiredsize.Currently,thedatasizecanbespecifiedaseitherapercentageofthetable,asanabsolutedatasize,orasanumberofrows(ineachblock).ThesyntaxforTABLESAMPLEisasfollows,whichwillsample0.5percentofthetable,1GBofdataor100rowspersplit,respectively:

TABLESAMPLE(0.5PERCENT)

TABLESAMPLE(1G)

Page 296: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TABLESAMPLE(100ROWS)

Iftheselatterformsofsamplingareofinterest,thenconsultthedocumentation,astherearesomespecificlimitationsontheinputformatandfileformatsthataresupported.

Page 297: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WritingscriptsWecanplaceHivecommandsinafileandrunthemwiththe-foptioninthehiveCLIutility:

$catshow_tables.hql

showtables;

$hive-fshow_tables.hql

WecanparameterizeHiveQLstatementsbymeansofthehiveconfmechanism.Thisallowsustospecifyanenvironmentvariablenameatthepointitisusedratherthanatthepointofinvocation.Forexample:

$catshow_tables2.hql

showtableslike'${hiveconf:TABLENAME}';

$hive-hiveconfTABLENAME=user-fshow_tables2.hql

ThevariablecanalsobesetwithintheHivescriptoraninteractivesession:

SETTABLE_NAME='user';

TheprecedinghiveconfargumentwilladdanynewvariablesinthesamenamespaceastheHiveconfigurationoptions.AsofHive0.8,thereisasimilaroptioncalledhivevarthataddsanyuservariablesintoadistinctnamespace.Usinghivevar,theprecedingcommandwouldbeasfollows:

$catshow_tables3.hql

showtableslike'${hivevar:TABLENAME}';

$hive-hivevarTABLENAME=user–fshow_tables3.hql

Orwecanwritethecommandinteractively:

SEThivevar:TABLE_NAME='user';

Page 298: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HiveandAmazonWebServicesWithElasticMapReduceastheAWSHadoop-on-demandservice,itisofcoursepossibletorunHiveonanEMRcluster.ButitisalsopossibletouseAmazonstorageservices,particularlyS3,fromanyHadoopclusterbeitwithinEMRoryourownlocalcluster.

Page 299: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HiveandS3AsmentionedinChapter2,Storage,itispossibletospecifyadefaultfilesystemotherthanHDFSforHadoopandS3isoneoption.But,itdoesn’thavetobeanall-or-nothingthing;itispossibletohavespecifictablesstoredinS3.Thedataforthesetableswillberetrievedintotheclustertobeprocessed,andanyresultingdatacaneitherbewrittentoadifferentS3location(thesametablecannotbethesourceanddestinationofasinglequery)orontoHDFS.

WecantakeafileofourtweetdataandplaceitontoalocationinS3withacommandsuchasthefollowing:

$awss3puttweets.tsvs3://<bucket-name>/tweets/

Wefirstlyneedtospecifytheaccesskeyandsecretaccesskeythatcanaccessthebucket.Thiscanbedoneinthreeways:

Setfs.s3n.awsAccessKeyIdandfs.s3n.awsSecretAccessKeytotheappropriatevaluesintheHiveCLISetthesamevaluesinhive-site.xmlthoughnotethislimitsuseofS3toasinglesetofcredentialsSpecifythetablelocationexplicitlyinthetableURL,thatis,s3n://<accesskey>:<secretaccesskey>@<bucket>/<path>

Thenwecancreateatablereferencingthisdata,asfollows:

CREATEtableremote_tweets(

created_atstring,

tweet_idstring,

textstring,

in_reply_tostring,

retweetedboolean,

user_idstring,

place_idstring

)CLUSTEREDBY(user_ID)into64BUCKETS

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\t'

LOCATION's3n://<bucket-name>/tweets'

ThiscanbeanincrediblyeffectivewayofpullingS3dataintoalocalHadoopclusterforprocessing.

NoteInordertouseAWScredentialsintheURIofanS3locationregardlessofhowtheparametersarepassed,thesecretandaccesskeysmustnotcontain/,+,=,or\characters.Ifnecessary,anewsetofcredentialscanbegeneratedfromtheIAMconsoleathttps://console.aws.amazon.com/iam/.

Intheory,youcanjustleavethedataintheexternaltableandrefertoitwhenneededtoavoidWANdatatransferlatencies(andcosts),eventhoughitoftenmakessensetopull

Page 300: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

thedataintoalocaltableanddofutureprocessingfromthere.Ifthetableispartitioned,thenyoumightfindyourselfretrievinganewpartitioneachday,forexample.

Page 301: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HiveonElasticMapReduceOnonelevel,usingHivewithinAmazonElasticMapReduceisjustthesameaseverythingdiscussedinthischapter.Youcancreateapersistentcluster,logintothemasternode,andusetheHiveCLItocreatetablesandsubmitqueries.DoingallthiswillusethelocalstorageontheEC2instancesforthetabledata.

Notsurprisingly,jobsonEMRclusterscanalsorefertotableswhosedataisstoredonS3(orDynamoDB).Andalsonotsurprisingly,AmazonhasmadeextensionstoitsversionofHivetomakeallthisveryseamless.ItisquitesimplefromwithinanEMRjobtopulldatafromatablestoredinS3,processit,writeanyintermediatedatatotheEMRlocalstorage,andthenwritetheoutputresultsintoS3,DynamoDB,oroneofagrowinglistofotherAWSservices.

ThepatternmentionedearlierwherenewdataisaddedtoanewpartitiondirectoryforatableeachdayhasprovedveryeffectiveinS3;itisoftenthestoragelocationofchoiceforlargeandincrementallygrowingdatasets.ThereisasyntaxdifferencewhenusingEMR;insteadoftheMSCKcommandmentionedearlier,thecommandtoupdateaHivetablewithnewdataaddedtoapartitiondirectoryisasfollows:

ALTERTABLE<table-name>RECOVERPARTITIONS;

ConsulttheEMRdocumentationforthelatestenhancementsathttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html.Also,consultthebroaderEMRdocumentation.Inparticular,theintegrationpointswithotherAWSservicesisanareaofrapidgrowth.

Page 302: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ExtendingHiveQLTheHiveQLlanguagecanbeextendedbymeansofpluginsandthird-partyfunctions.InHive,therearethreetypesoffunctionscharacterizedbythenumberofrowstheytakeasinputandproduceasoutput:

UserDefinedFunctions(UDFs):aresimplerfunctionsthatactononerowatatime.UserDefinedAggregateFunctions(UDAFs):takemultiplerowsasinputandgeneratemultiplerowsasoutput.TheseareaggregatefunctionstobeusedinconjunctionwithaGROUPBYstatement(similartoCOUNT(),AVG(),MIN(),MAX(),andsoon).UserDefinedTableFunctions(UDTFs):takemultiplerowsasinputandgeneratealogicaltablecomprisedofmultiplerowsthatcanbeusedinjoinexpressions.

TipTheseAPIsareprovidedonlyinJava.Forotherlanguages,itispossibletostreamdatathroughauser-definedscriptusingtheTRANSFORM,MAP,andREDUCEclausesthatactasafrontendtoHadoop’sstreamingcapabilities.

TwoAPIsareavailabletowriteUDFs.AsimpleAPIorg.apache.hadoop.hive.ql.exec.UDFcanbeusedforfunctionsthattakeandreturnbasicwritabletypes.AricherAPI,whichprovidessupportfordatatypesotherthanwritableisavailableintheorg.apache.hadoop.hive.ql.udf.generic.GenericUDFpackage.We’llnowillustratehoworg.apache.hadoop.hive.ql.exec.UDFcanbeusedtoimplementastringtoIDfunctionsimilartotheoneweusedinChapter5,IterativeComputationwithSpark,tomaphashtagstointegersinPig.BuildingaUDFwiththisAPIonlyrequiresextendingtheUDFclassandwritinganevaluate()method,asfollows:

publicclassStringToIntextendsUDF{

publicIntegerevaluate(Textinput){

if(input==null)

returnnull;

Stringstr=input.toString();

returnstr.hashCode();

}

}

ThefunctiontakesaTextobjectasinputandmapsittoanintegervaluewiththehashCode()method.Thesourcecodeofthisfunctioncanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/udf/com/learninghadoop2/hive/udf/StringToInt.java.

TipAsnotedinChapter6,DataAnalysiswithApachePig,amorerobusthashfunctionshouldbeusedinproduction.

Page 303: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WecompiletheclassandarchiveitintoaJARfile,asfollows:

$javac-classpath$(hadoop

classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*

com/learninghadoop2/hive/udf/StringToInt.java

$jarcvfmyudfs-hive.jarcom/learninghadoop2/hive/udf/StringToInt.class

Beforebeingabletouseit,aUDFmustberegisteredinHivewiththefollowingcommands:

ADDJARmyudfs-hive.jar;

CREATETEMPORARYFUNCTIONstring_to_intAS

'com.learninghadoop2.hive.udf.StringToInt';

TheADDJARstatementaddsaJARfiletothedistributedcache.TheCREATETEMPORARYFUNCTION<function>AS<class>statementregistersafunctioninHivethatimplementsagivenJavaclass.ThefunctionwillbedroppedoncetheHivesessionisclosed.AsofHive0.13,itispossibletocreatepermanentfunctionswhosedefinitioniskeptinthemetastoreusingCREATEFUNCTION….

Onceregistered,StringToIntcanbeusedinaqueryjustlikeanyotherfunction.Inthefollowingexample,wefirstextractalistofhashtagsfromthetweet’stextbyapplyingregexp_extract.Then,weusestring_to_inttomapeachtagtoanumericalID:

SELECTunique_hashtags.hashtag,string_to_int(unique_hashtags.hashtag)AS

tag_idFROM

(

SELECTregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')ashashtag

FROMtweets

GROUPBYregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')

)unique_hashtagsGROUPBYunique_hashtags.hashtag,

string_to_int(unique_hashtags.hashtag);

Justaswedidinthepreviouschapter,wecanusetheprecedingquerytocreatealookuptable:

CREATETABLElookuptable(tagstring,tag_idbigint);

INSERTOVERWRITETABLElookuptable

SELECTunique_hashtags.hashtag,

string_to_int(unique_hashtags.hashtag)astag_id

FROM

(

SELECTregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')AShashtag

FROMtweets

GROUPBYregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')

)unique_hashtags

GROUPBYunique_hashtags.hashtag,string_to_int(unique_hashtags.hashtag);

Page 304: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ProgrammaticinterfacesInadditiontothehiveandbeelinecommand-linetools,itispossibletosubmitHiveQLqueriestothesystemviatheJDBCandThriftprogrammaticinterfaces.SupportforODBCwasbundledinolderversionsofHive,butasofHive0.12,itneedstobebuiltfromscratch.Moreinformationonthisprocesscanbefoundathttps://cwiki.apache.org/confluence/display/Hive/HiveODBC.

Page 305: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

JDBCAHiveclientwrittenusingJDBCAPIslooksexactlythesameasaclientprogramwrittenforotherdatabasesystems(forexampleMySQL).ThefollowingisasampleHiveclientprogramusingJDBCAPIs.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveJdbcClient.java.

publicclassHiveJdbcClient{

privatestaticStringdriverName="org.apache.hive.jdbc.HiveDriver";

//connectionstring

publicstaticStringURL="jdbc:hive2://localhost:10000";

//Showalltablesinthedefaultdatabase

publicstaticStringQUERY="showtables";

publicstaticvoidmain(String[]args)throwsSQLException{

try{

Class.forName(driverName);

}

catch(ClassNotFoundExceptione){

e.printStackTrace();

System.exit(1);

}

Connectioncon=DriverManager.getConnection(URL);

Statementstmt=con.createStatement();

ResultSetresultSet=stmt.executeQuery(QUERY);

while(resultSet.next()){

System.out.println(resultSet.getString(1));

}

}

}

TheURLpartistheJDBCURIthatdescribestheconnectionendpoint.Theformatforestablishingaremoteconnectionisjdbc:hive2:<host>:<port>/<database>.Connectionsinembeddedmodecanbeestablishedbynotspecifyingahostorport,likejdbc:hive2://.

hiveandhive2arethedriverstobeusedwhenconnectingtoHiveServerandHiveServer2.QUERYcontainstheHiveQLquerytobeexecuted.

TipHive’sJDBCinterfaceexposesonlythedefaultdatabase.Inordertoaccessotherdatabases,youneedtoreferencethemexplicitlyintheunderlyingqueriesusingthe<database>.<table>notation.

FirstweloadtheHiveServer2JDBCdriverorg.apache.hive.jdbc.HiveDriver.

Tip

Page 306: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Useorg.apache.hadoop.hive.jdbc.HiveDrivertoconnecttoHiveServer.

Then,likewithanyotherJDBCprogram,weestablishaconnectiontoURLanduseittoinstantiateaStatementclass.WeexecuteQUERY,withnoauthentication,andstoretheoutputdatasetintotheResultSetobject.Finally,wescanresultSetandprintitscontenttothecommandline.

Compileandexecutetheexamplewiththefollowingcommands:

$javacHiveJdbcClient.java

$java-cp$(hadoop

classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:/opt/cloudera/parcels/C

DH/lib/hive/lib/hive-jdbc.jar:

com.learninghadoop2.hive.client.HiveJdbcClient

Page 307: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ThriftThriftprovideslower-levelaccesstoHiveandhasanumberofadvantagesovertheJDBCimplementationofHiveServer.Primarily,itallowsmultipleconnectionsfromthesameclient,anditallowsprogramminglanguagesotherthanJavatobeusedwithease.WithHiveServer2,itisalesscommonlyusedoptionbutstillworthmentioningforcompatibility.AsampleThriftclientimplementedusingtheJavaAPIcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveThriftClient.java.ThisclientcanbeusedtoconnecttoHiveServer,butduetoprotocoldifferences,theclientwon’tworkwithHiveServer2.

IntheexamplewedefineagetClient()methodthattakesasinputthehostandportofaHiveServerserviceandreturnsaninstanceoforg.apache.hadoop.hive.service.ThriftHive.Client.

Aclientisobtainedbyfirstinstantiatingasocketconnection,org.apache.thrift.transport.TSocket,totheHiveServerservice,andbyspecifyingaprotocol,org.apache.thrift.protocol.TBinaryProtocol,toserializeandtransmitdata,asfollows:

TSockettransport=newTSocket(host,port);

transport.setTimeout(TIMEOUT);

transport.open();

TBinaryProtocolprotocol=newTBinaryProtocol(transport);

client=newThriftHive.Client(protocol);

WecallgetClient()fromthemainmethodandusetheclienttoexecuteaqueryagainstaninstanceofHiveServerrunningonlocalhostonport11111,asfollows:

publicstaticvoidmain(String[]args)throwsException{

Clientclient=getClient("localhost",11111);

client.execute("showtables");

List<String>results=client.fetchAll();

for(Stringresult:results){

System.out.println(result);

}

}

MakesurethatHiveServerisrunningonport11111,andifnot,startaninstancewiththefollowingcommand:

$sudohive--servicehiveserver-p11111

CompileandexecutetheHiveThriftClient.javaexamplewith:

$javac$(hadoopclasspath):/opt/cloudera/parcels/CDH/lib/hive/lib/*

com/learninghadoop2/hive/client/HiveThriftClient.java

$java-cp$(hadoopclasspath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:

com.learninghadoop2.hive.client.HiveThriftClient

Page 308: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

StingerinitiativeHivehasremainedverysuccessfulandcapablesinceitsearliestreleases,particularlyinitsabilitytoprovideSQL-likeprocessingonenormousdatasets.Butothertechnologiesdidnotstandstill,andHiveacquiredareputationofbeingrelativelyslow,particularlyinregardtolengthystartuptimesonlargejobsanditsinabilitytogivequickresponsestoconceptuallysimplequeries.

TheseperceivedlimitationswerelessduetoHiveitselfandmoreaconsequenceofhowtranslationofSQLqueriesintotheMapReducemodelhasmuchbuilt-ininefficiencywhencomparedtootherwaysofimplementingaSQLquery.Particularlyinregardtoverylargedatasets,MapReducesawlotsofI/O(andconsequentlytime)spentwritingouttheresultsofoneMapReducejobjusttohavethemreadbyanother.AsdiscussedinChapter3,Processing–MapReduceandBeyond,thisisamajordriverinthedesignofTez,whichcanschedulejobsonaHadoopclusterasagraphoftasksthatdoesnotrequireinefficientwritesandreadsbetweenthem.

ThefollowingisaqueryontheMapReduceframeworkversusTez:

SELECTa.country,COUNT(b.place_id)FROMplaceaJOINtweetsbON(a.

place_id=b.place_id)GROUPBYa.country;

ThefollowingfigurecontraststheexecutionplanfortheprecedingqueryontheMapReduceframeworkversusTez:

HiveonMapReduceversusTez

InplainMapReduce,twojobsarecreatedfortheGROUPBYandJOINclauses.ThefirstjobiscomposedofasetofMapReducetasksthatreaddatafromthedisktocarryoutgrouping.Thereducerswriteintermediateresultstothedisksothatoutputcanbesynchronized.Themappersinthesecondjobreadtheintermediateresultsfromthediskaswellasdatafromtableb.Thecombineddatasetisthenpassedtothereducerwheresharedkeysarejoined.WerewetoexecuteanORDERBYstatement,thiswouldhaveresultedina

Page 309: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

thirdjobandfurtherMapReducepasses.ThesamequeryisexecutedonTezasasinglejobbyasinglesetofMaptasksthatreaddatafromthedisk.I/Ogroupingandjoiningarepipelinedacrossreducers.

Alongsidethesearchitecturallimitations,therewerequiteafewareasaroundSQLlanguagesupportthatcouldalsoprovidebetterefficiency,andinearly2013,theStingerinitiativewaslaunchedwithanexplicitgoalofmakingHiveover100timesasfastandwithmuchricherSQLsupport.Hive0.13hasallthefeaturesofthethreephasesofStinger,resultinginamuchmorecompleteSQLdialect.Also,TezisofferedasanexecutionframeworkinadditiontoaMapReduce-basedimplementationatopYARNwhichismoreefficientthanpreviousimplementationsonHadoop1MapReduce.

WithTezastheexecutionengine,HiveisnolongerlimitedtoaseriesoflinearMapReducejobsandcaninsteadbuildaprocessinggraphwhereanygivenstepcan,forexample,streamresultstomultiplesub-steps.

TotakeadvantageoftheTezframework,thereisanewhivevariablesetting:

sethive.execution.engine=tez;

ThissettingreliesonTezbeinginstalledonthecluster;itisavailableinsourceformfromhttp://tez.apache.orgorinseveraldistributions,thoughatthetimeofwriting,notCloudera.

Thealternativevalueismr,whichusestheclassicMapReducemodel(atopYARN),soitispossibleinasingleinstallationtocomparewiththeperformanceofHiveusingTez.

Page 310: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ImpalaHiveisnottheonlyproductprovidingSQL-on-Hadoopcapability.ThesecondmostwidelyusedislikelyImpala,announcedinlate2012andreleasedinspring2013.ThoughoriginallydevelopedinternallywithinCloudera,itssourcecodeisperiodicallypushedtoanopensourceGitrepository(https://github.com/cloudera/impala).

ImpalawascreatedoutofthesameperceptionofHive’sweaknessesthatledtotheStingerinitiative.

ImpalaalsotooksomeinspirationfromGoogleDremel(http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdfwhichwasfirstopenlydescribedbyapaperpublishedin2009.DremelwasbuiltatGoogletoaddressthegapbetweentheneedforveryfastqueriesonverylargedatasetsandthehighlatencyinherentintheexistingMapReducemodelunderpinningHiveatthetime.Dremelwasasophisticatedapproachtothisproblemthat,ratherthanbuildingmitigationsatopMapReducesuchasimplementedbyHive,insteadcreatedanewservicethataccessedthesamedatastoredinHDFS.Dremelalsobenefitedfromsignificantworktooptimizethestorageformatofthedatainawaythatmadeitmoreamenabletoveryfastanalyticqueries.

Page 311: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ThearchitectureofImpalaThebasicarchitecturehasthreemaincomponents;theImpaladaemons,thestatestore,andtheclients.Recentversionshaveaddedadditionalcomponentsthatimprovetheservice,butwe’llfocusonthehigh-levelarchitecture.

TheImpaladaemon(impalad)shouldberunoneachhostwhereaDataNodeprocessismanagingHDFSdata.NotethatimpaladdoesnotaccessthefilesystemblocksthroughthefullHDFSFileSystemAPI;instead,itusesafeaturecalledshort-circuitreadstomakedataaccessmoreefficient.

Whenaclientsubmitsaquery,itcandosotoanyoftherunningimpaladprocesses,andthisonewillbecomethecoordinatorfortheexecutionofthatquery.ThekeyaspectofImpala’sperformanceisthatforeachquery,itgeneratescustomnativecode,whichisthenpushedtoandexecutedbyalltheimpaladprocessesonthesystem.Thishighlyoptimizedcodeperformsthequeryonthelocaldata,andeachimpaladthenreturnsitssubsetoftheresultsettothecoordinatornode,whichperformsthefinaldataconsolidationtoproducethefinalresult.Thistypeofarchitectureshouldbefamiliartoanyonewhohasworkedwithanyofthe(usuallycommercialandexpensive)MassivelyParallelProcessing(MPP)(thetermusedforthistypeofsharedscale-outarchitecture)datawarehousesolutionsavailabletoday.Astheclusterruns,thestatestoredaemonensuresthateachimpaladprocessisawareofalltheothersandprovidesaviewoftheoverallclusterhealth.

Page 312: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Co-existingwithHiveImpala,asanewerproduct,tendstohaveamorerestrictedsetofSQLdatatypesandsupportsamoreconstraineddialectofSQLthanHive.Itis,however,expandingthissupportwitheachnewrelease.RefertotheImpaladocumentation(http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/impala.html)togetanoverviewofthecurrentlevelofsupport.

ImpalasupportstheHivemetastoremechanismusedbyHivetopersistentlystorethemetadatasurroundingitstablestructureandstorage.ThismeansthatonaclusterwithanexistingHivesetup,itshouldbeimmediatelypossibletouseImpalaasitwillaccessthesamemetastoreandthereforeprovideaccesstothesametablesavailableinHive.

ButbewarnedthatthedifferencesinSQLdialectanddatatypesmightcauseunexpectedresultswhenworkinginacombinedHiveandImpalaenvironment.Somequeriesmightworkononebutnottheother,theymightshowverydifferentperformancecharacteristics(moreonthislater),ortheymightactuallygivedifferentresults.Thislastpointmightbecomeapparentwhenusingdatatypessuchasfloatanddoublethataresimplytreateddifferentlyintheunderlyingsystems(HiveisimplementedonJavawhileImpalaiswritteninC++).

Asofversion1.2,itsupportsUDFswrittenbothinC++andJava,althoughC++isstronglyrecommendedasamuchfastersolution.KeepthisinmindifyouarelookingtosharecustomfunctionsbetweenHiveandImpala.

Page 313: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AdifferentphilosophyWhenImpalawasfirstreleased,itsgreatestbenefitwasinhowittrulyenabledwhatisoftencalledspeedofthoughtanalysis.Queriescouldbereturnedsufficientlyfastthatananalystcouldexploreathreadofanalysisinacompletelyinteractivefashionwithouthavingtowaitforminutesatatimeforeachquerytocomplete.It’sfairtosaythatmostadoptersofImpalawereattimesstunnedbyitsperformance,especiallywhencomparedtotheversionofHiveshippingatthetime.

TheImpalafocushasremainedmostlyontheseshorterqueries,andthisdoesimposesomelimitationsonthesystem.Impalatendstobequitememory-heavyasitreliesonin-memoryprocessingtoachievemuchofitsperformance.Ifaqueryrequiresadatasettobeheldinmemoryratherthanbeingavailableontheexecutingnode,thenthatquerywillsimplyfailinversionsofImpalabefore2.0.

ComparingtheworkonStingertoImpala,itcouldbearguedthatImpalahasamuchstrongerfocusonexcellingintheshorter(andarguablymorecommon)queriesthatsupportinteractivedataanalysis.ManybusinessintelligencetoolsandservicesarenowcertifiedtodirectlyrunonImpala.TheStingerinitiativehasputlesseffortintomakingHivejustasfastintheareawhereImpalaexcelsbuthasinsteadimprovedHive(tovaryingdegrees)forallworkloads.ImpalaisstilldevelopingatafastpaceandStingerhasputadditionalmomentumintoHive,soitismostlikelywisetoconsiderbothproductsanddeterminewhichbestmeetstheperformanceandfunctionalityrequirementsofyourprojectsandworkflows.

ItshouldalsobekeptinmindthattherearecompetitivecommercialpressuresshapingthedirectionofImpalaandHive.ImpalawascreatedandisstilldrivenbyCloudera,themostpopularvendorofHadoopdistributions.TheStingerinitiative,thoughcontributedtobymanycompaniesasdiverseasMicrosoft(yes,really!)andIntel,wasleadbyHortonworks,probablythesecondlargestvendorofHadoopdistributions.ThefactisthatifyouareusingtheClouderadistributionofHadoop,thensomeofthecorefeaturesofHivemightbeslowertoarrive,whereasImpalawillalwaysbeup-to-date.Conversely,ifyouuseanotherdistribution,youmightgetthelatestHiverelease,butthatmighteitherhaveanolderImpalaor,asiscurrentlythecase,youmighthavetodownloadandinstallityourself.

AsimilarsituationhasarisenwiththeParquetandORCfileformatsmentionedearlier.ParquetispreferredbyImpalaanddevelopedbyagroupofcompaniesledbyCloudera,whileORCispreferredbyHiveandischampionedbyHortonworks.

Unfortunately,therealityisthatParquetsupportisoftenveryquicktoarriveintheClouderadistributionbutlesssoinsaytheHortonworksdistribution,wheretheORCfileformatispreferred.

Thesethemesarealittleconcerningsince,althoughcompetitioninthisspaceisagoodthing,andarguablytheannouncementofImpalahelpedenergizetheHivecommunity,thereisagreaterriskthatyourchoiceofdistributionmighthavealargerimpactonthe

Page 314: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

toolsandfileformatsthatwillbefullysupported,unlikeinthepast.Hopefully,thecurrentsituationisjustanartifactofwhereweareinthedevelopmentcyclesofallthesenewandimprovedtechnologies,butdoconsideryourchoiceofdistributioncarefullyinrelationtoyourSQL-on-Hadoopneeds.

Page 315: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Drill,Tajo,andbeyondYoushouldalsoconsiderthatSQLonHadoopnolongeronlyreferstoHiveorImpala.ApacheDrill(http://drill.apache.org)isafullerimplementationoftheDremelmodelfirstdescribedbyGoogle.AlthoughImpalaimplementstheDremelarchitectureacrossHDFSdata,Drilllookstoprovidesimilarfunctionalityacrossmultipledatasources.Itisstillinitsearlystages,butifyourneedsarebroaderthanwhatHiveorImpalaprovides,itmightbeworthconsidering.

Tajo(http://tajo.apache.org)isanotherApacheprojectthatseekstobeafulldatawarehousesystemonHadoopdata.WithanarchitecturesimilartothatofImpala,itoffersamuchrichersystemwithcomponentssuchasmultipleoptimizersandETLtoolsthatarecommonplaceintraditionaldatawarehousesbutlessfrequentlybundledintheHadoopworld.Ithasamuchsmalleruserbasebuthasbeenusedbycertaincompaniesverysuccessfullyforasignificantlengthoftime,andmightbeworthconsideringifyouneedafullerdatawarehousingsolution.

Otherproductsarealsoemerginginthisspace,andit’sagoodideatodosomeresearch.HiveandImpalaareawesometools,butifyoufindthattheydon’tmeetyourneeds,thenlookaround—somethingelsemight.

Page 316: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SummaryInitsearlydays,Hadoopwassometimeserroneouslyseenasthelatestsupposedrelationaldatabasekiller.Overtime,ithasbecomemoreapparentthatthemoresensibleapproachistoviewitasacomplementtoRDBMStechnologiesandthat,infact,theRDBMScommunityhasdevelopedtoolssuchasSQLthatarealsovaluableintheHadoopworld.

HiveQLisanimplementationofSQLonHadoopandwastheprimaryfocusofthischapter.InregardtoHiveQLanditsimplementations,wecoveredthefollowingtopics:

HowHiveQLprovidesalogicalmodelatopdatastoredinHDFSincontrasttorelationaldatabaseswherethetablestructureisenforcedinadvanceHowHiveQLsupportsmanystandardSQLdatatypesandcommandsincludingjoinsandviewsTheETL-likefeaturesofferedbyHiveQL,includingtheabilitytoimportdataintotablesandoptimizethetablestructurethroughpartitioningandsimilarmechanismsHowHiveQLofferstheabilitytoextenditscoresetofoperatorswithuser-definedcodeandhowthiscontraststothePigUDFmechanismTherecenthistoryofHivedevelopments,suchastheStingerinitiative,thathaveseenHivetransitiontoanupdatedimplementationthatusesTezThebroaderecosystemaroundHiveQLthatnowincludesproductssuchasImpala,TajoandDrillandhoweachofthesefocusesonspecificareasinwhichtoexcel

WithPigandHive,we’veintroducedalternativemodelstoprocessMapReducedata,butsofarwe’venotlookedatanotherquestion:whatapproachesandtoolsarerequiredtoactuallyallowthismassivedatasetbeingcollectedinHadooptoremainusefulandmanageableovertime?Inthenextchapter,we’lltakeaslightstepuptheabstractionhierarchyandlookathowtomanagethelifecycleofthisenormousdataasset.

Page 317: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Chapter8.DataLifecycleManagementOurpreviouschapterswerequitetechnologyfocused,describingparticulartoolsortechniquesandhowtheycanbeused.Inthisandthenextchapter,wearegoingtotakeamoretop-downapproachwherebywewilldescribeaproblemspaceyouarelikelytoencounterandthenexplorehowtoaddressit.Inparticular,we’llcoverthefollowingtopics:

WhatwemeanbythetermdatalifecyclemanagementWhydatalifecyclemanagementissomethingtothinkaboutThecategoriesoftoolsthatcanbeusedtoaddresstheproblemHowtousethesetoolstobuildthefirsthalfofaTwittersentimentanalysispipeline

Page 318: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WhatdatalifecyclemanagementisDatadoesn’texistonlyatapointintime.Particularlyforlong-runningproductionworkflows,youarelikelytoacquireasignificantquantityofdatainaHadoopcluster.Requirementsrarelystaystaticforlong,soalongsidenewlogicyoumightalsoseetheformatofthatdatachangeorrequiremultipledatasourcestobeusedtoprovidethedatasetprocessedinyourapplication.Weusethetermdatalifecyclemanagementtodescribeanapproachtohandlingthecollection,storage,andtransformationofdatathatensuresthatdataiswhereitneedstobe,intheformatitneedstobein,inawaythatallowsdataandsystemevolutionovertime.

Page 319: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ImportanceofdatalifecyclemanagementIfyoubuilddataprocessingapplications,youarebydefinitionreliantonthedatathatisprocessed.Justasweconsiderthereliabilityofapplicationsandsystems,itbecomesnecessarytoensurethatthedataisalsoproduction-ready.

DataatsomepointneedstobeingestedintoHadoop.Itisonepartofanenterpriseandoftenhasmultiplepointsofintegrationwithexternalsystems.Iftheingestofdatacomingfromthosesystemsisnotreliable,thentheimpactonthejobsthatprocessthatdataisoftenasdisruptiveasamajorsystemfailure.Dataingestbecomesacriticalcomponentinitsownright.Andwhenwesaytheingestneedstobereliable,wedon’tjustmeanthatdataisarriving;italsohastobearrivinginaformatthatisusableandthroughamechanismthatcanhandleevolutionovertime.

Theproblemwithmanyoftheseissuesisthattheydonotariseinasignificantfashionuntiltheflowsarelarge,thesystemiscritical,andthebusinessimpactofanyproblemsisnon-trivial.Adhocapproachesthatworkedforalesscriticaldataflowoftenwillsimplynotscale,butwillbeverypainfultoreplaceonalivesystem.

Page 320: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ToolstohelpButdon’tpanic!Thereareanumberofcategoriesoftoolsthatcanhelpwiththedatalifecyclemanagementproblem.We’llgiveexamplesofthefollowingthreebroadcategoriesinthischapter:

Orchestrationservices:buildinganingestpipelineusuallyhasmultiplediscretestages,andwewilluseanorchestrationtooltoallowthesetobedescribed,executed,andmanagedConnectors:giventheimportanceofintegrationwithexternalsystems,wewilllookathowwecanuseconnectorstosimplifytheabstractionsprovidedbyHadoopstorageFileformats:howwestorethedataimpactshowwemanageformatevolutionovertime,andseveralrichstorageformatshavewaysofsupportingthis

Page 321: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

BuildingatweetanalysiscapabilityInearlierchapters,weusedvariousimplementationsofTwitterdataanalysistodescribeseveralconcepts.Wewilltakethiscapabilitytoadeeperlevelandapproachitasamajorcasestudy.

Inthischapter,wewillbuildadataingestpipeline,constructingaproduction-readydataflowthatisdesignedwithreliabilityandfutureevolutioninmind.

We’llbuildoutthepipelineincrementallythroughoutthechapter.Ateachstage,we’llhighlightwhathaschangedbutcan’tincludefulllistingsateachstagewithouttreblingthesizeofthechapter.Thesourcecodeforthischapter,however,haseveryiterationinitsfullglory.

Page 322: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GettingthetweetdataThefirstthingweneedtodoisgettheactualtweetdata.Asinpreviousexamples,wecanpassthe-jand-nargumentstostream.pytodumpJSONtweetstostdout:

$stream.py-j-n10000>tweets.json

Sincewehavethistoolthatcancreateabatchofsampletweetsondemand,wecouldstartouringestpipelinebyhavingthisjobrunonaperiodicbasis.Buthow?

Page 323: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

IntroducingOozieWecould,ofcourse,bangrockstogetherandusesomethinglikecronforsimplejobscheduling,butrecallthatwewantaningestpipelinethatisbuiltwithreliabilityinmind.So,wereallywantaschedulingtoolthatwecanusetodetectfailuresandotherwiserespondtoexceptionalsituations.

ThetoolwewillusehereisOozie(http://oozie.apache.org),aworkflowengineandschedulerbuiltwithafocusontheHadoopecosystem.

Oozieprovidesameanstodefineaworkflowasaseriesofnodeswithconfigurableparametersandcontrolledtransitionfromonenodetothenext.ItisinstalledaspartoftheClouderaQuickStartVM,andthemaincommand-lineclientis,notsurprisingly,calledoozie.

NoteWe’vetestedtheworkflowsinthischapteragainstversion5.0oftheClouderaQuickStartVM,andatthetimeofwritingOozieinthelatestversion,5.1,hassomeissues.There’snothingparticularlyversion-specificinourworkflows,however,sotheyshouldbecompatiblewithanycorrectlyworkingOoziev4implementation.

Thoughpowerfulandflexible,Ooziecantakealittlegettingusedto,sowe’llgivesomeexamplesanddescribewhatwearedoingalongtheway.

ThemostcommonnodeinanOozieworkflowisanaction.Itiswithinactionnodesthatthestepsoftheworkflowareactuallyexecuted;theothernodetypeshandlemanagementoftheworkflowintermsofdecisions,parallelism,andfailuredetection.Ooziehasmultipletypesofactionsthatitcanperform.Oneoftheseistheshellaction,whichcanbeusedtoexecuteanycommandonthesystem,suchasnativebinaries,shellscripts,oranyothercommand-lineutility.Let’screateascripttogenerateafileoftweetsandcopythistoHDFS:

set-e

sourcetwitter.keys

pythonstream.py-j-n500>/tmp/tweets.out

hdfsdfs-put/tmp/tweets.out/tmp/tweets/tweets.out

rm-f/tmp/tweets.out

Notethatthefirstlinewillcausetheentirescripttofailshouldanyoftheincludedcommandsfail.WeuseanenvironmentfiletoprovidetheTwitterkeystoourscriptintwitter.keys,whichisofthefollowingform:

exportTWITTER_CONSUMER_KEY=<value>

exportTWITTER_CONSUMER_SECRET=<value>

exportTWITTER_ACCESS_KEY=<value>

exportTWITTER_ACCESS_SECRET=<value>

OozieusesXMLtodescribeitsworkflows,usuallystoredinafilecalledworkflow.xml.Let’swalkthroughthedefinitionforanOozieworkflowthatcallsashellcommand.

Page 324: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheschemaforanOozieworkflowiscalledworkflow-app,andwecangivetheworkflowaspecificname.ThisisusefulwhenviewingjobhistoryintheCLIorOoziewebUI.Intheexamplesinthisbook,we’lluseanincreasingversionnumbertoallowustomoreeasilyseparatetheiterationswithinthesourcerepository.Thisishowwegivetheworkflow-appaspecificname:

<workflow-appxmlns="uri:oozie:workflow:0.4"name="v1">

Oozieworkflowsaremadeupofaseriesofconnectednodes,eachofwhichrepresentsastepintheprocess,andwhicharerepresentedbyXMLnodesintheworkflowdefinition.Ooziehasanumberofnodesthatdealwiththetransitionoftheworkflowfromonesteptothenext.Thefirstoftheseisthestartnode,whichsimplystatesthenameofthefirstnodetobeexecutedaspartoftheworkflow,asfollows:

<startto="fs-node"/>

Wethenhavethedefinitionforthenamedstartnode.Inthiscase,itisanactionnode,whichisthegenericnodetypeformostOozienodesthatactuallyperformsomeprocessing,asfollows:

<actionname="fs-node">

Actionisabroadcategoryofnodes,andwewilltypicallythenspecializeitwiththeparticularprocessingforthisgivennode.Inthiscase,weareusingthefsnodetype,whichallowsustoperformfilesystemoperations:

<fs>

WewanttoensurethatthedirectoryonHDFStowhichwewishtocopythefileoftweetdata,exists,isempty,andhassuitablepermissions.Wedothisbytryingtodeletethedirectoryifitexists,thencreatingit,andfinallyapplyingtherequiredpermissions,asfollows:

<deletepath="${nameNode}/tmp/tweets"/>

<mkdirpath="${nameNode}/tmp/tweets"/>

<chmodpath="${nameNode}/tmp/tweets"permissions="777"/>

</fs>

We’llseeanalternativewayofsettingupdirectorieslater.Afterperformingthefunctionalityofthenode,Oozieneedsknowhowtoproceedwiththeworkflow.Inmostcases,thiswillcomprisemovingtoanotheractionnodeifthisnodewassuccessfulandabortingtheworkflowotherwise.Thisisspecifiedbythenextelements.Theoknodegivesthenameofthenodetowhichtotransitioniftheexecutionwassuccessful;theerrornodenamesthedestinationnodeforfailurescenarios.Here’showtheokandfailnodesareused:

<okto="shell-node"/>

<errorto="fail"/>

</action>

<actionname="shell-node">

Thesecondactionnodeisagainspecializedwithitsspecificprocessingtype;inthiscase,

Page 325: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

wehaveashellnode:

<shellxmlns="uri:oozie:shell-action:0.2">

TheshellactionthenhastheHadoopJobTrackerandNameNodelocationsspecified.Notethattheactualvaluesaregivenbyvariables;we’llexplainwheretheycomefromlater.TheJobTrackerandNameNodearespecifiedasfollows:

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

AsmentionedinChapter3,Processing–MapReduceandBeyond,MapReduceusesmultiplequeuestoprovidesupportfordifferentapproachestoresourcescheduling.ThenextelementspecifiestheMapReducequeuetowhichtheworkflowshouldbesubmitted:

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

</configuration>

Nowthattheshellnodeisfullyconfigured,wecanspecifythecommandtoinvoke,againviaavariable,asfollows:

<exec>${EXEC}</exec>

ThevariousstepsofOozieworkflowsareexecutedasMapReducejobs.Thisshellactionwill,therefore,beexecutedasaspecifictaskinstanceonaparticularTaskTracker.We,therefore,needtospecifywhichfilesneedtobecopiedtothelocalworkingdirectoryontheTaskTrackermachinebeforetheactioncanbeperformed.Inthiscase,weneedtocopythemainshellscript,thePythontweetgenerator,andtheTwitterconfigfile,asfollows:

<file>${workflowRoot}/${EXEC}</file>

<file>${workflowRoot}/twitter.keys</file>

<file>${workflowRoot}/stream.py</file>

Afterclosingtheshellelement,weagainspecifywhattododependingonwhethertheactioncompletedsuccessfullyornot.BecauseMapReduceisusedforjobexecution,themajorityofnodetypesbydefinitionhavebuilt-inretryandrecoverylogic,thoughthisisnotthecaseforshellnodes:

</shell>

<okto="end"/>

<errorto="fail"/>

</action>

Iftheworkflowfails,let’sjustkillitinthiscase.Thekillnodetypedoesexactlythat—terminatetheworkflowfromproceedingtoanyfurthersteps,usuallyloggingerrormessagesalongtheway.Here’showthekillnodetypeisused:

<killname="fail">

<message>Shellactionfailed,error

message[${wf:errorMessage(wf:lastErrorNode())}]</message>

Page 326: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

</kill>

TheendnodeontheotherhandsimplyhaltstheworkflowandlogsitasasuccessfulcompletionwithinOozie:

<endname="end"/>

</workflow-app>

Theobviousquestioniswhattheprecedingvariablesrepresentandfromwheretheygettheirconcretevalues.TheprecedingvariablesareexamplesoftheOozieExpressionLanguageoftenreferredtoasEL.

Alongsidetheworkflowdefinitionfile(workflow.xml),whichdescribesthestepsintheflow,wealsoneedtocreateaconfigurationfilethatgivesthespecificvaluesforagivenexecutionoftheworkflow.Thisseparationoffunctionalityandconfigurationallowsustowriteworkflowsthatcanbeusedondifferentclusters,ondifferentfilelocations,orwithdifferentvariablevalueswithouthavingtorecreatetheworkflowitself.Byconvention,thisfileisusuallynamedjob.properties.Fortheprecedingworkflow,here’sasamplejob.propertiesfile.

Firstly,wespecifythelocationoftheJobTracker,theNameNode,andtheMapReducequeuetowhichtosubmittheworkflow.ThefollowingshouldworkontheCloudera5.0QuickStartVM,thoughinv5.1thehostnamehasbeenchangedtoquickstart.cloudera.TheimportantthingisthatthespecifiedNameNodeandJobTrackeraddressesneedtobeintheOoziewhitelist—thelocalservicesontheVMareaddedautomatically:

jobTracker=localhost.localdomain:8032

nameNode=hdfs://localhost.localdomain:8020

queueName=default

Next,wesetsomevaluesforwheretheworkflowdefinitionsandassociatedfilescanbefoundontheHDFSfilesystem.Notetheuseofavariablerepresentingtheusernamerunningthejob.Thisallowsasingleworkflowtobeappliedtodifferentpathsdependingonthesubmittinguser,asfollows:

tasksRoot=book

workflowRoot=${nameNode}/user/${user.name}/${tasksRoot}/v1

oozie.wf.application.path=${nameNode}/user/${user.name}/${tasksRoot}/v1

Next,wenamethecommandtobeexecutedintheworkflowas${EXEC}:

EXEC=gettweets.sh

Morecomplexworkflowswillrequireadditionalentriesinthejob.propertiesfile;theprecedingworkflowisassimpleasitgets.

Theooziecommand-linetoolneedstoknowwheretheOozieserverisrunning.ThiscanbeaddedasanargumenttoeveryOozieshellcommand,butthatgetsunwieldyveryquickly.Instead,youcansettheshellenvironmentvariable,asfollows:

$exportOOZIE_URL='http://localhost:11000/oozie'

Afterallthatwork,wecannowactuallyrunanOozieworkflow.Createadirectoryon

Page 327: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HDFSasspecifiedinthevaluesinthejob.propertiesfile.Intheprecedingcommand,we’dbecreatingthisasbook/v1underourhomedirectoryonHDFS.Copythestream.py,gettweets.shandtwitter.propertiesfilestothatdirectory;thesearethefilesrequiredtoperformtheactualexecutionoftheshellcommand.Then,addtheworkflow.xmlfiletothesamedirectory.

Toruntheworkflowthen,wedothefollowing:

$ooziejob-run-config<path-to-job.properties>

Ifsubmittedsuccessfully,Ooziewillprintthejobnametothescreen.Youcanseethecurrentstatusofthisworkflowwith:

$ooziejob-info<job-id>

Youcanalsocheckthelogsforthejob:

$ooziejob-log<job-id>

Inaddition,allcurrentandrecentjobscanbeviewedwith:

$ooziejobs

AnoteonHDFSfilepermissionsThereisasubtleaspectintheshellcommandthatcancatchtheunwary.Asanalternativetohavingthefsnode,wecouldinsteadincludeapreparationelementwithintheshellnodetocreatethedirectoryweneedonthefilesystem.Itwouldlooklikethefollowing:

<prepare>

<mkdirpath="${nameNode}/tmp/tweets"/>

</prepare>

Thepreparestageisexecutedbytheuserwhosubmittedtheworkflow,butsincetheactualscriptexecutionisperformedonYARN,itisusuallyexecutedastheyarnuser.Youmighthitaproblemwherethescriptgeneratesthetweets,the/tmp/tweetsdirectoryiscreatedonHDFS,butthescriptthenfailstohavepermissiontowritetothatdirectory.Youcaneitherresolvethisthroughassigningpermissionsmorepreciselyor,asshownearlier,youaddafilesystemnodetoencapsulatetheneededoperations.We’lluseamixtureofbothtechniquesinthischapter;fornon-shellnodes,we’lluseprepareelements,particularlyiftheneededdirectoryismanipulatedonlybythatnode.Forcaseswhereashellnodeisinvolvedorwherethecreateddirectorieswillbeusedacrossmultiplenodes,we’llbesafeandusethemoreexplicitfsnode.

MakingdevelopmentalittleeasierItcansometimesgetawkwardtomanagethefilesandresourcesforanOoziejobduringdevelopment.SomeneedtobeonHDFS,whilesomeneedtobelocal,andchangestosomefilesrequirechangestoothers.TheeasiestapproachisoftentodevelopormakechangesinacompletecloneoftheworkflowdirectoryonthelocalfilesystemandpushchangesfromtheretothesimilarlynameddirectoryinHDFS,notforgetting,ofcourse,toensurethatallchangesareunderrevisioncontrol!Foroperationalexecutionofthe

Page 328: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

workflow,thejob.propertiesfileistheonlythingthatneedstobeonthelocalfilesystemand,conversely,alltheotherfilesneedtobeonHDFS.Alwaysrememberthis:it’salltooeasytomakechangestoalocalcopyofaworkflow,forgettopushthechangestoHDFS,andthenbeconfusedastowhytheworkflowisn’treflectingthechanges.

ExtractingdataandingestingintoHiveWithourdataonHDFS,wecannowextracttheseparatedatasetsfortweetsandusers,andplacedataasinpreviouschapters.Wecanreuseextract_for_hive.pigtoparsetherawtweetJSONintoseparatefiles,storethemagainonHDFS,andthenfollowupwithaHivestepthatingeststhesenewfilesintoHivetablesfortweets,users,andplaces.

TodothiswithinOozie,we’llneedtoaddtwonewnodestoourworkflow,aPigactionforthefirststepandaHiveactionforthesecond.

ForourHiveaction,we’lljustcreatethreeexternaltablesthatpointtothefilesgeneratedbyPig.ThiswouldthenallowustofollowourpreviouslydescribedmodelofingestingintotemporaryorexternaltablesandusingHiveQLINSERTstatementsfromtheretoinsertintotheoperational,andoftenpartitioned,tables.Thiscreate.hqlscriptcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch8/v2/hive/create.hqlbutissimplyofthefollowingform:

CREATEDATABASEIFNOTEXISTStwttr;

USEtwttr;

DROPTABLEIFEXISTStweets;

CREATEEXTERNALTABLEtweets(

...

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE

LOCATION'${ingestDir}/tweets';

DROPTABLEIFEXISTSuser;

CREATEEXTERNALTABLEuser(

...

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE

LOCATION'${ingestDir}/users';

DROPTABLEIFEXISTSplace;

CREATEEXTERNALTABLEplace(

...

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE

LOCATION'${ingestDir}/places';

NotethatthefileseparatoroneachtableisalsoexplicitlysettomatchwhatweareoutputtingfromPig.Inadditiontothis,locationsinbothscriptsarespecifiedbyvariablesforwhichwewillprovideconcretevaluesinourjob.propertiesfile.

Withtheprecedingstatements,wecancreatethePignodeforourworkflowfoundinthe

Page 329: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

sourcecodeasv2ofthepipeline.Muchofthenodedefinitionlookssimilartotheshellnodeusedpreviously,aswesetthesameconfigurationelements;alsonoticeouruseoftheprepareelementtocreatetheneededoutputdirectory.WecancreatethePignodeforourworkflowasshowninthefollowingaction:

<actionname="pig-node">

<pig>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<prepare>

<deletepath="${nameNode}/${outputDir}"/>

<mkdirpath="${nameNode}/${outputDir}"/>

</prepare>

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

</configuration>

Similarlyaswiththeshellcommand,weneedtotellthePigactionthelocationoftheactualPigscript.Thisisspecifiedinthefollowingscriptelement:

<script>${workflowRoot}/pig/extract_for_hive.pig</script>

WealsoneedtomodifythecommandlineusedtoinvokethePigscripttoaddseveralparameters.Thefollowingelementsdothis;notetheconstructionpatternwhereinoneelementaddstheactualparameternameandthenextitsvalue(we’llseeanalternativemechanismforpassingargumentsinthenextsection):

<argument>-param</argument>

<argument>inputDir=${inputDir}</argument>

<argument>-param</argument>

<argument>outputDir=${outputDir}</argument>

</pig>

BecausewewanttomovefromthissteptotheHivenode,weneedtosetthefollowingelementsappropriately:

<okto="hive-node"/>

<errorto="fail"/>

</action>

TheHiveactionitselfisalittledifferentthanthepreviousnodes;eventhoughitstartsinasimilarfashion,itspecifiestheHiveaction-specificnamespace,asfollows:

<actionname="hive-node">

<hivexmlns="uri:oozie:hive-action:0.2">

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

TheHiveactionneedsmanyoftheconfigurationelementsusedbyHiveitselfand,inmostcases,wecopythehive-site.xmlfileintotheworkflowdirectoryandspecifyitslocation,asshowninthefollowingxml;notethatthismechanismisnotHive-specificand

Page 330: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

canalsobeusedforcustomactions:

<job-xml>${workflowRoot}/hive-site.xml</job-xml>

Inaddition,wemightneedtooverridesomeMapReducedefaultconfigurationproperties,asshowninthefollowingxml,wherewespecifythatintermediatecompressionshouldbeusedforourjob:

<configuration>

<property>

<name>mapred.compress.map.output</name>

<value>true</value>

</property>

</configuration>

AfterconfiguringtheHiveenvironment,wenowspecifythelocationoftheHivescript:

<script>${workflowRoot}/hive/create.hql</script>

WealsohavetoprovidethemechanismtopassargumentstotheHivescript.Butinsteadofbuildingoutthecommandlineonecomponentatatime,we’lladdtheparamelementsthatmapthenameofaconfigurationelementinthejob.propertiesfiletovariablesspecifiedintheHivescript;thismechanismisalsosupportedwithPigactions:

<param>dbName=${dbName}</param>

<param>ingestDir=${ingestDir}</param>

</hive>

TheHivenodethenclosesastheothers,asfollows:

<okto="end"/>

<errorto="fail"/>

</action>

WenowneedtoputallthistogethertorunthemultistageworkflowinOozie.Thefullworkflow.xmlfilecanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch8/v2andtheworkflowisvisualizedinthefollowingdiagram:

Dataingestionworkflowv2

Page 331: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Thisworkflowperformsallthestepsdiscussedbefore;itgeneratestweetdata,extractssubsetsofdataviaPig,andtheningeststheseintoHive.

AnoteonworkflowdirectorystructureWenowhavequiteafewfilesinourworkflowdirectoryanditisbesttoadoptsomestructureandnamingconventions.Forthecurrentworkflow,ourdirectoryonHDFSlookslikethefollowing:

/hive/

/hive/create.hql

/lib/

/pig/

/pig/extract_for_hive.pig

/scripts/

/scripts/gettweets.sh

/scripts/stream-json-batch.py

/scripts/twitter-keys

/hive-site.xml

/job.properties

/workflow.xml

Themodelwefollowistokeepconfigurationfilesinthetop-leveldirectorybuttokeepfilesrelatedtoagivenactiontypeindedicatedsubdirectories.Notethatitisusefultohavealibdirectoryevenifempty,assomenodetypeslookforit.

Withtheprecedingstructure,thejob.propertiesfileforourcombinedjobisnowthefollowing:

jobTracker=localhost.localdomain:8032

nameNode=hdfs://localhost.localdomain:8020

queueName=default

tasksRoot=book

workflowRoot=${nameNode}/user/${user.name}/${tasksRoot}/v2

oozie.wf.application.path=${nameNode}/user/${user.name}/${tasksRoot}/v2

oozie.use.system.libpath=true

EXEC=gettweets.sh

inputDir=/tmp/tweets

outputDir=/tmp/tweetdata

ingestDir=/tmp/tweetdata

dbName=twttr

Intheprecedingcode,we’vefullyupdatedtheworkflow.xmldefinitiontoincludeallthestepsdescribedsofar—includinganinitialfsnodetocreatetherequireddirectorywithoutworryingaboutuserpermissions.

IntroducingHCatalogIfwelookatourcurrentworkflow,thereisinefficiencyinhowweuseHDFSastheinterfacebetweenPigandHive.WeneedtooutputtheresultofourPigscriptontoHDFS,wheretheHivescriptcanthenuseitasthelocationofsomenewtables.Whatthis

Page 332: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

highlightsisthatitisoftenveryusefultohavedatastoredinHive,butthisislimited,asfewtools(primarilyHive)canaccesstheHivemetastoreandhencereadandwritesuchdata.Ifwethinkaboutit,Hivehastwomainlayers:itstoolsforaccessingandmanipulatingitsdataplustheexecutionframeworktorunqueriesonthatdata.

TheHCatalogsubprojectofHiveeffectivelyprovidesanindependentimplementationofthefirstoftheselayers—themeanstoaccessandmanipulatedataintheHivemetastore.HCatalogprovidesmechanismsforothertools,suchasPigandMapReduce,tonativelyreadandwritetable-structureddatathatisstoredonHDFS.

Remember,ofcourse,thatthedataisstoredonHDFSinoneformatoranother.TheHivemetastoreprovidesthemodelstoabstractthesefilesintotherelationaltablestructurefamiliarfromHive.SowhenwesaywearestoringdatainHCatalog,whatwereallymeanisthatwearestoringdataonHDFSinsuchawaythatthisdatacanthenbeexposedbytablestructuresspecifiedwithintheHivemetastore.Conversely,whenwerefertoHivedata,whatwereallymeanisdatawhosemetadataisstoredintheHivemetastore,andwhichcanbeaccessedbyanymetastore-awaretool,suchasHCatalog.

UsingHCatalog

TheHCatalogcommand-linetooliscalledhcatandwillbepreinstalledontheClouderaQuickStartVM—itisinstalled,infact,withanyversionofHivelaterthan0.11inclusive.

Thehcatutilitydoesn’thaveaninteractivemode,sogenerallyyouwilluseitwithexplicitcommand-lineargumentsorbypointingitatafileofcommands,asfollows:

$hcat–e"usedefault;showtables"

$hcat–fcommands.hql

Thoughthehcattoolisusefulandcanbeincorporatedintoscripts,themoreinterestingelementofHCatalogforourpurposeshereisitsintegrationwithPig.HCatalogdefinesanewPigloadercalledHCatLoaderandastorercalledHCatStorer.Asthenamessuggest,theseallowPigscriptstoreadfromorwritetoHivetablesdirectly.WecanusethismechanismtoreplaceourpreviousPigandHiveactionsinourOozieworkflowwithasingleHCatalog-basedPigactionthatwritestheoutputofthePigjobdirectlyintoourtablesinHive.

Forclarity,we’llcreatenewtablesnamedtweets_hcat,places_hcat,andusers_hcatintowhichwe’llinsertthisdata;notethatthesearenolongerexternaltables:

CREATETABLEtweets_hcat…

CREATETABLEplaces_hcat…

CREATETABLEusers_hcat…

Notethatifwehadthesecommandsinascriptfile,wecouldusethehcatCLItooltoexecutethem,asfollows:

$hcat–fcreate.hql

TheHCatCLItooldoesnot,however,offeraninteractiveshellakintotheHiveCLI.WecannowuseourpreviousPigscriptandneedtoonlychangethestorecommands,replacingtheuseofPigStoragewithHCatStorer.OurupdatedPigscript,

Page 333: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

extract_to_hcat.pig,thereforeincludesstorecommandssuchasthefollowing:

storetweets_tsvinto'twttr.tweets_hcat'using

org.apache.hive.hcatalog.pig.HCatStorer();

NotethatthepackagenamefortheHCatStorerclasshastheorg.apache.hive.hcatalogprefix;whenHCatalogwasintheApacheincubator,itusedorg.apache.hcatalogforitspackageprefix.Thisolderformisnowdeprecated,andthenewformthatexplicitlyshowsHCatalogasasubprojectofHiveshouldbeusedinstead.

WiththisnewPigscript,wecannowreplaceourpreviousPigandHiveactionwithanupdatedPigactionusingHCatalog.ThisalsorequiresthefirstusageoftheOoziesharelib,whichwe’lldiscussinthenextsection.Inourworkflowdefinition,thepigelementofthisactionwillbedefinedasshowninthefollowingxmlandcanbefoundasv3ofthepipelineinthesourcebundle;inv3,we’vealsoaddedautilityHivenodetorunbeforethePignodetoensurethatallnecessarytablesexistbeforethePigscriptthatrequiresthemisexecuted.

<pig>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<job-xml>${workflowRoot}/hive-site.xml</job-xml>

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

<property>

<name>oozie.action.sharelib.for.pig</name>

<value>pig,hcatalog</value>

</property>

</configuration>

<script>${workflowRoot}/pig/extract_to_hcat.pig

</script>

<argument>-param</argument>

<argument>inputDir=${inputDir}</argument>

</pig>

Thetwochangesofnotearetheadditionoftheexplicitreferencetothehive-site.xmlfile;thisisrequiredbyHCatalog,andthenewconfigurationelementthattellsOozietoincludetherequiredHCatalogJARs.

TheOoziesharelibThatlastadditiontouchedonanimportantaspectofOoziewe’venotmentionedthusfar:theOoziesharelib.WhenOozierunsallitsvariousactiontypes,itrequiresmultipleJARstoaccessHadoopandtoinvokevarioustools,suchasHiveandPig.AspartoftheOozieinstallation,alargenumberofdependentJARshavebeenplacedonHDFStobeusedbyOozieanditsvariousactiontypes:thisistheOoziesharelib.

FormostusagesofOozie,it’senoughtoknowthesharelibexists,usuallyunder/user/oozie/share/libonHDFS,andwhen,asinthepreviousexample,someexplicit

Page 334: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

configurationvaluesneedtobeadded.WhenusingaPigaction,thePigJARswillautomaticallygetpickedup,butwhenthePigscriptusessomethinglikeHCatalog,thenthisdependencywillnotbeexplicitlyknowntoOozie.

TheOozieCLIallowsmanipulationofthesharelib,thoughthescenarioswherethiswillberequiredareoutsideofthescopeofthisbook.ThefollowingcommandcanbeusefulthoughtoseewhichcomponentsareincludedintheOoziesharelib:

$oozieadmin-shareliblist

ThefollowingcommandisusefultoseetheindividualJARscomprisingaparticularcomponentwithinthesharelib,inthiscaseHCatalog:

$oozieadmin-shareliblisthcat

ThesecommandscanbeusefultoverifythattherequiredJARsarebeingincludedandtoseewhichspecificversionsarebeingused.

HCatalogandpartitionedtablesIfyourerunthepreviousworkflowasecondtime,itwillfail;digintothelogs,andyouwillseeHCatalogcomplainingthatitcannotwritetoatablethatalreadycontainsdata.ThisisacurrentlimitationofHCatalog;itviewstablesandpartitionswithintablesasimmutablebydefault.Hive,ontheotherhand,willaddnewdatatoatableorpartition;itsdefaultviewofatableisthatitismutable.

UpcomingchangestoHiveandHCatalogwillseethesupportofanewtablepropertythatwillcontrolthisbehaviorineithertool;forexample,thefollowingaddedtoatabledefinitionwouldallowtableappendsassupportedinHivetoday:

TBLPROPERTIES("immutable"="false")

ThisiscurrentlynotavailableintheshippingversionofHiveandHCatalog,however.Forustohaveaworkflowthataddsmoreandmoredataintoourtables,wethereforeneedtocreateanewpartitionforeachnewrunoftheworkflow.We’vemadethesechangesinv4ofourpipeline,wherewefirstrecreatethetableswithanintegerpartitionkey,asfollows:

CREATETABLEtweets_hcat(

…)

PARTITIONEDBY(partition_keyint)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASSEQUENCEFILE;

CREATETABLE`places_hcat`(

…)

partitionedby(partition_keyint)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASSEQUENCEFILE

TBLPROPERTIES("immutable"="false");

CREATETABLE`users_hcat`(

Page 335: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

…)

partitionedby(partition_keyint)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASSEQUENCEFILE

TBLPROPERTIES("immutable"="false");

ThePigHCatStorertakesanoptionalpartitiondefinitionandwemodifythestorestatementsinourPigscriptaccordingly;forexample:

storetweets_tsvinto'twttr.tweets_hcat'

usingorg.apache.hive.hcatalog.pig.HCatStorer(

'partition_key=$partitionKey');

WethenmodifyourPigactionintheworkflow.xmlfiletoincludethisadditionalparameter:

<script>${workflowRoot}/pig/extract_to_hcat.pig</script>

<param>inputDir=${inputDir}</param>

<param>partitionKey=${partitionKey}</param>

Thequestionisthenhowwepassthispartitionkeytotheworkflow.Wecouldspecifyitinthejob.propertiesfile,butbydoingsowewouldhitthesameproblemwithtryingtowritetoanexistingpartitiononthenextre-run.

Ingestionworkflowv4

Fornow,we’llpassthisasanexplicitargumenttotheinvocationoftheOozieCLIandexplorebetterwaystodothislater:

$ooziejob–run–configv4/job.properties–DpartitionKey=12345

NoteNotethataconsequenceofthisbehavioristhatrerunninganHCatworkflowwiththesameargumentswillfail.Beawareofthiswhentestingworkflowsorplayingwiththesamplecodefromthisbook.

Page 336: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ProducingderiveddataNowthatwehaveourmaindatapipelineestablished,thereismostlikelyaseriesofactionsthatwewishtotakeafterweaddeachnewadditionaldataset.Asasimpleexample,notethatwithourpreviousmechanismofaddingeachsetofuserdatatoaseparatepartition,theusers_hcattablewillcontainusersmultipletimes.Let’screateanewtableforuniqueusersandregeneratethiseachtimeweaddnewuserdata.

NotethatgiventheaforementionedlimitationsofHCatalog,we’lluseaHiveactionforthispurpose,asweneedtoreplacethedatainatable.

First,we’llcreateanewtableforuniqueuserinformation,asfollows:

CREATETABLEIFNOTEXISTS`unique_users`(

`user_id`string,

`name`string,

`description`string,

`screen_name`string)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\t'

STOREDASsequencefile;

Inthistable,we’llonlystoretheattributesofauserthateitherneverchange(ID)orchangerarely(thescreenname,andsoon).WecanthenwriteasimpleHivestatementtopopulatethistablefromthefullusers_hcattable:

USEtwttr;

INSERTOVERWRITETABLEunique_users

SELECTDISTINCTuser_id,name,description,screen_name

FROMusers_hcat;

WecanthenaddanadditionalHiveactionnodethatcomesafterourpreviousPignodeintheworkflow.Whendoingthis,wediscoverthatourpatternofsimplygivingnodesnamessuchashive-nodeisareallybadidea,aswenowhavetwoHive-basednodes.Inv5oftheworkflow,weaddthisnewnodeandalsochangeournodestohavemoredescriptivenames:

Page 337: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Ingestionworkflowv5

PerformingmultipleactionsinparallelOurworkflowhastwotypesofactivity:initialsetupwiththenodesthatinitializethefilesystemandHivetables,andthefunctionalnodesthatperformactualprocessing.Ifwelookatthetwosetupnodeswehavebeenusing,itisobviousthattheyarequitedistinctandnotinterdependent.WecanthereforetakeadvantageofanOoziefeaturecalledforkandjoinnodestoexecutetheseactionsinparallel.Thestartofourworkflow.xmlfilenowbecomes:

<startto="setup-fork-node"/>

TheOozieforknodecontainsanumberofpathelements,eachofwhichspecifiesastartingnode.Eachofthesewillbelaunchedinparallel:

<forkname="setup-fork-node">

<pathstart="setup-filesystem-node"/>

<pathstart="create-tables-node"/>

</fork>

Eachofthespecifiedactionnodesisnodifferentfromanywehaveusedpreviously.Anactionnodecanlinktoaseriesofothernodes;theonlyrequirementisthateachparallelseriesofactionsmustendwithatransitiontothejoinnodeassociatedwiththeforknode,asfollows:

<actionname="setup-filesystem-node">

<okto="setup-join-node"/>

<errorto="fail"/>

</action>

<actionname="create-tables-node">

<okto="setup-join-node"/>

<errorto="fail"/>

</action>

Thejoinnodeitselfactsasthepointofcoordination;anyworkflowthathascompletedwillwaituntilallthepathsspecifiedintheforknodereachthispoint.Atthatpoint,theworkflowcontinuesatthenodespecifiedwithinthejoinnode.Here’showthejoinnodeisused:

<joinname="create-join-node"to="gettweets-node"/>

Intheprecedingcodeweomittedtheactiondefinitionsforspacepurposes,butthefullworkflowdefinitionisinv6:

Page 338: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Ingestionworkflowv6

CallingasubworkflowThoughthefork/joinmechanismmakestheprocessofparallelactionsmoreefficient,itdoesstilladdsignificantverbosityifweincludeitinourmainworkflow.xmldefinition.Conceptually,wehaveaseriesofactionsthatareperformingrelatedtasksrequiredbyourworkflowbutnotnecessarilypartofit.Forthisandsimilarcases,Oozieofferstheabilitytoinvokeasubworkflow.Theparentworkflowwillexecutethechildandwaitforittocomplete,withtheabilitytopassconfigurationelementsfromoneworkflowtotheother.

Thechildworkflowwillbeafullworkflowinitsownright,usuallystoredinadirectoryonHDFSwithalltheusualstructureweexpectforaworkflow,themainworkflow.xmlfile,andanyrequiredHive,Pig,orsimilarfiles.

WecancreateanewdirectoryonHDFScalledsetup-workflow,andinthiscreatethefilesrequiredonlyforourfilesystemandHivecreationactions.Thesubworkflowconfigurationfilewilllooklikethefollowing:

<workflow-appxmlns="uri:oozie:workflow:0.4"name="create-workflow">

<startto="setup-fork-node"/>

<forkname="setup-fork-node">

<pathstart="setup-filesystem-node"/>

<pathstart="create-tables-node"/>

</fork>

<actionname="setup-filesystem-node">

</action>

<actionname="create-tables-node">

</action>

<joinname="create-join-node"to="end"/>

<killname="fail">

<message>Actionfailed,error

message[${wf:errorMessage(wf:lastErrorNode())}]</message>

</kill>

<endname="end"/>

</workflow-app>

Page 339: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Withthissubworkflowdefined,wethenmodifythefirstnodesofourmainworkflowtouseasubworkflownode,asinthefollowing:

<startto="create-subworkflow-node"/>

<actionname="create-subworkflow-node">

<sub-workflow>

<app-path>${subWorkflowRoot}</app-path>

<propagate-configuration/>

</sub-workflow>

<okto="gettweets-node"/>

<errorto="fail"/>

</action>

WewillspecifythesubWorkflowPathinthejob.propertiesofourparentworkflow,andthepropagate-configurationelementwillpasstheconfigurationoftheparentworkflowtothechild.

AddingglobalsettingsByextractingutilitynodesintosubworkflows,wecansignificantlyreduceclutterandcomplexityinourmainworkflowdefinition.Inv7ofouringestpipeline,we’llmakeoneadditionalsimplificationandaddaglobalconfigurationsection,asinthefollowing:

<workflow-appxmlns="uri:oozie:workflow:0.4"name="v7">

<global>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<job-xml>${workflowRoot}/hive-site.xml</job-xml>

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

</configuration>

</global>

<startto="create-subworkflow-node"/>

Byaddingthisglobalconfigurationsection,weremovetheneedtospecifyanyofthesevaluesintheHiveandPignodesintheremainingworkflow(notethatcurrentlytheshellnodedoesnotsupporttheglobalconfigurationmechanism).Thiscandramaticallysimplifysomeofournodes;forexample,ourPignodeisnowasfollows:

<actionname="hcat-ingest-node">

<pig>

<configuration>

<property>

<name>oozie.action.sharelib.for.pig</name>

<value>pig,hcatalog</value>

</property>

</configuration>

<script>${workflowRoot}/pig/extract_to_hcat.pig</script>

<param>inputDir=${inputDir}</param>

<param>dbName=${dbName}</param>

<param>partitionKey=${partitionKey}</param>

Page 340: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

</pig>

<okto="derived-data-node"/>

<errorto="fail"/>

</action>

Ascanbeseen,wecanaddadditionalconfigurationelements,orindeedoverridethosespecifiedintheglobalsection,resultinginamuchcleareractiondefinitionthatfocusesonlyontheinformationspecifictotheactioninquestion.Ourworkflowv7hashadbothaglobalsectionaddedaswellastheadditionofthesubworkflow,andthismakesasignificantimprovementintheworkflowreadability:

Ingestionworkflowv7

Page 341: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ChallengesofexternaldataWhenwerelyonexternaldatatodriveourapplication,weareimplicitlydependentonthequalityandstabilityofthatdata.Thisis,ofcourse,trueforanydata,butwhenthedataisgeneratedbyanexternalsourceoverwhichwedonothavecontrol,therisksaremostlikelyhigher.Regardless,whenbuildingwhatweexpecttobereliableapplicationsontopofsuchdatafeeds,andespeciallywhenourdatavolumesgrow,weneedtothinkabouthowtomitigatetheserisks.

Page 342: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DatavalidationWeusethegeneraltermdatavalidationtorefertotheactofensuringthatincomingdatacomplieswithourexpectationsandpotentiallyapplyingnormalizationtomodifyitaccordinglyortoevendeletemalformedorcorruptinput.Whatthisactuallyinvolveswillbeveryapplication-specific.Insomecases,theimportantthingisensuringthesystemonlyingestsdatathatconformstoagivendefinitionofaccurateorclean.Forourtweetdata,wedon’tcareabouteverysinglerecordandcouldveryeasilyadoptapolicysuchasdroppingrecordsthatdon’thavevaluesinparticularfieldswecareabout.Forotherapplications,however,itisimperativetocaptureeveryinputrecord,andthismightdrivetheimplementationoflogictoreformateveryrecordtomakesureitcomplieswiththerequirements.Inyetothercases,onlycorrectrecordswillbeingested,buttherest,insteadofbeingdiscarded,mightbestoredelsewhereforlateranalysis.

Thebottomlineisthattryingtodefineagenericapproachtodatavalidationisvastlybeyondthescopeofthischapter.

However,wecanoffersomethoughtsonwhereinthepipelinetoincorporatevarioustypesofvalidationlogic.

ValidationactionsLogictodoanynecessaryvalidationorcleanupcanbeincorporateddirectlyintootheractions.Ashellnoderunningascripttogatherdatacanhavecommandsaddedtohandlemalformedrecordsdifferently.PigandHiveactionsthatloaddataintotablescaneitherperformfilteringoningest(easierdoneinPig)oraddcaveatswhencopyingdatafromaningesttabletotheoperationalstore.

Thereisanargumentthoughfortheadditionofavalidationnodeintotheworkflow,evenifinitiallyitperformsnoactuallogic.Thiscould,forinstance,beaPigactionthatreadsthedata,appliesthevalidation,andwritesthevalidateddatatoanewlocationtobereadbyfollow-onnodes.Theadvantagehereisthatwecanlaterupdatethevalidationlogicwithoutalteringourotheractions,whichshouldreducetheriskofaccidentallybreakingtherestofthepipelineandalsomakenodesmorecleanlydefinedintermsofresponsibilities.Thenaturalextensionofthistrainofthoughtisthatanewsubworkflowforvalidationismostlikelyagoodmodelaswell,asitnotonlyprovidesseparationofresponsibilities,butalsomakesthevalidationlogiceasiertotestandupdate.

Theobviousdisadvantageofthisapproachisthatitaddsadditionalprocessingandanothercycleofreadingthedataandwritingitallagain.Thisis,ofcourse,directlyworkingagainstoneoftheadvantageswehighlightedwhenconsideringtheuseofHCatalogfromPig.

Intheend,itwillcomedowntoatrade-offofperformanceagainstworkflowcomplexityandmaintainability.Whenconsideringhowtoperformvalidationandjustwhatthatmeansforyourworkflow,takealltheseelementsintoaccountbeforedecidingonanimplementation.

Page 343: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HandlingformatchangesWecan’tdeclarevictoryjustbecausewehavedataflowingintooursystemandareconfidentthedataissufficientlyvalidated.Particularlywhenthedatacomesfromanexternalsourcewehavetothinkabouthowthestructureofthedatamightchangeovertime.

RememberthatsystemssuchasHiveonlyapplythetableschemawhenthedataisbeingread.Thisisahugebenefitinenablingflexibledatastorageandingest,butcanleadtouser-facingqueriesorworkloadsfailingsuddenlywhentheingesteddatanolongermatchesthequeriesbeingexecutedagainstit.Arelationaldatabase,whichappliesschemasonwrite,wouldnotevenallowsuchdatatobeingestedintothesystem.

Theobviousapproachtohandlingchangesmadetothedataformatwouldbetoreprocessexistingdataintothenewformat.Thoughthisistractableonsmallerdatasets,itquicklybecomesinfeasibleonthesortofvolumesseeninlargeHadoopclusters.

Page 344: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HandlingschemaevolutionwithAvroAvrohassomefeatureswithrespecttoitsintegrationwithHivethathelpuswiththisproblem.Ifwetakeourtablefortweetsdata,wecouldrepresentthestructureofatweetrecordbythefollowingAvroschema:

{

"namespace":"com.learninghadoop2.avrotables",

"type":"record",

"name":"tweets_avro",

"fields":[

{"name":"created_at","type":["null","string"]},

{"name":"tweet_id_str","type":["null","string"]},

{"name":"text","type":["null","string"]},

{"name":"in_reply_to","type":["null","string"]},

{"name":"is_retweeted","type":["null","string"]},

{"name":"user_id","type":["null","string"]},

{"name":"place_id","type":["null","string"]}

]

}

Createtheprecedingschemainafilecalledtweets_avro.avsc—thisisthestandardfileextensionforAvroschemas.Then,placeitonHDFS;weliketohaveacommonlocationforschemafilessuchas/schema/avro.

Withthisdefinition,wecannowcreateaHivetablethatusesthisschemaforitstablespecification,asfollows:

CREATETABLEtweets_avro

PARTITIONEDBY(`partition_key`int)

ROWFORMATSERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

WITHSERDEPROPERTIES(

'avro.schema.url'='hdfs://localhost.localdomain:8020/schema/avro/tweets_avr

o.avsc'

)

STOREDASINPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

Then,lookatthetabledefinitionfromwithinHive(orHCatalog,whichalsosupportssuchdefinitions):

describetweets_avro

OK

created_atstringfromdeserializer

tweet_id_strstringfromdeserializer

textstringfromdeserializer

in_reply_tostringfromdeserializer

is_retweetedstringfromdeserializer

user_idstringfromdeserializer

place_idstringfromdeserializer

partition_keyintNone

Page 345: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Wecanalsousethistablelikeanyother,forexample,tocopythedatafrompartition3fromthenon-AvrotableintotheAvrotable,asfollows:

SEThive.exec.dynamic.partition.mode=nonstrict

INSERTINTOTABLEtweets_avro

PARTITION(partition_key)

SELECTFROMtweets_hcat

NoteJustasinpreviousexamples,ifAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduceJARtoourenvironmentbeforebeingabletoselectfromthetable.

WenowhaveanewtweetstablespecifiedbyanAvroschema;sofaritjustlookslikeothertables.ButtherealbenefitsforourpurposesinthischapterareinhowwecanusetheAvromechanismtohandleschemaevolution.Let’saddanewfieldtoourtableschema,asfollows:

{

"namespace":"com.learninghadoop2.avrotables",

"type":"record",

"name":"tweets_avro",

"fields":[

{"name":"created_at","type":["null","string"]},

{"name":"tweet_id_str","type":["null","string"]},

{"name":"text","type":["null","string"]},

{"name":"in_reply_to","type":["null","string"]},

{"name":"is_retweeted","type":["null","string"]},

{"name":"user_id","type":["null","string"]},

{"name":"place_id","type":["null","string"]},

{"name":"new_feature","type":"string","default":"wow!"}

]

}

Withthisnewschemainplace,wecanvalidatethatthetabledefinitionhasalsobeenupdated,asfollows:

describetweets_avro;

OK

created_atstringfromdeserializer

tweet_id_strstringfromdeserializer

textstringfromdeserializer

in_reply_tostringfromdeserializer

is_retweetedstringfromdeserializer

user_idstringfromdeserializer

place_idstringfromdeserializer

new_featurestringfromdeserializer

partition_keyintNone

Withoutaddinganynewdata,wecanrunqueriesonthenewfieldthatwillreturnthedefaultvalueforourexistingdata,asfollows:

SELECTnew_featureFROMtweets_avroLIMIT5;

...

Page 346: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OK

wow!

wow!

wow!

wow!

wow!

Evenmoreimpressiveisthefactthatthenewcolumndoesn’tneedtobeaddedattheend;itcanbeanywhereintherecord.Withthismechanism,wecannowupdateourAvroschemastorepresentthenewdatastructureandseethesechangesautomaticallyreflectedinourHivetabledefinitions.Anyqueriesthatrefertothenewcolumnwillretrievethedefaultvalueforallourexistingdatathatdoesnothavethatfieldpresent.

NotethatthedefaultmechanismweareusinghereiscoretoAvroandisnotspecifictoHive.Avroisaverypowerfulandflexibleformatthathasapplicationsinmanyareasandisdefinitelyworthdeeperexaminationthanwearegivingithere.

Technically,whatthisprovidesuswithisforwardcompatibility.Wecanmakechangestoourtableschemaandhaveallourexistingdataremainautomaticallycompliantwiththenewstructurewecan’t,however,continuetoingestdataoftheoldformatintotheupdatedtablessincethemechanismdoesnotprovidebackwardcompatibility:

INSERTINTOTABLEtweets_avro

PARTITION(partition_key)

SELECT*FROMtweets_hcat;

FAILED:SemanticException[Error10044]:Line1:18Cannotinsertinto

targettablebecausecolumnnumber/typesaredifferent'tweets_avro':Table

insclause-0has8columns,butqueryhas7columns.

SupportingschemaevolutionwithAvroallowsdatachangestobesomethingthatishandledaspartofnormalbusinessinsteadofthefirefightingemergencytheyalltoooftenturninto.Butplainly,it’snotforfree;thereisstillaneedtomakethechangesinthepipelineandrolltheseintoproduction.HavingHivetablesthatprovideforwardcompatibilitydoes,however,allowtheprocesstobeperformedinmoremanageablesteps;otherwise,youwouldneedtosynchronizechangesacrosseverystageofthepipeline.IfthechangesaremadefromingestuptothepointtheyareinsertedintoAvro-backedHivetables,thenallusersofthosetablescanremainunchanged(aslongastheydon’tdothingslikeselect*,whichisusuallyaterribleideaanyway)andcontinuetorunexistingqueriesagainstthenewdata.Theseapplicationscanthenbechangedonadifferenttimetabletotheingestionmechanism.Inourv8oftheingestpipeline,weshowhowtofullyuseAvrotablesforallofourexistingfunctionality.

NoteNotethatHive0.14,currentlyunreleasedatthetimeofwritingthis,willlikelyincludemorebuilt-insupportforAvrothatmightsimplifytheprocessofschemaevolutionevenfurther.IfHive0.14isavailablewhenyoureadthis,thendocheckoutthefinalimplementation.

FinalthoughtsonusingAvroschemaevolution

Page 347: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WiththisdiscussionofAvro,wehavetouchedonsomeaspectsofmuchbroadertopics,inparticularofdatamanagementonabroaderscaleandpoliciesarounddataversioningandretention.Muchofthisareabecomesveryspecifictoanorganization,buthereareafewpartingthoughtsthatwefeelaremorebroadlyapplicable.

Onlymakeadditivechanges

Wediscussedaddingcolumnsintheprecedingexample.Sometimes,thoughmorerarely,yoursourcedatadropscolumnsoryoudiscoveryounolongerneedanewcolumn.Avrodoesn’treallyprovidetoolstohelpwiththis,andwefeelitisoftenundesirable.Insteadofdroppingoldcolumns,wetendtomaintaintheolddataandsimplydonotusetheemptycolumnsinallthenewdata.Thisismucheasiertomanageifyoucontrolthedataformat;ifyouareingestingexternalsources,thentofollowthisapproachyouwilleitherneedtoreprocessdatatoremovetheoldcolumnorchangetheingestmechanismtoaddadefaultvalueforallnewdata.

Manageschemaversionsexplicitly

Intheprecedingexamples,wehadasingleschemafiletowhichwemadechangesdirectly.Thisislikelyaverybadidea,asitremovesourabilitytotrackschemachangesovertime.Inadditiontotreatingschemasasartifactstobekeptunderversioncontrol(yourschemasareinGittoo,aren’tthey?)itisoftenusefultotageachschemawithanexplicitversion.Thisisparticularlyusefulwhentheincomingdataisalsoexplicitlyversioned.Then,insteadofoverwritingtheexistingschemafile,youcanaddthenewfileanduseanALTERTABLEstatementtopointtheHivetabledefinitionatthenewschema.Weare,ofcourse,assumingherethatyoudon’thavetheoptionofusingadifferentqueryfortheolddatawiththedifferentformat.ThoughthereisnoautomaticmechanismforHivetoselectschema,theremightbecaseswhereyoucancontrolthismanuallyandsidesteptheevolutionquestion.

Thinkaboutschemadistribution

Whenusingaschemafile,thinkabouthowitwillbedistributedtotheclients.If,asinthepreviousexample,thefileisonHDFS,thenitlikelymakessensetogiveitahighreplicationfactor.ThefilewillberetrievedbyeachmapperineveryMapReducejobthatqueriesthetable.

TheAvroURLcanalsobespecifiedasalocalfilesystemlocation(file://),whichisusefulfordevelopmentandalsoasawebresource(http://).Thoughthelatterisveryusefulasitisaconvenientmechanismtodistributetheschematonon-Hadoopclients,rememberthattheloadonthewebservermightbehigh.Withmodernhardwareandefficientwebservers,thisismostlikelynotahugeconcern,butifyouhaveaclusterofthousandsofmachinesrunningmanyparalleljobswhereeachmapperneedstohitthewebserver,thenbecareful.

Page 348: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

CollectingadditionaldataManydataprocessingsystemsdon’thaveasingledataingestsource;often,oneprimarysourceisenrichedbyothersecondarysources.Wewillnowlookathowtoincorporatetheretrievalofsuchreferencedataintoourdatawarehouse.

Atahighlevel,theproblemisn’tverydifferentfromourretrievaloftherawtweetdata,aswewishtopulldatafromanexternalsource,possiblydosomeprocessingonit,andstoreitsomewherewhereitcanbeusedlater.Butthisdoeshighlightanaspectweneedtoconsider;dowereallywanttoretrievethisdataeverytimeweingestnewtweets?Theansweriscertainlyno.Thereferencedatachangesveryrarely,andwecouldeasilyfetchitmuchlessfrequentlythannewtweetdata.Thisraisesaquestionwe’veskirteduntilnow:justhowdowescheduleOozieworkflows?

Page 349: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SchedulingworkflowsUntilnow,we’verunallourOozieworkflowsondemandfromtheCLI.OoziealsohasaschedulerthatallowsjobstobestartedeitheronatimedbasisorwhenexternalcriteriasuchasdataappearinginHDFSaremet.Itwouldbeagoodfitforourworkflowstohaveourmaintweetpipelinerun,say,every10minutesbutthereferencedataonlyrefresheddaily.

TipRegardlessofwhendataisretrieved,thinkcarefullyhowtohandledatasetsthatperformadelete/replaceoperation.Inparticular,don’tdothedeletebeforeretrievingandvalidatingthenewdata;otherwise,anyjobsthatrequirethereferencedatawillfailuntilthenextrunoftheretrievalsucceeds.Itcouldbeagoodoptiontoincludethedestructiveoperationsinasubworkflowthatisonlytriggeredaftersuccessfulcompletionoftheretrievalsteps.

Oozieactuallydefinestwotypesofapplicationsthatitcanrun:workflowssuchaswe’veusedsofarandcoordinators,whichscheduleworkflowstobeexecutedbasedonvariouscriteria.Acoordinatorjobisconceptuallysimilartoourotherworkflows;wepushanXMLconfigurationfileontoHDFSanduseaparameterizedpropertiesfiletoconfigureitatruntime.Inaddition,coordinatorjobshavethefacilitytoreceiveadditionalparameterizationfromtheeventsthattriggertheirexecution.

Thisispossiblybestdescribedbyanexample.Let’ssay,wewishtodoaspreviouslymentionedandcreateacoordinatorthatexecutesv7ofouringestworkflowevery10minutes.Here’sthecoordinator.xmlfile(thestandardnameforthecoordinatorXMLdefinition):

<coordinator-appname="tweets-10min-coordinator"frequency="${freq}"

start="${startTime}"end="${endTime}"timezone="UTC"

xmlns="uri:oozie:coordinator:0.2">

Themainactionnodeinacoordinatoristheworkflow,forwhichweneedtospecifyitsrootlocationonHDFSandallrequiredproperties,asfollows:

<action>

<workflow>

<app-path>${workflowPath}</app-path>

<configuration>

<property>

<name>workflowRoot</name>

<value>${workflowRoot}</value>

</property>

Wealsoneedtoincludeanypropertiesrequiredbyanyactionintheworkfloworbyanysubworkflowittriggers;ineffect,thismeansthatanyuser-definedvariablespresentinanyoftheworkflowstobetriggeredneedtobeincludedhere,asfollows:

<property>

<name>dbName</name>

Page 350: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

<value>${dbName}</value>

</property>

<property>

<name>partitionKey</name>

<value>${coord:formatTime(coord:nominalTime(),'yyyyMMddhhmm')}

</value>

</property>

<property>

<name>exec</name>

<value>gettweets.sh</value>

</property>

<property>

<name>inputDir</name>

<value>/tmp/tweets</value>

</property>

<property>

<name>subWorkflowRoot</name>

<value>${subWorkflowRoot}</value>

</property>

</configuration>

</workflow>

</action>

</coordinator-app>

Weusedafewcoordinator-specificfeaturesintheprecedingxml.Notethespecificationofthestartingandendingtimeofthecoordinatorandalsoitsfrequency(inminutes).Weareusingthesimplestformhere;Ooziealsohasasetoffunctionstoallowquiterichspecificationsofthefrequency.

WeusecoordinatorELfunctionsinourdefinitionofthepartitionKeyvariable.Earlier,whenrunningworkflowsfromtheCLI,wespecifiedtheseexplicitlybutmentionedtherewasabetterway—thisisit.Thefollowingexpressiongeneratesaformattedoutputcontainingtheyear,month,day,hour,andminute:

${coord:formatTime(coord:nominalTime(),'yyyyMMddhhmm')}

Ifwethenusethisasthevalueforourpartitionkey,wecanensurethateachinvocationoftheworkflowcorrectlycreatesauniquepartitioninourHCatalogtables.

Thecorrespondingjob.propertiesforthecoordinatorjoblooksmuchlikeourpreviousconfigfileswiththeusualentriesfortheNameNodeandsimilarvariablesaswellashavingvaluesfortheapplication-specificvariables,suchasdbName.Inaddition,weneedtospecifytherootofthecoordinatorlocationonHDFS,asfollows:

oozie.coord.application.path=${nameNode}/user/${user.name}/${tasksRoot}/twe

ets_10min

Notetheoozie.coordnamespaceprefixinsteadofthepreviouslyusedoozie.wf.WiththecoordinatordefinitiononHDFS,wecansubmitthefiletoOoziejustaswiththepreviousjobs.Butinthiscase,thejobwillonlyrunforagiventimeperiod.Specifically,itwillruneveryfiveminutes(thefrequencyisvariable)whenthesystemclockisbetweenstartTimeandendTime.

We’veincludedthefullconfigurationinthetweets_10mindirectoryinthesourcecodefor

Page 351: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

thischapter.

Page 352: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OtherOozietriggersTheprecedingcoordinatorhasaverysimpletrigger;itstartsperiodicallywithinaspecifiedtimerange.Ooziehasanadditionalcapabilitycalleddatasets,whereitcanbetriggeredbytheavailabilityofnewdata.

Thisisn’tagreatfitforhowwe’vedefinedourpipelineuntilnow,butimaginethat,insteadofourworkflowcollectingtweetsasitsfirststep,anexternalsystemwaspushingnewfilesoftweetsontoHDFSonacontinuousbasis.OoziecanbeconfiguredtoeitherlookforthepresenceofnewdatabasedonadirectorypatternortospecificallytriggerwhenareadyfileappearsonHDFS.ThislatterconfigurationprovidesaveryconvenientmechanismwithwhichtointegratetheoutputofMapReducejobs,whichbydefault,writea_SUCCESSfileintotheiroutputdirectory.

Ooziedatasetsarearguablyoneofthemostpowerfulpartsofthewholesystem,andwecannotdothemjusticehereforspacereasons.ButwedostronglyrecommendthatyouconsulttheOoziehomepageformoreinformation.

Page 353: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PullingitalltogetherLet’sreviewwhatwe’vediscusseduntilnowandhowwecanuseOozietobuildasophisticatedseriesofworkflowsthatimplementanapproachtodatalifecyclemanagementbyputtingtogetherallthediscussedtechniques.

First,it’simportanttodefineclearresponsibilitiesandimplementpartsofthesystemusinggooddesignandseparationofconcernprinciples.Byapplyingthis,weendupwithseveraldifferentworkflows:

Asubworkflowtoensuretheenvironment(mainlyHDFSandHivemetadata)iscorrectlyconfiguredAsubworkflowtoperformdatavalidationThemainworkflowthattriggersboththeprecedingsubworkflowsandthenpullsnewdatathroughamultistepingestpipelineAcoordinatorthatexecutestheprecedingworkflowsevery10minutesAsecondcoordinatorthatingestsreferencedatathatwillbeusefultotheapplicationpipeline

WealsodefineallourtableswithAvroschemasandusethemwhereverpossibletohelpmanageschemaevolutionandchangingdataformatsovertime.

Wepresentthefullsourcecodeofthesecomponentsinthefinalversionoftheworkflowinthesourcecodeofthischapter.

Page 354: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OthertoolstohelpThoughOozieisaverypowerfultool,sometimesitcanbesomewhatdifficulttocorrectlywriteworkflowdefinitionfiles.Aspipelinesgetsizeable,managingcomplexitybecomesachallengeevenwithgoodfunctionalpartitioningintomultipleworkflows.Atasimplerlevel,XMLisjustneverfunforahumantowrite!Thereareafewtoolsthatcanhelp.Hue,thetoolcallingitselftheHadoopUI(http://gethue.com/),providessomegraphicaltoolstohelpcompose,execute,andmanageOozieworkflows.Thoughpowerful,Hueisnotabeginnertool;we’llmentionitalittlemoreinChapter11,WheretoGoNext.

AnewApacheprojectcalledFalcon(http://falcon.incubator.apache.org)mightalsobeofinterest.FalconusesOozietobuildarangeofmuchhigher-leveldataflowsandactions.Forexample,Falconprovidesrecipestoenableandensurecross-sitereplicationacrossmultipleHadoopclusters.TheFalconteamisworkingonmuchbetterinterfacestobuildtheirworkflows,sotheprojectmightwellbeworthwatching.

Page 355: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SummaryHopefully,thischapterpresentedthetopicofdatalifecyclemanagementassomethingotherthanadryabstractconcept.Wecoveredalot,particularly:

ThedefinitionofdatalifecyclemanagementandhowitcoversanumberofissuesandtechniquesthatusuallybecomeimportantwithlargedatavolumesTheconceptofbuildingadataingestpipelinealonggooddatalifecyclemanagementprinciplesthatcanthenbeutilizedbyhigher-levelanalytictoolsOozieasaHadoop-focusedworkflowmanagerandhowwecanuseittocomposeaseriesofactionsintoaunifiedworkflowVariousOozietools,suchassubworkflows,parallelactionexecution,andglobalvariables,thatallowustoapplytruedesignprinciplestoourworkflowsHCatalogandhowitprovidesthemeansfortoolsotherthanHivetoreadandwritetable-structureddata;weshoweditsgreatpromiseandintegrationwithtoolssuchasPigbutalsohighlightedsomecurrentweaknessesAvroasourtoolofchoicetohandleschemaevolutionovertimeUsingOoziecoordinatorstobuildscheduledworkflowsbasedeitherontimeintervalsordataavailabilitytodrivetheexecutionofmultipleingestpipelinesSomeothertoolsthatcanmakethesetaskseasier,namely,HueandFalcon

Inthenextchapter,we’lllookatseveralofthehigher-levelanalytictoolsandframeworksthatcanbuildsophisticatedapplicationlogicuponthedatacollectedinaningestpipeline.

Page 356: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Chapter9.MakingDevelopmentEasierInthischapter,wewilllookathow,dependingonusecasesandendgoals,applicationdevelopmentinHadoopcanbesimplifiedusinganumberofabstractionsandframeworksbuiltontopoftheJavaAPIs.Inparticular,wewilllearnaboutthefollowingtopics:

HowthestreamingAPIallowsustowriteMapReducejobsusingdynamiclanguagessuchasPythonandRubyHowframeworkssuchasApacheCrunchandKiteMorphlinesallowustoexpressdatatransformationpipelinesusinghigher-levelabstractionsHowKiteData,apromisingframeworkdevelopedbyCloudera,providesuswiththeabilitytoapplydesignpatternsandboilerplatetoeaseintegrationandinteroperabilityofdifferentcomponentswithintheHadoopecosystem

Page 357: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ChoosingaframeworkInthepreviouschapters,welookedattheMapReduceandSparkprogrammingAPIstowritedistributedapplications.Althoughverypowerfulandflexible,theseAPIscomewithacertainlevelofcomplexityandpossiblyrequiresignificantdevelopmenttime.

Inanefforttoreduceverbosity,weintroducedthePigandHiveframeworks,whichcompiledomain-specificlanguages,PigLatinandHiveQL,intoanumberofMapReducejobsorSparkDAGs,effectivelyabstractingtheAPIsaway.BothlanguagescanbeextendedwithUDFs,whichisawayofmappingcomplexlogictothePigandHivedatamodels.

Attimeswhenweneedacertaindegreeofflexibilityandmodularity,thingscangettricky.Dependingontheusecaseanddeveloperneeds,theHadoopecosystempresentsavastchoiceofAPIs,frameworks,andlibraries.Inthischapter,weidentifyfourcategoriesofusersandmatchthemwiththefollowingrelevanttools:

DevelopersthatwanttoavoidJavainfavorofscriptingMapReducejobsusingdynamiclanguages,oruselanguagesnotimplementedontheJVM.Atypicalusecasewouldbeupfrontanalysisandrapidprototyping:HadoopstreamingJavadevelopersthatneedtointegratecomponentsoftheHadoopecosystemandcouldbenefitfromcodifieddesignpatternsandboilerplate:KiteDataJavadeveloperswhowanttowritemodulardatapipelinesusingafamiliarAPI:ApacheCrunchDeveloperswhowouldratherconfigurechainsofdatatransformations.Forinstance,adataengineerthatwantstoembedexistingcodeinanETLpipeline:KiteMorphlines

Page 358: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HadoopstreamingWehavementionedpreviouslythatMapReduceprogramsdon’thavetobewritteninJava.Thereareseveralreasonswhyyoumightwantorneedtowriteyourmapandreducetasksinanotherlanguage.Perhapsyouhaveexistingcodetoleverageorneedtousethird-partybinaries—thereasonsarevariedandvalid.

Hadoopprovidesanumberofmechanismstoaidnon-Javadevelopment,primaryamongstwhichareHadooppipesthatprovideanativeC++interfaceandHadoopstreamingthatallowsanyprogramthatusesstandardinputandoutputtobeusedformapandreducetasks.WiththeMapReduceJavaAPI,bothmapandreducetasksprovideimplementationsformethodsthatcontainthetaskfunctionality.ThesemethodsreceivetheinputtothetaskasmethodargumentsandthenoutputresultsviatheContextobject.Thisisaclearandtype-safeinterface,butitisbydefinitionJava-specific.

Hadoopstreamingtakesadifferentapproach.Withstreaming,youwriteamaptaskthatreadsitsinputfromstandardinput,onelineatatime,andgivestheoutputofitsresultstostandardoutput.Thereducetaskthendoesthesame,againusingonlystandardinputandoutputforitsdataflow.

Anyprogramthatreadsandwritesfromstandardinputandoutputcanbeusedinstreaming,suchascompiledbinaries,Unixshellscripts,orprogramswritteninadynamiclanguagesuchasPythonorRuby.ThebiggestadvantagetostreamingisthatitcanallowyoutotryideasanditeratethemmorequicklythanusingJava.Insteadofacompile/JAR/submitcycle,youjustwritethescriptsandpassthemasargumentstothestreamingJARfile.Especiallywhendoinginitialanalysisonanewdatasetortryingoutnewideas,thiscansignificantlyspeedupdevelopment.

Theclassicdebateregardingdynamicversusstaticlanguagesbalancesthebenefitsofswiftdevelopmentagainstruntimeperformanceandtypechecking.Thesedynamicdownsidesalsoapplywhenusingstreaming.Consequently,wefavortheuseofstreamingforupfrontanalysisandJavafortheimplementationofjobsthatwillbeexecutedontheproductioncluster.

Page 359: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

StreamingwordcountinPythonWe’lldemonstrateHadoopstreamingbyre-implementingourfamiliarwordcountexampleusingPython.First,wecreateascriptthatwillbeourmapper.ItconsumesUTF-8encodedrowsoftextfromstandardinputwithaforloop,splitsthisintowords,andusestheprintfunctiontowriteeachwordtostandardoutput,asfollows:

#!/bin/envpython

importsys

forlineinsys.stdin:

#skipemptylines

ifline=='\n':

continue

#preserveutf-8encoding

try:

line=line.encode('utf-8')

exceptUnicodeDecodeError:

continue

#newlinecharacterscanappearwithinthetext

line=line.replace('\n','')

#lowercaseandtokenize

line=line.lower().split()

forterminline:

ifnotterm:

continue

try:

print(

u"%s"%(

term.decode('utf-8')))

exceptUnicodeEncodeError:

continue

Thereducercountsthenumberofoccurrencesofeachwordfromstandardinput,andgivestheoutputasthefinalvaluetostandardoutput,asfollows:

#!/bin/envpython

importsys

count=1

current=None

forwordinsys.stdin:

word=word.strip()

ifword==current:

count+=1

else:

ifcurrent:

print"%s\t%s"%(current.decode('utf-8'),count)

current=word

Page 360: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

count=1

ifcurrent==word:

print"%s\t%s"%(current.decode('utf-8'),count)

NoteInbothcases,weareimplicitlyusingHadoopinputandoutputformatsdiscussedintheearlierchapters.ItistheTextInputFormatthatprocessesthesourcefileandprovideseachlineoneatatimetothemapscript.Conversely,theTextOutputFormatwillensurethattheoutputofreducetasksisalsocorrectlywrittenastext.

Copymap.pyandreduce.pytoHDFS,andexecutethescriptsasastreamingjobusingthesampledatafromthepreviouschapters,asfollows:

$hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

streaming.jar\

-filemap.py\

-mapper"pythonmap.py"\

-filereduce.py\

-reducer"pythonreduce.py"\

-inputsample.txt\

-outputoutput.txt

NoteTweetsareUTF-8encoded.MakesurethatPYTHONIOENCODINGissetaccordinglyinordertopipedatainaUNIXterminal:

$exportPYTHONIOENCODING='UTF-8'

Thesamecodecanbeexecutedfromthecommand-lineprompt:

$catsample.txt|pythonmap.py|pythonreduce.py>out.txt

Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/wc/python/map.py.

Page 361: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DifferencesinjobswhenusingstreamingInJava,weknowthatourmap()methodwillbeinvokedonceforeachinputkey/valuepairandourreduce()methodwillbeinvokedforeachkeyanditssetofvalues.

Withstreaming,wedon’thavetheconceptofthemaporreducemethodsanymore;insteadwehavewrittenscriptsthatprocessstreamsofreceiveddata.Thischangeshowweneedtowriteourreducer.InJava,thegroupingofvaluestoeachkeywasperformedbyHadoop;eachinvocationofthereducemethodwouldreceiveasingle,tabseparatedkeyandallitsvalues.Instreaming,eachinstanceofthereducetaskisgiventheindividualungatheredvaluesoneatatime.

Hadoopstreamingdoessortthekeys,forexample,ifamapperemittedthefollowingdata:

First1

Word1

Word1

A1

First1

Thestreamingreducerwouldreceiveitinthefollowingorder:

A1

First1

First1

Word1

Word1

Hadoopstillcollectsthevaluesforeachkeyandensuresthateachkeyispassedonlytoasinglereducer.Inotherwords,areducergetsallthevaluesforanumberofkeys,andtheyaregroupedtogether;however,theyarenotpackagedintoindividualexecutionsofthereducer,thatis,oneperkey,aswiththeJavaAPI.SinceHadoopstreamingusesthestdinandstdoutchannelstoexchangedatabetweentasks,debuganderrormessagesshouldnotbeprintedtostandardoutput.Inthefollowingexample,wewillusethePythonlogging(https://docs.python.org/2/library/logging.html)packagetologwarningstatementstoafile.

Page 362: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

FindingimportantwordsintextWewillnowimplementametric,TermFrequency-InverseDocumentFrequency(TF-IDF),thatwillhelpustodeterminetheimportanceofwordsbasedonhowfrequentlytheyappearacrossasetofdocuments(tweets,inourcase).

Intuitively,ifawordappearsfrequentlyinadocumentitisimportantandshouldbegivenahighscore.However,ifawordappearsinmanydocuments,weshouldpenalizeitwithalowerscore,asitisacommonwordanditsfrequencyisnotuniquetothisdocument.

Therefore,commonwordssuchasthe,andfor,whichappearinmanydocuments,willbescaleddown.Wordsthatappearfrequentlyinasingletweetwillbescaledup.UsesofTF-IDF,oftenincombinationwithothermetricsandtechniques,includestopwordremovalandtextclassification.Notethatthistechniquewillhaveshortcomingswhendealingwithshortdocuments,suchastweets.Insuchcases,thetermfrequencycomponentwilltendtobecomeone.Conversely,onecouldexploitthispropertytodetectoutliers.

ThedefinitionofTF-IDFwewilluseinourexampleisthefollowing:

tf=#oftimestermappearsinadocument(rawfrequency)

idf=1+log(#ofdocuments/#documentswithterminit)

tf-idf=tf*idf

WewillimplementthealgorithminPythonusingthreeMapReducejobs:

ThefirstonecalculatestermfrequencyThesecondonecalculatesdocumentfrequency(thedenominatorofIDF)Thethirdonecalculatesper-tweetTF-IDF

CalculatetermfrequencyThetermfrequencypartisverysimilartothewordcountexample.Themaindifferenceisthatwewillbeusingamulti-field,tab-separated,keytokeeptrackofco-occurrencesoftermsanddocumentIDs.Foreachtweet—inJSONformat—themapperextractstheid_strandtextfields,tokenizestext,andemitsaterm,doc_idtuple:

fortweetinsys.stdin:

#skipemptylines

iftweet=='\n':

continue

try:

tweet=json.loads(tweet)

except:

logger.warn("Invalidinput%s"%tweet)

continue

#Inourexampleonetweetcorrespondstoonedocument.

doc_id=tweet['id_str']

ifnotdoc_id:

continue

#preserveutf-8encoding

text=tweet['text'].encode('utf-8')

Page 363: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

#newlinecharacterscanappearwithinthetext

text=text.replace('\n','')

#lowercaseandtokenize

text=text.lower().split()

fortermintext:

try:

print(

u"%s\t%s"%(

term.decode('utf-8'),doc_id.decode('utf-8'))

)

exceptUnicodeEncodeError:

logger.warn("Invalidterm%s"%term)

Inthereducer,weemitthefrequencyofeachterminadocumentasatab-separatedstring:

freq=1

cur_term,cur_doc_id=sys.stdin.readline().split()

forlineinsys.stdin:

line=line.strip()

try:

term,doc_id=line.split('\t')

except:

logger.warn("Invalidrecord%s"%line)

#thekeyisa(doc_id,term)pair

if(doc_id==cur_doc_id)and(term==cur_term):

freq+=1

else:

print(

u"%s\t%s\t%s"%(

cur_term.decode('utf-8'),cur_doc_id.decode('utf-8'),

freq))

cur_doc_id=doc_id

cur_term=term

freq=1

print(

u"%s\t%s\t%s"%(

cur_term.decode('utf-8'),cur_doc_id.decode('utf-8'),freq))

Forthisimplementationtowork,itiscrucialthatthereducerinputissortedbyterm.Wecantestbothscriptsfromthecommandlinewiththefollowingpipe:

$cattweets.json|pythonmap-tf.py|sort-k1,2|\

pythonreduce-tf.py

Whereasatthecommandlineweusethesortutility,inMapReducewewilluseorg.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator.Thiscomparatorimplementsasubsetoffeaturesprovidedbythesortcommand.Inparticular,orderingbyfieldcanbespecifiedwiththe–k<position>option.Tofilterbyterm,thefirstfieldofourkey,weset-Dmapreduce.text.key.comparator.options=-k1:

/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

Page 364: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

streaming.jar\

-Dmap.output.key.field.separator=\t\

-Dstream.num.map.output.key.fields=2\

-Dmapreduce.output.key.comparator.class=\

org.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator\

-Dmapreduce.text.key.comparator.options=-k1,2\

-inputtweets.json\

-output/tmp/tf-out.tsv\

-filemap-tf.py\

-mapper"pythonmap-tf.py"\

-filereduce-tf.py\

-reducer"pythonreduce-tf.py"

NoteWespecifywhichfieldsbelongtothekey(forshuffling)inthecomparatoroptions.

Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-tf.py.

CalculatedocumentfrequencyThemainlogictocalculatedocumentfrequencyisinthereducer,whilethemapperisjustanidentityfunctionthatloadsandpipesthe(orderedbyterm)outputoftheTFjob.Inthereducer,foreachterm,wecounthowmanytimesitoccursacrossalldocuments.Foreachterm,wekeepabufferkey_cacheof(term,doc_id,tf)tuples,andwhenanewtermisfoundweflushthebuffertostandardoutput,togetherwiththeaccumulateddocumentfrequencydf:

#Cachethe(term,doc_id,tf)tuple.

key_cache=[]

line=sys.stdin.readline().strip()

cur_term,cur_doc_id,cur_tf=line.split('\t')

cur_tf=int(cur_tf)

cur_df=1

forlineinsys.stdin:

line=line.strip()

try:

term,doc_id,tf=line.strip().split('\t')

tf=int(tf)

except:

logger.warn("Invalidrecord:%s"%line)

continue

#termistheonlykeyforthisinput

if(term==cur_term):

#incrementdocumentfrequency

cur_df+=1

key_cache.append(

u"%s\t%s\t%s"%(term.decode('utf-8'),doc_id.decode('utf-8'),

tf))

Page 365: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

else:

forkeyinkey_cache:

print("%s\t%s"%(key,cur_df))

print(

u"%s\t%s\t%s\t%s"%(

cur_term.decode('utf-8'),

cur_doc_id.decode('utf-8'),

cur_tf,cur_df)

)

#flushthecache

key_cache=[]

cur_doc_id=doc_id

cur_term=term

cur_tf=tf

cur_df=1

forkeyinkey_cache:

print(u"%s\t%s"%(key.decode('utf-8'),cur_df))

print(

u"%s\t%s\t%s\t%s\n"%(

cur_term.decode('utf-8'),

cur_doc_id.decode('utf-8'),

cur_tf,cur_df))

Wecantestthescriptsfromthecommandlinewith:

$cat/tmp/tf-out.tsv|pythonmap-df.py|pythonreduce-df.py>

/tmp/df-out.tsv

AndwecantestthescriptsonHadoopstreamingwith:

/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

streaming.jar\

-Dmap.output.key.field.separator=\t\

-Dstream.num.map.output.key.fields=3\

-Dmapreduce.output.key.comparator.class=\

org.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator\

-Dmapreduce.text.key.comparator.options=-k1\

-input/tmp/tf-out.tsv/part-00000\

-output/tmp/df-out.tsv\

-mapperorg.apache.hadoop.mapred.lib.IdentityMapper\

-filereduce-df.py\

-reducer"pythonreduce-df.py"

OnHadoopweuseorg.apache.hadoop.mapred.lib.IdentityMapper,whichprovidesthesamelogicasthemap-df.pyscript.

Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-df.py.

Puttingitalltogether–TF-IDFTocalculateTF-IDF,weonlyneedamapperthatconsumestheoutputoftheprevious

Page 366: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

step:

num_doc=sys.argv[1]

forlineinsys.stdin:

line=line.strip()

try:

term,doc_id,tf,df=line.split('\t')

tf=float(tf)

df=float(df)

num_doc=float(num_doc)

except:

logger.warn("Invalidrecord%s"%line)

#idf=num_doc/df

tf_idf=tf*(1+math.log(num_doc/df))

print("%s\t%s\t%s"%(term,doc_id,tf_idf))

Thenumberofdocumentsinthecollectionispassedasaparametertotf-idf.py:

/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

streaming.jar\

-Dmapreduce.reduce.tasks=0\

-input/tmp/df-out.tsv/part-00000\

-output/tmp/tf-idf.out\

-filetf-idf.py\

-mapper"pythontf-idf.py15578"

Tocalculatethetotalnumberoftweets,wecanusethecatandwcUnixutilitiesincombinationwithHadoopstreaming:

/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

streaming.jar\

-inputtweets.json\

-outputtweets.cnt\

-mapper/bin/cat\

-reducer/usr/bin/wc

Themappersourcecodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/tf-idf.py.

Page 367: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

KiteDataTheKiteSDK(http://www.kitesdk.org)isacollectionofclasses,command-linetools,andexamplesthataimsateasingtheprocessofbuildingapplicationsontopofHadoop.

InthissectionwewilllookathowKiteData,asubprojectofKite,caneaseintegrationwithseveralcomponentsofaHadoopdatawarehouse.Kiteexamplescanbefoundathttps://github.com/kite-sdk/kite-examples.

OnCloudera’sQuickStartVM,KiteJARscanbefoundat/opt/cloudera/parcels/CDH/lib/kite/.

KiteDataisorganizedinanumberofsubprojects,someofwhichwe’lldescribeinthefollowingsections.

Page 368: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DataCoreAsthenamesuggests,thecoreisthebuildingblockforallcapabilitiesprovidedintheDatamodule.Itsprincipalabstractionsaredatasetsandrepositories.

Theorg.kitesdk.data.Datasetinterfaceisusedtorepresentanimmutablesetofdata:

@Immutable

publicinterfaceDataset<E>extendsRefinableView<E>{

StringgetName();

DatasetDescriptorgetDescriptor();

Dataset<E>getPartition(PartitionKeykey,booleanautoCreate);

voiddropPartition(PartitionKeykey);

Iterable<Dataset<E>>getPartitions();

URIgetUri();

}

Eachdatasetisidentifiedbyanameandaninstanceoftheorg.kitesdk.data.DatasetDescriptorinterface,thatisthestructuraldescriptionofadatasetandprovidesitsschema(org.apache.avro.Schema)andpartitioningstrategy.

ImplementationsoftheReader<E>interfaceareusedtoreaddatafromanunderlyingstoragesystemandproducedeserializedentitiesoftypeE.ThenewReader()methodcanbeusedtogetanappropriateimplementationforagivendataset:

publicinterfaceDatasetReader<E>extendsIterator<E>,Iterable<E>,

Closeable{

voidopen();

booleanhasNext();

Enext();

voidremove();

voidclose();

booleanisOpen();

}

AninstanceofDatasetReaderwillprovidemethodstoreadanditerateoverstreamsofdata.Similarly,org.kitesdk.data.DatasetWriterprovidesaninterfacetowritestreamsofdatatotheDatasetobjects:

publicinterfaceDatasetWriter<E>extendsFlushable,Closeable{

voidopen();

voidwrite(Eentity);

voidflush();

voidclose();

booleanisOpen();

}

Likereaders,writersareuse-onceobjects.TheyserializeinstancesofentitiesoftypeEandwritethemtotheunderlyingstoragesystem.Writersareusuallynotinstantiateddirectly;rather,anappropriateimplementationcanbecreatedbythenewWriter()factorymethod.ImplementationsofDatasetWriterwillholdresourcesuntilclose()iscalledandexpect

Page 369: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

thecallertoinvokeclose()inafinallyblockwhenthewriterisnolongerinuse.Finally,notethatimplementationsofDatasetWriteraretypicallynotthread-safe.Thebehaviorofawriterbeingaccessedfrommultiplethreadsisundefined.

AparticularcaseofadatasetistheViewinterface,whichisasfollows:

publicinterfaceView<E>{

Dataset<E>getDataset();

DatasetReader<E>newReader();

DatasetWriter<E>newWriter();

booleanincludes(Eentity);

publicbooleandeleteAll();

}

Viewscarrysubsetsofthekeysandpartitionsofanexistingdataset;theyareconceptuallysimilartothenotionof“view”intherelationalmodel.

AViewinterfacecanbecreatedfromrangesofdata,orrangesofkeys,orasaunionbetweenotherviews.

Page 370: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DataHCatalogDataHCatalogisamodulethatenablestheaccessingofHCatalogrepositories.Thecoreabstractionsofthismoduleareorg.kitesdk.data.hcatalog.HCatalogAbstractDatasetRepositoryanditsconcreteimplementation,org.kitesdk.data.hcatalog.HCatalogDatasetRepository.

TheydescribeaDatasetRepositorythatusesHCatalogtomanagemetadataandHDFSforstorage,asfollows:

publicclassHCatalogDatasetRepositoryextends

HCatalogAbstractDatasetRepository{

HCatalogDatasetRepository(Configurationconf){

super(conf,newHCatalogManagedMetadataProvider(conf));

}

HCatalogDatasetRepository(Configurationconf,MetadataProviderprovider)

{

super(conf,provider);

}

public<E>Dataset<E>create(Stringname,DatasetDescriptordescriptor)

{

getMetadataProvider().create(name,descriptor);

returnload(name);

}

publicbooleandelete(Stringname){

returngetMetadataProvider().delete(name);

}

publicstaticclassBuilder{

}

}

NoteAsofKite0.17,DataHCatalogisdeprecatedinfavorofthenewDataHivemodule.

ThelocationofthedatadirectoryiseitherchosenbyHive/HCatalog(so-called“managedtables”),orspecifiedwhencreatinganinstanceofthisclassbyprovidingafilesystemandarootdirectoryintheconstructor(externaltables).

Page 371: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DataHiveThekite-data-moduleexposesHiveschemasviatheDatasetinterface.AsofKite0.17,thispackagesupersedesDataHCatalog.

Page 372: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DataMapReduceTheorg.kitesdk.data.mapreducepackageprovidesinterfacestoreadandwritedatatoandfromaDatasetwithMapReduce.

Page 373: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DataSparkTheorg.kitesdk.data.sparkpackageprovidesinterfacesforreadingandwritingdatatoandfromaDatasetwithApacheSpark.

Page 374: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DataCrunchTheorg.kitesdk.data.crunch.CrunchDatasetspackageisahelperclasstoexposedatasetsandviewsasCrunchReadableSourceorTargetclasses:

publicclassCrunchDatasets{

publicstatic<E>ReadableSource<E>asSource(View<E>view,Class<E>type){

returnnewDatasetSourceTarget<E>(view,type);

}

publicstatic<E>ReadableSource<E>asSource(URIuri,Class<E>type){

returnnewDatasetSourceTarget<E>(uri,type);

}

publicstatic<E>ReadableSource<E>asSource(Stringuri,Class<E>type){

returnasSource(URI.create(uri),type);

}

publicstatic<E>TargetasTarget(View<E>view){

returnnewDatasetTarget<E>(view);

}

publicstaticTargetasTarget(Stringuri){

returnasTarget(URI.create(uri));

}

publicstaticTargetasTarget(URIuri){

returnnewDatasetTarget<Object>(uri);

}

}

Page 375: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ApacheCrunchApacheCrunch(http://crunch.apache.org)isaJavaandScalalibrarytocreatepipelinesofMapReducejobs.ItisbasedonGoogle’sFlumeJava(http://dl.acm.org/citation.cfm?id=1806638)paperandlibrary.TheprojectgoalistomakethetaskofwritingMapReducejobsasstraightforwardaspossibleforanybodyfamiliarwiththeJavaprogramminglanguagebyexposinganumberofpatternsthatimplementoperationssuchasaggregating,joining,filtering,andsortingrecords.

SimilartotoolssuchasPig,Crunchpipelinesarecreatedbycomposingimmutable,distributeddatastructuresandrunningallprocessingoperationsonsuchstructures;theyareexpressedandimplementedasuser-definedfunctions.PipelinesarecompiledintoaDAGofMapReducejobs,whoseexecutionismanagedbythelibrary’splanner.Crunchallowsustowriteiterativecodeandabstractsawaythecomplexityofthinkingintermsofmapandreduceoperations,whileatthesametimeavoidingtheneedofanadhocprogramminglanguagesuchasPigLatin.Inaddition,Crunchoffersahighlycustomizabletypesystemthatallowsustoworkwith,andmix,HadoopWritables,HBase,andAvroserializedobjects.

FlumeJava’smainassumptionisthatMapReduceisthewronglevelofabstractionforseveralclassesofproblems,wherecomputationsareoftenmadeupofmultiple,chainedjobs.Frequently,weneedtocomposelogicallyindependentoperations(forexample,filtering,projecting,grouping,andothertransformations)intoasinglephysicalMapReducejobforperformancereasons.Thisaspectalsohasimplicationsforcodetestability.Althoughwewon’tcoverthisaspectinthischapter,thereaderisencouragedtolookfurtherintoitbyconsultingCrunch’sdocumentation.

Page 376: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GettingstartedCrunchJARsarealreadyinstalledontheQuickStartVM.Bydefault,theJARsarefoundin/opt/cloudera/parcels/CDH/lib/crunch.

Alternatively,recentCrunchlibrariescanbedownloadedfromhttps://crunch.apache.org/download.html,fromMavenCentralorCloudera-specificrepositories.

Page 377: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ConceptsCrunchpipelinesarecreatedbycomposingtwoabstractions:PCollectionandPTable.

ThePCollection<T>interfaceisadistributed,immutablecollectionofobjectsoftypeT.ThePTable<Key,Value>interfaceisadistributed,immutablehashtable—asub-interfaceofPCollection—ofkeysoftheKeytypeandvaluesoftheValuetypethatexposesmethodstoworkwiththekey-valuepairs.

Thesetwoabstractionssupportthefollowingfourprimitiveoperations:

parallelDo:appliesauser-definedfunction,DoFn,toagivenPCollectionandreturnsanewPCollectionunion:mergestwoormorePCollectionsintoasinglevirtualPCollectiongroupByKey:sortsandgroupstheelementsofaPTablebytheirkeyscombineValues:aggregatesthevaluesfromagroupByKeyoperation

Thehttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/HashtagCount.javaimplementsaCrunchMapReducepipelinethatcountshashtagoccurrences:

Pipelinepipeline=newMRPipeline(HashtagCount.class,getConf());

pipeline.enableDebug();

PCollection<String>lines=pipeline.readTextFile(args[0]);

PCollection<String>words=lines.parallelDo(newDoFn<String,String>(){

publicvoidprocess(Stringline,Emitter<String>emitter){

for(Stringword:line.split("\\s+")){

if(word.matches("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")){

emitter.emit(word);

}

}

}

},Writables.strings());

PTable<String,Long>counts=words.count();

pipeline.writeTextFile(counts,args[1]);

//ExecutethepipelineasaMapReduce.

pipeline.done();

Inthisexample,wefirstcreateaMRPipelinepipelineanduseittofirstreadthecontentofsample.txtcreatedwithstream.py-tintoacollectionofstrings,whereeachelementofthecollectionrepresentsatweet.Wetokenizeeachtweetintowordswithtweet.split("\\s+"),andweemiteachwordthatmatchesthehashtagregularexpression,serializedasWritable.NotethatthetokenizingandfilteringoperationsareexecutedinparallelbyMapReducejobscreatedbytheparallelDocall.WecreateaPTablethatassociateseachhashtag,representedasastring,withthenumberoftimesitoccurredinthedatasets.Finally,wewritethePTablecountsintoHDFSasatextfile.The

Page 378: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

pipelineisexecutedwithpipeline.done().

Tocompileandexecutethepipeline,wecanuseGradletomanagetheneededdependencies,asfollows:

$./gradlewjar

$./gradlewcopyJars

AddtheCrunchandAvrodependenciesdownloadedwithcopyJarstotheLIBJARSenvironmentvariable:

$exportCRUNCH_DEPS=build/libjars/crunch-example/lib

$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/crunch-core-0.9.0-

cdh5.0.3.jar,${CRUNCH_DEPS}/avro-1.7.5-cdh5.0.3.jar,${CRUNCH_DEPS}/avro-

mapred-1.7.5-cdh5.0.3-hadoop2.jar

Then,runtheexampleonHadoop:

$hadoopjarbuild/libs/crunch-example.jar\

com.learninghadoop2.crunch.HashtagCount\

tweets.jsoncount-out\

-libjars$LIBJARS

Page 379: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DataserializationOneoftheframework’sgoalsistomakeiteasytoprocesscomplexrecordscontainingnestedandrepeateddatastructures,suchasprotocolbuffersandThriftrecords.

Theorg.apache.crunch.types.PTypeinterfacedefinesthemappingbetweenadatatypethatisusedinaCrunchpipelineandaserializationandstorageformatthatisusedtoread/writedatafrom/toHDFS.EveryPCollectionhasanassociatedPTypethattellsCrunchhowtoread/writedata.

Theorg.apache.crunch.types.PTypeFamilyinterfaceprovidesanabstractfactorytoimplementinstancesofPTypethatsharethesameserializationformat.Currently,Crunchsupportstwotypefamilies:onebasedontheWritableinterfaceandtheotheronApacheAvro.

NoteAlthoughCrunchpermitsmixingandmatchingPCollectioninterfacesthatusedifferentinstancesofPTypeinthesamepipeline,eachPCollectioninterfaces’sPTypemustbelongtoauniquefamily.Forinstance,itisnotpossibletohaveaPTablewithakeyserializedasWritableanditsvalueserializedusingAvro.

Bothtypefamiliessupportacommonsetofprimitivetypes(strings,longs,integers,floats,doubles,booleans,andbytes)aswellasmorecomplexPTypeinterfacesthatcanbeconstructedoutofotherPTypes.TheseincludetuplesandcollectionsofotherPType.Aparticularlyimportant,complex,PTypeistableOf,whichdetermineswhetherthereturntypeofparalleDowillbeaPCollectionorPTable.

NewPTypescanbecreatedbyinheritingandextendingthebuilt-insoftheAvroandWritablefamilies.ThisrequiresimplementinginputMapFn<S,T>andoutputMapFn<T,S>classes.WeareimplementingPTypeforinstanceswhereSistheoriginaltypeandTisthenewtype.

DerivedPTypescanbefoundinthePTypesclass.Theseincludeserializationsupportforprotocolbuffers,Thriftrecords,JavaEnums,BigInteger,andUUIDs.TheElephantBirdlibrarywediscussedinChapter6,DataAnalysiswithApachePig,containsadditionalexamples.

Page 380: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Dataprocessingpatternsorg.apache.crunch.libimplementsanumberofdesignpatternsforcommondatamanipulationoperations.

AggregationandsortingMostofthedataprocessingpatternsprovidedbyorg.apache.crunch.librelyonthePTable‘sgroupByKeymethod.Themethodhasthreedifferentoverloadedforms:

groupByKey():letstheplannerdeterminethenumberofpartitionsgroupByKey(intnumPartitions):isusedtosetthenumberofpartitionsspecifiedbythedevelopergroupByKey(GroupingOptionsoptions):allowsustospecifycustompartitionsandcomparatorsforshuffling

Theorg.apache.crunch.GroupingOptionsclasstakesinstancesofHadoop’sPartitionerandRawComparatorclassestoimplementcustompartitioningandsortingoperations.

ThegroupByKeymethodreturnsaninstanceofPGroupedTable,Crunch’srepresentationofagroupedtable.ItcorrespondstotheoutputoftheshufflephaseofaMapReducejobandallowsvaluestobecombinedwiththecombineValuemethod.

Theorg.apache.crunch.lib.Aggregatepackageexposesmethodstoperformsimpleaggregations(count,max,top,andlength)onthePCollectioninstances.

SortprovidesanAPItosortPCollectionandPTableinstanceswhosecontentsimplementtheComparableinterface.

Bydefault,Crunchsortsdatausingonereducer.Thisbehaviorcanbemodifiedbypassingthenumberofpartitionsrequiredtothesortmethod.TheSort.Ordermethodsignalstheorderinwhichasortshouldbedone.

Thefollowingarehowdifferentsortoptionscanbespecifiedforcollections:

publicstatic<T>PCollection<T>sort(PCollection<T>collection)

publicstatic<T>PCollection<T>sort(PCollection<T>collection,Sort.Order

order)

publicstatic<T>PCollection<T>sort(PCollection<T>collection,int

numReducers,Sort.Orderorder)

Thefollowingarehowdifferentsortoptionscanbespecifiedfortables:

publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table)

publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table,Sort.Orderkey)

publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table,intnumReducers,

Sort.Orderkey)

Finally,sortPairssortsthePCollectionofpairsusingthespecifiedcolumnorderinSort.ColumnOrder:

Page 381: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

sortPairs(PCollection<Pair<U,V>>collection,Sort.ColumnOrder…

columnOrders)

JoiningdataTheorg.apache.crunch.lib.JoinpackageisanAPItojoinPTablesbasedonacommonkey.Thefollowingfourjoinoperationsaresupported:

fullJoin

join(defaultstoinnerJoin)leftJoin

rightJoin

Themethodshaveacommonreturntypeandsignature.Forreference,wewilldescribethecommonlyusedjoinmethodthatimplementsaninnerjoin:

publicstatic<K,U,V>PTable<K,Pair<U,V>>join(PTable<K,U>left,

PTable<K,V>right)

Theorg.apache.crunch.lib.Join.JoinStrategypackageprovidesaninterfacetodefinecustomjoinstrategies.Crunch’sdefaultstrategy(defaultStrategy)istojoindatareduce-side.

Page 382: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PipelinesimplementationandexecutionCrunchcomeswiththreeimplementationsofthepipelineinterface.Theoldestone,implicitlyusedinthischapter,isorg.apache.crunch.impl.mr.MRPipeline,whichusesHadoop’sMapReduceasitsexecutionengine.org.apache.crunch.impl.mem.MemPipelineallowsalloperationstobeperformedinmemory,withnoserializationtodiskperformed.Crunch0.10introducedorg.apache.crunch.impl.spark.SparkPipelinewhichcompilesandrunsaDAGofPCollectionstoApacheSpark.

SparkPipelineWithSparkPipeline,CrunchdelegatesmuchoftheexecutiontoSparkanddoesrelativelylittleoftheplanningtasks,withthefollowingexceptions:

MultipleinputsMultipleoutputsDataserializationCheckpointing

Atthetimeofwriting,SparkPipelineisstillheavilyunderdevelopmentandmightnothandlealloftheusecasesofastandardMRPipeline.TheCrunchcommunityisactivelyworkingtoensurecompletecompatibilitybetweenthetwoimplementations.

MemPipelineMemPipelineexecutesin-memoryonaclient.UnlikeMRPipeline,MemPipelineisnotexplicitlycreatedbutreferencedbycallingthestaticmethodMemPipeline.getInstance().Alloperationsareinmemory,andtheuseofPTypesisminimal.

Page 383: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

CrunchexamplesWewillnowuseApacheCrunchtoreimplementsomeoftheMapReducecodewrittensofarinamoremodularfashion.

Wordco-occurrenceInChapter3,Processing–MapReduceandBeyond,weshowedaMapReducejob,BiGramCount,tocountco-occurrencesofwordsintweets.ThatsamelogiccanbeimplementedasaDoFn.Insteadofemittingamulti-fieldkeyandhavingtoparseitatalaterstage,withCrunchwecanuseacomplextypePair<String,String>,asfollows:

classBiGramextendsDoFn<String,Pair<String,String>>{

@Override

publicvoidprocess(Stringtweet,

Emitter<Pair<String,String>>emitter){

String[]words=tweet.split("");

Textbigram=newText();

Stringprev=null;

for(Strings:words){

if(prev!=null){

emitter.emit(Pair.of(prev,s));

}

prev=s;

}

}

}

Noticehow,comparedtoMapReduce,theBiGramCrunchimplementationisastandaloneclass,easilyreusableinanyothercodebase.Thecodeforthisexampleisincludedinhttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/DataPreparationPipeline.java

TF-IDFWecanimplementtheTF-IDFchainofjobswithaMRPipeline,asfollows:

publicclassCrunchTermFrequencyInvertedDocumentFrequency

extendsConfiguredimplementsTool,Serializable{

privateLongnumDocs;

@SuppressWarnings("deprecation")

publicstaticclassTF{

Stringterm;

StringdocId;

intfrequency;

publicTF(){}

Page 384: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

publicTF(Stringterm,

StringdocId,Integerfrequency){

this.term=term;

this.docId=docId;

this.frequency=(int)frequency;

}

}

publicintrun(String[]args)throwsException{

if(args.length!=2){

System.err.println();

System.err.println("Usage:"+this.getClass().getName()+"

[genericoptions]inputoutput");

return1;

}

//Createanobjecttocoordinatepipelinecreationandexecution.

Pipelinepipeline=

newMRPipeline(TermFrequencyInvertedDocumentFrequency.class,getConf());

//enabledebugoptions

pipeline.enableDebug();

//ReferenceagiventextfileasacollectionofStrings.

PCollection<String>tweets=pipeline.readTextFile(args[0]);

numDocs=tweets.length().getValue();

//WeuseAvroreflectionstomaptheTFPOJOtoavsc

PTable<String,TF>tf=tweets.parallelDo(newTermFrequencyAvro(),

Avros.tableOf(Avros.strings(),Avros.reflects(TF.class)));

//CalculateDF

PTable<String,Long>df=Aggregate.count(tf.parallelDo(new

DocumentFrequencyString(),Avros.strings()));

//FinallywecalculateTF-IDF

PTable<String,Pair<TF,Long>>tfDf=Join.join(tf,df);

PCollection<Tuple3<String,String,Double>>tfIdf=

tfDf.parallelDo(newTermFrequencyInvertedDocumentFrequency(),

Avros.triples(

Avros.strings(),

Avros.strings(),

Avros.doubles()));

//Serializeasavro

tfIdf.write(To.avroFile(args[1]));

//ExecutethepipelineasaMapReduce.

PipelineResultresult=pipeline.done();

returnresult.succeeded()?0:1;

}

}

Page 385: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Theapproachthatwefollowherehasanumberofadvantagescomparedtostreaming.Firstofall,wedon’tneedtomanuallychainMapReducejobsusingaseparatescript.ThistaskisCrunch’smainpurpose.Secondly,wecanexpresseachcomponentofthemetricasadistinctclass,makingiteasiertoreuseinfutureapplications.

Toimplementtermfrequency,wecreateaDoFnclassthattakesasinputatweetandemitsPair<String,TF>.Thefirstelementisaterm,andthesecondisaninstanceofthePOJOclassthatwillbeserializedusingAvro.TheTFpartcontainsthreevariables:term,documentId,andfrequency.Inthereferenceimplementation,weexpectinputdatatobeaJSONstringthatwedeserializeandparse.Wealsoincludetokenizingasasubtaskoftheprocessmethod.

Dependingontheusecases,wecouldabstractbothoperationsinseparateDoFns,asfollows:

classTermFrequencyAvroextendsDoFn<String,Pair<String,TF>>{

publicvoidprocess(StringJSONTweet,

Emitter<Pair<String,TF>>emitter){

Map<String,Integer>termCount=newHashMap<>();

Stringtweet;

StringdocId;

JSONParserparser=newJSONParser();

try{

Objectobj=parser.parse(JSONTweet);

JSONObjectjsonObject=(JSONObject)obj;

tweet=(String)jsonObject.get("text");

docId=(String)jsonObject.get("id_str");

for(Stringterm:tweet.split("\\s+")){

if(termCount.containsKey(term.toLowerCase())){

termCount.put(term,

termCount.get(term.toLowerCase())+1);

}else{

termCount.put(term.toLowerCase(),1);

}

}

for(Entry<String,Integer>entry:termCount.entrySet()){

emitter.emit(Pair.of(entry.getKey(),newTF(entry.getKey(),

docId,entry.getValue())));

}

}catch(ParseExceptione){

e.printStackTrace();

}

}

}

}

Documentfrequencyisstraightforward.ForeachPair<String,TF>generatedintheterm

Page 386: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

frequencystep,weemittheterm—thefirstelementofthepair.WeaggregateandcounttheresultingPCollectionoftermstoobtaindocumentfrequency,asfollows:

classDocumentFrequencyStringextendsDoFn<Pair<String,TF>,String>{

@Override

publicvoidprocess(Pair<String,TF>tfAvro,

Emitter<String>emitter){

emitter.emit(tfAvro.first());

}

}

WefinallyjointhePTableTFwiththePTableDFonthesharedkey(term)andfeedtheresultingPair<String,Pair<TF,Long>>objecttoTermFrequencyInvertedDocumentFrequency.

Foreachtermanddocument,wecalculateTF-IDFandreturnaterm,docIf,andtfIdftriple:

classTermFrequencyInvertedDocumentFrequencyextendsMapFn<Pair<String,

Pair<TF,Long>>,Tuple3<String,String,Double>>{

@Override

publicTuple3<String,String,Double>map(

Pair<String,Pair<TF,Long>>input){

Pair<TF,Long>tfDf=input.second();

Longdf=tfDf.second();

TFtf=tfDf.first();

doubleidf=1.0+Math.log(numDocs/df);

doubletfIdf=idf*tf.frequency;

returnTuple3.of(tf.term,tf.docId,tfIdf);

}

}

WeuseMapFnbecausewearegoingtooutputonerecordforeachinput.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/CrunchTermFrequencyInvertedDocumentFrequency.java

Theexamplecanbecompiledandexecutedwiththefollowingcommands:

$./gradlewjar

$./gradlewcopyJars

Ifnotalreadydone,addtheCrunchandAvrodependenciesdownloadedwithcopyJarstotheLIBJARSenvironmentvariable,asfollows:

$exportCRUNCH_DEPS=build/libjars/crunch-example/lib

$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/crunch-core-0.9.0-

cdh5.0.3.jar,${CRUNCH_DEPS}/avro-1.7.5-cdh5.0.3.jar,${CRUNCH_DEPS}/avro-

mapred-1.7.5-cdh5.0.3-hadoop2.jar

Furthermore,addthejson-simpleJARtoLIBJARS:

$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/json-simple-1.1.1.jar

Page 387: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Finally,runCrunchTermFrequencyInvertedDocumentFrequencyasaMapReducejob,asfollows:

$hadoopjarbuild/libs/crunch-example.jar\

com.learninghadoop2.crunch.CrunchTermFrequencyInvertedDocumentFrequency\

-libjars${LIBJARS}\

tweets.jsontweets.avro-out

Page 388: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

KiteMorphlinesKiteMorphlinesisadatatransformationlibrary,inspiredbyUnixpipes,originallydevelopedaspartofClouderaSearch.Amorphlineisanin-memorychainoftransformationcommandsthatreliesonapluginstructuretotapheterogeneousdatasources.ItusesdeclarativecommandstocarryoutETLoperationsonrecords.Commandsaredefinedinaconfigurationfile,whichislaterfedtoadriverclass.

ThegoalistomakeembeddingETLlogicintoanyJavacodebaseatrivialtaskbyprovidingalibrarythatallowsdeveloperstoreplaceprogrammingwithaseriesofconfigurationsettings.

ConceptsMorphlinesarebuiltaroundtwoabstractions:CommandandRecord.

Recordsareimplementationsoftheorg.kitesdk.morphline.api.Recordinterface:

publicfinalclassRecord{

privateArrayListMultimap<String,Object>fields;

privateRecord(ArrayListMultimap<String,Object>fields){…}

publicListMultimap<String,Object>getFields(){…}

publicListget(Stringkey){…}

publicvoidput(Stringkey,Objectvalue){…}

}

Arecordisasetofnamedfields,whereeachfieldhasalistofoneormorevalues.ARecordisimplementedontopofGoogleGuava’sListMultimapandArrayListMultimapclasses.NotethatavaluecanbeanyJavaobject,fieldscanbemultivalued,andtworecordsdon’tneedtousecommonfieldnames.Arecordcancontainan_attachment_bodyfieldthatcanbeajava.io.InputStreamorabytearray.

Commandsimplementtheorg.kitesdk.morphline.api.Commandinterface:

publicinterfaceCommand{

voidnotify(Recordnotification);

booleanprocess(Recordrecord);

CommandgetParent();

}

Acommandtransformsarecordintozeroormorerecords.CommandscancallthemethodsontheRecordinstanceprovidedforreadandwriteoperationsaswellasforaddingorremovingfields.

Commandsarechainedtogether,andateachstepofamorphlinetheparentcommandsendsrecordstoitschild,whichinturnprocessesthem.Informationbetweenparentsandchildrenisexchangedusingtwocommunicationchannels(planes);notificationsaresentviaacontrolplane,andrecordsaresentoveradataplane.Recordsareprocessedbytheprocess()method,whichreturnsaBooleanvaluetoindicatewhetheramorphlineshouldproceedornot.

Page 389: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Commandsarenotinstantiateddirectly,butviaanimplementationoftheorg.kitesdk.morphline.api.CommandBuilderinterface:

publicinterfaceCommandBuilder{

Collection<String>getNames();

Commandbuild(Configconfig,

Commandparent,

Commandchild,

MorphlineContextcontext);

}

ThegetNamesmethodreturnsthenameswithwhichthecommandcanbeinvoked.Multiplenamesaresupportedtoallowbackwardscompatiblenamechanges.Thebuild()methodcreatesandreturnsacommandrootedatthegivenmorphlineconfiguration.

Theorg.kitesdk.morphline.api.MorphlineContextinterfaceallowsadditionalparameterstobepassedtoallmorphlinecommands.

Thedatamodelofmorphlinesisstructuredfollowingasource-pipe-sinkpattern,wheredataiscapturedfromasource,pipedthroughanumberofprocessingsteps,anditsoutputisthendeliveredintoasink.

MorphlinecommandsKiteMorphlinescomeswithanumberofdefaultcommandsthatimplementdatatransformationsoncommonserializationformats(plaintext,Avro,JSON).Currentlyavailablecommandsareorganizedassubprojectsofmorphlinesandinclude:

kite-morphlines-core-stdio:willreaddatafrombinarylargeobjects(BLOBs)andtextkite-morphlines-core-stdlib:wrapsaroundJavadatatypesfordatamanipulationandrepresentationkite-morphlines-avro:isusedforserializationintoanddeserializationfromdataintheAvroformatkite-morphlines-json:willserializeanddeserializedatainJSONformatkite-morphlines-hadoop-core:isusedtoaccessHDFSkite-morphlines-hadoop-parquet-avro:isusedtoserializeanddeserializedataintheParquetformatkite-morphlines-hadoop-sequencefile:isusedtoserializeanddeserializedataintheSequencefileformatkite-morphlines-hadoop-rcfile:isusedtoserializeanddeserializedatainRCfileformat

Alistofallavailablecommandscanbefoundathttp://kitesdk.org/docs/0.17.0/kite-morphlines/morphlinesReferenceGuide.html.

Commandsaredefinedbydeclaringachainoftransformationsinaconfigurationfile,morphline.conf,whichisthencompiledandexecutedbyadriverprogram.Forinstance,wecanspecifyaread_tweetsmorphlinethatwillloadtweetsstoredasJSONdata,serializeanddeserializethemusingJackson,andprintthefirst10,bycombiningthe

Page 390: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

defaultreadJsonandheadcommandscontainedintheorg.kitesdk.morphlinepackage,asfollows:

morphlines:[{

id:read_tweets

importCommands:["org.kitesdk.morphline.**"]

commands:[{

readJson{

outputClass:com.fasterxml.jackson.databind.JsonNode

}}

{

head{

limit:10

}}

]

}]

WewillnowshowhowthismorphlinecanbeexecutedbothfromastandaloneJavaprogramaswellasfromMapReduce.

MorphlineDriver.javashowshowtousethelibraryembeddedintoahostsystem.Thefirststepthatwecarryoutinthemainmethodistoloadmorphline’sJSONconfiguration,buildaMorphlineContextobject,andcompileitintoaninstanceofCommandthatactsasthestartingnodeofthemorphline.NotethatCompiler.compile()takesafinalChildparameter;inthiscase,itisRecordEmitter.WeuseRecordEmittertoactasasinkforthemorphline,byeitherprintingarecordtostdoutorstoringitintoHDFS.IntheMorphlineDriverexample,weuseorg.kitesdk.morphline.base.Notificationstomanageandmonitorthemorphlinelifecycleinatransactionalfashion.

AcalltoNotifications.notifyStartSession(morphline)startsthetransformationchainwithinatransactiondefinedbycallingNotifications.notifyBeginTransaction.Uponsuccess,weterminatethepipelinewithNotifications.notifyShutdown(morphline).Intheeventoffailure,werollbackthetransaction,Notifications.notifyRollbackTransaction(morphline),andpassanexceptionhandlerfromthemorphlinecontexttothecallingJavacode:

publicclassMorphlineDriver{

privatestaticfinalclassRecordEmitterimplementsCommand{

privatefinalTextline=newText();

@Override

publicCommandgetParent(){

returnnull;

}

@Override

publicvoidnotify(Recordrecord){

}

@Override

publicbooleanprocess(Recordrecord){

Page 391: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

line.set(record.get("_attachment_body").toString());

System.out.println(line);

returntrue;

}

}

publicstaticvoidmain(String[]args)throwsIOException{

/*loadamorphlineconfandsetitup*/

FilemorphlineFile=newFile(args[0]);

StringmorphlineId=args[1];

MorphlineContextmorphlineContext=new

MorphlineContext.Builder().build();

Commandmorphline=newCompiler().compile(morphlineFile,

morphlineId,morphlineContext,newRecordEmitter());

/*Preparethemorphlineforexecution

*

*Notificationsaresentthroughthecommunicationchannel

**/

Notifications.notifyBeginTransaction(morphline);

/*Notethatweareusingthelocalfilesystem,nothdfs*/

InputStreamin=newBufferedInputStream(new

FileInputStream(args[2]));

/*fillinarecordandpassitover*/

Recordrecord=newRecord();

record.put(Fields.ATTACHMENT_BODY,in);

try{

Notifications.notifyStartSession(morphline);

booleansuccess=morphline.process(record);

if(!success){

System.out.println("Morphlinefailedtoprocessrecord:"+

record);

}

/*Committhemorphline*/

}catch(RuntimeExceptione){

Notifications.notifyRollbackTransaction(morphline);

morphlineContext.getExceptionHandler().handleException(e,null);

}

finally{

in.close();

}

/*shutitdown*/

Notifications.notifyShutdown(morphline);

}

}

Inthisexample,weloaddatainJSONformatfromthelocalfilesystemintoanInputStreamobjectanduseittoinitializeanewRecordinstance.TheRecordEmitter

Page 392: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

classcontainsthelastprocessedrecordinstanceofthechain,onwhichweextract_attachment_bodyandprintittostandardoutput.ThesourcecodeforMorphlineDrivercanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriver.java

UsingthesamemorphlinefromaMapReducejobisstraightforward.DuringthesetupphaseoftheMapper,webuildacontextthatcontainstheinstantiationlogic,whilethemapmethodsetstheRecordobjectupandfiresofftheprocessinglogic,asfollows:

publicstaticclassReadTweets

extendsMapper<Object,Text,Text,NullWritable>{

privatefinalRecordrecord=newRecord();

privateCommandmorphline;

@Override

protectedvoidsetup(Contextcontext)

throwsIOException,InterruptedException{

FilemorphlineConf=newFile(context.getConfiguration()

.get(MORPHLINE_CONF));

StringmorphlineId=context.getConfiguration()

.get(MORPHLINE_ID);

MorphlineContextmorphlineContext=

newMorphlineContext.Builder()

.build();

morphline=neworg.kitesdk.morphline.base.Compiler()

.compile(morphlineConf,

morphlineId,

morphlineContext,

newRecordEmitter(context));

}

publicvoidmap(Objectkey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{

record.put(Fields.ATTACHMENT_BODY,

newByteArrayInputStream(

value.toString().getBytes("UTF8")));

if(!morphline.process(record)){

System.out.println(

"Morphlinefailedtoprocessrecord:"+record);

}

record.removeAll(Fields.ATTACHMENT_BODY);

}

}

IntheMapReducecodewemodifyRecordEmittertoextracttheFieldspayloadfrompost-processedrecordsandstoreitintocontext.ThisallowsustowritedataintoHDFSbyspecifyingaFileOutputFormatintheMapReduceconfigurationboilerplate:

privatestaticfinalclassRecordEmitterimplementsCommand{

privatefinalTextline=newText();

privatefinalMapper.Contextcontext;

privateRecordEmitter(Mapper.Contextcontext){

Page 393: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

this.context=context;

}

@Override

publicvoidnotify(Recordnotification){

}

@Override

publicCommandgetParent(){

returnnull;

}

@Override

publicbooleanprocess(Recordrecord){

line.set(record.get(Fields.ATTACHMENT_BODY).toString());

try{

context.write(line,null);

}catch(Exceptione){

e.printStackTrace();

returnfalse;

}

returntrue;

}

}

Noticethatwecannowchangetheprocessingpipelinebehaviorandaddfurtherdatatransformationsbymodifyingmorphline.confwithouttheexplicitneedtoaltertheinstantiationandprocessinglogic.TheMapReducedriversourcecodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriverMapReduce.java

Bothexamplescanbecompiledfromch9/kite/withthefollowingcommands:

$./gradlewjar

$./gradlewcopyJar

WeaddtheruntimedependenciestoLIBJARS,asfollows

$exportKITE_DEPS=/home/cloudera/review/hadoop2book-private-reviews-

gabriele-ch8/src/ch8/kite/build/libjars/kite-example/lib

exportLIBJARS=${LIBJARS},${KITE_DEPS}/kite-morphlines-core-

0.17.0.jar,${KITE_DEPS}/kite-morphlines-json-

0.17.0.jar,${KITE_DEPS}/metrics-core-3.0.2.jar,${KITE_DEPS}/metrics-

healthchecks-3.0.2.jar,${KITE_DEPS}/config-1.0.2.jar,${KITE_DEPS}/jackson-

databind-2.3.1.jar,${KITE_DEPS}/jackson-core-

2.3.1.jar,${KITE_DEPS}/jackson-annotations-2.3.0.jar

WecanruntheMapReducedriverwiththefollowing:

$hadoopjarbuild/libs/kite-example.jar\

com.learninghadoop2.kite.morphlines.MorphlineDriverMapReduce\

-libjars${LIBJARS}\

morphline.conf\

read_tweets\

tweets.json\

morphlines-out

Page 394: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TheJavastandalonedrivercanbeexecutedwiththefollowingcommand:

$exportCLASSPATH=${CLASSPATH}:${KITE_DEPS}/kite-morphlines-core-

0.17.0.jar:${KITE_DEPS}/kite-morphlines-json-

0.17.0.jar:${KITE_DEPS}/metrics-core-3.0.2.jar:${KITE_DEPS}/metrics-

healthchecks-3.0.2.jar:${KITE_DEPS}/config-1.0.2.jar:${KITE_DEPS}/jackson-

databind-2.3.1.jar:${KITE_DEPS}/jackson-core-

2.3.1.jar:${KITE_DEPS}/jackson-annotations-2.3.0.jar:${KITE_DEPS}/slf4j-

api-1.7.5.jar:${KITE_DEPS}/guava-11.0.2.jar:${KITE_DEPS}/hadoop-common-

2.3.0-cdh5.0.3.jar

$java-cp$CLASSPATH:./build/libs/kite-example.jar\

com.learninghadoop2.kite.morphlines.MorphlineDriver\

morphline.conf\

read_tweetstweets.json\

morphlines-out

Page 395: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SummaryInthischapter,weintroducedfourtoolstoeasedevelopmentonHadoop.Inparticular,wecovered:

HowHadoopstreamingallowsthewritingofMapReducejobsusingdynamiclanguagesHowKiteDatasimplifiesinterfacingwithheterogeneousdatasourcesHowApacheCrunchprovidesahigh-levelabstractiontowritepipelinesofSparkandMapReducejobsthatimplementcommondesignpatternsHowMorphlinesallowsustodeclarechainsofcommandsanddatatransformationsthatcanthenbeembeddedinanyJavacodebase

InChapter10,RunningaHadoop2Cluster,wewillshiftourfocusfromthedomainofsoftwaredevelopmenttosystemadministration.Wewilldiscusshowtosetup,manage,andscaleaHadoopcluster,whiletakingaspectssuchasmonitoringandsecurityintoconsideration.

Page 396: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Chapter10.RunningaHadoopClusterInthischapter,wewillchangeourfocusalittleandlookatsomeoftheconsiderationsyouwillfacewhenrunninganoperationalHadoopcluster.Inparticular,wewillcoverthefollowingtopics:

WhyadevelopershouldcareaboutoperationsandwhyHadoopoperationsaredifferentMoredetailonClouderaManageranditscapabilitiesandlimitationsDesigningaclusterforuseonbothphysicalhardwareandEMRSecuringaHadoopclusterHadoopmonitoringTroubleshootingproblemswithanapplicationrunningonHadoop

Page 397: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

I’madeveloper–Idon’tcareaboutoperations!Beforegoinganyfurther,weneedtoexplainwhyweareputtingachapteraboutsystemsoperationsinabooksquarelyaimedatdevelopers.Foranyonewhohasdevelopedformoretraditionalplatforms(forexample,webapps,databaseprogramming,andsoon)thenthenormmightwellhavebeenforaverycleardelineationbetweendevelopmentandoperations.Thefirstgroupbuildsthecodeandpackagesitup,andthesecondgroupcontrolsandoperatestheenvironmentinwhichitruns.

Inrecentyears,theDevOpsmovementhasgainedmomentumwithabeliefthatitisbestforeveryoneifthesesilosareremovedandthattheteamsworkmorecloselytogether.WhenitcomestorunningsystemsandservicesbasedonHadoop,webelievethisisabsolutelyessential.

Page 398: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HadoopandDevOpspracticesEventhoughadevelopercanconceptuallybuildanapplicationreadytobedroppedintoYARNandforgottenabout,therealityisoftenmorenuanced.Howmanyresourcesareallocatedtotheapplicationatruntimeismostlikelysomethingthedeveloperwishestoinfluence.Oncetheapplicationisrunning,theoperationsstafflikelywantsomeinsightintotheapplicationwhentheyaretryingtooptimizethecluster.Therereallyisn’tthesameclear-cutsplitofresponsibilitiesseenintraditionalenterpriseIT.Andthat’slikelyareallygoodthing.

Inotherwords,developersneedtobemoreawareoftheoperationsaspects,andtheoperationsstaffneedtobemoreawareofwhatthedevelopersaredoing.Soconsiderthischapterourcontributiontohelpyouhavethosediscussionswithyouroperationsstaff.Wedon’tintendtomakeyouanexpertHadoopadministratorbytheendofthischapter;thatreallyisemergingasadedicatedroleandskillsetinitself.Instead,wewillgiveawhistle-stoptourofissuesyoudoneedsomeawarenessofandthatwillmakeyourlifeeasieronceyourapplicationsarerunningonliveclusters.

Bythenatureofthiscoverage,wewillbetouchingonalotoftopicsandgoingintothemonlylightly;ifanyareofdeeperinterest,thenweprovidelinksforfurtherinvestigation.Justmakesureyoukeepyouroperationsstaffinvolved!

Page 399: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ClouderaManagerInthisbook,weusedasthemostcommonplatformtheClouderaHadoopDistribution(CDH)withitsconvenientQuickStartvirtualmachineandthepowerfulClouderaManagerapplication.WithaCloudera-basedcluster,ClouderaManagerwillbecome(atleastinitially)yourprimaryinterfaceintothesystemtomanageandmonitorthecluster,solet’sexploreitalittle.

NotethatClouderaManagerhasextensiveandhigh-qualityonlinedocumentation.Wewon’tduplicatethisdocumentationhere;insteadwe’llattempttohighlightwhereClouderaManagerfitsintoyourdevelopmentandoperationalworkflowsandhowitmightormightnotbesomethingyouwanttoembrace.DocumentationforthelatestandpreviousversionsofClouderaManagercanbeaccessedviathemainClouderadocumentationpageathttp://www.cloudera.com/content/support/en/documentation.html.

Page 400: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TopayornottopayBeforegettingallexcitedaboutClouderaManager,it’simportanttoconsultthecurrentdocumentationconcerningwhatfeaturesareavailableinthefreeversionandwhichonesrequiresubscriptiontoapaid-forClouderaoffering.Ifyouabsolutelywantsomeofthefeaturesofferedonlyinthepaid-forversionbuteithercan’tordon’twishtopayforsubscriptionservices,thenClouderaManager,andpossiblytheentireClouderadistribution,mightnotbeagoodfitforyou.We’llreturntothistopicinChapter11,WheretoGoNext.

Page 401: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ClustermanagementusingClouderaManagerUsingtheQuickStartVM,itwon’tbeobvious,butClouderaManageristheprimarytooltobeusedformanagementofallservicesinthecluster.Ifyouwanttoenableanewservice,you’lluseClouderaManager.Tochangeaconfiguration,youwillneedClouderaManager.Toupgradetothelatestrelease,youwillagainrequireClouderaManager.

Eveniftheprimarymanagementoftheclusterishandledbyoperationalstaff,asadeveloperyou’lllikelystillwanttobecomefamiliarwiththeClouderaManagerinterfacejusttolooktoseeexactlyhowtheclusterisconfigured.Ifyourjobsarerunningslowly,thenlookingintoClouderaManagertoseejusthowthingsarecurrentlyconfiguredwilllikelybeyourfirststart.ThedefaultportfortheClouderaManagerwebinterfaceis7180,sothehomepagewillusuallybeconnectedtoviaaURLsuchashttp://<hostname>:7180/cmf/home,andcanbeseeninthefollowingscreenshot:

ClouderaManagerhomepage

It’sworthpokingaroundtheinterface;however,ifyouareconnectingwithauseraccountwithadminprivileges,becareful!

ClickontheClusterslink,andthiswillexpandtogivealistoftheclusterscurrentlymanagedbythisinstanceofClouderaManager.ThisshouldtellyouthatasingleClouderaManagerinstancecanmanagemultipleclusters.Thisisveryuseful,especiallyifyouhavemanyclustersspreadacrossdevelopmentandproduction.

Foreachexpandedcluster,therewillbealistoftheservicescurrentlyrunningonthecluster.Clickonaservice,andthenyouwillseealistofadditionalchoices.SelectConfiguration,andyoucanstartbrowsingthedetailedconfigurationofthatparticularservice.ClickonActions,andyouwillgetsomeservice-specificoptions;thiswillusuallyincludestopping,starting,restarting,andotherwisemanagingtheservice.

Page 402: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ClickontheHostsoptioninsteadofClusters,andyoucanstartdrillingdownintotheserversmanagedbyClouderaManager,andfromthere,seewhichservicecomponentsaredeployedoneach.

ClouderaManagerandothermanagementtoolsThatlastcommentmightraiseaquestion:howdoesClouderaManagerintegratewithothersystemsmanagementtools?GivenourearliercommentsregardingtheimportanceofDevOpsphilosophies,howwelldoesitintegratewiththetoolsfavoredinDevOpsenvironments?

Thehonestanswer:notalwaysverywell.ThoughthemainClouderaManagerservercanitselfbemanagedbyautomationtools,suchasPuppetorChef,thereisanexplicitassumptionthatClouderaManagerwillcontroltheinstallationandconfigurationofallthesoftwareClouderaManagerneedsonallthehoststhatwillbeincludedinitsclusters.Tosomeadministrators,thismakesthehardwarebehindClouderaManagerlooklikeabig,blackbox;theymightcontroltheinstallationofthebaseoperatingsystem,butthemanagementoftheconfigurationbaselinegoingforwardisentirelymanagedbyClouderaManager.There’snothingmuchtobedonehere;itiswhatitis—togetthebenefitsofClouderaManager,itwilladditselfasanewmanagementsysteminyourinfrastructure,andhowwellthatfitsinwithyourbroaderenvironmentwillbedeterminedonacase-by-casebasis.

Page 403: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MonitoringwithClouderaManagerAsimilarpointcanbemaderegardingsystemsmonitoringasClouderaManagerisalsoconceptuallyapointofduplicationhere.Butstartclickingaroundtheinterface,anditwillbecomeapparentveryquicklythatClouderaManagerprovidesanexceptionallyrichsetoftoolstoassessthehealthandperformanceofmanagedclusters.

FromgraphingtherelativeperformanceofImpalaqueriesthroughshowingthejobstatusforYARNapplicationsandgivinglow-leveldataontheblocksstoredonHDFS,itisallthereinasingleinterface.We’lldiscusslaterinthischapterhowtroubleshootingonHadoopcanbechallenging,butthesinglepointofvisibilityprovidedbyClouderaManagerisagreattoolwhenlookingtoassessclusterhealthorperformance.We’lldiscussmonitoringinalittlemoredetaillaterinthischapter.

FindingconfigurationfilesOneofthefirstconfusionsfacedwhenrunningaclustermanagedbyClouderaManageristryingtofindtheconfigurationfilesusedbythecluster.InthevanillaApachereleasesofproducts,suchasthecoreHadoop,therewouldbefilestypicallystoredin/etc/hadoop,similarly/etc/hiveforHive,/etc/oozieforOozie,andsoon.

InaClouderaManagermanagedcluster,however,theconfigfilesareregeneratedeachtimeaserviceisrestarted,andinsteadofsittinginthe/etclocationsonthefilesystem,willbefoundat/var/run/cloudera-scm-agent-process/<pid>-<taskname>/,wherethelastdirectorymighthaveanamesuchas7007-yarn-NODEMANAGER.ThismightseemoddtoanyoneusedtoworkingonearlierHadoopclustersorotherdistributionsthatdon’tdosuchathing.ButinaClouderaManager-controlledcluster,itmightoftenbeeasiertousethewebinterfacetobrowsetheconfigurationinsteadoflookingfortheunderlyingconfigfiles.Whichapproachisbest?Thisisalittlephilosophical,andeachteamneedstodecidewhichworksbestforthem.

Page 404: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ClouderaManagerAPIWe’veonlygiventhehighestlevelofoverviewofClouderaManager,andindoingso,havecompletelyignoredoneareathatmightbeveryusefulforsomeorganizations:ClouderaManageroffersanAPIthatallowsintegrationofitscapabilitiesintoothersystemsandtools.Consultthedocumentationifthismightbeofinteresttoyou.

Page 405: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ClouderaManagerlock-inThisbringsustothepointthatisimplicitinthewholediscussionaroundClouderaManager:itdoescauseadegreeoflock-intoClouderaandtheirdistribution.Thatlock-inmightonlyexistincertainways;code,forexample,shouldbeportableacrossclustersmodulotheusualcaveatsaboutdifferentunderlyingversions—buttheclusteritselfmightnoteasilybereconfiguredtouseadifferentdistribution.Assumethatswitchingdistributionswouldbeacompleteremove/reformat/reinstallactivity.

Wearen’tsayingdon’tuseit,ratherthatyouneedtobeawareofthelock-inthatcomeswiththeuseofClouderaManager.Forsmallteamswithlittlededicatedoperationssupportorexistinginfrastructure,theimpactofsuchalock-inislikelyoutweighedbythesignificantcapabilitiesthatClouderaManagergivesyou.

Forlargerteamsoronesworkinginanenvironmentwhereintegrationwithexistingtoolsandprocesseshasmoreweight,thedecisionmightbelessclear.LookatClouderaManager,discusswithyouroperationspeople,anddeterminewhatisrightforyou.

NotethatitispossibletomanuallydownloadandinstallthevariouscomponentsoftheClouderadistributionwithoutusingClouderaManagertomanagetheclusteranditshosts.ThismightbeanattractivemiddlegroundforsomeusersastheClouderasoftwarecanbeused,butdeploymentandmanagementcanbebuiltintotheexistingdeploymentandmanagementtools.Thisisalsopotentiallyawayofavoidingtheadditionalexpenseofthepaid-forlevelsofClouderasupportmentionedearlier.

Page 406: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Ambari–theopensourcealternativeAmbariisanApacheproject(http://ambari.apache.org),whichintheory,providesanopensourcealternativetoClouderaManager.ItistheadministrationconsolefortheHortonworksdistribution.AtthetimeofwritingHortonworksemployeesarealsothevastmajorityoftheprojectcontributors.

Ambari,asonewouldexpectgivenitsopensourcenature,reliesonotheropensourceproducts,suchasPuppetandNagios,toprovidethemanagementandmonitoringofitsmanagedclusters.Italsohashigh-levelfunctionalitysimilartoClouderaManager,thatis,theinstallation,configuration,management,andmonitoringofaHadoopcluster,andthecomponentserviceswithinit.

ItisgoodtobeawareoftheAmbariprojectasthechoiceisnotjustbetweenfulllock-intoClouderaandClouderaManageroramanuallymanagedcluster.Ambariprovidesagraphicaltoolthatmightbeworthconsideration,orindeedinvolvement,asitmatures.OnanHDPcluster,theAmbariUIequivalenttotheClouderaManagerhomepageshownearliercanbereachedathttp://<hostname>:8080/#/main/dashboardandlookslikethefollowingscreenshot:

Ambari

Page 407: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OperationsintheHadoop2worldAsmentionedinChapter2,Storage,someofthemostsignificantchangesmadetoHDFSinHadoop2involveitsfaulttoleranceandbetterintegrationwithexternalsystems.Thisisnotjustacuriosity,buttheNameNodeHighAvailabilityfeatures,inparticular,havemadeamassivedifferenceinthemanagementofclusterssinceHadoop1.Inthebadolddaysof2012orso,asignificantpartoftheoperationalpreparednessofaHadoopclusterwasbuiltaroundmitigationsfor,andrestorationprocessesaroundfailureoftheNameNode.IftheNameNodediedinHadoop1,andyoudidn’thaveabackupoftheHDFSfsimagemetadatafile,thenyoubasicallylostaccesstoallyourdata.Ifthemetadatawaspermanentlylost,thensowasthedata.

Hadoop2hasaddedthein-builtNameNodeHAandthemachinerytomakeitwork.Inaddition,therearecomponentssuchastheNFSgatewayintoHDFS,whichmakeitamuchmoreflexiblesystem.Butthisadditionalcapabilitydoescomeattheexpenseofmoremovingparts.ToenableNameNodeHA,thereareadditionalcomponentsintheJournalManagerandFailoverController,andtheNFSgatewayrequiresHadoop-specificimplementationsoftheportmapandnfsdservices.

Hadoop2alsonowhasextensiveotherintegrationpointswithexternalservicesaswellasamuchbroaderselectionofapplicationsandservicesthatrunatopit.Consequently,itmightbeusefultoviewHadoop2intermsofoperationsashavingtradedthesimplicityofHadoop1foradditionalcomplexity,whichdeliversasubstantiallymorecapableplatform.

Page 408: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SharingresourcesInHadoop1,theonlytimeonehadtoconsiderresourcesharingwasinconsideringwhichschedulertousefortheMapReduceJobTracker.SincealljobswereeventuallytranslatedintoMapReducecodehavingapolicyforresourcesharingattheMapReducelevelwasusuallysufficienttomanageclusterworkloadsinthelarge.

Hadoop2andYARNchangedthispicture.AswellasrunningmanyMapReducejobs,aclustermightalsoberunningmanyotherapplicationsatopotherYARNApplicationMasters.TezandSparkareframeworksintheirownrightthatrunadditionalapplicationsatoptheirprovidedinterfaces.

IfeverythingrunsonYARN,thenitprovideswaysofconfiguringthemaximumresourceallocation(intermsofCPU,memory,andsoonI/O)consumedbyeachcontainerallocatedtoanapplication.Theprimarygoalhereistoensurethatenoughresourcesareallocatedtokeepthehardwarefullyutilizedwithouteitherhavingunusedcapacityoroverloadingit.

Thingsgetsomewhatmoreinterestingwhennon-YARNapplications,suchasImpala,arerunningontheclusterandwanttograballocatedslicesofcapacity(particularlymemoryinthecaseofImpala).Thiscouldalsohappenif,say,youwererunningSparkonthesamehostsinitsnon-YARNmodeorindeedanyotherdistributedapplicationthatmightbenefitfromco-locationontheHadoopmachines.

Basically,inHadoop2,youneedtothinkoftheclusterasmuchmoreofamulti-tenancyenvironmentthatrequiresmoreattentiongiventotheallocationofresourcestothevarioustenants.

Therereallyisnosilverbulletrecommendationhere;therightconfigurationwillbeentirelydependentontheservicesco-locatedandtheworkloadstheyarerunning.Thisisanotherexamplewhereyouwanttoworkcloselywithyouroperationsteamtodoaseriesofloadtestswiththresholdstodeterminejustwhattheresourcerequirementsofthevariousclientsareandwhichapproachwillgivethemaximumutilizationandperformance.ThefollowingblogpostfromClouderaengineersgivesagoodoverviewofhowtheyapproachthisveryissueinhavingImpalaandMapReducecoexisteffectively:http://blog.cloudera.com/blog/2013/06/configuring-impala-and-mapreduce-for-multi-tenant-performance/.

Page 409: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

BuildingaphysicalclusterThereisoneminorrequirementbeforethinkingaboutallocationofhardwareresources:definingandselectingthehardwareusedforyourcluster.Inthissection,we’lldiscussaphysicalclusterandmoveontoAmazonEMRinthenext.

Anyspecifichardwareadvicewillbeoutofdatethemomentitiswritten.WeadviseperusingthewebsitesofthevariousHadoopdistributionvendorsastheyregularlywritenewarticlesonthecurrentlyrecommendedconfigurations.

InsteadoftellingyouhowmanycoresorGBofmemoryyouneed,we’lllookathardwareselectionataslightlyhigherlevel.ThefirstthingtorealizeisthatthehostsrunningyourHadoopclusterwillmostlikelylookverydifferentfromtherestofyourenterprise.Hadoopisoptimizedforlow(er)costhardware,soinsteadofseeingasmallnumberofverylargeservers,expecttoseealargernumberofmachineswithfewerenterprisereliabilityfeatures.Butdon’tthinkthatHadoopwillrungreatonanyjunkyouhavelyingaround.Itmight,butrecentlytheprofileoftypicalHadoopservershasbeenmovingawayfromthebottom-endofthemarket,andinstead,thesweetspotwouldseemtobeinmid-rangeserverswherethemaximumcores/disks/memorycanbeachievedatapricepoint.

YoushouldalsoexpecttohavedifferentresourcerequirementsforthehostsrunningservicessuchastheHDFSNameNodeortheYARNResourceManager,asopposedtotheworkernodesstoringdataandexecutingtheapplicationlogic.Fortheformer,thereisusuallymuchlessrequirementforlotsofstorage,butfrequently,aneedformorememoryandpossiblyfasterdisks.

ForHadoopworkernodes,theratiobetweenthethreemainhardwarecategoriesofcores,memory,andI/Oisoftenthemostimportantthingtogetright.Andthiswilldirectlyinformthedecisionsyoumakeregardingworkloadandresourceallocation.

Forexample,manyworkloadstendtobecomeI/Oboundandhavingmanytimesasmanycontainersallocatedonahostthantherearephysicaldisksmightactuallycauseanoverallslowdownduetocontentionforthespinningdisks.Atthetimeofwriting,currentrecommendationshereareforthenumberofYARNcontainerstobenomorethan1.8timesthenumberofdisks.IfyouhaveworkloadsthatareI/Obound,thenyouwillmostlikelygetmuchbetterperformancebyaddingmorehoststotheclusterinsteadoftryingtogetmorecontainersrunningorindeedfasterprocessorsormorememoryonthecurrenthosts.

Conversely,ifyouexpecttorunlotsofconcurrentImpala,Spark,andothermemory-hungryjobs,thenmemorymightquicklybecometheresourcemostunderpressure.Thisiswhyeventhoughyoucangetcurrenthardwarerecommendationsforgeneral-purposeclustersfromthedistributionvendors,youstillneedtovalidateagainstyourexpectedworkloadsandtailoraccordingly.ThereisreallynosubstituteforbenchmarkingonasmalltestclusterorindeedonEMR,whichcanbeagreatplatformtoexploretheresourcerequirementsofmultipleapplicationsthatcaninformhardwareacquisitiondecisions.PerhapsEMRmightbeyourmainenvironment;ifso,we’lldiscussthatinalatersection.

Page 410: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,
Page 411: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PhysicallayoutIfyoudouseaphysicalcluster,thereareafewthingsyouwillneedtoconsiderthatarelargelytransparentonEMR.

RackawarenessThefirstoftheseaspectsforclusterslargeenoughtoconsumemorethanonerackofdatacenterspaceisbuildingrackawareness.AsmentionedinChapter2,Storage,whenHDFSplacesreplicasofnewfiles,itattemptstoplacethesecondreplicaonadifferenthostthanthefirst,andthethirdinadifferentrackofequipmentinamulti-racksystem.Thisheuristicisaimedatmaximizingresilience;therewillbeatleastonereplicaavailableevenifanentirerackofequipmentfails.MapReduceusessimilarlogictoattempttogetabetter-balancedtaskspread.

Ifyoudonothing,theneachhostwillbespecifiedasbeinginthesingledefaultrack.But,iftheclustergrowsbeyondthispoint,youwillneedtoupdatetherackname.

Underthecovers,Hadoopdiscoversanode’srackbyexecutingauser-suppliedscriptthatmapsnodehostnametoracknames.ClouderaManagerallowsracknamestobesetonagivenhost,andthisisthenretrievedwhenitsrackawarenessscriptsarecalledbyHadoop.Tosettherackforahost,clickonHosts-><hostname>->AssignRack,andthenassigntherackfromtheClouderaManagerhomepage.

ServicelayoutAsmentionedearlier,youarelikelytohavetwotypesofhardwareinyourcluster:themachinesrunningtheworkersandthoserunningtheservers.Whendeployingaphysicalcluster,youwillneedtodecidewhichservicesandwhichsubcomponentsoftheservicesrunonwhichphysicalmachines.

Fortheworkers,thisisusuallyprettystraightforward;most,thoughnotall,serviceshaveamodelofaworkeragentonallworkerhosts.But,forthemaster/servercomponents,itrequiresalittlethought.Ifyouhavethreemasternodes,thenhowdoyouspreadyourprimaryandbackupNameNodes:theYARNResourceManager,maybeHue,afewHiveservers,andanOoziemanager?Someofthesefeaturesarehighlyavailable,whileothersarenot.Asyouaddmoreandmoreservicestoyourcluster,you’llalsoseethislistofmasterservicesgrowsubstantially.

Inanidealworld,youmighthaveahostperservicemasterbutthatisonlytractableforverylargeclusters;insmallerinstallationsitisprohibitivelyexpensive.Plusitmightalwaysbealittlewasteful.Therearenohard-and-fastruleshereeither,butdolookatyouravailablehardware,andtrytospreadtheservicesacrossthenodesasmuchaspossible.Don’t,forexample,havetwonodesforthetwoNameNodesandthenputeverythingelseonathird.Thinkabouttheimpactofasinglehostfailureandmanagethelayouttominimizeit.Astheclustergrowsacrossmultipleracksofequipment,theconsiderationswillalsoneedtoconsiderhowtosurvivesingle-rackfailures.HadoopitselfhelpswiththissinceHDFSwillattempttoensureeachblockofdatahasreplicasacrossatleasttwo

Page 412: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

racks.But,thistypeofresilienceisunderminedif,forexample,allthemasternodesresideinasinglerack.

UpgradingaserviceUpgradingHadoophashistoricallybeenatime-consumingandsomewhatriskytask.Thisremainsthecaseonamanuallydeployedcluster,thatis,onenotmanagedbyatoolsuchasClouderaManager.

IfyouareusingClouderaManager,thenittakesthetime-consumingpartoutoftheactivity,butnotnecessarilytherisk.Anyupgradeshouldalwaysbeviewedasanactivitywithahighchanceofunexpectedissues,andyoushouldarrangeenoughclusterdowntimetoaccountforthissurpriseexcitement.There’sreallynosubstitutefordoingatestupgradeonatestcluster,whichunderlinestheimportanceofthinkingaboutHadoopasacomponentofyourenvironmentthatneedstobetreatedwithadeploymentlifecyclelikeanyother.

SometimesanupgraderequiresmodificationtotheHDFSmetadataormightotherwiseaffectthefilesystem.Thisis,ofcourse,wheretherealriskslie.Inadditiontorunningatestupgrade,beawareoftheabilitytosetHDFSinupgrademode,whicheffectivelymakesasnapshotofthefilesystemstatepriortotheupgradeandwhichwillberetaineduntiltheupgradeisfinalized.Thiscanbereallyhelpfulasevenanupgradethatgoesbadlywrongandcorruptsdatacanpotentiallybefullyrolledback.

Page 413: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

BuildingaclusteronEMRElasticMapReduceisaflexiblesolutionthat,dependingonrequirementsandworkloads,cansitnextto,orreplace,aphysicalHadoopcluster.Aswe’veseensofar,EMRprovidesclusterspreloadedandconfiguredwithHive,Streaming,andPigaswellaswithcustomJARclustersthatallowtheexecutionofMapReduceapplications.

Aseconddistinctiontomakeisbetweentransientandlong-runninglifecycles.AtransientEMRclusterisgeneratedondemand;dataisloadedinS3orHDFS,someprocessingworkflowisexecuted,outputresultsarestored,andtheclusterisautomaticallyshutdown.Along-runningclusteriskeptaliveoncetheworkflowterminates,andtheclusterremainsavailablefornewdatatobecopiedoverandnewworkflowstobeexecuted.Long-runningclustersaretypicallywell-suitedfordatawarehousingorworkingwithdatasetslargeenoughthatloadingandprocessingdatawouldbeinefficientcomparedtoatransientinstance.

Inamust-readwhitepaperforprospectiveusers(foundathttps://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf),Amazongivesaheuristictoestimatewhichclustertypeisabetterfitasfollows:

Ifnumberofjobsperday*(timetosetupclusterincludingAmazonS3dataloadtimeifusingAmazonS3+dataprocessingtime)<24hours,considertransientAmazonEMRclustersorphysicalinstances.Long-runninginstancesareinstantiatedbypassingthe–aliveargumenttotheElasticMapreducecommand,whichenablestheKeepAliveoptionanddisablesautotermination.

Notethattransientandlong-runningclusterssharethesamepropertiesandlimitations;inparticular,dataonHDFSisnotpersistedoncetheclusterisshutdown.

Page 414: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ConsiderationsaboutfilesystemsInourexamplessofarweassumeddatatobeavailableinS3.Inthiscase,abucketismountedinEMRasans3nfilesystem,anditisusedasinputsourceaswellasatemporaryfilesystemtostoreintermediatedataincomputations.WithS3weintroducepotentialI/Ooverhead,operationssuchasreadsandwritesfireoffGETandPUTHTTPrequests.

NoteNotethatEMRdoesnotsupportS3blockstorage.Thes3URImapstos3n.

AnotheroptionwouldbetoloaddataintotheclusterHDFSandrunprocessingfromthere.Inthiscase,wedohavefasterI/Oanddatalocality,butwewouldlosepersistence.Whentheclusterisshutdown,ourdatadisappears.Asaruleofthumb,ifyouarerunningatransientcluster,itmakessensetouseS3asabackend.Inpractice,oneshouldmonitorandtakedecisionsbasedontheworkflowcharacteristics.Iterative,multi-passMapReducejobswouldgreatlybenefitfromHDFS;onecouldarguethatforthosetypesofworkflows,anexecutionenginelikeTezorSparkwouldbemoreappropriate.

Page 415: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GettingdataintoEMRWhencopyingdatafromHDFStoS3,itisrecommendedtouses3distcp(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.htmlinsteadofApachedistcporHadoopdistcp.ThisapproachissuitablealsototransferdatawithinEMRandfromS3toHDFS.TomoveverylargeamountsofdatafromthelocaldiskintoS3,AmazonrecommendsparallelizingtheworkloadusingJets3torGNUParallel.Ingeneral,it’simportanttobeawarethatPUTrequeststoS3arecappedat5GBperfile.Touploadlargerfiles,oneneedstorelyonMultipartUpload(https://aws.amazon.com/about-aws/whats-new/2010/11/10/Amazon-S3-Introducing-Multipart-Upload/),anAPIthatallowssplittinglargefilesintosmallerpartsandreassemblesthemwhenuploaded.FilescanalsobecopiedwithtoolssuchastheAWSCLIorthepopularS3CMDutility,butthesedonothavetheparallelismadvantagesofass3distcp.

Page 416: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

EC2instancesandtuningThesizeofanEMRclusterdependsonthedatasetsize,thenumberoffilesandblocks(determinesthenumberofsplits)andthetypeofworkload(trytoavoidspillingtodiskwhenataskrunsoutofmemory).Asaruleofthumb,agoodsizeisonethatmaximizesparallelism.ThenumberofmappersandreducersperinstanceaswellasheapsizeperJVMdaemonisgenerallyconfiguredbyEMRwhentheclusterisprovisionedandtunedintheeventofchangesintheavailableresources.

Page 417: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ClustertuningInadditiontothepreviouscommentsspecifictoaclusterrunonEMR,therearesomegeneralthoughtstokeepinmindwhenrunningworkloadsonanytypeofcluster.Thiswill,ofcourse,bemoreexplicitwhenrunningoutsideofEMRasitoftenabstractssomeofthedetails.

Page 418: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

JVMconsiderationsYoushouldberunningthe64-bitversionofaJVMandusingtheservermode.Thiscantakelongertoproduceoptimizedcode,butitalsousesmoreaggressivestrategiesandwillre-optimizecodeovertime.Thismakesitamuchbetterfitforlong-runningservices,suchasHadoopprocesses.

EnsurethatyouallocateenoughmemorytotheJVMtopreventoverly-frequentGarbageCollection(GC)pauses.Theconcurrentmark-and-sweepcollectoriscurrentlythemosttestedandrecommendedforHadoop.TheGarbageFirst(G1)collectorhasbecometheGCoptionofchoiceinnumerousotherworkloadssinceitsintroductionwithJDK7,soit’sworthmonitoringrecommendedbestpracticeasitevolves.TheseoptionscanbeconfiguredascustomJavaargumentswithineachservice’sconfigurationsectionofClouderaManager.

ThesmallfilesproblemHeapallocationtoJavaprocessesonworkernodeswillbesomethingyouconsiderwhenthinkingaboutserviceco-location.ButthereisaparticularsituationregardingtheNameNodeyoushouldbeawareof:thesmallfilesproblem.

Hadoopisoptimizedforverylargefileswithlargeblocksizes.ButsometimesparticularworkloadsordatasourcespushmanysmallfilesontoHDFS.Thisismostlikelysuboptimalasitsuggestseachtaskprocessingablockatatimewillreadonlyasmallamountofdatabeforecompleting,causinginefficiency.

HavingmanysmallfilesalsoconsumesmoreNameNodememory;itholdsin-memorythemappingfromfilestoblocksandconsequentlyholdsmetadataforeachfileandblock.Ifthenumberoffilesandhenceblocksincreasesquickly,thensowilltheNameNodememoryusage.Thisislikelytoonlyhitasubsetofsystemsas,atthetimeofwritingthis,1GBofmemorycansupport2millionfilesorblocks,butwithadefaultheapsizeof2or4GB,thislimitcaneasilybereached.IftheNameNodeneedstostartveryaggressivelyrunninggarbagecollectionoreventuallyrunsoutofmemory,thenyourclusterwillbeveryunhealthy.ThemitigationistoassignmoreheaptotheJVM;thelonger-termapproachistocombinemanysmallfilesintoasmallernumberoflargerones.Ideally,compressedwithasplittablecompressioncodec.

Page 419: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MapandreduceoptimizationsMappersandreducersbothprovideareasforoptimizingperformance;hereareafewpointerstoconsider:

Thenumberofmappersdependsonthenumberofsplits.Whenfilesaresmallerthanthedefaultblocksizeorcompressedusinganonsplittableformat,thenumberofmapperswillequalthenumberoffiles.Otherwise,thenumberofmappersisgivenbythetotalsizeofeachfiledividedbytheblocksize.CompressmappersoutputtoreducewritestodiskandincreaseI/O.LZOisagoodformatforthistask.Avoidspilltodisk:themappersshouldhaveenoughmemorytoretainasmuchdataaspossible.NumberofReducers:itisrecommendedthatyouusefewerreducersthanthetotalreducercapacity(thisavoidsexecutionwaits).

Page 420: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SecurityOnceyoubuiltacluster,thefirstthingyouthoughtaboutwashowtosecureit,right?Don’tworry,mostpeopledon’t.But,asHadoophasmovedonfrombeingsomethingrunningin-houseanalysisintheresearchdepartmenttodirectlydrivingcriticalsystems,it’snotsomethingtoignorefortoolong.

SecuringHadoopisnotsomethingtobedoneonawhimorwithoutsignificanttesting.Wecannotgivedetailedadviceonthistopicandcannotstressstronglyenoughtheneedtotakethistopicseriouslyanddoitproperly.Itmightconsumetime,itmightcostmoney,butweighthisagainstthecostofhavingyourclustercompromised.

SecurityisalsoamuchbiggertopicthanjusttheHadoopcluster.We’llexploresomeofthesecurityfeaturesavailableinHadoop,butyoudoneedacoherentsecuritystrategyintowhichthesediscretecomponentsfit.

Page 421: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

EvolutionoftheHadoopsecuritymodelInHadoop1,therewaseffectivelynosecurityprotectionastheprovidedsecuritymodelhadobviousattackvectors.TheUnixuserIDwithwhichyouconnectedtotheclusterwasassumedtobevalid,andyouhadalltheprivilegesofthatuser.Plainly,thismeantthatanyonewithadministrativeaccessonahostthatcouldaccesstheclustercouldeffectivelyimpersonateanyotheruser.

Thisledtothedevelopmentoftheso-called“headnode”accessmodel,wherebytheHadoopclusterwasfirewalledofffromeveryhostexceptone,theheadnode,andallaccesstotheclusterwasmediatedthroughthiscentrally-controllednode.Thiswasaneffectivemitigationforthelackofarealsecuritymodelandcanstillbeusefulinsituationsevenwhenrichersecurityschemesareutilized.

Page 422: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

BeyondbasicauthorizationCoreHadoophashadadditionalsecurityfeaturesadded,whichaddressthepreviousconcerns.Inparticular,theyaddressthefollowing:

AclustercanrequireausertoauthenticateviaKerberosandprovetheyarewhotheysaytheyare.Insecuremode,theclustercanalsouseKerberosforallnode-nodecommunications,ensuringthatallcommunicatingnodesareauthenticatedandpreventingmaliciousnodesfromattemptingtojointhecluster.Toeasemanagement,userscanbecollectedintogroupsagainstwhichdata-accessprivilegescanbedefined.ThisiscalledRoleBasedAccessControl(RBAC)andisaprerequisiteforasecureclusterwithmorethanahandfulofusers.Theuser-groupmappingscanberetrievedfromcorporatesystems,suchasLDAPoractivedirectory.HDFScanapplyACLstoreplacethecurrentUnix-inspiredowner/group/worldmodel.

ThesecapabilitiesgiveHadoopasignificantlystrongersecurityposturethaninthepast,butthecommunityismovingfastandadditionaldedicatedApacheprojectshaveemergedtoaddressspecificareasofsecurity.

ApacheSentryhttps://sentry.incubator.apache.orgisasystemtoprovidemuchfiner-grainedauthorizationtoHadoopdataandservices.OtherservicesbuildSentrymappings,andthisallows,forexample,specificrestrictionstobeplacednotonlyonparticularHDFSdirectories,butalsoonentitiessuchasHivetables.

WhereasSentryfocusesonprovidingmuchrichertoolsfortheinternal,fine-grainedaspectsofHadoopsecurity,ApacheKnox(http://knox.apache.org)providesasecuregatewaytoHadoopthatintegrateswithexternalidentitymanagementsystemsandprovidesaccesscontrolmechanismstoallowordisallowaccesstospecificHadoopservicesandoperations.ItdoesthisbypresentingaREST-onlyinterfacetoHadoopandsecuringallcallstothisAPI.

Page 423: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ThefutureofHadoopsecurityTherearemanyotherdevelopmentshappeningintheHadoopworld.CoreHadoop2.5addedextendedfileattributestoHDFS,whichcanbeusedasthebasisofadditionalaccesscontrolmechanisms.Futureversionswillincorporatecapabilitiesforbettersupportofencryptionfordataintransitaswellasatrest,andtheProjectRhinoinitiativeledbyIntel(https://github.com/intel-hadoop/project-rhino/)isbuildingoutrichersupportforfilesystemcryptographicmodules,asecurefilesystem,and,atsomepoint,afullerkey-managementinfrastructure.

TheHadoopdistributionvendorsaremovingfasttoaddthesecapabilitiestotheirreleases,soifyoucareaboutsecurity(youdo,don’tyou!),thenconsultthedocumentationforthelatestreleaseofyourdistribution.Newsecurityfeaturesarebeingaddedeveninpointupdatesandaren’tbeingdelayeduntilmajorupgrades.

Page 424: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ConsequencesofusingasecuredclusterAfterteasingyouwithallthesecuritygoodnessthatisnowavailableandthatwhichiscoming,it’sonlyfairtogivesomewordsofwarning.Securityisoftenhardtodocorrectly,andoftenthefeelingofsecuritywronglyemployedwithabuggydeploymentisworsethanknowingyouhavenosecurity.

However,evenifyoudoitright,thereareconsequencestorunningasecurecluster.Itmakesthingsharderfortheadministratorscertainlyandoftentheusers,sothereisdefinitelyanoverhead.SpecificHadooptoolsandserviceswillalsoworkdifferentlydependingonwhatsecurityisemployedonacluster.

Oozie,whichwediscussedinChapter8,DataLifecycleManagement,usesitsowndelegationtokensbehindthescenes.Thisallowstheoozieusertosubmitjobsthatarethenexecutedonbehalfoftheoriginallysubmittinguser.Inaclusterusingonlythebasicauthorizationmechanism,thisisveryeasilyconfigured,butusingOozieinasecureclusterwillrequireadditionallogictobeaddedtotheworkflowdefinitionsandthegeneralOozieconfiguration.Thisisn’taproblemwithHadooporOozie;however,similarlyaswiththeadditionalcomplexityresultingfromthemuchbetterHAfeaturesofHDFSinHadoop2,bettersecuritymechanismswillsimplyhavecostsandconsequencesthatyouneedtakeintoconsideration.

Page 425: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MonitoringEarlierinthischapter,wediscussedClouderaManagerasavisualmonitoringtoolandhintedthatitcouldalsobeprogrammaticallyintegratedwithothermonitoringsystems.ButbeforepluggingHadoopintoanymonitoringframework,it’sworthconsideringjustwhatitmeanstooperationallymonitoraHadoopcluster.

Page 426: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Hadoop–wherefailuresdon’tmatterTraditionalsystemsmonitoringtendstobequiteabinarytool;generallyspeaking,eithersomethingisworkingoritisn’t.Ahostisaliveordead,andawebserverisrespondingoritisn’t.ButintheHadoopworld,thingsarealittledifferent;theimportantthingisserviceavailability,andthiscanstillbetreatedasliveevenifparticularpiecesofhardwareorsoftwarehavefailed.NoHadoopclustershouldbeintroubleifasingleworkernodefails.AsofHadoop2,eventhefailureoftheserverprocesses,suchastheNameNodeshouldn’treallybeaconcernifHAisconfigured.So,anymonitoringofHadoopneedstotakeintoaccounttheservicehealthandnotthatofspecifichostmachines,whichshouldbeunimportant.Operationspeopleon24/7pagerarenotgoingtobehappygettingpagedat3AMtodiscoverthatoneworkernodeinaclusterof10,000hasfailed.Indeed,oncethescaleoftheclusterincreasesbeyondacertainpoint,thefailureofindividualpiecesofhardwarebecomesanalmostcommonplaceoccurrence.

Page 427: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MonitoringintegrationYouwon’tbebuildingyourownmonitoringtools;instead,youmightlikelywanttointegratewithexistingtoolsandframeworks.Forpopularopensourcemonitoringtools,suchasNagiosandZabbix,therearemultiplesampletemplatestointegrateHadoop’sservice-wideandnode-specificmetrics.

Thiscangivethesortofseparationhintedpreviously;thefailureoftheYARNResourceManagerwouldbeahigh-criticalityeventthatshouldmostlikelycausealertstobesenttooperationsstaff,butahighloadonspecifichostsshouldonlybecapturedandnotcausealertstobefired.Thisthenprovidesthedualityoffiringalertswhenbadthingshappeninadditiontocapturingandprovidingtheinformationneededtodelveintosystemdataovertimetodotrendanalysis.

ClouderaManagerprovidesaRESTinterface,whichisanotherpointofintegrationagainstwhichtoolssuchasNagioscanintegrateandpulltheClouderaManager-definedservice-levelmetricsinsteadofhavingtodefineitsown.

Forheavier-weightenterprise-monitoringinfrastructurebuiltonframeworks,suchasIBMTivoliorHPOpenView,ClouderaManagercanalsodelivereventsviaSNMPtrapsthatwillbecollectedbythesesystems.

Page 428: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Application-levelmetricsAttimes,youmightalsowantyourapplicationstogathermetricsthatcanbecentrallycapturedwithinthesystem.Themechanismsforthiswilldifferfromonecomputationalmodeltoanother,butthemostwell-knownaretheapplicationcountersavailablewithinMapReduce.

WhenaMapReducejobcompletes,itoutputsanumberofcounters,gatheredbythesystemthroughoutthejobexecution,thatdealwithmetricssuchasthenumberofmaptasks,byteswritten,failedtasks,andsoon.Youcanalsowriteapplication-specificmetricsthatwillbeavailablealongsidethesystemcountersandwhichareautomaticallyaggregatedacrossthemap/reduceexecution.FirstdefineaJavaenum,andnameyourdesiredmetricswithinit,asfollows:

publicenumAppMetrics{

MAX_SEEN,

MIN_SEEN,

BAD_RECORDS

};

Then,withinthemap,reduce,setup,andcleanupmethodsofyourMaporReduceimplementations,youcandosomethinglikethefollowingtoincrementacounterbyone:

Context.getCounter(AppMetrics.BAD_RECORDS).increment(1);

RefertotheJavaDocoftheorg.apache.hadoop.mapreduce.Counterinterfaceformoredetailsofthismechanism.

Page 429: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TroubleshootingMonitoringandloggingcountersoradditionalinformationisallwellandgood,butitcanbeintimidatingtoknowhowtoactuallyfindtheinformationyouneedwhentroubleshootingaproblemwithanapplication.Inthissection,wewilllookathowHadoopstoreslogsandsysteminformation.Wecandistinguishthreetypologiesoflogs,asfollows:

YARNapplications,includingMapReducejobsDaemonlogs(NameNodeandResourceManager)Servicesthatlognon-distributedworkloads,forexample,HiveServer2loggingto/var/log

Nexttotheselogtypologies,Hadoopexposesanumberofmetricsatfilesystem(thestorageavailability,replicationfactor,andnumberofblocks)andsystemlevel.Asmentioned,bothApacheAmbariandClouderaManager,whichcentralizeaccesstodebuginformation,doanicejobasthefrontend.However,underthehood,eachservicelogstoeitherHDFSorthesingle-nodefilesystem.Furthermore,YARN,MapReduce,andHDFSexposetheirlogfilesandmetricsviawebinterfacesandprogrammaticAPIs.

Page 430: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

LogginglevelsHadooplogsmessagestoLog4jbydefault.Log4jisconfiguredvialog4j.propertiesintheclasspath.Thisfiledefineswhatisloggedandwithwhichlayout:

log4j.rootLogger=${root.logger}

root.logger=INFO,console

log4j.appender.console=org.apache.log4j.ConsoleAppender

log4j.appender.console.target=System.err

log4j.appender.console.layout=org.apache.log4j.PatternLayout

log4j.appender.console.layout.ConversionPattern=%d{yy/MM/ddHH:mm:ss}%p

%c{2}:%m%n

ThedefaultrootloggerisINFO,console,whichlogsallmessagesatthelevelINFOandabovetotheconsole’sstderr.SingleapplicationsdeployedonHadoopcanshiptheirownlog4j.propertiesandsetthelevelandotherpropertiesoftheiremittedlogsasrequired.

HadoopdaemonshaveawebpagetogetandsettheloglevelforanyLog4jproperty.Thisinterfaceisexposedbythe/LogLevelendpointineachservicewebUI.ToenabledebugloggingfortheResourceManagerclass,wewillvisithttp://resourcemanagerhost:8088/LogLevel,andthescreenshotcanbeseenasfollows:

GettingandsettingtheloglevelonResourceManager

Alternatively,theYARNdaemonlog<host:port>commandinterfaceswiththeservice/LogLevelendpoint.Wecaninspectthelevelassociatedwithmapreduce.map.log.levelfortheResourceManagerclassusingthe–getlevel<property>parameter,asfollows:

$hadoopdaemonlog-getlevellocalhost.localdomain:8088

mapreduce.map.log.level

Connectingtohttp://localhost.localdomain:8088/logLevel?

log=mapreduce.map.log.levelSubmittedLogName:mapreduce.map.log.levelLog

Class:org.apache.commons.logging.impl.Log4JLoggerEffectivelevel:INFO

Theeffectivelevelcanbemodifiedusingthe-setlevel<property><level>option:

$hadoopdaemonlog-setlevellocalhost.localdomain:8088

mapreduce.map.log.levelDEBUG

Connectingtohttp://localhost.localdomain:8088/logLevel?

log=mapreduce.map.log.level&level=DEBUG

Page 431: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SubmittedLogName:mapreduce.map.log.level

LogClass:org.apache.commons.logging.impl.Log4JLogger

SubmittedLevel:DEBUG

SettingLeveltoDEBUG…

Effectivelevel:DEBUG

NotethatthissettingwillaffectalllogsproducedbytheResourceManagerclass.Thisincludessystem-generatedentriesaswellastheonesgeneratedbyapplicationsrunningonYARN.

Page 432: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AccesstologfilesLogfilelocationsandnamingconventionsarelikelytodifferbasedonthedistribution.ApacheAmbariandClouderaManagercentralizeaccesstologfiles,bothforservicesandsingleapplications.OnCloudera’sQuickStartVM,anoverviewofthecurrentlyrunningprocessesandlinkstotheirlogfiles,thestderrandstdoutchannelscanbefoundathttp://localhost.localdomain:7180/cmf/hardware/hosts/1/processes,andthescreenshotcanbeseenasfollows:

AccesstologresourcesinClouderaManager

AmbariprovidesasimilaroverviewviatheServicesdashboardfoundathttp://127.0.0.1:8080/#/main/servicesontheHDPSandbox,andthescreenshotcanbeseenasfollows:

Page 433: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AccesstologresourcesonApacheAmbari

Non-distributedlogsareusuallyfoundunder/var/log/<service>oneachclusternode.YARNcontainersandMRv2logslocationsalsodependonthedistribution.OnCDH5theseresourcesareavailableinHDFSunder/tmp/logs/<user>.

Thestandardmodalitytoaccessdistributedlogsiseitherviacommand-linetoolsorusingserviceswebUIs.

Forinstance,thecommandisasfollows:

$yarnapplication-list-appStatesALL

TheprecedingcommandwilllistallrunningandretriedYARNapplications.TheURLinthetaskcolumnpointstoawebinterfacethatexposesthetasklog,asfollows:

14/08/0314:44:38INFOclient.RMProxy:ConnectingtoResourceManagerat

localhost.localdomain/127.0.0.1:8032Totalnumberofapplications

(application-types:[]andstates:[NEW,NEW_SAVING,SUBMITTED,ACCEPTED,

RUNNING,FINISHED,FAILED,KILLED]):4Application-Id

Application-NameApplication-TypeUserQueue

StateFinal-StateProgress

Tracking-URLapplication_1405630696162_0002PigLatin:DefaultJobName

MAPREDUCEclouderaroot.clouderaFINISHED

SUCCEEDED100%

http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0002

application_1405630696162_0004PigLatin:DefaultJobName

MAPREDUCEclouderaroot.clouderaFINISHED

SUCCEEDED100%

http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0004

application_1405630696162_0003PigLatin:DefaultJobName

MAPREDUCEclouderaroot.clouderaFINISHED

Page 434: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SUCCEEDED100%

http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0003

application_1405630696162_0005PigLatin:DefaultJobName

MAPREDUCEclouderaroot.clouderaFINISHED

SUCCEEDED100%

http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0005

Forinstance,http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0002,alinktoataskbelongingtousercloudera,isafrontendtothecontentstoredunderhdfs:///tmp/logs/cloudera/logs/application_1405630696162_0002/.

Inthefollowingsections,wewillgiveanoverviewoftheavailableUIsfordifferentservices.

NoteProvisioninganEMRclusterwiththe–log-uris3://<bucket>optionwillensurethatHadooplogsarecopiedintothes3://<bucket>location.

Page 435: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ResourceManager,NodeManager,andApplicationManagerOnYARNtheResourceManagerwebUIprovidesinformationandgeneraljobstatisticsoftheHadoopcluster,running/completed/failedjobs,andajobhistorylogfile.Bydefault,theUIisexposedathttp://<resourcemanagerhost>:8088/andcanbeseeninthefollowingscreenshot:

ResourceManager

ApplicationsOntheleft-handsidebar,itispossibletoreviewtheapplicationstatusofinterest:NEW,SUBMITTED,ACCEPTED,RUNNING,FINISHING,FINISHED,FAILED,orKILLED.Dependingontheapplicationstatus,thefollowinginformationisavailable:

TheapplicationIDThesubmittinguserTheapplicationnameTheschedulerqueueinwhichtheapplicationisplacedStart/finishtimesandstateLinktotheTrackingUIforapplicationhistory

Inaddition,theClusterMetricsviewgivesyouinformationonthefollowing:

OverallapplicationstatusNumberofrunningcontainersMemoryusageNodestatus

NodesTheNodesviewisafrontendtotheNodeManagerservicemenu,whichshowshealthandlocationinformationonthenode’srunningapplications,asfollows:

Page 436: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Nodesstatus

EachindividualnodeoftheclusterexposesfurtherinformationandstatisticsathostlevelviaitsownUI.TheseincludewhichversionofHadoopisrunningonthenode,howmuchmemoryisavailableonthenode,thenodestatus,andalistofrunningapplicationsandcontainers,asshowninthefollowingscreenshot:

Singlenodeinfo

SchedulerThefollowingscreenshotshowstheSchedulerwindow:

Scheduler

MapReduceThoughthesameinformationandloggingdetailsareavailableinMapReducev1and

Page 437: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MapReducev2,theaccessmodalityisslightlydifferent.

MapReducev1ThefollowingscreenshotshowstheMapReduceJobTrackerUI:

TheJobTrackerUI

TheJobTrackerUI,availablebydefaultathttp://<jobtracker>:50070,exposesinformationonallcurrentlyrunningaswellasretiredMapReducejobs,asummaryoftheclusterresourcesandhealth,aswellasschedulinginformationandcompletionpercentage,asshowninthefollowingscreenshot:

Jobdetails

Page 438: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Foreachrunningandretiredjob,detailsareavailable,includingitsID,owner,priority,taskassignment,andtasklaunchforthemapper.Clickingonajobidlinkwillleadtoajobdetailspage—thesameURLexposedbythemapredjob–listcommand.Thisresourcegivesdetailsaboutboththemapandreducetasksaswellasgeneralcounterstatisticsatthejob,filesystem,andMapReducelevels;theseincludethememoryused,numberofread/writeoperations,andthenumberofbytesreadandwritten.

ForeachMapandReduceoperation,theJobTrackerexposesthetotal,pending,running,completed,andfailedtasks,asshowninthefollowingscreenshot:

Jobtasksoverview

ClickingonthelinksintheJobtablewillleadtoafurtheroverviewatthetaskandtask-attemptlevels,asshowninthefollowingscreenshot:

Taskattempts

Fromthislastpage,wecanaccessthelogsofeachtaskattempt,bothforsuccessfulandfailed/killedtasksoneachindividualTaskTrackerhost.ThislogcontainsthemostgranularinformationaboutthestatusoftheMapReducejob,includingtheoutputofLog4jappendersaswellasoutputpipedtothestdoutandstderrchannelsandsyslog,asshowninthefollowingscreenshot:

Page 439: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

TaskTrackerlogs

MapReducev2(YARN)AswehaveseeninChapter3,Processing–MapReduceandBeyond,withYARN,MapReduceisonlyoneofmanyprocessingframeworksthatcanbedeployed.RecallfrompreviouschaptersthattheJobTrackerandTaskTrackerserviceshavebeenreplacedbytheResourceManagerandNodeManager,respectively.Assuch,boththeserviceUIsandthelogfilesfromYARNaremoregenericthanMapReducev1.

Theapplication_1405630696162_0002nameshowninResourceManagercorrespondstoaMapReducejobwiththejob_1405630696162_0002ID.ThatapplicationIDbelongstothetaskrunninginsidethecontainer,andclickingonitwillrevealanoverviewoftheMapReducejobandallowadrill-downtotheindividualtasksfromeitherphaseuntilthesingle-tasklogisreached,asshowninthefollowingscreenshot:

Page 440: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AYARNapplicationcontainingaMapReducejob

JobHistoryServerYARNshipswithaJobHistoryRESTservicethatexposesdetailsonfinishedapplications.Currently,itonlysupportsMapReduceandprovidesinformationonfinishedjobs.ThisincludesthejobfinalstatusSUCCESSFULorFAILED,whosubmittedthejob,thetotalnumberofmapandreducetasks,andtiminginformation.

AUIisavailableathttp://<jobhistoryhost>:19888/jobhistory,asshowninthefollowingscreenshot:

JobHistoryUI

ClickingoneachjobIDwillleadtotheMapReducejobUIshownintheYARNapplicationscreenshot.

Page 441: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

NameNodeandDataNodeThewebinterfacefortheHadoopDistributedFileSystem(HDFS)showsinformationabouttheNameNodeitselfaswellasthefilesystemgenerally.

Bydefault,itislocatedathttp://<namenodehost>:50070/,asshowninthefollowingscreenshot:

NameNodeUI

TheOverviewmenuexposesNameNodeinformationaboutDFScapacityandusageandtheblockpoolstatus,anditgivesasummaryofthestatusofDataNodehealthandavailability.Theinformationcontainedinthispageisforthemostpartequivalenttowhatisshownatthecommand-lineprompt:

$hdfsdfsadmin–report

TheDataNodesmenugivesmoredetailedinformationaboutthestatusofeachnodeandoffersadrill-downatthesingle-hostlevel,bothforavailableanddecommissionednodes,asshowninthefollowingscreenshot:

Page 442: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DatanodeUI

Page 443: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SummaryThishasbeenquiteawhistle-stoptouraroundtheconsiderationsofrunninganoperationalHadoopcluster.Wedidn’ttrytoturndevelopersintoadministrators,buthopefully,thebroaderperspectivewillhelpyoutohelpyouroperationsstaff.Inparticular,wecoveredthefollowingtopics:

HowHadoopisanaturalfitforDevOpsapproachesasitsmultilayeredcomplexitymeansit’snotpossibleordesirabletohavesubstantialknowledgegapsbetweendevelopmentandoperationsstaffClouderaManager,andhowitcanbeagreatmanagementandmonitoringtool;itmightcauseintegrationproblemsthough,ifyouhaveotherenterprisetools,anditcomeswithavendorlock-inriskAmbari,theApacheopensourcealternativetoClouderaManager,andhowitisusedintheHortonworksdistributionHowtothinkaboutselectinghardwareforaphysicalHadoopcluster,andhowthisnaturallyfitsintotheconsiderationsofhowthemultipleworkloadspossibleintheworldofHadoop2canpeacefullycoexistonsharedresourcesThedifferentconsiderationsforfiringupandusingEMRclustersandhowthiscanbebothanadjunctto,aswellasanalternativeto,aphysicalclusterTheHadoopsecurityecosystem,howitisaveryfastmovingarea,andhowthefeaturesavailabletodayarevastlybetterthansomeyearsagoandthereisstillmucharoundthecornerMonitoringofaHadoopcluster,consideringwhateventsareimportantintheHadoopmodelofembracingfailure,andhowthesealertsandmetricscanbeintegratedintootherenterprise-monitoringframeworksHowtotroubleshootissueswithaHadoopcluster,bothintermsofwhatmighthavehappenedandhowtofindtheinformationtoinformyouranalysisAquicktourofthevariouswebUIsprovidedbyHadoop,whichcangiveverygoodoverviewsofhappeningswithinvariouscomponentsinthesystem

ThisconcludesourtreatmentofHadoopindepth.Inthefinalchapter,wewillexpresssomethoughtsonthebroaderHadoopecosystem,givesomepointersforusefulandinterestingtoolsandproductsthatwedidn’thaveachancetocoverinthebook,andsuggesthowtogetinvolvedwiththecommunity.

Page 444: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Chapter11.WheretoGoNextInthepreviouschapterswehaveexaminedmanypartsofHadoop2andtheecosystemaroundit.However,wehavenecessarilybeenlimitedbypagecount;someareaswedidn’tgetintoasmuchdepthaswaspossible,otherareaswereferredtoonlyinpassingordidnotmentionatall.

TheHadoopecosystem,withdistributions,Apacheandnon-Apacheprojects,isanincrediblyvibrantandhealthyplacetoberightnow.Inthischapter,wehopetocomplementthepreviouslydiscussedmoredetailedmaterialwithatravelguide,ifyouwill,forotherinterestingdestinations.Inthischapter,wewilldiscussthefollowingtopics:

HadoopdistributionsOthersignificantApacheandnon-ApacheprojectsSourcesofinformationandhelp

Ofcourse,notethatanyoverviewoftheecosystemisbothskewedbyourinterestsandpreferences,andisoutdatedthemomentitiswritten.Inotherwords,don’tforamomentthinkthisisallthat’savailable,consideritinsteadawhettingoftheappetite.

Page 445: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AlternativedistributionsWe’vegenerallyusedtheClouderadistributionforHadoopinthisbook,buthaveattemptedtokeepthecoveragedistributionindependentasmuchaspossible.We’vealsomentionedtheHortonworksDataPlatform(HDP)throughoutthisbookbutthesearecertainlynottheonlydistributionchoicesavailabletoyou.

Beforetakingalookaround,let’sconsiderwhetheryouneedadistributionatall.ItiscompletelypossibletogototheApachewebsite,downloadthesourcetarballsoftheprojectsinwhichyouareinterested,thenworktobuildthemalltogether.However,givenversiondependencies,thisislikelytoconsumemoretimethanyouwouldexpect.Potentially,vastlymoreso.Inaddition,theendproductwilllikelylacksomepolishintermsoftoolsorscriptsforoperationaldeploymentandmanagement.Formostusers,theseareasarewhyemployinganexistingHadoopdistributionisthenaturalchoice.

Anoteonfreeandcommercialextensions—beinganopensourceprojectwithaquiteliberallicense,distributioncreatorsarealsofreetoenhanceHadoopwithproprietaryextensionsthataremadeavailableeitherasfreeopensourceorcommercialproducts.

Thiscanbeacontroversialissueassomeopensourceadvocatesdislikeanycommercializationofsuccessfulopensourceprojects;tothem,itappearsthatthecommercialentityisfreeloadingbytakingthefruitsoftheopensourcecommunitywithouthavingtobuilditforthemselves.OthersseethisasahealthyaspectoftheflexibleApachelicense;thebaseproductwillalwaysbefree,andindividualsandcompaniescanchoosewhethertogowithcommercialextensionsornot.Wedon’tgivejudgmenteitherway,butbeawarethatthisisanotherofthecontroversiesyouwillalmostcertainlyencounter.

Soyouneedtodecideifyouneedadistributionandifsoforwhatreasons,whichspecificaspectswillbenefityoumostaboverollingyourown?Doyouwishforafullyopensourceproductorareyouwillingtopayforcommercialextensions?Withthesequestionsinmind,let’slookatafewofthemaindistributions.

Page 446: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ClouderaDistributionforHadoopYouwillbefamiliarwiththeClouderadistribution(http://www.cloudera.com)asithasbeenusedthroughoutthisbook.CDHwasthefirstwidelyavailablealternativedistributionanditsbreadthofavailablesoftware,provenlevelofquality,anditsfreecosthasmadeitaverypopularchoice.

Recently,ClouderahasbeenactivelyextendingtheproductsitaddstoitsdistributionbeyondthecoreHadoopprojects.InadditiontoClouderaManagerandImpala(bothCloudera-developedproducts),ithasalsoaddedothertoolssuchasClouderaSearch(basedonApacheSolr)andClouderaNavigator(adatagovernancesolution).WhileCDHversionspriorto5werefocusedmoreontheintegrationbenefitsofadistribution,version5(andpresumablybeyond)isaddingmoreandmorecapabilityatopthebaseApacheHadoopprojects.

Clouderaalsoofferscommercialsupportforitsproductsinadditiontotrainingandconsultancyservices.Detailscanbefoundonthecompanywebpage.

Page 447: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HortonworksDataPlatformIn2011,theYahoo!divisionresponsibleforsomuchofthedevelopmentofHadoopwasspunoffintoanewcompanycalledHortonworks.Theyhavealsoproducedtheirownpre-integratedHadoopdistributioncalledtheHortonworksDataPlatform(HDP),availableathttp://hortonworks.com/products/hortonworksdataplatform/.

HDPisconceptuallysimilartoCDHbutbothproductshavedifferencesintheirfocus.HortonworksmakesmuchofthefactHDPisfullyopensource,includingthemanagementtoolAmbari,whichwediscussedbrieflyinChapter10,RunningaHadoopCluster.TheyhavealsopositionedHDPasakeyintegrationplatformthroughitssupportfortoolssuchasTalendOpenStudio.Hortonworksdoesnotofferproprietarysoftware;itsbusinessmodelfocusesinsteadonofferingprofessionalservicesandsupportfortheplatform.

BothClouderaandHortonworksareventure-backedcompanieswithsignificantengineeringexpertise;bothcompaniesemploymanyofthemostprolificcontributorstoHadoop.Theunderlyingtechnologyis,however,comprisedofthesameApacheprojects;thedistinguishingfactorsarehowtheyarepackaged,theversionsemployed,andtheadditionalvalue-addedofferingsprovidedbythecompanies.

Page 448: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MapRAdifferenttypeofdistributionisofferedbyMapRTechnologies,althoughthecompanyanddistributionareusuallyreferredtosimplyasMapR.Thedistributionavailablefromhttp://www.mapr.comisbasedonHadoop,buthasaddedanumberofchangesandenhancements.

ThefocusoftheMapRdistributionisonperformanceandavailability.Forexample,itwasthefirstdistributiontoofferahigh-availabilitysolutionfortheHadoopNameNodeandJobTracker,whichyouwillrememberfromChapter2,Storage,wasasignificantweaknessincoreHadoop1.ItalsoofferednativeintegrationwithNFSfilesystemslongbeforeHadoop2,whichmakesprocessingofexistingdatamucheasier.Toachievethesefeatures,MapRreplacedHDFSwithafullPOSIXcompliantfilesystemthatalsofeaturesnoNameNode,resultinginatruedistributedsystemwithnomaster,andaclaimofmuchbetterhardwareutilizationthanApacheHDFS.

MapRprovidesbothacommunityandenterpriseeditionofitsdistribution;notalltheextensionsareavailableinthefreeproduct.Thecompanyalsoofferssupportservicesaspartoftheenterpriseproductsubscriptioninadditiontotrainingandconsultancy.

Page 449: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Andtherest…Hadoopdistributionsarenotjusttheterritoryofyoungstart-ups,noraretheyastaticmarketplace.Intelhaditsowndistributionuntilearly2014whenitdecidedtofolditschangesintoCDHinstead.IBMhasitsowndistributioncalledIBMInfosphereBigInsightsavailableinbothfreeandcommercialeditions.Therearealsovariousstoriesofnumerouslargeenterprisesrollingtheirowndistributions,someofwhicharemadeopenlyavailablewhileothersarenot.Youwillhavenoshortageofoptionswithsomanyhigh-qualitydistributionsavailable.

Page 450: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ChoosingadistributionThisraisesthequestion:howtochooseadistribution?Ascanbeseen,theavailabledistributions(andwedidn’tcoverthemall)rangefromconvenientpackagingandintegrationoffullyopensourceproductsthroughtoentirebespokeintegrationandanalysislayersatopthem.Thereisnooverallbestdistribution;thinkcarefullyaboutyourrequirementsandconsiderthealternatives.Sincealltheseofferafreedownloadofatleastabasicversion,it’sgoodtosimplyplayandexperiencetheoptionsforyourself.

Page 451: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OthercomputationalframeworksWe’vefrequentlydiscussedthemyriadpossibilitiesbroughttotheHadoopplatformbyYARN.Wewentintodetailsoftwonewmodels,SamzaandSpark.Additionally,othermoreestablishedframeworkssuchasPigarealsobeingportedtotheframework.

Togiveaviewofthemuchbiggerpictureinthissection,wewillillustratethebreadthofprocessingpossibleusingYARNbypresentingasetofcomputationalmodelsthatarecurrentlybeingportedtoHadoopontopofYARN.

Page 452: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ApacheStormStorm(http://storm.apache.org)isadistributedcomputationframeworkwritten(mainly)intheClojureprogramminglanguage.Itusescustom-createdspoutsandboltstodefineinformationsourcesandmanipulationstoallowdistributedprocessingofstreamingdata.AStormapplicationisdesignedasatopologyofinterfacesthatcreatesastreamoftransformations.ItprovidessimilarfunctionalitytoaMapReducejobwiththeexceptionthatthetopologywilltheoreticallyrunindefinitelyuntilitismanuallyterminated.

ThoughinitiallybuiltdistinctfromHadoop,aYARNportisbeingdevelopedbyYahoo!andcanbefoundathttps://github.com/yahoo/storm-yarn.

Page 453: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ApacheGiraphGiraphoriginatedastheopensourceimplementationofGoogle’sPregelpaper(whichcanbefoundathttp://kowshik.github.io/JPregel/pregel_paper.pdf).BothGiraphandPregelareinspiredbytheBulkSynchronousParallel(BSP)modelofdistributedcomputationintroducedbyValiantin1990.Giraphaddsseveralfeaturesincludingmastercomputation,shardedaggregators,edge-orientedinput,andout-of-corecomputation.TheYARNportcanbefoundathttps://issues.apache.org/jira/browse/GIRAPH-13.

Page 454: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ApacheHAMAHamaisatop-levelApacheprojectthataims,likeothermethodswe’veencounteredsofar,toaddresstheweaknessofMapReducewithregardtoiterativeprogramming.SimilartotheaforementionedGiraph,HamaimplementstheBSPtechniquesandhasbeenheavilyinspiredbythePregelpaper.TheYARNportcanbefoundathttps://issues.apache.org/jira/browse/HAMA-431.

Page 455: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OtherinterestingprojectsWhetheryouuseabundleddistributionorstickwiththebaseApacheHadoopdownload,youwillencountermanyreferencestootherrelatedprojects.We’vecoveredseveralofthesesuchasHive,Samza,andCrunchinthisbook;we’llnowhighlightsomeoftheothers.

Notethatthiscoverageseekstopointoutthehighlights(fromtheauthors’perspective)aswellasgiveatasteofthebreadthoftypesofprojectsavailable.Asmentionedearlier,keeplookingout,astherewillbenewoneslaunchingallthetime.

Page 456: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HBasePerhapsthemostpopularApacheHadoop-relatedprojectthatwedidn’tcoverinthisbookisHBase(http://hbase.apache.org).BasedontheBigTablemodelofdatastoragepublicizedbyGoogleinanacademicpaper(soundfamiliar?),HBaseisanonrelationaldatastoresittingatopHDFS.

WhilebothMapReduceandHivefocusonbatch-likedataaccesspatterns,HBaseinsteadseekstoprovideverylow-latencyaccesstodata.ConsequentlyHBasecan,unliketheaforementionedtechnologies,directlysupportuser-facingservices.

TheHBasedatamodelisnottherelationalapproachthatwasusedinHiveandallotherRDBMSs,nordoesitofferthefullACIDguaranteesthataretakenforgrantedwithrelationalstores.Instead,itisakey-valueschema-lesssolutionthattakesacolumn-orientedviewofdata;columnscanbeaddedatruntimeanddependonthevaluesinsertedintoHBase.Eachlookupoperationisthenveryfast,asitiseffectivelyakey-valuemappingfromtherowkeytothedesiredcolumn.HBasealsotreatstimestampsasanotherdimensiononthedatasoonecandirectlyretrievedatafromapointintime.

Thedatamodelisverypowerfulbutdoesnotsuitallusecasesjustastherelationalmodelisn’tuniversallyapplicable.Butifyouhavearequirementforstructuredlow-latencyviewsonlarge-scaledatastoredinHadoop,thenHBaseisabsolutelysomethingyoushouldlookat.

Page 457: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SqoopInChapter7,HadoopandSQL,welookedattoolsforpresentingarelational-likeinterfacetodatastoredonHDFS.Often,suchdataeitherneedstoberetrievedfromanexistingrelationaldatabaseortheoutputofitsprocessingneedstobestoredback.

ApacheSqoop(http://sqoop.apache.org)providesamechanismfordeclarativelyspecifyingdatamovementbetweenrelationaldatabasesandHadoop.IttakesataskdefinitionandfromthisgeneratesMapReducejobstoexecutetherequireddataretrievalorstorage.ItwillalsogeneratecodetohelpmanipulaterelationalrecordswithcustomJavaclasses.Inaddition,itcanintegratewithHBaseandHcatalog/Hiveanditprovidesaveryrichsetofintegrationpossibilities.

Atthetimeofwriting,Sqoopisslightlyinflux.Itsoriginalversion,Sqoop1,wasapureclient-sideapplication.MuchliketheoriginalHivecommand-linetool,Sqoop1hasnoserverandgeneratesallcodeontheclient.Thisunfortunatelymeansthateachclientneedstoknowalotofdetailsaboutphysicaldatasources,includingexacthostnamesaswellasauthenticationcredentials.

Sqoop2providesacentralizedSqoopserverthatencapsulatesallthesedetailsandoffersthevariousconfigureddatasourcestotheconnectingclients.Itisasuperiormodelbutatthetimeofwriting,thegeneralcommunityrecommendationistostickwithSqoop1untilthenewversionevolvesfurther.Checkonthecurrentstatusifyouareinterestedinthistypeoftool.

Page 458: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WhirWhenlookingtousecloudservicessuchasAmazonAWSforHadoopdeployments,itisusuallyaloteasiertouseahigherlevelservicesuchasElasticMapReduceasopposedtosettingupyourownclusteronEC2.Thoughtherearescriptstohelp,thefactisthattheoverheadofHadoop-baseddeploymentsoncloudinfrastructurescanbeinvolved.That’swhereApacheWhir(https://whirr.apache.org/)comesin.

Whirisn’tfocusedonHadoop;it’saboutsupplier-independentinstantiationofcloudservicesofwhichHadoopisasingleexample.WhiraimstoprovideaprogrammaticwayofspecifyingandcreatingHadoop-baseddeploymentsoncloudinfrastructuresinawaythathandlesalltheunderlyingserviceaspectsforyou.Itdoesthisinaprovider-independentfashionsothatonceyou’velaunchedonsayEC2thenyoucanusethesamecodetocreatetheidenticalsetuponanotherprovidersuchasRightscaleorEucalyptus.Thismakesvendorlock-in,oftenaconcernwithclouddeployments,lessofanissue.

Whirisn’tquitethereyet.Today,itislimitedinservicesitcancreateandprovidersitsupports,however,ifyouareinterestedinclouddeploymentwithlesspainthenit’sworthwatchingitsprogress.

NoteIfyouarebuildingoutyourfullinfrastructureonAmazonWebServicesthenyoumightfindcloudformationgivesmuchofthesameabilitytodefineapplicationrequirements,thoughobviouslyinanAWS-specificfashion.

Page 459: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MahoutApacheMahout(http://mahout.apache.org/)isacollectionofdistributedalgorithms,Javaclasses,andtoolsforperformingadvancedanalyticsontopofHadoop.SimilartoSpark’sMLLibbrieflymentionedinChapter5,IterativeComputationwithSpark,Mahoutshipswithanumberofalgorithmsforcommonusecases:recommendation,clustering,regression,andfeatureengineering.Althoughthesystemisfocusedonnaturallanguageprocessingandtext-miningtasks,itsbuildingblocks(linearalgebraoperations)aresuitabletobeappliedtoanumberofdomains.AsofVersion0.9,theprojectisbeingdecoupledfromtheMapReduceframeworkinfavorofricherprogrammingmodelssuchasSpark.Thecommunityendgoalistoobtainaplatform-independentlibrarybasedonaScalaDSL.

Page 460: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HueInitiallydevelopedbyClouderaandmarketedasthe“UserInterfaceforHadoop”,Hue(http://gethue.com/)isacollectionofapplications,bundledtogetherunderacommonwebinterface,thatactasclientsforcoreservicesandanumberofcomponentsoftheHadoopecosystem:

TheHueQueryEditorforHive

Hueleveragesmanyofthetoolswediscussedinpreviouschaptersandprovidesanintegratedinterfaceforanalyzingandvisualizingdata.Therearetwocomponentsthatareremarkablyinteresting.Ononehand,thereisaqueryeditorthatallowstheusertocreateandsaveHive(orImpala)queries,exporttheresultsetinCSVorMicrosoftOfficeExcelformataswellasplotitinthebrowser.TheeditorfeaturesthecapabilityofsharingbothHiveQLandresultsets,thusfacilitatingcollaborationwithinanorganization.Ontheotherhand,thereisanOozieworkflowandcoordinatoreditorthatallowsausertocreateanddeployOoziejobsmanually,automatingthegenerationofXMLconfigurationsandboilerplate.

BothClouderaandHortonworksdistributionsshipwithHueandtypicallyincludethefollowing:

AfilemanagerforHDFSAJobBrowserforYARN(MapReduce)AnApacheHBasebrowserAHivemetastoreexplorerQueryeditorsforHiveandImpalaAscripteditorforPigAjobeditorforMapReduceandSpark

Page 461: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AneditorforSqoop2jobsAnOozieworkfloweditoranddashboardAnApacheZooKeeperbrowser

Ontopofthis,HueisaframeworkwithanSDKthatcontainsanumberofwebassets,APIs,andpatternsfordevelopingthird-partyapplicationsthatinteractwithHadoop.

Page 462: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OtherprogrammingabstractionsHadoopisn’tjustextendedbyadditionalfunctionality,therearetoolstoprovideentirelydifferentparadigmsforwritingthecodeusedtoprocessyourdatawithinHadoop.

Page 463: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

CascadingDevelopedbyConcurrent,andopensourcedunderanApachelicense,Cascading(http://www.cascading.org/)isapopularframeworkthatabstractsthecomplexityofMapReduceawayandallowsustocreatecomplexworkflowsontopofHadoop.Cascadingjobscancompileto,andbeexecutedon,MapReduce,Tez,andSpark.Conceptually,theframeworkissimilartoApacheCrunch,coveredinChapter9,MakingDevelopmentEasier,thoughpracticallytherearedifferencesintermsofdataabstractionsandendgoals.Cascadingadoptsatupledatamodel(similartoPig)ratherthanarbitraryobjects,andencouragestheusertorelyonahigherlevelDSL,powerfulbuilt-intypes,andtoolstomanipulatedata.

Putinsimpleterms,CascadingistoPigLatinandHiveQLwhatCrunchistoauser-definedfunction.

LikeMorphlines,whichwealsosawinChapter9,MakingDevelopmentEasier,theCascadingdatamodelfollowsasource-pipe-sinkapproach,wheredataiscapturedfromasource,pipedthroughanumberofprocessingsteps,anditsoutputisthendeliveredintoasink,readytobepickedupbyanotherapplication.

CascadingencouragesdeveloperstowritecodeinanumberofJVMlanguages.PortsoftheframeworkexistforPython(PyCascading),JRuby(Cascading.jruby),Clojure(Cascalog),andScala(Scalding).CascalogandScaldinginparticularhavegainedalotoftractionandspawnedofftheirveryownecosystems.

AnareawhereCascadingexcelsisdocumentation.TheprojectprovidescomprehensivejavadocsoftheAPI,extensivetutorials(http://www.cascading.org/documentation/tutorials/)andaninteractiveexercise-basedlearningenvironment(https://github.com/Cascading/Impatient).

AnotherstrongsellingpointofCascadingisitsintegrationwiththird-partyenvironments.AmazonEMRsupportsCascadingasafirst-classprocessingframeworkandallowsustolaunchCascadingclustersbothwiththecommandlineandwebinterfaces(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CreateCascading.htmlPluginsfortheSDKexistforboththeIntelliJIDEAandEclipseintegrateddevelopmentenvironments.Oneoftheframework’stopprojects,CascadingPatterns,acollectionofmachine-learningalgorithms,featuresautilityfortranslatingPredictiveModelMarkupLanguage(PMML)documentsintoapplicationsonApacheHadoop,thusfacilitatinginteroperabilitywithpopularstatisticalenvironmentsandscientifictoolssuchasR(http://cran.r-project.org/web/packages/pmml/index.html).

Page 464: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

AWSresourcesManyHadooptechnologiescanbedeployedonAWSaspartofaself-managedcluster.However,justasAmazonofferssupportforElasticMapReduce,whichhandlesHadoopasamanagedservice,thereareafewotherservicesthatareworthmentioning.

Page 465: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SimpleDBandDynamoDBForsometime,AWShasofferedSimpleDBasahostedserviceprovidinganHBase-likedatamodel.

Ithas,however,largelybeensupersededbyamorerecentservicefromAWS,DynamoDB,locatedathttp://aws.amazon.com/dynamodb.ThoughitsdatamodelisverysimilartothatofSimpleDBandHBase,itisaimedataverydifferenttypeofapplication.WhereSimpleDBhasquitearichsearchAPIbutisverylimitedintermsofsize,DynamoDBprovidesamoreconstrainedthoughconstantlyevolvingAPI,butwithaserviceguaranteeofnear-unlimitedscalability.

TheDynamoDBpricingmodelisparticularlyinteresting;insteadofpayingforacertainnumberofservershostingtheservice,youallocateacertaincapacityforread-and-writeoperations,andDynamoDBmanagestheresourcesrequiredtomeetthisprovisionedcapacity.Thisisaninterestingdevelopmentasitisamorepureservicemodel,wherethemechanismofdeliveringthedesiredperformanceiskeptcompletelyopaquetotheserviceuser.HavealookatDynamoDBbutifyouneedamuchlargerscaleofdatastorethanSimpleDBcanoffer;however,doconsiderthepricingmodelcarefullyasprovisioningtoomuchcapacitycanbecomeveryexpensiveveryquickly.AmazonprovidessomegoodbestpracticesforDynamoDBatthefollowingURLthatillustratethatminimizingtheservicecostscanresultinadditionalapplication-layercomplexity:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BestPractices.html.

NoteOfcoursethediscussionofDynamoDBandSimpleDBassumesanon-relationaldatamodel;thereistheAmazonRelationalDatabaseService(AmazonRDS)forarelationaldatabaseinthecloudservice.

Page 466: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

KinesisJustasEMRishostedHadoopandDynamoDBhassimilaritiestoahostedHBase,itwasn’tsurprisingtoseeAWSannounceKinesis,ahostedstreamingdataservicein2013.Thiscanbefoundathttp://aws.amazon.com/kinesisandithasverysimilarconceptualbuildingblockstothestackofSamzaatopKafka.KinesisprovidesapartitionedviewofmessagesasastreamofdataandanAPItohavecallbacksthatexecutewhenmessagesarrive.AswithmostAWSservices,thereistightintegrationwithotherservicesmakingiteasytogetdataintoandoutoflocationssuchasS3.

Page 467: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

DataPipelineThefinalAWSservicethatwe’llmentionisDataPipeline,whichcanbefoundathttp://aws.amazon.com/datapipeline.Asthenamesuggests,itisaframeworkforbuildingupdata-processingjobsthatinvolvemultiplesteps,datamovements,andtransformations.IthasquiteaconceptualoverlapwithOozie,butwithafewtwists.Firstly,DataPipelinehastheexpecteddeepintegrationwithmanyotherAWSservices,enablingeasydefinitionofdataworkflowsthatincorporatediverserepositoriessuchasRDS,S3,andDynamoDB.Inadditionhowever,DataPipelinedoeshavetheabilitytointegrateagentsinstalledonlocalinfrastructure,providinganinterestingavenueforbuildingworkflowsthatspanacrosstheAWSandon-premisesenvironments.

Page 468: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SourcesofinformationYoudon’tjustneednewtechnologiesandtools—eveniftheyarecool.Sometimes,alittlehelpfromamoreexperiencedsourcecanpullyououtofahole.Inthisregard,youarewellcovered,astheHadoopcommunityisextremelystronginmanyareas.

Page 469: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SourcecodeIt’ssometimeseasytooverlook,butHadoopandalltheotherApacheprojectsareafterallfullyopensource.Theactualsourcecodeistheultimatesource(pardonthepun)ofinformationabouthowthesystemworks.Becomingfamiliarwiththesourceandtracingthroughsomeofthefunctionalitycanbehugelyinformative.Nottomentionhelpfulwhenyouarehittingunexpectedbehavior.

Page 470: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MailinglistsandforumsAlmostalltheprojectsandserviceslistedinthischapterhavetheirownmailinglistsand/orforums;checkoutthehomepagesforthespecificlinks.Mostdistributionsalsohavetheirownforumsandothermechanismstoshareknowledgeandget(non-commercial)helpfromthecommunity.Additionally,ifusingAWS,makesuretocheckouttheAWSdeveloperforumsathttps://forums.aws.amazon.com.

Alwaysremembertoreadpostingguidelinescarefullyandunderstandtheexpectedetiquette.Thesearetremendoussourcesofinformation;thelistsandforumsareoftenfrequentlyvisitedbythedevelopersoftheparticularproject.ExpecttoseethecoreHadoopdevelopersontheHadooplists,HivedevelopersontheHivelist,EMRdevelopersontheEMRforums,andsoon.

Page 471: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

LinkedIngroupsThereareanumberofHadoopandrelatedgroupsontheprofessionalsocialnetworkLinkedIn.Doasearchforyourparticularareasofinterest,butagoodstartingpointmightbethegeneralHadoopusers’groupathttp://www.linkedin.com/groups/Hadoop-Users-988957.

Page 472: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HUGsIfyouwantmoreface-to-faceinteractionthenlookforaHadoopUserGroup(HUG)inyourarea,mostofwhichwillbelistedathttp://wiki.apache.org/hadoop/HadoopUserGroups.Thesetendtoarrangesemi-regularget-togethersthatcombinethingssuchasqualitypresentations,theabilitytodiscusstechnologywithlike-mindedindividuals,andoftenpizzaanddrinks.

NoHUGnearwhereyoulive?Considerstartingone.

Page 473: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ConferencesThoughsomeindustriestakedecadestobuildupaconferencecircuit,Hadoopalreadyhassomesignificantconferenceactioninvolvingtheopensource,academic,andcommercialworlds.EventssuchastheHadoopSummitandStrataareprettybig;theseandsomeotherarelinkedfromhttp://wiki.apache.org/hadoop/Conferences.

Page 474: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SummaryInthischapter,wetookaquickgalloparoundthebroaderHadoopecosystem,lookingatthefollowingtopics:

WhyalternativeHadoopdistributionsexistandsomeofthemorepopularonesOtherprojectsthatprovidecapabilities,extensions,orHadoopsupportingtoolsAlternativewaysofwritingorcreatingHadoopjobsSourcesofinformationandhowtoconnectwithotherenthusiasts

Now,gohavefunandbuildsomethingamazing!

Page 475: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

IndexA

additionaldata,collectingabout/Collectingadditionaldataworkflows,scheduling/SchedulingworkflowsOozietriggers/OtherOozietriggers

addMappermethod,argumentsjob/Textcleanupusingchainmapperclass/TextcleanupusingchainmapperinputKeyClass/TextcleanupusingchainmapperinputValueClass/TextcleanupusingchainmapperoutputKeyClass/TextcleanupusingchainmapperoutputValueClass/TextcleanupusingchainmappermapperConf/Textcleanupusingchainmapper

alternativedistributionsabout/AlternativedistributionsClouderaDistribution/ClouderaDistributionforHadoopHortonworksDataPlatform(HDP)/HortonworksDataPlatformMapR/MapRselecting/Choosingadistribution

Amazonaccountreferencelink/CreatinganAWSaccount

AmazonCLIreferencelink/TheAWScommand-lineinterface

AmazonEMRabout/AmazonEMRAWSaccount,creating/CreatinganAWSaccountrequiredservices,signingup/Signingupforthenecessaryservices

AmazonRelationalDatabaseService(AmazonRDS)/SimpleDBandDynamoDBAmazonWebServices

Hive,workingwith/HiveandAmazonWebServicesAmbari

about/Ambari–theopensourcealternativeURL/Ambari–theopensourcealternative

AMPLabatUCBerkeley,URL/ApacheSpark

ApacheAvroabout/AvroURL/Avro

ApacheCrunchabout/ApacheCrunchURL/ApacheCrunch

Page 476: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

JARs/Gettingstartedlibraries/Gettingstartedconcepts/ConceptsPCollection<T>interface/ConceptsPTable<Key,Value>interface/Conceptsdataserialization/Dataserializationdataprocessingpatterns/DataprocessingpatternsPipelinesimplementation/Pipelinesimplementationandexecutionexecution/Pipelinesimplementationandexecutionexamples/CrunchexamplesKiteMorphlines/KiteMorphlines

ApacheDataFureferencelink/ContributedUDFs,ApacheDataFuabout/ApacheDataFu

ApacheGiraphabout/ApacheGiraphURL/ApacheGiraph

ApacheHAMAabout/ApacheHAMA

ApacheKafkaURL/ApacheSamza,Samza’sbestfriend–ApacheKafkaabout/Samza’sbestfriend–ApacheKafkaTwitterdata,gettinginto/GettingTwitterdataintoKafka

ApacheKnoxabout/BeyondbasicauthorizationURL/Beyondbasicauthorization

ApacheSentryURL/Beyondbasicauthorization

ApacheSparkabout/ApacheSpark,GettingstartedwithSparkURL/ApacheSpark,GettingstartedwithSparkclustercomputing,withworkingsets/ClustercomputingwithworkingsetsResilientDistributedDatasets(RDDs)/ResilientDistributedDatasets(RDDs)actions/Actionsdeployment/DeploymentonYARN/SparkonYARNonEC2/SparkonEC2standaloneapplications,writing/WritingandrunningstandaloneapplicationsScalaAPI/ScalaAPIJavaAPI/JavaAPIWordCount,inJava/WordCountinJavaPythonAPI/PythonAPIdata,processing/ProcessingdatawithApacheSpark

ApacheSpark,ecosystem

Page 477: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

about/TheSparkecosystemSparkStreaming/SparkStreamingGraphX/GraphXMLLib/MLlibSparkSQL/SparkSQL

ApacheStormabout/ApacheStormURL/ApacheStorm

ApacheThriftabout/ThriftURL/Thrift

ApacheTikaabout/MultijobworkflowsURL/Multijobworkflows

ApacheTwillURL/Thinkinginlayers

ApacheZooKeeperabout/ApacheZooKeeper–adifferenttypeoffilesystemURL/ApacheZooKeeper–adifferenttypeoffilesystemdistributedlock,implementingwithsequentialZNodes/ImplementingadistributedlockwithsequentialZNodesgroupmembership,implementing/ImplementinggroupmembershipandleaderelectionusingephemeralZNodesleaderelection,implementingwithephemeralZNodes/ImplementinggroupmembershipandleaderelectionusingephemeralZNodesJavaAPI/JavaAPIblocks,building/Buildingblocksused,forenablingautomaticNameNodefailover/AutomaticNameNodefailover

applicationdevelopmentframework,selecting/Choosingaframework

ApplicationManagerabout/ResourceManager,NodeManager,andApplicationManager

ApplicationMaster(AM)about/AnatomyofaYARNapplication

architecturalprinciples,HDFSandMapReduce/CommonbuildingblocksArraywrapperclasses

about/ArraywrapperclassesautomaticNameNodefailover

enabling/AutomaticNameNodefailoverAvro

about/AvroAvroschemaevolution,using

thoughts/FinalthoughtsonusingAvroschemaevolution

Page 478: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

additivechanges,making/Onlymakeadditivechangesschemaversions,managingexplicitly/Manageschemaversionsexplicitlyschemadistribution/Thinkaboutschemadistribution

Avroschemasabout/UsingtheJavaAPI

AvroSerdeURL/Avroabout/Avro

AWSabout/DistributionsofApacheHadoop,AWS–infrastructureondemandfromAmazonSimpleStorageService(S3)/SimpleStorageService(S3)ElasticMapReduce(EMR)/ElasticMapReduce(EMR)

AWScommand-lineinterfaceabout/TheAWScommand-lineinterfacereferencelink/TheAWScommand-lineinterface

AWScredentialsabout/AWScredentialsaccountID/AWScredentialsaccesskey/AWScredentialssecretaccesskey/AWScredentialskeypairs/AWScredentialsreferencelink/AWScredentials

AWSdeveloperforumsURL/Mailinglistsandforums

AWSresourcesabout/AWSresourcesSimpleDB/SimpleDBandDynamoDBDynamoDB/SimpleDBandDynamoDBDataPipeline/DataPipeline

Page 479: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Bblockreplication

about/BlockreplicationBulkSynchronousParallel(BSP)model

about/ApacheGiraph

Page 480: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

CCascading

about/CascadingURL/Cascadingreferencelinks/Cascading

ClouderaURL/DistributionsofApacheHadoopURL,fordocumentation/ClouderaManagerURL,forblogpost/Sharingresources

Clouderadistributionabout/ClouderaDistributionforHadoopURL/ClouderaDistributionforHadoop

ClouderaHadoopDistribution(CDH)about/ClouderaManager

ClouderaKittenURL/Thinkinginlayers

ClouderaManagerabout/ClouderaManagerpayment,forsubscriptionservices/Topayornottopayclustermanagement,performing/ClustermanagementusingClouderaManagerintegrating,withsystemsmanagementtools/ClouderaManagerandothermanagementtoolsmonitoringwith/MonitoringwithClouderaManagerlogfiles,finding/Findingconfigurationfiles

ClouderaManagerAPIabout/ClouderaManagerAPI

ClouderaManagerlock-inabout/ClouderaManagerlock-in

ClouderaQuickstartVMabout/ClouderaQuickStartVMadvantages/ClouderaQuickStartVM

clusterbuilding,onEMR/BuildingaclusteronEMR

cluster,APacheSparkcomputing,withworkingsets/Clustercomputingwithworkingsets

cluster,onEMRfilesystem,considerations/Considerationsaboutfilesystemsdata,obtainingintoEMR/GettingdataintoEMREC2instances/EC2instancesandtuningEC2tuning/EC2instancesandtuning

clustermanagementperforming,ClouderaManagerused/ClustermanagementusingClouderaManager

Page 481: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

clusterstartup,HDFSabout/ClusterstartupNameNodestartup/NameNodestartupDataNodestartup/DataNodestartup

clustertuningabout/ClustertuningJVMconsiderations/JVMconsiderationsmapoptimization/Mapandreduceoptimizationsreduceoptimization/Mapandreduceoptimizations

column-orienteddataformatsabout/Column-orienteddataformatsRCFile/RCFileORC/ORCParquet/ParquetAvro/AvroJavaAPI,using/UsingtheJavaAPI

columnarabout/Columnarstores

columnarstores/Columnarstorescombinerclass,JavaAPItoMapReduce

about/CombinercombineValuesoperation

about/Conceptscommand-lineaccess,HDFSfilesystem

about/Command-lineaccesstotheHDFSfilesystemhdfscommand/Command-lineaccesstotheHDFSfilesystemdfscommand/Command-lineaccesstotheHDFSfilesystemdfsadmincommand/Command-lineaccesstotheHDFSfilesystem

Comparableinterfaceabout/TheComparableandWritableComparableinterfaces

complexdatatypesmap/Pigdatatypestuple/Pigdatatypesbag/Pigdatatypes

complexeventprocessing(CEP)about/HowSamzaworks

components,Hadoopabout/ComponentsofHadoopcommonbuildingblocks/Commonbuildingblocksstorage/Storagecomputation/Computation

components,YARNabout/ThecomponentsofYARNResourceManager(RM)/ThecomponentsofYARN

Page 482: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

NodeManager(NM)/ThecomponentsofYARNcomputation

about/Computationcomputation,Hadoop2

about/ComputationinHadoop2computationalframeworks

about/OthercomputationalframeworksApacheStorm/ApacheStormApacheGiraph/ApacheGiraph,ApacheHAMA

conferencesabout/Conferencesreferencelink/Conferences

configurationfile,Samzaabout/Theconfigurationfile

containersabout/SerializationandContainers

contributedUDFsabout/ContributedUDFsPiggybank/PiggybankElephantBird/ElephantBirdApacheDataFu/ApacheDataFu

create.hqlscriptreferencelink/ExtractingdataandingestingintoHive

Crunchexamplesabout/Crunchexampleswordco-occurrence/Wordco-occurrenceTF-IDF/TF-IDF

Curatorprojectreferencelink/Buildingblocks

Page 483: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Ddata,managing

about/ManagingandserializingdataWritableinterface/TheWritableinterfacewrapperclasses/IntroducingthewrapperclassesArraywrapperclasses/ArraywrapperclassesComparableinterface/TheComparableandWritableComparableinterfacesWritableComparableinterface/TheComparableandWritableComparableinterfaces

data,Pigworkingwith/WorkingwithdataFILTERoperator/Filteringaggregation/AggregationFOREACHoperator/ForeachJOINoperator/Join

data,storingabout/Storingdataserializationfileformat/SerializationandContainerscontainersfileformat/SerializationandContainersfilecompression/Compressiongeneral-purposefileformats/General-purposefileformatscolumn-orienteddataformats/Column-orienteddataformats

Datacoreabout/DataCore

DataCrunchabout/DataCrunch

DataHCatalogabout/DataHCatalog

DataHiveabout/DataHive

datalifecyclemanagementabout/Whatdatalifecyclemanagementisimportance/Importanceofdatalifecyclemanagementtools/Toolstohelp

DataMapReduceabout/DataMapReduce

DataNode/NameNodeandDataNodeDataNodes

about/StorageinHadoop2DataNodestartup

about/DataNodestartupDataPipeline

about/DataPipeline

Page 484: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

referencelink/DataPipelinedataprocessing

about/DataprocessingwithHadoopdataset,generatingfromTwitter/WhyTwitter?dataset,building/Buildingourfirstdatasetprogrammaticaccess,withPython/ProgrammaticaccesswithPython

dataprocessing,ApacheSparkabout/ProcessingdatawithApacheSparkexamples,running/Buildingandrunningtheexamplesexamples,building/Buildingandrunningtheexamplesexamples,runningonYARN/RunningtheexamplesonYARNpopulartopics,finding/Findingpopulartopicssentiment,assigningtotopics/Assigningasentimenttotopicsonstreams/Dataprocessingonstreamsstatemanagement/Statemanagementdataanalysis,withSparkSQL/DataanalysiswithSparkSQLSQL,ondatastreams/SQLondatastreams

dataprocessingpatterns,Crunchabout/Dataprocessingpatternsaggregationandsorting/Aggregationandsortingjoiningdata/Joiningdata

dataserialization,Crunchabout/Dataserialization

dataset,buildingwithTwitterabout/BuildingourfirstdatasetmultipleAPIs,using/Oneservice,multipleAPIsanatomy,ofTweet/AnatomyofaTweetTwittercredentials/Twittercredentials

DataSparkabout/DataSpark

datatypes,Hivenumeric/Datatypesdateandtime/Datatypesstring/Datatypescollections/Datatypesmisc/Datatypes

datatypes,Pigscalardatatypes/Pigdatatypescomplexdatatypes/Pigdatatypes

DDLstatements,Hive/DDLstatementsdecayFactorfunction/StatemanagementDEFINEoperator

about/ExtendingPig(UDFs)deriveddata,producing

Page 485: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

about/Producingderiveddatamultipleactions,performinginparallel/Performingmultipleactionsinparallelsubworkflow,calling/Callingasubworkflowglobalsettings,adding/Addingglobalsettings

DevOpspractices/HadoopandDevOpspractices

directedacyclicgraph(DAG)about/YARN

documentfrequencyabout/Calculatedocumentfrequencycalculating,TF-IDFused/Calculatedocumentfrequency

DrillURL/Drill,Tajo,andbeyondabout/Drill,Tajo,andbeyond

Driverclass,JavaAPItoMapReduceabout/TheDriverclass

dynamicinvokersabout/Dynamicinvokersreferencelink/Dynamicinvokers

DynamoDBURL/SimpleDBandDynamoDBabout/SimpleDBandDynamoDB

Page 486: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

EEC2

ApacheSparkon/SparkonEC2EC2key-valuepair

referencelink/TheAWScommand-lineinterfaceElasticMapReduce

Hive,usingwith/HiveonElasticMapReduceElasticMapReduce(EMR)

about/DistributionsofApacheHadoop,ElasticMapReduce(EMR)URL/ElasticMapReduce(EMR)using/UsingElasticMapReduce

ElephantBirdreferencelink/ContributedUDFs,ElephantBird

EMRcluster,buildingon/BuildingaclusteronEMRURL,forbestpractices/BuildingaclusteronEMR

EMRdocumentationURL/HiveonElasticMapReduce

entitiesabout/Tweetmetadata

ephemeralZNodesabout/ImplementinggroupmembershipandleaderelectionusingephemeralZNodes

evalfunctions,PigAVG(expression)/EvalCOUNT(expression)/EvalCOUNT_STAR(expression)/EvalIsEmpty(expression)/EvalMAX(expression)/EvalMIN(expression)/EvalSUM(expression)/EvalTOKENIZE(expression)/Eval

examplesrunning/Runningtheexamples

examples,MapReduceprogramsreferencelink/Runningtheexampleslocalcluster/LocalclusterElasticMapReduce/ElasticMapReduce

examplesandsourcecodedownloadlink/Gettingstarted

ExecutionEngineinterface/AnoverviewofPigexternaldata,challenges

about/Challengesofexternaldata

Page 487: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

datavalidation/Datavalidationvalidationactions/Validationactionsformatchanges,handling/Handlingformatchangesschemaevolution,handlingwithAvro/HandlingschemaevolutionwithAvro

EXTERNALkeyword/DDLstatementsExtract-Transform-Load(ETL)/DDLstatementsextract_for_hive.pig

URL,forsourcecode/Prerequisites

Page 488: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

FFalcon

URL/Othertoolstohelpabout/Othertoolstohelp

fileformat,Hiveabout/FileformatsandstorageJSON/JSON

FileFormatclasses,HiveTextInputFormat/FileformatsandstorageHiveIgnoreKeyTextOutputFormat/FileformatsandstorageSequenceFileInputFormat/FileformatsandstorageSequenceFileOutputFormat/Fileformatsandstorage

filesystemmetadata,HDFSprotecting/ProtectingthefilesystemmetadataSecondaryNameNode,demerits/SecondaryNameNodenottotherescueHadoop2NameNodeHA/Hadoop2NameNodeHAclientconfiguration/Clientconfigurationfailover,working/Howafailoverworks

FILTERoperatorabout/Filtering

FlumeJavareferencelink/ApacheCrunch

FOREACHoperatorabout/Foreach

forknodeabout/Performingmultipleactionsinparallel

functions,Pigabout/Pigfunctionsbuilt-infunctions/Pigfunctionsreferencelink,forbuilt-infunctions/Pigfunctionsload/storefunctions/Load/storeeval/Evaltuple/Thetuple,bag,andmapfunctionsbag/Thetuple,bag,andmapfunctionsmap/Thetuple,bag,andmapfunctionsstring/Themath,string,anddatetimefunctionsmath/Themath,string,anddatetimefunctionsdatetime/Themath,string,anddatetimefunctionsdynamicinvokers/Dynamicinvokersmacros/Macros

Page 489: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

GGarbageCollection(GC)/JVMconsiderationsGarbageFirst(G1)collector/JVMconsiderationsgeneral-purposefileformats

about/General-purposefileformatsTextfiles/General-purposefileformatsSequenceFile/General-purposefileformats

generalavailability(GA)/AnoteonversioningGoogleChubbysystem

referencelink/ApacheZooKeeper–adifferenttypeoffilesystemGoogleFileSystem(GFS)

referencelink/ThebackgroundofHadoopGradle

URL/RunningtheexamplesGraphX

about/GraphXURL/GraphX

groupByKey()method/AggregationandsortinggroupByKey(GroupingOptionsoptions)method/AggregationandsortinggroupByKey(intnumPartitions)method/AggregationandsortinggroupByKeyoperation

about/ConceptsGROUPoperator

about/AggregationGrunt

about/Grunt–thePiginteractiveshellshcommand/Grunt–thePiginteractiveshellhelpcommand/Grunt–thePiginteractiveshell

GuavalibraryURL/TheTopNpattern

Page 490: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HHadoop

versioning/Anoteonversioningbackground/ThebackgroundofHadoopcomponents/ComponentsofHadoopdualapproach/Adualapproachabout/Gettingstartedusing/GettingHadoopupandrunningEMR,using/HowtouseEMRAWScredentials/AWScredentialsdataprocessing/DataprocessingwithHadooppractices/HadoopandDevOpspracticesalternativedistributions/Alternativedistributionscomputationalframeworks/Othercomputationalframeworksinterestingprojects/Otherinterestingprojectsprogrammingabstractions/OtherprogrammingabstractionsAWSresources/AWSresourcessourcesofinformation/Sourcesofinformation

Hadoop-providedInputFormat,MapReducejobabout/Hadoop-providedInputFormatFileInputFormat/Hadoop-providedInputFormatSequenceFileInputFormat/Hadoop-providedInputFormatTextInputFormat/Hadoop-providedInputFormatKeyValueTextInputFormat/Hadoop-providedInputFormat

Hadoop-providedMapperandReducerimplementations,JavaAPItoMapReduceabout/Hadoop-providedmapperandreducerimplementationsmappers/Hadoop-providedmapperandreducerimplementationsreducers/Hadoop-providedmapperandreducerimplementations

Hadoop-providedOutputFormat,MapReducejobabout/Hadoop-providedOutputFormatFileOutputFormat/Hadoop-providedOutputFormatNullOutputFormat/Hadoop-providedOutputFormatSequenceFileOutputFormat/Hadoop-providedOutputFormatTextOutputFormat/Hadoop-providedOutputFormat

Hadoop-providedRecordReader,MapReducejobabout/Hadoop-providedRecordReaderLineRecordReader/Hadoop-providedRecordReaderSequenceFileRecordReader/Hadoop-providedRecordReader

Hadoop2about/Hadoop2–what’sthebigdeal?storage/StorageinHadoop2computation/ComputationinHadoop2diagrammaticrepresentation,architecture/ComputationinHadoop2

Page 491: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

referencelink/Gettingstartedoperations/OperationsintheHadoop2world

Hadoop2NameNodeHAabout/Hadoop2NameNodeHAenabling/Hadoop2NameNodeHAkeeping,insync/KeepingtheHANameNodesinsync

HadoopDistributedFileSystem(HDFS)/NameNodeandDataNodeHadoopdistributions

about/DistributionsofApacheHadoopHortonworks/DistributionsofApacheHadoopCloudera/DistributionsofApacheHadoopMapR/DistributionsofApacheHadoopreferencelink/DistributionsofApacheHadoop

Hadoopfilesystemsabout/Hadoopfilesystemsreferencelink/HadoopfilesystemsHadoopinterfaces/Hadoopinterfaces

Hadoopinterfacesabout/HadoopinterfacesJavaFileSystemAPI/JavaFileSystemAPILibhdfs/LibhdfsApacheThrift/Thrift

Hadoopoperationsabout/I’madeveloper–Idon’tcareaboutoperations!

Hadoopsecurityfuture/ThefutureofHadoopsecurity

Hadoopsecuritymodelevolution/EvolutionoftheHadoopsecuritymodeladditionalsecurityfeatures/Beyondbasicauthorization

Hadoopstreamingabout/Hadoopstreamingwordcount,streaminginPython/StreamingwordcountinPythondifferencesinjobs/Differencesinjobswhenusingstreamingimportanceofwords,determining/Findingimportantwordsintext

HadoopUIURL/Othertoolstohelpabout/Othertoolstohelp

HadoopUserGroup(HUG)/HUGshashtagRegExp/Trendingtopicshashtags

about/SentimentofhashtagsHBase

about/HBaseURL/HBase

Page 492: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HCatalogabout/IntroducingHCatalogusing/UsingHCatalog

HCatCLItoolabout/UsingHCatalog

hcatutilityabout/UsingHCatalog

HDFSabout/ComponentsofHadoop,Storage,SamzaandHDFScharacteristics/Storagearchitecture/TheinnerworkingsofHDFSNameNode/TheinnerworkingsofHDFSDataNodes/TheinnerworkingsofHDFSclusterstartup/Clusterstartupblockreplication/Blockreplication

HDFSandMapReducemerits/Bettertogether

HDFSfilesystemcommand-lineaccess/Command-lineaccesstotheHDFSfilesystemexploring/ExploringtheHDFSfilesystem

HDFSsnapshotsabout/HDFSsnapshots

HelloSamzaabout/HelloSamza!URL/HelloSamza!

high-availability(HA)about/StorageinHadoop2

HighPerformanceComputing(HPC)/ComputationinHadoop2Hive

about/Hive-on-tezURL/Hive-on-tezoverview/OverviewofHivedatatypes/DatatypesDDLstatements/DDLstatementsfileformats/Fileformatsandstoragestorage/Fileformatsandstoragequeries/Queriesscripts,writing/Writingscriptsworking,withAmazonWebServices/HiveandAmazonWebServicesusing,withS3/HiveandS3using,withElasticMapReduce/HiveonElasticMapReduceURL,forsourcecodeofJDBCclient/JDBCURL,forsourcecodeofThriftclient/Thrift

Hive-JSON-Serde

Page 493: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

URL/JSONhive-jsonmodule

URL/JSONabout/JSON

Hive-on-tezabout/Hive-on-tez

Hive0.13about/Hive-on-tez

Hivearchitectureabout/Hivearchitecture

HiveQLabout/WhySQLonHadoop,Queriesextending/ExtendingHiveQL

HiveServer2about/HivearchitectureURL/Hivearchitecture

Hivetablesabout/ThenatureofHivetablesstructuring,fromworkloads/StructuringHivetablesforgivenworkloads

Hortonwork’sHDPURL/SparkonYARN

HortonworksURL/DistributionsofApacheHadoop

HortonworksDataPlatform(HDP)about/Alternativedistributions,HortonworksDataPlatformURL/HortonworksDataPlatform

Hueabout/HueURL/Hue

HUGsabout/HUGsreferencelink/HUGs

Page 494: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

IIAMconsole

URL/HiveandS3IBMInfosphereBigInsights

about/Andtherest…IdentityandAccessManagement(IAM)/AWScredentialsImpala

about/Impalareferences/Impala,Co-existingwithHivearchitecture/ThearchitectureofImpalaco-existing,withHive/Co-existingwithHive

in-syncreplicas(ISR)about/GettingTwitterdataintoKafka

indicesattribute,entityabout/Tweetmetadata

input/output,MapReducejobabout/Input/Output

InputFormat,MapReducejobabout/InputFormatandRecordReader

Page 495: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

JJava

WordCount/WordCountinJavaJavaAPI

about/JavaAPIandScalaAPI,differences/JavaAPI

JavaAPItoMapReduceabout/JavaAPItoMapReduceMapperclass/TheMapperclassReducerclass/TheReducerclassDriverclass/TheDriverclasscombinerclass/Combinerpartitioning/PartitioningHadoop-providedMapperandReducerimplementations/Hadoop-providedmapperandreducerimplementationsreferencedata,sharing/Sharingreferencedata

JavaFileSystemAPIabout/JavaFileSystemAPI

JDBCabout/JDBC

JobTrackermonitoring,MapReducejobabout/OngoingJobTrackermonitoring

joinnodeabout/Performingmultipleactionsinparallel

JOINoperatorabout/Join

/QueriesJSON

about/JSONJSONSimple

URL/BuildingatweetparsingjobJVMconsiderations,clustertuning

about/JVMconsiderationssmallfilesproblem/Thesmallfilesproblem

Page 496: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Kkite-morphlines-avrocommand/Morphlinecommandskite-morphlines-core-stdiocommand/Morphlinecommandskite-morphlines-core-stdlibcommand/Morphlinecommandskite-morphlines-hadoop-corecommand/Morphlinecommandskite-morphlines-hadoop-parquet-avrocommand/Morphlinecommandskite-morphlines-hadoop-rcfilecommand/Morphlinecommandskite-morphlines-hadoop-sequencefilecommand/Morphlinecommandskite-morphlines-jsoncommand/MorphlinecommandsKiteData

about/KiteDataDatacore/DataCoreDatacore/DataCoreDataHCatalog/DataHCatalogDataHive/DataHiveDataMapReduce/DataMapReduceDataSpark/DataSparkDataCrunch/DataCrunch

Kiteexamplesreferencelink/KiteData

KiteJARsreferencelink/KiteData

KiteMorphlinesabout/KiteMorphlinesconcepts/ConceptsRecordabstractions/Conceptscommands/Morphlinecommands

KiteSDKURL/KiteData

KVMreferencelink/ClouderaQuickStartVM

Page 497: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

LLambdasyntax

URL/PythonAPILibhdfs

about/LibhdfsLinkedIngroups

about/LinkedIngroupsURL/LinkedIngroups

Log4jabout/Logginglevels

logfilesaccessingto/Accesstologfiles

logginglevelsabout/Logginglevels

Page 498: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

MMachineLearning(ML)

about/MLlibmacros

about/MacrosMahout

about/MahoutURL/Mahout

mapoptimization,clustertuningconsiderations/Mapandreduceoptimizations

Mapperclass,JavaAPItoMapReduceabout/TheMapperclass

mapperexecution,MapReducejobabout/Mapperexecution

mapperinput,MapReducejobabout/Mapperinput

mapperoutput,MapReducejobabout/Mapperoutputandreducerinput

mappers,MapperandReducerimplementationsInverseMapper/Hadoop-providedmapperandreducerimplementationsTokenCounterMapper/Hadoop-providedmapperandreducerimplementationsIdentityMapper/Hadoop-providedmapperandreducerimplementations

MapRURL/DistributionsofApacheHadoop,MapRabout/MapR

MapReducereferencelink/ThebackgroundofHadoop,MapReduceabout/MapReduceMapphase/MapReduce

MapReduceAPIabout/ComponentsofHadoop,Computation

MapReducedriversourcecodereferencelink/Morphlinecommands

MapReducejobabout/WalkingthrougharunofaMapReducejobstartup/Startupinput,splitting/Splittingtheinputtaskassignment/Taskassignmenttaskstartup/TaskstartupJobTrackermonitoring/OngoingJobTrackermonitoringmapperinput/Mapperinputmapperexecution/Mapperexecutionmapperoutput/Mapperoutputandreducerinput

Page 499: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

reducerinput/Reducerinputreducerexecution/Reducerexecutionreduceroutput/Reduceroutputshutdown/Shutdowninput/output/Input/OutputInputFormat/InputFormatandRecordReaderRecordReader/InputFormatandRecordReaderHadoop-providedInputFormat/Hadoop-providedInputFormatHadoop-providedRecordReader/Hadoop-providedRecordReaderOutputFormat/OutputFormatandRecordWriterRecordWriter/OutputFormatandRecordWriterHadoop-providedOutputFormat/Hadoop-providedOutputFormatsequencefiles/Sequencefiles

MapReduceprogramswriting/WritingMapReduceprograms,Gettingstartedexamples,running/RunningtheexamplesWordCountexample/WordCount,theHelloWorldofMapReducewordco-occurrences/Wordco-occurrencessocialnetworktopics/Trendingtopicsreferencelink,forHashTagCountexamplesourcecode/TrendingtopicsTopNpattern/TheTopNpatternreferencelink,forTopTenHashTagsourcecode/TheTopNpatternhashtags/Sentimentofhashtagsreferencelink,forHashTagSentimentsourcecode/Sentimentofhashtagstextcleanup,chainmapperused/Textcleanupusingchainmapperreferencelink,forHashTagSentimentChainsourcecode/Textcleanupusingchainmapper

MassivelyParallelProcessing(MPP)about/ThearchitectureofImpala

MemPipelineabout/MemPipeline

MessagePassingInterface(MPI)/ComputationinHadoop2MLLib

about/MLlibmonitoring

about/MonitoringHadoop/Hadoop–wherefailuresdon’tmatterapplication-levelmetrics/Application-levelmetrics

monitoringtoolsabout/Monitoringintegration

MoprhlineDrviersourcecodereferencelink/Morphlinecommands

Morphlinecommandskite-morphlines-core-stdio/Morphlinecommands

Page 500: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

kite-morphlines-core-stdlib/Morphlinecommandskite-morphlines-avro/Morphlinecommandskite-morphlines-json/Morphlinecommandskite-morphlines-hadoop-parquet-avro/Morphlinecommandskite-morphlines-hadoop-sequencefile/Morphlinecommandskite-morphlines-hadoop-rcfile/Morphlinecommandsreferencelink/Morphlinecommands

MRExecutionEngine/AnoverviewofPigMultipartUpload

URL/GettingdataintoEMR

Page 501: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

NNameNode

about/StorageinHadoop2/NameNodeandDataNodeNameNodeHA

about/StorageinHadoop2NameNodestartup

about/NameNodestartupNFSshare/KeepingtheHANameNodesinsyncNodeManager

about/ResourceManager,NodeManager,andApplicationManagerNodeManager(NM)

about/ThecomponentsofYARN

Page 502: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

OOozie

about/IntroducingOozieURL/IntroducingOoziefeatures/IntroducingOozieactionnodes/IntroducingOozieHDFSfilepermissions/AnoteonHDFSfilepermissionsdevelopment,makingeasier/Makingdevelopmentalittleeasierdata,extracting/ExtractingdataandingestingintoHivedata,ingestingintoHive/ExtractingdataandingestingintoHiveworkflowdirectorystructure/AnoteonworkflowdirectorystructureHCatalog/IntroducingHCatalogsharelib/TheOoziesharelibHCatalogandpartitionedtables/HCatalogandpartitionedtablesusing/Pullingitalltogether

Oozietriggers/OtherOozietriggersOozieworkflow

about/IntroducingOozie/IntroducingOozieoperations,Hadoop2

about/OperationsintheHadoop2worldopinionlexicon

URL/SentimentofhashtagsOptimizedRowColumnarfileformat(ORC)

about/ORCreferencelink/ORC

ORCURL/Columnarstores

org.apache.zookeeper.ZooKeeperclassabout/JavaAPI

OutputFormat,MapReducejobabout/OutputFormatandRecordWriter

Page 503: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

PparallelDooperation

about/ConceptsPARALLELoperator

about/AggregationParquet

referencelink/Parquetabout/ParquetURL/Columnarstores

partitioning,JavaAPItoMapReduceabout/Partitioningoptionalpartitionfunction/Theoptionalpartitionfunction

PCollection<T>interface,Crunchabout/Concepts

physicalclusterbuilding/Buildingaphysicalcluster

physicalcluster,considerationsabout/Physicallayoutrackawareness/Rackawarenessservicelayout/Servicelayoutservice,upgrading/Upgradingaservice

Pigoverview/AnoverviewofPigusecases/AnoverviewofPigabout/Gettingstarted,WhySQLonHadooprunning/RunningPigreferencelink,forsourcecodeandbinarydistributions/RunningPigGrunt/Grunt–thePiginteractiveshellElasticMapReduce/ElasticMapReducefundamentals/FundamentalsofApachePigreferencelink,forparallelfeature/FundamentalsofApachePigreferencelink,formulti-queryimplementation/FundamentalsofApachePigprogramming/ProgrammingPigdatatypes/Pigdatatypesfunctions/Pigfunctionsdata,workingwith/Workingwithdata

Piggybankabout/Piggybank

PigLatin/AnoverviewofPigPigUDFs

extending/ExtendingPig(UDFs)contributedUDFs/ContributedUDFs

pipelinesimplementation,ApacheCrunch

Page 504: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

about/PipelinesimplementationandexecutionSparkPipeline/SparkPipelineMemPipeline/MemPipeline

positive_wordsoperatorabout/Join

pre-requisitesabout/Prerequisites

PredictiveModelMarkupLanguage(PMML)/Cascadingprocessingmodels,YARN

ClouderaKitten/ThinkinginlayersApacheTwill/Thinkinginlayers

programmaticinterfacesabout/ProgrammaticinterfacesJDBC/JDBCThrift/Thrift

ProjectRhinoURL/ThefutureofHadoopsecurity

PTable<Key,Value>interface,Crunchabout/Concepts

Pythonused,forprogrammaticaccess/ProgrammaticaccesswithPython

PythonAPIabout/PythonAPI

Page 505: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

QQJMmechanism

about/KeepingtheHANameNodesinsyncqueries,Hive/Queries

Page 506: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

RRDDs

about/Clustercomputingwithworkingsets,ResilientDistributedDatasets(RDDs)

RDDs,operationsmap/Actionsfilter/Actionsreduce/Actionscollect/Actionsforeach/ActionsgroupByKey/ActionssortByKey/Actions

Recordabstractionsimplementing/Concepts

RecordReader,MapReducejobabout/InputFormatandRecordReader

RecordWriter,MapReducejobabout/OutputFormatandRecordWriter

Reducefunctionabout/MapReduce

reduceoptimization,clustertuningconsiderations/Mapandreduceoptimizations

Reducerclass,JavaAPItoMapReduceabout/TheReducerclass

reducerexecution,MapReducejobabout/Reducerexecution

reducerinput,MapReducejobabout/Reducerinput

reduceroutput,MapReducejobabout/Reduceroutput

reducers,MapperandReducerimplementationsIntSumReducer/Hadoop-providedmapperandreducerimplementationsLongSumReducer/Hadoop-providedmapperandreducerimplementationsIdentityReducer/Hadoop-providedmapperandreducerimplementations

referencedata,JavaAPItoMapReducesharing/Sharingreferencedata

REGISTERoperatorabout/ExtendingPig(UDFs)

requiredservices,AWSSimpleStorageService(S3)/SigningupforthenecessaryservicesElasticMapReduce/SigningupforthenecessaryservicesElasticComputeCloud(EC2)/Signingupforthenecessaryservices

ResourceManager

Page 507: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

about/ResourceManager,NodeManager,andApplicationManagerapplications/ApplicationsNodesview/NodesSchedulerwindow/SchedulerMapReduce/MapReduceMapReducev1/MapReducev1MapReducev2(YARN)/MapReducev2(YARN)JobHistoryServer/JobHistoryServer

resourcessharing/Sharingresources

RoleBasedAccessControl(RBAC)/BeyondbasicauthorizationRowColumnarFile(RCFile)

about/RCFilereferencelink/RCFile

Page 508: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

SS3

Hive,usingwith/HiveandS3s3distcp

URL/GettingdataintoEMRs3n/HadoopfilesystemsSamza

about/ApacheSamzaURL/ApacheSamza,StreamprocessingwithSamzaYARN-independentframeworks/YARN-independentframeworksused,forstreamprocessing/StreamprocessingwithSamzaworking/HowSamzaworksarchitecture/Samzahigh-levelarchitectureApacheKafka/Samza’sbestfriend–ApacheKafkaintegrating,withYARN/YARNintegrationindependentmodel/AnindependentmodelHelloSamza/HelloSamza!tweetparsingjob,building/Buildingatweetparsingjobconfigurationfile/TheconfigurationfileURL,forconfigurationoptions/TheconfigurationfileTwitterdata,gettingintoApacheKafka/GettingTwitterdataintoKafkaHDFS/SamzaandHDFSwindowfunction,adding/Windowingfunctionsmultijobworkflows/Multijobworkflowstweetsentimentanalysis,performing/Tweetsentimentanalysistasksprocessing/StatefultasksandSparkStreaming,comparing/ComparingSamzaandSparkStreaming

Samza,layersstreaming/Samzahigh-levelarchitectureexecution/Samzahigh-levelarchitectureprocessing/Samzahigh-levelarchitecture

Samzajobexecuting/RunningaSamzajob

sbtURL/GettingstartedwithSpark

ScalaandJavasourcecode,examplesURL/Buildingandrunningtheexamples

ScalaAPIabout/ScalaAPI

scalardatatypesint/Pigdatatypeslong/Pigdatatypesfloat/Pigdatatypes

Page 509: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

double/Pigdatatypeschararray/Pigdatatypesbytearray/Pigdatatypesboolean/Pigdatatypesdatetime/Pigdatatypesbiginteger/Pigdatatypesbigdecimal/Pigdatatypes

ScalasourcecodeURL/Dataprocessingonstreams

SecondaryNameNodeabout/SecondaryNameNodenottotherescuedemerits/SecondaryNameNodenottotherescue

securedclusterusing,consequences/Consequencesofusingasecuredcluster

securityabout/Security

sentimentanalysisabout/Sentimentofhashtags

SequenceFileabout/General-purposefileformats

SequenceFileclass,MapReducejobabout/Sequencefiles

sequencefiles,MapReducejobabout/Sequencefilesadvantages/Sequencefiles

SerDeclasses,HiveMetadataTypedColumnsetSerDe/FileformatsandstorageThriftSerDe/FileformatsandstorageDynamicSerDe/Fileformatsandstorage

serializationabout/SerializationandContainers

sharelib,Oozieabout/TheOoziesharelib

SimpleDBabout/SimpleDBandDynamoDB

SimpleStorageService(S3),AWSabout/SimpleStorageService(S3)URL/SimpleStorageService(S3)

sourcesofinformation,Hadoopabout/Sourcesofinformationsourcecode/Sourcecodemailinglists/Mailinglistsandforumsforums/MailinglistsandforumsLinkedIngroups/LinkedIngroups

Page 510: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

HUGs/HUGsconferences/Conferences

Sparkabout/ApacheSparkURL/ApacheSpark

SparkContextobject/ScalaAPISparkPipeline

about/SparkPipelineSparkSQL

about/SparkSQLdataanalysiswith/DataanalysiswithSparkSQL

SparkStreamingURL/SparkStreamingabout/SparkStreamingandSamza,comparing/ComparingSamzaandSparkStreaming

specializedjoinreferencelink/Join

speedofthoughtanalysis/AdifferentphilosophySQL

ondatastreams/SQLondatastreamsondatastreams,URL/SQLondatastreams

SQL-on-Hadoopneedfor/WhySQLonHadoopsolutions/OtherSQL-on-Hadoopsolutions

Sqoopabout/SqoopURL/Sqoop

Sqoop1about/Sqoop

Sqoop2about/Sqoop

standaloneapplications,ApacheSparkwriting/Writingandrunningstandaloneapplicationsrunning/Writingandrunningstandaloneapplications

statementsabout/FundamentalsofApachePig

Stingerinitiativeabout/Stingerinitiative

storageabout/Storage

storage,Hadoop2about/StorageinHadoop2

storage,Hiveabout/Fileformatsandstorage

Page 511: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

columnarstores/ColumnarstoresStorm

URL/HowSamzaworksabout/HowSamzaworks

stream.pyreferencelink/ProgrammaticaccesswithPython

streamprocessingwithSamza/StreamprocessingwithSamza

streamsdata,processingon/Dataprocessingonstreams

systemsmanagementtoolsClouderaManager,integratingwith/ClouderaManagerandothermanagementtools

Page 512: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Ttablepartitioning

about/Partitioningatabledata,overwriting/Overwritingandupdatingdatadata,updating/Overwritingandupdatingdatabucketing/Bucketingandsortingsorting/Bucketingandsortingdata,sampling/Samplingdata

TajoURL/Drill,Tajo,andbeyondabout/Drill,Tajo,andbeyond

tasksprocessing,Samzaabout/Statefultasks

termfrequencyabout/Calculatetermfrequencycalculating,withTF-IDF/Calculatetermfrequency

textattribute,entityabout/Tweetmetadata

Textfilesabout/General-purposefileformats

Tezabout/TezURL/Tez,Stingerinitiativereferencelink,forcanonicalWordCountexample/TezHive-on-tez/Hive-on-tez

/AnoverviewofPigTF-IDF

about/Findingimportantwordsintextdefinition/Findingimportantwordsintexttermfrequency,calculating/Calculatetermfrequencydocumentfrequency,calculating/Calculatedocumentfrequencyimplementing/Puttingitalltogether–TF-IDF

Thriftabout/Thrift

TOBAG(expression)function/Thetuple,bag,andmapfunctionsTOMAP(expression)function/Thetuple,bag,andmapfunctionstools,datalifecyclemanagement

orchestrationservices/Toolstohelpconnectors/Toolstohelpfileformats/Toolstohelp

TOP(n,column,relation)function/Thetuple,bag,andmapfunctionsTOTUPLE(expression)function/Thetuple,bag,andmapfunctionstroubleshooting

Page 513: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

about/Troubleshootingtuples

about/FundamentalsofApachePigTweet,structure

referencelink/AnatomyofaTweettweetanalysiscapability

building/Buildingatweetanalysiscapabilitytweetdata,obtaining/GettingthetweetdataOozie/IntroducingOoziederiveddata,producing/Producingderiveddata

tweetsentimentanalysisperforming/Tweetsentimentanalysisbootstrapstreams/Bootstrapstreams

Twitterused,forgeneratingdataset/DataprocessingwithHadoopURL/DataprocessingwithHadoopabout/WhyTwitter?signuppage/Twittercredentialswebform/Twittercredentials

Twitterdata,propertiesunstructured/WhyTwitter?structured/WhyTwitter?graph/WhyTwitter?geolocated/WhyTwitter?realtime/WhyTwitter?

TwitterSearchURL/Trendingtopics

Twitterstreamanalyzing/AnalyzingtheTwitterstreamprerequisites/Prerequisitesdatasetexploration/Datasetexplorationtweetmetadata/Tweetmetadatadatapreparation/Datapreparationtopnstatistics/Topnstatisticsdatetimemanipulation/Datetimemanipulationsessions/Sessionsusers’interaction,capturing/Capturinguserinteractionslinkanalysis/Linkanalysisinfluentialusers,identifying/Influentialusers

Page 514: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Uunionoperation

about/ConceptsupdateFuncfunction/StatemanagementUserDefinedAggregateFunctions(UDAFs/ExtendingHiveQLUserDefinedFunctions(UDFs)/AnoverviewofPig,ExtendingHiveQL

about/FundamentalsofApachePigUserDefinedTableFunctions(UDTF)/ExtendingHiveQL

Page 515: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

Vversioning,Hadoop

about/AnoteonversioningVirtualBox

referencelink/ClouderaQuickStartVMVMware

referencelink/ClouderaQuickStartVM

Page 516: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

WWhir

about/WhirURL/Whir

WhotoFollowservicereferencelink/Influentialusers

windowfunctionadding/Windowingfunctions

WordCountinJava/WordCountinJava

WordCountexample,MapReduceprogramsabout/WordCount,theHelloWorldofMapReducereferencelink,forsourcecode/Wordco-occurrences

workflow-appabout/IntroducingOozie

workflow.xmlfilereferencelink/ExtractingdataandingestingintoHive

workflowsbuilding,Oozieused/Pullingitalltogether

workloadsHivetables,structuringfrom/StructuringHivetablesforgivenworkloads

wrapperclassesabout/Introducingthewrapperclasses

WritableComparableinterfaceabout/TheComparableandWritableComparableinterfaces

Writableinterfaceabout/TheWritableinterface

Page 517: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

YYARN

about/ComputationinHadoop2,YARN,YARNintherealworld–ComputationbeyondMapReducearchitecture/YARNarchitecturecomponents/ThecomponentsofYARNprocessingframeworks/Thinkinginlayersprocessingmodels/Thinkinginlayersissues,withMapReduce/TheproblemwithMapReduceTez/TezApacheSpark/ApacheSparkApacheSamza/ApacheSamzafuture/YARNtodayandbeyondpresentsituation/YARNtodayandbeyondSamza,integrating/YARNintegrationApacheSparkon/SparkonYARNexamples,runningon/RunningtheexamplesonYARNURL/RunningtheexamplesonYARN

YARNAPIabout/Thinkinginlayers

YARNapplicationanatomy/AnatomyofaYARNapplicationApplicationMaster(AM)/AnatomyofaYARNapplicationlifecycle/LifecycleofaYARNapplicationfault-tolerance/Faulttoleranceandmonitoringmonitoring/Faulttoleranceandmonitoringexecutionmodels/Executionmodels

Page 518: index-of.co.ukindex-of.co.uk/Big-Data-Technologies/Learning Hadoop 2 - Garry... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files,

ZZooKeeperFailoverController(ZKFC)/AutomaticNameNodefailoverZooKeeperquorum/AutomaticNameNodefailover