Upload
vanlien
View
233
Download
6
Embed Size (px)
Citation preview
LearningHadoop2
TableofContents
LearningHadoop2
Credits
AbouttheAuthors
AbouttheReviewers
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
Whysubscribe?
FreeaccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Errata
Piracy
Questions
1.Introduction
Anoteonversioning
ThebackgroundofHadoop
ComponentsofHadoop
Commonbuildingblocks
Storage
Computation
Bettertogether
Hadoop2–what’sthebigdeal?
StorageinHadoop2
ComputationinHadoop2
DistributionsofApacheHadoop
Adualapproach
AWS–infrastructureondemandfromAmazon
SimpleStorageService(S3)
ElasticMapReduce(EMR)
Gettingstarted
ClouderaQuickStartVM
AmazonEMR
CreatinganAWSaccount
Signingupforthenecessaryservices
UsingElasticMapReduce
GettingHadoopupandrunning
HowtouseEMR
AWScredentials
TheAWScommand-lineinterface
Runningtheexamples
DataprocessingwithHadoop
WhyTwitter?
Buildingourfirstdataset
Oneservice,multipleAPIs
AnatomyofaTweet
Twittercredentials
ProgrammaticaccesswithPython
Summary
2.Storage
TheinnerworkingsofHDFS
Clusterstartup
NameNodestartup
DataNodestartup
Blockreplication
Command-lineaccesstotheHDFSfilesystem
ExploringtheHDFSfilesystem
Protectingthefilesystemmetadata
SecondaryNameNodenottotherescue
Hadoop2NameNodeHA
KeepingtheHANameNodesinsync
Clientconfiguration
Howafailoverworks
ApacheZooKeeper–adifferenttypeoffilesystem
ImplementingadistributedlockwithsequentialZNodes
ImplementinggroupmembershipandleaderelectionusingephemeralZNodes
JavaAPI
Buildingblocks
Furtherreading
AutomaticNameNodefailover
HDFSsnapshots
Hadoopfilesystems
Hadoopinterfaces
JavaFileSystemAPI
Libhdfs
Thrift
Managingandserializingdata
TheWritableinterface
Introducingthewrapperclasses
Arraywrapperclasses
TheComparableandWritableComparableinterfaces
Storingdata
SerializationandContainers
Compression
General-purposefileformats
Column-orienteddataformats
RCFile
ORC
Parquet
Avro
UsingtheJavaAPI
Summary
3.Processing–MapReduceandBeyond
MapReduce
JavaAPItoMapReduce
TheMapperclass
TheReducerclass
TheDriverclass
Combiner
Partitioning
Theoptionalpartitionfunction
Hadoop-providedmapperandreducerimplementations
Sharingreferencedata
WritingMapReduceprograms
Gettingstarted
Runningtheexamples
Localcluster
ElasticMapReduce
WordCount,theHelloWorldofMapReduce
Wordco-occurrences
Trendingtopics
TheTopNpattern
Sentimentofhashtags
Textcleanupusingchainmapper
WalkingthrougharunofaMapReducejob
Startup
Splittingtheinput
Taskassignment
Taskstartup
OngoingJobTrackermonitoring
Mapperinput
Mapperexecution
Mapperoutputandreducerinput
Reducerinput
Reducerexecution
Reduceroutput
Shutdown
Input/Output
InputFormatandRecordReader
Hadoop-providedInputFormat
Hadoop-providedRecordReader
OutputFormatandRecordWriter
Hadoop-providedOutputFormat
Sequencefiles
YARN
YARNarchitecture
ThecomponentsofYARN
AnatomyofaYARNapplication
LifecycleofaYARNapplication
Faulttoleranceandmonitoring
Thinkinginlayers
Executionmodels
YARNintherealworld–ComputationbeyondMapReduce
TheproblemwithMapReduce
Tez
Hive-on-tez
ApacheSpark
ApacheSamza
YARN-independentframeworks
YARNtodayandbeyond
Summary
4.Real-timeComputationwithSamza
StreamprocessingwithSamza
HowSamzaworks
Samzahigh-levelarchitecture
Samza’sbestfriend–ApacheKafka
YARNintegration
Anindependentmodel
HelloSamza!
Buildingatweetparsingjob
Theconfigurationfile
GettingTwitterdataintoKafka
RunningaSamzajob
SamzaandHDFS
Windowingfunctions
Multijobworkflows
Tweetsentimentanalysis
Bootstrapstreams
Statefultasks
Summary
5.IterativeComputationwithSpark
ApacheSpark
Clustercomputingwithworkingsets
ResilientDistributedDatasets(RDDs)
Actions
Deployment
SparkonYARN
SparkonEC2
GettingstartedwithSpark
Writingandrunningstandaloneapplications
ScalaAPI
JavaAPI
WordCountinJava
PythonAPI
TheSparkecosystem
SparkStreaming
GraphX
MLlib
SparkSQL
ProcessingdatawithApacheSpark
Buildingandrunningtheexamples
RunningtheexamplesonYARN
Findingpopulartopics
Assigningasentimenttotopics
Dataprocessingonstreams
Statemanagement
DataanalysiswithSparkSQL
SQLondatastreams
ComparingSamzaandSparkStreaming
Summary
6.DataAnalysiswithApachePig
AnoverviewofPig
Gettingstarted
RunningPig
Grunt–thePiginteractiveshell
ElasticMapReduce
FundamentalsofApachePig
ProgrammingPig
Pigdatatypes
Pigfunctions
Load/store
Eval
Thetuple,bag,andmapfunctions
Themath,string,anddatetimefunctions
Dynamicinvokers
Macros
Workingwithdata
Filtering
Aggregation
Foreach
Join
ExtendingPig(UDFs)
ContributedUDFs
Piggybank
ElephantBird
ApacheDataFu
AnalyzingtheTwitterstream
Prerequisites
Datasetexploration
Tweetmetadata
Datapreparation
Topnstatistics
Datetimemanipulation
Sessions
Capturinguserinteractions
Linkanalysis
Influentialusers
Summary
7.HadoopandSQL
WhySQLonHadoop
OtherSQL-on-Hadoopsolutions
Prerequisites
OverviewofHive
ThenatureofHivetables
Hivearchitecture
Datatypes
DDLstatements
Fileformatsandstorage
JSON
Avro
Columnarstores
Queries
StructuringHivetablesforgivenworkloads
Partitioningatable
Overwritingandupdatingdata
Bucketingandsorting
Samplingdata
Writingscripts
HiveandAmazonWebServices
HiveandS3
HiveonElasticMapReduce
ExtendingHiveQL
Programmaticinterfaces
JDBC
Thrift
Stingerinitiative
Impala
ThearchitectureofImpala
Co-existingwithHive
Adifferentphilosophy
Drill,Tajo,andbeyond
Summary
8.DataLifecycleManagement
Whatdatalifecyclemanagementis
Importanceofdatalifecyclemanagement
Toolstohelp
Buildingatweetanalysiscapability
Gettingthetweetdata
IntroducingOozie
AnoteonHDFSfilepermissions
Makingdevelopmentalittleeasier
ExtractingdataandingestingintoHive
Anoteonworkflowdirectorystructure
IntroducingHCatalog
UsingHCatalog
TheOoziesharelib
HCatalogandpartitionedtables
Producingderiveddata
Performingmultipleactionsinparallel
Callingasubworkflow
Addingglobalsettings
Challengesofexternaldata
Datavalidation
Validationactions
Handlingformatchanges
HandlingschemaevolutionwithAvro
FinalthoughtsonusingAvroschemaevolution
Onlymakeadditivechanges
Manageschemaversionsexplicitly
Thinkaboutschemadistribution
Collectingadditionaldata
Schedulingworkflows
OtherOozietriggers
Pullingitalltogether
Othertoolstohelp
Summary
9.MakingDevelopmentEasier
Choosingaframework
Hadoopstreaming
StreamingwordcountinPython
Differencesinjobswhenusingstreaming
Findingimportantwordsintext
Calculatetermfrequency
Calculatedocumentfrequency
Puttingitalltogether–TF-IDF
KiteData
DataCore
DataHCatalog
DataHive
DataMapReduce
DataSpark
DataCrunch
ApacheCrunch
Gettingstarted
Concepts
Dataserialization
Dataprocessingpatterns
Aggregationandsorting
Joiningdata
Pipelinesimplementationandexecution
SparkPipeline
MemPipeline
Crunchexamples
Wordco-occurrence
TF-IDF
KiteMorphlines
Concepts
Morphlinecommands
Summary
10.RunningaHadoopCluster
I’madeveloper–Idon’tcareaboutoperations!
HadoopandDevOpspractices
ClouderaManager
Topayornottopay
ClustermanagementusingClouderaManager
ClouderaManagerandothermanagementtools
MonitoringwithClouderaManager
Findingconfigurationfiles
ClouderaManagerAPI
ClouderaManagerlock-in
Ambari–theopensourcealternative
OperationsintheHadoop2world
Sharingresources
Buildingaphysicalcluster
Physicallayout
Rackawareness
Servicelayout
Upgradingaservice
BuildingaclusteronEMR
Considerationsaboutfilesystems
GettingdataintoEMR
EC2instancesandtuning
Clustertuning
JVMconsiderations
Thesmallfilesproblem
Mapandreduceoptimizations
Security
EvolutionoftheHadoopsecuritymodel
Beyondbasicauthorization
ThefutureofHadoopsecurity
Consequencesofusingasecuredcluster
Monitoring
Hadoop–wherefailuresdon’tmatter
Monitoringintegration
Application-levelmetrics
Troubleshooting
Logginglevels
Accesstologfiles
ResourceManager,NodeManager,andApplicationManager
Applications
Nodes
Scheduler
MapReduce
MapReducev1
MapReducev2(YARN)
JobHistoryServer
NameNodeandDataNode
Summary
11.WheretoGoNext
Alternativedistributions
ClouderaDistributionforHadoop
HortonworksDataPlatform
MapR
Andtherest…
Choosingadistribution
Othercomputationalframeworks
ApacheStorm
ApacheGiraph
ApacheHAMA
Otherinterestingprojects
HBase
Sqoop
Whir
Mahout
Hue
Otherprogrammingabstractions
Cascading
AWSresources
SimpleDBandDynamoDB
Kinesis
DataPipeline
Sourcesofinformation
Sourcecode
Mailinglistsandforums
LinkedIngroups
HUGs
Conferences
Summary
Index
LearningHadoop2
LearningHadoop2Copyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:February2015
Productionreference:1060215
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78328-551-8
www.packtpub.com
CreditsAuthors
GarryTurkington
GabrieleModena
Reviewers
AtdheBuja
AmitGurdasani
JakobHoman
JamesLampton
DavideSetti
ValerieParham-Thompson
CommissioningEditor
EdwardGordon
AcquisitionEditor
JoanneFitzpatrick
ContentDevelopmentEditor
VaibhavPawar
TechnicalEditors
IndrajitA.Das
MenzaMathew
CopyEditors
RoshniBanerjee
SarangChari
PranjaliChury
ProjectCoordinator
KrantiBerde
Proofreaders
SimranBhogal
MartinDiver
LawrenceA.Herman
PaulHindle
Indexer
HemanginiBari
Graphics
AbhinashSahu
ProductionCoordinator
NiteshThakur
CoverWork
NiteshThakur
AbouttheAuthorsGarryTurkingtonhasover15yearsofindustryexperience,mostofwhichhasbeenfocusedonthedesignandimplementationoflarge-scaledistributedsystems.InhiscurrentroleastheCTOatImproveDigital,heisprimarilyresponsiblefortherealizationofsystemsthatstore,process,andextractvaluefromthecompany’slargedatavolumes.BeforejoiningImproveDigital,hespenttimeatAmazon.co.uk,whereheledseveralsoftwaredevelopmentteams,buildingsystemsthatprocesstheAmazoncatalogdataforeveryitemworldwide.Priortothis,hespentadecadeinvariousgovernmentpositionsinboththeUKandtheUSA.
HehasBScandPhDdegreesinComputerSciencefromQueensUniversityBelfastinNorthernIreland,andaMaster’sdegreeinEngineeringinSystemsEngineeringfromStevensInstituteofTechnologyintheUSA.HeistheauthorofHadoopBeginnersGuide,publishedbyPacktPublishingin2013,andisacommitterontheApacheSamzaproject.
IwouldliketothankmywifeLeaandmotherSarahfortheirsupportandpatiencethroughthewritingofanotherbookandmydaughterMayaforfrequentlycheeringmeupandaskingmehardquestions.IwouldalsoliketothankGabrieleforbeingsuchanamazingco-authoronthisproject.
GabrieleModenaisadatascientistatImproveDigital.Inhiscurrentposition,heusesHadooptomanage,process,andanalyzebehavioralandmachine-generateddata.Gabrieleenjoysusingstatisticalandcomputationalmethodstolookforpatternsinlargeamountsofdata.PriortohiscurrentjobinadtechheheldanumberofpositionsinAcademiaandIndustrywherehedidresearchinmachinelearningandartificialintelligence.
HeholdsaBScdegreeinComputerSciencefromtheUniversityofTrento,ItalyandaResearchMScdegreeinArtificialIntelligence:LearningSystems,fromtheUniversityofAmsterdamintheNetherlands.
Firstandforemost,IwanttothankLauraforhersupport,constantencouragementandendlesspatienceputtingupwithfartoomany“can’tdo,I’mworkingontheHadoopbook”.SheismyrockandIdedicatethisbooktoher.
AspecialthankyougoestoAmit,Atdhe,Davide,Jakob,JamesandValerie,whoseinvaluablefeedbackandcommentarymadethisworkpossible.
Finally,I’dliketothankmyco-author,Garry,forbringingmeonboardwiththisproject;ithasbeenapleasureworkingtogether.
AbouttheReviewersAtdheBujaisacertifiedethicalhacker,DBA(MCITP,OCA11g),anddeveloperwithgoodmanagementskills.HeisaDBAattheAgencyforInformationSociety/MinistryofPublicAdministration,wherehealsomanagessomeprojectsofe-governanceandhasmorethan10years’experienceworkingonSQLServer.
AtdheisaregularcolumnistforUBTNews.Currently,heholdsanMScdegreeincomputerscienceandengineeringandhasabachelor’sdegreeinmanagementandinformation.Hespecializesinandiscertifiedinmanytechnologies,suchasSQLServer(allversions),Oracle11g,CEH,WindowsServer,MSProject,SCOM2012R2,BizTalk,andintegrationbusinessprocesses.
Hewasthereviewerofthebook,MicrosoftSQLServer2012withHadoop,publishedbyPacktPublishing.Hiscapabilitiesgobeyondtheaforementionedknowledge!
IthankDonikaandmyfamilyforalltheencouragementandsupport.
AmitGurdasaniisasoftwareengineeratAmazon.Hearchitectsdistributedsystemstoprocessproductcataloguedata.Priortobuildinghigh-throughputsystemsatAmazon,hewasworkingontheentiresoftwarestack,bothasasystems-leveldeveloperatEricssonandIBMaswellasanapplicationdeveloperatManhattanAssociates.Hemaintainsastronginterestinbulkdataprocessing,datastreaming,andservice-orientedsoftwarearchitectures.
JakobHomanhasbeeninvolvedwithbigdataandtheApacheHadoopecosystemformorethan5years.HeisaHadoopcommitteraswellasacommitterfortheApacheGiraph,Spark,Kafka,andTajoprojects,andisaPMCmember.HehasworkedinbringingallthesesystemstoscaleatYahoo!andLinkedIn.
JamesLamptonisaseasonedpractitionerofallthingsdata(bigorsmall)with10yearsofhands-onexperienceinbuildingandusinglarge-scaledatastorageandprocessingplatforms.Heisabelieverinholisticapproachestosolvingproblemsusingtherighttoolfortherightjob.HisfavoritetoolsincludePython,Java,Hadoop,Pig,Storm,andSQL(whichsometimesIlikeandsometimesIdon’t).HehasrecentlycompletedhisPhDfromtheUniversityofMarylandwiththereleaseofPigSqueal:amechanismforrunningPigscriptsonStorm.
Iwouldliketothankmyspouse,Andrea,andmyson,Henry,forgivingmetimetoreadwork-relatedthingsathome.IwouldalsoliketothankGarry,Gabriele,andthefolksatPacktPublishingfortheopportunitytoreviewthismanuscriptandfortheirpatienceandunderstanding,asmyfreetimewasconsumedwhenwritingmydissertation.
DavideSetti,aftergraduatinginphysicsfromtheUniversityofTrento,joinedtheSoNetresearchunitattheFondazioneBrunoKesslerinTrento,whereheappliedlarge-scaledataanalysistechniquestounderstandpeople’sbehaviorsinsocialnetworksandlargecollaborativeprojectssuchasWikipedia.
In2010,DavidemovedtoFondazione,whereheledthedevelopmentofdataanalytic
toolstosupportresearchoncivicmedia,citizenjournalism,anddigitalmedia.
In2013,DavidebecametheCTOofSpazioDati,whereheleadsthedevelopmentoftoolstoperformsemanticanalysisofmassiveamountsofdatainthebusinessinformationsector.
Whennotsolvinghardproblems,Davideenjoystakingcareofhisfamilyvineyardandplayingwithhistwochildren.
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.
DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.
https://www2.packtpub.com/books/subscription/packtlib
DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.
Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser
FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.
PrefaceThisbookwilltakeyouonahands-onexplorationofthewonderfulworldthatisHadoop2anditsrapidlygrowingecosystem.Buildingonthesolidfoundationfromtheearlierversionsoftheplatform,Hadoop2allowsmultipledataprocessingframeworkstobeexecutedonasingleHadoopcluster.
Togiveanunderstandingofthissignificantevolution,wewillexplorebothhowthesenewmodelsworkandalsoshowtheirapplicationsinprocessinglargedatavolumeswithbatch,iterative,andnear-real-timealgorithms.
WhatthisbookcoversChapter1,Introduction,givesthebackgroundtoHadoopandtheBigDataproblemsitlookstosolve.WealsohighlighttheareasinwhichHadoop1hadroomforimprovement.
Chapter2,Storage,delvesintotheHadoopDistributedFileSystem,wheremostdataprocessedbyHadoopisstored.WeexaminetheparticularcharacteristicsofHDFS,showhowtouseit,anddiscusshowithasimprovedinHadoop2.WealsointroduceZooKeeper,anotherstoragesystemwithinHadoop,uponwhichmanyofitshigh-availabilityfeaturesrely.
Chapter3,Processing–MapReduceandBeyond,firstdiscussesthetraditionalHadoopprocessingmodelandhowitisused.WethendiscusshowHadoop2hasgeneralizedtheplatformtousemultiplecomputationalmodels,ofwhichMapReduceismerelyone.
Chapter4,Real-timeComputationwithSamza,takesadeeperlookatoneofthesealternativeprocessingmodelsenabledbyHadoop2.Inparticular,welookathowtoprocessreal-timestreamingdatawithApacheSamza.
Chapter5,IterativeComputationwithSpark,delvesintoaverydifferentalternativeprocessingmodel.Inthischapter,welookathowApacheSparkprovidesthemeanstodoiterativeprocessing.
Chapter6,DataAnalysiswithPig,demonstrateshowApachePigmakesthetraditionalcomputationalmodelofMapReduceeasiertousebyprovidingalanguagetodescribedataflows.
Chapter7,HadoopandSQL,looksathowthefamiliarSQLlanguagehasbeenimplementedatopdatastoredinHadoop.ThroughtheuseofApacheHiveanddescribingalternativessuchasClouderaImpala,weshowhowBigDataprocessingcanbemadepossibleusingexistingskillsandtools.
Chapter8,DataLifecycleManagement,takesalookatthebiggerpictureofjusthowtomanageallthatdatathatistobeprocessedinHadoop.UsingApacheOozie,weshowhowtobuildupworkflowstoingest,process,andmanagedata.
Chapter9,MakingDevelopmentEasier,focusesonaselectionoftoolsaimedathelpingadevelopergetresultsquickly.ThroughtheuseofHadoopstreaming,ApacheCrunchandKite,weshowhowtheuseoftherighttoolcanspeedupthedevelopmentlooporprovidenewAPIswithrichersemanticsandlessboilerplate.
Chapter10,RunningaHadoopCluster,takesalookattheoperationalsideofHadoop.Byfocusingontheareasofinteresttodevelopers,suchasclustermanagement,monitoring,andsecurity,thischaptershouldhelpyoutoworkbetterwithyouroperationsstaff.
Chapter11,WheretoGoNext,takesyouonawhirlwindtourthroughanumberofotherprojectsandtoolsthatwefeelareuseful,butcouldnotcoverindetailinthebookduetospaceconstraints.Wealsogivesomepointersonwheretofindadditionalsourcesofinformationandhowtoengagewiththevariousopensourcecommunities.
WhatyouneedforthisbookBecausemostpeopledon’thavealargenumberofsparemachinessittingaround,weusetheClouderaQuickStartvirtualmachineformostoftheexamplesinthisbook.ThisisasinglemachineimagewithallthecomponentsofafullHadoopclusterpre-installed.ItcanberunonanyhostmachinesupportingeithertheVMwareortheVirtualBoxvirtualizationtechnology.
WealsoexploreAmazonWebServicesandhowsomeoftheHadooptechnologiescanberunontheAWSElasticMapReduceservice.TheAWSservicescanbemanagedthroughawebbrowseroraLinuxcommand-lineinterface.
WhothisbookisforThisbookisprimarilyaimedatapplicationandsystemdevelopersinterestedinlearninghowtosolvepracticalproblemsusingtheHadoopframeworkandrelatedcomponents.Althoughweshowexamplesinafewprogramminglanguages,astrongfoundationinJavaisthemainprerequisite.
Dataengineersandarchitectsmightalsofindthematerialconcerningdatalifecycle,fileformats,andcomputationalmodelsuseful.
ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andanexplanationoftheirmeaning.
Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduce.jarfiletoourenvironmentbeforeaccessingindividualfields.”
Ablockofcodeissetasfollows:
topic_edges_grouped=FOREACHtopic_edges_grouped{
GENERATE
group.topic_idastopic,
group.source_idassource,
topic_edges.(destination_id,w)asedges;
}
Anycommand-lineinputoroutputiswrittenasfollows:
$hdfsdfs-puttarget/elephant-bird-pig-4.5.jarhdfs:///jar/
$hdfsdfs–puttarget/elephant-bird-hadoop-compat-4.5.jarhdfs:///jar/
$hdfsdfs–putelephant-bird-core-4.5.jarhdfs:///jar/
Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxes,appearinthetextlikethis:“Oncetheformisfilledin,weneedtoreviewandacceptthetermsofserviceandclickontheCreateApplicationbuttoninthebottom-leftcornerofthepage.”
NoteWarningsorimportantnotesappearinaboxlikethis.
TipTipsandtricksappearlikethis.
ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.
Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.
Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.
CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.
DownloadingtheexamplecodeThesourcecodeforthisbookcanbefoundonGitHubathttps://github.com/learninghadoop2/book-examples.Theauthorswillbeapplyinganyerratatothiscodeandkeepingituptodateasthetechnologiesevolve.Inadditionyoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.
ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.
Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.
PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.
Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.
Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.
QuestionsYoucancontactusat<[email protected]>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.
Chapter1.IntroductionThisbookwillteachyouhowtobuildamazingsystemsusingthelatestreleaseofHadoop.Beforeyouchangetheworldthough,weneedtodosomegroundwork,whichiswherethischaptercomesin.
Inthisintroductorychapter,wewillcoverthefollowingtopics:
AbriefrefresheronthebackgroundtoHadoopAwalk-throughofHadoop’sevolutionThekeyelementsinHadoop2TheHadoopdistributionswe’lluseinthisbookThedatasetwe’lluseforexamples
AnoteonversioningInHadoop1,theversionhistorywassomewhatconvolutedwithmultipleforkedbranchesinthe0.2xrange,leadingtooddsituations,wherea1.xversioncould,insomesituations,havefewerfeaturesthana0.23release.Intheversion2codebase,thisisfortunatelymuchmorestraightforward,butit’simportanttoclarifyexactlywhichversionwewilluseinthisbook.
Hadoop2.0wasreleasedinalphaandbetaversions,andalongtheway,severalincompatiblechangeswereintroduced.Therewas,inparticular,amajorAPIstabilizationeffortbetweenthebetaandfinalreleasestages.
Hadoop2.2.0wasthefirstgeneralavailability(GA)releaseoftheHadoop2codebase,anditsinterfacesarenowdeclaredstableandforwardcompatible.Wewillthereforeusethe2.2productandinterfacesinthisbook.Thoughtheprincipleswillbeusableona2.0beta,inparticular,therewillbeAPIincompatibilitiesinthebeta.ThisisparticularlyimportantasMapReducev2wasback-portedtoHadoop1byseveraldistributionvendors,buttheseproductswerebasedonthebetaandnottheGAAPIs.Ifyouareusingsuchaproduct,thenyouwillencountertheseincompatiblechanges.ItisrecommendedthatareleasebaseduponHadoop2.2orlaterisusedforboththedevelopmentandtheproductiondeploymentsofanyHadoop2workloads.
ThebackgroundofHadoopWe’reassumingthatmostreaderswillhavealittlefamiliaritywithHadoop,orattheveryleast,withbigdata-processingsystems.Consequently,wewon’tgiveadetailedbackgroundastowhyHadoopissuccessfulorthetypesofproblemithelpstosolveinthisbook.However,particularlybecauseofsomeaspectsofHadoop2andtheotherproductswewilluseinlaterchapters,itisusefultogiveasketchofhowweseeHadoopfittingintothetechnologylandscapeandwhicharetheparticularproblemareaswherewebelieveitgivesthemostbenefit.
Inancienttimes,beforetheterm“bigdata”cameintothepicture(whichequatestomaybeadecadeago),therewerefewoptionstoprocessdatasetsofsizesinterabytesandbeyond.Somecommercialdatabasescould,withveryspecificandexpensivehardwaresetups,bescaledtothislevel,buttheexpertiseandcapitalexpenditurerequiredmadeitanoptionforonlythelargestorganizations.Alternatively,onecouldbuildacustomsystemaimedatthespecificproblemathand.Thissufferedfromsomeofthesameproblems(expertiseandcost)andaddedtheriskinherentinanycutting-edgesystem.Ontheotherhand,ifasystemwassuccessfullyconstructed,itwaslikelyaverygoodfittotheneed.
Fewsmall-tomid-sizecompaniesevenworriedaboutthisspace,notonlybecausethesolutionswereoutoftheirreach,buttheygenerallyalsodidn’thaveanythingclosetothedatavolumesthatrequiredsuchsolutions.Astheabilitytogenerateverylargedatasetsbecamemorecommon,sodidtheneedtoprocessthatdata.
Eventhoughlargedatabecamemoredemocratizedandwasnolongerthedomainoftheprivilegedfew,majorarchitecturalchangeswererequiredifthedata-processingsystemscouldbemadeaffordabletosmallercompanies.Thefirstbigchangewastoreducetherequiredupfrontcapitalexpenditureonthesystem;thatmeansnohigh-endhardwareorexpensivesoftwarelicenses.Previously,high-endhardwarewouldhavebeenutilizedmostcommonlyinarelativelysmallnumberofverylargeserversandstoragesystems,eachofwhichhadmultipleapproachestoavoidhardwarefailures.Thoughveryimpressive,suchsystemsarehugelyexpensive,andmovingtoalargernumberoflower-endserverswouldbethequickestwaytodramaticallyreducethehardwarecostofanewsystem.Movingmoretowardcommodityhardwareinsteadofthetraditionalenterprise-gradeequipmentwouldalsomeanareductionincapabilitiesintheareaofresilienceandfaulttolerance.Thoseresponsibilitieswouldneedtobetakenupbythesoftwarelayer.Smartersoftware,dumberhardware.
GooglestartedthechangethatwouldeventuallybeknownasHadoop,whenin2003,andin2004,theyreleasedtwoacademicpapersdescribingtheGoogleFileSystem(GFS)(http://research.google.com/archive/gfs.html)andMapReduce(http://research.google.com/archive/mapreduce.html).Thetwotogetherprovidedaplatformforverylarge-scaledataprocessinginahighlyefficientmanner.Googlehadtakenthebuild-it-yourselfapproach,butinsteadofconstructingsomethingaimedatonespecificproblemordataset,theyinsteadcreatedaplatformonwhichmultipleprocessingapplicationscouldbeimplemented.Inparticular,theyutilizedlargenumbersof
commodityserversandbuiltGFSandMapReduceinawaythatassumedhardwarefailureswouldbecommonplaceandweresimplysomethingthatthesoftwareneededtodealwith.
Atthesametime,DougCuttingwasworkingontheNutchopensourcewebcrawler.HewasworkingonelementswithinthesystemthatresonatedstronglyoncetheGoogleGFSandMapReducepaperswerepublished.DougstartedworkonopensourceimplementationsoftheseGoogleideas,andHadoopwassoonborn,firstly,asasubprojectofLucene,andthenasitsowntop-levelprojectwithintheApacheSoftwareFoundation.
Yahoo!hiredDougCuttingin2006andquicklybecameoneofthemostprominentsupportersoftheHadoopproject.InadditiontooftenpublicizingsomeofthelargestHadoopdeploymentsintheworld,Yahoo!allowedDougandotherengineerstocontributetoHadoopwhileemployedbythecompany,nottomentioncontributingbacksomeofitsowninternallydevelopedHadoopimprovementsandextensions.
ComponentsofHadoopThebroadHadoopumbrellaprojecthasmanycomponentsubprojects,andwe’lldiscussseveraloftheminthisbook.Atitscore,Hadoopprovidestwoservices:storageandcomputation.AtypicalHadoopworkflowconsistsofloadingdataintotheHadoopDistributedFileSystem(HDFS)andprocessingusingtheMapReduceAPIorseveraltoolsthatrelyonMapReduceasanexecutionframework.
Hadoop1:HDFSandMapReduce
BothlayersaredirectimplementationsofGoogle’sownGFSandMapReducetechnologies.
CommonbuildingblocksBothHDFSandMapReduceexhibitseveralofthearchitecturalprinciplesdescribedintheprevioussection.Inparticular,thecommonprinciplesareasfollows:
Botharedesignedtorunonclustersofcommodity(thatis,lowtomediumspecification)serversBothscaletheircapacitybyaddingmoreservers(scale-out)asopposedtothepreviousmodelsofusinglargerhardware(scale-up)BothhavemechanismstoidentifyandworkaroundfailuresBothprovidemostoftheirservicestransparently,allowingtheusertoconcentrateontheproblemathandBothhaveanarchitecturewhereasoftwareclustersitsonthephysicalserversandmanagesaspectssuchasapplicationloadbalancingandfaulttolerance,withoutrelyingonhigh-endhardwaretodeliverthesecapabilities
StorageHDFSisafilesystem,thoughnotaPOSIX-compliantone.Thisbasicallymeansthatitdoesnotdisplaythesamecharacteristicsasthatofaregularfilesystem.Inparticular,thecharacteristicsareasfollows:
HDFSstoresfilesinblocksthataretypicallyatleast64MBor(morecommonlynow)128MBinsize,muchlargerthanthe4-32KBseeninmostfilesystemsHDFSisoptimizedforthroughputoverlatency;itisveryefficientatstreamingreadsoflargefilesbutpoorwhenseekingformanysmallonesHDFSisoptimizedforworkloadsthataregenerallywrite-onceandread-manyInsteadofhandlingdiskfailuresbyhavingphysicalredundanciesindiskarraysorsimilarstrategies,HDFSusesreplication.Eachoftheblockscomprisingafileisstoredonmultiplenodeswithinthecluster,andaservicecalledtheNameNodeconstantlymonitorstoensurethatfailureshavenotdroppedanyblockbelowthedesiredreplicationfactor.Ifthisdoeshappen,thenitschedulesthemakingofanothercopywithinthecluster.
ComputationMapReduceisanAPI,anexecutionengine,andaprocessingparadigm;itprovidesaseriesoftransformationsfromasourceintoaresultdataset.Inthesimplestcase,theinputdataisfedthroughamapfunctionandtheresultanttemporarydataisthenfedthroughareducefunction.
MapReduceworksbestonsemistructuredorunstructureddata.Insteadofdataconformingtorigidschemas,therequirementisinsteadthatthedatacanbeprovidedtothemapfunctionasaseriesofkey-valuepairs.Theoutputofthemapfunctionisasetofotherkey-valuepairs,andthereducefunctionperformsaggregationtocollectthefinalsetofresults.
Hadoopprovidesastandardspecification(thatis,interface)forthemapandreducephases,andtheimplementationoftheseareoftenreferredtoasmappersandreducers.AtypicalMapReduceapplicationwillcompriseanumberofmappersandreducers,andit’snotunusualforseveralofthesetobeextremelysimple.Thedeveloperfocusesonexpressingthetransformationbetweenthesourceandtheresultantdata,andtheHadoopframeworkmanagesallaspectsofjobexecutionandcoordination.
BettertogetherItispossibletoappreciatetheindividualmeritsofHDFSandMapReduce,buttheyareevenmorepowerfulwhencombined.Theycanbeusedindividually,butwhentheyaretogether,theybringoutthebestineachother,andthiscloseinterworkingwasamajorfactorinthesuccessandacceptanceofHadoop1.
WhenaMapReducejobisbeingplanned,Hadoopneedstodecideonwhichhosttoexecutethecodeinordertoprocessthedatasetmostefficiently.IftheMapReduceclusterhostsareallpullingtheirdatafromasinglestoragehostorarray,thenthislargelydoesn’tmatterasthestoragesystemisasharedresourcethatwillcausecontention.IfthestoragesystemwasmoretransparentandallowedMapReducetomanipulateitsdatamoredirectly,thentherewouldbeanopportunitytoperformtheprocessingclosertothedata,buildingontheprincipleofitbeinglessexpensivetomoveprocessingthandata.
ThemostcommondeploymentmodelforHadoopseestheHDFSandMapReduceclustersdeployedonthesamesetofservers.EachhostthatcontainsdataandtheHDFScomponenttomanagethedataalsohostsaMapReducecomponentthatcanscheduleandexecutedataprocessing.WhenajobissubmittedtoHadoop,itcanusethelocalityoptimizationtoscheduledataonthehostswheredataresidesasmuchaspossible,thusminimizingnetworktrafficandmaximizingperformance.
Hadoop2–what’sthebigdeal?IfwelookatthetwomaincomponentsofthecoreHadoopdistribution,storageandcomputation,weseethatHadoop2hasaverydifferentimpactoneachofthem.WhereastheHDFSfoundinHadoop2ismostlyamuchmorefeature-richandresilientproductthantheHDFSinHadoop1,forMapReduce,thechangesaremuchmoreprofoundandhave,infact,alteredhowHadoopisperceivedasaprocessingplatformingeneral.Let’slookatHDFSinHadoop2first.
StorageinHadoop2We’lldiscusstheHDFSarchitectureinmoredetailinChapter2,Storage,butfornow,it’ssufficienttothinkofamaster-slavemodel.Theslavenodes(calledDataNodes)holdtheactualfilesystemdata.Inparticular,eachhostrunningaDataNodewilltypicallyhaveoneormoredisksontowhichfilescontainingthedataforeachHDFSblockarewritten.TheDataNodeitselfhasnounderstandingoftheoverallfilesystem;itsroleistostore,serve,andensuretheintegrityofthedataforwhichitisresponsible.
Themasternode(calledtheNameNode)isresponsibleforknowingwhichoftheDataNodesholdswhichblockandhowtheseblocksarestructuredtoformthefilesystem.Whenaclientlooksatthefilesystemandwishestoretrieveafile,it’sviaarequesttotheNameNodethatthelistofrequiredblocksisretrieved.
ThismodelworkswellandhasbeenscaledtoclusterswithtensofthousandsofnodesatcompaniessuchasYahoo!So,thoughitisscalable,thereisaresiliencyrisk;iftheNameNodebecomesunavailable,thentheentireclusterisrenderedeffectivelyuseless.NoHDFSoperationscanbeperformed,andsincethevastmajorityofinstallationsuseHDFSasthestoragelayerforservices,suchasMapReduce,theyalsobecomeunavailableeveniftheyarestillrunningwithoutproblems.
Morecatastrophically,theNameNodestoresthefilesystemmetadatatoapersistentfileonitslocalfilesystem.IftheNameNodehostcrashesinawaythatthisdataisnotrecoverable,thenalldataontheclusteriseffectivelylostforever.ThedatawillstillexistonthevariousDataNodes,butthemappingofwhichblockscomprisewhichfilesislost.Thisiswhy,inHadoop1,thebestpracticewastohavetheNameNodesynchronouslywriteitsfilesystemmetadatatobothlocaldisksandatleastoneremotenetworkvolume(typicallyviaNFS).
SeveralNameNodehigh-availability(HA)solutionshavebeenmadeavailablebythird-partysuppliers,butthecoreHadoopproductdidnotoffersuchresilienceinVersion1.Giventhisarchitecturalsinglepointoffailureandtheriskofdataloss,itwon’tbeasurprisetohearthatNameNodeHAisoneofthemajorfeaturesofHDFSinHadoop2andissomethingwe’lldiscussindetailinlaterchapters.ThefeatureprovidesbothastandbyNameNodethatcanbeautomaticallypromotedtoserviceallrequestsshouldtheactiveNameNodefail,butalsobuildsadditionalresilienceforthecriticalfilesystemmetadataatopthismechanism.
HDFSinHadoop2isstillanon-POSIXfilesystem;itstillhasaverylargeblocksizeanditstilltradeslatencyforthroughput.However,itdoesnowhaveafewcapabilitiesthatcanmakeitlookalittlemorelikeatraditionalfilesystem.Inparticular,thecoreHDFSinHadoop2nowcanberemotelymountedasanNFSvolume.Thisisanotherfeaturethatwaspreviouslyofferedasaproprietarycapabilitybythird-partysuppliersbutisnowinthemainApachecodebase.
Overall,theHDFSinHadoop2ismoreresilientandcanbemoreeasilyintegratedintoexistingworkflowsandprocesses.It’sastrongevolutionoftheproductfoundinHadoop
1.
ComputationinHadoop2TheworkonHDFS2wasstartedbeforeadirectionforMapReducecrystallized.ThiswaslikelyduetothefactthatfeaturessuchasNameNodeHAweresuchanobviouspaththatthecommunityknewthemostcriticalareastoaddress.However,MapReducedidn’treallyhaveasimilarlistofareasofimprovement,andthat’swhy,whentheMRv2initiativestarted,itwasn’tcompletelyclearwhereitwouldlead.
PerhapsthemostfrequentcriticismofMapReduceinHadoop1washowitsbatchprocessingmodelwasill-suitedtoproblemdomainswherefasterresponsetimeswererequired.Hive,forexample,whichwe’lldiscussinChapter7,HadoopandSQL,providesaSQL-likeinterfaceontoHDFSdata,but,behindthescenes,thestatementsareconvertedintoMapReducejobsthatarethenexecutedlikeanyother.Anumberofotherproductsandtoolstookasimilarapproach,providingaspecificuser-facinginterfacethathidaMapReducetranslationlayer.
Thoughthisapproachhasbeenverysuccessful,andsomeamazingproductshavebeenbuilt,thefactremainsthatinmanycases,thereisamismatchasalloftheseinterfaces,someofwhichexpectacertaintypeofresponsiveness,arebehindthescenes,beingexecutedonabatch-processingplatform.WhenlookingtoenhanceMapReduce,improvementscouldbemadetomakeitabetterfittotheseusecases,butthefundamentalmismatchwouldremain.ThissituationledtoasignificantchangeoffocusoftheMRv2initiative;perhapsMapReduceitselfdidn’tneedchange,buttherealneedwastoenabledifferentprocessingmodelsontheHadoopplatform.ThuswasbornYetAnotherResourceNegotiator(YARN).
LookingatMapReduceinHadoop1,theproductactuallydidtwoquitedifferentthings;itprovidedtheprocessingframeworktoexecuteMapReducecomputations,butitalsomanagedtheallocationofthiscomputationacrossthecluster.Notonlydiditdirectdatatoandbetweenthespecificmapandreducetasks,butitalsodeterminedwhereeachtaskwouldrun,andmanagedthefulljoblifecycle,monitoringthehealthofeachtaskandnode,reschedulingifanyfailed,andsoon.
Thisisnotatrivialtask,andtheautomatedparallelizationofworkloadshasalwaysbeenoneofthemainbenefitsofHadoop.IfwelookatMapReduceinHadoop1,weseethataftertheuserdefinesthekeycriteriaforthejob,everythingelseistheresponsibilityofthesystem.Critically,fromascaleperspective,thesameMapReducejobcanbeappliedtodatasetsofanyvolumehostedonclustersofanysize.Ifthedatais1GBinsizeandonasinglehost,thenHadoopwillscheduletheprocessingaccordingly.Ifthedataisinstead1PBinsizeandhostedacross1,000machines,thenitdoeslikewise.Fromtheuser’sperspective,theactualscaleofthedataandclusteristransparent,andasidefromaffectingthetimetakentoprocessthejob,itdoesnotchangetheinterfacewithwhichtointeractwiththesystem.
InHadoop2,thisroleofjobschedulingandresourcemanagementisseparatedfromthatofexecutingtheactualapplication,andisimplementedbyYARN.
YARNisresponsibleformanagingtheclusterresources,andsoMapReduceexistsasanapplicationthatrunsatoptheYARNframework.TheMapReduceinterfaceinHadoop2iscompletelycompatiblewiththatinHadoop1,bothsemanticallyandpractically.However,underthecovers,MapReducehasbecomeahostedapplicationontheYARNframework.
ThesignificanceofthissplitisthatotherapplicationscanbewrittenthatprovideprocessingmodelsmorefocusedontheactualproblemdomainandcanoffloadalltheresourcemanagementandschedulingresponsibilitiestoYARN.ThelatestversionsofmanydifferentexecutionengineshavebeenportedontoYARN,eitherinaproduction-readyorexperimentalstate,andithasshownthattheapproachcanallowasingleHadoopclustertoruneverythingfrombatch-orientedMapReducejobsthroughfast-responseSQLqueriestocontinuousdatastreamingandeventoimplementmodelssuchasgraphprocessingandtheMessagePassingInterface(MPI)fromtheHighPerformanceComputing(HPC)world.ThefollowingdiagramshowsthearchitectureofHadoop2:
Hadoop2
ThisiswhymuchoftheattentionandexcitementaroundHadoop2hasbeenfocusedonYARNandframeworksthatsitontopofit,suchasApacheTezandApacheSpark.WithYARN,theHadoopclusterisnolongerjustabatch-processingengine;itisthesingleplatformonwhichavastarrayofprocessingtechniquescanbeappliedtotheenormousdatavolumesstoredinHDFS.Moreover,applicationscanbuildonthesecomputationparadigmsandexecutionmodels.
TheanalogythatisachievingsometractionistothinkofYARNastheprocessingkerneluponwhichotherdomain-specificapplicationscanbebuilt.We’lldiscussYARNinmoredetailinthisbook,particularlyinChapter3,Processing–MapReduceandBeyond,Chapter4,Real-timeComputationwithSamza,andChapter5,IterativeComputationwithSpark.
DistributionsofApacheHadoopIntheveryearlydaysofHadoop,theburdenofinstalling(oftenbuildingfromsource)andmanagingeachcomponentanditsdependenciesfellontheuser.Asthesystembecamemorepopularandtheecosystemofthird-partytoolsandlibrariesstartedtogrow,thecomplexityofinstallingandmanagingaHadoopdeploymentincreaseddramaticallytothepointwhereprovidingacoherentofferofsoftwarepackages,documentation,andtrainingbuiltaroundthecoreApacheHadoophasbecomeabusinessmodel.EntertheworldofdistributionsforApacheHadoop.
HadoopdistributionsareconceptuallysimilartohowLinuxdistributionsprovideasetofintegratedsoftwarearoundacommoncore.Theytaketheburdenofbundlingandpackagingsoftwarethemselvesandprovidetheuserwithaneasywaytoinstall,manage,anddeployApacheHadoopandaselectednumberofthird-partylibraries.Inparticular,thedistributionreleasesdeliveraseriesofproductversionsthatarecertifiedtobemutuallycompatible.Historically,puttingtogetheraHadoop-basedplatformwasoftengreatlycomplicatedbythevariousversioninterdependencies.
Cloudera(http://www.cloudera.com),Hortonworks(http://www.hortonworks.com),andMapR(http://www.mapr.com)areamongstthefirsttohavereachedthemarket,eachcharacterizedbydifferentapproachesandsellingpoints.Hortonworkspositionsitselfastheopensourceplayer;ClouderaisalsocommittedtoopensourcebutaddsproprietarybitsforconfiguringandmanagingHadoop;MapRprovidesahybridopensource/proprietaryHadoopdistributioncharacterizedbyaproprietaryNFSlayerinsteadofHDFSandafocusonprovidingservices.
AnotherstrongplayerinthedistributionsecosystemisAmazon,whichoffersaversionofHadoopcalledElasticMapReduce(EMR)ontopoftheAmazonWebServices(AWS)infrastructure.
WiththeadventofHadoop2,thenumberofavailabledistributionsforHadoophasincreaseddramatically,farinexcessofthefourwementioned.ApossiblyincompletelistofsoftwareofferingsthatincludesApacheHadoopcanbefoundathttp://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support.
AdualapproachInthisbook,wewilldiscussboththebuildingandthemanagementoflocalHadoopclustersinadditiontoshowinghowtopushtheprocessingintothecloudviaEMR.
Thereasonforthisistwofold:firstly,thoughEMRmakesHadoopmuchmoreaccessible,thereareaspectsofthetechnologythatonlybecomeapparentwhenmanuallyadministeringthecluster.AlthoughitisalsopossibletouseEMRinamoremanualmode,we’llgenerallyusealocalclusterforsuchexplorations.Secondly,thoughitisn’tnecessarilyaneither/ordecision,manyorganizationsuseamixtureofin-houseandcloud-hostedcapacities,sometimesduetoaconcernofoverrelianceonasingleexternalprovider,butpracticallyspeaking,it’softenconvenienttododevelopmentandsmall-scaletestsonlocalcapacityandthendeployatproductionscaleintothecloud.
Inafewofthelaterchapters,wherewediscussadditionalproductsthatintegratewithHadoop,we’llmostlygiveexamplesoflocalclusters,asthereisnodifferencebetweenhowtheproductsworkregardlessofwheretheyaredeployed.
AWS–infrastructureondemandfromAmazonAWSisasetofcloud-computingservicesofferedbyAmazon.Wewilluseseveraloftheseservicesinthisbook.
SimpleStorageService(S3)Amazon’sSimpleStorageService(S3),foundathttp://aws.amazon.com/s3/,isastorageservicethatprovidesasimplekey-valuestoragemodel.Usingweb,command-line,orprogrammaticinterfacestocreateobjects,whichcanbeanythingfromtextfilestoimagestoMP3s,youcanstoreandretrieveyourdatabasedonahierarchicalmodel.Inthismodel,youcreatebucketsthatcontainobjects.Eachbuckethasauniqueidentifier,andwithineachbucket,everyobjectisuniquelynamed.ThissimplestrategyenablesanextremelypowerfulserviceforwhichAmazontakescompleteresponsibility(forservicescaling,inadditiontoreliabilityandavailabilityofdata).
ElasticMapReduce(EMR)Amazon’sElasticMapReduce,foundathttp://aws.amazon.com/elasticmapreduce/,isbasicallyHadoopinthecloud.Usinganyofthemultipleinterfaces(webconsole,CLI,orAPI),aHadoopworkflowisdefinedwithattributessuchasthenumberofHadoophostsrequiredandthelocationofthesourcedata.TheHadoopcodeimplementingtheMapReducejobsisprovided,andthevirtualGobuttonispressed.
Initsmostimpressivemode,EMRcanpullsourcedatafromS3,processitonaHadoopclusteritcreatesonAmazon’svirtualhoston-demandserviceEC2,pushtheresultsbackintoS3,andterminatetheHadoopclusterandtheEC2virtualmachineshostingit.Naturally,eachoftheseserviceshasacost(usuallyonperGBstoredandserver-timeusagebasis),buttheabilitytoaccesssuchpowerfuldata-processingcapabilitieswithnoneedfordedicatedhardwareisapowerfulone.
GettingstartedWewillnowdescribethetwoenvironmentswewillusethroughoutthebook:Cloudera’sQuickStartvirtualmachinewillbeourreferencesystemonwhichwewillshowallexamples,butwewilladditionallydemonstratesomeexamplesonAmazon’sEMRwhenthereissomeparticularlyvaluableaspecttorunningtheexampleintheon-demandservice.
Althoughtheexamplesandcodeprovidedareaimedatbeingasgeneral-purposeandportableaspossible,ourreferencesetup,whentalkingaboutalocalcluster,willbeClouderarunningatopCentOSLinux.
Forthemostpart,wewillshowexamplesthatmakeuseof,orareexecutedfrom,aterminalprompt.AlthoughHadoop’sgraphicalinterfaceshaveimprovedsignificantlyovertheyears(forexample,theexcellentHUEandClouderaManager),whenitcomestodevelopment,automation,andprogrammaticaccesstothesystem,thecommandlineisstillthemostpowerfultoolforthejob.
Allexamplesandsourcecodepresentedinthisbookcanbedownloadedfromhttps://github.com/learninghadoop2/book-examples.Inaddition,wehaveahomepageforthebookwherewewillpublishupdatesandrelatedmaterialathttp://learninghadoop2.com.
ClouderaQuickStartVMOneoftheadvantagesofHadoopdistributionsisthattheygiveaccesstoeasy-to-install,packagedsoftware.ClouderatakesthisonestepfurtherandprovidesafreelydownloadableVirtualMachineinstanceofitslatestdistribution,knownastheCDHQuickStartVM,deployedontopofCentOSLinux.
Intheremainingpartsofthisbook,wewillusetheCDH5.0.0VMasthereferenceandbaselinesystemtorunexamplesandsourcecode.ImagesoftheVMareavailableforVMware(http://www.vmware.com/nl/products/player/),KVM(http://www.linux-kvm.org/page/Main_Page),andVirtualBox(https://www.virtualbox.org/)virtualizationsystems.
AmazonEMRBeforeusingElasticMapReduce,weneedtosetupanAWSaccountandregisteritwiththenecessaryservices.
CreatinganAWSaccountAmazonhasintegrateditsgeneralaccountswithAWS,whichmeansthat,ifyoualreadyhaveanaccountforanyoftheAmazonretailwebsites,thisistheonlyaccountyouwillneedtouseAWSservices.
NoteNotethatAWSserviceshaveacost;youwillneedanactivecreditcardassociatedwiththeaccounttowhichchargescanbemade.
IfyourequireanewAmazonaccount,gotohttp://aws.amazon.com,selectCreateanewAWSaccount,andfollowtheprompts.Amazonhasaddedafreetierforsomeservices,soyoumightfindthatintheearlydaysoftestingandexploration,youarekeepingmanyofyouractivitieswithinthenonchargedtier.Thescopeofthefreetierhasbeenexpanding,somakesureyouknowwhatyouwillandwon’tbechargedfor.
SigningupforthenecessaryservicesOnceyouhaveanAmazonaccount,youwillneedtoregisteritforusewiththerequiredAWSservices,thatis,SimpleStorageService(S3),ElasticComputeCloud(EC2),andElasticMapReduce.ThereisnocosttosimplysignuptoanyAWSservice;theprocessjustmakestheserviceavailabletoyouraccount.
GototheS3,EC2,andEMRpageslinkedfromhttp://aws.amazon.com,clickontheSignupbuttononeachpage,andthenfollowtheprompts.
UsingElasticMapReduceHavingcreatedanaccountwithAWSandregisteredalltherequiredservices,wecanproceedtoconfigureprogrammaticaccesstoEMR.
GettingHadoopupandrunningNoteCaution!Thiscostsrealmoney!
Beforegoinganyfurther,itiscriticaltounderstandthatuseofAWSserviceswillincurchargesthatwillappearonthecreditcardassociatedwithyourAmazonaccount.Mostofthechargesarequitesmallandincreasewiththeamountofinfrastructureconsumed;storing10GBofdatainS3costs10timesmorethan1GB,andrunning20EC2instancescosts20timesasmuchasasingleone.Therearetieredcostmodels,sotheactualcoststendtohavesmallermarginalincreasesathigherlevels.Butyoushouldreadcarefullythroughthepricingsectionsforeachservicebeforeusinganyofthem.NotealsothatcurrentlydatatransferoutofAWSservices,suchasEC2andS3,ischargeable,butdatatransferbetweenservicesisnot.Thismeansitisoftenmostcost-effectivetocarefullydesignyouruseofAWStokeepdatawithinAWSthroughasmuchofthedataprocessingaspossible.ForinformationregardingAWSandEMR,consulthttp://aws.amazon.com/elasticmapreduce/#pricing.
HowtouseEMRAmazonprovidesbothwebandcommand-lineinterfacestoEMR.Bothinterfacesarejustafrontendtotheverysamesystem;aclustercreatedwiththecommand-lineinterfacecanbeinspectedandmanagedwiththewebtoolsandvice-versa.
Forthemostpart,wewillbeusingthecommand-linetoolstocreateandmanageclustersprogrammaticallyandwillfallbackonthewebinterfacecaseswhereitmakessensetodoso.
AWScredentialsBeforeusingeitherprogrammaticorcommand-linetools,weneedtolookathowanaccountholderauthenticatestoAWStomakesuchrequests.
EachAWSaccounthasseveralidentifiers,suchasthefollowing,thatareusedwhenaccessingthevariousservices:
AccountID:eachAWSaccounthasanumericID.Accesskey:theassociatedaccesskeyisusedtoidentifytheaccountmakingtherequest.Secretaccesskey:thepartnertotheaccesskeyisthesecretaccesskey.Theaccesskeyisnotasecretandcouldbeexposedinservicerequests,butthesecretaccesskeyiswhatyouusetovalidateyourselfastheaccountowner.Treatitlikeyourcreditcard.Keypairs:thesearethekeypairsusedtologintoEC2hosts.Itispossibletoeithergeneratepublic/privatekeypairswithinEC2ortoimportexternallygeneratedkeysintothesystem.
UsercredentialsandpermissionsaremanagedviaawebservicecalledIdentityandAccessManagement(IAM),whichyouneedtosignuptoinordertoobtainaccessandsecretkeys.
Ifthissoundsconfusing,it’sbecauseitis,atleastatfirst.WhenusingatooltoaccessanAWSservice,there’susuallythesingle,upfrontstepofaddingtherightcredentialstoaconfiguredfile,andtheneverythingjustworks.However,ifyoudodecidetoexploreprogrammaticorcommand-linetools,itwillbeworthinvestingalittletimetoreadthedocumentationforeachservicetounderstandhowitssecurityworks.MoreinformationoncreatinganAWSaccountandobtainingaccesscredentialscanbefoundathttp://docs.aws.amazon.com/iam.
TheAWScommand-lineinterfaceEachAWSservicehistoricallyhaditsownsetofcommand-linetools.Recentlythough,Amazonhascreatedasingle,unifiedcommand-linetoolthatallowsaccesstomostservices.TheAmazonCLIcanbefoundathttp://aws.amazon.com/cli.
Itcanbeinstalledfromatarballorviathepiporeasy_installpackagemanagers.
OntheCDHQuickStartVM,wecaninstallawscliusingthefollowingcommand:
$pipinstallawscli
InordertoaccesstheAPI,weneedtoconfigurethesoftwaretoauthenticatetoAWSusingouraccessandsecretkeys.
ThisisalsoagoodmomenttosetupanEC2keypairbyfollowingtheinstructionsprovidedathttps://console.aws.amazon.com/ec2/home?region=us-east-1#c=EC2&s=KeyPairs.
AlthoughakeypairisnotstrictlynecessarytorunanEMRcluster,itwillgiveusthecapabilitytoremotelylogintothemasternodeandgainlow-levelaccesstothecluster.
Thefollowingcommandwillguideyouthroughaseriesofconfigurationstepsandstoretheresultingconfigurationinthe.aws/credentialfile:
$awsconfigure
OncetheCLIisconfigured,wecanqueryAWSwithaws<service><arguments>.TocreateandqueryanS3bucketusesomethinglikethefollowingcommand.NotethatS3bucketsneedtobegloballyuniqueacrossallAWSaccounts,somostcommonnames,suchass3://mybucket,willnotbeavailable:
$awss3mbs3://learninghadoop2
$awss3ls
WecanprovisionanEMRclusterwithfivem1.xlargenodesusingthefollowingcommands:
$awsemrcreate-cluster--name"EMRcluster"\
--ami-version3.2.0\
--instance-typem1.xlarge\
--instance-count5\
--log-uris3://learninghadoop2/emr-logs
Where--ami-versionistheIDofanAmazonMachineImagetemplate(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html),and--log-uriinstructsEMRtocollectlogsandstoretheminthelearninghadoop2S3bucket.
NoteIfyoudidnotspecifyadefaultregionwhensettinguptheAWSCLI,thenyouwillalsohavetoaddonetomostEMRcommandsintheAWSCLIusingthe—regionargument;forexample,--regioneu-west-1isruntousetheEUIrelandregion.Youcanfind
detailsofallavailableAWSregionsathttp://docs.aws.amazon.com/general/latest/gr/rande.html.
Wecansubmitworkflowsbyaddingstepstoarunningclusterusingthefollowingcommand:
$awsemradd-steps--cluster-id<cluster>--steps<steps>
Toterminatethecluster,usethefollowingcommandline:
$awsemrterminate-clusters--cluster-id<cluster>
Inlaterchapters,wewillshowyouhowtoaddstepstoexecuteMapReducejobsandPigscripts.
MoreinformationonusingtheAWSCLIcanbefoundathttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage.html.
RunningtheexamplesThesourcecodeofallexamplesisavailableathttps://github.com/learninghadoop2/book-examples.
Gradle(http://www.gradle.org/)scriptsandconfigurationsareprovidedtocompilemostoftheJavacode.ThegradlewscriptincludedwiththeexamplewillbootstrapGradleanduseittofetchdependenciesandcompilecode.
JARfilescanbecreatedbyinvokingthejartaskviaagradlewscript,asfollows:
./gradlewjar
JobsareusuallyexecutedbysubmittingaJARfileusingthehadoopjarcommand,asfollows:
$hadoopjarexample.jar<MainClass>[-libjars$LIBJARS]arg1arg2…argN
Theoptional-libjarsparameterspecifiesruntimethird-partydependenciestoshiptoremotenodes.
NoteSomeoftheframeworkswewillworkwith,suchasApacheSpark,comewiththeirownbuildandpackagemanagementtools.Additionalinformationandresourceswillbeprovidedfortheseparticularcases.
ThecopyJarGradletaskcanbeusedtodownloadthird-partydependenciesintobuild/libjars/<example>/lib,asfollows:
./gradlewcopyJar
Forconvenience,weprovideafatJarGradletaskthatbundlestheexampleclassesandtheirdependenciesintoasingleJARfile.Althoughthisapproachisdiscouragedinfavorofusing–libjar,itmightcomeinhandywhendealingwithdependencyissues.
Thefollowingcommandwillgeneratebuild/libs/<example>-all.jar:
$./gradlewfatJar
DataprocessingwithHadoopIntheremainingchaptersofthisbook,wewillintroducethecorecomponentsoftheHadoopecosystemaswellasanumberofthird-partytoolsandlibrariesthatwillmakewritingrobust,distributedcodeanaccessibleandhopefullyenjoyabletask.Whilereadingthisbook,youwilllearnhowtocollect,process,store,andextractinformationfromlargeamountsofstructuredandunstructureddata.
WewilluseadatasetgeneratedfromTwitter’s(http://www.twitter.com)real-timefirehose.Thisapproachwillallowustoexperimentwithrelativelysmalldatasetslocallyand,onceready,scaletheexamplesuptoproduction-leveldatasizes.
WhyTwitter?ThankstoitsprogrammaticAPIs,Twitterprovidesaneasywaytogeneratedatasetsofarbitrarysizeandinjectthemintoourlocal-orcloud-basedHadoopclusters.Otherthanthesheersize,thedatasetthatwewillusehasanumberofpropertiesthatfitseveralinterestingdatamodelingandprocessingusecases.
Twitterdatapossessesthefollowingproperties:
Unstructured:eachstatusupdateisatextmessagethatcancontainreferencestomediacontentsuchasURLsandimagesStructured:tweetsaretimestamped,sequentialrecordsGraph:relationshipssuchasrepliesandmentionscanbemodeledasanetworkofinteractionsGeolocated:thelocationwhereatweetwaspostedorwhereauserresidesRealtime:alldatageneratedonTwitterisavailableviaareal-timefirehose
ThesepropertieswillbereflectedinthetypeofapplicationthatwecanbuildwithHadoop.Theseincludeexamplesofsentimentanalysis,socialnetwork,andtrendanalysis.
BuildingourfirstdatasetTwitter’stermsofserviceprohibitredistributionofuser-generateddatainanyform;forthisreason,wecannotmakeavailableacommondataset.Instead,wewilluseaPythonscripttoprogrammaticallyaccesstheplatformandcreateadumpofusertweetscollectedfromalivestream.
Oneservice,multipleAPIsTwitteruserssharemorethan200milliontweets,alsoknownasstatusupdates,aday.TheplatformoffersaccesstothiscorpusofdataviafourtypesofAPIs,eachofwhichrepresentsafacetofTwitterandaimsatsatisfyingspecificusecases,suchaslinkingandinteractingwithtwittercontentfromthird-partysources(TwitterforProducts),programmaticaccesstospecificusers’orsites’content(REST),searchcapabilitiesacrossusers’orsites’timelines(Search),andaccesstoallcontentcreatedontheTwitternetworkinrealtime(Streaming).
TheStreamingAPIallowsdirectaccesstotheTwitterstream,trackingkeywords,retrievinggeotaggedtweetsfromacertainregion,andmuchmore.Inthisbook,wewillmakeuseofthisAPIasadatasourcetoillustrateboththebatchandreal-timecapabilitiesofHadoop.Wewillnot,however,interactwiththeAPIitself;rather,wewillmakeuseofthird-partylibrariestooffloadchoressuchasauthenticationandconnectionmanagement.
AnatomyofaTweetEachtweetobjectreturnedbyacalltothereal-timeAPIsisrepresentedasaserializedJSONstringthatcontainsasetofattributesandmetadatainadditiontoatextualmessage.ThisadditionalcontentincludesanumericalIDthatuniquelyidentifiesthetweet,thelocationwherethetweetwasshared,theuserwhosharedit(userobject),whetheritwasrepublishedbyotherusers(retweeted)andhowmanytimes(retweetcount),themachine-detectedlanguageofitstext,whetherthetweetwaspostedinreplytosomeoneand,ifso,theuserandtweetIDsitrepliedto,andsoon.
ThestructureofaTweet,andanyotherobjectexposedbytheAPI,isconstantlyevolving.Anup-to-datereferencecanbefoundathttps://dev.twitter.com/docs/platform-objects/tweets.
TwittercredentialsTwittermakesuseoftheOAuthprotocoltoauthenticateandauthorizeaccessfromthird-partysoftwaretoitsplatform.
Theapplicationobtainsthroughanexternalchannel,forinstanceawebform,thefollowingpairofcredentials:
ConsumerkeyConsumersecret
Theconsumersecretisneverdirectlytransmittedtothethirdpartyasitisusedtosign
eachrequest.
Theuserauthorizestheapplicationtoaccesstheserviceviaathree-wayprocessthat,oncecompleted,grantstheapplicationatokenconsistingofthefollowing:
AccesstokenAccesssecret
Similarly,totheconsumer,theaccesssecretisneverdirectlytransmittedtothethirdparty,anditisusedtosigneachrequest.
InordertousetheStreamingAPI,wewillfirstneedtoregisteranapplicationandgrantitprogrammaticaccesstothesystem.IfyourequireanewTwitteraccount,proceedtothesignuppageathttps://twitter.com/signup,andfillintherequiredinformation.Oncethisstepiscompleted,weneedtocreateasampleapplicationthatwillaccesstheAPIonourbehalfandgrantittheproperauthorizationrights.Wewilldosousingthewebformfoundathttps://dev.twitter.com/apps.
Whencreatinganewapp,weareaskedtogiveitaname,adescription,andaURL.ThefollowingscreenshotshowsthesettingsofasampleapplicationnamedLearningHadoop2BookDataset.Forthepurposeofthisbook,wedonotneedtospecifyavalidURL,soweusedaplaceholderinstead.
Oncetheformisfilledin,weneedtoreviewandacceptthetermsofserviceandclickontheCreateApplicationbuttoninthebottom-leftcornerofthepage.
Wearenowpresentedwithapagethatsummarizesourapplicationdetailsasseeninthefollowingscreenshot;theauthenticationandauthorizationcredentialscanbefoundundertheOAuthTooltab.
WearefinallyreadytogenerateourveryfirstTwitterdataset.
ProgrammaticaccesswithPythonInthissection,wewillusePythonandthetweepylibrary,foundathttps://github.com/tweepy/tweepy,tocollectTwitter’sdata.Thestream.pyfilefoundinthech1directoryofthebookcodearchiveinstantiatesalistenertothereal-timefirehose,grabsadatasample,andechoeseachtweet’stexttostandardoutput.
Thetweepylibrarycanbeinstalledusingeithertheeasy_installorpippackagemanagersorbycloningtherepositoryathttps://github.com/tweepy/tweepy.
OntheCDHQuickStartVM,wecaninstalltweepyusingthefollowingcommandline:
$pipinstalltweepy
Wheninvokedwiththe-jparameter,thescriptwilloutputaJSONtweettostandardoutput;-textractsandprintsthetextfield.Wespecifyhowmanytweetstoprintwith–n<numtweets>.When–nisnotspecified,thescriptwillrunindefinitely.ExecutioncanbeterminatedbypressingCtrl+C.
ThescriptexpectsOAuthcredentialstobestoredasshellenvironmentvariables;thefollowingcredentialswillhavetobesetintheterminalsessionfromwherestream.pywillbeexecuted.
$exportTWITTER_CONSUMER_KEY="your_consumer_key"
$exportTWITTER_CONSUMER_SECRET="your_consumer_secret"
$exportTWITTER_ACCESS_KEY="your_access_key"
$exportTWITTER_ACCESS_SECRET="your_access_secret"
OncetherequireddependencyhasbeeninstalledandtheOAuthdataintheshellenvironmenthasbeenset,wecanruntheprogramasfollows:
$pythonstream.py–t–n1000>tweets.txt
WearerelyingonLinux’sshellI/Otoredirecttheoutputwiththe>operatorofstream.pytoafilecalledtweets.txt.Ifeverythingwasexecutedcorrectly,youshouldseeawalloftext,whereeachlineisatweet.
Noticethatinthisexample,wedidnotmakeuseofHadoopatall.Inthenextchapters,wewillshowhowtoimportadatasetgeneratedfromtheStreamingAPIintoHadoopandanalyzeitscontentonthelocalclusterandAmazonEMR.
Fornow,let’stakealookatthesourcecodeofstream.py,whichcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch1/stream.py:
importtweepy
importos
importjson
importargparse
consumer_key=os.environ['TWITTER_CONSUMER_KEY']
consumer_secret=os.environ['TWITTER_CONSUMER_SECRET']
access_key=os.environ['TWITTER_ACCESS_KEY']
access_secret=os.environ['TWITTER_ACCESS_SECRET']
classEchoStreamListener(tweepy.StreamListener):
def__init__(self,api,dump_json=False,numtweets=0):
self.api=api
self.dump_json=dump_json
self.count=0
self.limit=int(numtweets)
super(tweepy.StreamListener,self).__init__()
defon_data(self,tweet):
tweet_data=json.loads(tweet)
if'text'intweet_data:
ifself.dump_json:
printtweet.rstrip()
else:
printtweet_data['text'].encode("utf-8").rstrip()
self.count=self.count+1
returnFalseifself.count==self.limitelseTrue
defon_error(self,status_code):
returnTrue
defon_timeout(self):
returnTrue
…
if__name__=='__main__':
parser=get_parser()
args=parser.parse_args()
auth=tweepy.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_key,access_secret)
api=tweepy.API(auth)
sapi=tweepy.streaming.Stream(
auth,EchoStreamListener(
api=api,
dump_json=args.json,
numtweets=args.numtweets))
sapi.sample()
First,weimportthreedependencies:tweepy,andtheosandjsonmodules,whichcomewiththePythoninterpreterversion2.6orgreater.
Wethendefineaclass,EchoStreamListener,thatinheritsandextendsStreamListenerfromtweepy.Asthenamesuggests,StreamListenerlistensforeventsandtweetsbeingpublishedonthereal-timestreamandperformsactionsaccordingly.
Wheneveraneweventisdetected,ittriggersacalltoon_data().Inthismethod,weextractthetextfieldfromatweetobjectandprintittostandardoutputwithUTF-8encoding.Alternatively,ifthescriptisinvokedwith-j,weprintthewholeJSONtweet.Whenthescriptisexecuted,weinstantiateatweepy.OAuthHandlerobjectwiththeOAuthcredentialsthatidentifyourTwitteraccount,andthenweusethisobjecttoauthenticatewiththeapplicationaccessandsecretkey.Wethenusetheauthobjecttocreateaninstanceofthetweepy.APIclass(api)
Uponsuccessfulauthentication,wetellPythontolistenforeventsonthereal-timestreamusingEchoStreamListener.
AnhttpGETrequesttothestatuses/sampleendpointisperformedbysample().Therequestreturnsarandomsampleofallpublicstatuses.
NoteBeware!Bydefault,sample()willrunindefinitely.RemembertoexplicitlyterminatethemethodcallbypressingCtrl+C.
SummaryThischaptergaveawhirlwindtourofwhereHadoopcamefrom,itsevolution,andwhytheversion2releaseissuchamajormilestone.WealsodescribedtheemergingmarketinHadoopdistributionsandhowwewilluseacombinationoflocalandclouddistributionsinthebook.
Finally,wedescribedhowtosetuptheneededsoftware,accounts,andenvironmentsrequiredinsubsequentchaptersanddemonstratedhowtopulldatafromtheTwitterstreamthatwewilluseforexamples.
Withthisbackgroundoutoftheway,wewillnowmoveontoadetailedexaminationofthestoragelayerwithinHadoop.
Chapter2.StorageAftertheoverviewofHadoopinthepreviouschapter,wewillnowstartlookingatitsvariouscomponentpartsinmoredetail.Wewillstartattheconceptualbottomofthestackinthischapter:themeansandmechanismsforstoringdatawithinHadoop.Inparticular,wewilldiscussthefollowingtopics:
DescribethearchitectureoftheHadoopDistributedFileSystem(HDFS)ShowwhatenhancementstoHDFShavebeenmadeinHadoop2ExplorehowtoaccessHDFSusingcommand-linetoolsandtheJavaAPIGiveabriefdescriptionofZooKeeper—another(sortof)filesystemwithinHadoopSurveyconsiderationsforstoringdatainHadoopandtheavailablefileformats
InChapter3,Processing–MapReduceandBeyond,wewilldescribehowHadoopprovidestheframeworktoallowdatatobeprocessed.
TheinnerworkingsofHDFSInChapter1,Introduction,wegaveaveryhigh-leveloverviewofHDFS;wewillnowexploreitinalittlemoredetail.Asmentionedinthatchapter,HDFScanbeviewedasafilesystem,thoughonewithveryspecificperformancecharacteristicsandsemantics.It’simplementedwithtwomainserverprocesses:theNameNodeandtheDataNodes,configuredinamaster/slavesetup.IfyouviewtheNameNodeasholdingallthefilesystemmetadataandtheDataNodesasholdingtheactualfilesystemdata(blocks),thenthisisagoodstartingpoint.EveryfileplacedontoHDFSwillbesplitintomultipleblocksthatmightresideonnumerousDataNodes,andit’stheNameNodethatunderstandshowtheseblockscanbecombinedtoconstructthefiles.
ClusterstartupLet’sexplorethevariousresponsibilitiesofthesenodesandthecommunicationbetweenthembyassumingwehaveanHDFSclusterthatwaspreviouslyshutdownandthenexaminingthestartupbehavior.
NameNodestartupWe’llfirstlyconsiderthestartupoftheNameNode(thoughthereisnoactualorderingrequirementforthisandwearedoingitfornarrativereasonsalone).TheNameNodeactuallystorestwotypesofdataaboutthefilesystem:
Thestructureofthefilesystem,thatis,directorynames,filenames,locations,andattributesTheblocksthatcompriseeachfileonthefilesystem
ThisdataisstoredinfilesthattheNameNodereadsatstartup.NotethattheNameNodedoesnotpersistentlystorethemappingoftheblocksthatarestoredonparticularDataNodes;we’llseehowthatinformationiscommunicatedshortly.
BecausetheNameNodereliesonthisin-memoryrepresentationofthefilesystem,ittendstohavequitedifferenthardwarerequirementscomparedtotheDataNodes.We’llexplorehardwareselectioninmoredetailinChapter10,RunningaHadoopCluster;fornow,justrememberthattheNameNodetendstobequitememoryhungry.Thisisparticularlytrueonverylargeclusterswithmany(millionsormore)files,particularlyifthesefileshaveverylongnames.ThisscalinglimitationontheNameNodehasalsoledtoanadditionalHadoop2featurethatwewillnotexploreinmuchdetail:NameNodefederation,wherebymultipleNameNodes(orNameNodeHApairs)workcollaborativelytoprovidetheoverallmetadataforthefullfilesystem.
ThemainfilewrittenbytheNameNodeiscalledfsimage;thisisthesinglemostimportantpieceofdataintheentirecluster,aswithoutit,theknowledgeofhowtoreconstructallthedatablocksintotheusablefilesystemislost.Thisfileisreadintomemoryandallfuturemodificationstothefilesystemareappliedtothisin-memoryrepresentationofthefilesystem.TheNameNodedoesnotwriteoutnewversionsoffsimageasnewchangesareappliedafteritisrun;instead,itwritesanotherfilecallededits,whichisalistofthechangesthathavebeenmadesincethelastversionoffsimagewaswritten.
TheNameNodestartupprocessistofirstreadthefsimagefile,thentoreadtheeditsfile,andapplyallthechangesstoredintheeditsfiletothein-memorycopyoffsimage.Itthenwritestodiskanewup-to-dateversionofthefsimagefileandisreadytoreceiveclientrequests.
DataNodestartupWhentheDataNodesstartup,theyfirstcatalogtheblocksforwhichtheyholdcopies.Typically,theseblockswillbewrittensimplyasfilesonthelocalDataNodefilesystem.
TheDataNodewillperformsomeblockconsistencycheckingandthenreporttotheNameNodethelistofblocksforwhichithasvalidcopies.ThisishowtheNameNodeconstructsthefinalmappingitrequires—bylearningwhichblocksarestoredonwhichDataNodes.OncetheDataNodehasregistereditselfwiththeNameNode,anongoingseriesofheartbeatrequestswillbesentbetweenthenodestoallowtheNameNodetodetectDataNodesthathaveshutdown,becomeunreachable,orhavenewlyenteredthecluster.
BlockreplicationHDFSreplicateseachblockontomultipleDataNodes;thedefaultreplicationfactoris3,butthisisconfigurableonaper-filelevel.HDFScanalsobeconfiguredtobeabletodeterminewhethergivenDataNodesareinthesamephysicalhardwarerackornot.Givensmartblockplacementandthisknowledgeoftheclustertopology,HDFSwillattempttoplacethesecondreplicaonadifferenthostbutinthesameequipmentrackasthefirstandthethirdonahostoutsidetherack.Inthisway,thesystemcansurvivethefailureofasmuchasafullrackofequipmentandstillhaveatleastonelivereplicaforeachblock.Aswe’llseeinChapter3,Processing–MapReduceandBeyond,knowledgeofblockplacementalsoallowsHadooptoscheduleprocessingasnearaspossibletoareplicaofeachblock,whichcangreatlyimproveperformance.
Rememberthatreplicationisastrategyforresiliencebutisnotabackupmechanism;ifyouhavedatamasteredinHDFSthatiscritical,thenyouneedtoconsiderbackuporotherapproachesthatgiveprotectionforerrors,suchasaccidentallydeletedfiles,againstwhichreplicationwillnotdefend.
WhentheNameNodestartsupandisreceivingtheblockreportsfromtheDataNodes,itwillremaininsafemodeuntilaconfigurablethresholdofblocks(thedefaultis99.9percent)havebeenreportedaslive.Whileinsafemode,clientscannotmakeanymodificationstothefilesystem.
Command-lineaccesstotheHDFSfilesystemWithintheHadoopdistribution,thereisacommand-lineutilitycalledhdfs,whichistheprimarywaytointeractwiththefilesystemfromthecommandline.Runthiswithoutanyargumentstoseethevarioussubcommandsavailable.Therearemany,though;severalareusedtodothingslikestartingorstoppingvariousHDFScomponents.Thegeneralformofthehdfscommandis:
hdfs<sub-command><command>[arguments]
Thetwomainsubcommandswewilluseinthisbookare:
dfs:Thisisusedforgeneralfilesystemaccessandmanipulation,includingreading/writingandaccessingfilesanddirectoriesdfsadmin:Thisisusedforadministrationandmaintenanceofthefilesystem.Wewillnotcoverthiscommandindetail,though.Havealookatthe-reportcommand,whichgivesalistingofthestateofthefilesystemandallDataNodes:
$hdfsdfsadmin-report
NoteNotethatthedfsanddfsadmincommandscanalsobeusedwiththemainHadoopcommand-lineutility,forexample,hadoopfs-ls/.ThiswastheapproachinearlierversionsofHadoopbutisnowdeprecatedinfavorofthehdfscommand.
ExploringtheHDFSfilesystemRunthefollowingtogetalistoftheavailablecommandsprovidedbythedfssubcommand:
$hdfsdfs
Aswillbeseenfromtheoutputoftheprecedingcommand,manyoftheselooksimilartostandardUnixfilesystemcommandsand,notsurprisingly,theyworkaswouldbeexpected.InourtestVM,wehaveauseraccountcalledcloudera.Usingthisuser,wecanlisttherootofthefilesystemasfollows:
$hdfsdfs-ls/
Found7items
drwxr-xr-x-hbasehbase02014-04-0415:18/hbase
drwxr-xr-x-hdfssupergroup02014-10-2113:16/jar
drwxr-xr-x-hdfssupergroup02014-10-1515:26/schema
drwxr-xr-x-solrsolr02014-04-0415:16/solr
drwxrwxrwt-hdfssupergroup02014-11-1211:29/tmp
drwxr-xr-x-hdfssupergroup02014-07-1309:05/user
drwxr-xr-x-hdfssupergroup02014-04-0415:15/var
TheoutputisverysimilartotheUnixlscommand.Thefileattributesworkthesameastheuser/group/worldattributesonaUnixfilesystem(includingthetstickybitascanbeseen)plusdetailsoftheowner,group,andmodificationtimeofthedirectories.Thecolumnbetweenthegroupnameandthemodifieddateisthesize;thisis0fordirectoriesbutwillhaveavalueforfilesaswe’llseeinthecodefollowingthenextinformationbox:
NoteIfrelativepathsareused,theyaretakenfromthehomedirectoryoftheuser.Ifthereisnohomedirectory,wecancreateitusingthefollowingcommands:
$sudo-uhdfshdfsdfs–mkdir/user/cloudera
$sudo-uhdfshdfsdfs–chowncloudera:cloudera/user/cloudera
Themkdirandchownstepsrequiresuperuserprivileges(sudo-uhdfs).
$hdfsdfs-mkdirtestdir
$hdfsdfs-ls
Found1items
drwxr-xr-x-clouderacloudera02014-11-1311:21testdir
Then,wecancreateafile,copyittoHDFS,andreaditscontentsdirectlyfromitslocationonHDFS,asfollows:
$echo"Helloworld">testfile.txt
$hdfsdfs-puttestfile.txttestdir
Notethatthereisanoldercommandcalled-copyFromLocal,whichworksinthesamewayas-put;youmightseeitinolderdocumentationonline.Now,runthefollowingcommandandchecktheoutput:
$hdfsdfs-lstestdir
Found1items
-rw-r--r--3clouderacloudera122014-11-1311:21
testdir/testfile.txt
Notethenewcolumnbetweenthefileattributesandtheowner;thisisthereplicationfactorofthefile.Now,finally,runthefollowingcommand:
$hdfsdfs-tailtestdir/testfile.txt
Helloworld
Muchoftherestofthedfssubcommandsareprettyintuitive;playaround.We’llexploresnapshotsandprogrammaticaccesstoHDFSlaterinthischapter.
ProtectingthefilesystemmetadataBecausethefsimagefileissocriticaltothefilesystem,itslossisacatastrophicfailure.InHadoop1,wheretheNameNodewasasinglepointoffailure,thebestpracticewastoconfiguretheNameNodetosynchronouslywritethefsimageandeditsfilestobothlocalstorageplusatleastoneotherlocationonaremotefilesystem(oftenNFS).IntheeventofNameNodefailure,areplacementNameNodecouldbestartedusingthisup-to-datecopyofthefilesystemmetadata.Theprocesswouldrequirenon-trivialmanualintervention,however,andwouldresultinaperiodofcompleteclusterunavailability.
SecondaryNameNodenottotherescueThemostunfortunatelynamedcomponentinallofHadoop1wastheSecondaryNameNode,which,notunreasonably,manypeopleexpecttobesomesortofbackuporstandbyNameNode.Itisnot;instead,theSecondaryNameNodewasresponsibleonlyforperiodicallyreadingthelatestversionofthefsimageandeditsfileandcreatinganewup-to-datefsimagewiththeoutstandingeditsapplied.Onabusycluster,thischeckpointcouldsignificantlyspeeduptherestartoftheNameNodebyreducingthenumberofeditsithadtoapplybeforebeingabletoserviceclients.
InHadoop2,thenamingismoreclear;thereareCheckpointnodes,whichdotherolepreviouslyperformedbytheSecondaryNameNode,plusBackupNameNodes,whichkeepalocalup-to-datecopyofthefilesystemmetadataeventhoughtheprocesstopromoteaBackupnodetobetheprimaryNameNodeisstillamultistagemanualprocess.
Hadoop2NameNodeHAInmostproductionHadoop2clusters,however,itmakesmoresensetousethefullHighAvailability(HA)solutioninsteadofrelyingonCheckpointandBackupnodes.ItisactuallyanerrortotrytocombineNameNodeHAwiththeCheckpointandBackupnodemechanisms.
Thecoreideaisforapair(currentlynomorethantwoaresupported)ofNameNodesconfiguredinanactive/passivecluster.OneNameNodeactsasthelivemasterthatservicesallclientrequests,andthesecondremainsreadytotakeovershouldtheprimaryfail.Inparticular,Hadoop2HDFSenablesthisHAthroughtwomechanisms:
ProvidingameansforbothNameNodestohaveconsistentviewsofthefilesystemProvidingameansforclientstoalwaysconnecttothemasterNameNode
KeepingtheHANameNodesinsyncThereareactuallytwomechanismsbywhichtheactiveandstandbyNameNodeskeeptheirviewsofthefilesystemconsistent;useofanNFSshareorQuorumJournalManager(QJM).
IntheNFScase,thereisanobviousrequirementonanexternalremoteNFSfileshare—notethatasuseofNFSwasbestpracticeinHadoop1forasecondcopyoffilesystemmetadatamanyclustersalreadyhaveone.Ifhighavailabilityisaconcern,thoughitshouldbeborneinmindthatmakingNFShighlyavailableoftenrequireshigh-endandexpensivehardware.InHadoop2,HAusesNFS;however,theNFSlocationbecomestheprimarylocationforthefilesystemmetadata.AstheactiveNameNodewritesallfilesystemchangestotheNFSshare,thestandbynodedetectsthesechangesandupdatesitscopyofthefilesystemmetadataaccordingly.
TheQJMmechanismusesanexternalservice(theJournalManagers)insteadofafilesystem.TheJournalManagerclusterisanoddnumberofservices(3,5,and7arethemostcommon)runningonthatnumberofhosts.AllchangestothefilesystemaresubmittedtotheQJMservice,andachangeistreatedascommittedonlywhenamajorityoftheQJMnodeshavecommittedthechange.ThestandbyNameNodereceiveschangeupdatesfromtheQJMserviceandusesthisinformationtokeepitscopyofthefilesystemmetadatauptodate.
TheQJMmechanismdoesnotrequireadditionalhardwareastheCheckpointnodesarelightweightandcanbeco-locatedwithotherservices.Thereisalsonosinglepointoffailureinthemodel.Consequently,theQJMHAisusuallythepreferredoption.
Ineithercase,bothinNFS-basedHAandQJM-basedHA,theDataNodessendblockstatusreportstobothNameNodestoensurethatbothhaveup-to-dateinformationofthemappingofblockstoDataNodes.Rememberthatthisblockassignmentinformationisnotheldinthefsimage/editsdata.
ClientconfigurationTheclientstotheHDFSclusterremainmostlyunawareofthefactthatNameNodeHAisbeingused.TheconfigurationfilesneedtoincludethedetailsofbothNameNodes,butthemechanismsfordeterminingwhichistheactiveNameNode—andwhentoswitchtothestandby—arefullyencapsulatedintheclientlibraries.ThefundamentalconceptthoughisthatinsteadofreferringtoanexplicitNameNodehostasinHadoop1,HDFSinHadoop2identifiesanameserviceIDfortheNameNodewithinwhichmultipleindividualNameNodes(eachwithitsownNameNodeID)aredefinedforHA.NotethattheconceptofnameserviceIDisalsousedbyNameNodefederation,whichwebrieflymentionedearlier.
HowafailoverworksFailovercanbeeithermanualorautomatic.AmanualfailoverrequiresanadministratortotriggertheswitchthatpromotesthestandbytothecurrentlyactiveNameNode.Thoughautomaticfailoverhasthegreatestimpactonmaintainingsystemavailability,theremightbeconditionsinwhichthisisnotalwaysdesirable.Triggeringamanualfailoverrequiresrunningonlyafewcommandsand,therefore,eveninthismode,thefailoverissignificantlyeasierthaninthecaseofHadoop1orwithHadoop2Backupnodes,wherethetransitiontoanewNameNoderequiressubstantialmanualeffort.
Regardlessofwhetherthefailoveristriggeredmanuallyorautomatically,ithastwomainphases:confirmationthatthepreviousmasterisnolongerservingrequestsandthepromotionofthestandbytobethemaster.
ThegreatestriskinafailoveristohaveaperiodinwhichbothNameNodesareservicingrequests.Insuchasituation,itispossiblethatconflictingchangesmightbemadetothefilesystemonthetwoNameNodesorthattheymightbecomeoutofsync.EventhoughthisshouldnotbepossibleiftheQJMisbeingused(itonlyeveracceptsconnectionsfromasingleclient),out-of-dateinformationmightbeservedtoclients,whomightthentrytomakeincorrectdecisionsbasedonthisstalemetadata.Thisis,ofcourse,particularlylikelyifthepreviousmasterNameNodeisbehavingincorrectlyinsomeway,whichiswhytheneedforthefailoverisidentifiedinthefirstplace.
ToensureonlyoneNameNodeisactiveatanytime,afencingmechanismisusedtovalidatethattheexistingNameNodemasterhasbeenshutdown.ThesimplestincludedmechanismwilltrytosshintotheNameNodehostandactivelykilltheprocessthoughacustomscriptcanalsobeexecuted,sothemechanismisflexible.ThefailoverwillnotcontinueuntilthefencingissuccessfulandthesystemhasconfirmedthatthepreviousmasterNameNodeisnowdeadandhasreleasedanyrequiredresources.
Oncefencingsucceeds,thestandbyNameNodebecomesthemasterandwillstartwritingtotheNFS-mountedfsimageandeditslogsifNFSisbeingusedforHAorwillbecomethesingleclienttotheQJMifthatistheHAmechanism.
Beforediscussingautomaticfailover,weneedaslightseguetointroduceanotherApacheprojectthatisusedtoenablethisfeature.
ApacheZooKeeper–adifferenttypeoffilesystemWithinHadoop,wewillmostlytalkaboutHDFSwhendiscussingfilesystemsanddatastorage.But,insidealmostallHadoop2installations,thereisanotherservicethatlookssomewhatlikeafilesystem,butwhichprovidessignificantcapabilitycrucialtotheproperfunctioningofdistributedsystems.ThisserviceisApacheZooKeeper(http://zookeeper.apache.org)and,asitisakeypartoftheimplementationofHDFSHA,wewillintroduceitinthischapter.Itis,however,alsousedbymultipleotherHadoopcomponentsandrelatedprojects,sowewilltouchonitseveralmoretimesthroughoutthebook.
ZooKeeperstartedoutasasubcomponentofHBaseandwasusedtoenableseveraloperationalcapabilitiesoftheservice.Whenanycomplexdistributedsystemisbuilt,thereareaseriesofactivitiesthatarealmostalwaysrequiredandwhicharealwaysdifficulttogetright.Theseactivitiesincludethingssuchashandlingsharedlocks,detectingcomponentfailure,andsupportingleaderelectionwithinagroupofcollaboratingservices.ZooKeeperwascreatedasthecoordinationservicethatwouldprovideaseriesofprimitiveoperationsuponwhichHBasecouldimplementthesetypesofoperationallycriticalfeatures.NotethatZooKeeperalsotakesinspirationfromtheGoogleChubbysystemdescribedathttp://research.google.com/archive/chubby-osdi06.pdf.
ZooKeeperrunsasaclusterofinstancesreferredtoasanensemble.Theensembleprovidesadatastructure,whichissomewhatanalogoustoafilesystem.EachlocationinthestructureiscalledaZNodeandcanhavechildrenasifitwereadirectorybutcanalsohavecontentasifitwereafile.NotethatZooKeeperisnotasuitableplacetostoreverylargeamountsofdata,andbydefault,themaximumamountofdatainaZNodeis1MB.Atanypointintime,oneserverintheensembleisthemasterandmakesalldecisionsaboutclientrequests.Thereareverywell-definedrulesaroundtheresponsibilitiesofthemaster,includingthatithastoensurethatarequestisonlycommittedwhenamajorityoftheensemblehavecommittedthechange,andthatoncecommittedanyconflictingchangeisrejected.
YoushouldhaveZooKeeperinstalledwithinyourClouderaVirtualMachine.Ifnot,useClouderaManagertoinstallitasasinglenodeonthehost.Inproductionsystems,ZooKeeperhasveryspecificsemanticsaroundabsolutemajorityvoting,sosomeofthelogiconlymakessenseinalargerensemble(3,5,or7nodesarethemostcommonsizes).
Thereisacommand-lineclienttoZooKeepercalledzookeeper-clientintheClouderaVM;notethatinthevanillaZooKeeperdistributionitiscalledzkCli.sh.Ifyourunitwithnoarguments,itwillconnecttotheZooKeeperserverrunningonthelocalmachine.Fromhere,youcantypehelptogetalistofcommands.
Themostimmediatelyinterestingcommandswillbecreate,ls,andget.Asthenamessuggest,thesecreateaZNode,listtheZNodesataparticularpointinthefilesystem,and
getthedatastoredataparticularZNode.Herearesomeexamplesofusage.
CreateaZNodewithnodata:
$create/zk-test''
CreateachildofthefirstZNodeandstoresometextinit:
$create/zk-test/child1'sampledata'
RetrievethedataassociatedwithaparticularZNode:
$get/zk-test/child1
TheclientcanalsoregisterawatcheronagivenZNode—thiswillraiseanalertiftheZNodeinquestionchanges,eitheritsdataorchildrenbeingmodified.
Thismightnotsoundveryuseful,butZNodescanadditionallybecreatedasbothsequentialandephemeralnodes,andthisiswherethemagicstarts.
ImplementingadistributedlockwithsequentialZNodesIfaZNodeiscreatedwithintheCLIwiththe-soption,itwillbecreatedasasequentialnode.ZooKeeperwillsuffixthesuppliednamewitha10-digitintegerguaranteedtobeuniqueandgreaterthananyothersequentialchildrenofthesameZNode.Wecanusethismechanismtocreateadistributedlock.ZooKeeperitselfisnotholdingtheactuallock;theclientneedstounderstandwhatparticularstatesinZooKeepermeanintermsoftheirmappingtotheapplicationlocksinquestion.
Ifwecreatea(non-sequential)ZNodeat/zk-lock,thenanyclientwishingtoholdthelockwillcreateasequentialchildnode.Forexample,thecreate-s/zk-lock/locknodecommandmightcreatethenode,/zk-lock/locknode-0000000001,inthefirstcase,withincreasingintegersuffixesforsubsequentcalls.WhenaclientcreatesaZNodeunderthelock,itwillthencheckifitssequentialnodehasthelowestintegersuffix.Ifitdoes,thenitistreatedashavingthelock.Ifnot,thenitwillneedtowaituntilthenodeholdingthelockisdeleted.Theclientwillusuallyputawatchonthenodewiththenextlowestsuffixandthenbealertedwhenthatnodeisdeleted,indicatingthatitnowholdsthelock.
ImplementinggroupmembershipandleaderelectionusingephemeralZNodesAnyZooKeeperclientwillsendheartbeatstotheserverthroughoutthesession,showingthatitisalive.FortheZNodeswehavediscusseduntilnow,wecansaythattheyarepersistentandwillsurviveacrosssessions.Wecan,however,createaZNodeasephemeral,meaningitwilldisappearoncetheclientthatcreatediteitherdisconnectsorisdetectedasbeingdeadbytheZooKeeperserver.WithintheCLIanephemeralZNodeiscreatedbyaddingthe-eflagtothecreatecommand.
EphemeralZNodesareagoodmechanismtoimplementgroupmembershipdiscoverywithinadistributedsystem.Foranysystemwherenodescanfail,join,andleavewithoutnotice,knowingwhichnodesarealiveatanypointintimeisoftenadifficulttask.WithinZooKeeper,wecanprovidethebasisforsuchdiscoverybyhavingeachnodecreateanephemeralZNodeatacertainlocationintheZooKeeperfilesystem.TheZNodescanholddataabouttheservicenodes,suchashostname,IPaddress,portnumber,andsoon.Togetalistoflivenodes,wecansimplylistthechildnodesoftheparentgroupZNode.Becauseofthenatureofephemeralnodes,wecanhaveconfidencethatthelistoflivenodesretrievedatanytimeisuptodate.
IfwehaveeachservicenodecreateZNodechildrenthatarenotjustephemeralbutalsosequential,thenwecanalsobuildamechanismforleaderelectionforservicesthatneedtohaveasinglemasternodeatanyonetime.Themechanismisthesameforlocks;theclientservicenodecreatesthesequentialandephemeralZNodeandthenchecksifithasthelowestsequencenumber.Ifso,thenitisthemaster.Ifnot,thenitwillregisterawatcheronthenextlowestsequencenodetobealertedwhenitmightbecomethemaster.
JavaAPITheorg.apache.zookeeper.ZooKeeperclassisthemainprogrammaticclienttoaccessaZooKeeperensemble.Refertothejavadocsforthefulldetails,butthebasicinterfaceisrelativelystraightforwardwithobviousone-to-onecorrespondencetocommandsintheCLI.Forexample:
create:isequivalenttoCLIcreategetChildren:isequivalenttoCLIlsgetData:isequivalenttoCLIget
BuildingblocksAscanbeseen,ZooKeeperprovidesasmallnumberofwell-definedoperationswithverystrongsemanticguaranteesthatcanbebuiltintohigher-levelservices,suchasthelocks,groupmembership,andleaderelectionwediscussedearlier.It’sbesttothinkofZooKeeperasatoolkitofwell-engineeredandreliablefunctionscriticaltodistributedsystemsthatcanbebuiltuponwithouthavingtoworryabouttheintricaciesoftheirimplementation.TheprovidedZooKeeperinterfaceisquitelow-levelthough,andthereareafewhigher-levelinterfacesemergingthatprovidemoreofthemappingofthelow-levelprimitivesintoapplication-levellogic.TheCuratorproject(http://curator.apache.org/)isagoodexampleofthis.
ZooKeeperwasusedsparinglywithinHadoop1,butit’snowquiteubiquitous.It’susedbybothMapReduceandHDFSforthehighavailabilityoftheirJobTrackerandNameNodecomponents.HiveandImpala,whichwewillexplorelater,useittoplacelocksondatatablesthatarebeingaccessedbymultipleconcurrentjobs.Kafka,whichwe’lldiscussinthecontextofSamza,usesZooKeeperfornode(brokerinKafkaterminology),leaderelection,andstatemanagement.
FurtherreadingWehavenotdescribedZooKeeperinmuchdetailandhavecompletelyomittedaspectssuchasitsabilitytoapplyquotasandaccesscontrolliststoZNodeswithinthefilesystemandthemechanismstobuildcallbacks.OurpurposeherewastogiveenoughofthedetailssothatyouwouldhavesomeideaofhowitwasbeingusedwithintheHadoopservicesweexploreinthisbook.Formoreinformation,consulttheprojecthomepage.
AutomaticNameNodefailoverNowthatwehaveintroducedZooKeeper,wecanshowhowitisusedtoenableautomaticNameNodefailover.
AutomaticNameNodefailoverintroducestwonewcomponentstothesystem,aZooKeeperquorum,andtheZooKeeperFailoverController(ZKFC),whichrunsoneachNameNodehost.TheZKFCcreatesanephemeralZNodeinZooKeeperandholdsthisZNodeforaslongasitdetectsthelocalNameNodetobealiveandfunctioningcorrectly.Itdeterminesthisbycontinuouslysendingsimplehealth-checkrequeststotheNameNode,andiftheNameNodefailstorespondcorrectlyoverashortperiodoftimetheZKFCwillassumetheNameNodehasfailed.IfaNameNodemachinecrashesorotherwisefails,theZKFCsessioninZooKeeperwillbeclosedandtheephemeralZNodewillalsobeautomaticallyremoved.
TheZKFCprocessesarealsomonitoringtheZNodesoftheotherNameNodesinthecluster.IftheZKFConthestandbyNameNodehostseestheexistingmasterZNodedisappear,itwillassumethemasterhasfailedandwillattemptafailover.ItdoesthisbytryingtoacquirethelockfortheNameNode(throughtheprotocoldescribedintheZooKeepersection)andifsuccessfulwillinitiateafailoverthroughthesamefencing/promotionmechanismdescribedearlier.
HDFSsnapshotsWementionedearlierthatHDFSreplicationaloneisnotasuitablebackupstrategy.IntheHadoop2filesystem,snapshotshavebeenadded,whichbringsanotherlevelofdataprotectiontoHDFS.
Filesystemsnapshotshavebeenusedforsometimeacrossavarietyoftechnologies.Thebasicideaisthatitbecomespossibletoviewtheexactstateofthefilesystematparticularpointsintime.Thisisachievedbytakingacopyofthefilesystemmetadataatthepointthesnapshotismadeandmakingthisavailabletobeviewedinthefuture.
Aschangestothefilesystemaremade,anychangethatwouldaffectthesnapshotistreatedspecially.Forexample,ifafilethatexistsinthesnapshotisdeletedthen,eventhoughitwillberemovedfromthecurrentstateofthefilesystem,itsmetadatawillremaininthesnapshot,andtheblocksassociatedwithitsdatawillremainonthefilesystemthoughnotaccessiblethroughanyviewofthesystemotherthanthesnapshot.
Anexamplemightillustratethispoint.Say,youhaveafilesystemcontainingthefollowingfiles:
/data1(5blocks)
/data2(10blocks)
Youtakeasnapshotandthendeletethefile/data2.Ifyouviewthecurrentstateofthefilesystem,thenonly/data1willbevisible.Ifyouexaminethesnapshot,youwillseebothfiles.Behindthescenes,all15blocksstillexist,butonlythoseassociatedwiththeun-deletedfile/data1arepartofthecurrentfilesystem.Theblocksforthefile/data2willbereleasedonlywhenthesnapshotisitselfremoved—snapshotsareread-onlyviews.
SnapshotsinHadoop2canbeappliedateitherthefullfilesystemleveloronlyonparticularpaths.Apathneedstobesetassnapshottable,andnotethatyoucannothaveapathsnapshottableifanyofitschildrenorparentpathsarethemselvessnapshottable.
Let’stakeasimpleexamplebasedonthedirectorywecreatedearliertoillustratetheuseofsnapshots.Thecommandswearegoingtoillustrateneedtobeexecutedwithsuperuserprivileges,whichcanbeobtainedwithsudo-uhdfs.
First,usethedfsadminsubcommandofthehdfsCLIutilitytoenablesnapshotsofadirectory,asfollows:
$sudo-uhdfshdfsdfsadmin-allowSnapshot\
/user/cloudera/testdir
Allowingsnapshotontestdirsucceeded
Now,wecreatethesnapshotandexamineit;snapshotsareavailablethroughthe.snapshotsubdirectoryofthesnapshottabledirectory.Notethatthe.snapshotdirectorywillnotbevisibleinanormallistingofthedirectory.Here’showwecreateasnapshotandexamineit:
$sudo-uhdfshdfsdfs-createSnapshot\
/user/cloudera/testdirsn1
Createdsnapshot/user/cloudera/testdir/.snapshot/sn1
$sudo-uhdfshdfsdfs-ls\
/user/cloudera/testdir/.snapshot/sn1
Found1items-rw-r--r--1clouderacloudera122014-11-1311:21
/user/cloudera/testdir/.snapshot/sn1/testfile.txt
Now,weremovethetestfilefromthemaindirectoryandverifythatitisnowempty:
$sudo-uhdfshdfsdfs-rm\
/user/cloudera/testdir/testfile.txt
14/11/1313:13:51INFOfs.TrashPolicyDefault:Namenodetrashconfiguration:
Deletioninterval=1440minutes,Emptierinterval=0minutes.Moved:
'hdfs://localhost.localdomain:8020/user/cloudera/testdir/testfile.txt'to
trashat:hdfs://localhost.localdomain:8020/user/hdfs/.Trash/Current
$hdfsdfs-ls/user/cloudera/testdir
$
Notethementionoftrashdirectories;bydefault,HDFSwillcopyanydeletedfilesintoa.Trashdirectoryintheuser’shomedirectory,whichhelpstodefendagainstslippingfingers.Thesefilescanberemovedthroughhdfsdfs-expungeorwillbeautomaticallypurgedin7daysbydefault.
Now,weexaminethesnapshotwherethenow-deletedfileisstillavailable:
$hdfsdfs-lstestdir/.snapshot/sn1
Found1itemsdrwxr-xr-x-clouderacloudera02014-11-1313:12
testdir/.snapshot/sn1
$hdfsdfs-tailtestdir/.snapshot/sn1/testfile.txt
Helloworld
Then,wecandeletethesnapshot,freeingupanyblocksheldbyit,asfollows:
$sudo-uhdfshdfsdfs-deleteSnapshot\
/user/cloudera/testdirsn1
$hdfsdfs-lstestdir/.snapshot
$
Ascanbeseen,thefileswithinasnapshotarefullyavailabletobereadandcopied,providingaccesstothehistoricalstateofthefilesystematthepointwhenthesnapshotwasmade.Eachdirectorycanhaveupto65,535snapshots,andHDFSmanagessnapshotsinsuchawaythattheyarequiteefficientintermsofimpactonnormalfilesystemoperations.Theyareagreatmechanismtousepriortoanyactivitythatmighthaveadverseeffects,suchastryinganewversionofanapplicationthataccessesthefilesystem.Ifthenewsoftwarecorruptsfiles,theoldstateofthedirectorycanberestored.Ifafteraperiodofvalidationthesoftwareisaccepted,thenthesnapshotcaninsteadbedeleted.
HadoopfilesystemsUntilnow,wereferredtoHDFSastheHadoopfilesystem.Inreality,Hadoophasaratherabstractnotionoffilesystem.HDFSisonlyoneofseveralimplementationsoftheorg.apache.hadoop.fs.FileSystemJavaabstractclass.Alistofavailablefilesystemscanbefoundathttps://hadoop.apache.org/docs/r2.5.0/api/org/apache/hadoop/fs/FileSystem.html.Thefollowingtablesummarizessomeofthesefilesystems,alongwiththecorrespondingURIschemeandJavaimplementationclass.
Filesystem URIscheme Javaimplementation
Local file org.apache.hadoop.fs.LocalFileSystem
HDFS hdfs org.apache.hadoop.hdfs.DistributedFileSystem
S3(native) s3n org.apache.hadoop.fs.s3native.NativeS3FileSystem
S3(block-based) s3 org.apache.hadoop.fs.s3.S3FileSystem
ThereexisttwoimplementationsoftheS3filesystem.Native—s3n—isusedtoreadandwriteregularfiles.Datastoredusings3ncanbeaccessedbyanytoolandconverselycanbeusedtoreaddatageneratedbyotherS3tools.s3ncannothandlefileslargerthan5TBorrenameoperations.
MuchlikeHDFS,theblock-basedS3filesystemstoresfilesinblocksandrequiresanS3buckettobededicatedtothefilesystem.FilesstoredinanS3filesystemcanbelargerthan5TB,buttheywillnotbeinteroperablewithotherS3tools.Additionallyblock-basedS3supportsrenameoperations.
HadoopinterfacesHadoopiswritteninJava,andnotsurprisingly,allinteractionwiththesystemhappensviatheJavaAPI.Thecommand-lineinterfaceweusedthroughthehdfscommandinpreviousexamplesisaJavaapplicationthatusestheFileSystemclasstocarryoutinput/outputoperationsontheavailablefilesystems.
JavaFileSystemAPITheJavaAPI,providedbytheorg.apache.hadoop.fspackage,exposesApacheHadoopfilesystems.
org.apache.hadoop.fs.FileSystemistheabstractclasseachfilesystemimplementsandprovidesageneralinterfacetointeractwithdatainHadoop.AllcodethatusesHDFSshouldbewrittenwiththecapabilityofhandlingaFileSystemobject.
LibhdfsLibhdfsisaClibrarythat,despiteitsname,canbeusedtoaccessanyHadoopfilesystemandnotjustHDFS.ItiswrittenusingtheJavaNativeInterface(JNI)andmimicstheJavaFileSystemclass.
ThriftApacheThrift(http://thrift.apache.org)isaframeworkforbuildingcross-languagesoftwarethroughdataserializationandremotemethodinvocationmechanisms.TheHadoopThriftAPI,availableincontrib,exposesHadoopfilesystemsasaThriftservice.Thisinterfacemakesiteasyfornon-JavacodetoaccessdatastoredinaHadoopfilesystem.
Otherthantheaforementionedinterfaces,thereexistotherinterfacesthatallowaccesstoHadoopfilesystemsviaHTTPandFTP—theseforHDFSonly—aswellasWebDAV.
ManagingandserializingdataHavingafilesystemisallwellandgood,butwealsoneedmechanismstorepresentdataandstoreitonthefilesystems.Wewillexploresomeofthesemechanismsnow.
TheWritableinterfaceItisuseful,tousasdevelopers,ifwecanmanipulatehigher-leveldatatypesandhaveHadooplookaftertheprocessesrequiredtoserializethemintobytestowritetoafilesystemandreconstructfromastreamofbyteswhenitisreadfromthefilesystem.
Theorg.apache.hadoop.iopackagecontainstheWritableinterface,whichprovidesthismechanismandisspecifiedasfollows:
publicinterfaceWritable
{
voidwrite(DataOutputout)throwsIOException;
voidreadFields(DataInputin)throwsIOException;
}
Themainpurposeofthisinterfaceistoprovidemechanismsfortheserializationanddeserializationofdataasitispassedacrossthenetworkorreadandwrittenfromthedisk.
WhenweexploreprocessingframeworksonHadoopinlaterchapters,wewilloftenseeinstanceswheretherequirementisforadataargumenttobeofthetypeWritable.Ifweusedatastructuresthatprovideasuitableimplementationofthisinterface,thentheHadoopmachinerycanautomaticallymanagetheserializationanddeserializationofthedatatypewithoutknowinganythingaboutwhatitrepresentsorhowitisused.
IntroducingthewrapperclassesFortunately,youdon’thavetostartfromscratchandbuildWritablevariantsofallthedatatypesyouwilluse.HadoopprovidesclassesthatwraptheJavaprimitivetypesandimplementtheWritableinterface.Theyareprovidedintheorg.apache.hadoop.iopackage.
Theseclassesareconceptuallysimilartotheprimitivewrapperclasses,suchasIntegerandLong,foundinjava.lang.Theyholdasingleprimitivevaluethatcanbeseteitheratconstructionorviaasettermethod.Theyareasfollows:
BooleanWritable
ByteWritable
DoubleWritable
FloatWritable
IntWritable
LongWritable
VIntWritable:avariablelengthintegertypeVLongWritable:avariablelengthlongtypeThereisalsoText,whichwrapsjava.lang.String.
ArraywrapperclassesHadoopalsoprovidessomecollection-basedwrapperclasses.TheseclassesprovideWritablewrappersforarraysofotherWritableobjects.Forexample,aninstancecouldeitherholdanarrayofIntWritableorDoubleWritable,butnotarraysoftherawintorfloattypes.AspecificsubclassfortherequiredWritableclasswillberequired.Theyareasfollows:
ArrayWritable
TwoDArrayWritable
TheComparableandWritableComparableinterfacesWewereslightlyinaccuratewhenwesaidthatthewrapperclassesimplementWritable;theyactuallyimplementacompositeinterfacecalledWritableComparableintheorg.apache.hadoop.iopackagethatcombinesWritablewiththestandardjava.lang.Comparableinterface:
publicinterfaceWritableComparableextendsWritable,Comparable
{}
TheneedforComparablewillonlybecomeapparentwhenweexploreMapReduceinthenextchapter,butfornow,justrememberthatthewrapperclassesprovidemechanismsforthemtobebothserializedandsortedbyHadooporanyofitsframeworks.
StoringdataUntilnow,weintroducedthearchitectureofHDFSandhowtoprogrammaticallystoreandretrievedatausingthecommand-linetoolsandtheJavaAPI.Intheexamplesseenuntilnow,wehaveimplicitlyassumedthatourdatawasstoredasatextfile.Inreality,someapplicationsanddatasetswillrequireadhocdatastructurestoholdthefile’scontents.Overtheyears,fileformatshavebeencreatedtoaddressboththerequirementsofMapReduceprocessing—forinstance,wewantdatatobesplittable—andtosatisfytheneedtomodelbothstructuredandunstructureddata.Currently,alotoffocushasbeendedicatedtobettercapturetheusecasesofrelationaldatastorageandmodeling.Intheremainderofthischapter,wewillintroducesomeofthepopularfileformatchoicesavailablewithintheHadoopecosystem.
SerializationandContainersWhentalkingaboutfileformats,weareassumingtwotypesofscenarios,whichareasfollows:
Serialization:wewanttoencodedatastructuresgeneratedandmanipulatedatprocessingtimetoaformatwecanstoretoafile,transmit,andatalaterstage,retrieveandtranslatebackforfurthermanipulationContainers:oncedataisserializedtofiles,containersprovidemeanstogroupmultiplefilestogetherandaddadditionalmetadata
CompressionWhenworkingwithdata,filecompressioncanoftenleadtosignificantsavingsbothintermsofthespacenecessarytostorefilesaswellasonthedataI/Oacrossthenetworkandfrom/tolocaldisks.
Inbroadterms,whenusingaprocessingframework,compressioncanoccuratthreepointsintheprocessingpipeline:
inputfilestobeprocessedoutputfilesthatresultafterprocessingiscompletedintermediate/temporaryfilesproducedinternallywithinthepipeline
Whenweaddcompressionatanyofthesestages,wehaveanopportunitytodramaticallyreducetheamountofdatatobereadorwrittentothediskoracrossthenetwork.ThisisparticularlyusefulwithframeworkssuchasMapReducethatcan,forexample,producevolumesoftemporarydatathatarelargerthaneithertheinputoroutputdatasets.
ApacheHadoopcomeswithanumberofcompressioncodecs:gzip,bzip2,LZO,snappy—eachwithitsowntradeoffs.Pickingacodecisaneducatedchoicethatshouldconsiderboththekindofdatabeingprocessedaswellasthenatureoftheprocessingframeworkitself.
Otherthanthegeneralspace/timetradeoff,wherethelargestspacesavingscomeattheexpenseofcompressionanddecompressionspeed(andviceversa),weneedtotakeintoaccountthatdatastoredinHDFSwillbeaccessedbyparallel,distributedsoftware;someofthissoftwarewillalsoadditsownparticularrequirementsonfileformats.MapReduce,forexample,ismostefficientonfilesthatcanbesplitintovalidsubfiles.
Thiscancomplicatedecisions,suchasthechoiceofwhethertocompressandwhichcodectouseifso,asmostcompressioncodecs(suchasgzip)donotsupportsplittablefiles,whereasafew(suchasLZO)do.
General-purposefileformatsThefirstclassoffileformatsarethosegeneral-purposeonesthatcanbeappliedtoanyapplicationdomainandmakenoassumptionsondatastructureoraccesspatterns.
Text:thesimplestapproachtostoringdataonHDFSistouseflatfiles.Textfilescanbeusedbothtoholdunstructureddata—awebpageoratweet—aswellasstructureddata—aCSVfilethatisafewmillionrowslong.Textfilesaresplittable,thoughoneneedstoconsiderhowtohandleboundariesbetweenmultipleelements(forexample,lines)inthefile.SequenceFile:aSequenceFileisaflatdatastructureconsistingofbinarykey/valuepairs,introducedtoaddressspecificrequirementsofMapReduce-basedprocessing.ItisstillextensivelyusedinMapReduceasaninput/outputformat.AswewillseeinChapter3,Processing–MapReduceandBeyond,internally,thetemporaryoutputsofmapsarestoredusingSequenceFile.
SequenceFileprovidesWriter,Reader,andSorterclassestowrite,read,and,sortdata,respectively.
Dependingonthecompressionmechanisminuse,threevariationsofSequenceFilecanbedistinguished:
Uncompressedkey/valuerecords.Recordcompressedkey/valuerecords.Only‘values’arecompressed.Blockcompressedkey/valuerecords.Keysandvaluesarecollectedinblocksofarbitrarysizeandcompressedseparately.
Ineachcase,however,theSequenceFileremainssplittable,whichisoneofitsbiggeststrengths.
Column-orienteddataformatsIntherelationaldatabaseworld,column-orienteddatastoresorganizeandstoretablesbasedonthecolumns;generallyspeaking,thedataforeachcolumnwillbestoredtogether.ThisisasignificantlydifferentapproachcomparedtomostrelationalDBMSthatorganizedataperrow.Column-orientedstoragehassignificantperformanceadvantages;forexample,ifaqueryneedstoreadonlytwocolumnsfromaverywidetablecontaininghundredsofcolumns,thenonlytherequiredcolumndatafilesareaccessed.Atraditionalrow-orienteddatabasewouldhavetoreadallcolumnsforeachrowforwhichdatawasrequired.Thishasthegreatestimpactonworkloadswhereaggregatefunctionsarecomputedoverlargenumbersofsimilaritems,suchaswithOLAPworkloadstypicalofdatawarehousesystems.
InChapter7,HadoopandSQL,wewillseehowHadoopisbecomingaSQLbackendforthedatawarehouseworldthankstoprojectssuchasApacheHiveandClouderaImpala.Aspartoftheexpansionintothisdomain,anumberoffileformatshavebeendevelopedtoaccountforbothrelationalmodelinganddatawarehousingneeds.
RCFile,ORC,andParquetarethreestate-of-the-artcolumn-orientedfileformatsdevelopedwiththeseusecasesinmind.
RCFileRowColumnarFile(RCFile)wasoriginallydevelopedbyFacebooktobeusedasthebackendstoragefortheirHivedatawarehousesystemthatwasthefirstmainstreamSQL-on-Hadoopsystemavailableasopensource.
RCFileaimstoprovidethefollowing:
fastdataloadingfastqueryprocessingefficientstorageutilizationadaptabilitytodynamicworkloads
MoreinformationonRCFilecanbefoundathttp://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/abs11-4.html.
ORCTheOptimizedRowColumnarfileformat(ORC)aimstocombinetheperformanceoftheRCFilewiththeflexibilityofAvro.ItisprimarilyintendedtoworkwithApacheHiveandhasbeeninitiallydevelopedbyHortonworkstoovercometheperceivedlimitationsofotheravailablefileformats.
Moredetailscanbefoundathttp://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html.
ParquetParquet,foundathttp://parquet.incubator.apache.org,wasoriginallyajointeffortof
Cloudera,Twitter,andCriteo,andnowhasbeendonatedtotheApacheSoftwareFoundation.ThegoalsofParquetaretoprovideamodern,performant,columnarfileformattobeusedwithClouderaImpala.AswithImpala,ParquethasbeeninspiredbytheDremelpaper(http://research.google.com/pubs/pub36632.html).Itallowscomplex,nesteddatastructuresandallowsefficientencodingonaper-columnlevel.
AvroApacheAvro(http://avro.apache.org)isaschema-orientedbinarydataserializationformatandfilecontainer.Avrowillbeourpreferredbinarydataformatthroughoutthisbook.Itisbothsplittableandcompressible,makingitanefficientformatfordataprocessingwithframeworkssuchasMapReduce.
Numerousotherprojectsalsohavebuilt-inspecificAvrosupportandintegration,however,soitisverywidelyapplicable.WhendataisstoredinanAvrofile,itsschema—definedasaJSONobject—isstoredwithit.Afilecanbelaterprocessedbyathirdpartywithnoapriorinotionofhowdataisencoded.Thismakesdataself-describingandfacilitatesusewithdynamicandscriptinglanguages.Theschema-on-readmodelalsohelpsAvrorecordstobeefficienttostoreasthereisnoneedfortheindividualfieldstobetagged.
Inlaterchapters,youwillseehowthesepropertiescanmakedatalifecyclemanagementeasierandallownon-trivialoperationssuchasschemamigrations.
UsingtheJavaAPIWe’llnowdemonstratetheuseoftheJavaAPItoparseAvroschemas,readandwriteAvrofiles,anduseAvro’scodegenerationfacilities.Notethattheformatisintrinsicallylanguageindependent;thereareAPIsformostlanguages,andfilescreatedbyJavawillseamlesslybereadfromanyotherlanguage.
AvroschemasaredescribedasJSONdocumentsandrepresentedbytheorg.apache.avro.Schemaclass.TodemonstratetheAPIformanipulatingAvrodocuments,we’lllookaheadtoanAvrospecificationweuseforaHivetableinChapter7,HadoopandSQL.Thefollowingcodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch2/src/main/java/com/learninghadoop2/avro/AvroParse.java.
Inthefollowingcode,wewillusetheAvroJavaAPItocreateanAvrofilecontainingatweetrecordandthenre-readthefile,usingtheschemainthefiletoextractthedetailsofthestoredrecords:
publicstaticvoidtestGenericRecord(){
try{
Schemaschema=newSchema.Parser()
.parse(newFile("tweets_avro.avsc"));
GenericRecordtweet=newGenericData
.Record(schema);
tweet.put("text","Thegenerictweettext");
Filefile=newFile("tweets.avro");
DatumWriter<GenericRecord>datumWriter=
newGenericDatumWriter<>(schema);
DataFileWriter<GenericRecord>fileWriter=
newDataFileWriter<>(datumWriter);
fileWriter.create(schema,file);
fileWriter.append(tweet);
fileWriter.close();
DatumReader<GenericRecord>datumReader=
newGenericDatumReader<>(schema);
DataFileReader<GenericRecord>fileReader=
newDataFileReader(file,datumReader);
GenericRecordgenericTweet=null;
while(fileReader.hasNext()){
genericTweet=(GenericRecord)fileReader
.next(genericTweet);
for(Schema.Fieldfield:
genericTweet.getSchema().getFields()){
Objectval=genericTweet.get(field.name());
if(val!=null){
System.out.println(val);
}
}
}
}catch(IOExceptionie){
System.out.println("Errorparsingorwritingfile.");
}
}
Thetweets_avro.avscschema,foundathttps://github.com/learninghadoop2/book-examples/blob/master/ch2/tweets_avro.avsc,describesatweetwithmultiplefields.TocreateanAvroobjectofthistype,wefirstparsetheschemafile.WethenuseAvro’sconceptofaGenericRecordtobuildanAvrodocumentthatcomplieswiththisschema.Inthiscase,weonlysetasingleattribute—thetweettextitself.
TowritethisAvrofile—containingasingleobject—wethenuseAvro’sI/Ocapabilities.Toreadthefile,wedonotneedtostartwiththeschema,aswecanextractthisfromtheGenericRecordwereadfromthefile.Wethenwalkthroughtheschemastructureanddynamicallyprocessthedocumentbasedonthediscoveredfields.Thisisparticularlypowerful,asitisthekeyenablerofclientsremainingindependentoftheAvroschemaandhowitevolvesovertime.
Ifwehavetheschemafileinadvance,however,wecanuseAvrocodegenerationtocreateacustomizedclassthatmakesmanipulatingAvrorecordsmucheasier.Togeneratethecode,wewillusethecompileclassintheavro-tools.jar,passingitthenameoftheschemafileandthedesiredoutputdirectory:
$java-jar/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/avro/avro-
tools.jarcompileschematweets_avro.avscsrc/main/java
Theclasswillbeplacedinadirectorystructurebasedonanynamespacedefinedintheschema.Sincewecreatedthisschemainthecom.learninghadoop2.avrotablesnamespace,weseethefollowing:
$lssrc/main/java/com/learninghadoop2/avrotables/tweets_avro.java
Withthisclass,let’srevisitthecreationandtheactofreadingandwritingAvroobjects,asfollows:
publicstaticvoidtestGeneratedCode(){
tweets_avrotweet=newtweets_avro();
tweet.setText("Thecodegeneratedtweettext");
try{
Filefile=newFile("tweets.avro");
DatumWriter<tweets_avro>datumWriter=
newSpecificDatumWriter<>(tweets_avro.class);
DataFileWriter<tweets_avro>fileWriter=
newDataFileWriter<>(datumWriter);
fileWriter.create(tweet.getSchema(),file);
fileWriter.append(tweet);
fileWriter.close();
DatumReader<tweets_avro>datumReader=
newSpecificDatumReader<>(tweets_avro.class);
DataFileReader<tweets_avro>fileReader=
newDataFileReader<>(file,datumReader);
while(fileReader.hasNext()){
tweet=fileReader.next(tweet);
System.out.println(tweet.getText());
}
}catch(IOExceptionie){
System.out.println("Errorinparsingorwritingfiles.");
}
}
Becauseweusedcodegeneration,wenowusetheAvroSpecificRecordmechanismalongsidethegeneratedclassthatrepresentstheobjectinourdomainmodel.Consequently,wecandirectlyinstantiatetheobjectandaccessitsattributesthroughfamiliarget/setmethods.
Writingthefileissimilartotheactionperformedbefore,exceptthatweusespecificclassesandalsoretrievetheschemadirectlyfromthetweetobjectwhenneeded.Readingissimilarlyeasedthroughtheabilitytocreateinstancesofaspecificclassanduseget/setmethods.
SummaryThischapterhasgivenawhistle-stoptourthroughstorageonaHadoopcluster.Inparticular,wecovered:
Thehigh-levelarchitectureofHDFS,themainfilesystemusedinHadoopHowHDFSworksunderthecoversand,inparticular,itsapproachtoreliabilityHowHadoop2hasaddedsignificantlytoHDFS,particularlyintheformofNameNodeHAandfilesystemsnapshotsWhatZooKeeperisandhowitisusedbyHadooptoenablefeaturessuchasNameNodeautomaticfailoverAnoverviewofthecommand-linetoolsusedtoaccessHDFSTheAPIforfilesystemsinHadoopandhowatacodelevelHDFSisjustoneimplementationofamoreflexiblefilesystemabstractionHowdatacanbeserializedontoaHadoopfilesystemandsomeofthesupportprovidedinthecoreclassesThevariousfileformatsavailableinwhichdataismostfrequentlystoredinHadoopandsomeoftheirparticularusecases
Inthenextchapter,wewilllookindetailathowHadoopprovidesprocessingframeworksthatcanbeusedtoprocessthedatastoredwithinit.
Chapter3.Processing–MapReduceandBeyondInHadoop1,theplatformhadtwoclearcomponents:HDFSfordatastorageandMapReducefordataprocessing.ThepreviouschapterdescribedtheevolutionofHDFSinHadoop2andinthischapterwe’lldiscussdataprocessing.
ThepicturewithprocessinginHadoop2haschangedmoresignificantlythanhasstorage,andHadoopnowsupportsmultipleprocessingmodelsasfirst-classcitizens.Inthischapterwe’llexplorebothMapReduceandothercomputationalmodelsinHadoop2.Inparticular,we’llcover:
WhatMapReduceisandtheJavaAPIrequiredtowriteapplicationsforitHowMapReduceisimplementedinpracticeHowHadoopreadsdataintoandoutofitsprocessingjobsYARN,theHadoop2componentthatallowsprocessingbeyondMapReduceontheplatformAnintroductiontoseveralcomputationalmodelsimplementedonYARN
MapReduceMapReduceistheprimaryprocessingmodelsupportedinHadoop1.Itfollowsadivideandconquermodelforprocessingdatamadepopularbya2006paperbyGoogle(http://research.google.com/archive/mapreduce.html)andhasfoundationsbothinfunctionalprogramminganddatabaseresearch.Thenameitselfreferstotwodistinctstepsappliedtoallinputdata,amapfunctionandareducefunction.
EveryMapReduceapplicationisasequenceofjobsthatbuildatopthisverysimplemodel.Sometimes,theoverallapplicationmayrequiremultiplejobs,wheretheoutputofthereducestagefromoneistheinputtothemapstageofanother,andsometimestheremightbemultiplemaporreducefunctions,butthecoreconceptsremainthesame.
WewillintroducetheMapReducemodelbylookingatthenatureofthemapandreducefunctionsandthendescribetheJavaAPIrequiredtobuildimplementationsofthefunctions.Aftershowingsomeexamples,wewillwalkthroughaMapReduceexecutiontogivemoreinsightintohowtheactualMapReduceframeworkexecutescodeatruntime.
LearningtheMapReducemodelcanbealittlecounter-intuitive;it’softendifficulttoappreciatehowverysimplefunctionscan,whencombined,provideveryrichprocessingonenormousdatasets.Butitdoeswork,trustus!
Asweexplorethenatureofthemapandreducefunctions,thinkofthemasbeingappliedtoastreamofrecordsbeingretrievedfromthesourcedataset.We’lldescribehowthathappenslater;fornow,thinkofthesourcedatabeingslicedintosmallerchunks,eachofwhichgetsfedtoadedicatedinstanceofthemapfunction.Eachrecordhasthemapfunctionapplied,producingasetofintermediarydata.Recordsareretrievedfromthistemporarydatasetandallassociatedrecordsarefedtogetherthroughthereducefunction.Thefinaloutputofthereducefunctionforallthesetsofrecordsistheoverallresultforthecompletejob.
Fromafunctionalperspective,MapReducetransformsdatastructuresfromonelistof(key,value)pairsintoanother.DuringtheMapphase,dataisloadedfromHDFS,andafunctionisappliedinparalleltoeveryinput(key,value)andanewlistof(key,value)pairsistheoutput:
map(k1,v1)->list(k2,v2)
Theframeworkthencollectsallpairswiththesamekeyfromalllistsandgroupsthemtogether,creatingonegroupforeachkey.AReducefunctionisappliedinparalleltoeachgroup,whichinturnproducesalistofvalues:
reduce(k2,list(v2))→k3,list(v3)
TheoutputisthenwrittenbacktoHDFSinthefollowingmanner:
MapandReducephases
JavaAPItoMapReduceTheJavaAPItoMapReduceisexposedbytheorg.apache.hadoop.mapreducepackage.WritingaMapReduceprogram,atitscore,isamatterofsubclassingHadoop-providedMapperandReducerbaseclasses,andoverridingthemap()andreduce()methodswithourownimplementation.
TheMapperclassForourownMapperimplementations,wewillsubclasstheMapperbaseclassandoverridethemap()method,asfollows:
classMapper<K1,V1,K2,V2>
{
voidmap(K1key,V1valueMapper.Contextcontext)
throwsIOException,InterruptedException
...
}
Theclassisdefinedintermsofthekey/valueinputandoutputtypes,andthenthemapmethodtakesaninputkey/valuepairasitsparameter.TheotherparameterisaninstanceoftheContextclassthatprovidesvariousmechanismstocommunicatewiththeHadoopframework,oneofwhichistooutputtheresultsofamaporreducemethod.
NoticethatthemapmethodonlyreferstoasingleinstanceofK1andV1key/valuepairs.ThisisacriticalaspectoftheMapReduceparadigminwhichyouwriteclassesthatprocesssinglerecords,andtheframeworkisresponsibleforalltheworkrequiredtoturnanenormousdatasetintoastreamofkey/valuepairs.Youwillneverhavetowritemaporreduceclassesthattrytodealwiththefulldataset.HadoopalsoprovidesmechanismsthroughitsInputFormatandOutputFormatclassesthatprovideimplementationsofcommonfileformatsandlikewiseremovetheneedforhavingtowritefileparsersforanybutcustomfiletypes.
Therearethreeadditionalmethodsthatsometimesmayberequiredtobeoverridden:.
protectedvoidsetup(Mapper.Contextcontext)
throwsIOException,InterruptedException
Thismethodiscalledoncebeforeanykey/valuepairsarepresentedtothemapmethod.Thedefaultimplementationdoesnothing:
protectedvoidcleanup(Mapper.Contextcontext)
throwsIOException,InterruptedException
Thismethodiscalledonceafterallkey/valuepairshavebeenpresentedtothemapmethod.Thedefaultimplementationdoesnothing:
protectedvoidrun(Mapper.Contextcontext)
throwsIOException,InterruptedException
ThismethodcontrolstheoverallflowoftaskprocessingwithinaJVM.Thedefaultimplementationcallsthesetupmethodoncebeforerepeatedlycallingthemapmethodforeachkey/valuepairinthesplitandthenfinallycallsthecleanupmethod.
TheReducerclassTheReducerbaseclassworksverysimilarlytotheMapperclassandusuallyrequiresonlysubclassestooverrideasinglereduce()method.Hereisthecut-downclassdefinition:
publicclassReducer<K2,V2,K3,V3>
{
voidreduce(K2key,Iterable<V2>values,
Reducer.Contextcontext)
throwsIOException,InterruptedException
...
}
Again,noticetheclassdefinitionintermsofthebroaderdataflow(thereducemethodacceptsK2/V2asinputandprovidesK3/V3asoutput),whiletheactualreducemethodtakesonlyasinglekeyanditsassociatedlistofvalues.TheContextobjectisagainthemechanismtooutputtheresultofthemethod.
Thisclassalsohasthesetup,runandcleanupmethodswithsimilardefaultimplementationsaswiththeMapperclassthatcanoptionallybeoverridden:
protectedvoidsetup(Reducer.Contextcontext)
throwsIOException,InterruptedException
Thesetup()methodiscalledoncebeforeanykey/listsofvaluesarepresentedtothereducemethod.Thedefaultimplementationdoesnothing:
protectedvoidcleanup(Reducer.Contextcontext)
throwsIOException,InterruptedException
Thecleanup()methodiscalledonceafterallkey/listsofvalueshavebeenpresentedtothereducemethod.Thedefaultimplementationdoesnothing:
protectedvoidrun(Reducer.Contextcontext)
throwsIOException,InterruptedException
Therun()methodcontrolstheoverallflowofprocessingthetaskwithintheJVM.Thedefaultimplementationcallsthesetupmethodbeforerepeatedlyandpotentiallyconcurrentlycallingthereducemethodforasmanykey/valuepairsprovidedtotheReducerclass,andthenfinallycallsthecleanupmethod.
TheDriverclassTheDriverclasscommunicateswiththeHadoopframeworkandspecifiestheconfigurationelementsneededtorunaMapReducejob.ThisinvolvesaspectssuchastellingHadoopwhichMapperandReducerclassestouse,wheretofindtheinputdataandinwhatformat,andwheretoplacetheoutputdataandhowtoformatit.
ThedriverlogicusuallyexistsinthemainmethodoftheclasswrittentoencapsulateaMapReducejob.ThereisnodefaultparentDriverclasstosubclass:
publicclassExampleDriverextendsConfiguredimplementsTool
{
...
publicstaticvoidrun(String[]args)throwsException
{
//CreateaConfigurationobjectthatisusedtosetotheroptions
Configurationconf=getConf();
//Getcommandlinearguments
args=newGenericOptionsParser(conf,args)
.getRemainingArgs();
//Createtheobjectrepresentingthejob
Jobjob=newJob(conf,"ExampleJob");
//Setthenameofthemainclassinthejobjarfile
job.setJarByClass(ExampleDriver.class);
//Setthemapperclass
job.setMapperClass(ExampleMapper.class);
//Setthereducerclass
job.setReducerClass(ExampleReducer.class);
//Setthetypesforthefinaloutputkeyandvalue
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//Setinputandoutputfilepaths
FileInputFormat.addInputPath(job,newPath(args[0]));
FileOutputFormat.setOutputPath(job,newPath(args[1]));
//Executethejobandwaitforittocomplete
System.exit(job.waitForCompletion(true)?0:1);
}
publicstaticvoidmain(String[]args)throwsException
{
intexitCode=ToolRunner.run(newExampleDriver(),args);
System.exit(exitCode);
}
}
Intheprecedinglinesofcode,org.apache.hadoop.util.Toolisaninterfaceforhandlingcommand-lineoptions.TheactualhandlingisdelegatedtoToolRunner.run,whichruns
ToolwiththegivenConfigurationusedtogetandsetajob’sconfigurationoptions.Bysubclassingorg.apache.hadoop.conf.Configured,wecansettheConfigurationobjectdirectlyfromcommand-lineoptionsviaGenericOptionsParser.
Givenourprevioustalkofjobs,it’snotsurprisingthatmuchofthesetupinvolvesoperationsonajobobject.Thisincludessettingthejobnameandspecifyingwhichclassesaretobeusedforthemapperandreducerimplementations.
Certaininput/outputconfigurationsaresetand,finally,theargumentspassedtothemainmethodareusedtospecifytheinputandoutputlocationsforthejob.Thisisaverycommonmodelthatyouwillseeoften.
Thereareanumberofdefaultvaluesforconfigurationoptions,andweareimplicitlyusingsomeofthemintheprecedingclass.Mostnotably,wedon’tsayanythingabouttheformatoftheinputfilesorhowtheoutputfilesaretobewritten.ThesearedefinedthroughtheInputFormatandOutputFormatclassesmentionedearlier;wewillexplorethemindetaillater.Thedefaultinputandoutputformatsaretextfilesthatsuitourexamples.Therearemultiplewaysofexpressingtheformatwithintextfilesinadditiontoparticularlyoptimizedbinaryformats.
AcommonmodelforlesscomplexMapReducejobsistohavetheMapperandReducerclassesasinnerclasseswithinthedriver.Thisallowseverythingtobekeptinasinglefile,whichsimplifiesthecodedistribution.
CombinerHadoopallowstheuseofacombinerclasstoperformsomeearlysortingoftheoutputfromthemapmethodbeforeit’sretrievedbythereducer.
MuchofHadoop’sdesignispredicatedonreducingtheexpensivepartsofajobthatusuallyequatetodiskandnetworkI/O.Theoutputofthemapperisoftenlarge;it’snotinfrequenttoseeitmanytimesthesizeoftheoriginalinput.Hadoopdoesallowconfigurationoptionstohelpreducetheimpactofthereducerstransferringsuchlargechunksofdataacrossthenetwork.Thecombinertakesadifferentapproachwhereit’spossibletoperformearlyaggregationtorequirelessdatatobetransferredinthefirstplace.
Thecombinerdoesnothaveitsowninterface;acombinermusthavethesamesignatureasthereducer,andhencealsosubclassestheReduceclassfromtheorg.apache.hadoop.mapreducepackage.Theeffectofthisistobasicallyperformamini-reduceonthemapperfortheoutputdestinedforeachreducer.
Hadoopdoesnotguaranteewhetherthecombinerwillbeexecuted.Sometimes,itmaynotbeexecutedatall,whileatothertimesitmaybeusedonce,twice,ormoretimesdependingonthesizeandnumberofoutputfilesgeneratedbythemapperforeachreducer.
PartitioningOneoftheimplicitguaranteesoftheReduceinterfaceisthatasinglereducerwillbegivenallthevaluesassociatedwithagivenkey.Withmultiplereducetasksrunningacrossacluster,eachmapperoutputmustbepartitionedintotheseparateoutputsdestinedforeachreducer.Thesepartitionedfilesarestoredonthelocalnodefilesystem.
Thenumberofreducetasksacrosstheclusterisnotasdynamicasthatofmappers,andindeedwecanspecifythevalueaspartofourjobsubmission.Hadooptherefore,knowshowmanyreducerswillbeneededtocompletethejob,andfromthis,itknowsintohowmanypartitionsthemapperoutputshouldbesplit.
TheoptionalpartitionfunctionWithintheorg.apache.hadoop.mapreducepackageisthePartitionerclass,anabstractclasswiththefollowingsignature:
publicabstractclassPartitioner<Key,Value>
{
publicabstractintgetPartition(Keykey,Valuevalue,int
numPartitions);
}
Bydefault,Hadoopwilluseastrategythathashestheoutputkeytoperformthepartitioning.ThisfunctionalityisprovidedbytheHashPartitionerclasswithintheorg.apache.hadoop.mapreduce.lib.partitionpackage,butit’snecessaryinsomecasestoprovideacustomsubclassofPartitionerwithapplication-specificpartitioninglogic.NoticethatthegetPartitionfunctiontakesthekey,value,andnumberofpartitionsasparameters,anyofwhichcanbeusedbythecustompartitioninglogic.
Acustompartitioningstrategywouldbeparticularlynecessaryif,forexample,thedataprovidedaveryunevendistributionwhenthestandardhashfunctionwasapplied.Unevenpartitioningcanresultinsometaskshavingtoperformsignificantlymoreworkthanothers,leadingtomuchlongeroveralljobexecutiontime.
Hadoop-providedmapperandreducerimplementationsWedon’talwayshavetowriteourownMapperandReducerclassesfromscratch.HadoopprovidesseveralcommonMapperandReducerimplementationsthatcanbeusedinourjobs.Ifwedon’toverrideanyofthemethodsintheMapperandReducerclasses,thedefaultimplementationsaretheidentityMapperandReducerclasses,whichsimplyoutputtheinputunchanged.
Themappersarefoundatorg.apache.hadoop.mapreduce.lib.mapperandincludethefollowing:
InverseMapper:returns(value,key)asanoutput,thatis,theinputkeyisoutputasthevalueandtheinputvalueisoutputasthekeyTokenCounterMapper:countsthenumberofdiscretetokensineachlineofinputIdentityMapper:implementstheidentityfunction,mappinginputsdirectlytooutputs
Thereducersarefoundatorg.apache.hadoop.mapreduce.lib.reduceandcurrentlyincludethefollowing:
IntSumReducer:outputsthesumofthelistofintegervaluesperkeyLongSumReducer:outputsthesumofthelistoflongvaluesperkeyIdentityReducer:implementstheidentityfunction,mappinginputsdirectlytooutputs
SharingreferencedataOccasionally,wemightwanttosharedataacrosstasks.Forinstance,ifweneedtoperformalookupoperationonanID-to-stringtranslationtable,wemightwantsuchadatasourcetobeaccessiblebythemapperorreducer.AstraightforwardapproachistostorethedatawewanttoaccessonHDFSandusetheFileSystemAPItoqueryitaspartoftheMaporReducesteps.
Hadoopgivesusanalternativemechanismtoachievethegoalofsharingreferencedataacrossalltasksinthejob,theDistributedCachedefinedbytheorg.apache.hadoop.mapreduce.filecache.DistributedCacheclass.Thiscanbeusedtoefficientlymakeavailablecommonread-onlyfilesthatareusedbythemaporreducetaskstoallnodes.
Thefilescanbetextdataasinthiscase,butcouldalsobeadditionalJARs,binarydata,orarchives;anythingispossible.ThefilestobedistributedareplacedonHDFSandaddedtotheDistributedCachewithinthejobdriver.Hadoopcopiesthefilesontothelocalfilesystemofeachnodepriortojobexecution,meaningeverytaskhaslocalaccesstothefiles.
AnalternativeistobundleneededfilesintothejobJARsubmittedtoHadoop.ThisdoestiethedatatothejobJAR,makingitmoredifficulttoshareacrossjobsandrequirestheJARtoberebuiltifthedatachanges.
WritingMapReduceprogramsInthischapter,wewillbefocusingonbatchworkloads;givenasetofhistoricaldata,wewilllookatpropertiesofthatdataset.InChapter4,Real-timeComputationwithSamza,andChapter5,IterativeComputationwithSpark,wewillshowhowasimilartypeofanalysiscanbeperformedoverastreamoftextcollectedinrealtime.
GettingstartedInthefollowingexamples,wewillassumeadatasetgeneratedbycollecting1,000tweetsusingthestream.pyscript,asshowninChapter1,Introduction:
$pythonstream.py–t–n1000>tweets.txt
WecanthencopythedatasetintoHDFSwith:
$hdfsdfs-puttweets.txt<destination>
TipNotethatuntilnowwehavebeenworkingonlywiththetextoftweets.Intheremainderofthisbook,we’llextendstream.pytooutputadditionaltweetmetadatainJSONformat.Keepthisinmindbeforedumpingterabytesofmessageswithstream.py.
OurfirstMapReduceprogramwillbethecanonicalWordCountexample.Avariationofthisprogramwillbeusedtodeterminetrendingtopics.Wewillthenanalyzetextassociatedwithtopicstodeterminewhetheritexpressesa“positive”or“negative”sentiment.Finally,wewillmakeuseofaMapReducepattern—ChainMapper—topullthingstogetherandpresentadatapipelinetocleanandpreparethetextualdatawe’llfeedtothetrendingtopicandsentimentanalysismodel.
RunningtheexamplesThefullsourcecodeoftheexamplesdescribedinthissectioncanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch3.
BeforewerunourjobinHadoop,wemustcompileourcodeandcollecttherequiredclassfilesintoasingleJARfilethatwewillsubmittothesystem.UsingGradle,youcanbuildtheneededJARfilewith:
$./gradlewjar
LocalclusterJobsareexecutedonHadoopusingtheJARoptiontotheHadoopcommand-lineutility.Tousethis,wespecifythenameoftheJARfile,themainclasswithinit,andanyargumentsthatwillbepassedtothemainclass,asshowninthefollowingcommand:
$hadoopjar<jobjarfile><mainclass><argument1>…<argument2>
ElasticMapReduceRecallfromChapter1,Introduction,thatElasticMapReduceexpectsthejobJARfileanditsinputdatatobelocatedinanS3bucketandconverselywilldumpitsoutputbackintoS3.
NoteBecareful:thiswillcostmoney!Forthisexample,wewillusethesmallestpossibleclusterconfigurationavailableforEMR,asingle-nodecluster
Firstofall,wewillcopythetweetdatasetandthelistofpositiveandnegativewordstoS3usingtheawscommand-lineutility:
$awss3puttweets.txts3://<bucket>/input
$awss3putjob.jars3://<bucket>
WecanexecuteajobusingtheEMRcommand-linetoolasfollowsbyuploadingtheJARfiletos3://<bucket>andaddingCUSTOM_JARstepswiththeawsCLI:
$awsemradd-steps--cluster-id<cluster-id>--steps\
Type=CUSTOM_JAR,\
Name=CustomJAR,\
Jar=s3://<bucket>/job.jar,\
MainClass=<classname>,\
Args=arg1,arg2,…argN
Here,cluster-idistheIDofarunningEMRcluster,<classname>isthefullyqualifiednameofthemainclass,andarg1,arg2,…,argNarethejobarguments.
WordCount,theHelloWorldofMapReduceWordCountcountswordoccurrencesinadataset.Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/WordCount.javaConsiderthefollowingblockofcodeforexample:
publicclassWordCountextendsConfiguredimplementsTool
{
publicstaticclassWordCountMapper
extendsMapper<Object,Text,Text,IntWritable>
{
privatefinalstaticIntWritableone=newIntWritable(1);
privateTextword=newText();
publicvoidmap(Objectkey,Textvalue,Contextcontext
)throwsIOException,InterruptedException{
String[]words=value.toString().split("");
for(Stringstr:words)
{
word.set(str);
context.write(word,one);
}
}
}
publicstaticclassWordCountReducer
extendsReducer<Text,IntWritable,Text,IntWritable>{
publicvoidreduce(Textkey,Iterable<IntWritable>values,
Contextcontext
)throwsIOException,InterruptedException{
inttotal=0;
for(IntWritableval:values){
total++;
}
context.write(key,newIntWritable(total));
}
}
publicintrun(String[]args)throwsException{
Configurationconf=getConf();
args=newGenericOptionsParser(conf,args)
.getRemainingArgs();
Jobjob=Job.getInstance(conf);
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job,newPath(args[0]));
FileOutputFormat.setOutputPath(job,newPath(args[1]));
return(job.waitForCompletion(true)?0:1);
}
publicstaticvoidmain(String[]args)throwsException{
intexitCode=ToolRunner.run(newWordCount(),args);
System.exit(exitCode);
}
}
ThisisourfirstcompleteMapReducejob.Lookatthestructure,andyoushouldrecognizetheelementswehavepreviouslydiscussed:theoverallJobclasswiththedriverconfigurationinitsmainmethodandtheMapperandReducerimplementationsdefinedasstaticnestedclasses.
We’lldoamoredetailedwalkthroughofthemechanicsofMapReduceinthenextsection,butfornow,let’slookattheprecedingcodeandthinkofhowitrealizesthekey/valuetransformationswediscussedearlier.
TheinputtotheMapperclassisarguablythehardesttounderstand,asthekeyisnotactuallyused.ThejobspecifiesTextInputFormatastheformatoftheinputdataand,bydefault,thisdeliverstothemapperdatawherethekeyisthebyteoffsetinthefileandthevalueisthetextofthatline.Inreality,youmayneveractuallyseeamapperthatusesthatbyteoffsetkey,butit’sprovided.
Themapperisexecutedonceforeachlineoftextintheinputsource,andeverytimeittakesthelineandbreaksitintowords.ItthenusestheContextobjecttooutput(morecommonlyknownasemitting)eachnewkey/valueoftheform(word,1).TheseareourK2/V2values.
Wesaidbeforethattheinputtothereducerisakeyandacorrespondinglistofvalues,andthereissomemagicthathappensbetweenthemapandreducemethodstocollectthevaluesforeachkeythatfacilitatesthis—calledtheshufflestage,whichwewon’tdescriberightnow.Hadoopexecutesthereduceronceforeachkey,andtheprecedingreducerimplementationsimplycountsthenumbersintheIterableobjectandgivesoutputforeachwordintheformof(word,count).TheseareourK3/V3values.
Takealookatthesignaturesofourmapperandreducerclasses:theWordCountMapperclassacceptsIntWritableandTextasinputandprovidesTextandIntWritableasoutput.TheWordCountReducerclasshasTextandIntWritableacceptedasbothinputandoutput.Thisisagainquiteacommonpattern,wherethemapmethodperformsaninversiononthekeyandvalues,andinsteademitsaseriesofdatapairsonwhichthereducerperformsaggregation.
Thedriverismoremeaningfulhere,aswehaverealvaluesfortheparameters.Weuseargumentspassedtotheclasstospecifytheinputandoutputlocations.
Runthejobwith:
$hadoopjarbuild/libs/mapreduce-example.jar
com.learninghadoop2.mapreduce.WordCount\
twitter.txtoutput
Examinetheoutputwithacommandsuchasthefollowing;theactualfilenamemightbedifferent,sojustlookinsidethedirectorycalledoutputinyourhomedirectoryonHDFS:
$hdfsdfs-catoutput/part-r-00000
Wordco-occurrencesWordsoccurringtogetherarelikelytobephrasesandcommon—frequentlyoccurring—phrasesarelikelytobeimportant.InNaturalLanguageProcessing,alistofco-occurringtermsiscalledanN-Gram.N-Gramsarethefoundationofseveralstatisticalmethodsfortextanalytics.WewillgiveanexampleofthespecialcaseofanN-Gram—andametricoftenencounteredinanalyticsapplications—composedoftwoterms(abigram).
AnaïveimplementationinMapReducewouldbeanextensionofWordCountthatemitsamulti-fieldkeycomposedoftwotab-separatedwords.
publicclassBiGramCountextendsConfiguredimplementsTool
{
publicstaticclassBiGramMapper
extendsMapper<Object,Text,Text,IntWritable>{
privatefinalstaticIntWritableone=newIntWritable(1);
privateTextword=newText();
publicvoidmap(Objectkey,Textvalue,Contextcontext
)throwsIOException,InterruptedException{
String[]words=value.toString().split("");
Textbigram=newText();
Stringprev=null;
for(Strings:words){
if(prev!=null){
bigram.set(prev+"\t+\t"+s);
context.write(bigram,one);
}
prev=s;
}
}
}
@Override
publicintrun(String[]args)throwsException{
Configurationconf=getConf();
args=newGenericOptionsParser(conf,args).getRemainingArgs();
Jobjob=Job.getInstance(conf);
job.setJarByClass(BiGramCount.class);
job.setMapperClass(BiGramMapper.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job,newPath(args[0]));
FileOutputFormat.setOutputPath(job,newPath(args[1]));
return(job.waitForCompletion(true)?0:1);
}
publicstaticvoidmain(String[]args)throwsException{
intexitCode=ToolRunner.run(newBiGramCount(),args);
System.exit(exitCode);
}
}
Inthisjob,wereplaceWordCountReducerwithorg.apache.hadoop.mapreduce.lib.reduce.IntSumReducer,whichimplementsthesamelogic.Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/BiGramCount.java
TrendingtopicsThe#symbol,calledahashtag,isusedtomarkkeywordsortopicsinatweet.ItwascreatedorganicallybyTwitterusersasawaytocategorizemessages.TwitterSearch(foundathttps://twitter.com/search-home)popularizedtheuseofhashtagsasamethodtoconnectandfindcontentrelatedtospecifictopicsaswellasthepeopletalkingaboutsuchtopics.Bycountingthefrequencywithwhichahashtagismentionedoveragiventimeperiod,wecandeterminewhichtopicsaretrendinginthesocialnetwork.
publicclassHashTagCountextendsConfiguredimplementsTool
{
publicstaticclassHashTagCountMapper
extendsMapper<Object,Text,Text,IntWritable>
{
privatefinalstaticIntWritableone=newIntWritable(1);
privateTextword=newText();
privateStringhashtagRegExp=
"(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)";
publicvoidmap(Objectkey,Textvalue,Contextcontext)
throwsIOException,InterruptedException{
String[]words=value.toString().split("");
for(Stringstr:words)
{
if(str.matches(hashtagRegExp)){
word.set(str);
context.write(word,one);
}
}
}
}
publicintrun(String[]args)throwsException{
Configurationconf=getConf();
args=newGenericOptionsParser(conf,args)
.getRemainingArgs();
Jobjob=Job.getInstance(conf);
job.setJarByClass(HashTagCount.class);
job.setMapperClass(HashTagCountMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job,newPath(args[0]));
FileOutputFormat.setOutputPath(job,newPath(args[1]));
return(job.waitForCompletion(true)?0:1);
}
publicstaticvoidmain(String[]args)throwsException{
intexitCode=ToolRunner.run(newHashTagCount(),args);
System.exit(exitCode);
}
}
AsintheWordCountexample,wetokenizetextintheMapper.Weusearegularexpression—hashtagRegExp—todetectthepresenceofahashtaginTwitter’stextandemitthehashtagandthenumber1whenahashtagisfound.IntheReducerstep,wethencountthetotalnumberofemittedhashtagoccurrencesusingIntSumReducer.
Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagCount.java
ThiscompiledclasswillbeintheJARfilewebuiltwithGradleearlier,sonowweexecuteHashTagCountwiththefollowingcommand:
$hadoopjarbuild/libs/mapreduce-example.jar\
com.learninghadoop2.mapreduce.HashTagCounttwitter.txtoutput
Let’sexaminetheoutputasbefore:
$hdfsdfs-catoutput/part-r-00000
Youshouldseeoutputsimilartothefollowing:
#whey1
#willpower1
#win2
#winterblues1
#winterstorm1
#wipolitics1
#women6
#woodgrain1
Eachlineiscomposedofahashtagandthenumberoftimesitappearsinthetweetsdataset.Asyoucansee,theMapReducejobordersresultsbykey.Ifwewanttofindthemostmentionedtopics,weneedtoordertheresultset.Thenaïveapproachwouldbetoperformatotalorderoftheaggregatedvaluesandselectingthetop10.
Iftheoutputdatasetissmall,wecanpipeittostandardoutputandsortitusingthesortutility:
$hdfsdfs-catoutput/part-r-00000|sort-k2-n-r|head-n10
AnothersolutionwouldbetowriteanotherMapReducejobtotraversethewholeresultsetandsortbyvalue.Whendatabecomeslarge,thistypeofglobalsortingcanbecomequiteexpensive.Inthefollowingsection,wewillillustrateanefficientdesignpatterntosortaggregateddata
TheTopNpattern
IntheTopNpattern,wekeepdatasortedinalocaldatastructure.EachmappercalculatesalistofthetopNrecordsinitssplitandsendsitslisttothereducer.AsinglereducertaskfindsthetopNglobalrecords.
WewillapplythisdesignpatterntoimplementaTopTenHashTagjobthatfindsthetoptentopicsinourdataset.ThejobtakesasinputtheoutputdatageneratedbyHashTagCountandreturnsalistofthetenmostfrequentlymentionedhashtags.
InTopTenMapperweuseTreeMaptokeepasortedlist—inascendingorder—ofhashtags.Thekeyofthismapisthenumberofoccurrences;thevalueisatab-separatedstringofhashtagsandtheirfrequency.Inmap(),foreachvalue,weupdatethetopNmap.WhentopNhasmorethantenitems,weremovethesmallest:
publicstaticclassTopTenMapperextendsMapper<Object,Text,
NullWritable,Text>{
privateTreeMap<Integer,Text>topN=newTreeMap<Integer,Text>();
privatefinalstaticIntWritableone=newIntWritable(1);
privateTextword=newText();
publicvoidmap(Objectkey,Textvalue,Contextcontext)throws
IOException,InterruptedException{
String[]words=value.toString().split("\t");
if(words.length<2){
return;
}
topN.put(Integer.parseInt(words[1]),newText(value));
if(topN.size()>10){
topN.remove(topN.firstKey());
}
}
@Override
protectedvoidcleanup(Contextcontext)throwsIOException,
InterruptedException{
for(Textt:topN.values()){
context.write(NullWritable.get(),t);
}
}
}
Wedon’temitanykey/valueinthemapfunction.Weimplementacleanup()methodthat,oncethemapperhasconsumedallitsinput,emitsthe(hashtag,count)valuesintopN.WeuseaNullWritablekeybecausewewantallvaluestobeassociatedwiththesamekeysothatwecanperformaglobalorderoverallmappers’topnlists.Thisimpliesthatourjobwillexecuteonlyonereducer.
Thereducerimplementslogicsimilartowhatwehaveinmap().WeinstantiateTreeMapanduseittokeepanorderedlistofthetop10values:
publicstaticclassTopTenReducerextends
Reducer<NullWritable,Text,NullWritable,Text>{
privateTreeMap<Integer,Text>topN=newTreeMap<Integer,Text>();
@Override
publicvoidreduce(NullWritablekey,Iterable<Text>values,Context
context)throwsIOException,InterruptedException{
for(Textvalue:values){
String[]words=value.toString().split("\t");
topN.put(Integer.parseInt(words[1]),
newText(value));
if(topN.size()>10){
topN.remove(topN.firstKey());
}
}
for(Textword:topN.descendingMap().values()){
context.write(NullWritable.get(),word);
}
}
}
Finally,wetraversetopNindescendingordertogeneratethelistoftrendingtopics.
NoteNotethatinthisimplementation,weoverridehashtagsthathaveafrequencyvaluealreadypresentinTreeMapwhencallingtopN.put().Dependingontheusecase,it’sadvisedtouseadifferentdatastructure—suchastheonesofferedbytheGuavalibrary(https://code.google.com/p/guava-libraries/)—oradjusttheupdatingstrategy.
Inthedriver,weenforceasinglereducerbysettingjob.setNumReduceTasks(1):
$hadoopjarbuild/libs/mapreduce-example.jar\
com.learninghadoop2.mapreduce.TopTenHashTag\
output/part-r-00000\
top-ten
Wecaninspectthetoptentolisttrendingtopics:
$hdfsdfs-cattop-ten/part-r-00000
#Stalker48150
#gameinsight55
#12M52
#KCA46
#LORDJASONJEROME29
#Valencia19
#LesAnges616
#VoteLuan15
#hadoop212
#Gameinsight11
Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/TopTenHashTag.java
SentimentofhashtagsTheprocessofidentifyingsubjectiveinformationinadatasourceiscommonlyreferredtoassentimentanalysis.Inthepreviousexample,weshowhowtodetecttrendingtopicsinasocialnetwork;we’llnowanalyzethetextsharedaroundthosetopicstodeterminewhethertheyexpressamostlypositiveornegativesentiment.
AlistofpositiveandnegativewordsfortheEnglishlanguage—aso-calledopinionlexicon—canbefoundathttp://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar.
NoteTheseresources—andmanymore—havebeencollectedbyProf.BingLiu’sgroupattheUniversityofIllinoisatChicagoandhavebeenused,amongothers,inBingLiu,MinqingHuandJunshengCheng.“OpinionObserver:AnalyzingandComparingOpinionsontheWeb.”Proceedingsofthe14thInternationalWorldWideWebconference(WWW-2005),May10-14,2005,Chiba,Japan.
Inthisexample,we’llpresentabag-of-wordsmethodthat,althoughsimplisticinnature,canbeusedasabaselinetomineopinionintext.Foreachtweetandeachhashtag,wewillcountthenumberoftimesapositiveoranegativewordappearsandnormalizethiscountbythetextlength.
NoteThebag-of-wordsmodelisanapproachusedinNaturalLanguageProcessingandInformationRetrievaltorepresenttextualdocuments.Inthismodel,textisrepresentedasthesetorbag—withmultiplicity—ofitswords,disregardinggrammarandmorphologicalpropertiesandevenwordorder.
UncompressthearchiveandplacethewordlistsintoHDFSwiththefollowingcommandline:
$hdfsdfs–putpositive-words.txt<destination>
$hdfsdfs–putnegative-words.txt<destination>
IntheMapperclass,wedefinetwoobjectsthatwillholdthewordlists:positiveWordsandnegativeWordsasSet<String>:
privateSet<String>positiveWords=null;
privateSet<String>negativeWords=null;
Weoverridethedefaultsetup()methodoftheMappersothatalistofpositiveandnegativewords—specifiedbytwoconfigurationproperties:job.positivewords.pathandjob.negativewords.path—isreadfromHDFSusingthefilesystemAPIwediscussedinthepreviouschapter.WecouldhavealsousedDistributedCachetosharethisdataacrossthecluster.Thehelpermethod,parseWordsList,readsalistofwordlists,stripsoutcomments,andloadswordsintoHashSet<String>:
privateHashSet<String>parseWordsList(FileSystemfs,PathwordsListPath)
{
HashSet<String>words=newHashSet<String>();
try{
if(fs.exists(wordsListPath)){
FSDataInputStreamfi=fs.open(wordsListPath);
BufferedReaderbr=
newBufferedReader(newInputStreamReader(fi));
Stringline=null;
while((line=br.readLine())!=null){
if(line.length()>0&&!line.startsWith(BEGIN_COMMENT)){
words.add(line);
}
}
fi.close();
}
}
catch(IOExceptione){
e.printStackTrace();
}
returnwords;
}
IntheMapperstep,weemitforeachhashtaginthetweettheoverallsentimentofthetweet(simplythepositivewordcountminusthenegativewordcount)andthelengthofthetweet.
We’llusetheseinthereducertocalculateanoverallsentimentratioweightedbythelengthofthetweetstoestimatethesentimentexpressedbyatweetonahashtag,asfollows:
publicvoidmap(Objectkey,Textvalue,Contextcontext)
throwsIOException,InterruptedException{
String[]words=value.toString().split("");
IntegerpositiveCount=newInteger(0);
IntegernegativeCount=newInteger(0);
IntegerwordsCount=newInteger(0);
for(Stringstr:words)
{
if(str.matches(HASHTAG_PATTERN)){
hashtags.add(str);
}
if(positiveWords.contains(str)){
positiveCount+=1;
}elseif(negativeWords.contains(str)){
negativeCount+=1;
}
wordsCount+=1;
}
IntegersentimentDifference=0;
if(wordsCount>0){
sentimentDifference=positiveCount-negativeCount;
}
Stringstats;
for(Stringhashtag:hashtags){
word.set(hashtag);
stats=String.format("%d%d",sentimentDifference,
wordsCount);
context.write(word,newText(stats));
}
}
}
IntheReducerstep,weaddtogetherthesentimentscoresgiventoeachinstanceofthehashtaganddividebythetotalsizeofallthetweetsinwhichitoccurred:
publicstaticclassHashTagSentimentReducer
extendsReducer<Text,Text,Text,DoubleWritable>{
publicvoidreduce(Textkey,Iterable<Text>values,
Contextcontext
)throwsIOException,InterruptedException{
doubletotalDifference=0;
doubletotalWords=0;
for(Textval:values){
String[]parts=val.toString().split("");
totalDifference+=Double.parseDouble(parts[0]);
totalWords+=Double.parseDouble(parts[1]);
}
context.write(key,
newDoubleWritable(totalDifference/totalWords));
}
}
Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentiment.java
Afterrunningtheprecedingcode,executeHashTagSentimentwiththefollowingcommand:
$hadoopjarbuild/libs/mapreduce-example.jar
com.learninghadoop2.mapreduce.HashTagSentimenttwitter.txtoutput-sentiment
<positivewords><negativewords>
Youcanexaminetheoutputwiththefollowingcommand:
$hdfsdfs-catoutput-sentiment/part-r-00
000
Youshouldseeanoutputsimilartothefollowing:
#10680.011861271213042056
#10YearsOfLove0.012285135487494233
#110.011941109121333999
#120.011938693593171155
#12F0.012339242266249566
#12M0.011864286953783268
#12MCalleEnPazYaTeVasNicolas
Intheprecedingoutput,eachlineiscomposedofahashtagandthesentimentpolarityassociatedwithit.Thisnumberisaheuristicthattellsuswhetherahashtagisassociatedmostlywithpositive(polarity>0)ornegative(polarity<0)sentimentandthemagnitudeofsuchasentiment—thehigherorlowerthenumber,thestrongerthesentiment.
TextcleanupusingchainmapperIntheexamplespresenteduntilnow,weignoredakeystepofessentiallyeveryapplicationbuiltaroundtextprocessing,whichisthenormalizationandcleanupoftheinputdata.Threecommoncomponentsofthisnormalizationstepare:
ChangingthelettercasetoeitherlowerorupperRemovalofstopwordsStemming
Inthissection,wewillshowhowtheChainMapperclass—foundatorg.apache.hadoop.mapreduce.lib.chain.ChainMapper—allowsustosequentiallycombineaseriesofMapperstoputtogetherasthefirststepofadatacleanuppipeline.Mappersareaddedtotheconfiguredjobusingthefollowing:
ChainMapper.addMapper(
JobConfjob,
Class<?extendsMapper<K1,V1,K2,V2>>klass,
Class<?extendsK1>inputKeyClass,
Class<?extendsV1>inputValueClass,
Class<?extendsK2>outputKeyClass,
Class<?extendsV2>outputValueClass,JobConfmapperConf)
Thestaticmethod,addMapper,requiresthefollowingargumentstobepassed:
job:JobConftoaddtheMapperclassclass:MapperclasstoaddinputKeyClass:mapperinputkeyclassinputValueClass:mapperinputvalueclassoutputKeyClass:mapperoutputkeyclassoutputValueClass:mapperoutputvalueclassmapperConf:aJobConfwiththeconfigurationfortheMapperclass
Inthisexample,wewilltakecareofthefirstitemlistedabove:beforecomputingthesentimentofeachtweet,wewillconverttolowercaseeachwordpresentinitstext.Thiswillallowustomoreaccuratelyascertainthesentimentofhashtagsbyignoringdifferencesincapitalizationacrosstweets.
Firstofall,wedefineanewMapper—LowerCaseMapper—whosemap()functioncallsJavaString’stoLowerCase()methodonitsinputvalueandemitsthelowercasedtext:
publicclassLowerCaseMapperextendsMapper<LongWritable,Text,
IntWritable,Text>{
privateTextlowercased=newText();
publicvoidmap(LongWritablekey,Textvalue,Contextcontext)
throwsIOException,InterruptedException{
lowercased.set(value.toString().toLowerCase());
context.write(newIntWritable(1),lowercased);
}
}
IntheHashTagSentimentChaindriver,weconfiguretheJobobjectsothatbothMappers
willbechainedtogetherandexecuted:
publicclassHashTagSentimentChain
extendsConfiguredimplementsTool
{
publicintrun(String[]args)throwsException{
Configurationconf=getConf();
args=newGenericOptionsParser(conf,args).getRemainingArgs();
//location(onhdfs)ofthepositivewordslist
conf.set("job.positivewords.path",args[2]);
conf.set("job.negativewords.path",args[3]);
Jobjob=Job.getInstance(conf);
job.setJarByClass(HashTagSentimentChain.class);
ConfigurationlowerCaseMapperConf=newConfiguration(false);
ChainMapper.addMapper(job,
LowerCaseMapper.class,
LongWritable.class,Text.class,
IntWritable.class,Text.class,
lowerCaseMapperConf);
ConfigurationhashTagSentimentConf=newConfiguration(false);
ChainMapper.addMapper(job,
HashTagSentiment.HashTagSentimentMapper.class,
IntWritable.class,
Text.class,Text.class,
Text.class,
hashTagSentimentConf);
job.setReducerClass(HashTagSentiment.HashTagSentimentReducer.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job,newPath(args[0]));
job.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job,newPath(args[1]));
return(job.waitForCompletion(true)?0:1);
}
publicstaticvoidmain(String[]args)throwsException{
intexitCode=ToolRunner.run(
newHashTagSentimentChain(),args);
System.exit(exitCode);
}
}
TheLowerCaseMapperandHashTagSentimentMapperclassesareinvokedinapipeline,wheretheoutputofthefirstbecomestheinputofthesecond.TheoutputofthelastMapperwillbewrittentothetask’soutput.AnimmediatebenefitofthisdesignisareductionofdiskI/Ooperations.Mappersdonotneedtobeawarethattheyarechained.
It’sthereforepossibletoreusespecializedMappersthatcanbecombinedwithinasingletask.NotethatthispatternassumesthatallMappers—andtheReduce—usematchingoutputandinput(key,value)pairs.NocastingorconversionisdonebyChainMapperitself.
Finally,noticethattheaddMappercallforthelastmapperinthechainspecifiestheoutputkey/valueclassesapplicabletothewholemapperpipelinewhenusedasacomposite.
Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentimentChain.java
ExecuteHashTagSentimentChainwiththecommand:
$hadoopjarbuild/libs/mapreduce-example.jar
com.learninghadoop2.mapreduce.HashTagSentimentChaintwitter.txtoutput
<positivewords><negativewords>
Youshouldseeanoutputsimilartothepreviousexample.Noticethatthistime,thehashtagineachlineislowercased.
WalkingthrougharunofaMapReducejobToexploretherelationshipbetweenmapperandreducerinmoredetail,andtoexposesomeofHadoop’sinnerworkings,we’llnowgothroughhowaMapReducejobisexecuted.ThisappliestobothMapReduceinHadoop1andHadoop2eventhoughthelatterisimplementedverydifferentlyusingYARN,whichwe’lldiscusslaterinthischapter.Additionalinformationontheservicesdescribedinthissection,aswellassuggestionsfortroubleshootingMapReduceapplications,canbefoundinChapter10,RunningaHadoopCluster.
StartupThedriveristheonlypieceofcodethatrunsonourlocalmachine,andthecalltoJob.waitForCompletion()startsthecommunicationwiththeJobTracker,whichisthemasternodeintheMapReducesystem.TheJobTrackerisresponsibleforallaspectsofjobschedulingandexecution,soitbecomesourprimaryinterfacewhenperforminganytaskrelatedtojobmanagement.
ToshareresourcesontheclustertheJobTrackercanuseoneofseveralschedulingapproachestohandleincomingjobs.Thegeneralmodelistohaveanumberofqueuestowhichjobscanbesubmittedalongwithpoliciestoassignresourcesacrossthequeues.ThemostcommonlyusedimplementationsforthesepoliciesareCapacityandFairScheduler.
TheJobTrackercommunicateswiththeNameNodeonourbehalfandmanagesallinteractionsrelatingtothedatastoredonHDFS.
SplittingtheinputThefirstoftheseinteractionshappenswhentheJobTrackerlooksattheinputdataanddetermineshowtoassignittomaptasks.RecallthatHDFSfilesareusuallysplitintoblocksofatleast64MBandtheJobTrackerwillassigneachblocktoonemaptask.OurWordCountexample,ofcourse,usedatrivialamountofdatathatwaswellwithinasingleblock.Pictureamuchlargerinputfilemeasuredinterabytes,andthesplitmodelmakesmoresense.Eachsegmentofthefile—orsplit,inMapReduceterminology—isprocesseduniquelybyonemaptask.Onceithascomputedthesplits,theJobTrackerplacesthemandtheJARfilecontainingtheMapperandReducerclassesintoajob-specificdirectoryonHDFS,whosepathwillbepassedtoeachtaskasitstarts.
TaskassignmentTheTaskTrackerserviceisresponsibleforallocatingresources,executingandtrackingthestatusofmapandreducetasksrunningonanode.OncetheJobTrackerhasdeterminedhowmanymaptaskswillbeneeded,itlooksatthenumberofhostsinthecluster,howmanyTaskTrackersareworking,andhowmanymaptaskseachcanconcurrentlyexecute(auser-definableconfigurationvariable).TheJobTrackeralsolookstoseewherethevariousinputdatablocksarelocatedacrosstheclusterandattemptstodefineanexecutionplanthatmaximizesthecaseswhentheTaskTrackerprocessesasplit/blocklocatedonthesamephysicalhost,or,failingthat,itprocessesatleastoneinthesamehardwarerack.ThisdatalocalityoptimizationisahugereasonbehindHadoop’sabilitytoefficientlyprocesssuchlargedatasets.Recallalsothat,bydefault,eachblockisreplicatedacrossthreedifferenthosts,sothelikelihoodofproducingatask/hostplanthatseesmostblocksprocessedlocallyishigherthanitmightseematfirst.
TaskstartupEachTaskTrackerthenstartsupaseparateJavavirtualmachinetoexecutethetasks.Thisdoesaddastartuptimepenalty,butitisolatestheTaskTrackerfromproblemscausedbymisbehavingmaporreducetasks,anditcanbeconfiguredtobesharedbetweensubsequentlyexecutedtasks.
Iftheclusterhasenoughcapacitytoexecuteallthemaptasksatonce,theywillallbestartedandgivenareferencetothesplittheyaretoprocessandthejobJARfile.Iftherearemoretasksthantheclustercapacity,theJobTrackerwillkeepaqueueofpendingtasksandassignthemtonodesastheycompletetheirinitiallyassignedmaptasks.
Wearenowreadytoseetheexecuteddataofmaptasks.Ifallthissoundslikealotofwork,itis;itexplainswhy,whenrunninganyMapReducejob,thereisalwaysanon-trivialamountoftimetakenasthesystemgetsstartedandperformsallthesesteps.
OngoingJobTrackermonitoringTheJobTrackerdoesn’tjuststopworknowandwaitfortheTaskTrackerstoexecuteallthemappersandreducers.It’sconstantlyexchangingheartbeatandstatusmessageswiththeTaskTrackers,lookingforevidenceofprogressorproblems.Italsocollectsmetricsfromthetasksthroughoutthejobexecution,someprovidedbyHadoopandothersspecifiedbythedeveloperofthemapandreducetasks,althoughwedon’tuseanyinthisexample.
MapperinputThedriverclassspecifiestheformatandstructureoftheinputfileusingTextInputFormat,andfromthis,Hadoopknowstotreatthisastextwiththebyteoffsetasthekeyandlinecontentsasthevalue.Assumethatourdatasetcontainsthefollowingtext:
Thisisatest
Yesitis
Thetwoinvocationsofthemapperwillthereforebegiventhefollowingoutput:
1Thisisatest
2Yesitis
MapperexecutionThekey/valuepairsreceivedbythemapperaretheoffsetinthefileofthelineandthelinecontents,respectively,becauseofhowthejobisconfigured.OurimplementationofthemapmethodinWordCountMapperdiscardsthekey,aswedonotcarewhereeachlineoccurredinthefile,andsplitstheprovidedvalueintowordsusingthesplitmethodonthestandardJavaStringclass.NotethatbettertokenizationcouldbeprovidedbyuseofregularexpressionsortheStringTokenizerclass,butforourpurposesthissimpleapproachwillsuffice.Foreachindividualword,themapperthenemitsakeycomprisedoftheactualworditself,andavalueof1.
MapperoutputandreducerinputTheoutputofthemapperisaseriesofpairsoftheform(word,1);inourexample,thesewillbe:
(This,1),(is,1),(a,1),(test,1),(Yes,1),(it,1),(is,1)
Theseoutputpairsfromthemapperarenotpasseddirectlytothereducer.Betweenmappingandreducingistheshufflestage,wheremuchofthemagicofMapReduceoccurs.
ReducerinputThereducerTaskTrackerreceivesupdatesfromtheJobTrackerthattellitwhichnodesintheclusterholdmapoutputpartitionsthatneedtobeprocessedbyitslocalreducetask.Itthenretrievesthesefromthevariousnodesandmergesthemintoasinglefilethatwillbefedtothereducetask.
ReducerexecutionOurWordCountReducerclassisverysimple;foreachword,itsimplycountsthenumberofelementsinthearrayandemitsthefinal(word,count)outputforeachword.ForourinvocationofWordCountonoursampleinput,allbutonewordhasonlyonevalueinthelistofvalues;ishastwo.
ReduceroutputThefinalsetofreduceroutputforourexampleistherefore:
(This,1),(is,2),(a,1),(test,1),(Yes,1),(it,1)
ThisdatawillbeoutputtopartitionfileswithintheoutputdirectoryspecifiedinthedriverthatwillbeformattedusingthespecifiedOutputFormatimplementation.Eachreducetaskwritestoasinglefilewiththefilenamepart-r-nnnnn,wherennnnnstartsat00000andisincremented.
ShutdownOncealltaskshavecompletedsuccessfully,theJobTrackeroutputsthefinalstateofthejobtotheclient,alongwiththefinalaggregatesofsomeofthemoreimportantcountersthatithasbeenaggregatingalongtheway.Thefulljobandtaskhistoryisavailableinthelogdirectoryoneachnodeor,moreaccessibly,viatheJobTrackerwebUI;pointyourbrowsertoport50030ontheJobTrackernode.
Input/OutputWehavetalkedaboutfilesbeingbrokenintosplitsaspartofthejobstartupandthedatainasplitbeingsenttothemapperimplementation.However,thisoverlookstwoaspects:howthedataisstoredinthefileandhowtheindividualkeysandvaluesarepassedtothemapperstructure.
InputFormatandRecordReaderHadoophastheconceptofInputFormatforthefirstoftheseresponsibilities.TheInputFormatabstractclassintheorg.apache.hadoop.mapreducepackageprovidestwomethodsasshowninthefollowingcode:
publicabstractclassInputFormat<K,V>
{
publicabstractList<InputSplit>getSplits(JobContextcontext);
RecordReader<K,V>createRecordReader(InputSplitsplit,
TaskAttemptContextcontext);
}
ThesemethodsdisplaythetworesponsibilitiesoftheInputFormatclass:
ToprovidedetailsonhowtodivideaninputfileintothesplitsrequiredformapprocessingTocreateaRecordReaderthatwillgeneratetheseriesofkey/valuepairsfromasplit
TheRecordReaderclassisalsoanabstractclasswithintheorg.apache.hadoop.mapreducepackage:
publicabstractclassRecordReader<Key,Value>implementsCloseable
{
publicabstractvoidinitialize(InputSplitsplit,
TaskAttemptContextcontext);
publicabstractbooleannextKeyValue()
throwsIOException,InterruptedException;
publicabstractKeygetCurrentKey()
throwsIOException,InterruptedException;
publicabstractValuegetCurrentValue()
throwsIOException,InterruptedException;
publicabstractfloatgetProgress()
throwsIOException,InterruptedException;
publicabstractclose()throwsIOException;
}
ARecordReaderinstanceiscreatedforeachsplitandcallsgetNextKeyValuetoreturnaBooleanindicatingwhetheranotherkey/valuepairisavailable,and,ifso,thegetKeyandgetValuemethodsareusedtoaccessthekeyandvaluerespectively.
ThecombinationoftheInputFormatandRecordReaderclassesthereforeareallthatisrequiredtobridgebetweenanykindofinputdataandthekey/valuepairsrequiredbyMapReduce.
Hadoop-providedInputFormatTherearesomeHadoop-providedInputFormatimplementationswithintheorg.apache.hadoop.mapreduce.lib.inputpackage:
FileInputFormat:isanabstractbaseclassthatcanbetheparentofanyfile-basedinput.SequenceFileInputFormat:isanefficientbinaryfileformatthatwillbediscussedinanupcomingsection.TextInputFormat:isusedforplaintextfiles.KeyValueTextInputFormat:isusedforplaintextfiles.Eachlineisdividedintokeyandvaluepartsbyaseparatorbyte.
Notethatinputformatsarenotrestrictedtoreadingfromfiles;FileInputFormatisitselfasubclassofInputFormat.It’spossibletohaveHadoopusedatathatisnotbasedonfilesastheinputtoMapReducejobs;commonsourcesarerelationaldatabasesorcolumn-orienteddatabases,suchasAmazonDynamoDBorHBase.
Hadoop-providedRecordReaderHadoopprovidesafewcommonRecordReaderimplementations,whicharealsopresentwithintheorg.apache.hadoop.mapreduce.lib.inputpackage:
LineRecordReader:implementationisthedefaultRecordReaderclassfortextfilesthatpresentsthebyteoffsetinthefileasthekeyandthelinecontentsasthevalueSequenceFileRecordReader:implementationreadsthekey/valuefromthebinarySequenceFilecontainer
OutputFormatandRecordWriterThereisasimilarpatternforwritingtheoutputofajobcoordinatedbysubclassesofOutputFormatandRecordWriterfromtheorg.apache.hadoop.mapreducepackage.Wewon’texploretheseinanydetailhere,butthegeneralapproachissimilar,althoughOutputFormatdoeshaveamoreinvolvedAPI,asithasmethodsfortaskssuchasvalidationoftheoutputspecification.
It’sthisstepthatcausesajobtofailifaspecifiedoutputdirectoryalreadyexists.Ifyouwanteddifferentbehavior,itwouldrequireasubclassofOutputFormatthatoverridesthismethod.
Hadoop-providedOutputFormatThefollowingoutputformatsareprovidedintheorg.apache.hadoop.mapreduce.outputpackage:
FileOutputFormat:isthebaseclassforallfile-basedOutputFormatsNullOutputFormat:isadummyimplementationthatdiscardstheoutputandwritesnothingtothefileSequenceFileOutputFormat:writestothebinarySequenceFileformatTextOutputFormat:writesaplaintextfile
NotethattheseclassesdefinetheirrequiredRecordWriterimplementationsasstaticnestedclasses,sotherearenoseparatelyprovidedRecordWriterimplementations.
SequencefilesTheSequenceFileclasswithintheorg.apache.hadoop.iopackageprovidesanefficientbinaryfileformatthatisoftenusefulasanoutputfromaMapReducejob.Thisisespeciallytrueiftheoutputfromthejobisprocessedastheinputofanotherjob.Sequencefileshaveseveraladvantages,asfollows:
Asbinaryfiles,theyareintrinsicallymorecompactthantextfilesTheyadditionallysupportoptionalcompression,whichcanalsobeappliedatdifferentlevels,thatis,theycompresseachrecordoranentiresplitTheycanbesplitandprocessedinparallel
Thislastcharacteristicisimportantasmostbinaryformats—particularlythosethatarecompressedorencrypted—cannotbesplitandmustbereadasasinglelinearstreamofdata.UsingsuchfilesasinputtoaMapReducejobmeansthatasinglemapperwillbeusedtoprocesstheentirefile,causingapotentiallylargeperformancehit.Insuchasituation,it’spreferabletouseasplittableformat,suchasSequenceFile,or,ifyoucannotavoidreceivingthefileinanotherformat,doapreprocessingstepthatconvertsitintoasplittableformat.Thiswillbeatradeoff,astheconversionwilltaketime,butinmanycases—especiallywithcomplexmaptasks—thiswillbeoutweighedbythetimesavedthroughincreasedparallelism.
YARNYARNstartedoutaspartoftheMapReducev2(MRv2)initiativebutisnowanindependentsub-projectwithinHadoop(thatis,it’satthesamelevelasMapReduce).ItgrewoutofarealizationthatMapReduceinHadoop1conflatedtworelatedbutdistinctresponsibilities:resourcemanagementandapplicationexecution.
Althoughithasenabledpreviouslyunimaginedprocessingonenormousdatasets,theMapReducemodelataconceptuallevelhasanimpactonperformanceandscalability.ImplicitintheMapReducemodelisthatanyapplicationcanonlybecomposedofaseriesoflargelylinearMapReducejobs,eachofwhichfollowsamodelofoneormoremapsfollowedbyoneormorereduces.Thismodelisagreatfitforsomeapplications,butnotall.Inparticular,it’sapoorfitforworkloadsrequiringverylow-latencyresponsetimes;theMapReducestartuptimesandsometimeslengthyjobchainsoftengreatlyexceedthetoleranceforauser-facingprocess.Themodelhasalsobeenfoundtobeveryinefficientforjobsthatwouldmorenaturallyberepresentedasadirectedacyclicgraph(DAG)oftaskswherethenodesonthegraphareprocessingsteps,andtheedgesaredataflows.IfanalyzedandexecutedasaDAGthentheapplicationmaybeperformedinonestepwithhighparallelismacrosstheprocessingsteps,butwhenviewedthroughtheMapReducelens,theresultisusuallyaninefficientseriesofinterdependentMapReducejobs.
NumerousprojectshavebuiltdifferenttypesofprocessingatopMapReduceandalthoughmanyarewildlysuccessful(ApacheHiveandPigaretwostandoutexamples),theclosecouplingofMapReduceasaprocessingparadigmwiththejobschedulingmechanisminHadoop1madeitverydifficultforanynewprojecttotailoreitheroftheseareastoitsspecificneeds.
TheresultisYetAnotherResourceNegotiator(YARN),whichprovidesahighlycapablejobschedulingmechanismwithinHadoopandthewell-definedinterfacesfordifferentprocessingmodelstobeimplementedwithinit.
YARNarchitectureTounderstandhowYARNworks,it’simportanttostopthinkingaboutMapReduceandhowitprocessesdata.YARNitselfsaysnothingaboutthenatureoftheapplicationsthatrunatopit,ratherit’sfocusedonprovidingthemachineryfortheschedulingandexecutionofthesejobs.Aswe’llsee,YARNisjustascapableofhostinglong-runningstreamprocessingorlow-latency,user-facingworkloadsasitiscapableofhostingbatch-processingworkloads,suchasMapReduce.
ThecomponentsofYARNYARNiscomprisedoftwomaincomponents,theResourceManager(RM),whichmanagesresourcesacrossthecluster,andtheNodeManager(NM),whichrunsoneachhostandmanagestheresourcesontheindividualmachine.TheResourceManagerandNodeManagersdealwiththeschedulingandmanagementofcontainers,anabstractnotionofthememory,CPU,andI/Othatwillbededicatedtorunaparticularpieceofapplicationcode.UsingMapReduceasanexample,whenrunningatopYARN,theJobTrackerandeachTaskTrackerallrunintheirowndedicatedcontainers.Notethough,thatinYARN,eachMapReducejobhasitsowndedicatedJobTracker;thereisnosingleinstancethatmanagesalljobs,asinHadoop1.
YARNitselfisresponsibleonlyfortheschedulingoftasksacrossthecluster;allnotionsofapplication-levelprogress,monitoring,andfaulttolerancearehandledintheapplicationcode.Thisisaveryexplicitdesigndecision;bymakingYARNasindependentaspossible,ithasaveryclearsetofresponsibilitiesanddoesnotartificiallyconstrainthetypesofapplicationthatcanbeimplementedonYARN.
Asthearbiterofallclusterresources,YARNhastheabilitytoefficientlymanagetheclusterasawholeandnotfocusonapplication-levelresourcerequirements.IthasapluggableschedulingpolicywiththeprovidedimplementationssimilartotheexistingHadoopCapacityandFairScheduler.YARNalsotreatsallapplicationcodeasinherentlyuntrustedandallapplicationmanagementandcontroltasksareperformedinuserspace.
AnatomyofaYARNapplicationAsubmittedYARNapplicationhastwocomponents:theApplicationMaster(AM),whichcoordinatestheoverallapplicationflow,andthespecificationofthecodethatwillrunontheworkernodes.ForMapReduceatopYARN,theJobTrackerimplementstheApplicationMasterfunctionalityandTaskTrackersaretheapplicationcustomcodedeployedontheworkernodes.
Asmentionedintheprevioussection,theresponsibilitiesofapplicationmanagement,progressmonitoringandfaulttolerancearepushedtotheapplicationlevelinYARN.It’stheApplicationMasterthatperformsthesetasks;YARNitselfsaysnothingaboutthemechanismsforcommunicationbetweentheApplicationMasterandthecoderunningintheworkercontainers,forexample.
ThisgenericityallowsYARNapplicationstonotbetiedtoJavaclasses.The
ApplicationManagercaninsteadrequestaNodeManagertoexecuteshellscripts,nativeapplications,oranyothertypeofprocessingthatismadeavailableoneachnode.
LifecycleofaYARNapplicationAswithMapReducejobsinHadoop1,YARNapplicationsaresubmittedtotheclusterbyaclient.WhenaYARNapplicationisstarted,theclientfirstcallstheResourceManager(morespecificallytheApplicationManagerportionoftheResourceManager)andrequeststheinitialcontainerwithinwhichtoexecutetheApplicationMaster.InmostcasestheApplicationMasterwillrunfromahostedcontainerinthecluster,justaswilltherestoftheapplicationcode.TheApplicationManagercommunicateswiththeothermaincomponentoftheResourceManager,thescheduleritself,whichhastheultimateresponsibilityofmanagingallresourcesacrossthecluster.
TheApplicationMasterstartsupintheprovidedcontainer,registersitselfwiththeResourceManager,andbeginstheprocessofnegotiatingitsrequiredresources.TheApplicationMastercommunicateswiththeResourceManagerandrequeststhecontainersitrequires.Thespecificationofthecontainersrequestedcanalsoincludeadditionalinformation,suchasdesiredplacementwithintheclusterandconcreteresourcerequirements,suchasaparticularamountofmemoryorCPU.
TheResourceManagerprovidestheApplicationMasterwiththedetailsofthecontainersithasbeenallocated,andtheApplicationMasterthencommunicateswiththeNodeManagerstostarttheapplication-specifictaskforeachcontainer.ThisisdonebyprovidingtheNodeManagerwiththespecificationoftheapplicationtobeexecuted,whichasmentionedmaybeaJARfile,ascript,apathtoalocalexecutable,oranythingelsethattheNodeManagercaninvoke.EachNodeManagerinstantiatesthecontainerfortheapplicationcodeandstartstheapplicationbasedontheprovidedspecification.
FaulttoleranceandmonitoringFromthispointonward,thebehaviorislargelyapplicationspecific.YARNwillnotmanageapplicationprogressbutdoesperformanumberofongoingtasks.TheAMLivelinessMonitorwithintheResourceManagerreceivesheartbeatsfromallApplicationMasters,andifitdeterminesthatanApplicationMasterhasfailedorstoppedworking,itwillde-registerthefailedApplicationMasterandreleaseallitsallocatedcontainers.TheResourceManagerwillthenrescheduletheapplicationaconfigurablenumberoftimes.
AlongsidethisprocesstheNMLivelinessMonitorwithintheResourceManagerreceivesheartbeatsfromtheNodeManagersandkeepstrackofthehealthofeachNodeManagerinthecluster.SimilartothemonitoringofApplicationMasterhealth,aNodeManagerwillbemarkedasdeadafterreceivingnoheartbeatsforadefaulttimeof10minutes,afterwhichallallocatedcontainersaremarkedasdead,andthenodeisexcludedfromfutureresourceallocation.
Atthesametime,theNodeManagerwillactivelymonitorresourceutilizationofeachallocatedcontainerand,forthoseresourcesnotconstrainedbyhardlimits,willkillcontainersthatexceedtheirresourceallocation.
Atahigherlevel,theYARNschedulerwillalwaysbelookingtomaximizetheclusterutilizationwithintheconstraintsofthesharingpolicybeingemployed.AswithHadoop1,thiswillallowlow-priorityapplicationstousemoreclusterresourcesifcontentionislow,buttheschedulerwillthenpreempttheseadditionalcontainers(thatis,requestthemtobeterminated)ifhigher-priorityapplicationsaresubmitted.
Therestoftheresponsibilityforapplication-levelfaulttoleranceandprogressmonitoringmustbeimplementedwithintheapplicationcode.ForMapReduceonYARN,forexample,allthemanagementoftaskschedulingandretriesisprovidedattheapplicationlevelandisnotinanywaydeliveredbyYARN.
ThinkinginlayersTheselaststatementsmaysuggestthatwritingapplicationstorunonYARNisalotofwork,andthisistrue.TheYARNAPIisquitelow-levelandlikelyintimidatingformostdeveloperswhojustwanttorunsomeprocessingtasksontheirdata.IfallwehadwasYARNandeverynewHadoopapplicationhadtohaveitsownApplicationMasterimplemented,thenYARNwouldnotlookquiteasinterestingasitdoes.
Whatmakesthepicturebetteristhat,ingeneral,therequirementisn’ttoimplementeachandeveryapplicationonYARN,butinsteaduseitforasmallernumberofprocessingframeworksthatprovidemuchfriendlierinterfacestobeimplemented.ThefirstofthesewasMapReduce;withithostedonYARN,thedeveloperwritestotheusualmapandreduceinterfacesandislargelyunawareoftheYARNmechanics.
Butonthesamecluster,anotherdevelopermayberunningajobthatusesadifferentframeworkwithsignificantlydifferentprocessingcharacteristics,andYARNwillmanagebothatthesametime.
We’llgivesomemoredetailonseveralYARNprocessingmodelscurrentlyavailable,buttheyrunthegamutfrombatchprocessingthroughlow-latencyqueriestostreamandgraphprocessingandbeyond.
AstheYARNexperiencegrows,however,thereareanumberofinitiativestomakethedevelopmentoftheseprocessingframeworkseasier.Ontheonehandtherearehigher-levelinterfaces,suchasClouderaKitten(https://github.com/cloudera/kitten)orApacheTwill(http://twill.incubator.apache.org/),thatgivefriendlierabstractionsabovetheYARNAPIs.Perhapsamoresignificantdevelopmentmodel,though,istheemergenceofframeworksthatproviderichertoolstomoreeasilyconstructapplicationswithacommongeneralclassofperformancecharacteristics.
ExecutionmodelsWehavementioneddifferentYARNapplicationshavingdistinctprocessingcharacteristics,butanemergingpatternhasseentheirexecutionmodelsingeneralbeingasourceofdifferentiation.Bythis,werefertohowtheYARNapplicationlifecycleismanaged,andweidentifythreemaintypes:per-jobapplication,per-session,andalways-on.
Batchprocessing,suchasMapReduceonYARN,seesthelifecycleoftheMapReduceframeworktiedtothatofthesubmittedapplication.IfwesubmitaMapReducejob,thentheJobTrackerandTaskTrackersthatexecuteitarecreatedspecificallyforthejobandareterminatedwhenthejobcompletes.Thisworkswellforbatch,butifwewishtoprovideamoreinteractivemodelthenthestartupoverheadofestablishingtheYARNapplicationandallitsresourceallocationswillseverelyimpacttheuserexperienceifeverycommandissuedsuffersthispenalty.Amoreinteractive,orsession-based,lifecyclewillseetheYARNapplicationstartandthenbeavailabletoserviceanumberofsubmittedrequests/commands.TheYARNapplicationterminatesonlywhenthesessionisexited.
Finally,wehavetheconceptoflong-runningapplicationsthatprocesscontinuousdatastreamsindependentofanyinteractiveinput.FortheseitmakesmostsensefortheYARNapplicationtostartandcontinuouslyprocessdatathatisretrievedthroughsomeexternalmechanism.Theapplicationwillonlyexitwhenexplicitlyshutdownorifanabnormalsituationoccurs.
YARNintherealworld–ComputationbeyondMapReduceThepreviousdiscussionshavebeenalittleabstract,sointhissection,wewillexploreafewexistingYARNapplicationstoseejusthowtheyusetheframeworkandhowtheyprovideabreadthofprocessingcapability.OfparticularinterestishowtheYARNframeworkstakeverydifferentapproachestoresourcemanagement,I/Opipelining,andfaulttolerance.
TheproblemwithMapReduceUntilnow,wehavelookedatMapReduceintermsofAPI.MapReduceinHadoopismorethanthat;upuntilHadoop2,itwasthedefaultexecutionengineforanumberoftools,amongwhichwereHiveandPig,whichwewilldiscussinmoredetaillaterinthisbook.WehaveseenhowMapReduceapplicationsare,infact,chainsofjobs.Thisveryaspectisonethebiggestpainpointsandconstrainingfactorsoftheframeworks.MapReducecheckpointsdatatoHDFSforintra-processcommunication:
AchainofMapReducejobs
Attheendofeachreducephase,outputiswrittentodisksothatitcanthenbeloadedbythemappersofthenextjobandusedasitsinput.ThisI/Ooverheadintroduceslatency,especiallywhenwehaveapplicationsthatrequiremultiplepassesonadataset(hencemultiplewrites).Unfortunately,thistypeofiterativecomputationisatthecoreofmanyanalyticsapplications.
ApacheTezandApacheSparkaretwoframeworksthataddressthisproblembygeneralizingtheMapReduceparadigm.Wewillbrieflydiscussthemintheremainderofthissection,nexttoApacheSamza,aframeworkthattakesanentirelydifferentapproachtoreal-timeprocessing.
TezTez(http://tez.apache.org)isalow-levelAPIandexecutionenginefocusedonprovidinglow-latencyprocessing,andisbeingusedasthebasisofthelatestevolutionofHive,Pigandseveralotherframeworksthatimplementstandardjoin,filter,mergeandgroupoperations.TezisanimplementationandevolutionofaprogrammingmodelpresentedbyMicrosoftinthe2009Dryadpaper(http://research.microsoft.com/en-us/projects/dryad/).TezisageneralizationofMapReduceasdataflowthatstrivestoachievefast,interactivecomputingbypipeliningI/Ooperationsoveraqueueforintra-processcommunication.ThisavoidstheexpensivewritestodisksthataffectMapReduce.TheAPIprovidesprimitivesexpressingdependenciesbetweenjobsasaDAG.ThefullDAGisthensubmittedtoaplannerthatcanoptimizetheexecutionflow.ThesameapplicationdepictedintheprecedingdiagramwouldbeexecutedinTezasasinglejob,withI/OpipelinedfromreducerstoreducerswithoutHDFSwritesandsubsequentreadsbymappers.Anexamplecanbeseeninthefollowingdiagram:.
ATezDAGisageneralizationofMapReduce
ThecanonicalWordCountexamplecanbefoundathttps://github.com/apache/incubator-tez/blob/master/tez-mapreduce-
examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java.
DAGdag=newDAG("WordCount");
dag.addVertex(tokenizerVertex)
.addVertex(summerVertex)
.addEdge(newEdge(tokenizerVertex,summerVertex,
edgeConf.createDefaultEdgeProperty()));
Eventhoughthegraphtopologydagcanbeexpressedwithafewlinesofcode,theboilerplaterequiredtoexecutethejobisconsiderable.Thiscodehandlesmanyofthelow-levelschedulingandexecutionresponsibilities,includingfaulttolerance.WhenTezdetectsafailedtask,itwalksbackuptheprocessinggraphtofindthepointfromwhichtore-executethefailedtasks.
Hive-on-tezHive0.13isthefirsthigh-profileprojecttouseTezasitsexecutionengine.We’lldiscussHiveinalotmoredetailinChapter7,HadoopandSQL,butfornowwewilljusttouchonhowit’simplementedonYARN.
Hive(http://hive.apache.org)isanengineforqueryingdatastoredonHDFSthroughstandardSQLsyntax.Ithasbeenenormouslysuccessful,asthistypeofcapabilitygreatlyreducesthebarrierstostartanalyticexplorationofdatainHadoop.
InHadoop1,Hivehadnochoice,buttoimplementitsSQLstatementsasaseriesofMapReducejobs.WhenSQLissubmittedtoHive,itgeneratestherequiredMapReducejobsbehindthescenesandexecutestheseonthecluster.Thisapproachhastwomaindrawbacks:thereisanon-trivialstartuppenaltyeachtime,andtheconstrainedMapReducemodelmeansthatseeminglysimpleSQLstatementsareoftentranslatedintoalengthyseriesofmultipledependentMapReducejobs.ThisisanexampleofthetypeofprocessingmorenaturallyconceptualizedasaDAGoftasks,asdescribedearlierinthischapter.
AlthoughsomebenefitsareachievedwhenHiveexecuteswithinMapReduce,withinYARN,themajorbenefitscomeinHive0.13whentheprojectisfullyre-implementedusingTez.ByexploitingtheTezAPIs,whicharefocusedonprovidinglow-latencyprocessing,Hivegainsevenmoreperformancewhilemakingitscodebasesimpler.
SinceTeztreatsitsworkloadsastheDAGswhichprovideamuchbetterfittotranslatedSQLqueries,HiveonTezcanperformanySQLstatementasasinglejobwithmaximizedparallelism.
TezhelpsHivesupportinteractivequeriesbyprovidinganalways-runningserviceinsteadofrequiringtheapplicationtobeinstantiatedfromscratchforeachSQLsubmission.Thisisimportantbecause,eventhoughqueriesthatprocesshugedatavolumeswillsimplytakesometime,thegoalisforHivetobecomelessofabatchtoolandinsteadmovetobeasmuchofaninteractivetoolaspossible.
ApacheSparkSpark(.apache.org)isaprocessingframeworkthatexcelsatiterativeandnearreal-timeprocessing.CreatedatUCBerkeley,ithasbeendonatedasanApacheproject.SparkprovidesanabstractionthatallowsdatainHadooptobeviewedasadistributeddatastructureuponwhichaseriesofoperationscanbeperformed.TheframeworkisbasedonthesameconceptsTezdrawsinspirationfrom(Dryad),butexcelswithjobsthatallowdatatobeheldandprocessedinmemory,anditcanveryefficientlyscheduleprocessingonthein-memorydatasetacrossthecluster.Sparkautomaticallycontrolsreplicationofdataacrossthecluster,ensuringthateachelementofthedistributeddatasetisheldinmemoryonatleasttwomachines,andprovidesreplication-basedfaulttolerancesomewhatakintoHDFS.
Sparkstartedasastandalonesystem,butwasportedtoalsorunonYARNasofits0.8release.Sparkisparticularlyinterestingbecause,althoughitsclassicprocessingmodelisbatch-oriented,withtheSparkshellitprovidesaninteractivefrontendandwiththeSparkStreamingsub-projectalsooffersnearreal-timeprocessingofdatastreams.Sparkisdifferentthingstodifferentpeople;it’sbothahigh-levelAPIandanexecutionengine.Atthetimeofwriting,portsofHiveandPigtoSparkareinprogress.
ApacheSamzaSamza(http://samza.apache.org)isastream-processingframeworkdevelopedatLinkedInanddonatedtotheApacheSoftwareFoundation.Samzaprocessesconceptuallyinfinitestreamsofdata,whichareseenbytheapplicationasaseriesofmessages.
SamzacurrentlyintegratesmosttightlywithApacheKafka(http://kafka.apache.org)althoughitdoeshaveapluggablearchitecture.Kafkaitselfisamessagingsystemthatexcelsatlargedatavolumesandprovidesatopic-basedabstractionsimilartomostothermessagingplatforms,suchasRabbitMQ.Publisherssendmessagestotopicsandinterestedclientsconsumemessagesfromthetopicsastheyarrive.Kafkahasmultipleaspectsthatsetitapartfromothermessagingplatforms,butforthisdiscussion,themostinterestingoneisthatKafkastoresmessagesforaperiodoftime,whichallowsmessagesintopicstobereplayed.Topicsarepartitionedacrossmultiplehostsandpartitionscanbereplicatedacrosshoststoprotectfromnodefailure.
Samzabuildsitsprocessingflowonitsconceptofstreams,whichwhenusingKafkamapdirectlytoKafkapartitions.AtypicalSamzajobmaylistentoonetopicforincomingmessages,performsometransformations,andthenwritetheoutputtoadifferenttopic.MultipleSamzajobscanthenbecomposedtoprovidemorecomplexprocessingstructures.
AsaYARNapplication,theSamzaApplicationMastermonitorsthehealthofallrunningSamzatasks.Ifataskfails,thenareplacementtaskisinstantiatedinanewcontainer.Samzaachievesfaulttolerancebyhavingeachtaskwriteitsprogresstoanewstream(againmodeledasaKafkatopic),soanyreplacementtaskjustneedstoreadthelatesttaskstatefromthischeckpointtopicandthenreplaythemainmessagetopicfromthelastprocessedposition.Samzaadditionallyofferssupportforlocaltaskstate,whichcanbeveryusefulforjoinandaggregationtypeworkloads.Thislocalstateisagainbuiltatopthestreamabstractionandhenceisintrinsicallyresilienttohostfailure.
YARN-independentframeworksAninterestingpointtonoteisthattwooftheprecedingprojects(SamzaandSpark)runatopYARNbutarenotspecifictoYARN.Sparkstartedoutasastandaloneserviceandhasimplementationsforotherschedulers,suchasApacheMesosortorunonAmazonEC2.ThoughSamzarunsonlyonYARNtoday,itsarchitectureexplicitlyisnotYARN-specific,andtherearediscussionsaboutprovidingrealizationsonotherplatforms.
IftheYARNmodelofpushingasmuchaspossibleintotheapplicationhasitsdownsidesthroughimplementationcomplexity,thenthisdecouplingisoneofitsmajorbenefits.AnapplicationwrittentouseYARNneednotbetiedtoit;bydefinition,allthefunctionalityfortheactualapplicationlogicandmanagementisencapsulatedwithintheapplicationcodeandisindependentofYARNoranotherframework.Thisis,ofcourse,notsayingthatdesigningascheduler-independentapplicationisatrivialtask,butit’snowatractabletask;thiswasabsolutelynotthecasepre-YARN.
YARNtodayandbeyondThoughYARNhasbeenusedinproduction(atYahoo!inparticular)forsometime,thefinalGAversionwasnotreleaseduntillate2012.TheinterfacestoYARNwerealsosomewhatfluiduntilquitelateinthedevelopmentcycle.Consequently,thefullyforwardcompatibleYARNasofHadoop2.2isstillrelativelynew.
YARNisfullyfunctionaltoday,andthefuturedirectionwillseeextensionstoitscurrentcapabilities.Perhapsmostnotableamongthesewillbetheabilitytospecifyandcontrolcontainerresourcesonmoredimensions.Currently,onlylocation,memoryandCPUspecificationsarepossible,andthiswillbeexpandedintoareassuchasstorageandnetworkI/O.
Inaddition,theApplicationMastercurrentlyhaslittlecontroloverthemanagementofhowcontainersareco-locatedornot.Finer-grainedcontrolherewillallowtheApplicationMastertospecifypoliciesaroundwhencontainersmayormaynotbescheduledonthesamenode.Inaddition,thecurrentresourceallocationmodelisquitestatic,anditwillbeusefultoallowanapplicationtodynamicallychangetheresourcesallocatedtoarunningcontainer.
SummaryThischapterexploredhowtoprocessthoselargevolumesofdatathatwediscussedsomuchinthepreviouschapter.Inparticularwecovered:
HowMapReducewastheonlyprocessingmodelavailableinHadoop1anditsconceptualmodelTheJavaAPItoMapReduce,andhowtousethistobuildsomeexamples,fromawordcounttosentimentanalysisofTwitterhashtagsThedetailsofhowMapReduceisimplementedinpractice,andwewalkedthroughtheexecutionofaMapReducejobHowHadoopstoresdataandtheclassesinvolvedtorepresentinputandoutputformatsandrecordreadersandwritersThelimitationsofMapReducethatledtothedevelopmentofYARN,openingthedoortomultiplecomputationalmodelsontheHadoopplatformTheYARNarchitectureandhowapplicationsarebuiltatopit
Inthenexttwochapters,wewillmoveawayfromstrictlybatchprocessinganddelveintotheworldofnearreal-timeanditerativeprocessing,usingtwooftheYARN-hostedframeworksweintroducedinthischapter,namelySamzaandSpark.
Chapter4.Real-timeComputationwithSamzaThepreviouschapterdiscussedYARN,andfrequentlymentionedthebreadthofcomputationalmodelsandprocessingframeworksoutsideoftraditionalbatch-basedMapReducethatitenablesontheHadoopplatform.Inthischapterandthenext,wewillexploretwosuchprojectsinsomedepth,namelyApacheSamzaandApacheSpark.Wechosetheseframeworksastheydemonstratetheusageofstreamanditerativeprocessingandalsoprovideinterestingmechanismstocombineprocessingparadigms.InthischapterwewillexploreSamzaandcoverthefollowingtopics:
WhatSamzaisandhowitintegrateswithYARNandotherprojectssuchasApacheKafkaHowSamzaprovidesasimplecallback-basedinterfaceforstreamprocessingHowSamzacomposesmultiplestreamprocessingjobsintomorecomplexworkflowsHowSamzasupportspersistentlocalstatewithintasksandhowthisgreatlyenricheswhatitcanenable
StreamprocessingwithSamzaToexploreapurestream-processingplatform,wewilluseSamza,whichisavailableathttps://samza.apache.org.Thecodeshownherewastestedwiththecurrent0.8releaseandwe’llkeeptheGitHubrepositoryupdatedastheprojectcontinuestoevolve.
SamzawasbuiltatLinkedInanddonatedtotheApacheSoftwareFoundationinSeptember2013.Overtheyears,LinkedInhasbuiltamodelthatconceptualizesmuchoftheirdataasstreams,andfromthistheysawtheneedforaframeworkthatcanprovideadeveloper-friendlymechanismtoprocesstheseubiquitousdatastreams.
TheteamatLinkedInrealizedthatwhenitcametodataprocessing,muchoftheattentionwenttotheextremeendsofthespectrum,forexample,RPCworkloadsareusuallyimplementedassynchronoussystemswithverylowlatencyrequirementsorbatchsystemswheretheperiodicityofjobsisoftenmeasuredinhours.ThegroundinbetweenhasbeenrelativelypoorlysupportedandthisistheareathatSamzaistargetedat;mostofitsjobsexpectresponsetimesrangingfrommillisecondstominutes.Theyalsoassumethatdataarrivesinatheoreticallyinfinitestreamofcontinuousmessages.
HowSamzaworksTherearenumerousstream-processingsystemssuchasStorm(http://storm.apache.org),intheopensourceworld,andmanyother(mostlycommercial)toolssuchascomplexeventprocessing(CEP)systemsthatalsotargetprocessingoncontinuousmessagestreams.Thesesystemshavemanysimilaritiesbutalsosomemajordifferences.
ForSamza,perhapsthemostsignificantdifferenceisitsassumptionsaboutmessagedelivery.Manysystemsworkveryhardtoreducethelatencyofeachmessage,sometimeswithanassumptionthatthegoalistogetthemessageintoandoutofthesystemasfastaspossible.Samzaassumesalmosttheopposite;itsstreamsarepersistentandresilientandanymessagewrittentoastreamcanbere-readforaperiodoftimeafteritsfirstarrival.Aswewillsee,thisgivessignificantcapabilityaroundfaulttolerance.Samzaalsobuildsonthismodeltoalloweachofitstaskstoholdresilientlocalstate.
SamzaismostlyimplementedinScalaeventhoughitspublicAPIsarewritteninJava.We’llshowJavaexamplesinthischapter,butanyJVMlanguagecanbeusedtoimplementSamzaapplications.We’lldiscussScalawhenweexploreSparkinthenextchapter.
Samzahigh-levelarchitectureSamzaviewstheworldashavingthreemainlayersorcomponents:thestreaming,execution,andprocessinglayers.
Samzaarchitecture
Thestreaminglayerprovidesaccesstothedatastreams,bothforconsumptionandpublication.TheexecutionlayerprovidesthemeansbywhichSamzaapplicationscanberun,haveresourcessuchasCPUandmemoryallocated,andhavetheirlifecyclesmanaged.TheprocessinglayeristheactualSamzaframeworkitself,anditsinterfacesallowper-messagefunctionality.
SamzaprovidespluggableinterfacestosupportthefirsttwolayersthoughthecurrentmainimplementationsuseKafkaforstreamingandYARNforexecution.We’lldiscussthesefurtherinthefollowingsections.
Samza’sbestfriend–ApacheKafkaSamzaitselfdoesnotimplementtheactualmessagestream.Instead,itprovidesaninterfaceforamessagesystemwithwhichitthenintegrates.ThedefaultstreamimplementationisbuiltuponApacheKafka(http://kafka.apache.org),amessagingsystemalsobuiltatLinkedInbutnowasuccessfulandwidelyadoptedopensourceproject.
KafkacanbeviewedasamessagebrokerakintosomethinglikeRabbitMQorActiveMQ,butasmentionedearlier,itwritesallmessagestodiskandscalesoutacrossmultiplehostsasacorepartofitsdesign.Kafkausestheconceptofapublish/subscribemodelthroughnamedtopicstowhichproducerswritemessagesandfromwhichconsumersreadmessages.Theseworkmuchliketopicsinanyothermessagingsystem.
BecauseKafkawritesallmessagestodisk,itmightnothavethesameultra-lowlatencymessagethroughputasothermessagingsystems,whichfocusongettingthemessageprocessedasfastaspossibleanddon’taimtostorethemessagelongterm.Kafkacan,however,scaleexceptionallywellanditsabilitytoreplayamessagestreamcanbeextremelyuseful.Forexample,ifaconsumingclientfails,thenitcanre-readmessagesfromaknowngoodpointintime,orifadownstreamalgorithmchanges,thentrafficcanbereplayedtoutilizethenewfunctionality.
Whenscalingacrosshosts,Kafkapartitionstopicsandsupportspartitionreplicationforfaulttolerance.EachKafkamessagehasakeyassociatedwiththemessageandthisisusedtodecidetowhichpartitionagivenmessageissent.Thisallowssemanticallyusefulpartitioning,forexample,ifthekeyisauserIDinthesystem,thenallmessagesforagivenuserwillbesenttothesamepartition.Kafkaguaranteesordereddeliverywithineachpartitionsothatanyclientreadingapartitioncanknowthattheyarereceivingallmessagesforeachkeyinthatpartitionintheorderinwhichtheyarewrittenbytheproducer.
Samzaperiodicallywritesoutcheckpointsofthepositionuptowhichithasreadinallthestreamsitisconsuming.ThesecheckpointmessagesarethemselveswrittentoaKafkatopic.Thus,whenaSamzajobstartsup,eachtaskcanrereaditscheckpointstreamtoknowfromwhichpositioninthestreamtostartprocessingmessages.ThismeansthatineffectKafkaalsoactsasabuffer;ifaSamzajobcrashesoristakendownforupgrade,nomessageswillbelost.Instead,thejobwilljustrestartfromthelastcheckpointedpositionwhenitrestarts.Thisbufferfunctionalityisalsoimportant,asitmakesiteasierformultipleSamzajobstorunaspartofacomplexworkflow.WhenKafkatopicsarethepointsofcoordinationbetweenthejobs,onejobmightconsumeatopicbeingwrittentobyanother;insuchcases,Kafkacanhelpsmoothoutissuescausedduetoanygivenjobrunningslowerthanothers.Traditionally,thebackpressurecausedbyaslowrunningjobcanbearealissueinasystemcomprisedofmultiplejobstages,butKafkaastheresilientbufferallowseachjobtoreadandwriteatitsownrate.NotethatthisisanalogoustohowmultiplecoordinatingMapReducejobswilluseHDFSforsimilarpurposes.
Kafkaprovidesat-leastoncemessagedeliverysemantics,thatistosaythatanymessage
writtentoKafkawillbeguaranteedtobeavailabletoaclientoftheparticularpartition.Messagesmightbeprocessedbetweencheckpointshowever;itispossibleforduplicatemessagestobereceivedbytheclient.Thereareapplication-specificmechanismstomitigatethis,andbothKafkaandSamzahaveexactly-oncesemanticsontheirroadmaps,butfornowitissomethingyoushouldtakeintoconsiderationwhendesigningjobs.
Wewon’texplainKafkafurtherbeyondwhatweneedtodemonstrateSamza.Ifyouareinterested,checkoutitswebsiteandwiki;thereisalotofgoodinformation,includingsomeexcellentpapersandpresentations.
YARNintegrationAsmentionedearlier,justasSamzautilizesKafkaforitsstreaminglayerimplementation,itusesYARNfortheexecutionlayer.JustlikeanyYARNapplicationdescribedinChapter3,Processing–MapReduceandBeyond,SamzaprovidesanimplementationofbothanApplicationMaster,whichcontrolsthelifecycleoftheoveralljob,plusimplementationsofSamza-specificfunctionality(calledtasks)thatareexecutedineachcontainer.JustasKafkapartitionsitstopics,tasksarethemechanismbywhichSamzapartitionsitsprocessing.EachKafkapartitionwillbereadbyasingleSamzatask.IfaSamzajobconsumesmultiplestreams,thenagiventaskwillbetheonlyconsumerwithinthejobforeverystreampartitionassignedtoit.
TheSamzaframeworkistoldbyeachjobconfigurationabouttheKafkastreamsthatareofinteresttothejob,andSamzacontinuouslypollsthesestreamstodetermineifanynewmessageshavearrived.Whenanewmessageisavailable,theSamzataskinvokesauser-definedcallbacktoprocessthemessage,amodelthatshouldn’tlooktooalientoMapReducedevelopers.ThismethodisdefinedinaninterfacecalledStreamTaskandhasthefollowingsignature:
publicvoidprocess(IncomingMessageEnvelopeenvelope,
MessageCollectorcollector,
TaskCoordinatorcoordinator)
ThisisthecoreofeachSamzataskanddefinesthefunctionalitytobeappliedtoreceivedmessages.ThereceivedmessagethatistobeprocessediswrappedintheIncomingMessageEnvelope;outputmessagescanbewrittentotheMessageCollector,andtaskmanagement(suchasShutdown)canbeperformedviatheTaskCoordinator.
Asmentioned,SamzacreatesonetaskinstanceforeachpartitionintheunderlyingKafkatopic.EachYARNcontainerwillmanageoneormoreofthesetasks.TheoverallmodelthenisoftheSamzaApplicationMastercoordinatingmultiplecontainers,eachofwhichisresponsibleforoneormoreStreamTaskinstances.
AnindependentmodelThoughwewilltalkexclusivelyofKafkaandYARNastheprovidersofSamza’sstreamingandexecutionlayersinthischapter,itisimportanttorememberthatthecoreSamzasystemuseswell-definedinterfacesforboththestreamandexecutionsystems.Thereareimplementationsofmultiplestreamsources(we’llseeoneinthenextsection)andalongsidetheYARNsupport,SamzashipswithaLocalJobRunnerclass.ThisalternativemethodofrunningtaskscanexecuteStreamTaskinstancesin-processontheJVMinsteadofrequiringafullYARNcluster,whichcansometimesbeausefultestinganddebuggingtool.ThereisalsoadiscussionofSamzaimplementationsontopofotherclustermanagerorvirtualizationframeworks.
HelloSamza!SincenoteveryonealreadyhasZooKeeper,Kafka,andYARNclustersreadytobeused,theSamzateamhascreatedawonderfulwaytogetstartedwiththeproduct.InsteadofjusthavingaHelloworld!program,thereisarepositorycalledHelloSamza,whichisavailablebycloningtherepositoryatgit://git.apache.org/samza-hello-samza.git.
ThiswilldownloadandinstalldedicatedinstancesofZooKeeper,Kafka,andYARN(the3majorprerequisitesforSamza),creatingafullstackuponwhichyoucansubmitSamzajobs.
TherearealsoanumberofexampleSamzajobsthatprocessdatafromWikipediaeditnotifications.Takealookatthepageathttp://samza.apache.org/startup/hello-samza/0.8/andfollowtheinstructionsgiventhere.(Atthetimeofwriting,Samzaisstillarelativelyyoungprojectandwe’drathernotincludedirectinformationabouttheexamples,whichmightbesubjecttochange).
FortheremainderoftheSamzaexamplesinthischapter,we’llassumeyouareeitherusingtheHelloSamzapackagetoprovidethenecessarycomponents(ZooKeeper/Kafka/YARN)oryouhaveintegratedwithotherinstancesofeach.
ThisexamplehasthreedifferentSamzajobsthatbuilduponeachother.ThefirstreadstheWikipediaedits,thesecondparsestheserecords,andthethirdproducesstatisticsbasedontheprocessedrecords.We’llbuildourownmultistreamworkflowshortly.
OneinterestingpointistheWikipediaFeedexamplehere;itusesWikipediaasitsmessagesourceinsteadofKafka.Specifically,itprovidesanotherimplementationoftheSamzaSystemConsumerinterfacetoallowSamzatoreadmessagesfromanexternalsystem.Asmentionedearlier,SamzaisnottiedtoKafkaand,asthisexampleshows,buildinganewstreamimplementationdoesnothavetobeagainstagenericinfrastructurecomponent;itcanbequitejob-specific,astheworkrequiredisnothuge.
TipNotethatthedefaultconfigurationforbothZooKeeperandKafkawillwritesystemdatatodirectoriesunder/tmp,whichwillbewhatyouhavesetifyouuseHelloSamza.BecarefulifyouareusingaLinuxdistributionthatpurgesthecontentsofthisdirectoryonareboot.Ifyouplantocarryoutanysignificanttesting,thenit’sbesttoreconfigurethesecomponentstouselessephemerallocations.Changetherelevantconfigfilesforeachservice;theyarelocatedintheservicedirectoryunderthehello-samza/deploydirectory.
BuildingatweetparsingjobLet’sbuildourownsimplejobimplementationtoshowthefullcoderequired.We’lluseparsingoftheTwitterstreamastheexamplesinthischapterandwilllatersetupapipefromourclientconsumingmessagesfromtheTwitterAPIintoaKafkatopic.So,weneedaSamzataskthatwillreadthestreamofJSONmessages,extracttheactualtweettext,andwritethesetoatopicoftweets.
HereisthemaincodefromTwitterParseStreamTask.java,availableathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterParseStreamTask.java
packagecom.learninghadoop2.samza.tasks;
publicclassTwitterParseStreamTaskimplementsStreamTask{
@Override
publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector
collector,TaskCoordinatorcoordinator){
Stringmsg=((String)envelope.getMessage());
try{
JSONParserparser=newJSONParser();
Objectobj=parser.parse(msg);
JSONObjectjsonObj=(JSONObject)obj;
Stringtext=(String)jsonObj.get("text");
collector.send(newOutgoingMessageEnvelope(new
SystemStream("kafka","tweets-parsed"),text));
}catch(ParseExceptionpe){}
}
}
}
Thecodeislargelyself-explanatory,butthereareafewpointsofinterest.WeuseJSONSimple(http://code.google.com/p/json-simple/)forourrelativelystraightforwardJSONparsingrequirements;we’llalsouseitlaterinthisbook.
TheIncomingMessageEnvelopeanditscorrespondingOutputMessageEnvelopearethemainstructuresconcernedwiththeactualmessagedata.Alongwiththemessagepayload,theenvelopewillalsohavedataconcerningthesystem,topicname,and(optionally)partitionnumberinadditiontoothermetadata.Forourpurposes,wejustextractthemessagebodyfromtheincomingmessageandsendthetweettextweextractfromitviaanewOutgoingMessageEnvelopetoatopiccalledtweets-parsedwithinasystemcalledkafka.Notethelowercasename—we’llexplainthisinamoment.
ThetypeofmessageintheIncomingMessageEnvelopeisjava.lang.Object.Samzadoesnotcurrentlyenforceadatamodelandhencedoesnothavestrongly-typedmessagebodies.Therefore,whenextractingthemessagecontents,anexplicitcastisusuallyrequired.Sinceeachtaskneedstoknowtheexpectedmessageformatofthestreamsitprocesses,thisisnottheodditythatitmayappeartobe.
TheconfigurationfileTherewasnothinginthepreviouscodethatsaidwherethemessagescamefrom;theframeworkjustpresentsthemtotheStreamTaskimplementation,butobviouslySamzaneedstoknowfromwheretofetchmessages.Thereisaconfigurationfileforeachjobthatdefinesthisandmore.Thefollowingcanbefoundastwitter-parse.propertiesathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-parser.properties:
#Job
job.factory.class=org.apache.samza.job.yarn.YarnJobFactory
job.name=twitter-parser
#YARN
yarn.package.path=file:///home/gturkington/samza/build/distributions/learni
nghadoop2-0.1.tar.gz
#Task
task.class=com.learninghadoop2.samza.tasks.TwitterParseStreamTask
task.inputs=kafka.tweets
task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointMa
nagerFactory
task.checkpoint.system=kafka
#Normally,thiswouldbe3,butwehaveonlyonebroker.
task.checkpoint.replication.factor=1
#Serializers
serializers.registry.string.class=org.apache.samza.serializers.StringSerdeF
actory
#Systems
systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactor
y
systems.kafka.streams.tweets.samza.msg.serde=string
systems.kafka.streams.tweets-parsed.samza.msg.serde=string
systems.kafka.consumer.zookeeper.connect=localhost:2181/
systems.kafka.consumer.auto.offset.reset=largest
systems.kafka.producer.metadata.broker.list=localhost:9092
systems.kafka.producer.producer.type=sync
systems.kafka.producer.batch.num.messages=1
Thismaylooklikealot,butfornowwe’lljustconsiderthehigh-levelstructureandsomekeysettings.ThejobsectionsetsYARNastheexecutionframework(asopposedtothelocaljobrunnerclass)andgivesthejobaname.Ifweweretorunmultiplecopiesofthissamejob,wewouldalsogiveeachcopyauniqueID.Thetasksectionspecifiestheimplementationclassofourtaskandalsothenameofthestreamsforwhichitshouldreceivemessages.SerializerstellSamzahowtoreadandwritemessagestoandfromthestreamandthesystemsectiondefinessystemsbynameandassociatesimplementationclasseswiththem.
Inourcase,wedefineonlyonesystemcalledkafkaandwerefertothissystemwhen
sendingourmessageintheprecedingtask.Notethatthisnameisarbitraryandwecouldcallitwhateverwewant.Obviously,forclarityitmakessensetocalltheKafkasystembythesamenamebutthisisonlyaconvention.Inparticular,sometimesyouwillneedtogivedifferentnameswhendealingwithmultiplesystemsthataresimilartoeachother,orsometimesevenwhentreatingthesamesystemdifferentlyindifferentpartsofaconfigurationfile.
Inthissection,wewillalsospecifytheSerDetobeassociatedwiththestreamsusedbythetask.RecallthatKafkamessageshaveabodyandanoptionalkeythatisusedtodeterminetowhichpartitionthemessageissent.Samzaneedstoknowhowtotreatthecontentsofthekeysandmessagesforthesestreams.Samzahassupporttotreattheseasrawbytesorspecifictypessuchasstring,integer,andJSON,asmentionedearlier.
Therestoftheconfigurationwillbemostlyunchangedfromjobtojob,asitincludesthingssuchasthelocationoftheZooKeeperensembleandKafkaclusters,andspecifieshowstreamsaretobecheckpointed.Samzaallowsawidevarietyofcustomizationsandthefullconfigurationoptionsaredetailedathttp://samza.apache.org/learn/documentation/0.8/jobs/configuration-table.html.
GettingTwitterdataintoKafkaBeforewerunthejob,wedoneedtogetsometweetsintoKafka.Let’screateanewKafkatopiccalledtweetstowhichwe’llwritethetweets.
ToperformthisandotherKafka-relatedoperations,we’llusecommand-linetoolslocatedwithinthebindirectoryoftheKafkadistribution.IfyouarerunningajobfromwithinthestackcreatedaspartoftheHelloSamzaapplication;thiswillbedeploy/kafka/bin.
kafka-topics.shisageneral-purposetoolthatcanbeusedtocreate,update,anddescribetopics.MostofitsusagesrequireargumentstospecifythelocationofthelocalZooKeepercluster,whereKafkabrokersstoretheirdetails,andthenameofthetopictobeoperatedupon.Tocreateanewtopic,runthefollowingcommand:
$kafka-topics.sh--zookeeperlocalhost:2181--create–topictweets--
partitions1--replication-factor1
Thiscreatesatopiccalledtweetsandexplicitlysetsitsnumberofpartitionsandreplicationfactorto1.ThisissuitableifyouarerunningKafkawithinalocaltestVM,butclearlyproductiondeploymentswillhavemorepartitionstoscaleouttheloadacrossmultiplebrokersandareplicationfactorofatleast2toprovidefaulttolerance.
Usethelistoptionofthekafka-topics.shtooltosimplyshowthetopicsinthesystem,orusedescribetogetmoredetailedinformationonspecifictopics:
$kafka-topics.sh--zookeeperlocalhost:2181--describe--topictweets
Topic:tweetsPartitionCount:1ReplicationFactor:1Configs:
Topic:tweetsPartition:0Leader:0Replicas:0Isr:0
Themultiple0sarepossiblyconfusingasthesearelabelsandnotcounts.EachbrokerinthesystemhasanIDthatusuallystartsfrom0,asdothepartitionswithineachtopic.TheprecedingoutputistellingusthatthetopiccalledtweetshasasinglepartitionwithID0,thebrokeractingastheleaderforthatpartitionisbroker0,andthesetofin-syncreplicas(ISR)forthispartitionisagainonlybroker0.Thislastvalueisparticularlyimportantwhendealingwithreplication.
We’lluseourPythonutilityfrompreviouschapterstopullJSONtweetsfromtheTwitterfeed,andthenuseaKafkaCLImessageproducertowritethemessagestoaKafkatopic.Thisisn’taterriblyefficientwayofdoingthings,butitissuitableforillustrationpurposes.AssumingourPythonscriptisinourhomedirectory,runthefollowingcommandfromwithintheKafkabindirectory:
$python~/stream.py–j|./kafka-console-producer.sh--broker-list
localhost:9092--topictweets
ThiswillrunindefinitelysobecarefulnottoleaveitrunningovernightonatestVMwithsmalldiskspace,notthattheauthorshaveeverdonesuchathing.
RunningaSamzajobTorunaSamzajob,weneedourcodetobepackagedalongwiththeSamzacomponentsrequiredtoexecuteitintoa.tar.gzarchivethatwillbereadbytheYARNNodeManager.Thisisthefilereferredtobytheyarn.file.packagepropertyintheSamzataskconfigurationfile.
WhenusingthesinglenodeHelloSamzawecanjustuseanabsolutepathonthefilesystem,asseeninthepreviousconfigurationexample.ForjobsonlargerYARNgrids,theeasiestwayistoputthepackageontoHDFSandrefertoitbyanhdfs://URIoronawebserver(SamzaprovidesamechanismtoallowYARNtoreadthefileviaHTTP).
BecauseSamzahasmultiplesubcomponentsandeachsubcomponenthasitsowndependencies,thefullYARNpackagecanendupcontainingalotofJARfiles(over100!).Inaddition,youneedtoincludeyourcustomcodefortheSamzataskaswellassomescriptsfromwithintheSamzadistribution.It’snotsomethingtobedonebyhand.Inthesamplecodeforthischapter,foundathttps://github.com/learninghadoop2/book-examples/tree/master/ch4,wehavesetupasamplestructuretoholdthecodeandconfigfilesandprovidedsomeautomationviaGradletobuildthenecessarytaskarchiveandstartthetasks.
WhenintherootoftheSamzaexamplecodedirectoryforthisbook,performthefollowingcommandtobuildasinglefilearchivecontainingalltheclassesofthischaptercompiledtogetherandbundledwithalltheotherrequiredfiles:
$./gradlewtargz
ThisGradletaskwillnotonlycreatethenecessary.tar.gzarchiveinthebuild/distributionsdirectory,butwillalsostoreanexpandedversionofthearchiveunderbuild/samza-package.Thiswillbeuseful,aswewilluseSamzascriptsstoredinthebindirectoryofthearchivetoactuallysubmitthetasktoYARN.
Sonow,let’srunourjob.Weneedtohavefilepathsfortwothings:theSamzarun-job.shscripttosubmitajobtoYARNandtheconfigurationfileforourjob.Sinceourcreatedjobpackagehasallthecompiledtasksbundledtogether,itisbyusingadifferentconfigurationfilethatspecifiesaspecifictaskimplementationclassinthetask.classpropertythatwetellSamzawhichtasktorun.Toactuallyrunthetask,wecanrunthefollowingcommandfromwithintheexplodedprojectarchiveunderbuild/samza-archives:
$bin/run-job.sh--config-
factory=org.apache.samza.config.factories.PropertiesConfigFactory--config-
path=]config/twitter-parser.properties
Forconvenience,weaddedaGradletasktorunthisjob:
$./gradlewrunTwitterParser
Toseetheoutputofthejob,we’llusetheKafkaCLIclienttoconsumemessages:
$./kafka-console-consumer.sh–zookeeperlocalhost:2181–topictweets-
parsed
Youshouldseeacontinuousstreamoftweetsappearingontheclient.
NoteNotethatwedidnotexplicitlycreatethetopiccalledtweets-parsed.Kafkacanallowtopicstobecreateddynamicallywheneitheraproducerorconsumertriestousethetopic.Inmanysituations,thoughthedefaultpartitioningandreplicationvaluesmaynotbesuitable,andexplicittopiccreationwillberequiredtoensurethesecriticaltopicattributesarecorrectlydefined.
SamzaandHDFSYoumayhavenoticedthatwejustmentionedHDFSforthefirsttimeinourdiscussionofSamza.ThoughSamzaintegratestightlywithYARN,ithasnodirectintegrationwithHDFS.Atalogicallevel,Samza’sstream-implementingsystems(suchasKafka)areprovidingthestoragelayerthatisusuallyprovidedbyHDFSfortraditionalHadoopworkloads.IntheterminologyofSamza’sarchitecture,asdescribedearlier,YARNistheexecutionlayerinbothmodels,whereasSamzausesastreaminglayerforitssourceanddestinationdata,frameworkssuchasMapReduceuseHDFS.ThisisagoodexampleofhowYARNenablesalternativecomputationalmodelsthatnotonlyprocessdataverydifferentlythanbatch-orientedMapReduce,butthatcanalsouseentirelydifferentstoragesystemsfortheirsourcedata.
WindowingfunctionsIt’sfrequentlyusefultogeneratesomedatabasedonthemessagesreceivedonastreamoveracertaintimewindow.Anexampleofthismaybetorecordthetopnattributevaluesmeasuredeveryminute.SamzasupportsthisthroughtheWindowableTaskinterface,whichhasthefollowingsinglemethodtobeimplemented:
publicvoidwindow(MessageCollectorcollector,TaskCoordinator
coordinator);
ThisshouldlooksimilartotheprocessmethodintheStreamTaskinterface.However,becausethemethodiscalledonatimeschedule,itsinvocationisnotassociatedwithareceivedmessage.TheMessageCollectorandTaskCoordinatorparametersarestillthere,however,asmostwindowabletaskswillproduceoutputmessagesandmayalsowishtoperformsometaskmanagementactions.
Let’stakeourprevioustaskandaddawindowfunctionthatwilloutputthenumberoftweetsreceivedineachwindowedtimeperiod.ThisisthemainclassimplementationofTwitterStatisticsStreamTask.javafoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatisticsStreamTask.java
publicclassTwitterStatisticsStreamTaskimplementsStreamTask,
WindowableTask{
privateinttweets=0;
@Override
publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector
collector,TaskCoordinatorcoordinator){
tweets++;
}
@Override
publicvoidwindow(MessageCollectorcollector,TaskCoordinator
coordinator){
collector.send(newOutgoingMessageEnvelope(new
SystemStream("kafka","tweet-stats"),""+tweets));
//Resetcountsafterwindowing.
tweets=0;
}
}
TheTwitterStatisticsStreamTaskclasshasaprivatemembervariablecalledtweetsthatisinitializedto0andisincrementedineverycalltotheprocessmethod.Wethereforeknowthatthisvariablewillbeincrementedforeachmessagepassedtothetaskfromtheunderlyingstreamimplementation.EachSamzacontainerhasasinglethreadrunninginaloopthatexecutestheprocessandwindowmethodsonallthetaskswithinthecontainer.Thismeansthatwedonotneedtoguardinstancevariablesagainstconcurrentmodifications;onlyonemethodoneachtaskwithinacontainerwillbeexecutingsimultaneously.
Inourwindowmethod,wesendamessagetoanewtopicwecalltweet-statsandthenresetthetweetsvariable.ThisisprettystraightforwardandtheonlymissingpieceishowSamzawillknowwhentocallthewindowmethod.Wespecifythisintheconfigurationfile:
task.window.ms=5000
ThistellsSamzatocallthewindowmethodoneachtaskinstanceevery5seconds.Torunthewindowtask,thereisaGradletask:
$./gradlewrunTwitterStatistics
Ifweusekafka-console-consumer.shtolistenonthetweet-statsstreamnow,wewillseethefollowingoutput:
Numberoftweets:5012
Numberoftweets:5398
NoteNotethatthetermwindowinthiscontextreferstoSamzaconceptuallyslicingthestreamofmessagesintotimerangesandprovidingamechanismtoperformprocessingateachrangeboundary.Samzadoesnotdirectlyprovideanimplementationoftheotheruseofthetermwithregardstoslidingwindows,whereaseriesofvaluesisheldandprocessedovertime.However,thewindowabletaskinterfacedoesprovidetheplumbingtoimplementsuchslidingwindows.
MultijobworkflowsAswesawwiththeHelloSamzaexamples,someoftherealpowerofSamzacomesfromcompositionofmultiplejobsandwe’lluseatextcleanupjobtostartdemonstratingthiscapability.
Inthefollowingsection,we’llperformtweetsentimentanalysisbycomparingtweetswithasetofEnglishpositiveandnegativewords.SimplyapplyingthistotherawTwitterfeedwillhaveverypatchyresults,however,givenhowrichlymultilingualtheTwitterstreamis.Wealsoneedtoconsiderthingssuchastextcleanup,capitalization,frequentcontractions,andsoon.Asanyonewhohasworkedwithanynon-trivialdatasetknows,theactofmakingthedatafitforprocessingisusuallywherealargeamountofeffort(oftenthemajority!)goes.
Sobeforewetryanddetecttweetsentiments,let’sdosomesimpletextcleanup;inparticular,we’llselectonlyEnglishlanguagetweetsandwewillforcetheirtexttobelowercasebeforesendingthemtoanewoutputstream.
Languagedetectionisadifficultproblemandforthiswe’lluseafeatureoftheApacheTikalibrary(http://tika.apache.org).Tikaprovidesawidearrayoffunctionalitytoextracttextfromvarioussourcesandthentoextractfurtherinformationfromthattext.IfusingourGradlescripts,theTikadependencyisalreadyspecifiedandwillautomaticallybeincludedinthegeneratedjobpackage.Ifbuildingthroughanothermechanism,youwillneedtodownloadtheTikaJARfilefromthehomepageandaddittoyourYARNjobpackage.ThefollowingcodecanbefoundasTextCleanupStreamTask.javaathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TextCleanupStreamTask.java
publicclassTextCleanupStreamTaskimplementsStreamTask{
@Override
publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector
collector,TaskCoordinatorcoordinator){
Stringrawtext=((String)envelope.getMessage());
if("en".equals(detectLanguage(rawtext))){
collector.send(newOutgoingMessageEnvelope(new
SystemStream("kafka","english-tweets"),
rawtext.toLowerCase()));
}
}
privateStringdetectLanguage(Stringtext){
LanguageIdentifierli=newLanguageIdentifier(text);
returnli.getLanguage();
}
}
ThistaskisquitestraightforwardthankstotheheavyliftingperformedbyTika.WecreateautilitymethodthatwrapsthecreationanduseofaTika,LanguageDetector,andthenwe
callthismethodonthemessagebodyofeachincomingmessageintheprocessmethod.Weonlywritetotheoutputstreamiftheresultofapplyingthisutilitymethodis"en",thatis,thetwo-lettercodeforEnglish.
Theconfigurationfileforthistaskissimilartothatofourprevioustask,withthespecificvaluesforthetasknameandimplementingclass.Itisintherepositoryastextcleanup.propertiesathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/textcleanup.properties.Wealsoneedtospecifytheinputstream:
task.inputs=kafka.tweets-parsed
ThisisimportantbecauseweneedthistasktoparsethetweettextthatwasextractedintheearliertaskandavoidduplicatingtheJSONparsinglogicthatisbestencapsulatedinoneplace.Wecanrunthistaskwiththefollowingcommand:
$./gradlewrunTextCleanup
Now,wecanrunallthreetaskstogether;TwitterParseStreamTaskandTwitterStatisticsStreamTaskwillconsumetherawtweetstream,whileTextCleanupStreamTaskwillconsumetheoutputfromTwitterParseStreamTask.
Dataprocessingonstreams
TweetsentimentanalysisWe’llnowimplementatasktoperformtweetsentimentanalysissimilartowhatwedidusingMapReduceinthepreviouschapter.ThiswillalsoshowusausefulmechanismofferedbySamza:bootstrapstreams.
BootstrapstreamsGenerallyspeaking,moststream-processingjobs(inSamzaoranotherframework)willstartprocessingmessagesthatarriveaftertheystartupandgenerallyignorehistoricalmessages.Becauseofitsconceptofreplayablestreams,Samzadoesn’thavethislimitation.
Inoursentimentanalysisjob,wehadtwosetsofreferenceterms:positiveandnegativewords.Thoughwe’venotshownitsofar,Samzacanconsumemessagesfrommultiplestreamsandtheunderlyingmachinerywillpollallnamedstreamsandprovidetheirmessages,oneatatime,totheprocessmethod.Wecanthereforecreatestreamsforthepositiveandnegativewordsandpushthedatasetsontothosestreams.Atfirstglance,wecouldplantorewindthesetwostreamstotheearliestpointandreadtweetsastheyarrive.TheproblemisthatSamzawon’tguaranteeorderingofmessagesfrommultiplestreams,andeventhoughthereisamechanismtogivestreamshigherpriority,wecan’tassumethatallnegativeandpositivewordswillbeprocessedbeforethefirsttweetarrives.
Forsuchtypesofscenarios,Samzahastheconceptofbootstrapstreams.Ifataskhasanybootstrapstreamsdefined,thenitwillreadthesestreamsfromtheearliestoffsetuntiltheyarefullyprocessed(technically,itwillreadthestreamstilltheygetcaughtup,sothatanynewwordssenttoeitherstreamwillbetreatedwithoutpriorityandwillarriveinterleavedbetweentweets).
We’llnowcreateanewjobcalledTweetSentimentStreamTaskthatreadstwobootstrapstreams,collectstheircontentsintoHashMaps,gathersrunningcountsforsentimenttrends,andusesawindowfunctiontooutputthisdataatintervals.Thiscodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterSentimentStreamTask.java
publicclassTwitterSentimentStreamTaskimplementsStreamTask,
WindowableTask{
privateSet<String>positiveWords=newHashSet<String>();
privateSet<String>negativeWords=newHashSet<String>();
privateinttweets=0;
privateintpositiveTweets=0;
privateintnegativeTweets=0;
privateintmaxPositive=0;
privateintmaxNegative=0;
@Override
publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector
collector,TaskCoordinatorcoordinator){
if("positive-
words".equals(envelope.getSystemStreamPartition().getStream())){
positiveWords.add(((String)envelope.getMessage()));
}elseif("negative-
words".equals(envelope.getSystemStreamPartition().getStream())){
negativeWords.add(((String)envelope.getMessage()));
}elseif("english-
tweets".equals(envelope.getSystemStreamPartition().getStream())){
tweets++;
intpositive=0;
intnegative=0;
Stringwords=((String)envelope.getMessage());
for(Stringword:words.split("")){
if(positiveWords.contains(word)){
positive++;
}elseif(negativeWords.contains(word)){
negative++;
}
}
if(positive>negative){
positiveTweets++;
}
if(negative>positive){
negativeTweets++;
}
if(positive>maxPositive){
maxPositive=positive;
}
if(negative>maxNegative){
maxNegative=negative;
}
}
}
@Override
publicvoidwindow(MessageCollectorcollector,TaskCoordinator
coordinator){
Stringmsg=String.format("Tweets:%dPositive:%dNegative:%d
MaxPositive:%dMinPositive:%d",tweets,positiveTweets,negativeTweets,
maxPositive,maxNegative);
collector.send(newOutgoingMessageEnvelope(new
SystemStream("kafka","tweet-sentiment-stats"),msg));
//Resetcountsafterwindowing.
tweets=0;
positiveTweets=0;
negativeTweets=0;
maxPositive=0;
maxNegative=0;
}
}
Inthistask,weaddanumberofprivatemembervariablesthatwewillusetokeeparunningcountofthenumberofoveralltweets,howmanywerepositiveandnegative,andthemaximumpositiveandnegativecountsseeninasingletweet.
ThistaskconsumesfromthreeKafkatopics.Eventhoughwewillconfiguretwotobeusedasbootstrapstreams,theyareallstillexactlythesametypeofKafkatopicfromwhichmessagesarereceived;theonlydifferencewithbootstrapstreamsisthatwetellSamzatouseKafka’srewindingcapabilitiestofullyre-readeachmessageinthestream.Fortheotherstreamoftweets,wejuststartreadingnewmessagesastheyarrive.
Ashintedearlier,ifatasksubscribestomultiplestreams,thesameprocessmethodwillreceivemessagesfromeachstream.Thatiswhyweuseenvelope.getSystemStreamPartition().getStream()toextractthestreamnameforeachgivenmessageandthenactaccordingly.Ifthemessageisfromeitherofthebootstrappedstreams,weadditscontentstotheappropriatehashmap.Webreakatweetmessageintoitsconstituentwords,testeachwordforpositiveornegativesentiment,andthenupdatecountsaccordingly.Asyoucansee,thistaskdoesn’toutputthereceivedtweetstoanothertopic.
Sincewedon’tperformanydirectprocessing,thereisnopointindoingso;anyothertaskthatwishestoconsumemessagescanjustsubscribedirectlytotheincomingtweetsstream.However,apossiblemodificationcouldbetowritepositiveandnegativesentimenttweetstodedicatedstreamsforeach.
Thewindowmethodoutputsaseriesofcountsandthenresetsthevariables(asitdidbefore).NotethatSamzadoeshavesupporttodirectlyexposemetricsthroughJMX,whichcouldpossiblybeabetterfitforsuchsimplewindowingexamples.However,wewon’thavespacetocoverthataspectoftheprojectinthisbook.
Torunthisjob,weneedtomodifytheconfigurationfilebysettingthejobandtasknamesasusual,butwealsoneedtospecifymultipleinputstreamsnow:
task.inputs=kafka.english-tweets,kafka.positive-words,kafka.negative-words
Then,weneedtospecifythattwoofourstreamsarebootstrapstreamsthatshouldbereadfromtheearliestoffset.Specifically,wesetthreepropertiesforthestreams.Wesaytheyaretobebootstrapped,thatis,fullyreadbeforeotherstreams,andthisisachievedbyspecifyingthattheoffsetoneachstreamneedstoberesettotheoldest(first)position:
systems.kafka.streams.positive-words.samza.bootstrap=true
systems.kafka.streams.positive-words.samza.reset.offset=true
systems.kafka.streams.positive-words.samza.offset.default=oldest
systems.kafka.streams.negative-words.samza.bootstrap=true
systems.kafka.streams.negative-words.samza.reset.offset=true
systems.kafka.streams.negative-words.samza.offset.default=oldest
Wecanrunthisjobwiththefollowingcommand:
$./gradlewrunTwitterSentiment
Afterstartingthejob,lookattheoutputofthemessagesonthetweet-sentiment-statstopic.
Thesentimentdetectionjobwillbootstrapthepositiveandnegativewordstreamsbeforereadinganyofournewlydetectedlower-caseEnglishtweets.
Withthesentimentdetectionjob,wecannowvisualizeourfourcollaboratingjobsasshowninthefollowingdiagram:
Bootstrapstreamsandcollaboratingtasks
TipTocorrectlyrunthejobs,itmayseemnecessarytostarttheJSONparserjobfollowedbythecleanupjobbeforefinallystartingthesentimentjob,butthisisnotthecase.AnyunreadmessagesremainbufferedinKafka,soitdoesn’tmatterinwhichorderthejobsofamulti-jobworkflowarestarted.Ofcourse,thesentimentjobwilloutputcountsof0tweetsuntilitstartsreceivingdata,butnothingwillbreakifastreamjobstartsbeforethoseitdependson.
StatefultasksThefinalaspectofSamzathatwewillexploreishowitallowsthetasksprocessingstreampartitionstohavepersistentlocalstate.Inthepreviousexample,weusedprivatevariablestokeepatrackofrunningtotals,butsometimesitisusefulforatasktohavericherlocalstate.Anexamplecouldbetheactofperformingalogicaljoinontwostreams,whereitisusefultobuildupastatemodelfromonestreamandcomparethiswiththeother.
NoteNotethatSamzacanutilizeitsconceptofpartitionedstreamstogreatlyoptimizetheactofjoiningstreams.Ifeachstreamtobejoinedusesthesamepartitionkey(forexample,auserID),theneachtaskconsumingthesestreamswillreceiveallmessagesassociatedwitheachIDacrossallthestreams.
Samzahasanotherabstractionsimilartoitsnotionoftheframeworktomanageitsjobsandthatwhichimplementsitstasks.Itdefinesanabstractkey-valuestorethatcanhavemultipleconcreteimplementations.Samzausesexistingopensourceprojectsfortheon-diskimplementationsandusedLevelDBasofv0.7andaddedRocksDBasofv0.8.Thereisalsoanin-memorystorethatdoesnotpersistthekey-valuedatabutthatmaybeusefulintestingorpotentiallyveryspecificproductionworkloads.
Eachtaskcanwritetothiskey-valuestoreandSamzamanagesitspersistencetothelocalimplementation.Tosupportpersistentstates,thestoreisalsomodeledasastreamandallwritestothestorearealsopushedintoastream.Ifataskfails,thenonrestart,itcanrecoverthestateofitslocalkey-valuestorebyreplayingthemessagesinthebackingtopic.Anobviousconcernherewillbethenumberofmessagesthatneedtobereplayed;however,whenusingKafka,forexample,itcompactsmessageswiththesamekeysothatonlythelatestupdateremainsinthetopic.
We’llmodifyourprevioustweetsentimentexampletoaddalifetimecountofthemaximumpositiveandnegativesentimentseeninanytweet.ThefollowingcodecanbefoundasTwitterStatefulSentimentStateTask.javaathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatefulSentimentStreamTask.javaNotethattheprocessmethodisthesameasTwitterSentimentStateTask.java,sowehaveomittedithereforspacereasons:
publicclassTwitterStatefulSentimentStreamTaskimplementsStreamTask,
WindowableTask,InitableTask{
privateSet<String>positiveWords=newHashSet<String>();
privateSet<String>negativeWords=newHashSet<String>();
privateinttweets=0;
privateintpositiveTweets=0;
privateintnegativeTweets=0;
privateintmaxPositive=0;
privateintmaxNegative=0;
privateKeyValueStore<String,Integer>store;
@SuppressWarnings("unchecked")
@Override
publicvoidinit(Configconfig,TaskContextcontext){
this.store=(KeyValueStore<String,Integer>)
context.getStore("tweet-store");
}
@Override
publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector
collector,TaskCoordinatorcoordinator){
...
}
@Override
publicvoidwindow(MessageCollectorcollector,TaskCoordinator
coordinator){
IntegerlifetimeMaxPositive=store.get("lifetimeMaxPositive");
IntegerlifetimeMaxNegative=store.get("lifetimeMaxNegative");
if((lifetimeMaxPositive==null)||(maxPositive>
lifetimeMaxPositive)){
lifetimeMaxPositive=maxPositive;
store.put("lifetimeMaxPositive",lifetimeMaxPositive);
}
if((lifetimeMaxNegative==null)||(maxNegative>
lifetimeMaxNegative)){
lifetimeMaxNegative=maxNegative;
store.put("lifetimeMaxNegative",lifetimeMaxNegative);
}
Stringmsg=
String.format(
"Tweets:%dPositive:%dNegative:%dMaxPositive:%d
MaxNegative:%dLifetimeMaxPositive:%dLifetimeMaxNegative:%d",
tweets,positiveTweets,negativeTweets,maxPositive,
maxNegative,lifetimeMaxPositive,
lifetimeMaxNegative);
collector.send(newOutgoingMessageEnvelope(new
SystemStream("kafka","tweet-stateful-sentiment-stats"),msg));
//Resetcountsafterwindowing.
tweets=0;
positiveTweets=0;
negativeTweets=0;
maxPositive=0;
maxNegative=0;
}
}
ThisclassimplementsanewinterfacecalledInitableTask.Thishasasinglemethodcalledinitandisusedwhenataskneedstoconfigureaspectsofitsconfigurationbeforeitbeginsexecution.Weusetheinit()methodheretocreateaninstanceoftheKeyValueStoreclassandstoreitinaprivatemembervariable.
KeyValueStore,asthenamesuggests,providesafamiliarput/gettypeinterface.Inthiscase,wespecifythatthekeysareofthetypeStringandthevaluesareIntegers.Inourwindowmethod,weretrieveanypreviouslystoredvaluesforthemaximumpositiveandnegativesentimentandifthecountinthecurrentwindowishigher,updatethestoreaccordingly.Then,wejustoutputtheresultsofthewindowmethodasbefore.
Asyoucansee,theuserdoesnotneedtodealwiththedetailsofeitherthelocalorremotepersistenceoftheKeyValueStoreinstance;thisisallhandledbySamza.Theefficiencyofthemechanismalsomakesittractablefortaskstoholdsizeableamountoflocalstate,whichcanbeparticularlyvaluableincasessuchaslong-runningaggregationsorstreamjoins.
Theconfigurationfileforthejobcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-stateful-sentiment.properties.Itneedstohaveafewentriesadded,whichareasfollows:
stores.tweet-
store.factory=org.apache.samza.storage.kv.KeyValueStorageEngineFactory
stores.tweet-store.changelog=kafka.twitter-stats-state
stores.tweet-store.key.serde=string
stores.tweet-store.msg.serde=integer
Thefirstlinespecifiestheimplementationclassforthestore,thesecondlinespecifiestheKafkatopictobeusedforpersistentstate,andthelasttwolinesspecifythetypeofthestorekeyandvalue.
Torunthisjob,usethefollowingcommand:
$./gradlewrunTwitterStatefulSentiment
Forconvenience,thefollowingcommandwillstartupfourjobs:theJSONparser,thetextcleanup,thestatisticsjobandthestatefulsentimentjobs:
$./gradlewrunTasks
Samzaisapurestream-processingsystemthatprovidespluggableimplementationsofitsstorageandexecutionlayers.ThemostcommonlyusedpluginsareYARNandKafka,andthesedemonstratehowSamzacanintegratetightlywithHadoopYARNwhileusingacompletelydifferentstoragelayer.Samzaisstillarelativelynewprojectandthecurrentfeaturesareonlyasubsetofwhatisenvisaged.Itisrecommendedtoconsultitswebpagetogetthelatestinformationonitscurrentstatus.
SummaryThischapterfocusedmuchmoreonwhatcanbedoneonHadoop2,andinparticularYARN,thanthedetailsofHadoopinternals.Thisisalmostcertainlyagoodthing,asitdemonstratesthatHadoopisrealizingitsgoalofbecomingamuchmoreflexibleandgenericdataprocessingplatformthatisnolongertiedtobatchprocessing.Inparticular,wehighlightedhowSamzashowsthattheprocessingframeworksthatcanbeimplementedonYARNcaninnovateandenablefunctionalityvastlydifferentfromthatavailableinHadoop1.
Inparticular,wesawhowSamzagoestotheoppositeendofthelatencyspectrumfrombatchprocessingandenablesper-messageprocessingofindividualmessagesastheyarrive.
WealsosawhowSamzaprovidesacallbackmechanismthatMapReducedeveloperswillbefamiliarwith,butusesitforaverydifferentprocessingmodel.WealsodiscussedthewaysinwhichSamzautilizesYARNasitsmainexecutionframeworkandhowitimplementsthemodeldescribedinChapter3,Processing–MapReduceandBeyond.
Inthenextchapter,wewillswitchgearsandexploreApacheSpark.ThoughithasaverydifferentdatamodelthanSamza,we’llseethatitdoesalsohaveanextensionthatsupportsprocessingofrealtimedatastreams,includingtheoptionofKafkaintegration.However,bothprojectsaresodifferentthattheyarecomplimentarymorethanincompetition.
Chapter5.IterativeComputationwithSparkInthepreviouschapter,wesawhowSamzacanenablenearreal-timestreamdataprocessingwithinHadoop.ThisisquiteastepawayfromthetraditionalbatchprocessingmodelofMapReduce,butstillkeepswiththemodelofprovidingawell-definedinterfaceagainstwhichbusinesslogictaskscanbeimplemented.InthischapterwewillexploreApacheSpark,whichcanbeviewedbothasaframeworkonwhichapplicationscanbebuiltaswellasaprocessingframeworkinitsownright.NotonlyareapplicationsbeingbuiltonSpark,butentirecomponentswithintheHadoopecosystemarealsobeingreimplementedtouseSparkastheirunderlyingprocessingframework.Inparticular,wewillcoverthefollowingtopics:
WhatSparkisandhowitscoresystemcanrunonYARNThedatamodelprovidedbySparkthatenableshugelyscalableandhighlyefficientdataprocessingThebreadthofadditionalSparkcomponentsandrelatedprojects
It’simportanttonoteupfrontthatalthoughSparkhasitsownmechanismtoprocessstreamingdata,thisisbutonepartofwhatSparkhastooffer.It’sbesttothinkofitasamuchbroaderinitiative.
ApacheSparkApacheSpark(https://spark.apache.org/)isadataprocessingframeworkbasedonageneralizationofMapReduce.ItwasoriginallydevelopedbytheAMPLabatUCBerkeley(https://amplab.cs.berkeley.edu/).LikeTez,SparkactsasanexecutionenginethatmodelsdatatransformationsasDAGsandstrivestoeliminatetheI/OoverheadofMapReduceinordertoperformiterativecomputationatscale.WhileTez’smaingoalwastoprovideafasterexecutionengineforMapReduceonHadoop,SparkhasbeendesignedbothasastandaloneframeworkandanAPIforapplicationdevelopment.Thesystemisdesignedtoperformgeneral-purposein-memorydataprocessing,streamworkflows,aswellasinteractiveanditerativecomputation.
SparkisimplementedinScala,whichisastaticallytypedprogramminglanguagefortheJavaVMandexposesnativeprogramminginterfacesforJavaandPythoninadditiontoScalaitself.NotethatthoughJavacodecancalltheScalainterfacedirectly,therearesomeaspectsofthetypesystemthatmakesuchcodeprettyunwieldy,andhenceweusethenativeJavaAPI.
ScalashipswithaninteractiveshellsimilartothatofRubyandPython;thisallowsuserstorunSparkinteractivelyfromtheinterpretertoqueryanydataset.
TheScalainterpreteroperatesbycompilingaclassforeachlinetypedbytheuser,loadingitintotheJVM,andinvokingafunctiononit.Thisclassincludesasingletonobjectthatcontainsthevariablesorfunctionsonthatlineandrunstheline’scodeinaninitializemethod.Inadditiontoitsrichprogramminginterfaces,Sparkisbecomingestablishedasanexecutionengine,withpopulartoolsoftheHadoopecosystem(suchasPigandHive)beingportedtotheframework.
ClustercomputingwithworkingsetsSpark’sarchitectureiscenteredaroundtheconceptofResilientDistributedDatasets(RDDs),whichisaread-onlycollectionofScalaobjectspartitionedacrossasetofmachinesthatcanpersistinmemory.Thisabstractionwasproposedina2012researchpaper,ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing,whichcanbefoundathttps://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
ASparkapplicationconsistsofadriverprogramthatexecutesparalleloperationsonaclusterofworkersandlong-livedprocessesthatcanstoredatapartitionsinmemorybydispatchingfunctionsthatrunasparalleltasks,asshowninthefollowingdiagram:
Sparkclusterarchitecture
ProcessesarecoordinatedviaaSparkContextinstance.SparkContextconnectstoaresourcemanager(suchasYARN),requestsexecutorsonworkernodes,andsendstaskstobeexecuted.Executorsareresponsibleforrunningtasksandmanagingmemorylocally.
Sparkallowsyoutosharevariablesbetweentasks,orbetweentasksandthedriver,usinganabstractionknownassharedvariables.Sparksupportstwotypesofsharedvariables:broadcastvariables,whichcanbeusedtocacheavalueinmemoryonallnodes,andaccumulators,whichareadditivevariablessuchascountersandsums.
ResilientDistributedDatasets(RDDs)AnRDDisstoredinmemory,sharedacrossmachinesandisusedinMapReduce-likeparalleloperations.Faulttoleranceisachievedthroughthenotionoflineage:ifapartitionofanRDDislost,theRDDhasenoughinformationabouthowitwasderivedfromotherRDDstobeabletorebuildjustthatpartition.AnRDDcanbebuiltinfourways:
ByreadingdatafromafilestoredinHDFSBydividing–parallelizing–aScalacollectionintoanumberofpartitionsthataresenttoworkers
BytransforminganexistingRDDusingparalleloperatorsBychangingthepersistenceofanexistingRDD
SparkshineswhenRDDscanfitinmemoryandcanbecachedacrossoperations.TheAPIexposesmethodstopersistRDDsandallowsforseveralpersistencestrategiesandstoragelevels,allowingforspilltodiskaswellasspace-efficientbinaryserialization.
ActionsOperationsareinvokedbypassingfunctionstoSpark.Thesystemdealswithvariablesandsideeffectsaccordingtothefunctionalprogrammingparadigm.Closurescanrefertovariablesinthescopewheretheyarecreated.Examplesofactionsarecount(returnsthenumberofelementsinthedataset),andsave(outputsthedatasettostorage).OtherparalleloperationsonRDDsincludethefollowing:
map:appliesafunctiontoeachelementofthedatasetfilter:selectselementsfromadatasetbasedonuser-providedcriteriareduce:combinesdatasetelementsusinganassociativefunctioncollect:sendsallelementsofthedatasettothedriverprogramforeach:passeseachelementthroughauser-providedfunctiongroupByKey:groupsitemstogetherbyaprovidedkeysortByKey:sortsitemsbykey
DeploymentSparkcanrunbothinlocalmode,similartoaHadoopsingle-nodesetup,oratoparesourcemanager.Currentlysupportedresourcemanagersinclude:
SparkStandaloneClusterModeYARNApacheMesos
SparkonYARNAnad-hoc-consolidatedJARneedstobebuiltinordertodeploySparkonYARN.SparklaunchesaninstanceofthestandalonedeployedclusterwithintheResourceManager.ClouderaandMapRbothshipwithSparkonYARNaspartoftheirsoftwaredistribution.Atthetimeofwriting,SparkisavailableforHortonworks’sHDPasatechnologypreview(http://hortonworks.com/hadoop/spark/).
SparkonEC2Sparkcomeswithadeploymentscript,spark-ec2,locatedintheec2directory.ThisscriptautomaticallysetsupSparkandHDFSonaclusterofEC2instances.InordertolaunchaSparkclusterontheAmazoncloud,gototheec2directoryandrunthefollowingcommand:
./spark-ec2-k<keypair>-i<key-file>-s<num-slaves>launch<cluster-
name>
Here,<keypair>isthenameofyourEC2keypair,<key-file>istheprivatekeyfileforthekeypair,<num-slaves>isthenumberofslavenodestobelaunched,and<cluster-name>isthenametobegiventoyourcluster.SeeChapter1,Introduction,formoredetailsregardingthesetupofkeypairs,andverifythattheclusterschedulerisupandseesalltheslavesbygoingtoitswebUI,theaddressofwhichwillbeprintedoncethescriptcompletes.
YoucanspecifyapathinS3astheinputthroughaURIoftheforms3n://<bucket>/path.YouwillalsoneedtosetyourAmazonsecuritycredentials,eitherbysettingtheenvironmentvariablesAWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYbeforeyourprogramisexecuted,orthroughSparkContext.hadoopConfiguration.
GettingstartedwithSparkSparkbinariesandsourcecodeareavailableontheprojectwebsiteathttp://spark.apache.org/.TheexamplesinthefollowingsectionhavebeentestedusingSpark1.1.0builtfromsourceontheClouderaCDH5.0QuickStartVM.
Downloadanduncompressthegziparchivewiththefollowingcommands:
$wgethttp://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz
$tarxvzfspark-1.1.0.tgz
$cdspark-1.1.0
SparkisbuiltonScala2.10andusessbt(https://github.com/sbt/sbt)tobuildthesourcecoreandrelatedexamples:
$./sbt/sbt-Dhadoop.version=2.2.0-Pyarnassembly
Withthe-Dhadoop.version=2.2.0and-Pyarnoptions,weinstructsbttobuildagainstHadoopversions2.2.0orhigherandenableYARNsupport.
StartSparkinstandalonemodewiththefollowingcommand:
$./sbin/start-all.sh
Thiscommandwilllaunchalocalmasterinstanceatspark://localhost:7077aswellasaworkernode.
Awebinterfacetothemasternodecanbeaccessedathttp://localhost:8080/andcanbeseeninthefollowingscreenshot:
Masternodewebinterface
Sparkcanruninteractivelythroughspark-shell,whichisamodifiedversionoftheScalashell.Asafirstexample,wewillimplementawordcountoftheTwitterdatasetweusedinChapter3,Processing-MapReduceandBeyond,usingtheScalaAPI.
Startaninteractivespark-shellsessionbyrunningthefollowingcommand:
$./bin/spark-shell
TheshellinstantiatesaSparkContextobject,sc,thatisresponsibleforhandlingdriverconnectionstoworkers.Wewilldescribeitssemanticslaterinthischapter.
Tomakethingsabiteasier,let’screateasampletextualdatasetthatcontainsonestatusupdateperline:
$stream.py-t-n1000>sample.txt
Then,copyittoHDFS:
$hdfsdfs-putsample.txt/tmp
Withinspark-shell,wefirstcreateanRDD-file-fromthesampledata:
valfile=sc.textFile("/tmp/sample.txt")
Then,weapplyaseriesoftransformationstocountthewordoccurrencesinthefile.Notethattheoutputofthetransformationchain-counts-isstillanRDD:
valcounts=file.flatMap(line=>line.split(""))
.map(word=>(word,1))
.reduceByKey((m,n)=>m+n)
Thischainoftransformationscorrespondstothemapandreducephasesthatwearefamiliarwith.Inthemapphase,weloadeachlineofthedataset(flatMap),tokenizeeachtweetintoasequenceofwords,counttheoccurrenceofeachword(map),andemit(key,value)pairs.Inthereducephase,wegroupbykey(word)andsumvalues(m,n)togethertoobtainwordcounts.
Finally,weprintthefirsttenelements,counts.take(10),totheconsole:
counts.take(10).foreach(println)
WritingandrunningstandaloneapplicationsSparkallowsstandaloneapplicationstobewrittenusingthreeAPIs:Scala,Java,andPython.
ScalaAPIThefirstthingaSparkdrivermustdoistocreateaSparkContextobject,whichtellsSparkhowtoaccessacluster.Afterimportingclassesandimplicitconversionsintoaprogram,asinthefollowing:
importorg.apache.spark.SparkContext
importorg.apache.spark.SparkContext._
TheSparkContextobjectcanbecreatedwiththefollowingconstructor:
newSparkContext(master,appName,[sparkHome])
ItcanalsobecreatedthroughSparkContext(conf),whichtakesaSparkConfobject.
ThemasterparameterisastringthatspecifiesaclusterURItoconnectto(suchasspark://localhost:7077)oralocalstringtoruninlocalmode.TheappNametermistheapplicationnamethatwillbeshownintheclusterwebUI.
ItisnotpossibletooverridethedefaultSparkContextclass,norisitpossibletocreateanewonewithinarunningSparkshell.ItishoweverpossibletospecifywhichmasterthecontextconnectstousingtheMASTERenvironmentvariable.Forexample,torunspark-shellonfourcores,usethefollowing:
$MASTER=local[4]./bin/spark-shell
JavaAPITheorg.apache.spark.api.javapackageexposesalltheSparkfeaturesavailableintheScalaversiontoJava.TheJavaAPIhasaJavaSparkContextclassthatreturnsinstancesoforg.apache.spark.api.java.JavaRDDandworkswithJavacollectionsinsteadofScalaones.
ThereareafewkeydifferencesbetweentheJavaandScalaAPIs:
Java7doesnotsupportanonymousorfirst-classfunctions;therefore,functionsmustbeimplementedbyextendingtheorg.apache.spark.api.java.function.Function,Function2,andotherclasses.AsofSparkversion1.0theAPIhasbeenrefactoredtosupportJava8lambdaexpressions.WithJava8,Functionclassescanbereplacedwithinlineexpressionsthatactasashorthandforanonymousfunctions.TheRDDmethodsreturnJavacollectionsKey-valuepairs,whicharesimplywrittenas(key,value)inScala,arerepresentedbythescala.Tuple2class.Tomaintaintypesafety,someRDDandfunctionmethods,suchasthosethathandlekeypairsanddoubles,areimplementedasspecializedclasses.
WordCountinJavaAnexampleofWordCountinJavaisincludedwiththeSparksourcecodedistributionatexamples/src/main/java/org/apache/spark/examples/JavaWordCount.java.
Firstofall,wecreateacontextusingtheJavaSparkContextclass:
JavaSparkContextsc=newJavaSparkContext(master,"JavaWordCount",
System.getenv("SPARK_HOME"),
JavaSparkContext.jarOfClass(JavaWordCount.class));
JavaRDD<String>data=sc.textFile(infile,1);
JavaRDD<String>words=data.flatMap(newFlatMapFunction<String,
String>(){
@Override
publicIterable<String>call(Strings){
returnArrays.asList(s.split(""));
}
});
JavaPairRDD<String,Integer>ones=words.map(newPairFunction<String,
String,Integer>(){
@Override
publicTuple2<String,Integer>call(Strings){
returnnewTuple2<String,Integer>(s,1);
}
});
JavaPairRDD<String,Integer>counts=ones.reduceByKey(new
Function2<Integer,Integer,Integer>(){
@Override
publicIntegercall(Integeri1,Integeri2){
returni1+i2;
}
});
WethenbuildanRDDfromtheHDFSlocationinfile.Inthefirststepofthetransformationchain,wetokenizeeachtweetinthedatasetandreturnalistofwords.WeuseaninstanceofJavaPairRDD<String,Integer>tocountoccurrencesofeachword.Finally,wereducetheRDDtoanewJavaPairRDD<String,Integer>instancethatcontainsalistoftuples,eachrepresentingawordandthenumberoftimesitwasfoundinthedataset.
PythonAPIPySparkrequiresPythonversion2.6orhigher.RDDssupportthesamemethodsastheirScalacounterpartsbuttakePythonfunctionsandreturnPythoncollectiontypes.Lambdasyntax(https://docs.python.org/2/reference/expressions.html)isusedtopassfunctionstoRDDs.
ThewordcountinpysparkisrelativelysimilartoitsScalacounterpart:
tweets=sc.textFile("/tmp/sample.txt")
counts=tweets.flatMap(lambdatweet:tweet.split(''))\
.map(lambdaword:(word,1))\
.reduceByKey(lambdam,n:m+n)
Thelambdaconstructcreatesanonymousfunctionsatruntime.lambdatweet:tweet.split('')createsafunctionthattakesastringtweetastheinputandoutputsalistofstringssplitbywhitespace.Spark’sflatMapappliesthisfunctiontoeachlineofthetweetsdataset.Inthemapphase,foreachwordtoken,lambdaword:(word,1)returns(word,1)tuplesthatindicatetheoccurrenceofawordinthedataset.InreduceByKey,wegroupthesetuplesbykey-word-andsumthevaluestogethertoobtainthewordcountwithlambdam,n:m+n.
TheSparkecosystemApacheSparkpowersanumberoftools,bothasalibraryandasanexecutionengine.
SparkStreamingSparkStreaming(foundathttp://spark.apache.org/docs/latest/streaming-programming-guide.html)isanextensionoftheScalaAPIthatallowsdataingestionfromstreamssuchasKafka,Flume,Twitter,ZeroMQ,andTCPsockets.
SparkStreamingreceivesliveinputdatastreamsanddividesthedataintobatches(arbitrarilysizedtimewindows),whicharethenprocessedbytheSparkcoreenginetogeneratethefinalstreamofresultsinbatches.Thishigh-levelabstractioniscalledDStream(org.apache.spark.streaming.dstream.DStreams)andisimplementedasasequenceofRDDs.DStreamallowsfortwokindsofoperations:transformationsandoutputoperations.TransformationsworkononeormoreDStreamstocreatenewDStreams.Aspartofachainoftransformations,datacanbepersistedeithertoastoragelayer(HDFS)oranoutputchannel.SparkStreamingallowsfortransformationsoveraslidingwindowofdata.Awindow-basedoperationneedstospecifytwoparameters:thewindowlength,thedurationofthewindowandtheslideinterval,theintervalatwhichthewindow-basedoperationisperformed.
GraphXGraphX(foundathttps://spark.apache.org/docs/latest/graphx-programming-guide.html)isanAPIforgraphcomputationthatexposesasetofoperatorsandalgorithmsforgraph-orientedcomputationaswellasanoptimizedvariantofPregel.
MLlibMLlib(foundathttp://spark.apache.org/docs/latest/mllib-guide.html)providescommonMachineLearning(ML)functionality,includingtestsanddatagenerators.MLlibcurrentlysupportsfourtypesofalgorithms:binaryclassification,regression,clustering,andcollaborativefiltering.
SparkSQLSparkSQLisderivedfromShark,whichisanimplementationoftheHivedatawarehousingsystemthatusesSparkasanexecutionengine.WewilldiscussHiveinChapter7,HadoopandSQL.WithSparkSQL,itispossibletomixSQL-likequerieswithScalaorPythoncode.TheresultsetsreturnedbyaqueryarethemselvesRDDs,andassuch,theycanbemanipulatedbySparkcoremethodsorMLlibandGraphX.
ProcessingdatawithApacheSparkInthissection,wewillimplementtheexamplesfromChapter3,Processing–MapReduceandBeyond,usingtheScalaAPI.Wewillconsiderboththebatchandreal-timeprocessingscenarios.WewillshowyouhowSparkStreamingcanbeusedtocomputestatisticsontheliveTwitterstream.
BuildingandrunningtheexamplesScalasourcecodefortheexamplescanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch5.Wewillbeusingsbttobuild,manage,andexecutecode.
Thebuild.sbtfilecontrolsthecodebasemetadataandsoftwaredependencies;theseincludetheversionoftheScalainterpreterthatSparklinksto,alinktotheAkkapackagerepositoryusedtoresolveimplicitdependencies,aswellasdependenciesonSparkandHadooplibraries.
Thesourcecodeforallexamplescanbecompiledwith:
$sbtcompile
Or,itcanbepackagedintoaJARfilewith:
$sbtpackage
Ahelperscripttoexecutecompiledclassescanbegeneratedwith:
$sbtadd-start-script-tasks
$sbtstart-script
Thehelpercanbeinvokedasfollows:
$target/start<classname><master><param1>…<paramn>
Here,<master>istheURIofthemasternode.AninteractiveScalasessioncanbeinvokedviasbtwiththefollowingcommand:
$sbtconsole
ThisconsoleisnotthesameastheSparkinteractiveshell;rather,itisanalternativewaytoexecutecode.InordertorunSparkcodeinitwewillneedtomanuallyimportandinstantiateaSparkContextobject.Allexamplespresentedinthissectionexpectatwitter4j.propertiesfilecontainingtheconsumerkeyandsecretandtheaccesstokenstobepresentinthesamedirectorywheresbtorspark-shellisbeinginvoked:
oauth.consumerKey=
oauth.consumerSecret=
oauth.accessToken=
oauth.accessTokenSecret=
RunningtheexamplesonYARNToruntheexamplesonaYARNgrid,wefirstbuildaJARfileusing:
$sbtpackage
Then,weshipittotheresourcemanagerusingthespark-submitcommand:
./bin/spark-submit--classapplication.to.execute--masteryarn-cluster
[options]target/scala-2.10/chapter-4_2.10-1.0.jar[<param1>…<paramn>]
Unlikethestandalonemode,wedon’tneedtospecifya<master>URI.InYARN,theResourceManagerisselectedfromtheclusterconfiguration.MoreinformationonlaunchingsparkinYARNcanbefoundathttp://spark.apache.org/docs/latest/running-on-yarn.html.
FindingpopulartopicsUnliketheearlierexampleswiththeSparkshellweinitializeaSparkContextaspartoftheprogram.WepassthreeargumentstotheSparkContextconstructor:thetypeofschedulerwewanttouse,anamefortheapplication,andthedirectorywhereSparkisinstalled:
importorg.apache.spark.SparkContext._
importorg.apache.spark.SparkContext
importscala.util.matching.Regex
objectHashtagCount{
defmain(args:Array[String]){
[…]
valsc=newSparkContext(master,
"HashtagCount",
System.getenv("SPARK_HOME"))
valfile=sc.textFile(inputFile)
valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")
valcounts=file.flatMap(line=>
(patternfindAllInline).toList)
.map(word=>(word,1))
.reduceByKey((m,n)=>m+n)
counts.saveAsTextFile(outputPath)
}
}
WecreateaninitialRDDfromadatasetstoredinHDFS-inputFile-andapplylogicthatissimilartotheWordCountexample.
Foreachtweetinthedataset,weextractanarrayofstringsthatmatchthehashtagpattern(patternfindAllInline).toArray,andwecountanoccurrenceofeachstringusingthemapoperator.ThisgeneratesanewRDDasalistoftuplesintheform:
(word,1),(word2,1),(word,1)
Finally,wecombinetogetherelementsofthisRDDusingthereduceByKey()method.WestoretheRDDgeneratedbythislaststepbackintoHDFSwithsaveAsTextFile.
Thecodeforthestandalonedrivercanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagCount.scala
AssigningasentimenttotopicsThesourcecodeofthisexamplecanbefoundat
https://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagSentiment.scalaandthecodeisasfollows:
importorg.apache.spark.SparkContext._
importorg.apache.spark.SparkContext
importscala.util.matching.Regex
importscala.io.Source
objectHashtagSentiment{
defmain(args:Array[String]){
[…]
valsc=newSparkContext(master,
"HashtagSentiment",
System.getenv("SPARK_HOME"))
valfile=sc.textFile(inputFile)
valpositive=Source.fromFile(positiveWordsPath)
.getLines
.filterNot(_startsWith";")
.toSet
valnegative=Source.fromFile(negativeWordsPath)
.getLines
.filterNot(_startsWith";")
.toSet
valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")
valcounts=file.flatMap(line=>(patternfindAllInline).map({
word=>(word,sentimentScore(line,positive,negative))
})).reduceByKey({(m,n)=>(m._1+n._1,m._2+n._2)})
valsentiment=counts.map({hashtagScore=>
valhashtag=hashtagScore._1
valscore=hashtagScore._2
valnormalizedScore=score._1/score._2
(hashtag,normalizedScore)
})
sentiment.saveAsTextFile(outputPath)
}
}
First,wereadalistofpositiveandnegativewordsintoScalaSetobjectsandfilteroutcomments(stringsbeginningwith;).
Whenahashtagisfound,wecallafunction-sentimentScore-toestimatethesentimentexpressedbythatgiventext.ThisfunctionimplementsthesamelogicweusedinChapter3,Processing–MapReduceandBeyond,toestimatethesentimentofatweet.Ittakesasinputparametersthetweet’stext,str,andalistofpositiveandnegativewordsasSet[String]objects.Thereturnvalueisthedifferencebetweenthepositiveandnegativescoresandthenumberofwordsinthetweets.InSpark,werepresentthisreturnvalueasapairofDoubleandIntegerobjects:
defsentimentScore(str:String,positive:Set[String],
negative:Set[String]):(Double,Int)={
varpositiveScore=0;varnegativeScore=0;
str.split("""\s+""").foreach{w=>
if(positive.contains(w)){positiveScore+=1;}
if(negative.contains(w)){negativeScore+=1;}
}
((positiveScore-negativeScore).toDouble,
str.split("""\s+""").length)
}
Wereducethemapoutputbyaggregatingbythekey(thehashtag).Inthisphase,weemitatriplemadeofthehashtag,thesumofthedifferencebetweenpositiveandnegativescores,andthenumberofwordspertweet.WeuseanadditionalmapsteptonormalizethesentimentscoreandstoretheresultinglistofhashtagandsentimentpairstoHDFS.
DataprocessingonstreamsThepreviousexamplecanbeeasilyadjustedtoworkonareal-timestreamofdata.Inthisandthefollowingsection,wewillusespark-streaming-twittertoperformsomesimpleanalyticstasksonthereal-timefirehose:
valwindow=10
valssc=newStreamingContext(master,"TwitterStreamEcho",
Seconds(window),System.getenv("SPARK_HOME"))
valstream=TwitterUtils.createStream(ssc,auth)
valtweets=stream.map(tweet=>(tweet.getText()))
tweets.print()
ssc.start()
ssc.awaitTermination()
}
TheScalasourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/TwitterStreamEcho.scala
Thetwokeypackagesweneedtoimportare:
importorg.apache.spark.streaming.{Seconds,StreamingContext}
importorg.apache.spark.streaming.twitter._
WeinitializeanewStreamingContextssconalocalclusterusinga10-secondwindowandusethiscontexttocreateaDStreamoftweetswhosetextweprint.
Uponsuccessfulexecution,Twitter’sreal-timefirehosewillbeechoedintheterminalinbatchesof10secondsworthofdata.NoticethatthecomputationwillcontinueindefinitelybutcanbeinterruptedatanymomentbypressingCtrl+C.
TheTwitterUtilsobjectisawrappertotheTwitter4jlibrary(http://twitter4j.org/en/index.html)thatshipswithspark-streaming-twitter.AsuccessfulcalltoTwitterUtils.createStreamwillreturnaDStreamofTwitter4jobjects(TwitterInputDStream).Intheprecedingexample,weusedthegetText()methodtoextractthetweettext;however,noticethatthetwitter4jobjectexposesthefullTwitterAPI.Forinstance,wecanprintastreamofuserswiththefollowingcall:
valusers=stream.map(tweet=>(tweet.getUser().getId(),
tweet.getUser().getName()))
users.print()
StatemanagementSparkStreamingprovidesanadhocDStreamtokeepthestateofeachkeyinanRDDandtheupdateStateByKeymethodtomutatestate.
Wecanreusethecodeofthebatchexampletoassignandupdatesentimentscoresonstreams:
objectStreamingHashTagSentiment{
[…]
valcounts=text.flatMap(line=>(patternfindAllInline)
.toList
.map(word=>(word,sentimentScore(line,positive,negative))))
.reduceByKey({(m,n)=>(m._1+n._1,m._2+n._2)})
valsentiment=counts.map({hashtagScore=>
valhashtag=hashtagScore._1
valscore=hashtagScore._2
valnormalizedScore=score._1/score._2
(hashtag,normalizedScore)
})
valstateDstream=sentiment
.updateStateByKey[Double](updateFunc)
stateDstream.print
ssc.checkpoint("/tmp/checkpoint")
ssc.start()
}
AstateDStreamiscreatedbycallinghashtagSentiment.updateStateByKey.
TheupdateFuncfunctionimplementsthestatemutationlogic,whichisacumulativesumofsentimentscoresoveraperiodoftime:
valupdateFunc=(values:Seq[Double],state:Option[Double])=>{
valcurrentScore=values.sum
valpreviousScore=state.getOrElse(0.0)
Some((currentScore+previousScore)*decayFactor)
}
decayFactorisaconstantvalue,lessthanorequaltozero,thatweusetoproportionallydecreasethescoreovertime.Intuitively,thiswillfadehashtagsiftheyarenottrendinganymore.SparkStreamingwritesintermediatedataforstatefuloperationstoHDFS,soweneedtocheckpointtheStreamingcontextwithssc.checkpoint.
Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/StreamingHashTagSentiment.scala
DataanalysiswithSparkSQLSparkSQLcaneasethetaskofrepresentingandmanipulatingstructureddata.WewillloadaJSONfileintoatemporarytableandcalculatesimplestatisticsbyblendingSQLstatementsandScalacode:
objectSparkJson{
[…]
valfile=sc.textFile(inputFile)
valsqlContext=neworg.apache.spark.sql.SQLContext(sc)
importsqlContext._
valtweets=sqlContext.jsonFile(inFile)
tweets.printSchema()
//RegistertheSchemaRDDasatable
tweets.registerTempTable("tweets")
valtext=sqlContext.sql("SELECTtext,user.idFROMtweets")
//Findthetenmostpopularhashtags
valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")
valcounts=text.flatMap(sqlRow=>(patternfindAllIn
sqlRow(0).toString).toList)
.map(word=>(word,1))
.reduceByKey((m,n)=>m+n)
counts.registerTempTable("hashtag_frequency")
counts.printSchema
valtop10=sqlContext.sql("SELECT_1ashashtag,_2asfrequencyFROM
hashtag_frequencyorderbyfrequencydesclimit10")
top10.foreach(println)
}
Aswithpreviousexamples,weinstantiateaSparkContextscandloadthedatasetofJSONtweets.Wethencreateaninstanceoforg.apache.spark.sql.SQLContextbasedontheexistingsc.TheimportsqlContext._givesaccesstoallfunctionsandimplicitconventionsforsqlContext.Weloadthetweets’JSONdatasetusingsqlContext.jsonFile.TheresultingtweetsobjectisaninstanceofSchemaRDD,whichisanewtypeofRDDintroducedbySparkSQL.TheSchemaRDDclassisconceptuallysimilartoatableinarelationaldatabase;itiscomposedofRowobjectsandaschemathatdescribesthecontentineachRow.Wecanseetheschemaforatweetbycallingtweets.printSchema().Beforewe’reabletomanipulatetweetswithSQLstatements,weneedtoregisterSchemaRDDasatableintheSQLContext.WethenextractthetextfieldofaJSONtweetwithanSQLquery.NotethattheoutputofsqlContext.sqlisanRDDagain.Assuch,wecanmanipulateitusingSparkcoremethods.Inourcase,wereusethelogicusedinpreviousexamplestoextracthashtagsandcounttheiroccurrences.Finally,weregistertheresultingRDDasatable,hashtag_frequency,andorderhashtagsby
frequencywithaSQLquery.
Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SparkJson.scala.
SQLondatastreamsAtthetimeofwriting,aSQLContextcannotbedirectlyinstantiatedfromaStreamingContextobject.Itis,however,possibletoqueryaDStreambyregisteringaSchemaRDDforeachRDDinagivenstream:
objectSqlOnStream{
[…]
valssc=newStreamingContext(sc,Seconds(window))
valgson=newGson()
valdstream=TwitterUtils
.createStream(ssc,auth)
.map(gson.toJson(_))
valsqlContext=neworg.apache.spark.sql.SQLContext(sc)
importsqlContext._
dstream.foreachRDD(rdd=>{
rdd.foreach(println)
valjsonRDD=sqlContext.jsonRDD(rdd)
jsonRDD.registerTempTable("tweets")
jsonRDD.printSchema
sqlContext.sql(query)
})
ssc.checkpoint("/tmp/checkpoint")
ssc.start()
ssc.awaitTermination()
}
Inordertogetthetwoworkingtogether,wefirstcreateaSparkContextscthatweusetoinitializebothaStreamingContextsscandasqlContext.Asinpreviousexamples,weuseTwitterUtils.createStreamtocreateaDStreamRDDdstream.Inthisexample,weuseGoogle’sGsonJSONparsertoserializeeachtwitter4jobjecttoaJSONstring.ToexecuteSparkSQLqueriesonthestream,weregisteraSchemaRDDjsonRDDwithinadstream.foreachRDDloop.WeusethesqlContext.jsonRDDmethodtocreateanRDDfromabatchofJSONtweets.Atthispoint,wecanquerytheSchemaRDDusingthesqlContext.sqlmethod.
Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SqlOnStream.scala.
ComparingSamzaandSparkStreamingItisusefultocompareSamzaandSparkStreamingtohelpidentifytheareasinwhicheachcanbestbeapplied.Asithasbeenhopefullymadeclearinthisbook,thesetechnologiesareverymuchcomplimentary.EventhoughSparkStreamingmightappearcompetitivewithSamza,wefeelbothproductsoffercompellingadvantagesincertainareas.
Samzashineswhentheinputdataistrulyastreamofdiscreteeventsandyouwishtobuildprocessingthatoperatesonthistypeofinput.SamzajobsrunningonKafkacanhavelatenciesintheorderofmilliseconds.Thisprovidesaprogrammingmodelfocusedontheindividualmessagesandisthebetterfitfortruenearreal-timeprocessingapplications.Thoughitlackssupporttobuildtopologiesofcollaboratingjobs,itssimplemodelallowssimilarconstructstobebuiltand,perhapsmoreimportantly,beeasilyreasonedabout.Itsmodelofpartitioningandscalingalsofocusesonsimplicity,whichagainmakesaSamzaapplicationveryeasytounderstandandgivesitasignificantadvantagewhendealingwithsomethingasintrinsicallycomplexasreal-timedata.
Sparkismuchmorethanastreamingproduct.Itssupportforbuildingdistributeddatastructuresfromexistingdatasetsandusingpowerfulprimitivestomanipulatethesegivesittheabilitytoprocesslargedatasetsatahigherlevelofgranularity.OtherproductsintheSparkecosystembuildadditionalinterfacesorabstractionsuponthiscommonbatchprocessingcore.ThisisverymuchadifferentfocustothemessagestreammodelofSamza.
ThisbatchmodelisalsodemonstratedwhenwelookatSparkStreaming;insteadofaper-messageprocessingmodel,itslicesthemessagestreamintoaseriesofRDDs.Withafastexecutionengine,thismeanslatenciesaslowas1second(http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf).Forworkloadsthatwishtoanalyzethestreaminsuchaway,thiswillbeabetterfitthanSamza’sper-messagemodel,whichrequiresadditionallogictoprovidesuchwindowing.
SummaryThischapterexploredSparkandshowedyouhowitaddsiterativeprocessingasanewrichframeworkuponwhichapplicationscanbebuiltatopYARN.Inparticular,wehighlighted:
Thedistributeddata-structure-basedprocessingmodelofSparkandhowitallowsveryefficientin-memorydataprocessingThebroaderSparkecosystemandhowmultipleadditionalprojectsarebuiltatopittospecializethecomputationalmodelevenfurther
InthenextchapterwewillexploreApachePiganditsprogramminglanguage,PigLatin.WewillseehowthistoolcangreatlysimplifysoftwaredevelopmentforHadoopbyabstractingawaysomeoftheMapReduceandSparkcomplexity.
Chapter6.DataAnalysiswithApachePigInthepreviouschapters,weexploredanumberofAPIsfordataprocessing.MapReduce,Spark,TezandSamzaareratherlow-level,andwritingnon-trivialbusinesslogicwiththemoftenrequiressignificantJavadevelopment.Moreover,differentuserswillhavedifferentneeds.ItmightbeimpracticalforananalysttowriteMapReducecodeorbuildaDAGofinputsandoutputstoanswersomesimplequeries.Atthesametime,asoftwareengineeroraresearchermightwanttoprototypeideasandalgorithmsusinghigh-levelabstractionsbeforejumpingintolow-levelimplementationdetails.
Inthischapterandthefollowingone,wewillexploresometoolsthatprovideawaytoprocessdataonHDFSusinghigher-levelabstractions.InthischapterwewillexploreApachePig,and,inparticular,wewillcoverthefollowingtopics:
WhatApachePigisandthedataflowmodelitprovidesPigLatin’sdatatypesandfunctionsHowPigcanbeeasilyenhancedusingcustomusercodeHowwecanusePigtoanalyzetheTwitterstream
AnoverviewofPigHistorically,thePigtoolkitconsistedofacompilerthatgeneratedMapReduceprograms,bundledtheirdependencies,andexecutedthemonHadoop.PigjobsarewritteninalanguagecalledPigLatinandcanbeexecutedinbothinteractiveandbatchfashions.Furthermore,PigLatincanbeextendedusingUserDefinedFunctions(UDFs)writteninJava,Python,Ruby,Groovy,orJavaScript.
Pigusecasesincludethefollowing:
DataprocessingAdhocanalyticalqueriesRapidprototypingofalgorithmsExtractTransformLoadpipelines
Followingatrendwehaveseeninpreviouschapters,Pigismovingtowardsageneral-purposecomputingarchitecture.Asofversion0.13theExecutionEngineinterface(org.apache.pig.backend.executionengine)actsasabridgebetweenthefrontendandthebackendofPig,allowingPigLatinscriptstobecompiledandexecutedonframeworksotherthanMapReduce.Atthetimeofwriting,version0.13shipswithMRExecutionEngine(org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRExecutionEngineandworkonalow-latencybackendbasedonTez(org.apache.pig.backend.hadoop.executionengine.tez.*)isexpectedtobeincludedinversion0.14(seehttps://issues.apache.org/jira/browse/PIG-3446).WorkonintegratingSparkiscurrentlyinprogressinthedevelopmentbranch(seehttps://issues.apache.org/jira/browse/PIG-4059).
Pig0.13comeswithanumberofperformanceenhancementsfortheMapReducebackend,inparticulartwofeaturestoreducelatencyofsmalljobs:directHDFSaccess(https://issues.apache.org/jira/browse/PIG-3642)andautolocalmode(https://issues.apache.org/jira/browse/PIG-3463).DirectHDFS,theopt.fetchproperty,isturnedonbydefault.WhendoingaDUMPinasimple(map-only)scriptthatcontainsonlyLIMIT,FILTER,UNION,STREAM,orFOREACHoperators,inputdataisfetchedfromHDFS,andthequeryisexecuteddirectlyinPig,bypassingMapReduce.Withautolocal,thepig.auto.local.enabledproperty,PigwillrunaqueryintheHadooplocalmodewhenthedatasizeissmallerthanpig.auto.local.input.maxbytes.Autolocalisoffbydefault.
PigwilllaunchMapReducejobsifbothmodesareofforifthequeryisnoteligibleforeither.Ifbothmodesareon,Pigwillcheckwhetherthequeryiseligiblefordirectaccessand,ifnot,fallbacktoautolocal.Failingthat,itwillexecutethequeryonMapReduce.
GettingstartedWewillusethestream.pyscriptoptionstoextractJSONdataandretrieveaspecificnumberoftweets;wecanrunthiswithacommandsuchasthefollowing:
$pythonstream.py-j-n10000>tweets.json
Thetweets.jsonfilewillcontainoneJSONstringoneachlinerepresentingatweet.
RememberthattheTwitterAPIcredentialsneedtobemadeavailableasenvironmentvariablesorhardcodedinthescriptitself.
RunningPigPigisatoolthattranslatesstatementswritteninPigLatinandexecutesthemeitheronasinglemachineinstandalonemodeoronafullHadoopclusterwhenindistributedmode.Eveninthelatter,Pig’sroleistotranslatePigLatinstatementsintoMapReducejobsandthereforeitdoesn’trequiretheinstallationofadditionalservicesordaemons.Itisusedasacommand-linetoolwithitsassociatedlibraries.
ClouderaCDHshipswithApachePigversion0.12.Alternatively,thePigsourcecodeandbinarydistributionscanbeobtainedathttps://pig.apache.org/releases.html.
Ascanbeexpected,theMapReducemoderequiresaccesstoaHadoopclusterandHDFSinstallation.MapReducemodeisthedefaultmodeexecutedwhenrunningthePigcommandatthecommand-lineprompt.Scriptscanbeexecutedwiththefollowingcommand:
$pig-f<script>
Parameterscanbepassedviathecommandlineusing-param<param>=<val>,asfollows:
$pig–paraminput=tweets.txt
ParameterscanalsobespecifiedinaparamfilethatcanbepassedtoPigusingthe-param_file<file>option.Multiplefilescanbespecified.Ifaparameterispresentmultipletimesinthefile,thelastvaluewillbeusedandawarningwillbedisplayed.Aparameterfilecontainsoneparameterperline.Emptylinesandcomments(specifiedbystartingalinewith#)areallowed.WithinaPigscript,parametersareintheform$<parameter>.Thedefaultvaluecanbeassignedusingthedefaultstatement:%defaultinputtweets.json'.ThedefaultcommandwillnotworkwithinaGruntsession;we’lldiscussGruntinthenextsection.
Inlocalmode,allfilesareinstalledandrunusingthelocalhostandfilesystem.Specifylocalmodeusingthe-xflag:
$pig-xlocal
Inbothexecutionmodes,Pigprogramscanberuneitherinaninteractiveshellorinbatchmode.
Grunt–thePiginteractiveshellPigcanruninaninteractivemodeusingtheGruntshell,whichisinvokedwhenweusethepigcommandattheterminalprompt.Intherestofthischapter,wewillassumethatexamplesareexecutedwithinaGruntsession.OtherthanexecutingPigLatinstatements,Gruntoffersanumberofutilitiesandaccesstoshellcommands:
fs:allowsuserstomanipulateHadoopfilesystemobjectsandhasthesamesemanticsastheHadoopCLIsh:executescommandsviatheoperatingsystemshellexec:launchesaPigscriptwithinaninteractiveGruntsessionkill:killsaMapReducejobhelp:printsalistofallavailablecommands
ElasticMapReducePigscriptscanbeexecutedonEMRbycreatingaclusterwith--applicationsName=Pig,Args=--version,<version>,asfollows:
$awsemrcreate-cluster\
--name"Pigcluster"\
--ami-version<amiversion>\
--instance-type<EC2instance>\
--instance-count<numberofnodes>\
--applicationsName=Pig,Args=--version,<version>\
--log-uri<S3bucket>\
--stepsType=PIG,\
Name="Pigscript",\
Args=[-f,s3://<scriptlocation>,\
-p,input=<inputparam>,\
-p,output=<outputparam>]
TheprecedingcommandwillprovisionanewEMRclusterandexecutes3://<scriptlocation>.Noticethatthescriptstobeexecutedandtheinput(-pinput)andoutput(-poutput)pathsareexpectedtobelocatedonS3.
AsanalternativetocreatinganewEMRcluster,itispossibletoaddPigstepstoanalready-instantiatedEMRclusterusingthefollowingcommand:
$awsemradd-steps\
--cluster-id<clusterid>\
--stepsType=PIG,\
Name="OtherPigscript",\
Args=[-f,s3://<scriptlocation>,\
-p,input=<inputparam>,\
-p,output=<outputparam>]
Intheprecedingcommand,<clusterid>istheIDoftheinstantiatedcluster.
ItisalsopossibletosshintothemasternodeandrunPigLatinstatementswithinaGruntsessionwiththefollowingcommand:
$awsemrssh--cluster-id<clusterid>--key-pair-file<keypair>
FundamentalsofApachePigTheprimaryinterfacetoprogramApachePigisPigLatin,aprocedurallanguagethatimplementsideasofthedataflowparadigm.
PigLatinprogramsaregenerallyorganizedasfollows:
ALOADstatementreadsdatafromHDFSAseriesofstatementsaggregatesandmanipulatesdataASTOREstatementwritesoutputtothefilesystemAlternatively,aDUMPstatementdisplaystheoutputtotheterminal
Thefollowingexampleshowsasequenceofstatementsthatoutputsthetop10hashtagsorderedbythefrequency,extractedfromthedatasetoftweets:
tweets=LOAD'tweets.json'
USINGJsonLoader('created_at:chararray,
id:long,
id_str:chararray,
text:chararray');
hashtags=FOREACHtweets{
GENERATEFLATTEN(
REGEX_EXTRACT(
text,
'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)',1)
)astag;
}
hashtags_grpd=GROUPhashtagsBYtag;
hashtags_count=FOREACHhashtags_grpd{
GENERATE
group,
COUNT(hashtags)asoccurrencies;
}
hashtags_count_sorted=ORDERhashtags_countBYoccurrenciesDESC;
top_10_hashtags=LIMIThashtags_count_sorted10;
DUMPtop_10_hashtags;
First,weloadthetweets.jsondatasetfromHDFS,de-serializetheJSONfile,andmapittoafour-columnschemathatcontainsatweet’screationtime,itsIDinnumericalandstringform,andthetext.Foreachtweet,weextracthashtagsfromitstextusingaregularexpression.Weaggregateonhashtag,countthenumberofoccurrences,andorderbyfrequency.Finally,welimittheorderedrecordstothetop10mostfrequenthashtags.
AseriesofstatementslikethepreviousoneispickedupbythePigcompiler,transformedintoMapReducejobs,andexecutedonaHadoopcluster.Theplannerandoptimizerwillresolvedependenciesoninputandoutputrelationsandparallelizetheexecutionofstatementswhereverpossible.
StatementsarethebuildingblocksofprocessingdatawithPig.Theytakearelationasinputandproduceanotherrelationasoutput.InPigLatinterms,arelationcanbedefined
asabagoftuples,twodatatypeswewillusethroughouttheremainderofthischapter.
UsersexperiencedwithSQLandtherelationaldatamodelmightfindPigLatin’ssyntaxsomewhatfamiliar.Whilethereareindeedsimilaritiesinthesyntaxitself,PigLatinimplementsanentirelydifferentcomputationalmodel.PigLatinisprocedural,itspecifiestheactualdatatransformstobeperformed,whereasSQLisdeclarativeanddescribesthenatureoftheproblembutdoesnotspecifytheactualruntimeprocessing.Intermsoforganizingdata,arelationcanbethoughtofasatableinarelationaldatabase,wheretuplesinabagcorrespondtotherowsinatable.Relationsareunorderedandthereforeeasilyparallelizable,andtheyarelessconstrainedthanrelationaltables.Pigrelationscancontaintupleswithdifferentnumbersoffields,andthosewiththesamefieldcountcanhavefieldsofdifferenttypesincorrespondingpositions.
AkeydifferencebetweenSQLandthedataflowmodeladoptedbyPigLatinliesinhowsplitsinadatapipelinearemanaged.Intherelationalworld,adeclarativelanguagesuchasSQLimplementsandexecutesqueriesthatwillgenerateasingleresult.Thedataflowmodelseesdatatransformationsasagraphwhereinputandoutputarenodesconnectedbyanoperator.Forinstance,intermediatestepsofaquerymightrequiretheinputtobegroupedbyanumberofkeysandresultinmultipleoutputs(GROUPBY).Pighasbuilt-inmechanismstomanagemultipledataflowsinsuchagraphbyexecutingoperatorsassoonasinputsarereadilyavailableandpotentiallyapplydifferentoperatorstoeachflow.Forinstance,Pig’simplementationoftheGROUPBYoperatorusestheparallelfeature(http://pig.apache.org/docs/r0.12.0/perf.html#parallel)toallowausertoincreasethenumberofreducetasksfortheMapReducejobsgeneratedandhenceincreasesconcurrency.Anadditionalsideeffectofthispropertyisthatwhenmultipleoperatorscanbeexecutedinparallelinthesameprogram,Pigdoesso(moredetailsonPig’smulti-queryimplementationcanbefoundathttp://pig.apache.org/docs/r0.12.0/perf.html#multi-query-execution).AnotherconsequenceofPigLatin’sapproachtocomputationisthatitallowsthepersistenceofdataatanypointinthepipeline.Itallowsthedevelopertoselectspecificoperatorimplementationsandexecutionplanswhennecessary,effectivelyoverridingtheoptimizer.
PigLatinallowsandevenencouragesdeveloperstoinserttheirowncodealmostanywhereinapipelinebymeansofUserDefinedFunctions(UDFs)aswellasbyutilizingHadoopstreaming.UDFsallowuserstospecifycustombusinesslogiconhowdataisloaded,howitisstored,andhowitisprocessed,whereasstreamingallowsuserstolaunchexecutablesatanypointinthedataflow.
ProgrammingPigPigLatincomeswithanumberofbuilt-infunctions(theeval,load/store,math,string,bag,andtuplefunctions)andanumberofscalarandcomplexdatatypes.Additionally,Pigallowsfunctionanddata-typeextensionbymeansofUDFsanddynamicinvocationofJavamethods.
PigdatatypesPigsupportsthefollowingscalardatatypes:
int:asigned32-bitintegerlong:asigned64-bitintegerfloat:a32-bitfloatingpointdouble:a64-bitfloatingpointchararray:acharacterarray(string)inUnicodeUTF-8formatbytearray:abytearray(blob)boolean:abooleandatetime:adatetimebiginteger:aJavaBigIntegerbigdecimal:aJavaBigDecimal
Pigsupportsthefollowingcomplexdatatypes:
map:anassociativearrayenclosedby[],withthekeyandvalueseparatedby#,anditemsseparatedby,tuple:anorderedlistofdata,whereelementscanbeofanyscalarorcomplextypeenclosedby(),withitemsseparatedby,bag:anunorderedcollectionoftuplesenclosedby{}andseparatedby,
Bydefault,Pigtreatsdataasuntyped.Theusercandeclarethetypesofdataatloadtimeormanuallycastitwhennecessary.Ifadatatypeisnotdeclared,butascriptimplicitlytreatsavalueasacertaintype,Pigwillassumeitisofthattypeandcastitaccordingly.Thefieldsofabagortuplecanbereferredtobythenametuple.fieldorbytheposition$<index>.Pigcountsfrom0andhencethefirstelementwillbedenotedas$0.
PigfunctionsBuilt-infunctionsareimplementedinJava,andtheytrytofollowstandardJavaconventions.Therearehoweveranumberofdifferencestokeepinmind,whichareasfollows:
FunctionnamesarecasesensitiveanduppercaseIftheresultvalueisnull,empty,ornotanumber(NaN),PigreturnsnullIfPigisunabletoprocesstheexpression,itreturnsanexception
Alistofallbuilt-infunctionscanbefoundathttp://pig.apache.org/docs/r0.12.0/func.html.
Load/storeLoad/storefunctionsdeterminehowdatagoesintoandcomesoutofPig.ThePigStorage,TextLoader,andBinStoragefunctionscanbeusedtoreadandwriteUTF-8delimited,unstructuredtext,andbinarydatarespectively.Supportforcompressionisdeterminedbytheload/storefunction.ThePigStorageandTextLoaderfunctionssupportgzipandbzip2compressionforbothread(load)andwrite(store).TheBinStoragefunctiondoesnotsupportcompression.
Asofversion0.12,Pigincludesbuilt-insupportforloadingandstoringAvroandJSONdataviatheAvroStorage(load/store),JsonStorage(store),andJsonLoader(load).Atthetimeofwriting,JSONsupportisstillsomewhatlimited.Inparticular,PigexpectsaschemaforthedatatobeprovidedasanargumenttoJsonLoader/JsonStorage,oritassumesthat.pig_schema(producedbyJsonStorage)ispresentinthedirectorycontainingtheinputdata.Inpractice,thismakesitdifficulttoworkwithJSONdumpsnotgeneratedbyPigitself.
Asseeninourfollowingexample,wecanloadtheJSONdatasetwithJsonLoader:
tweets=LOAD'tweets.json'USINGJsonLoader(
'created_at:chararray,
id:long,
id_str:chararray,
text:chararray,
source:chararray');
WeprovideaschemasothatthefirstfiveelementsofaJSONobjectcreated_id,id,id_str,text,andsourcearemapped.Wecanlookattheschemaoftweetsbyusingdescribetweets,whichreturnsthefollowing:
tweets:{created_at:chararray,id:long,id_str:chararray,text:
chararray,source:chararray}
EvalEvalfunctionsimplementasetofoperationstobeappliedonanexpressionthatreturnsabagormapdatatype.Theexpressionresultisevaluatedwithinthefunctioncontext.
AVG(expression):computestheaverageofthenumericvaluesinasingle-column
bagCOUNT(expression):countsallelementswithnon-nullvaluesinthefirstpositioninabagCOUNT_STAR(expression):countsallelementsinabagIsEmpty(expression):checkswhetherabagormapisemptyMAX(expression),MIN(expression),andSUM(expression):returnthemax,min,orthesumofelementsinabagTOKENIZE(expression):splitsastringandoutputsabagofwords
Thetuple,bag,andmapfunctionsThesefunctionsallowconversionfromandtothebag,tuple,andmaptypes.Theyincludethefollowing:
TOTUPLE(expression),TOMAP(expression),andTOBAG(expression):Thesecoerceexpressiontoatuple,map,orbagTOP(n,column,relation):Thisreturnsthetopntuplesfromabagoftuples
Themath,string,anddatetimefunctionsPigexposesanumberoffunctionsprovidedbythejava.lang.Math,java.lang.String,java.util.Date,andJoda-TimeDateTimeclass(foundathttp://www.joda.org/joda-time/).
DynamicinvokersDynamicinvokersallowtheexecutionofJavafunctionswithouthavingtowraptheminaUDF.Theycanbeusedforanystaticfunctionthat:
acceptsnoargumentsoracceptsacombinationofstring,int,long,double,float,orarraywiththesesametypesreturnsastring,int,long,double,orfloatvalue
OnlyprimitivescanbeusedfornumbersandJavaboxedclasses(suchasInteger)cannotbeusedasarguments.Dependingonthereturntype,aspecifickindofinvokermustbeused:InvokeForString,InvokeForInt,InvokeForLong,InvokeForDouble,orInvokeForFloat.Moredetailsregardingdynamicinvokerscanbefoundathttp://pig.apache.org/docs/r0.12.0/func.html#dynamic-invokers.
MacrosAsofversion0.9,PigLatin’spreprocessorsupportsmacroexpansion.MacrosaredefinedusingtheDEFINEstatement:
DEFINEmacro_name(param1,...,paramN)RETURNSoutput_bag{
pig_latin_statements
};
Themacroisexpandedinline,anditsparametersarereferencedinthePigLatinblockwithin{}.
ThemacrooutputrelationisgivenintheRETURNSstatements(output_bag).RETURNSvoidisusedforamacrowithnooutputrelation.
Wecandefineamacrotocountthenumberofrowsinarelation,asfollows:
DEFINEcount_rows(X)RETURNScnt{
grpd=group$Xall;
$cnt=foreachgrpdgenerateCOUNT($X);
};
WecanuseitinaPigscriptorGruntsessiontocountthenumberoftweets:
tweets_count=count_rows(tweets);
DUMPtweets_count;
Macrosallowustomakescriptsmodularbyhousingcodeinseparatefilesandimportingthemwhereneeded.Forexample,wecansavecount_rowsinafilecalledcount_rows.macroandlateronimportitwiththecommandimport'count_rows.macro'.
Macroshaveanumberoflimitations;inparticular,onlyPigLatinstatementsareallowedinsideamacro.ItisnotpossibletouseREGISTERstatementsandshellcommands,UDFsarenotallowed,andparametersubstitutioninsidethemacroisnotsupported.
WorkingwithdataPigLatinprovidesanumberofrelationaloperatorstocombinefunctionsandapplytransformationsondata.Typicaloperationsinadatapipelineconsistoffilteringrelations(FILTER),aggregatinginputsbasedonkeys(GROUP),generatingtransformationsbasedoncolumnsofdata(FOREACH),andjoiningrelations(JOIN)basedonsharedkeys.
Inthefollowingsections,wewillillustratesuchoperatorsonadatasetoftweetsgeneratedbyloadingJSONdata.
FilteringTheFILTERoperatorselectstuplesfromarelationbasedonanexpression,asfollows:
relation=FILTERrelationBYexpression;
Wecanusethisoperatortofiltertweetswhosetextmatchesthehashtagregularexpression,asfollows:
tweets_with_tag=FILTERtweetsBY
(text
MATCHES'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)'
);
AggregationTheGROUPoperatorgroupstogetherdatainoneormorerelationsbasedonanexpressionorakey,asfollows:
relation=GROUPrelationBYexpression;
Wecangrouptweetsbythesourcefieldintoanewrelationgrpd,asfollows:
grpd=GROUPtweetsBYsource;
Itispossibletogrouponmultipledimensionsbyspecifyingatupleasthekey,asfollows:
grpd=GROUPtweetsBY(created_at,source);
TheresultofaGROUPoperationisarelationthatincludesonetupleperuniquevalueofthegroupexpression.Thistuplecontainstwofields.Thefirstfieldisnamedgroupandisofthesametypeasthegroupkey.Thesecondfieldtakesthenameoftheoriginalrelationandisofthetypebag.Thenamesofbothfieldsaregeneratedbythesystem.
UsingtheALLkeyword,Pigwillaggregateacrossthewholerelation.TheGROUPtweetsALLschemewillaggregatealltuplesinthesamegroup.
Aspreviouslymentioned,PigallowsexplicithandlingoftheconcurrencyleveloftheGROUPoperatorusingthePARALLELoperator:
grpd=GROUPtweetsBY(created_at,id)PARALLEL10;
Intheprecedingexample,theMapReducejobgeneratedbythecompilerwillrun10concurrentreducetasks.Pighasaheuristicestimateofhowmanyreducerstouse.
Anotherwayofgloballyenforcingthenumberofreducetasksistousethesetdefault_parallel<n>command.
ForeachTheFOREACHoperatorappliesfunctionsoncolumns,asfollows:
relation=FOREACHrelationGENERATEtransformation;
TheoutputofFOREACHdependsonthetransformationapplied.
Wecanusetheoperatortoprojectthetextofalltweetsthatcontainahashtag,asfollows:
t=FOREACHtweets_with_tagGENERATEtext;
Wecanalsoapplyafunctiontotheprojectedcolumns.Forinstance,wecanusetheREGEX_TOKENIZEfunctiontospliteachtweetintowords,asfollows:
t=FOREACHtweets_with_tagGENERATEFLATTEN(TOKENIZE(text))asword;
TheFLATTENmodifierfurtherun-neststhebaggeneratedbyTOKENIZEintoatupleofwords.
JoinTheJOINoperatorperformsaninnerjoinoftwoormorerelationsbasedoncommonfieldvalues.Itssyntaxisasfollows:
relation=JOINrelation1BYexpression1,relation2BYexpression2;
Wecanuseajoinoperationtodetecttweetsthatcontainpositivewords,asfollows:
positive=LOAD'positive-words.txt'USINGPigStorage()as(w:chararray);
Filteroutthecomments,asfollows:
positive_words=FILTERpositiveBYNOTwMATCHES'^;.*';
positive_wordsisabagoftuples,eachcontainingaword.Wethentokenizethetweets’textandcreateanewbagof(id_str,word)tuplesasfollows:
id_words=FOREACHtweets{
GENERATE
id_str,
FLATTEN(TOKENIZE(text))asword;
}
Wejointhetworelationsonthewordfieldandobtainarelationofalltweetsthatcontainoneormorepositivewords,asfollows:
positive_tweets=JOINpositive_wordsBYw,id_wordsBYword;
Inthisstatement,wejoinpositive_wordsandid_wordsontheconditionthatid_words.wordisapositiveword.Thepositive_tweetsoperatorisabagintheformof{w:chararray,id_str:chararray,word:chararray}thatcontainsallelementsofpositive_wordsandid_wordsthatmatchthejoincondition.
WecancombinetheGROUPandFOREACHoperatortocalculatethenumberofpositivewordspertweet(withatleastonepositiveword).First,wegrouptherelationofpositivetweetsbythetweetID,andthenwecountthenumberofoccurrencesofeachIDintherelation,asfollows:
grpd=GROUPpositive_tweetsBYid_str;
score=FOREACHgrpdGENERATEFLATTEN(group),COUNT(positive_tweets);
TheJOINoperatorcanmakeuseoftheparallelizefeatureaswell,asfollows:
positive_tweets=JOINpositive_wordsBYw,id_wordsBYwordPARALLEL10
Theprecedingcommandwillexecutethejoinwith10reducertasks.
Itispossibletospecifytheoperator’sbehaviorwiththeUSINGkeywordfollowedbytheIDofaspecializedjoin.Moredetailscanbefoundathttp://pig.apache.org/docs/r0.12.0/perf.html#specialized-joins.
ExtendingPig(UDFs)FunctionscanbeapartofalmosteveryoperatorinPig.TherearetwomaindifferencesbetweenUDFsandbuilt-infunctions.First,UDFsneedtoberegisteredusingtheREGISTERkeywordinordertomakethemavailabletoPig.Secondly,theyneedtobequalifiedwhenused.PigUDFscancurrentlybeimplementedinJava,Python,Ruby,JavaScript,andGroovy.ThemostextensivesupportisprovidedforJavafunctions,whichallowyoutocustomizeallpartsoftheprocessincludingdataload/store,transformation,andaggregation.Additionally,JavafunctionsarealsomoreefficientbecausetheyareimplementedinthesamelanguageasPigandbecauseadditionalinterfacesaresupported,suchastheAlgebraicandAccumulatorinterfaces.Ontheotherhand,RubyandPythonAPIsallowmorerapidprototyping.
TheintegrationofUDFswiththePigenvironmentismainlymanagedbythefollowingtwostatementsREGISTERandDEFINE:
REGISTERregistersaJARfilesothattheUDFsinthefilecanbeused,asfollows:
REGISTER'piggybank.jar'
DEFINEcreatesanaliastoafunctionorastreamingcommand,asfollows:
DEFINEMyFunctionmy.package.uri.MyFunction
Theversion0.12ofPigintroducedthestreamingofUDFsasamechanismforwritingfunctionsusinglanguageswithnoJVMimplementation.
ContributedUDFsPig’scodebasehostsaUDFrepositorycalledPiggybank.OtherpopularcontributedrepositoriesareTwitter’sElephantBird(foundathttps://github.com/kevinweil/elephant-bird/)andApacheDataFu(foundathttp://datafu.incubator.apache.org/).
PiggybankPiggybankisaplaceforPiguserstosharetheirfunctions.SharedcodeislocatedintheofficialPigSubversionrepositoryfoundathttp://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/TheAPIdocumentationcanbefoundathttp://pig.apache.org/docs/r0.12.0/api/underthecontribsection.PiggybankUDFscanbeobtainedbycheckingoutandcompilingthesourcesfromtheSubversionrepositoryorbyusingtheJARfilethatshipswithbinaryreleasesofPig.InClouderaCDH,piggybank.jarisavailableat/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar.
ElephantBirdElephantBirdisanopensourcelibraryofallthingsHadoopusedinproductionatTwitter.Thislibrarycontainsanumberofserializationtools,custominputandoutputformats,writables,Pigload/storefunctions,andmoremiscellanea.
ElephantBirdshipswithanextremelyflexibleJSONloaderfunction,whichatthetimeofwriting,isthego-toresourceformanipulatingJSONdatainPig.
ApacheDataFuApacheDataFuPigcollectsanumberofanalyticalfunctionsdevelopedandcontributedbyLinkedIn.Theseincludestatisticalandestimationfunctions,bagandsetoperations,sampling,hashing,andlinkanalysis.
AnalyzingtheTwitterstreamInthefollowingexamples,wewillusetheimplementationofJsonLoaderprovidedbyElephantBirdtoloadandmanipulateJSONdata.WewillusePigtoexploretweetmetadataandanalyzetrendsinthedataset.Finally,wewillmodeltheinteractionbetweenusersasagraphanduseApacheDataFutoanalyzethissocialnetwork.
PrerequisitesDownloadtheelephant-bird-pig(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-pig/4.5/elephant-bird-pig-4.5.jar),elephant-bird-hadoop-compat(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-hadoop-compat/4.5/elephant-bird-hadoop-compat-4.5.jar),andelephant-bird-core(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-core/4.5/elephant-bird-core-4.5.jar)JARfilesfromtheMavencentralrepositoryandcopythemontoHDFSusingthefollowingcommand:
$hdfsdfs-puttarget/elephant-bird-pig-4.5.jarhdfs:///jar/
$hdfsdfs–puttarget/elephant-bird-hadoop-compat-4.5.jarhdfs:///jar/
$hdfsdfs–putelephant-bird-core-4.5.jarhdfs:///jar/
DatasetexplorationBeforedivingdeeperintothedataset,weneedtoregisterthedependenciestoElephantBirdandDataFu,asfollows:
REGISTER/opt/cloudera/parcels/CDH/lib/pig/datafu-1.1.0-cdh5.0.0.jar
REGISTER/opt/cloudera/parcels/CDH/lib/pig/lib/json-simple-1.1.jar
REGISTERhdfs:///jar/elephant-bird-pig-4.5.jar
REGISTERhdfs:///jar/elephant-bird-hadoop-compat-4.5.jar
REGISTERhdfs:///jar/elephant-bird-core-4.5.jar
Then,loadtheJSONdatasetoftweetsusingcom.twitter.elephantbird.pig.load.JsonLoader,asfollows:
tweets=LOAD'tweets.json'using
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
com.twitter.elephantbird.pig.load.JsonLoaderdecodeseachlineoftheinputfiletoJSONandpassestheresultingmapofvaluestoPigasasingle-elementtuple.ThisenablesaccesstoelementsoftheJSONobjectwithouthavingtospecifyaschemaupfront.The–nestedLoadargumentinstructstheclasstoloadnesteddatastructures.
TweetmetadataIntheremainderofthechapter,wewillusemetadatafromtheJSONdatasettomodelthetweetstream.OneexampleofmetadataattachedtoatweetisthePlaceobject,whichcontainsgeographicalinformationabouttheuser’slocation.Placecontainsfieldsthatdescribeitsname,ID,country,countrycode,andmore.Afulldescriptioncanbefoundathttps://dev.twitter.com/docs/platform-objects/places.
place=FOREACHtweetsGENERATE(chararray)$0#'place'asplace;
Entitiesgiveinformationsuchasstructureddatafromtweets,URLs,hashtags,andmentions,withouthavingtoextractthemfromtext.Adescriptionofentitiescanbefoundathttps://dev.twitter.com/docs/entities.Thehashtagentityisanarrayoftagsextractedfromatweet.Eachentityhasthefollowingtwoattributes:
Text:isthehashtagtextIndices:isthecharacterpositionfromwhichthehashtagwasextracted
Thefollowingcodeusesentities:
hashtags_bag=FOREACHtweets{
GENERATE
FLATTEN($0#'entities'#'hashtags')astag;
}
Wethenflattenhashtags_bagtoextracteachhashtag’stext:
hashtags=FOREACHhashtags_bagGENERATEtag#'text'astopic;
Entitiesforuserobjectscontaininformationthatappearsintheuserprofileanddescriptionfields.Wecanextractthetweetauthor’sIDviatheuserfieldinthetweetmap:
users=FOREACHtweetsGENERATE$0#'user'#'id'asid;
DatapreparationTheSAMPLEbuilt-inoperatorselectsasetofntupleswithprobabilitypoutofthedataset,asfollows:
sampled=SAMPLEtweets0.01;
Theprecedingcommandwillselectapproximately1percentofthedataset.GiventhatSAMPLEisprobabilistic(http://en.wikipedia.org/wiki/Bernoulli_sampling),thereisnoguaranteethatthesamplesizewillbeexact.Moreoverthefunctionsampleswithreplacement,whichmeansthateachitemmightappearmorethanonce.
ApacheDataFuimplementsanumberofsamplingmethodsforcaseswherehavinganexactsamplesizeandnoreplacementisdesired(SimpleRandomSampling),samplingwithreplacement(SimpleRandomSampleWithReplacementVoteandSimpleRandomSampleWithReplacementElect),whenwewanttoaccountforsamplebias(WeightedRandomSampling),ortosampleacrossmultiplerelations(SampleByKey).
Wecancreateasampleofexactly1percentofthedataset,witheachitemhavingthesameprobabilityofbeingselected,usingSimpleRandomSample.
NoteTheactualguaranteeisasampleofsizeceil(p*n)withaprobabilityofatleast99percent.
First,wepassasamplingprobability0.01totheUDFconstructor:
DEFINESRSdatafu.pig.sampling.SimpleRandomSample('0.01');
andthebag,createdwith(GROUPtweetsALL),tobesampled:
sampled=FOREACH(GROUPtweetsALL)GENERATEFLATTEN(SRS(tweets));
TheSimpleRandomSampleUDFselectswithoutreplacement,whichmeansthateachitemwillappearonlyonce.
NoteWhichsamplingmethodtousedependsbothonthedataweareworkingwith,assumptionsonhowitemsaredistributed,thesizeofthedataset,andwhatwepracticallywanttoachieve.Ingeneral,whenwewanttoexploreadatasettoformulatehypotheses,SimpleRandomSamplecanbeagoodchoice.However,inseveralanalyticsapplications,itiscommontousemethodsthatassumereplacement(forexample,bootstrapping).
Notethatwhenworkingwithverylargedatasets,samplingwithreplacementandsamplingwithoutreplacementtendtobehavesimilarly.Theprobabilityofanitembeingselectedtwiceoutofapopulationofbillionsofitemswillbelow.
TopnstatisticsOneofthefirstquestionswemightwanttoaskishowfrequentcertainthingsare.Forinstance,wemightwanttocreateahistogramofthetop10topicsbythenumberofmentions.Similarly,wemightwanttofindthetop50countriesorthetop10users.Beforelookingattweetsdata,wewilldefineamacrosothatwecanapplythesameselectionlogictodifferentcollectionsofitems:
DEFINEtop_n(rel,col,n)
RETURNStop_n_items{
grpd=GROUP$relBY$col;
cnt_items=FOREACHgrpd
GENERATEFLATTEN(group),COUNT($rel)AScnt;
cnt_items_sorted=ORDERcnt_itemsBYcntDESC;
$top_n_items=LIMITcnt_items_sorted$n;
}
Thetop_nmethodtakesarelationrel,thecolumncolwewanttocount,andthenumberofitemstoreturnnasparameters.InthePigLatinblock,wefirstgrouprelbyitemsincol,countthenumberofoccurrencesofeachitem,sortthem,andselectthemostfrequentn.
Tofindthetop10Englishhashtags,wefilterthembylanguage,andextracttheirtext:
tweets_en=FILTERtweetsby$0#'lang'=='en';
hashtags_bag=FOREACHtweets{
GENERATE
FLATTEN($0#'entities'#'hashtags')AStag;
}
hashtags=FOREACHhashtags_bagGENERATEtag#'text'AStag;
Andapplythetop_nmacro:
top_10_hashtags=top_n(hashtags,tag,10);
Inordertobettercharacterizewhatistrendingandmakethisinformationmorerelevanttousers,wecandrilldownintothedatasetandlookathashtagspergeographiclocation.
First,wegeneratebagof(place,hashtag)tuples,asfollows:
hashtags_country_bag=FOREACHtweetsgenerate{
0#'place'asplace,
FLATTEN($0#'entities'#'hashtags')astag;
}
Andthen,weextractthecountrycodeandhashtagtext,asfollows:
hashtags_country=FOREACHhashtags_country_bag{
GENERATE
place#'country_code'asco,
tag#'text'astag;
}
Then,wecounthowmanytimeseachcountrycodeandhashtagappeartogether,as
follows:
hashtags_country_frequency=FOREACH(GROUPhashtags_countryALL){
GENERATE
FLATTEN(group),
COUNT(hashtags_country)ascount;
}
Finally,wecountthetop10countriesperhashtagwiththeTOPfunction,asfollows:
hashtags_country_regrouped=GROUPhashtags_country_frequencyBYcnt;
top_results=FOREACHhashtags_country_regrouped{
result=TOP(10,1,hashtags_country_frequency);
GENERATEFLATTEN(result);
}
TOP‘sparametersarethenumberoftuplestoreturn,thecolumntocompare,andtherelationcontainingsaidcolumn:
top_results=FOREACHD{
result=TOP(10,1,C);
GENERATEFLATTEN(result);
}
Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/topn.pig.
DatetimemanipulationThecreated_atfieldintheJSONtweetsgivesustime-stampedinformationaboutwhenthetweetwasposted.Unfortunately,itsformatisnotcompatiblewithPig’sbuilt-indatetimetype.
PiggybankcomestotherescuewithanumberoftimemanipulationUDFscontainedinorg.apache.pig.piggybank.evaluation.datetime.convert.OneofthemisCustomFormatToISO,whichconvertsanarbitrarilyformattedtimestampintoanISO8601datetimestring.
InordertoaccesstheseUDFs,wefirstneedtoregisterthepiggybank.jarfile,asfollows:
REGISTER/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar
Tomakeourcodelessverbose,wecreateanaliasfortheCustomFormatToISOclass’sfullyqualifiedJavaname:
DEFINECustomFormatToISO
org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
Byknowinghowtomanipulatetimestamps,wecancalculatestatisticsatdifferenttimeintervals.Forinstance,wecanlookathowmanytweetsarecreatedperhour.Pighasabuilt-inGetHourfunctionthatextractsthehouroutofadatetimetype.Tousethis,wefirstconvertthetimestampstringtoISO8601withCustomFormatToISOandthentheresultingchararraytodatetimeusingthebuilt-inToDatefunction,asfollows:
hourly_tweets=FOREACHtweets{
GENERATE
GetHour(
ToDate(
CustomFormatToISO(
$0#'created_at','EEEMMMMdHH:mm:ssZy')
)
)ashour;
}
Now,itisjustamatterofgroupinghourly_tweetsbyhourandthengeneratingacountoftweetspergroup,asfollows:
hourly_tweets_count=FOREACH(GROUPhourly_tweetsBYhour){
GENERATEFLATTEN(group),COUNT(hourly_tweets);
}
SessionsDataFu’sSessionizeclasscanhelpustobettercaptureuseractivityovertime.Asessionrepresentstheactivityofauserwithinagivenperiodoftime.Forinstance,wecanlookateachuser’stweetstreamatintervalsof15minutesandmeasurethesesessionstodeterminebothnetworkvolumesaswellasuseractivity:
DEFINESessionizedatafu.pig.sessions.Sessionize('15m');
users_activity=FOREACHtweets{
GENERATE
CustomFormatToISO($0#'created_at',
'EEEMMMMdHH:mm:ssZy')ASdt,
(chararray)$0#'user'#'id'asuser_id;
}
users_activity_sessionized=FOREACH
(GROUPusers_activityBYuser_id){
ordered=ORDERusers_activityBYdt;
GENERATEFLATTEN(Sessionize(ordered))
AS(dt,user_id,session_id);
}
user_activitysimplyrecordsthetimedtagivenuser_idpostedastatusupdate.
Sessionizetakesthesessiontimeoutandabagasinput.ThefirstelementoftheinputbagisanISO8601timestamp,andthebagmustbesortedbythistimestamp.Eventsthatarewithin15minutesfromeachotherwillbelongtothesamesession.
Itreturnstheinputbagwithanewfield,session_id,thatuniquelyidentifiesasession.Withthisdata,wecancalculatethesession’slengthandsomeotherstatistics.MoreexamplesofSessionizeusagecanbefoundathttp://datafu.incubator.apache.org/docs/datafu/guide/sessions.html.
CapturinguserinteractionsIntheremainderofthechapter,wewilllookathowtocapturepatternsfromuserinteractions.Asafirststepinthisdirection,wewillcreateadatasetsuitabletomodelasocialnetwork.Thisdatasetwillcontainatimestamp,theIDofthetweet,theuserwhopostedthetweet,theuserandtweetshe’sreplyingto,andthehashtaginthetweet.
Twitterconsidersasareply(in_reply_to_status_id_str)anymessagebeginningwiththe@character.Suchtweetsareinterpretedasadirectmessagetothatperson.Placingan@characteranywhereelseinthetweetisinterpretedasamention('entities'#'user_mentions‘)andnotareply.Thedifferenceisthatmentionsareimmediatelybroadcasttoaperson’sfollowers,whereasrepliesarenot.Repliesare,however,consideredasmentions.
Whenworkingwithpersonallyidentifiableinformation,itisagoodideatoanonymizeifnotremoveentirelysensitivedatasuchasIPaddresses,names,anduserIDs.Acommonlyusedtechniqueinvolvesahashfunctionthattakesasinputthedatawewanttoanonymize,concatenatedwithadditionalrandomdatacalledsalt.Thefollowingcodeshowsanexampleofsuchanonymization:
DEFINESHAdatafu.pig.hash.SHA();
from_to_bag=FOREACHtweets{
dt=$0#'created_at';
user_id=(chararray)$0#'user'#'id';
tweet_id=(chararray)$0#'id_str';
reply_to_tweet=(chararray)$0#'in_reply_to_status_id_str';
reply_to=(chararray)$0#'in_reply_to_user_id_str';
place=$0#'place';
topics=$0#'entities'#'hashtags';
GENERATE
CustomFormatToISO(dt,'EEEMMMMdHH:mm:ssZy')ASdt,
SHA((chararray)CONCAT('SALT',user_id))ASsource,
SHA(((chararray)CONCAT('SALT',tweet_id)))AStweet_id,
((reply_to_tweetISNULL)
?NULL
:SHA((chararray)CONCAT('SALT',reply_to_tweet)))
ASreply_to_tweet_id,
((reply_toISNULL)
?NULL
:SHA((chararray)CONCAT('SALT',reply_to)))
ASdestination,
(chararray)place#'country_code'ascountry,
FLATTEN(topics)AStopic;
}
—extractthehashtagtext
from_to=FOREACHfrom_to_bag{
GENERATE
dt,
tweet_id,
reply_to_tweet_id,
source,
destination,
country,
(chararray)topic#'text'AStopic;
}
Inthisexample,weuseCONCATtoappenda(notsorandom)saltstringtopersonaldata.WethengenerateahashofthesaltedIDswithDataFu’sSHAfunction.TheSHAfunctionrequiresitsinputparameterstobenonnull.Weenforcethisconditionusingif-then-elsestatements.InPigLatin,thisisexpressedas<conditionistrue>?<truebranch>:<falsebranch>.Ifthestringisnull,wereturnNULL,andifnot,wereturnthesaltedhash.Tomakecodemorereadable,weusealiasesforthetweetJSONfieldsandreferencethemintheGENERATEblock.
LinkanalysisWecanredefineourapproachtodeterminetrendingtopicstoincludeusers’reactions.Afirst,naïve,approachcouldbetoconsideratopicasimportantifitcausedanumberofreplieslargerthanathresholdvalue.
Aproblemwiththisapproachisthattweetsgeneraterelativelyfewreplies,sothevolumeoftheresultingdatasetwillbelow.Hence,itrequiresaverylargeamountofdatatocontaintweetsbeingrepliedtoandproduceanyresult.Inpractice,wewouldlikelywanttocombinethismetricwithotherones(forexample,mentions)inordertoperformmoremeaningfulanalyses.
Tosatisfythisquery,wewillcreateanewdatasetthatincludesthehashtagsextractedfromboththetweetandtheoneauserisreplyingto:
tweet_hashtag=FOREACHfrom_toGENERATEtweet_id,topic;
from_to_self_joined=JOINfrom_toBYreply_to_tweet_idLEFT,
tweet_hashtagBYtweet_id;
twitter_graph=FOREACHfrom_to_self_joined{
GENERATE
from_to::dtASdt,
from_to::tweet_idAStweet_id,
from_to::reply_to_tweet_idASreply_to_tweet_id,
from_to::sourceASsource,
from_to::destinationASdestination,
from_to::topicAStopic,
from_to::countryAScountry,
tweet_hashtag::topicAStopic_replied;
}
NotethatPigdoesnotallowacrossjoinonthesamerelation,hencewehavetocreatetweet_hashtagfortheright-handsideofthejoin.Here,weusethe::operatortodisambiguatefromwhichrelationandcolumnwewanttoselectrecords.
Onceagain,wecanlookforthetop10topicsbynumberofrepliesusingthetop_nmacro:
top_10_topics=top_n(twitter_graph,topic_replied,10);
Countingthingswillonlytakeussofar.WecancomputemoredescriptivestatisticsonthisdatasetwithDataFu.UsingtheQuantilefunction,wecancalculatethemedian,the90th,95th,andthe99thpercentilesofthenumberofhashtagreactions,asfollows:
DEFINEQuantiledatafu.pig.stats.Quantile('0.5','0.90','0.95','0.99');
SincetheUDFexpectsanorderedbagofintegervaluesasinput,wefirstcountthefrequencyofeachtopic_repliedentry,asfollows.
topics_with_replies_grpd=GROUPtwitter_graphBYtopic_replied;
topics_with_replies_cnt=FOREACHtopics_with_replies_grpd{
GENERATE
COUNT(twitter_graph)ascnt;
}
Then,weapplyQuantileonthebagoffrequencies,asfollows:
quantiles=FOREACH(GROUPtopics_with_replies_cntALL){
sorted=ORDERtopics_with_replies_cntBYcnt;
GENERATEQuantile(sorted);
}
Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/graph.pig.
InfluentialusersWewillusePageRank,analgorithmdevelopedbyGoogletorankwebpages(http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf),toidentifyinfluentialusersintheTwittergraphwegeneratedintheprevioussection.
Thistypeofanalysishasanumberofusecases,suchastargetedandcontextualadvertisement,recommendationsystems,spamdetection,andobviouslymeasuringtheimportanceofwebpages.Asimilarapproach,usedbyTwittertoimplementtheWhotoFollowfeature,isdescribedintheresearchpaperWTF:TheWhotoFollowserviceatTwitterfoundathttp://stanford.edu/~rezab/papers/wtf_overview.pdf.
Informally,PageRankdeterminestheimportanceofapagebasedontheimportanceofotherpageslinkingtoitandassignsitascorebetween0and1.AhighPageRankscoreindicatesthatalotofpagespointtoit.Intuitively,beinglinkedbypageswithahighPageRankisaqualityendorsement.IntermsoftheTwittergraph,weassumethatusersreceivingalotofrepliesareimportantorinfluentialwithinthesocialnetwork.InTwitter’scase,weconsideranextendeddefinitionofPageRank,wherethelinkbetweentwousersisgivenbyadirectreplyandlabeledbyanyeventualhashtagpresentinthemessage.Heuristically,wewanttoidentifyinfluentialusersonagiventopic.
InDataFu’simplementation,eachgraphisrepresentedasabagof(source,edges)tuples.ThesourcetupleisanintegerIDrepresentingthesourcenode.Theedgesareabagof(destination,weight)tuples.destinationisanintegerIDrepresentingthedestinationnode.weightisadoublerepresentinghowmuchtheedgeshouldbeweighted.TheoutputoftheUDFisabagof(source,rank)pairs,whererankisthePageRankvalueforthesourceuserinthegraph.Noticethatwetalkedaboutnodes,edges,andgraphsasabstractconcepts.InGoogle’scase,nodesarewebpages,edgesarelinksfromonepagetotheother,andgraphsaregroupsofpagesconnecteddirectlyandindirectly.
Inourcase,nodesrepresentusers,edgesrepresentin_reply_to_user_id_strmentions,andedgesarelabeledbyhashtagsintweets.TheoutputofPageRankshouldsuggestwhichusersareinfluentialonagiventopicgiventheirinteractionpatterns.
Inthissection,wewillwriteapipelineto:
RepresentdataasagraphwhereeachnodeisauserandahashtaglabelstheedgeMapIDsandhashtagstointegerssothattheycanbeconsumedbyPageRankApplyPageRankStoretheresultsintoHDFSinaninteroperableformat(Avro)
Werepresentthegraphasabagoftuplesintheform(source,destination,topic),whereeachtuplerepresentstheinteractionbetweennodes.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/pagerank.pig.
Wewillmapusers’andhashtags’texttonumericalIDs.WeusetheJavaStringhashCode()methodtoperformthisconversionstepandwrapthelogicinanEvalUDF.
NoteThesizeofanintegeriseffectivelytheupperboundforthenumberofnodesandedgesinthegraph.Forproductioncode,itisrecommendedthatyouuseamorerobusthashfunction.
TheStringToIntclasstakesastringasinput,callsthehashCode()method,andreturnsthemethodoutputtoPig.TheUDFcodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/udf/com/learninghadoop2/pig/udf/StringToInt.java.
packagecom.learninghadoop2.pig.udf;
importjava.io.IOException;
importorg.apache.pig.EvalFunc;
importorg.apache.pig.data.Tuple;
publicclassStringToIntextendsEvalFunc<Integer>{
publicIntegerexec(Tupleinput)throwsIOException{
if(input==null||input.size()==0)
returnnull;
try{
Stringstr=(String)input.get(0);
returnstr.hashCode();
}catch(Exceptione){
throw
newIOException("CannotconvertStringtoInt",e);
}
}
}
Weextendorg.apache.pig.EvalFuncandoverridetheexecmethodtoreturnstr.hashCode()onthefunctioninput.TheEvalFunc<Integer>classisparameterizedwiththereturntypeoftheUDF(Integer).
Next,wecompiletheclassandarchiveitintoaJAR,asfollows:
$javac-classpath/opt/cloudera/parcels/CDH/lib/pig/pig.jar:$(hadoop
classpath)com/learninghadoop2/pig/udf/StringToInt.java
$jarcvfmyudfs-pig.jarcom/learninghadoop2/pig/udf/StringToInt.class
WecannowregistertheUDFinPigandcreateanaliastoStringToInt,asfollows:
REGISTERmyudfs-pig.jar
DEFINEStringToIntcom.learninghadoop2.pig.udf.StringToInt();
Wefilterouttweetswithnodestinationandnotopic,asfollows:
tweets_graph_filtered=FILTERtwitter_graphby
(destinationISNOTNULL)AND
(topicISNOTnull);
Then,weconvertthesource,destination,andtopictointegerIDs:
from_to=foreachtweets_graph_filtered{
GENERATE
StringToInt(source)assource_id,
StringToInt(destination)asdestination_id,
StringToInt(topic)astopic_id;
}
Oncedataisintheappropriateformat,wecanreusetheimplementationofPageRankandtheexamplecode(foundathttps://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/linkanalysis/PageRank.java)providedbyDataFu,asshowninthefollowingcode:
DEFINEPageRankdatafu.pig.linkanalysis.PageRank('dangling_nodes','true');
Webeginbycreatingabagof(source_id,destination_id,topic_id)tuples,asfollows:
reply_to=groupfrom_toby(source_id,destination_id,topic_id);
Wecounttheoccurrencesofeachtuple,thatis,howmanytimestwopeopletalkedaboutatopic,asfollows:
topic_edges=foreachreply_to{
GENERATEflatten(group),((double)COUNT(from_to.topic_id))asw;
}
Rememberthattopicistheedgeofourgraph;webeginbycreatinganassociationbetweenthesourcenodeandthetopicedge,asfollows:
topic_edges_grouped=GROUPtopic_edgesby(topic_id,source_id);
Thenweregroupitwiththepurposeofaddingadestinationnodeandtheedgeweight,asfollows:
topic_edges_grouped=FOREACHtopic_edges_grouped{
GENERATE
group.topic_idastopic,
group.source_idassource,
topic_edges.(destination_id,w)asedges;
}
OncewecreatetheTwittergraph,wecalculatethePageRankofallusers(source_id):
topic_rank=FOREACH(GROUPtopic_edges_groupedBYtopic){
GENERATE
groupastopic,
FLATTEN(PageRank(topic_edges_grouped.(source,edges)))as(source,rank);
}
topic_rank=FOREACHtopic_rankGENERATEtopic,source,rank;
WestoretheresultinHDFSinAvroformat.IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReducejarfiletoourenvironmentbeforeaccessingindividualfields.WithinPig,forexample,ontheClouderaCDH5VM:
REGISTER/opt/cloudera/parcels/CDH/lib/avro/avro.jar
REGISTER/opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar
STOREtopic_rankINTO'replies-pagerank'usingAvroStorage();
Note
Intheselasttwosections,wemadeanumberofimplicitassumptionsonwhataTwittergraphmightlooklikeandwhattheconceptsoftopicanduserinteractionmean.Giventheconstraintsthatweposed,theresultingsocialnetworkweanalyzedwillberelativelysmallandnotnecessarilyrepresentativeoftheentireTwittersocialnetwork.Extrapolatingresultsfromthisdatasetisdiscouraged.Inpractice,therearemanyotherfactorsthatshouldbetakenintoaccounttogeneratearobustmodelofsocialinteraction.
SummaryInthischapter,weintroducedApachePig,aplatformforlarge-scaledataanalysisonHadoop.Inparticular,wecoveredthefollowingtopics:
ThegoalsofPigasawayofprovidingadataflow-likeabstractionthatdoesnotrequirehands-onMapReducedevelopmentHowPig’sapproachtoprocessingdatacomparestoSQL,wherePigisproceduralwhileSQLisdeclarativeGettingstartedwithPig—aneasytask,asitisalibrarythatgeneratescustomcodeanddoesn’trequireadditionalservicesAnoverviewofthedatatypes,corefunctions,andextensionmechanismsprovidedbyPigExamplesofapplyingPigtoanalyzetheTwitterdatasetindetail,whichdemonstrateditsabilitytoexpresscomplexconceptsinaveryconcisefashionHowlibrariessuchasPiggybank,ElephantBird,andDataFuproviderepositoriesfornumeroususefulprewrittenPigfunctionsInthenextchapter,wewillrevisittheSQLcomparisonbyexploringtoolsthatexposeaSQL-likeabstractionoverdatastoredinHDFS
Chapter7.HadoopandSQLMapReduceisapowerfulparadigmthatenablescomplexdataprocessingthatcanrevealvaluableinsights.Asdiscussedinearlierchaptershowever,itdoesrequireadifferentmindsetandsometrainingandexperienceonthemodelofbreakingprocessinganalyticsintoaseriesofmapandreducesteps.ThereareseveralproductsthatarebuiltatopHadooptoprovidehigher-levelormorefamiliarviewsofthedataheldwithinHDFS,andPigisaverypopularone.ThischapterwillexploretheothermostcommonabstractionimplementedatopHadoop:SQL.
Inthischapter,wewillcoverthefollowingtopics:
WhattheusecasesforSQLonHadoopareandwhyitissopopularHiveQL,theSQLdialectintroducedbyApacheHiveUsingHiveQLtoperformSQL-likeanalysisoftheTwitterdatasetHowHiveQLcanapproximatecommonfeaturesofrelationaldatabasessuchasjoinsandviewsHowHiveQLallowstheincorporationofuser-definedfunctionsintoitsqueriesHowSQLonHadoopcomplementsPigOtherSQL-on-HadoopproductssuchasImpalaandhowtheydifferfromHive
WhySQLonHadoopSofarwehaveseenhowtowriteHadoopprogramsusingtheMapReduceAPIsandhowPigLatinprovidesascriptingabstractionandawrapperforcustombusinesslogicbymeansofUDFs.Pigisaverypowerfultool,butitsdataflow-basedprogrammingmodelisnotfamiliartomostdevelopersorbusinessanalysts.ThetraditionaltoolofchoiceforsuchpeopletoexploredataisSQL.
Backin2008FacebookreleasedHive,thefirstwidelyusedimplementationofSQLonHadoop.
Insteadofprovidingawayofmorequicklydevelopingmapandreducetasks,HiveoffersanimplementationofHiveQL,aquerylanguagebasedonSQL.HivetakesHiveQLstatementsandimmediatelyandautomaticallytranslatesthequeriesintooneormoreMapReducejobs.ItthenexecutestheoverallMapReduceprogramandreturnstheresultstotheuser.
ThisinterfacetoHadoopnotonlyreducesthetimerequiredtoproduceresultsfromdataanalysis,italsosignificantlywidensthenetastowhocanuseHadoop.Insteadofrequiringsoftwaredevelopmentskills,anyonewho’sfamiliarwithSQLcanuseHive.
ThecombinationoftheseattributesisthatHiveQLisoftenusedasatoolforbusinessanddataanalyststoperformadhocqueriesonthedatastoredonHDFS.WithHive,thedataanalystcanworkonrefiningquerieswithouttheinvolvementofasoftwaredeveloper.JustaswithPig,HivealsoallowsHiveQLtobeextendedbymeansofUserDefinedFunctions,enablingthebaseSQLdialecttobecustomizedwithbusiness-specificfunctionality.
OtherSQL-on-HadoopsolutionsThoughHivewasthefirstproducttointroduceandsupportHiveQL,itisnolongertheonlyone.Laterinthischapter,wewillalsodiscussImpala,releasedin2013andalreadyaverypopulartool,particularlyforlow-latencyqueries.Thereareothers,butwewillmostlydiscussHiveandImpalaastheyhavebeenthemostsuccessful.
WhileintroducingthecorefeaturesandcapabilitiesofSQLonHadoophowever,wewillgiveexamplesusingHive;eventhoughHiveandImpalasharemanySQLfeatures,theyalsohavenumerousdifferences.Wedon’twanttoconstantlyhavetocaveateachnewfeaturewithexactlyhowitissupportedinHivecomparedtoImpala.We’llgenerallybelookingataspectsofthefeaturesetthatarecommontoboth,butifyouusebothproducts,it’simportanttoreadthelatestreleasenotestounderstandthedifferences.
PrerequisitesBeforedivingintospecifictechnologies,let’sgeneratesomedatathatwe’lluseintheexamplesthroughoutthischapter.We’llcreateamodifiedversionofaformerPigscriptasthemainfunctionalityforthis.ThescriptinthischapterassumesthattheElephantBirdJARsusedpreviouslyareavailableinthe/jardirectoryonHDFS.Thefullsourcecodeisathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/extract_for_hive.pig,butthecoreofextract_for_hive.pigisasfollows:
--loadJSONdata
tweets=load'$inputDir'using
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');—Tweets
tweets_tsv=foreachtweets{
generate
(chararray)CustomFormatToISO($0#'created_at',
'EEEMMMMdHH:mm:ssZy')asdt,
(chararray)$0#'id_str',
(chararray)$0#'text'astext,
(chararray)$0#'in_reply_to',
(boolean)$0#'retweeted'asis_retweeted,
(chararray)$0#'user'#'id_str'asuser_id,(chararray)$0#'place'#'id'as
place_id;
}
storetweets_tsvinto'$outputDir/tweets'
usingPigStorage('\u0001');—Places
needed_fields=foreachtweets{
generate
(chararray)CustomFormatToISO($0#'created_at',
'EEEMMMMdHH:mm:ssZy')asdt,
(chararray)$0#'id_str'asid_str,
$0#'place'asplace;
}
place_fields=foreachneeded_fields{
generate
(chararray)place#'id'asplace_id,
(chararray)place#'country_code'asco,
(chararray)place#'country'ascountry,
(chararray)place#'name'asplace_name,
(chararray)place#'full_name'asplace_full_name,
(chararray)place#'place_type'asplace_type;
}
filtered_places=filterplace_fieldsbyco!='';
unique_places=distinctfiltered_places;
storeunique_placesinto'$outputDir/places'
usingPigStorage('\u0001');
—Users
users=foreachtweets{
generate
(chararray)CustomFormatToISO($0#'created_at',
'EEEMMMMdHH:mm:ssZy')asdt,
(chararray)$0#'id_str'asid_str,
$0#'user'asuser;
}
user_fields=foreachusers{
generate
(chararray)CustomFormatToISO(user#'created_at',
'EEEMMMMdHH:mm:ssZy')asdt,
(chararray)user#'id_str'asuser_id,
(chararray)user#'location'asuser_location,
(chararray)user#'name'asuser_name,
(chararray)user#'description'asuser_description,
(int)user#'followers_count'asfollowers_count,
(int)user#'friends_count'asfriends_count,
(int)user#'favourites_count'asfavourites_count,
(chararray)user#'screen_name'asscreen_name,
(int)user#'listed_count'aslisted_count;
}
unique_users=distinctuser_fields;
storeunique_usersinto'$outputDir/users'
usingPigStorage('\u0001');
Runthisscriptasfollows:
$pig–fextract_for_hive.pig–paraminputDir=<jsoninput>-param
outputDir=<outputpath>
TheprecedingcodewritesdataintothreeseparateTSVfilesforthetweet,user,andplaceinformation.Noticethatinthestorecommand,wepassanargumentwhencallingPigStorage.ThissingleargumentchangesthedefaultfieldseparatorfromatabcharactertounicodevalueU0001,oryoucanalsouseCtrl+C+A.ThisisoftenusedasaseparatorinHivetablesandwillbeparticularlyusefultousasourtweetdatacouldcontaintabsinotherfields.
OverviewofHiveWewillnowshowhowyoucanimportdataintoHiveandrunaqueryagainstthetableabstractionHiveprovidesoverthedata.Inthisexample,andintheremainderofthechapter,wewillassumethatqueriesaretypedintotheshellthatcanbeinvokedbyexecutingthehivecommand.
RecentlyaclientcalledBeelinealsobecameavailableandwilllikelybethepreferredCLIclientinthenearfuture.
WhenimportinganynewdataintoHive,thereisgenerallyathree-stageprocess:
CreatethespecificationofthetableintowhichthedataistobeimportedImportthedataintothecreatedtableExecuteHiveQLqueriesagainstthetable
MostoftheHiveQLstatementsaredirectanaloguestosimilarlynamedstatementsinstandardSQL.WeassumeonlyapassingknowledgeofSQLthroughoutthischapter,butifyouneedarefresher,therearenumerousgoodonlinelearningresources.
Hivegivesastructuredqueryviewofourdata,andtoenablethat,wemustfirstdefinethespecificationofthetable’scolumnsandimportthedataintothetablebeforewecanexecuteanyqueries.AtablespecificationisgeneratedusingaCREATEstatementthatspecifiesthetablename,thenameandtypesofitscolumns,andsomemetadataabouthowthetableisstored:
CREATEtabletweets(
created_atstring,
tweet_idstring,
textstring,
in_reply_tostring,
retweetedboolean,
user_idstring,
place_idstring
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE;
Thestatementcreatesanewtabletweetsdefinedbyalistofnamesforcolumnsinthedatasetandtheirdatatype.WespecifythatfieldsaredelimitedbytheUnicodeU0001characterandthattheformatusedtostoredataisTEXTFILE.
DatacanbeimportedfromalocationinHDFStweets/usingtheLOADDATAstatement:
LOADDATAINPATH'tweets'OVERWRITEINTOTABLEtweets;
Bydefault,dataforHivetablesisstoredonHDFSunder/user/hive/warehouse.IfaLOADstatementisgivenapathtodataonHDFS,itwillnotsimplycopythedatainto/user/hive/warehouse,butwillmoveitthereinstead.IfyouwanttoanalyzedataonHDFSthatisusedbyotherapplications,theneithercreateacopyorusetheEXTERNALmechanismthatwillbedescribedlater.
OncedatahasbeenimportedintoHive,wecanrunqueriesagainstit.Forinstance:
SELECTCOUNT(*)FROMtweets;
Theprecedingcodewillreturnthetotalnumberoftweetspresentinthedataset.HiveQL,likeSQL,isnotcasesensitiveintermsofkeywords,columns,ortablenames.Byconvention,SQLstatementsuseuppercaseforSQLlanguagekeywords,andwewillgenerallyfollowthiswhenusingHiveQLwithinfiles,aswillbeshownlater.However,whentypinginteractivecommands,wewillfrequentlytakethelineofleastresistanceanduselowercase.
Ifyoulookcloselyatthetimetakenbythevariouscommandsintheprecedingexample,you’llnoticethatloadingdataintoatabletakesaboutaslongascreatingthetablespecification,buteventhesimplecountofallrowstakessignificantlylonger.TheoutputalsoshowsthattablecreationandtheloadingofdatadonotactuallycauseMapReducejobstobeexecuted,whichexplainstheveryshortexecutiontimes.
ThenatureofHivetablesAlthoughHivecopiesthedatafileintoitsworkingdirectory,itdoesnotactuallyprocesstheinputdataintorowsatthatpoint.
BoththeCREATETABLEandLOADDATAstatementsdonottrulycreateconcretetabledataassuch;instead,theyproducethemetadatathatwillbeusedwhenHivegeneratesMapReducejobstoaccessthedataconceptuallystoredinthetablebutactuallyresidingonHDFS.EventhoughtheHiveQLstatementsrefertoaspecifictablestructure,itisHive’sresponsibilitytogeneratecodethatcorrectlymapsthistotheactualon-diskformatinwhichthedatafilesarestored.
ThismightseemtosuggestthatHiveisn’tarealdatabase;thisistrue,itisn’t.Whereasarelationaldatabasewillrequireatableschematobedefinedbeforedataisingestedandtheningestonlydatathatconformstothatspecification,Hiveismuchmoreflexible.ThelessconcretenatureofHivetablesmeansthatschemascanbedefinedbasedonthedataasithasalreadyarrivedandnotonsomeassumptionofhowthedatashouldbe,whichmightprovetobewrong.Thoughchangeabledataformatsaretroublesomeregardlessoftechnology,theHivemodelprovidesanadditionaldegreeoffreedominhandlingtheproblemwhen,notif,itarises.
HivearchitectureUntilversion2,Hadoopwasprimarilyabatchsystem.Aswesawinpreviouschapters,MapReducejobstendtohavehighlatencyandoverheadderivedfromsubmissionandscheduling.Internally,HivecompilesHiveQLstatementsintoMapReducejobs.Hivequerieshavetraditionallybeencharacterizedbyhighlatency.ThishaschangedwiththeStingerinitiativeandtheimprovementsintroducedinHive0.13thatwewilldiscusslater.
HiverunsasaclientapplicationthatprocessesHiveQLqueries,convertsthemintoMapReducejobs,andsubmitsthesetoaHadoopclustereithertonativeMapReduceinHadoop1ortotheMapReduceApplicationMasterrunningonYARNinHadoop2.
Regardlessofthemodel,Hiveusesacomponentcalledthemetastore,inwhichitholdsallitsmetadataaboutthetablesdefinedinthesystem.Ironically,thisisstoredinarelationaldatabasededicatedtoHive’susage.IntheearliestversionsofHive,allclientscommunicateddirectlywiththemetastore,butthismeantthateveryuseroftheHiveCLItoolneededtoknowthemetastoreusernameandpassword.
HiveServerwascreatedtoactasapointofentryforremoteclients,whichcouldalsoactasasingleaccess-controlpointandwhichcontrolledallaccesstotheunderlyingmetastore.BecauseoflimitationsinHiveServer,thenewestwaytoaccessHiveisthroughthemulti-clientHiveServer2.
NoteHiveServer2introducesanumberofimprovementsoveritspredecessor,includinguserauthenticationandsupportformultipleconnectionsfromthesameclient.Moreinformationcanbefoundathttps://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2.
InstancesofHiveServerandHiveServer2canbemanuallyexecutedwiththehive--servicehiveserverandhive--servicehiveserver2commands,respectively.
Intheexampleswesawbeforeandintheremainderofthischapter,weimplicitlyuseHiveServertosubmitqueriesviatheHivecommand-linetool.HiveServer2comeswithBeeline.Forcompatibilityandmaturityreasons,Beelinebeingrelativelynew,bothtoolsareavailableonClouderaandmostothermajordistributions.TheBeelineclientispartofthecoreApacheHivedistributionandsoisalsofullyopensource.Beelinecanbeexecutedinembeddedversionwiththefollowingcommand:
$beeline-ujdbc:hive2://
DatatypesHiveQLsupportsmanyofthecommondatatypesprovidedbystandarddatabasesystems.Theseincludeprimitivetypes,suchasfloat,double,int,andstring,throughtostructuredcollectiontypesthatprovidetheSQLanaloguestotypessuchasarrays,structs,andunions(structswithoptionsforsomefields).SinceHiveisimplementedinJava,primitivetypeswillbehaveliketheirJavacounterparts.WecandistinguishHivedatatypesintothefollowingfivebroadcategories:
Numeric:tinyint,smallint,int,bigint,float,double,anddecimalDateandtime:timestampanddateString:string,varchar,andcharCollections:array,map,struct,anduniontypeMisc:boolean,binary,andNULL
DDLstatementsHiveQLprovidesanumberofstatementstocreate,delete,andalterdatabases,tables,andviews.TheCREATEDATABASE<name>statementcreatesanewdatabasewiththegivenname.Adatabaserepresentsanamespacewheretableandviewmetadataiscontained.Ifmultipledatabasesarepresent,theUSE<databasename>statementspecifieswhichonetousetoquerytablesorcreatenewmetadata.Ifnodatabaseisexplicitlyspecified,Hivewillrunallstatementsagainstthedefaultdatabase.SHOW[DATABASES,TABLES,VIEWS]displaysthedatabasescurrentlyavailablewithinadatawarehouseandwhichtableandviewmetadataispresentwithinthedatabasecurrentlyinuse:
CREATEDATABASEtwitter;
SHOWdatabases;
USEtwitter;
SHOWTABLES;
TheCREATETABLE[IFNOTEXISTS]<name>statementcreatesatablewiththegivenname.Asalludedtoearlier,whatisreallycreatedisthemetadatarepresentingthetableanditsmappingtofilesonHDFSaswellasadirectoryinwhichtostorethedatafiles.Ifatableorviewwiththesamenamealreadyexists,Hivewillraiseanexception.
Bothtableandcolumnnamesarecaseinsensitive.InolderversionsofHive(0.12andearlier),onlyalphanumericandunderscorecharacterswereallowedintableandcolumnnames.AsofHive0.13,thesystemsupportsunicodecharactersincolumnnames.Reservedwords,suchasloadandcreate,needtobeescapedbybackticks(the`character)tobetreatedliterally.
TheEXTERNALkeywordspecifiesthatthetableexistsinresourcesoutofHive’scontrol,whichcanbeausefulmechanismtoextractdatafromanothersourceatthebeginningofaHadoop-basedExtract-Transform-Load(ETL)pipeline.TheLOCATIONclausespecifieswherethesourcefile(ordirectory)istobefound.TheEXTERNALkeywordandLOCATIONclausehavebeenusedinthefollowingcode:
CREATEEXTERNALTABLEtweets(
created_atstring,
tweet_idstring,
textstring,
in_reply_tostring,
retweetedboolean,
user_idstring,
place_idstring
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE
LOCATION'${input}/tweets';
Thistablewillbecreatedinthemetastorebutthedatawillnotbecopiedintothe/user/hive/warehousedirectory.
Tip
NotethatHivehasnoconceptofprimarykeyoruniqueidentifier.Uniquenessanddatanormalizationareaspectstobeaddressedbeforeloadingdataintothedatawarehouse.
TheCREATEVIEW<viewname>…ASSELECTstatementcreatesaviewwiththegivenname.Forexample,wecancreateaviewtoisolateretweetsfromothermessages,asfollows:
CREATEVIEWretweets
COMMENT'Tweetsthathavebeenretweeted'
ASSELECT*FROMtweetsWHEREretweeted=true;
Unlessotherwisespecified,columnnamesarederivedfromthedefiningSELECTstatement.Hivedoesnotcurrentlysupportmaterializedviews.
TheDROPTABLEandDROPVIEWstatementsremovebothmetadataanddataforagiventableorview.WhendroppinganEXTERNALtableoraview,onlymetadatawillberemovedandtheactualdatafileswillnotbeaffected.
HiveallowstablemetadatatobealteredviatheALTERTABLEstatement,whichcanbeusedtochangeacolumntype,name,position,andcommentortoaddandreplacecolumns.
Whenaddingcolumns,itisimportanttorememberthatonlymetadatawillbechangedandnotthedatasetitself.Thismeansthatifweweretoaddacolumninthemiddleofthetablewhichdidn’texistinolderfiles,thenwhileselectingfromolderdata,wemightgetwrongvaluesinthewrongcolumns.Thisisbecausewewouldbelookingatoldfileswithanewformat.WewilldiscussdataandschemamigrationsinChapter8,DataLifecycleManagement,whendiscussingAvro.
Similarly,ALTERVIEW<viewname>AS<selectstatement>changesthedefinitionofanexistingview.
FileformatsandstorageThedatafilesunderlyingaHivetablearenodifferentfromanyotherfileonHDFS.UserscandirectlyreadtheHDFSfilesintheHivetablesusingothertools.TheycanalsouseothertoolstowritetoHDFSfilesthatcanbeloadedintoHivethroughCREATEEXTERNALTABLEorthroughLOADDATAINPATH.
HiveusestheSerializerandDeserializerclasses,SerDe,aswellasFileFormattoreadandwritetablerows.AnativeSerDeisusedifROWFORMATisnotspecifiedorROWFORMATDELIMITEDisspecifiedinaCREATETABLEstatement.TheDELIMITEDclauseinstructsthesystemtoreaddelimitedfiles.DelimitercharacterscanbeescapedusingtheESCAPEDBYclause.
HivecurrentlyusesthefollowingFileFormatclassestoreadandwriteHDFSfiles:
TextInputFormatandHiveIgnoreKeyTextOutputFormat:willread/writedatainplaintextfileformatSequenceFileInputFormatandSequenceFileOutputFormat:classesread/writedataintheHadoopSequenceFileformat
Additionally,thefollowingSerDeclassescanbeusedtoserializeanddeserializedata:
MetadataTypedColumnsetSerDe:willread/writedelimitedrecordssuchasCSVortab-separatedrecordsThriftSerDe,andDynamicSerDe:willread/writeThriftobjects
JSONAsofversion0.13,Hiveshipswiththenativeorg.apache.hive.hcatalog.data.JsonSerDe.ForolderversionsofHive,Hive-JSON-Serde(foundathttps://github.com/rcongiu/Hive-JSON-Serde)isarguablyoneofthemostfeature-richJSONserialization/deserializationmodules.
WecanuseeithermoduletoloadJSONtweetswithoutanyneedforpreprocessingandjustdefineaHiveschemathatmatchesthecontentofaJSONdocument.Inthefollowingexample,weuseHive-JSON-Serde.
Aswithanythird-partymodule,weloadtheSerDeJARsintoHivewiththefollowingcode:
ADDJARJARjson-serde-1.3-jar-with-dependencies.jar;
Then,weissuetheusualCREATEstatement,asfollows:
CREATEEXTERNALTABLEtweets(
contributorsstring,
coordinatesstruct<
coordinates:array<float>,
type:string>,
created_atstring,
entitiesstruct<
hashtags:array<struct<
indices:array<tinyint>,
text:string>>,
…
)
ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe'
STOREDASTEXTFILE
LOCATION'tweets';
WiththisSerDe,wecanmapnesteddocuments(suchasentitiesorusers)tothestructormaptypes.WetellHivethatthedatastoredatLOCATION'tweets'istext(STOREDASTEXTFILE)andthateachrowisaJSONobject(ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe‘).InHive0.13andlater,wecanexpressthispropertyasROWFORMATSERDE'org.apache.hive.hcatalog.data.JsonSerDe'.
Manuallyspecifyingtheschemaforcomplexdocumentscanbeatediousanderror-proneprocess.Thehive-jsonmodule(foundathttps://github.com/hortonworks/hive-json)isahandyutilitytoanalyzelargedocumentsandgenerateanappropriateHiveschema.Dependingonthedocumentcollection,furtherrefinementmightbenecessary.
Inourexample,weusedaschemageneratedwithhive-jsonthatmapsthetweetsJSONtoanumberofstructdatatypes.Thisallowsustoquerythedatausingahandydotnotation.Forinstance,wecanextractthescreennameanddescriptionfieldsofauserobjectwiththefollowingcode:
SELECTuser.screen_name,user.descriptionFROMtweets_jsonLIMIT10;
AvroAvroSerde(https://cwiki.apache.org/confluence/display/Hive/AvroSerDe)allowsustoreadandwritedatainAvroformat.Startingfrom0.14,Avro-backedtablescanbecreatedusingtheSTOREDASAVROstatement,andHivewilltakecareofcreatinganappropriateAvroschemaforthetable.PriorversionsofHiveareabitmoreverbose.
Asanexample,let’sloadintoHivethePageRankdatasetwegeneratedinChapter6,DataAnalysiswithApachePig.ThisdatasetwascreatedusingPig’sAvroStorageclass,andhasthefollowingschema:
{
"type":"record",
"name":"record",
"fields":[
{"name":"topic","type":["null","int"]},
{"name":"source","type":["null","int"]},
{"name":"rank","type":["null","float"]}
]
}
ThetablestructureiscapturedinanAvrorecord,whichcontainsheaderinformation(anameandoptionalnamespacetoqualifythename)andanarrayofthefields.Eachfieldisspecifiedwithitsnameandtypeaswellasanoptionaldocumentationstring.
Forafewofthefields,thetypeisnotasinglevalue,butinsteadapairofvalues,oneofwhichisnull.ThisisanAvrounion,andthisistheidiomaticwayofhandlingcolumns
thatmighthaveanullvalue.Avrospecifiesnullasaconcretetype,andanylocationwhereanothertypemighthaveanullvalueneedstobespecifiedinthisway.Thiswillbehandledtransparentlyforuswhenweusethefollowingschema.
Withthisdefinition,wecannowcreateaHivetablethatusesthisschemaforitstablespecification,asfollows:
CREATEEXTERNALTABLEtweets_pagerank
ROWFORMATSERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITHSERDEPROPERTIES('avro.schema.literal'='{
"type":"record",
"name":"record",
"fields":[
{"name":"topic","type":["null","int"]},
{"name":"source","type":["null","int"]},
{"name":"rank","type":["null","float"]}
]
}')
STOREDASINPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION'${data}/ch5-pagerank';
Then,lookatthefollowingtabledefinitionfromwithinHive(notealsothatHCatalog,whichwe’llintroduceinChapter8,DataLifeCycleManagement,alsosupportssuchdefinitions):
DESCRIBEtweets_pagerank;
OK
topicintfromdeserializer
sourceintfromdeserializer
rankfloatfromdeserializer
IntheDDL,wetoldHivethatdataisstoredinAvroformatusingAvroContainerInputFormatandAvroContainerOutputFormat.Eachrowneedstobeserializedanddeserializedusingorg.apache.hadoop.hive.serde2.avro.AvroSerDe.ThetableschemaisinferredbyHivefromtheAvroschemaembeddedinavro.schema.literal.
Alternatively,wecanstoreaschemaonHDFSandhaveHivereadittodeterminethetablestructure.Createtheprecedingschemainafilecalledpagerank.avsc—thisisthestandardfileextensionforAvroschemas.ThenplaceitonHDFS;weprefertohaveacommonlocationforschemafilessuchas/schema/avro.Finally,definethetableusingtheavro.schema.urlSerDepropertyWITHSERDEPROPERTIES('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc').
IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduceJARtoourenvironmentbeforeaccessingindividualfields.WithinHive,ontheClouderaCDH5VM:
ADDJAR/opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar;
Wecanalsousethistablelikeanyother.Forinstance,wecanquerythedatatoselecttheuserandtopicpairswithahighPageRank:
SELECTsource,topicfromtweets_pagerankWHERErank>=0.9;
InChapter8,DataLifecycleManagement,wewillseehowAvroandavro.schema.urlplayaninstrumentalroleinenablingschemamigrations.
ColumnarstoresHivecanalsotakeadvantageofcolumnarstorageviatheORC(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC)andParquet(https://cwiki.apache.org/confluence/display/Hive/Parquet)formats.
Ifatableisdefinedwithverymanycolumns,itisnotunusualforanygivenquerytoonlyprocessasmallsubsetofthesecolumns.ButeveninaSequenceFileeachfullrowandallitscolumnswillbereadfromdisk,decompressed,andprocessed.Thisconsumesalotofsystemresourcesfordatathatweknowinadvanceisnotofinterest.
Traditionalrelationaldatabasesalsostoredataonarowbasis,andatypeofdatabasecalledcolumnarchangedthistobecolumn-focused.Inthesimplestmodel,insteadofonefileforeachtable,therewouldbeonefileforeachcolumninthetable.Ifaqueryonlyneededtoaccessfivecolumnsinatablewith100columnsintotal,thenonlythefilesforthosefivecolumnswillberead.BothORCandParquetusethisprincipleaswellasotheroptimizationstoenablemuchfasterqueries.
QueriesTablescanbequeriedusingthefamiliarSELECT…FROMstatement.TheWHEREstatementallowsthespecificationoffilteringconditions,GROUPBYaggregatesrecords,ORDERBYspecifiessortingcriteria,andLIMITspecifiesthenumberofrecordstoretrieve.Aggregatefunctions,suchascountandsum,canbeappliedtoaggregatedrecords.Forinstance,thefollowingcodereturnsthetop10mostprolificusersinthedataset:
SELECTuser_id,COUNT(*)AScntFROMtweetsGROUPBYuser_idORDERBYcnt
DESCLIMIT10
Thisreturnsthetop10mostprolificusersinthedataset:
22639496594
13321880534
9594688573
13677521183
3625629443
586460413
23752966883
14681885293
371142093
23850409403
Wecanimprovethereadabilityofthehiveoutputbysettingthefollowing:
SEThive.cli.print.header=true;
Thiswillinstructhive,thoughnotbeeline,toprintcolumnnamesaspartoftheoutput.
TipYoucanaddthecommandtothe.hivercfileusuallyfoundintherootoftheexecutinguser’shomedirectorytohaveitapplytoallhiveCLIsessions.
HiveQLimplementsaJOINoperatorthatenablesustocombinetablestogether.InthePrerequisitessection,wegeneratedseparatedatasetsfortheuserandplaceobjects.Let’snowloadthemintohiveusingexternaltables.
Wefirstcreateausertabletostoreuserdata,asfollows:
CREATEEXTERNALTABLEuser(
created_atstring,
user_idstring,
`location`string,
namestring,
descriptionstring,
followers_countbigint,
friends_countbigint,
favourites_countbigint,
screen_namestring,
listed_countbigint
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE
LOCATION'${input}/users';
Wethencreateaplacetabletostorelocationdata,asfollows:
CREATEEXTERNALTABLEplace(
place_idstring,
country_codestring,
countrystring,
`name`string,
full_namestring,
place_typestring
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE
LOCATION'${input}/places';
WecanusetheJOINoperatortodisplaythenamesofthe10mostprolificusers,asfollows:
SELECTtweets.user_id,user.name,COUNT(tweets.user_id)AScnt
FROMtweets
JOINuserONuser.user_id=tweets.user_id
GROUPBYtweets.user_id,user.user_id,user.name
ORDERBYcntDESCLIMIT10;
TipOnlyequality,outer,andleft(semi)joinsaresupportedinHive.
NoticethattheremightbemultipleentrieswithagivenuserIDbutdifferentvaluesforthefollowers_count,friends_count,andfavourites_countcolumns.Toavoidduplicateentries,wecountonlyuser_idfromthetweetstable.
Wecanrewritethepreviousqueryasfollows:
SELECTtweets.user_id,u.name,COUNT(*)AScnt
FROMtweets
join(SELECTuser_id,nameFROMuserGROUPBYuser_id,name)u
ONu.user_id=tweets.user_id
GROUPBYtweets.user_id,u.name
ORDERBYcntDESCLIMIT10;
Insteadofdirectlyjoiningtheusertable,weexecuteasubquery,asfollows:
SELECTuser_id,nameFROMuserGROUPBYuser_id,name;
ThesubqueryextractsuniqueuserIDsandnames.NotethatHivehaslimitedsupportforsubqueries,historicallyonlypermittingasubqueryintheFROMclauseofaSELECTstatement.Hive0.13hasaddedlimitedsupportforsubquerieswithintheWHEREclausealso.
HiveQLisanever-evolvingrichlanguage,afullexpositionofwhichisbeyondthescopeofthischapter.Adescriptionofitsqueryandddlcapabilitiescanbefoundathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual.
StructuringHivetablesforgivenworkloadsOftenHiveisn’tusedinisolation,insteadtablesarecreatedwithparticularworkloadsinmindorneedsinvokedinwaysthataresuitableforinclusioninautomatedprocesses.We’llnowexploresomeofthesescenarios.
PartitioningatableWithcolumnarfileformats,weexplainedthebenefitsofexcludingunneededdataasearlyaspossiblewhenprocessingaquery.AsimilarconcepthasbeenusedinSQLforsometime:tablepartitioning.
Whencreatingapartitionedtable,acolumnisspecifiedasthepartitionkey.Allvalueswiththatkeyarethenstoredtogether.InHive’scase,differentsubdirectoriesforeachpartitionkeyarecreatedunderthetabledirectoryinthewarehouselocationonHDFS.
It’simportanttounderstandthecardinalityofthepartitioncolumn.Withtoofewdistinctvalues,thebenefitsarereducedasthefilesarestillverylarge.Iftherearetoomanyvalues,thenqueriesmightneedalargenumberoffilestobescannedtoaccessalltherequireddata.Perhapsthemostcommonpartitionkeyisonebasedondate.Wecould,forexample,partitionourusertablefromearlierbasedonthecreated_atcolumn,thatis,thedatetheuserwasfirstregistered.Notethatsincepartitioningatablebydefinitionaffectsitsfilestructure,wecreatethistablenowasanon-externalone,asfollows:
CREATETABLEpartitioned_user(
created_atstring,
user_idstring,
`location`string,
namestring,
descriptionstring,
followers_countbigint,
friends_countbigint,
favourites_countbigint,
screen_namestring,
listed_countbigint
)PARTITIONEDBY(created_at_datestring)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE;
Toloaddataintoapartition,wecanexplicitlygiveavalueforthepartitionintowhichtoinsertthedata,asfollows:
INSERTINTOTABLEpartitioned_user
PARTITION(created_at_date='2014-01-01')
SELECT
created_at,
user_id,
location,
name,
description,
followers_count,
friends_count,
favourites_count,
screen_name,
listed_count
FROMuser;
Thisisatbestverbose,asweneedastatementforeachpartitionkeyvalue;ifasingle
LOADorINSERTstatementcontainsdataformultiplepartitions,itjustwon’twork.Hivealsohasafeaturecalleddynamicpartitioning,whichcanhelpushere.Wesetthefollowingthreevariables:
SEThive.exec.dynamic.partition=true;
SEThive.exec.dynamic.partition.mode=nonstrict;
SEThive.exec.max.dynamic.partitions.pernode=5000;
Thefirsttwostatementsenableallpartitions(nonstrictoption)tobedynamic.Thethirdoneallows5,000distinctpartitionstobecreatedoneachmapperandreducernode.
Wecanthensimplyusethenameofthecolumntobeusedasthepartitionkey,andHivewillinsertdataintopartitionsdependingonthevalueofthekeyforagivenrow:
INSERTINTOTABLEpartitioned_user
PARTITION(created_at_date)
SELECT
created_at,
user_id,
location,
name,
description,
followers_count,
friends_count,
favourites_count,
screen_name,
listed_count,
to_date(created_at)ascreated_at_date
FROMuser;
Eventhoughweuseonlyasinglepartitioncolumnhere,wecanpartitionatablebymultiplecolumnkeys;justhavethemasacomma-separatedlistinthePARTITIONEDBYclause.
Notethatthepartitionkeycolumnsneedtobeincludedasthelastcolumnsinanystatementbeingusedtoinsertintoapartitionedtable.IntheprecedingcodeweuseHive’sto_datefunctiontoconvertthecreated_attimestamptoaYYYY-MM-DDformattedstring.
PartitioneddataisstoredinHDFSas/path/to/warehouse/<database>/<table>/key=<value>.Inourexample,thepartitioned_usertablestructurewilllooklike/user/hive/warehouse/default/partitioned_user/created_at=2014-04-01.
Ifdataisaddeddirectlytothefilesystem,forinstancebysomethird-partyprocessingtoolorbyhadoopfs-put,themetastorewon’tautomaticallydetectthenewpartitions.TheuserwillneedtomanuallyrunanALTERTABLEstatementsuchasthefollowingforeachnewlyaddedpartition:
ALTERTABLE<table_name>ADDPARTITION<location>;
Toaddmetadataforallpartitionsnotcurrentlypresentinthemetastorewecanuse:MSCKREPAIRTABLE<table_name>;statement.OnEMR,thisisequivalenttoexecutingthefollowingstatement:
ALTERTABLE<table_name>RECOVERPARTITIONS;
NoticethatbothstatementswillworkalsowithEXTERNALtables.Inthefollowingchapter,wewillseehowthispatterncanbeexploitedtocreateflexibleandinteroperablepipelines.
OverwritingandupdatingdataPartitioningisalsousefulwhenweneedtoupdateaportionofatable.Normallyastatementofthefollowingformwillreplaceallthedataforthedestinationtable:
INSERTOVERWRITEINTO<table>…
IfOVERWRITEisomitted,theneachINSERTstatementwilladdadditionaldatatothetable.Sometimes,thisisdesirable,butoften,thesourcedatabeingingestedintoaHivetableisintendedtofullyupdateasubsetofthedataandkeeptherestuntouched.
IfweperformanINSERTOVERWRITEstatement(oraLOADOVERWRITEstatement)intoapartitionofatable,thenonlythespecifiedpartitionwillbeaffected.Thus,ifwewereinsertinguserdataandonlywantedtoaffectthepartitionswithdatainthesourcefile,wecouldachievethisbyaddingtheOVERWRITEkeywordtoourpreviousINSERTstatement.
WecanalsoaddcaveatstotheSELECTstatement.Say,forexample,weonlywantedtoupdatedataforacertainmonth:
INSERTINTOTABLEpartitioned_user
PARTITION(created_at_date)
SELECTcreated_at,
user_id,
location,
name,
description,
followers_count,
friends_count,
favourites_count,
screen_name,
listed_count,
to_date(created_at)ascreated_at_date
FROMuser
WHEREto_date(created_at)BETWEEN'2014-03-01'and'2014-03-31';
BucketingandsortingPartitioningatableisaconstructthatyoutakeexplicitadvantageofbyusingthepartitioncolumn(orcolumns)intheWHEREclauseofqueriesagainstthetables.ThereisanothermechanismcalledbucketingthatcanfurthersegmenthowatableisstoredanddoessoinawaythatallowsHiveitselftooptimizeitsinternalqueryplanstotakeadvantageofthestructure.
Let’screatebucketedversionsofourtweetsandusertables;notethefollowingadditionalCLUSTERBYandSORTBYstatementsintheCREATETABLEstatements:
CREATEtablebucketed_tweets(
tweet_idstring,
textstring,
in_reply_tostring,
retweetedboolean,
user_idstring,
place_idstring
)PARTITIONEDBY(created_atstring)
CLUSTEREDBY(user_ID)into64BUCKETS
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE;
CREATETABLEbucketed_user(
user_idstring,
`location`string,
namestring,
descriptionstring,
followers_countbigint,
friends_countbigint,
favourites_countbigint,
screen_namestring,
listed_countbigint
)PARTITIONEDBY(created_atstring)
CLUSTEREDBY(user_ID)SORTEDBY(name)into64BUCKETS
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE;
Notethatwechangedthetweetstabletoalsobepartitioned;youcanonlybucketatablethatispartitioned.
Justasweneedtospecifyapartitioncolumnwheninsertingintoapartitionedtable,wemustalsotakecaretoensurethatdatainsertedintoabucketedtableiscorrectlyclustered.Wedothisbysettingthefollowingflagbeforeinsertingthedataintothetable:
SEThive.enforce.bucketing=true;
Justaswithpartitionedtables,youcannotapplythebucketingfunctionwhenusingtheLOADDATAstatement;ifyouwishtoloadexternaldataintoabucketedtable,firstinsertitintoatemporarytable,andthenusetheINSERT…SELECT…syntaxtopopulatethebucketedtable.
Whendataisinsertedintoabucketedtable,rowsareallocatedtoabucketbasedontheresultofahashfunctionappliedtothecolumnspecifiedintheCLUSTEREDBYclause.
Oneofthegreatestadvantagesofbucketingatablecomeswhenweneedtojointwotablesthataresimilarlybucketed,asinthepreviousexample.So,forexample,anyqueryofthefollowingformwouldbevastlyimproved:
SEThive.optimize.bucketmapjoin=true;
SELECT…
FROMbucketed_useruJOINbucketed_tweett
ONu.user_id=t.user_id;
Withthejoinbeingperformedonthecolumnusedtobucketthetable,Hivecanoptimizetheamountofprocessingasitknowsthateachbucketcontainsthesamesetofuser_idcolumnsinbothtables.Whiledeterminingwhichrowsagainstwhichtomatch,onlythose
inthebucketneedtobecomparedagainst,andnotthewholetable.Thisdoesrequirethatthetablesarebothclusteredonthesamecolumnandthatthebucketnumbersareeitheridenticaloroneisamultipleoftheother.Inthelattercase,withsayonetableclusteredinto32bucketsandanotherinto64,thenatureofthedefaulthashfunctionusedtoallocatedatatoabucketmeansthattheIDsinbucket3inthefirsttablewillcoverthoseinbothbuckets3and35inthesecond.
SamplingdataBucketingatablecanalsohelpwhileusingHive’sabilitytosampledatainatable.Samplingallowsaquerytogatheronlyaspecifiedsubsetoftheoverallrowsinthetable.Thisisusefulwhenyouhaveanextremelylargetablewithmoderatelyconsistentdatapatterns.Insuchacase,applyingaquerytoasmallfractionofthedatawillbemuchfasterandwillstillgiveabroadlyrepresentativeresult.Note,ofcourse,thatthisonlyappliestoquerieswhereyouarelookingtodeterminetablecharacteristics,suchaspatternrangesinthedata;ifyouaretryingtocountanything,thentheresultneedstobescaledtothefulltablesize.
Foranon-bucketedtable,youcansampleinamechanismsimilartowhatwesawearlierbyspecifyingthatthequeryshouldonlybeappliedtoacertainsubsetofthetable:
SELECTmax(friends_count)
FROMuserTABLESAMPLE(BUCKET2OUTOF64ONname);
Inthisquery,Hivewilleffectivelyhashtherowsinthetableinto64bucketsbasedonthenamecolumn.Itwillthenonlyusethesecondbucketforthequery.Multiplebucketscanbespecified,andifRAND()isgivenastheONclause,thentheentirerowisusedbythebucketingfunction.
Thoughsuccessful,thisishighlyinefficientasthefulltableneedstobescannedtogeneratetherequiredsubsetofdata.Ifwesampleonabucketedtableandensurethenumberofbucketssampledisequaltooramultipleofthebucketsinthetable,thenHivewillonlyreadthebucketsinquestion.Forexample:
SELECTMAX(friends_count)
FROMbucketed_userTABLESAMPLE(BUCKET2OUTOF32onuser_id);
Intheprecedingqueryagainstthebucketed_usertable,whichiscreatedwith64bucketsontheuser_idcolumn,thesampling,sinceitisusingthesamecolumn,willonlyreadtherequiredbuckets.Inthiscase,thesewillbebuckets2and34fromeachpartition.
Afinalformofsamplingisblocksampling.Inthiscase,wecanspecifytherequiredamountofthetabletobesampled,andHivewilluseanapproximationofthisbyonlyreadingenoughsourcedatablocksonHDFStomeettherequiredsize.Currently,thedatasizecanbespecifiedaseitherapercentageofthetable,asanabsolutedatasize,orasanumberofrows(ineachblock).ThesyntaxforTABLESAMPLEisasfollows,whichwillsample0.5percentofthetable,1GBofdataor100rowspersplit,respectively:
TABLESAMPLE(0.5PERCENT)
TABLESAMPLE(1G)
TABLESAMPLE(100ROWS)
Iftheselatterformsofsamplingareofinterest,thenconsultthedocumentation,astherearesomespecificlimitationsontheinputformatandfileformatsthataresupported.
WritingscriptsWecanplaceHivecommandsinafileandrunthemwiththe-foptioninthehiveCLIutility:
$catshow_tables.hql
showtables;
$hive-fshow_tables.hql
WecanparameterizeHiveQLstatementsbymeansofthehiveconfmechanism.Thisallowsustospecifyanenvironmentvariablenameatthepointitisusedratherthanatthepointofinvocation.Forexample:
$catshow_tables2.hql
showtableslike'${hiveconf:TABLENAME}';
$hive-hiveconfTABLENAME=user-fshow_tables2.hql
ThevariablecanalsobesetwithintheHivescriptoraninteractivesession:
SETTABLE_NAME='user';
TheprecedinghiveconfargumentwilladdanynewvariablesinthesamenamespaceastheHiveconfigurationoptions.AsofHive0.8,thereisasimilaroptioncalledhivevarthataddsanyuservariablesintoadistinctnamespace.Usinghivevar,theprecedingcommandwouldbeasfollows:
$catshow_tables3.hql
showtableslike'${hivevar:TABLENAME}';
$hive-hivevarTABLENAME=user–fshow_tables3.hql
Orwecanwritethecommandinteractively:
SEThivevar:TABLE_NAME='user';
HiveandAmazonWebServicesWithElasticMapReduceastheAWSHadoop-on-demandservice,itisofcoursepossibletorunHiveonanEMRcluster.ButitisalsopossibletouseAmazonstorageservices,particularlyS3,fromanyHadoopclusterbeitwithinEMRoryourownlocalcluster.
HiveandS3AsmentionedinChapter2,Storage,itispossibletospecifyadefaultfilesystemotherthanHDFSforHadoopandS3isoneoption.But,itdoesn’thavetobeanall-or-nothingthing;itispossibletohavespecifictablesstoredinS3.Thedataforthesetableswillberetrievedintotheclustertobeprocessed,andanyresultingdatacaneitherbewrittentoadifferentS3location(thesametablecannotbethesourceanddestinationofasinglequery)orontoHDFS.
WecantakeafileofourtweetdataandplaceitontoalocationinS3withacommandsuchasthefollowing:
$awss3puttweets.tsvs3://<bucket-name>/tweets/
Wefirstlyneedtospecifytheaccesskeyandsecretaccesskeythatcanaccessthebucket.Thiscanbedoneinthreeways:
Setfs.s3n.awsAccessKeyIdandfs.s3n.awsSecretAccessKeytotheappropriatevaluesintheHiveCLISetthesamevaluesinhive-site.xmlthoughnotethislimitsuseofS3toasinglesetofcredentialsSpecifythetablelocationexplicitlyinthetableURL,thatis,s3n://<accesskey>:<secretaccesskey>@<bucket>/<path>
Thenwecancreateatablereferencingthisdata,asfollows:
CREATEtableremote_tweets(
created_atstring,
tweet_idstring,
textstring,
in_reply_tostring,
retweetedboolean,
user_idstring,
place_idstring
)CLUSTEREDBY(user_ID)into64BUCKETS
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\t'
LOCATION's3n://<bucket-name>/tweets'
ThiscanbeanincrediblyeffectivewayofpullingS3dataintoalocalHadoopclusterforprocessing.
NoteInordertouseAWScredentialsintheURIofanS3locationregardlessofhowtheparametersarepassed,thesecretandaccesskeysmustnotcontain/,+,=,or\characters.Ifnecessary,anewsetofcredentialscanbegeneratedfromtheIAMconsoleathttps://console.aws.amazon.com/iam/.
Intheory,youcanjustleavethedataintheexternaltableandrefertoitwhenneededtoavoidWANdatatransferlatencies(andcosts),eventhoughitoftenmakessensetopull
thedataintoalocaltableanddofutureprocessingfromthere.Ifthetableispartitioned,thenyoumightfindyourselfretrievinganewpartitioneachday,forexample.
HiveonElasticMapReduceOnonelevel,usingHivewithinAmazonElasticMapReduceisjustthesameaseverythingdiscussedinthischapter.Youcancreateapersistentcluster,logintothemasternode,andusetheHiveCLItocreatetablesandsubmitqueries.DoingallthiswillusethelocalstorageontheEC2instancesforthetabledata.
Notsurprisingly,jobsonEMRclusterscanalsorefertotableswhosedataisstoredonS3(orDynamoDB).Andalsonotsurprisingly,AmazonhasmadeextensionstoitsversionofHivetomakeallthisveryseamless.ItisquitesimplefromwithinanEMRjobtopulldatafromatablestoredinS3,processit,writeanyintermediatedatatotheEMRlocalstorage,andthenwritetheoutputresultsintoS3,DynamoDB,oroneofagrowinglistofotherAWSservices.
ThepatternmentionedearlierwherenewdataisaddedtoanewpartitiondirectoryforatableeachdayhasprovedveryeffectiveinS3;itisoftenthestoragelocationofchoiceforlargeandincrementallygrowingdatasets.ThereisasyntaxdifferencewhenusingEMR;insteadoftheMSCKcommandmentionedearlier,thecommandtoupdateaHivetablewithnewdataaddedtoapartitiondirectoryisasfollows:
ALTERTABLE<table-name>RECOVERPARTITIONS;
ConsulttheEMRdocumentationforthelatestenhancementsathttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html.Also,consultthebroaderEMRdocumentation.Inparticular,theintegrationpointswithotherAWSservicesisanareaofrapidgrowth.
ExtendingHiveQLTheHiveQLlanguagecanbeextendedbymeansofpluginsandthird-partyfunctions.InHive,therearethreetypesoffunctionscharacterizedbythenumberofrowstheytakeasinputandproduceasoutput:
UserDefinedFunctions(UDFs):aresimplerfunctionsthatactononerowatatime.UserDefinedAggregateFunctions(UDAFs):takemultiplerowsasinputandgeneratemultiplerowsasoutput.TheseareaggregatefunctionstobeusedinconjunctionwithaGROUPBYstatement(similartoCOUNT(),AVG(),MIN(),MAX(),andsoon).UserDefinedTableFunctions(UDTFs):takemultiplerowsasinputandgeneratealogicaltablecomprisedofmultiplerowsthatcanbeusedinjoinexpressions.
TipTheseAPIsareprovidedonlyinJava.Forotherlanguages,itispossibletostreamdatathroughauser-definedscriptusingtheTRANSFORM,MAP,andREDUCEclausesthatactasafrontendtoHadoop’sstreamingcapabilities.
TwoAPIsareavailabletowriteUDFs.AsimpleAPIorg.apache.hadoop.hive.ql.exec.UDFcanbeusedforfunctionsthattakeandreturnbasicwritabletypes.AricherAPI,whichprovidessupportfordatatypesotherthanwritableisavailableintheorg.apache.hadoop.hive.ql.udf.generic.GenericUDFpackage.We’llnowillustratehoworg.apache.hadoop.hive.ql.exec.UDFcanbeusedtoimplementastringtoIDfunctionsimilartotheoneweusedinChapter5,IterativeComputationwithSpark,tomaphashtagstointegersinPig.BuildingaUDFwiththisAPIonlyrequiresextendingtheUDFclassandwritinganevaluate()method,asfollows:
publicclassStringToIntextendsUDF{
publicIntegerevaluate(Textinput){
if(input==null)
returnnull;
Stringstr=input.toString();
returnstr.hashCode();
}
}
ThefunctiontakesaTextobjectasinputandmapsittoanintegervaluewiththehashCode()method.Thesourcecodeofthisfunctioncanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/udf/com/learninghadoop2/hive/udf/StringToInt.java.
TipAsnotedinChapter6,DataAnalysiswithApachePig,amorerobusthashfunctionshouldbeusedinproduction.
WecompiletheclassandarchiveitintoaJARfile,asfollows:
$javac-classpath$(hadoop
classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*
com/learninghadoop2/hive/udf/StringToInt.java
$jarcvfmyudfs-hive.jarcom/learninghadoop2/hive/udf/StringToInt.class
Beforebeingabletouseit,aUDFmustberegisteredinHivewiththefollowingcommands:
ADDJARmyudfs-hive.jar;
CREATETEMPORARYFUNCTIONstring_to_intAS
'com.learninghadoop2.hive.udf.StringToInt';
TheADDJARstatementaddsaJARfiletothedistributedcache.TheCREATETEMPORARYFUNCTION<function>AS<class>statementregistersafunctioninHivethatimplementsagivenJavaclass.ThefunctionwillbedroppedoncetheHivesessionisclosed.AsofHive0.13,itispossibletocreatepermanentfunctionswhosedefinitioniskeptinthemetastoreusingCREATEFUNCTION….
Onceregistered,StringToIntcanbeusedinaqueryjustlikeanyotherfunction.Inthefollowingexample,wefirstextractalistofhashtagsfromthetweet’stextbyapplyingregexp_extract.Then,weusestring_to_inttomapeachtagtoanumericalID:
SELECTunique_hashtags.hashtag,string_to_int(unique_hashtags.hashtag)AS
tag_idFROM
(
SELECTregexp_extract(text,
'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')ashashtag
FROMtweets
GROUPBYregexp_extract(text,
'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')
)unique_hashtagsGROUPBYunique_hashtags.hashtag,
string_to_int(unique_hashtags.hashtag);
Justaswedidinthepreviouschapter,wecanusetheprecedingquerytocreatealookuptable:
CREATETABLElookuptable(tagstring,tag_idbigint);
INSERTOVERWRITETABLElookuptable
SELECTunique_hashtags.hashtag,
string_to_int(unique_hashtags.hashtag)astag_id
FROM
(
SELECTregexp_extract(text,
'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')AShashtag
FROMtweets
GROUPBYregexp_extract(text,
'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')
)unique_hashtags
GROUPBYunique_hashtags.hashtag,string_to_int(unique_hashtags.hashtag);
ProgrammaticinterfacesInadditiontothehiveandbeelinecommand-linetools,itispossibletosubmitHiveQLqueriestothesystemviatheJDBCandThriftprogrammaticinterfaces.SupportforODBCwasbundledinolderversionsofHive,butasofHive0.12,itneedstobebuiltfromscratch.Moreinformationonthisprocesscanbefoundathttps://cwiki.apache.org/confluence/display/Hive/HiveODBC.
JDBCAHiveclientwrittenusingJDBCAPIslooksexactlythesameasaclientprogramwrittenforotherdatabasesystems(forexampleMySQL).ThefollowingisasampleHiveclientprogramusingJDBCAPIs.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveJdbcClient.java.
publicclassHiveJdbcClient{
privatestaticStringdriverName="org.apache.hive.jdbc.HiveDriver";
//connectionstring
publicstaticStringURL="jdbc:hive2://localhost:10000";
//Showalltablesinthedefaultdatabase
publicstaticStringQUERY="showtables";
publicstaticvoidmain(String[]args)throwsSQLException{
try{
Class.forName(driverName);
}
catch(ClassNotFoundExceptione){
e.printStackTrace();
System.exit(1);
}
Connectioncon=DriverManager.getConnection(URL);
Statementstmt=con.createStatement();
ResultSetresultSet=stmt.executeQuery(QUERY);
while(resultSet.next()){
System.out.println(resultSet.getString(1));
}
}
}
TheURLpartistheJDBCURIthatdescribestheconnectionendpoint.Theformatforestablishingaremoteconnectionisjdbc:hive2:<host>:<port>/<database>.Connectionsinembeddedmodecanbeestablishedbynotspecifyingahostorport,likejdbc:hive2://.
hiveandhive2arethedriverstobeusedwhenconnectingtoHiveServerandHiveServer2.QUERYcontainstheHiveQLquerytobeexecuted.
TipHive’sJDBCinterfaceexposesonlythedefaultdatabase.Inordertoaccessotherdatabases,youneedtoreferencethemexplicitlyintheunderlyingqueriesusingthe<database>.<table>notation.
FirstweloadtheHiveServer2JDBCdriverorg.apache.hive.jdbc.HiveDriver.
Tip
Useorg.apache.hadoop.hive.jdbc.HiveDrivertoconnecttoHiveServer.
Then,likewithanyotherJDBCprogram,weestablishaconnectiontoURLanduseittoinstantiateaStatementclass.WeexecuteQUERY,withnoauthentication,andstoretheoutputdatasetintotheResultSetobject.Finally,wescanresultSetandprintitscontenttothecommandline.
Compileandexecutetheexamplewiththefollowingcommands:
$javacHiveJdbcClient.java
$java-cp$(hadoop
classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:/opt/cloudera/parcels/C
DH/lib/hive/lib/hive-jdbc.jar:
com.learninghadoop2.hive.client.HiveJdbcClient
ThriftThriftprovideslower-levelaccesstoHiveandhasanumberofadvantagesovertheJDBCimplementationofHiveServer.Primarily,itallowsmultipleconnectionsfromthesameclient,anditallowsprogramminglanguagesotherthanJavatobeusedwithease.WithHiveServer2,itisalesscommonlyusedoptionbutstillworthmentioningforcompatibility.AsampleThriftclientimplementedusingtheJavaAPIcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveThriftClient.java.ThisclientcanbeusedtoconnecttoHiveServer,butduetoprotocoldifferences,theclientwon’tworkwithHiveServer2.
IntheexamplewedefineagetClient()methodthattakesasinputthehostandportofaHiveServerserviceandreturnsaninstanceoforg.apache.hadoop.hive.service.ThriftHive.Client.
Aclientisobtainedbyfirstinstantiatingasocketconnection,org.apache.thrift.transport.TSocket,totheHiveServerservice,andbyspecifyingaprotocol,org.apache.thrift.protocol.TBinaryProtocol,toserializeandtransmitdata,asfollows:
TSockettransport=newTSocket(host,port);
transport.setTimeout(TIMEOUT);
transport.open();
TBinaryProtocolprotocol=newTBinaryProtocol(transport);
client=newThriftHive.Client(protocol);
WecallgetClient()fromthemainmethodandusetheclienttoexecuteaqueryagainstaninstanceofHiveServerrunningonlocalhostonport11111,asfollows:
publicstaticvoidmain(String[]args)throwsException{
Clientclient=getClient("localhost",11111);
client.execute("showtables");
List<String>results=client.fetchAll();
for(Stringresult:results){
System.out.println(result);
}
}
MakesurethatHiveServerisrunningonport11111,andifnot,startaninstancewiththefollowingcommand:
$sudohive--servicehiveserver-p11111
CompileandexecutetheHiveThriftClient.javaexamplewith:
$javac$(hadoopclasspath):/opt/cloudera/parcels/CDH/lib/hive/lib/*
com/learninghadoop2/hive/client/HiveThriftClient.java
$java-cp$(hadoopclasspath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:
com.learninghadoop2.hive.client.HiveThriftClient
StingerinitiativeHivehasremainedverysuccessfulandcapablesinceitsearliestreleases,particularlyinitsabilitytoprovideSQL-likeprocessingonenormousdatasets.Butothertechnologiesdidnotstandstill,andHiveacquiredareputationofbeingrelativelyslow,particularlyinregardtolengthystartuptimesonlargejobsanditsinabilitytogivequickresponsestoconceptuallysimplequeries.
TheseperceivedlimitationswerelessduetoHiveitselfandmoreaconsequenceofhowtranslationofSQLqueriesintotheMapReducemodelhasmuchbuilt-ininefficiencywhencomparedtootherwaysofimplementingaSQLquery.Particularlyinregardtoverylargedatasets,MapReducesawlotsofI/O(andconsequentlytime)spentwritingouttheresultsofoneMapReducejobjusttohavethemreadbyanother.AsdiscussedinChapter3,Processing–MapReduceandBeyond,thisisamajordriverinthedesignofTez,whichcanschedulejobsonaHadoopclusterasagraphoftasksthatdoesnotrequireinefficientwritesandreadsbetweenthem.
ThefollowingisaqueryontheMapReduceframeworkversusTez:
SELECTa.country,COUNT(b.place_id)FROMplaceaJOINtweetsbON(a.
place_id=b.place_id)GROUPBYa.country;
ThefollowingfigurecontraststheexecutionplanfortheprecedingqueryontheMapReduceframeworkversusTez:
HiveonMapReduceversusTez
InplainMapReduce,twojobsarecreatedfortheGROUPBYandJOINclauses.ThefirstjobiscomposedofasetofMapReducetasksthatreaddatafromthedisktocarryoutgrouping.Thereducerswriteintermediateresultstothedisksothatoutputcanbesynchronized.Themappersinthesecondjobreadtheintermediateresultsfromthediskaswellasdatafromtableb.Thecombineddatasetisthenpassedtothereducerwheresharedkeysarejoined.WerewetoexecuteanORDERBYstatement,thiswouldhaveresultedina
thirdjobandfurtherMapReducepasses.ThesamequeryisexecutedonTezasasinglejobbyasinglesetofMaptasksthatreaddatafromthedisk.I/Ogroupingandjoiningarepipelinedacrossreducers.
Alongsidethesearchitecturallimitations,therewerequiteafewareasaroundSQLlanguagesupportthatcouldalsoprovidebetterefficiency,andinearly2013,theStingerinitiativewaslaunchedwithanexplicitgoalofmakingHiveover100timesasfastandwithmuchricherSQLsupport.Hive0.13hasallthefeaturesofthethreephasesofStinger,resultinginamuchmorecompleteSQLdialect.Also,TezisofferedasanexecutionframeworkinadditiontoaMapReduce-basedimplementationatopYARNwhichismoreefficientthanpreviousimplementationsonHadoop1MapReduce.
WithTezastheexecutionengine,HiveisnolongerlimitedtoaseriesoflinearMapReducejobsandcaninsteadbuildaprocessinggraphwhereanygivenstepcan,forexample,streamresultstomultiplesub-steps.
TotakeadvantageoftheTezframework,thereisanewhivevariablesetting:
sethive.execution.engine=tez;
ThissettingreliesonTezbeinginstalledonthecluster;itisavailableinsourceformfromhttp://tez.apache.orgorinseveraldistributions,thoughatthetimeofwriting,notCloudera.
Thealternativevalueismr,whichusestheclassicMapReducemodel(atopYARN),soitispossibleinasingleinstallationtocomparewiththeperformanceofHiveusingTez.
ImpalaHiveisnottheonlyproductprovidingSQL-on-Hadoopcapability.ThesecondmostwidelyusedislikelyImpala,announcedinlate2012andreleasedinspring2013.ThoughoriginallydevelopedinternallywithinCloudera,itssourcecodeisperiodicallypushedtoanopensourceGitrepository(https://github.com/cloudera/impala).
ImpalawascreatedoutofthesameperceptionofHive’sweaknessesthatledtotheStingerinitiative.
ImpalaalsotooksomeinspirationfromGoogleDremel(http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdfwhichwasfirstopenlydescribedbyapaperpublishedin2009.DremelwasbuiltatGoogletoaddressthegapbetweentheneedforveryfastqueriesonverylargedatasetsandthehighlatencyinherentintheexistingMapReducemodelunderpinningHiveatthetime.Dremelwasasophisticatedapproachtothisproblemthat,ratherthanbuildingmitigationsatopMapReducesuchasimplementedbyHive,insteadcreatedanewservicethataccessedthesamedatastoredinHDFS.Dremelalsobenefitedfromsignificantworktooptimizethestorageformatofthedatainawaythatmadeitmoreamenabletoveryfastanalyticqueries.
ThearchitectureofImpalaThebasicarchitecturehasthreemaincomponents;theImpaladaemons,thestatestore,andtheclients.Recentversionshaveaddedadditionalcomponentsthatimprovetheservice,butwe’llfocusonthehigh-levelarchitecture.
TheImpaladaemon(impalad)shouldberunoneachhostwhereaDataNodeprocessismanagingHDFSdata.NotethatimpaladdoesnotaccessthefilesystemblocksthroughthefullHDFSFileSystemAPI;instead,itusesafeaturecalledshort-circuitreadstomakedataaccessmoreefficient.
Whenaclientsubmitsaquery,itcandosotoanyoftherunningimpaladprocesses,andthisonewillbecomethecoordinatorfortheexecutionofthatquery.ThekeyaspectofImpala’sperformanceisthatforeachquery,itgeneratescustomnativecode,whichisthenpushedtoandexecutedbyalltheimpaladprocessesonthesystem.Thishighlyoptimizedcodeperformsthequeryonthelocaldata,andeachimpaladthenreturnsitssubsetoftheresultsettothecoordinatornode,whichperformsthefinaldataconsolidationtoproducethefinalresult.Thistypeofarchitectureshouldbefamiliartoanyonewhohasworkedwithanyofthe(usuallycommercialandexpensive)MassivelyParallelProcessing(MPP)(thetermusedforthistypeofsharedscale-outarchitecture)datawarehousesolutionsavailabletoday.Astheclusterruns,thestatestoredaemonensuresthateachimpaladprocessisawareofalltheothersandprovidesaviewoftheoverallclusterhealth.
Co-existingwithHiveImpala,asanewerproduct,tendstohaveamorerestrictedsetofSQLdatatypesandsupportsamoreconstraineddialectofSQLthanHive.Itis,however,expandingthissupportwitheachnewrelease.RefertotheImpaladocumentation(http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/impala.html)togetanoverviewofthecurrentlevelofsupport.
ImpalasupportstheHivemetastoremechanismusedbyHivetopersistentlystorethemetadatasurroundingitstablestructureandstorage.ThismeansthatonaclusterwithanexistingHivesetup,itshouldbeimmediatelypossibletouseImpalaasitwillaccessthesamemetastoreandthereforeprovideaccesstothesametablesavailableinHive.
ButbewarnedthatthedifferencesinSQLdialectanddatatypesmightcauseunexpectedresultswhenworkinginacombinedHiveandImpalaenvironment.Somequeriesmightworkononebutnottheother,theymightshowverydifferentperformancecharacteristics(moreonthislater),ortheymightactuallygivedifferentresults.Thislastpointmightbecomeapparentwhenusingdatatypessuchasfloatanddoublethataresimplytreateddifferentlyintheunderlyingsystems(HiveisimplementedonJavawhileImpalaiswritteninC++).
Asofversion1.2,itsupportsUDFswrittenbothinC++andJava,althoughC++isstronglyrecommendedasamuchfastersolution.KeepthisinmindifyouarelookingtosharecustomfunctionsbetweenHiveandImpala.
AdifferentphilosophyWhenImpalawasfirstreleased,itsgreatestbenefitwasinhowittrulyenabledwhatisoftencalledspeedofthoughtanalysis.Queriescouldbereturnedsufficientlyfastthatananalystcouldexploreathreadofanalysisinacompletelyinteractivefashionwithouthavingtowaitforminutesatatimeforeachquerytocomplete.It’sfairtosaythatmostadoptersofImpalawereattimesstunnedbyitsperformance,especiallywhencomparedtotheversionofHiveshippingatthetime.
TheImpalafocushasremainedmostlyontheseshorterqueries,andthisdoesimposesomelimitationsonthesystem.Impalatendstobequitememory-heavyasitreliesonin-memoryprocessingtoachievemuchofitsperformance.Ifaqueryrequiresadatasettobeheldinmemoryratherthanbeingavailableontheexecutingnode,thenthatquerywillsimplyfailinversionsofImpalabefore2.0.
ComparingtheworkonStingertoImpala,itcouldbearguedthatImpalahasamuchstrongerfocusonexcellingintheshorter(andarguablymorecommon)queriesthatsupportinteractivedataanalysis.ManybusinessintelligencetoolsandservicesarenowcertifiedtodirectlyrunonImpala.TheStingerinitiativehasputlesseffortintomakingHivejustasfastintheareawhereImpalaexcelsbuthasinsteadimprovedHive(tovaryingdegrees)forallworkloads.ImpalaisstilldevelopingatafastpaceandStingerhasputadditionalmomentumintoHive,soitismostlikelywisetoconsiderbothproductsanddeterminewhichbestmeetstheperformanceandfunctionalityrequirementsofyourprojectsandworkflows.
ItshouldalsobekeptinmindthattherearecompetitivecommercialpressuresshapingthedirectionofImpalaandHive.ImpalawascreatedandisstilldrivenbyCloudera,themostpopularvendorofHadoopdistributions.TheStingerinitiative,thoughcontributedtobymanycompaniesasdiverseasMicrosoft(yes,really!)andIntel,wasleadbyHortonworks,probablythesecondlargestvendorofHadoopdistributions.ThefactisthatifyouareusingtheClouderadistributionofHadoop,thensomeofthecorefeaturesofHivemightbeslowertoarrive,whereasImpalawillalwaysbeup-to-date.Conversely,ifyouuseanotherdistribution,youmightgetthelatestHiverelease,butthatmighteitherhaveanolderImpalaor,asiscurrentlythecase,youmighthavetodownloadandinstallityourself.
AsimilarsituationhasarisenwiththeParquetandORCfileformatsmentionedearlier.ParquetispreferredbyImpalaanddevelopedbyagroupofcompaniesledbyCloudera,whileORCispreferredbyHiveandischampionedbyHortonworks.
Unfortunately,therealityisthatParquetsupportisoftenveryquicktoarriveintheClouderadistributionbutlesssoinsaytheHortonworksdistribution,wheretheORCfileformatispreferred.
Thesethemesarealittleconcerningsince,althoughcompetitioninthisspaceisagoodthing,andarguablytheannouncementofImpalahelpedenergizetheHivecommunity,thereisagreaterriskthatyourchoiceofdistributionmighthavealargerimpactonthe
toolsandfileformatsthatwillbefullysupported,unlikeinthepast.Hopefully,thecurrentsituationisjustanartifactofwhereweareinthedevelopmentcyclesofallthesenewandimprovedtechnologies,butdoconsideryourchoiceofdistributioncarefullyinrelationtoyourSQL-on-Hadoopneeds.
Drill,Tajo,andbeyondYoushouldalsoconsiderthatSQLonHadoopnolongeronlyreferstoHiveorImpala.ApacheDrill(http://drill.apache.org)isafullerimplementationoftheDremelmodelfirstdescribedbyGoogle.AlthoughImpalaimplementstheDremelarchitectureacrossHDFSdata,Drilllookstoprovidesimilarfunctionalityacrossmultipledatasources.Itisstillinitsearlystages,butifyourneedsarebroaderthanwhatHiveorImpalaprovides,itmightbeworthconsidering.
Tajo(http://tajo.apache.org)isanotherApacheprojectthatseekstobeafulldatawarehousesystemonHadoopdata.WithanarchitecturesimilartothatofImpala,itoffersamuchrichersystemwithcomponentssuchasmultipleoptimizersandETLtoolsthatarecommonplaceintraditionaldatawarehousesbutlessfrequentlybundledintheHadoopworld.Ithasamuchsmalleruserbasebuthasbeenusedbycertaincompaniesverysuccessfullyforasignificantlengthoftime,andmightbeworthconsideringifyouneedafullerdatawarehousingsolution.
Otherproductsarealsoemerginginthisspace,andit’sagoodideatodosomeresearch.HiveandImpalaareawesometools,butifyoufindthattheydon’tmeetyourneeds,thenlookaround—somethingelsemight.
SummaryInitsearlydays,Hadoopwassometimeserroneouslyseenasthelatestsupposedrelationaldatabasekiller.Overtime,ithasbecomemoreapparentthatthemoresensibleapproachistoviewitasacomplementtoRDBMStechnologiesandthat,infact,theRDBMScommunityhasdevelopedtoolssuchasSQLthatarealsovaluableintheHadoopworld.
HiveQLisanimplementationofSQLonHadoopandwastheprimaryfocusofthischapter.InregardtoHiveQLanditsimplementations,wecoveredthefollowingtopics:
HowHiveQLprovidesalogicalmodelatopdatastoredinHDFSincontrasttorelationaldatabaseswherethetablestructureisenforcedinadvanceHowHiveQLsupportsmanystandardSQLdatatypesandcommandsincludingjoinsandviewsTheETL-likefeaturesofferedbyHiveQL,includingtheabilitytoimportdataintotablesandoptimizethetablestructurethroughpartitioningandsimilarmechanismsHowHiveQLofferstheabilitytoextenditscoresetofoperatorswithuser-definedcodeandhowthiscontraststothePigUDFmechanismTherecenthistoryofHivedevelopments,suchastheStingerinitiative,thathaveseenHivetransitiontoanupdatedimplementationthatusesTezThebroaderecosystemaroundHiveQLthatnowincludesproductssuchasImpala,TajoandDrillandhoweachofthesefocusesonspecificareasinwhichtoexcel
WithPigandHive,we’veintroducedalternativemodelstoprocessMapReducedata,butsofarwe’venotlookedatanotherquestion:whatapproachesandtoolsarerequiredtoactuallyallowthismassivedatasetbeingcollectedinHadooptoremainusefulandmanageableovertime?Inthenextchapter,we’lltakeaslightstepuptheabstractionhierarchyandlookathowtomanagethelifecycleofthisenormousdataasset.
Chapter8.DataLifecycleManagementOurpreviouschapterswerequitetechnologyfocused,describingparticulartoolsortechniquesandhowtheycanbeused.Inthisandthenextchapter,wearegoingtotakeamoretop-downapproachwherebywewilldescribeaproblemspaceyouarelikelytoencounterandthenexplorehowtoaddressit.Inparticular,we’llcoverthefollowingtopics:
WhatwemeanbythetermdatalifecyclemanagementWhydatalifecyclemanagementissomethingtothinkaboutThecategoriesoftoolsthatcanbeusedtoaddresstheproblemHowtousethesetoolstobuildthefirsthalfofaTwittersentimentanalysispipeline
WhatdatalifecyclemanagementisDatadoesn’texistonlyatapointintime.Particularlyforlong-runningproductionworkflows,youarelikelytoacquireasignificantquantityofdatainaHadoopcluster.Requirementsrarelystaystaticforlong,soalongsidenewlogicyoumightalsoseetheformatofthatdatachangeorrequiremultipledatasourcestobeusedtoprovidethedatasetprocessedinyourapplication.Weusethetermdatalifecyclemanagementtodescribeanapproachtohandlingthecollection,storage,andtransformationofdatathatensuresthatdataiswhereitneedstobe,intheformatitneedstobein,inawaythatallowsdataandsystemevolutionovertime.
ImportanceofdatalifecyclemanagementIfyoubuilddataprocessingapplications,youarebydefinitionreliantonthedatathatisprocessed.Justasweconsiderthereliabilityofapplicationsandsystems,itbecomesnecessarytoensurethatthedataisalsoproduction-ready.
DataatsomepointneedstobeingestedintoHadoop.Itisonepartofanenterpriseandoftenhasmultiplepointsofintegrationwithexternalsystems.Iftheingestofdatacomingfromthosesystemsisnotreliable,thentheimpactonthejobsthatprocessthatdataisoftenasdisruptiveasamajorsystemfailure.Dataingestbecomesacriticalcomponentinitsownright.Andwhenwesaytheingestneedstobereliable,wedon’tjustmeanthatdataisarriving;italsohastobearrivinginaformatthatisusableandthroughamechanismthatcanhandleevolutionovertime.
Theproblemwithmanyoftheseissuesisthattheydonotariseinasignificantfashionuntiltheflowsarelarge,thesystemiscritical,andthebusinessimpactofanyproblemsisnon-trivial.Adhocapproachesthatworkedforalesscriticaldataflowoftenwillsimplynotscale,butwillbeverypainfultoreplaceonalivesystem.
ToolstohelpButdon’tpanic!Thereareanumberofcategoriesoftoolsthatcanhelpwiththedatalifecyclemanagementproblem.We’llgiveexamplesofthefollowingthreebroadcategoriesinthischapter:
Orchestrationservices:buildinganingestpipelineusuallyhasmultiplediscretestages,andwewilluseanorchestrationtooltoallowthesetobedescribed,executed,andmanagedConnectors:giventheimportanceofintegrationwithexternalsystems,wewilllookathowwecanuseconnectorstosimplifytheabstractionsprovidedbyHadoopstorageFileformats:howwestorethedataimpactshowwemanageformatevolutionovertime,andseveralrichstorageformatshavewaysofsupportingthis
BuildingatweetanalysiscapabilityInearlierchapters,weusedvariousimplementationsofTwitterdataanalysistodescribeseveralconcepts.Wewilltakethiscapabilitytoadeeperlevelandapproachitasamajorcasestudy.
Inthischapter,wewillbuildadataingestpipeline,constructingaproduction-readydataflowthatisdesignedwithreliabilityandfutureevolutioninmind.
We’llbuildoutthepipelineincrementallythroughoutthechapter.Ateachstage,we’llhighlightwhathaschangedbutcan’tincludefulllistingsateachstagewithouttreblingthesizeofthechapter.Thesourcecodeforthischapter,however,haseveryiterationinitsfullglory.
GettingthetweetdataThefirstthingweneedtodoisgettheactualtweetdata.Asinpreviousexamples,wecanpassthe-jand-nargumentstostream.pytodumpJSONtweetstostdout:
$stream.py-j-n10000>tweets.json
Sincewehavethistoolthatcancreateabatchofsampletweetsondemand,wecouldstartouringestpipelinebyhavingthisjobrunonaperiodicbasis.Buthow?
IntroducingOozieWecould,ofcourse,bangrockstogetherandusesomethinglikecronforsimplejobscheduling,butrecallthatwewantaningestpipelinethatisbuiltwithreliabilityinmind.So,wereallywantaschedulingtoolthatwecanusetodetectfailuresandotherwiserespondtoexceptionalsituations.
ThetoolwewillusehereisOozie(http://oozie.apache.org),aworkflowengineandschedulerbuiltwithafocusontheHadoopecosystem.
Oozieprovidesameanstodefineaworkflowasaseriesofnodeswithconfigurableparametersandcontrolledtransitionfromonenodetothenext.ItisinstalledaspartoftheClouderaQuickStartVM,andthemaincommand-lineclientis,notsurprisingly,calledoozie.
NoteWe’vetestedtheworkflowsinthischapteragainstversion5.0oftheClouderaQuickStartVM,andatthetimeofwritingOozieinthelatestversion,5.1,hassomeissues.There’snothingparticularlyversion-specificinourworkflows,however,sotheyshouldbecompatiblewithanycorrectlyworkingOoziev4implementation.
Thoughpowerfulandflexible,Ooziecantakealittlegettingusedto,sowe’llgivesomeexamplesanddescribewhatwearedoingalongtheway.
ThemostcommonnodeinanOozieworkflowisanaction.Itiswithinactionnodesthatthestepsoftheworkflowareactuallyexecuted;theothernodetypeshandlemanagementoftheworkflowintermsofdecisions,parallelism,andfailuredetection.Ooziehasmultipletypesofactionsthatitcanperform.Oneoftheseistheshellaction,whichcanbeusedtoexecuteanycommandonthesystem,suchasnativebinaries,shellscripts,oranyothercommand-lineutility.Let’screateascripttogenerateafileoftweetsandcopythistoHDFS:
set-e
sourcetwitter.keys
pythonstream.py-j-n500>/tmp/tweets.out
hdfsdfs-put/tmp/tweets.out/tmp/tweets/tweets.out
rm-f/tmp/tweets.out
Notethatthefirstlinewillcausetheentirescripttofailshouldanyoftheincludedcommandsfail.WeuseanenvironmentfiletoprovidetheTwitterkeystoourscriptintwitter.keys,whichisofthefollowingform:
exportTWITTER_CONSUMER_KEY=<value>
exportTWITTER_CONSUMER_SECRET=<value>
exportTWITTER_ACCESS_KEY=<value>
exportTWITTER_ACCESS_SECRET=<value>
OozieusesXMLtodescribeitsworkflows,usuallystoredinafilecalledworkflow.xml.Let’swalkthroughthedefinitionforanOozieworkflowthatcallsashellcommand.
TheschemaforanOozieworkflowiscalledworkflow-app,andwecangivetheworkflowaspecificname.ThisisusefulwhenviewingjobhistoryintheCLIorOoziewebUI.Intheexamplesinthisbook,we’lluseanincreasingversionnumbertoallowustomoreeasilyseparatetheiterationswithinthesourcerepository.Thisishowwegivetheworkflow-appaspecificname:
<workflow-appxmlns="uri:oozie:workflow:0.4"name="v1">
Oozieworkflowsaremadeupofaseriesofconnectednodes,eachofwhichrepresentsastepintheprocess,andwhicharerepresentedbyXMLnodesintheworkflowdefinition.Ooziehasanumberofnodesthatdealwiththetransitionoftheworkflowfromonesteptothenext.Thefirstoftheseisthestartnode,whichsimplystatesthenameofthefirstnodetobeexecutedaspartoftheworkflow,asfollows:
<startto="fs-node"/>
Wethenhavethedefinitionforthenamedstartnode.Inthiscase,itisanactionnode,whichisthegenericnodetypeformostOozienodesthatactuallyperformsomeprocessing,asfollows:
<actionname="fs-node">
Actionisabroadcategoryofnodes,andwewilltypicallythenspecializeitwiththeparticularprocessingforthisgivennode.Inthiscase,weareusingthefsnodetype,whichallowsustoperformfilesystemoperations:
<fs>
WewanttoensurethatthedirectoryonHDFStowhichwewishtocopythefileoftweetdata,exists,isempty,andhassuitablepermissions.Wedothisbytryingtodeletethedirectoryifitexists,thencreatingit,andfinallyapplyingtherequiredpermissions,asfollows:
<deletepath="${nameNode}/tmp/tweets"/>
<mkdirpath="${nameNode}/tmp/tweets"/>
<chmodpath="${nameNode}/tmp/tweets"permissions="777"/>
</fs>
We’llseeanalternativewayofsettingupdirectorieslater.Afterperformingthefunctionalityofthenode,Oozieneedsknowhowtoproceedwiththeworkflow.Inmostcases,thiswillcomprisemovingtoanotheractionnodeifthisnodewassuccessfulandabortingtheworkflowotherwise.Thisisspecifiedbythenextelements.Theoknodegivesthenameofthenodetowhichtotransitioniftheexecutionwassuccessful;theerrornodenamesthedestinationnodeforfailurescenarios.Here’showtheokandfailnodesareused:
<okto="shell-node"/>
<errorto="fail"/>
</action>
<actionname="shell-node">
Thesecondactionnodeisagainspecializedwithitsspecificprocessingtype;inthiscase,
wehaveashellnode:
<shellxmlns="uri:oozie:shell-action:0.2">
TheshellactionthenhastheHadoopJobTrackerandNameNodelocationsspecified.Notethattheactualvaluesaregivenbyvariables;we’llexplainwheretheycomefromlater.TheJobTrackerandNameNodearespecifiedasfollows:
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
AsmentionedinChapter3,Processing–MapReduceandBeyond,MapReduceusesmultiplequeuestoprovidesupportfordifferentapproachestoresourcescheduling.ThenextelementspecifiestheMapReducequeuetowhichtheworkflowshouldbesubmitted:
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
Nowthattheshellnodeisfullyconfigured,wecanspecifythecommandtoinvoke,againviaavariable,asfollows:
<exec>${EXEC}</exec>
ThevariousstepsofOozieworkflowsareexecutedasMapReducejobs.Thisshellactionwill,therefore,beexecutedasaspecifictaskinstanceonaparticularTaskTracker.We,therefore,needtospecifywhichfilesneedtobecopiedtothelocalworkingdirectoryontheTaskTrackermachinebeforetheactioncanbeperformed.Inthiscase,weneedtocopythemainshellscript,thePythontweetgenerator,andtheTwitterconfigfile,asfollows:
<file>${workflowRoot}/${EXEC}</file>
<file>${workflowRoot}/twitter.keys</file>
<file>${workflowRoot}/stream.py</file>
Afterclosingtheshellelement,weagainspecifywhattododependingonwhethertheactioncompletedsuccessfullyornot.BecauseMapReduceisusedforjobexecution,themajorityofnodetypesbydefinitionhavebuilt-inretryandrecoverylogic,thoughthisisnotthecaseforshellnodes:
</shell>
<okto="end"/>
<errorto="fail"/>
</action>
Iftheworkflowfails,let’sjustkillitinthiscase.Thekillnodetypedoesexactlythat—terminatetheworkflowfromproceedingtoanyfurthersteps,usuallyloggingerrormessagesalongtheway.Here’showthekillnodetypeisused:
<killname="fail">
<message>Shellactionfailed,error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
TheendnodeontheotherhandsimplyhaltstheworkflowandlogsitasasuccessfulcompletionwithinOozie:
<endname="end"/>
</workflow-app>
Theobviousquestioniswhattheprecedingvariablesrepresentandfromwheretheygettheirconcretevalues.TheprecedingvariablesareexamplesoftheOozieExpressionLanguageoftenreferredtoasEL.
Alongsidetheworkflowdefinitionfile(workflow.xml),whichdescribesthestepsintheflow,wealsoneedtocreateaconfigurationfilethatgivesthespecificvaluesforagivenexecutionoftheworkflow.Thisseparationoffunctionalityandconfigurationallowsustowriteworkflowsthatcanbeusedondifferentclusters,ondifferentfilelocations,orwithdifferentvariablevalueswithouthavingtorecreatetheworkflowitself.Byconvention,thisfileisusuallynamedjob.properties.Fortheprecedingworkflow,here’sasamplejob.propertiesfile.
Firstly,wespecifythelocationoftheJobTracker,theNameNode,andtheMapReducequeuetowhichtosubmittheworkflow.ThefollowingshouldworkontheCloudera5.0QuickStartVM,thoughinv5.1thehostnamehasbeenchangedtoquickstart.cloudera.TheimportantthingisthatthespecifiedNameNodeandJobTrackeraddressesneedtobeintheOoziewhitelist—thelocalservicesontheVMareaddedautomatically:
jobTracker=localhost.localdomain:8032
nameNode=hdfs://localhost.localdomain:8020
queueName=default
Next,wesetsomevaluesforwheretheworkflowdefinitionsandassociatedfilescanbefoundontheHDFSfilesystem.Notetheuseofavariablerepresentingtheusernamerunningthejob.Thisallowsasingleworkflowtobeappliedtodifferentpathsdependingonthesubmittinguser,asfollows:
tasksRoot=book
workflowRoot=${nameNode}/user/${user.name}/${tasksRoot}/v1
oozie.wf.application.path=${nameNode}/user/${user.name}/${tasksRoot}/v1
Next,wenamethecommandtobeexecutedintheworkflowas${EXEC}:
EXEC=gettweets.sh
Morecomplexworkflowswillrequireadditionalentriesinthejob.propertiesfile;theprecedingworkflowisassimpleasitgets.
Theooziecommand-linetoolneedstoknowwheretheOozieserverisrunning.ThiscanbeaddedasanargumenttoeveryOozieshellcommand,butthatgetsunwieldyveryquickly.Instead,youcansettheshellenvironmentvariable,asfollows:
$exportOOZIE_URL='http://localhost:11000/oozie'
Afterallthatwork,wecannowactuallyrunanOozieworkflow.Createadirectoryon
HDFSasspecifiedinthevaluesinthejob.propertiesfile.Intheprecedingcommand,we’dbecreatingthisasbook/v1underourhomedirectoryonHDFS.Copythestream.py,gettweets.shandtwitter.propertiesfilestothatdirectory;thesearethefilesrequiredtoperformtheactualexecutionoftheshellcommand.Then,addtheworkflow.xmlfiletothesamedirectory.
Toruntheworkflowthen,wedothefollowing:
$ooziejob-run-config<path-to-job.properties>
Ifsubmittedsuccessfully,Ooziewillprintthejobnametothescreen.Youcanseethecurrentstatusofthisworkflowwith:
$ooziejob-info<job-id>
Youcanalsocheckthelogsforthejob:
$ooziejob-log<job-id>
Inaddition,allcurrentandrecentjobscanbeviewedwith:
$ooziejobs
AnoteonHDFSfilepermissionsThereisasubtleaspectintheshellcommandthatcancatchtheunwary.Asanalternativetohavingthefsnode,wecouldinsteadincludeapreparationelementwithintheshellnodetocreatethedirectoryweneedonthefilesystem.Itwouldlooklikethefollowing:
<prepare>
<mkdirpath="${nameNode}/tmp/tweets"/>
</prepare>
Thepreparestageisexecutedbytheuserwhosubmittedtheworkflow,butsincetheactualscriptexecutionisperformedonYARN,itisusuallyexecutedastheyarnuser.Youmighthitaproblemwherethescriptgeneratesthetweets,the/tmp/tweetsdirectoryiscreatedonHDFS,butthescriptthenfailstohavepermissiontowritetothatdirectory.Youcaneitherresolvethisthroughassigningpermissionsmorepreciselyor,asshownearlier,youaddafilesystemnodetoencapsulatetheneededoperations.We’lluseamixtureofbothtechniquesinthischapter;fornon-shellnodes,we’lluseprepareelements,particularlyiftheneededdirectoryismanipulatedonlybythatnode.Forcaseswhereashellnodeisinvolvedorwherethecreateddirectorieswillbeusedacrossmultiplenodes,we’llbesafeandusethemoreexplicitfsnode.
MakingdevelopmentalittleeasierItcansometimesgetawkwardtomanagethefilesandresourcesforanOoziejobduringdevelopment.SomeneedtobeonHDFS,whilesomeneedtobelocal,andchangestosomefilesrequirechangestoothers.TheeasiestapproachisoftentodevelopormakechangesinacompletecloneoftheworkflowdirectoryonthelocalfilesystemandpushchangesfromtheretothesimilarlynameddirectoryinHDFS,notforgetting,ofcourse,toensurethatallchangesareunderrevisioncontrol!Foroperationalexecutionofthe
workflow,thejob.propertiesfileistheonlythingthatneedstobeonthelocalfilesystemand,conversely,alltheotherfilesneedtobeonHDFS.Alwaysrememberthis:it’salltooeasytomakechangestoalocalcopyofaworkflow,forgettopushthechangestoHDFS,andthenbeconfusedastowhytheworkflowisn’treflectingthechanges.
ExtractingdataandingestingintoHiveWithourdataonHDFS,wecannowextracttheseparatedatasetsfortweetsandusers,andplacedataasinpreviouschapters.Wecanreuseextract_for_hive.pigtoparsetherawtweetJSONintoseparatefiles,storethemagainonHDFS,andthenfollowupwithaHivestepthatingeststhesenewfilesintoHivetablesfortweets,users,andplaces.
TodothiswithinOozie,we’llneedtoaddtwonewnodestoourworkflow,aPigactionforthefirststepandaHiveactionforthesecond.
ForourHiveaction,we’lljustcreatethreeexternaltablesthatpointtothefilesgeneratedbyPig.ThiswouldthenallowustofollowourpreviouslydescribedmodelofingestingintotemporaryorexternaltablesandusingHiveQLINSERTstatementsfromtheretoinsertintotheoperational,andoftenpartitioned,tables.Thiscreate.hqlscriptcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch8/v2/hive/create.hqlbutissimplyofthefollowingform:
CREATEDATABASEIFNOTEXISTStwttr;
USEtwttr;
DROPTABLEIFEXISTStweets;
CREATEEXTERNALTABLEtweets(
...
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE
LOCATION'${ingestDir}/tweets';
DROPTABLEIFEXISTSuser;
CREATEEXTERNALTABLEuser(
...
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE
LOCATION'${ingestDir}/users';
DROPTABLEIFEXISTSplace;
CREATEEXTERNALTABLEplace(
...
)ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASTEXTFILE
LOCATION'${ingestDir}/places';
NotethatthefileseparatoroneachtableisalsoexplicitlysettomatchwhatweareoutputtingfromPig.Inadditiontothis,locationsinbothscriptsarespecifiedbyvariablesforwhichwewillprovideconcretevaluesinourjob.propertiesfile.
Withtheprecedingstatements,wecancreatethePignodeforourworkflowfoundinthe
sourcecodeasv2ofthepipeline.Muchofthenodedefinitionlookssimilartotheshellnodeusedpreviously,aswesetthesameconfigurationelements;alsonoticeouruseoftheprepareelementtocreatetheneededoutputdirectory.WecancreatethePignodeforourworkflowasshowninthefollowingaction:
<actionname="pig-node">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<deletepath="${nameNode}/${outputDir}"/>
<mkdirpath="${nameNode}/${outputDir}"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
Similarlyaswiththeshellcommand,weneedtotellthePigactionthelocationoftheactualPigscript.Thisisspecifiedinthefollowingscriptelement:
<script>${workflowRoot}/pig/extract_for_hive.pig</script>
WealsoneedtomodifythecommandlineusedtoinvokethePigscripttoaddseveralparameters.Thefollowingelementsdothis;notetheconstructionpatternwhereinoneelementaddstheactualparameternameandthenextitsvalue(we’llseeanalternativemechanismforpassingargumentsinthenextsection):
<argument>-param</argument>
<argument>inputDir=${inputDir}</argument>
<argument>-param</argument>
<argument>outputDir=${outputDir}</argument>
</pig>
BecausewewanttomovefromthissteptotheHivenode,weneedtosetthefollowingelementsappropriately:
<okto="hive-node"/>
<errorto="fail"/>
</action>
TheHiveactionitselfisalittledifferentthanthepreviousnodes;eventhoughitstartsinasimilarfashion,itspecifiestheHiveaction-specificnamespace,asfollows:
<actionname="hive-node">
<hivexmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
TheHiveactionneedsmanyoftheconfigurationelementsusedbyHiveitselfand,inmostcases,wecopythehive-site.xmlfileintotheworkflowdirectoryandspecifyitslocation,asshowninthefollowingxml;notethatthismechanismisnotHive-specificand
canalsobeusedforcustomactions:
<job-xml>${workflowRoot}/hive-site.xml</job-xml>
Inaddition,wemightneedtooverridesomeMapReducedefaultconfigurationproperties,asshowninthefollowingxml,wherewespecifythatintermediatecompressionshouldbeusedforourjob:
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
AfterconfiguringtheHiveenvironment,wenowspecifythelocationoftheHivescript:
<script>${workflowRoot}/hive/create.hql</script>
WealsohavetoprovidethemechanismtopassargumentstotheHivescript.Butinsteadofbuildingoutthecommandlineonecomponentatatime,we’lladdtheparamelementsthatmapthenameofaconfigurationelementinthejob.propertiesfiletovariablesspecifiedintheHivescript;thismechanismisalsosupportedwithPigactions:
<param>dbName=${dbName}</param>
<param>ingestDir=${ingestDir}</param>
</hive>
TheHivenodethenclosesastheothers,asfollows:
<okto="end"/>
<errorto="fail"/>
</action>
WenowneedtoputallthistogethertorunthemultistageworkflowinOozie.Thefullworkflow.xmlfilecanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch8/v2andtheworkflowisvisualizedinthefollowingdiagram:
Dataingestionworkflowv2
Thisworkflowperformsallthestepsdiscussedbefore;itgeneratestweetdata,extractssubsetsofdataviaPig,andtheningeststheseintoHive.
AnoteonworkflowdirectorystructureWenowhavequiteafewfilesinourworkflowdirectoryanditisbesttoadoptsomestructureandnamingconventions.Forthecurrentworkflow,ourdirectoryonHDFSlookslikethefollowing:
/hive/
/hive/create.hql
/lib/
/pig/
/pig/extract_for_hive.pig
/scripts/
/scripts/gettweets.sh
/scripts/stream-json-batch.py
/scripts/twitter-keys
/hive-site.xml
/job.properties
/workflow.xml
Themodelwefollowistokeepconfigurationfilesinthetop-leveldirectorybuttokeepfilesrelatedtoagivenactiontypeindedicatedsubdirectories.Notethatitisusefultohavealibdirectoryevenifempty,assomenodetypeslookforit.
Withtheprecedingstructure,thejob.propertiesfileforourcombinedjobisnowthefollowing:
jobTracker=localhost.localdomain:8032
nameNode=hdfs://localhost.localdomain:8020
queueName=default
tasksRoot=book
workflowRoot=${nameNode}/user/${user.name}/${tasksRoot}/v2
oozie.wf.application.path=${nameNode}/user/${user.name}/${tasksRoot}/v2
oozie.use.system.libpath=true
EXEC=gettweets.sh
inputDir=/tmp/tweets
outputDir=/tmp/tweetdata
ingestDir=/tmp/tweetdata
dbName=twttr
Intheprecedingcode,we’vefullyupdatedtheworkflow.xmldefinitiontoincludeallthestepsdescribedsofar—includinganinitialfsnodetocreatetherequireddirectorywithoutworryingaboutuserpermissions.
IntroducingHCatalogIfwelookatourcurrentworkflow,thereisinefficiencyinhowweuseHDFSastheinterfacebetweenPigandHive.WeneedtooutputtheresultofourPigscriptontoHDFS,wheretheHivescriptcanthenuseitasthelocationofsomenewtables.Whatthis
highlightsisthatitisoftenveryusefultohavedatastoredinHive,butthisislimited,asfewtools(primarilyHive)canaccesstheHivemetastoreandhencereadandwritesuchdata.Ifwethinkaboutit,Hivehastwomainlayers:itstoolsforaccessingandmanipulatingitsdataplustheexecutionframeworktorunqueriesonthatdata.
TheHCatalogsubprojectofHiveeffectivelyprovidesanindependentimplementationofthefirstoftheselayers—themeanstoaccessandmanipulatedataintheHivemetastore.HCatalogprovidesmechanismsforothertools,suchasPigandMapReduce,tonativelyreadandwritetable-structureddatathatisstoredonHDFS.
Remember,ofcourse,thatthedataisstoredonHDFSinoneformatoranother.TheHivemetastoreprovidesthemodelstoabstractthesefilesintotherelationaltablestructurefamiliarfromHive.SowhenwesaywearestoringdatainHCatalog,whatwereallymeanisthatwearestoringdataonHDFSinsuchawaythatthisdatacanthenbeexposedbytablestructuresspecifiedwithintheHivemetastore.Conversely,whenwerefertoHivedata,whatwereallymeanisdatawhosemetadataisstoredintheHivemetastore,andwhichcanbeaccessedbyanymetastore-awaretool,suchasHCatalog.
UsingHCatalog
TheHCatalogcommand-linetooliscalledhcatandwillbepreinstalledontheClouderaQuickStartVM—itisinstalled,infact,withanyversionofHivelaterthan0.11inclusive.
Thehcatutilitydoesn’thaveaninteractivemode,sogenerallyyouwilluseitwithexplicitcommand-lineargumentsorbypointingitatafileofcommands,asfollows:
$hcat–e"usedefault;showtables"
$hcat–fcommands.hql
Thoughthehcattoolisusefulandcanbeincorporatedintoscripts,themoreinterestingelementofHCatalogforourpurposeshereisitsintegrationwithPig.HCatalogdefinesanewPigloadercalledHCatLoaderandastorercalledHCatStorer.Asthenamessuggest,theseallowPigscriptstoreadfromorwritetoHivetablesdirectly.WecanusethismechanismtoreplaceourpreviousPigandHiveactionsinourOozieworkflowwithasingleHCatalog-basedPigactionthatwritestheoutputofthePigjobdirectlyintoourtablesinHive.
Forclarity,we’llcreatenewtablesnamedtweets_hcat,places_hcat,andusers_hcatintowhichwe’llinsertthisdata;notethatthesearenolongerexternaltables:
CREATETABLEtweets_hcat…
CREATETABLEplaces_hcat…
CREATETABLEusers_hcat…
Notethatifwehadthesecommandsinascriptfile,wecouldusethehcatCLItooltoexecutethem,asfollows:
$hcat–fcreate.hql
TheHCatCLItooldoesnot,however,offeraninteractiveshellakintotheHiveCLI.WecannowuseourpreviousPigscriptandneedtoonlychangethestorecommands,replacingtheuseofPigStoragewithHCatStorer.OurupdatedPigscript,
extract_to_hcat.pig,thereforeincludesstorecommandssuchasthefollowing:
storetweets_tsvinto'twttr.tweets_hcat'using
org.apache.hive.hcatalog.pig.HCatStorer();
NotethatthepackagenamefortheHCatStorerclasshastheorg.apache.hive.hcatalogprefix;whenHCatalogwasintheApacheincubator,itusedorg.apache.hcatalogforitspackageprefix.Thisolderformisnowdeprecated,andthenewformthatexplicitlyshowsHCatalogasasubprojectofHiveshouldbeusedinstead.
WiththisnewPigscript,wecannowreplaceourpreviousPigandHiveactionwithanupdatedPigactionusingHCatalog.ThisalsorequiresthefirstusageoftheOoziesharelib,whichwe’lldiscussinthenextsection.Inourworkflowdefinition,thepigelementofthisactionwillbedefinedasshowninthefollowingxmlandcanbefoundasv3ofthepipelineinthesourcebundle;inv3,we’vealsoaddedautilityHivenodetorunbeforethePignodetoensurethatallnecessarytablesexistbeforethePigscriptthatrequiresthemisexecuted.
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${workflowRoot}/hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>oozie.action.sharelib.for.pig</name>
<value>pig,hcatalog</value>
</property>
</configuration>
<script>${workflowRoot}/pig/extract_to_hcat.pig
</script>
<argument>-param</argument>
<argument>inputDir=${inputDir}</argument>
</pig>
Thetwochangesofnotearetheadditionoftheexplicitreferencetothehive-site.xmlfile;thisisrequiredbyHCatalog,andthenewconfigurationelementthattellsOozietoincludetherequiredHCatalogJARs.
TheOoziesharelibThatlastadditiontouchedonanimportantaspectofOoziewe’venotmentionedthusfar:theOoziesharelib.WhenOozierunsallitsvariousactiontypes,itrequiresmultipleJARstoaccessHadoopandtoinvokevarioustools,suchasHiveandPig.AspartoftheOozieinstallation,alargenumberofdependentJARshavebeenplacedonHDFStobeusedbyOozieanditsvariousactiontypes:thisistheOoziesharelib.
FormostusagesofOozie,it’senoughtoknowthesharelibexists,usuallyunder/user/oozie/share/libonHDFS,andwhen,asinthepreviousexample,someexplicit
configurationvaluesneedtobeadded.WhenusingaPigaction,thePigJARswillautomaticallygetpickedup,butwhenthePigscriptusessomethinglikeHCatalog,thenthisdependencywillnotbeexplicitlyknowntoOozie.
TheOozieCLIallowsmanipulationofthesharelib,thoughthescenarioswherethiswillberequiredareoutsideofthescopeofthisbook.ThefollowingcommandcanbeusefulthoughtoseewhichcomponentsareincludedintheOoziesharelib:
$oozieadmin-shareliblist
ThefollowingcommandisusefultoseetheindividualJARscomprisingaparticularcomponentwithinthesharelib,inthiscaseHCatalog:
$oozieadmin-shareliblisthcat
ThesecommandscanbeusefultoverifythattherequiredJARsarebeingincludedandtoseewhichspecificversionsarebeingused.
HCatalogandpartitionedtablesIfyourerunthepreviousworkflowasecondtime,itwillfail;digintothelogs,andyouwillseeHCatalogcomplainingthatitcannotwritetoatablethatalreadycontainsdata.ThisisacurrentlimitationofHCatalog;itviewstablesandpartitionswithintablesasimmutablebydefault.Hive,ontheotherhand,willaddnewdatatoatableorpartition;itsdefaultviewofatableisthatitismutable.
UpcomingchangestoHiveandHCatalogwillseethesupportofanewtablepropertythatwillcontrolthisbehaviorineithertool;forexample,thefollowingaddedtoatabledefinitionwouldallowtableappendsassupportedinHivetoday:
TBLPROPERTIES("immutable"="false")
ThisiscurrentlynotavailableintheshippingversionofHiveandHCatalog,however.Forustohaveaworkflowthataddsmoreandmoredataintoourtables,wethereforeneedtocreateanewpartitionforeachnewrunoftheworkflow.We’vemadethesechangesinv4ofourpipeline,wherewefirstrecreatethetableswithanintegerpartitionkey,asfollows:
CREATETABLEtweets_hcat(
…)
PARTITIONEDBY(partition_keyint)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASSEQUENCEFILE;
CREATETABLE`places_hcat`(
…)
partitionedby(partition_keyint)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASSEQUENCEFILE
TBLPROPERTIES("immutable"="false");
CREATETABLE`users_hcat`(
…)
partitionedby(partition_keyint)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\u0001'
STOREDASSEQUENCEFILE
TBLPROPERTIES("immutable"="false");
ThePigHCatStorertakesanoptionalpartitiondefinitionandwemodifythestorestatementsinourPigscriptaccordingly;forexample:
storetweets_tsvinto'twttr.tweets_hcat'
usingorg.apache.hive.hcatalog.pig.HCatStorer(
'partition_key=$partitionKey');
WethenmodifyourPigactionintheworkflow.xmlfiletoincludethisadditionalparameter:
<script>${workflowRoot}/pig/extract_to_hcat.pig</script>
<param>inputDir=${inputDir}</param>
<param>partitionKey=${partitionKey}</param>
Thequestionisthenhowwepassthispartitionkeytotheworkflow.Wecouldspecifyitinthejob.propertiesfile,butbydoingsowewouldhitthesameproblemwithtryingtowritetoanexistingpartitiononthenextre-run.
Ingestionworkflowv4
Fornow,we’llpassthisasanexplicitargumenttotheinvocationoftheOozieCLIandexplorebetterwaystodothislater:
$ooziejob–run–configv4/job.properties–DpartitionKey=12345
NoteNotethataconsequenceofthisbehavioristhatrerunninganHCatworkflowwiththesameargumentswillfail.Beawareofthiswhentestingworkflowsorplayingwiththesamplecodefromthisbook.
ProducingderiveddataNowthatwehaveourmaindatapipelineestablished,thereismostlikelyaseriesofactionsthatwewishtotakeafterweaddeachnewadditionaldataset.Asasimpleexample,notethatwithourpreviousmechanismofaddingeachsetofuserdatatoaseparatepartition,theusers_hcattablewillcontainusersmultipletimes.Let’screateanewtableforuniqueusersandregeneratethiseachtimeweaddnewuserdata.
NotethatgiventheaforementionedlimitationsofHCatalog,we’lluseaHiveactionforthispurpose,asweneedtoreplacethedatainatable.
First,we’llcreateanewtableforuniqueuserinformation,asfollows:
CREATETABLEIFNOTEXISTS`unique_users`(
`user_id`string,
`name`string,
`description`string,
`screen_name`string)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\t'
STOREDASsequencefile;
Inthistable,we’llonlystoretheattributesofauserthateitherneverchange(ID)orchangerarely(thescreenname,andsoon).WecanthenwriteasimpleHivestatementtopopulatethistablefromthefullusers_hcattable:
USEtwttr;
INSERTOVERWRITETABLEunique_users
SELECTDISTINCTuser_id,name,description,screen_name
FROMusers_hcat;
WecanthenaddanadditionalHiveactionnodethatcomesafterourpreviousPignodeintheworkflow.Whendoingthis,wediscoverthatourpatternofsimplygivingnodesnamessuchashive-nodeisareallybadidea,aswenowhavetwoHive-basednodes.Inv5oftheworkflow,weaddthisnewnodeandalsochangeournodestohavemoredescriptivenames:
Ingestionworkflowv5
PerformingmultipleactionsinparallelOurworkflowhastwotypesofactivity:initialsetupwiththenodesthatinitializethefilesystemandHivetables,andthefunctionalnodesthatperformactualprocessing.Ifwelookatthetwosetupnodeswehavebeenusing,itisobviousthattheyarequitedistinctandnotinterdependent.WecanthereforetakeadvantageofanOoziefeaturecalledforkandjoinnodestoexecutetheseactionsinparallel.Thestartofourworkflow.xmlfilenowbecomes:
<startto="setup-fork-node"/>
TheOozieforknodecontainsanumberofpathelements,eachofwhichspecifiesastartingnode.Eachofthesewillbelaunchedinparallel:
<forkname="setup-fork-node">
<pathstart="setup-filesystem-node"/>
<pathstart="create-tables-node"/>
</fork>
Eachofthespecifiedactionnodesisnodifferentfromanywehaveusedpreviously.Anactionnodecanlinktoaseriesofothernodes;theonlyrequirementisthateachparallelseriesofactionsmustendwithatransitiontothejoinnodeassociatedwiththeforknode,asfollows:
<actionname="setup-filesystem-node">
…
<okto="setup-join-node"/>
<errorto="fail"/>
</action>
<actionname="create-tables-node">
…
<okto="setup-join-node"/>
<errorto="fail"/>
</action>
Thejoinnodeitselfactsasthepointofcoordination;anyworkflowthathascompletedwillwaituntilallthepathsspecifiedintheforknodereachthispoint.Atthatpoint,theworkflowcontinuesatthenodespecifiedwithinthejoinnode.Here’showthejoinnodeisused:
<joinname="create-join-node"to="gettweets-node"/>
Intheprecedingcodeweomittedtheactiondefinitionsforspacepurposes,butthefullworkflowdefinitionisinv6:
Ingestionworkflowv6
CallingasubworkflowThoughthefork/joinmechanismmakestheprocessofparallelactionsmoreefficient,itdoesstilladdsignificantverbosityifweincludeitinourmainworkflow.xmldefinition.Conceptually,wehaveaseriesofactionsthatareperformingrelatedtasksrequiredbyourworkflowbutnotnecessarilypartofit.Forthisandsimilarcases,Oozieofferstheabilitytoinvokeasubworkflow.Theparentworkflowwillexecutethechildandwaitforittocomplete,withtheabilitytopassconfigurationelementsfromoneworkflowtotheother.
Thechildworkflowwillbeafullworkflowinitsownright,usuallystoredinadirectoryonHDFSwithalltheusualstructureweexpectforaworkflow,themainworkflow.xmlfile,andanyrequiredHive,Pig,orsimilarfiles.
WecancreateanewdirectoryonHDFScalledsetup-workflow,andinthiscreatethefilesrequiredonlyforourfilesystemandHivecreationactions.Thesubworkflowconfigurationfilewilllooklikethefollowing:
<workflow-appxmlns="uri:oozie:workflow:0.4"name="create-workflow">
<startto="setup-fork-node"/>
<forkname="setup-fork-node">
<pathstart="setup-filesystem-node"/>
<pathstart="create-tables-node"/>
</fork>
<actionname="setup-filesystem-node">
…
</action>
<actionname="create-tables-node">
…
</action>
<joinname="create-join-node"to="end"/>
<killname="fail">
<message>Actionfailed,error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<endname="end"/>
</workflow-app>
Withthissubworkflowdefined,wethenmodifythefirstnodesofourmainworkflowtouseasubworkflownode,asinthefollowing:
<startto="create-subworkflow-node"/>
<actionname="create-subworkflow-node">
<sub-workflow>
<app-path>${subWorkflowRoot}</app-path>
<propagate-configuration/>
</sub-workflow>
<okto="gettweets-node"/>
<errorto="fail"/>
</action>
WewillspecifythesubWorkflowPathinthejob.propertiesofourparentworkflow,andthepropagate-configurationelementwillpasstheconfigurationoftheparentworkflowtothechild.
AddingglobalsettingsByextractingutilitynodesintosubworkflows,wecansignificantlyreduceclutterandcomplexityinourmainworkflowdefinition.Inv7ofouringestpipeline,we’llmakeoneadditionalsimplificationandaddaglobalconfigurationsection,asinthefollowing:
<workflow-appxmlns="uri:oozie:workflow:0.4"name="v7">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${workflowRoot}/hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
</global>
<startto="create-subworkflow-node"/>
Byaddingthisglobalconfigurationsection,weremovetheneedtospecifyanyofthesevaluesintheHiveandPignodesintheremainingworkflow(notethatcurrentlytheshellnodedoesnotsupporttheglobalconfigurationmechanism).Thiscandramaticallysimplifysomeofournodes;forexample,ourPignodeisnowasfollows:
<actionname="hcat-ingest-node">
<pig>
<configuration>
<property>
<name>oozie.action.sharelib.for.pig</name>
<value>pig,hcatalog</value>
</property>
</configuration>
<script>${workflowRoot}/pig/extract_to_hcat.pig</script>
<param>inputDir=${inputDir}</param>
<param>dbName=${dbName}</param>
<param>partitionKey=${partitionKey}</param>
</pig>
<okto="derived-data-node"/>
<errorto="fail"/>
</action>
Ascanbeseen,wecanaddadditionalconfigurationelements,orindeedoverridethosespecifiedintheglobalsection,resultinginamuchcleareractiondefinitionthatfocusesonlyontheinformationspecifictotheactioninquestion.Ourworkflowv7hashadbothaglobalsectionaddedaswellastheadditionofthesubworkflow,andthismakesasignificantimprovementintheworkflowreadability:
Ingestionworkflowv7
ChallengesofexternaldataWhenwerelyonexternaldatatodriveourapplication,weareimplicitlydependentonthequalityandstabilityofthatdata.Thisis,ofcourse,trueforanydata,butwhenthedataisgeneratedbyanexternalsourceoverwhichwedonothavecontrol,therisksaremostlikelyhigher.Regardless,whenbuildingwhatweexpecttobereliableapplicationsontopofsuchdatafeeds,andespeciallywhenourdatavolumesgrow,weneedtothinkabouthowtomitigatetheserisks.
DatavalidationWeusethegeneraltermdatavalidationtorefertotheactofensuringthatincomingdatacomplieswithourexpectationsandpotentiallyapplyingnormalizationtomodifyitaccordinglyortoevendeletemalformedorcorruptinput.Whatthisactuallyinvolveswillbeveryapplication-specific.Insomecases,theimportantthingisensuringthesystemonlyingestsdatathatconformstoagivendefinitionofaccurateorclean.Forourtweetdata,wedon’tcareabouteverysinglerecordandcouldveryeasilyadoptapolicysuchasdroppingrecordsthatdon’thavevaluesinparticularfieldswecareabout.Forotherapplications,however,itisimperativetocaptureeveryinputrecord,andthismightdrivetheimplementationoflogictoreformateveryrecordtomakesureitcomplieswiththerequirements.Inyetothercases,onlycorrectrecordswillbeingested,buttherest,insteadofbeingdiscarded,mightbestoredelsewhereforlateranalysis.
Thebottomlineisthattryingtodefineagenericapproachtodatavalidationisvastlybeyondthescopeofthischapter.
However,wecanoffersomethoughtsonwhereinthepipelinetoincorporatevarioustypesofvalidationlogic.
ValidationactionsLogictodoanynecessaryvalidationorcleanupcanbeincorporateddirectlyintootheractions.Ashellnoderunningascripttogatherdatacanhavecommandsaddedtohandlemalformedrecordsdifferently.PigandHiveactionsthatloaddataintotablescaneitherperformfilteringoningest(easierdoneinPig)oraddcaveatswhencopyingdatafromaningesttabletotheoperationalstore.
Thereisanargumentthoughfortheadditionofavalidationnodeintotheworkflow,evenifinitiallyitperformsnoactuallogic.Thiscould,forinstance,beaPigactionthatreadsthedata,appliesthevalidation,andwritesthevalidateddatatoanewlocationtobereadbyfollow-onnodes.Theadvantagehereisthatwecanlaterupdatethevalidationlogicwithoutalteringourotheractions,whichshouldreducetheriskofaccidentallybreakingtherestofthepipelineandalsomakenodesmorecleanlydefinedintermsofresponsibilities.Thenaturalextensionofthistrainofthoughtisthatanewsubworkflowforvalidationismostlikelyagoodmodelaswell,asitnotonlyprovidesseparationofresponsibilities,butalsomakesthevalidationlogiceasiertotestandupdate.
Theobviousdisadvantageofthisapproachisthatitaddsadditionalprocessingandanothercycleofreadingthedataandwritingitallagain.Thisis,ofcourse,directlyworkingagainstoneoftheadvantageswehighlightedwhenconsideringtheuseofHCatalogfromPig.
Intheend,itwillcomedowntoatrade-offofperformanceagainstworkflowcomplexityandmaintainability.Whenconsideringhowtoperformvalidationandjustwhatthatmeansforyourworkflow,takealltheseelementsintoaccountbeforedecidingonanimplementation.
HandlingformatchangesWecan’tdeclarevictoryjustbecausewehavedataflowingintooursystemandareconfidentthedataissufficientlyvalidated.Particularlywhenthedatacomesfromanexternalsourcewehavetothinkabouthowthestructureofthedatamightchangeovertime.
RememberthatsystemssuchasHiveonlyapplythetableschemawhenthedataisbeingread.Thisisahugebenefitinenablingflexibledatastorageandingest,butcanleadtouser-facingqueriesorworkloadsfailingsuddenlywhentheingesteddatanolongermatchesthequeriesbeingexecutedagainstit.Arelationaldatabase,whichappliesschemasonwrite,wouldnotevenallowsuchdatatobeingestedintothesystem.
Theobviousapproachtohandlingchangesmadetothedataformatwouldbetoreprocessexistingdataintothenewformat.Thoughthisistractableonsmallerdatasets,itquicklybecomesinfeasibleonthesortofvolumesseeninlargeHadoopclusters.
HandlingschemaevolutionwithAvroAvrohassomefeatureswithrespecttoitsintegrationwithHivethathelpuswiththisproblem.Ifwetakeourtablefortweetsdata,wecouldrepresentthestructureofatweetrecordbythefollowingAvroschema:
{
"namespace":"com.learninghadoop2.avrotables",
"type":"record",
"name":"tweets_avro",
"fields":[
{"name":"created_at","type":["null","string"]},
{"name":"tweet_id_str","type":["null","string"]},
{"name":"text","type":["null","string"]},
{"name":"in_reply_to","type":["null","string"]},
{"name":"is_retweeted","type":["null","string"]},
{"name":"user_id","type":["null","string"]},
{"name":"place_id","type":["null","string"]}
]
}
Createtheprecedingschemainafilecalledtweets_avro.avsc—thisisthestandardfileextensionforAvroschemas.Then,placeitonHDFS;weliketohaveacommonlocationforschemafilessuchas/schema/avro.
Withthisdefinition,wecannowcreateaHivetablethatusesthisschemaforitstablespecification,asfollows:
CREATETABLEtweets_avro
PARTITIONEDBY(`partition_key`int)
ROWFORMATSERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITHSERDEPROPERTIES(
'avro.schema.url'='hdfs://localhost.localdomain:8020/schema/avro/tweets_avr
o.avsc'
)
STOREDASINPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
Then,lookatthetabledefinitionfromwithinHive(orHCatalog,whichalsosupportssuchdefinitions):
describetweets_avro
OK
created_atstringfromdeserializer
tweet_id_strstringfromdeserializer
textstringfromdeserializer
in_reply_tostringfromdeserializer
is_retweetedstringfromdeserializer
user_idstringfromdeserializer
place_idstringfromdeserializer
partition_keyintNone
Wecanalsousethistablelikeanyother,forexample,tocopythedatafrompartition3fromthenon-AvrotableintotheAvrotable,asfollows:
SEThive.exec.dynamic.partition.mode=nonstrict
INSERTINTOTABLEtweets_avro
PARTITION(partition_key)
SELECTFROMtweets_hcat
NoteJustasinpreviousexamples,ifAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduceJARtoourenvironmentbeforebeingabletoselectfromthetable.
WenowhaveanewtweetstablespecifiedbyanAvroschema;sofaritjustlookslikeothertables.ButtherealbenefitsforourpurposesinthischapterareinhowwecanusetheAvromechanismtohandleschemaevolution.Let’saddanewfieldtoourtableschema,asfollows:
{
"namespace":"com.learninghadoop2.avrotables",
"type":"record",
"name":"tweets_avro",
"fields":[
{"name":"created_at","type":["null","string"]},
{"name":"tweet_id_str","type":["null","string"]},
{"name":"text","type":["null","string"]},
{"name":"in_reply_to","type":["null","string"]},
{"name":"is_retweeted","type":["null","string"]},
{"name":"user_id","type":["null","string"]},
{"name":"place_id","type":["null","string"]},
{"name":"new_feature","type":"string","default":"wow!"}
]
}
Withthisnewschemainplace,wecanvalidatethatthetabledefinitionhasalsobeenupdated,asfollows:
describetweets_avro;
OK
created_atstringfromdeserializer
tweet_id_strstringfromdeserializer
textstringfromdeserializer
in_reply_tostringfromdeserializer
is_retweetedstringfromdeserializer
user_idstringfromdeserializer
place_idstringfromdeserializer
new_featurestringfromdeserializer
partition_keyintNone
Withoutaddinganynewdata,wecanrunqueriesonthenewfieldthatwillreturnthedefaultvalueforourexistingdata,asfollows:
SELECTnew_featureFROMtweets_avroLIMIT5;
...
OK
wow!
wow!
wow!
wow!
wow!
Evenmoreimpressiveisthefactthatthenewcolumndoesn’tneedtobeaddedattheend;itcanbeanywhereintherecord.Withthismechanism,wecannowupdateourAvroschemastorepresentthenewdatastructureandseethesechangesautomaticallyreflectedinourHivetabledefinitions.Anyqueriesthatrefertothenewcolumnwillretrievethedefaultvalueforallourexistingdatathatdoesnothavethatfieldpresent.
NotethatthedefaultmechanismweareusinghereiscoretoAvroandisnotspecifictoHive.Avroisaverypowerfulandflexibleformatthathasapplicationsinmanyareasandisdefinitelyworthdeeperexaminationthanwearegivingithere.
Technically,whatthisprovidesuswithisforwardcompatibility.Wecanmakechangestoourtableschemaandhaveallourexistingdataremainautomaticallycompliantwiththenewstructurewecan’t,however,continuetoingestdataoftheoldformatintotheupdatedtablessincethemechanismdoesnotprovidebackwardcompatibility:
INSERTINTOTABLEtweets_avro
PARTITION(partition_key)
SELECT*FROMtweets_hcat;
FAILED:SemanticException[Error10044]:Line1:18Cannotinsertinto
targettablebecausecolumnnumber/typesaredifferent'tweets_avro':Table
insclause-0has8columns,butqueryhas7columns.
SupportingschemaevolutionwithAvroallowsdatachangestobesomethingthatishandledaspartofnormalbusinessinsteadofthefirefightingemergencytheyalltoooftenturninto.Butplainly,it’snotforfree;thereisstillaneedtomakethechangesinthepipelineandrolltheseintoproduction.HavingHivetablesthatprovideforwardcompatibilitydoes,however,allowtheprocesstobeperformedinmoremanageablesteps;otherwise,youwouldneedtosynchronizechangesacrosseverystageofthepipeline.IfthechangesaremadefromingestuptothepointtheyareinsertedintoAvro-backedHivetables,thenallusersofthosetablescanremainunchanged(aslongastheydon’tdothingslikeselect*,whichisusuallyaterribleideaanyway)andcontinuetorunexistingqueriesagainstthenewdata.Theseapplicationscanthenbechangedonadifferenttimetabletotheingestionmechanism.Inourv8oftheingestpipeline,weshowhowtofullyuseAvrotablesforallofourexistingfunctionality.
NoteNotethatHive0.14,currentlyunreleasedatthetimeofwritingthis,willlikelyincludemorebuilt-insupportforAvrothatmightsimplifytheprocessofschemaevolutionevenfurther.IfHive0.14isavailablewhenyoureadthis,thendocheckoutthefinalimplementation.
FinalthoughtsonusingAvroschemaevolution
WiththisdiscussionofAvro,wehavetouchedonsomeaspectsofmuchbroadertopics,inparticularofdatamanagementonabroaderscaleandpoliciesarounddataversioningandretention.Muchofthisareabecomesveryspecifictoanorganization,buthereareafewpartingthoughtsthatwefeelaremorebroadlyapplicable.
Onlymakeadditivechanges
Wediscussedaddingcolumnsintheprecedingexample.Sometimes,thoughmorerarely,yoursourcedatadropscolumnsoryoudiscoveryounolongerneedanewcolumn.Avrodoesn’treallyprovidetoolstohelpwiththis,andwefeelitisoftenundesirable.Insteadofdroppingoldcolumns,wetendtomaintaintheolddataandsimplydonotusetheemptycolumnsinallthenewdata.Thisismucheasiertomanageifyoucontrolthedataformat;ifyouareingestingexternalsources,thentofollowthisapproachyouwilleitherneedtoreprocessdatatoremovetheoldcolumnorchangetheingestmechanismtoaddadefaultvalueforallnewdata.
Manageschemaversionsexplicitly
Intheprecedingexamples,wehadasingleschemafiletowhichwemadechangesdirectly.Thisislikelyaverybadidea,asitremovesourabilitytotrackschemachangesovertime.Inadditiontotreatingschemasasartifactstobekeptunderversioncontrol(yourschemasareinGittoo,aren’tthey?)itisoftenusefultotageachschemawithanexplicitversion.Thisisparticularlyusefulwhentheincomingdataisalsoexplicitlyversioned.Then,insteadofoverwritingtheexistingschemafile,youcanaddthenewfileanduseanALTERTABLEstatementtopointtheHivetabledefinitionatthenewschema.Weare,ofcourse,assumingherethatyoudon’thavetheoptionofusingadifferentqueryfortheolddatawiththedifferentformat.ThoughthereisnoautomaticmechanismforHivetoselectschema,theremightbecaseswhereyoucancontrolthismanuallyandsidesteptheevolutionquestion.
Thinkaboutschemadistribution
Whenusingaschemafile,thinkabouthowitwillbedistributedtotheclients.If,asinthepreviousexample,thefileisonHDFS,thenitlikelymakessensetogiveitahighreplicationfactor.ThefilewillberetrievedbyeachmapperineveryMapReducejobthatqueriesthetable.
TheAvroURLcanalsobespecifiedasalocalfilesystemlocation(file://),whichisusefulfordevelopmentandalsoasawebresource(http://).Thoughthelatterisveryusefulasitisaconvenientmechanismtodistributetheschematonon-Hadoopclients,rememberthattheloadonthewebservermightbehigh.Withmodernhardwareandefficientwebservers,thisismostlikelynotahugeconcern,butifyouhaveaclusterofthousandsofmachinesrunningmanyparalleljobswhereeachmapperneedstohitthewebserver,thenbecareful.
CollectingadditionaldataManydataprocessingsystemsdon’thaveasingledataingestsource;often,oneprimarysourceisenrichedbyothersecondarysources.Wewillnowlookathowtoincorporatetheretrievalofsuchreferencedataintoourdatawarehouse.
Atahighlevel,theproblemisn’tverydifferentfromourretrievaloftherawtweetdata,aswewishtopulldatafromanexternalsource,possiblydosomeprocessingonit,andstoreitsomewherewhereitcanbeusedlater.Butthisdoeshighlightanaspectweneedtoconsider;dowereallywanttoretrievethisdataeverytimeweingestnewtweets?Theansweriscertainlyno.Thereferencedatachangesveryrarely,andwecouldeasilyfetchitmuchlessfrequentlythannewtweetdata.Thisraisesaquestionwe’veskirteduntilnow:justhowdowescheduleOozieworkflows?
SchedulingworkflowsUntilnow,we’verunallourOozieworkflowsondemandfromtheCLI.OoziealsohasaschedulerthatallowsjobstobestartedeitheronatimedbasisorwhenexternalcriteriasuchasdataappearinginHDFSaremet.Itwouldbeagoodfitforourworkflowstohaveourmaintweetpipelinerun,say,every10minutesbutthereferencedataonlyrefresheddaily.
TipRegardlessofwhendataisretrieved,thinkcarefullyhowtohandledatasetsthatperformadelete/replaceoperation.Inparticular,don’tdothedeletebeforeretrievingandvalidatingthenewdata;otherwise,anyjobsthatrequirethereferencedatawillfailuntilthenextrunoftheretrievalsucceeds.Itcouldbeagoodoptiontoincludethedestructiveoperationsinasubworkflowthatisonlytriggeredaftersuccessfulcompletionoftheretrievalsteps.
Oozieactuallydefinestwotypesofapplicationsthatitcanrun:workflowssuchaswe’veusedsofarandcoordinators,whichscheduleworkflowstobeexecutedbasedonvariouscriteria.Acoordinatorjobisconceptuallysimilartoourotherworkflows;wepushanXMLconfigurationfileontoHDFSanduseaparameterizedpropertiesfiletoconfigureitatruntime.Inaddition,coordinatorjobshavethefacilitytoreceiveadditionalparameterizationfromtheeventsthattriggertheirexecution.
Thisispossiblybestdescribedbyanexample.Let’ssay,wewishtodoaspreviouslymentionedandcreateacoordinatorthatexecutesv7ofouringestworkflowevery10minutes.Here’sthecoordinator.xmlfile(thestandardnameforthecoordinatorXMLdefinition):
<coordinator-appname="tweets-10min-coordinator"frequency="${freq}"
start="${startTime}"end="${endTime}"timezone="UTC"
xmlns="uri:oozie:coordinator:0.2">
Themainactionnodeinacoordinatoristheworkflow,forwhichweneedtospecifyitsrootlocationonHDFSandallrequiredproperties,asfollows:
<action>
<workflow>
<app-path>${workflowPath}</app-path>
<configuration>
<property>
<name>workflowRoot</name>
<value>${workflowRoot}</value>
</property>
…
Wealsoneedtoincludeanypropertiesrequiredbyanyactionintheworkfloworbyanysubworkflowittriggers;ineffect,thismeansthatanyuser-definedvariablespresentinanyoftheworkflowstobetriggeredneedtobeincludedhere,asfollows:
<property>
<name>dbName</name>
<value>${dbName}</value>
</property>
<property>
<name>partitionKey</name>
<value>${coord:formatTime(coord:nominalTime(),'yyyyMMddhhmm')}
</value>
</property>
<property>
<name>exec</name>
<value>gettweets.sh</value>
</property>
<property>
<name>inputDir</name>
<value>/tmp/tweets</value>
</property>
<property>
<name>subWorkflowRoot</name>
<value>${subWorkflowRoot}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
Weusedafewcoordinator-specificfeaturesintheprecedingxml.Notethespecificationofthestartingandendingtimeofthecoordinatorandalsoitsfrequency(inminutes).Weareusingthesimplestformhere;Ooziealsohasasetoffunctionstoallowquiterichspecificationsofthefrequency.
WeusecoordinatorELfunctionsinourdefinitionofthepartitionKeyvariable.Earlier,whenrunningworkflowsfromtheCLI,wespecifiedtheseexplicitlybutmentionedtherewasabetterway—thisisit.Thefollowingexpressiongeneratesaformattedoutputcontainingtheyear,month,day,hour,andminute:
${coord:formatTime(coord:nominalTime(),'yyyyMMddhhmm')}
Ifwethenusethisasthevalueforourpartitionkey,wecanensurethateachinvocationoftheworkflowcorrectlycreatesauniquepartitioninourHCatalogtables.
Thecorrespondingjob.propertiesforthecoordinatorjoblooksmuchlikeourpreviousconfigfileswiththeusualentriesfortheNameNodeandsimilarvariablesaswellashavingvaluesfortheapplication-specificvariables,suchasdbName.Inaddition,weneedtospecifytherootofthecoordinatorlocationonHDFS,asfollows:
oozie.coord.application.path=${nameNode}/user/${user.name}/${tasksRoot}/twe
ets_10min
Notetheoozie.coordnamespaceprefixinsteadofthepreviouslyusedoozie.wf.WiththecoordinatordefinitiononHDFS,wecansubmitthefiletoOoziejustaswiththepreviousjobs.Butinthiscase,thejobwillonlyrunforagiventimeperiod.Specifically,itwillruneveryfiveminutes(thefrequencyisvariable)whenthesystemclockisbetweenstartTimeandendTime.
We’veincludedthefullconfigurationinthetweets_10mindirectoryinthesourcecodefor
thischapter.
OtherOozietriggersTheprecedingcoordinatorhasaverysimpletrigger;itstartsperiodicallywithinaspecifiedtimerange.Ooziehasanadditionalcapabilitycalleddatasets,whereitcanbetriggeredbytheavailabilityofnewdata.
Thisisn’tagreatfitforhowwe’vedefinedourpipelineuntilnow,butimaginethat,insteadofourworkflowcollectingtweetsasitsfirststep,anexternalsystemwaspushingnewfilesoftweetsontoHDFSonacontinuousbasis.OoziecanbeconfiguredtoeitherlookforthepresenceofnewdatabasedonadirectorypatternortospecificallytriggerwhenareadyfileappearsonHDFS.ThislatterconfigurationprovidesaveryconvenientmechanismwithwhichtointegratetheoutputofMapReducejobs,whichbydefault,writea_SUCCESSfileintotheiroutputdirectory.
Ooziedatasetsarearguablyoneofthemostpowerfulpartsofthewholesystem,andwecannotdothemjusticehereforspacereasons.ButwedostronglyrecommendthatyouconsulttheOoziehomepageformoreinformation.
PullingitalltogetherLet’sreviewwhatwe’vediscusseduntilnowandhowwecanuseOozietobuildasophisticatedseriesofworkflowsthatimplementanapproachtodatalifecyclemanagementbyputtingtogetherallthediscussedtechniques.
First,it’simportanttodefineclearresponsibilitiesandimplementpartsofthesystemusinggooddesignandseparationofconcernprinciples.Byapplyingthis,weendupwithseveraldifferentworkflows:
Asubworkflowtoensuretheenvironment(mainlyHDFSandHivemetadata)iscorrectlyconfiguredAsubworkflowtoperformdatavalidationThemainworkflowthattriggersboththeprecedingsubworkflowsandthenpullsnewdatathroughamultistepingestpipelineAcoordinatorthatexecutestheprecedingworkflowsevery10minutesAsecondcoordinatorthatingestsreferencedatathatwillbeusefultotheapplicationpipeline
WealsodefineallourtableswithAvroschemasandusethemwhereverpossibletohelpmanageschemaevolutionandchangingdataformatsovertime.
Wepresentthefullsourcecodeofthesecomponentsinthefinalversionoftheworkflowinthesourcecodeofthischapter.
OthertoolstohelpThoughOozieisaverypowerfultool,sometimesitcanbesomewhatdifficulttocorrectlywriteworkflowdefinitionfiles.Aspipelinesgetsizeable,managingcomplexitybecomesachallengeevenwithgoodfunctionalpartitioningintomultipleworkflows.Atasimplerlevel,XMLisjustneverfunforahumantowrite!Thereareafewtoolsthatcanhelp.Hue,thetoolcallingitselftheHadoopUI(http://gethue.com/),providessomegraphicaltoolstohelpcompose,execute,andmanageOozieworkflows.Thoughpowerful,Hueisnotabeginnertool;we’llmentionitalittlemoreinChapter11,WheretoGoNext.
AnewApacheprojectcalledFalcon(http://falcon.incubator.apache.org)mightalsobeofinterest.FalconusesOozietobuildarangeofmuchhigher-leveldataflowsandactions.Forexample,Falconprovidesrecipestoenableandensurecross-sitereplicationacrossmultipleHadoopclusters.TheFalconteamisworkingonmuchbetterinterfacestobuildtheirworkflows,sotheprojectmightwellbeworthwatching.
SummaryHopefully,thischapterpresentedthetopicofdatalifecyclemanagementassomethingotherthanadryabstractconcept.Wecoveredalot,particularly:
ThedefinitionofdatalifecyclemanagementandhowitcoversanumberofissuesandtechniquesthatusuallybecomeimportantwithlargedatavolumesTheconceptofbuildingadataingestpipelinealonggooddatalifecyclemanagementprinciplesthatcanthenbeutilizedbyhigher-levelanalytictoolsOozieasaHadoop-focusedworkflowmanagerandhowwecanuseittocomposeaseriesofactionsintoaunifiedworkflowVariousOozietools,suchassubworkflows,parallelactionexecution,andglobalvariables,thatallowustoapplytruedesignprinciplestoourworkflowsHCatalogandhowitprovidesthemeansfortoolsotherthanHivetoreadandwritetable-structureddata;weshoweditsgreatpromiseandintegrationwithtoolssuchasPigbutalsohighlightedsomecurrentweaknessesAvroasourtoolofchoicetohandleschemaevolutionovertimeUsingOoziecoordinatorstobuildscheduledworkflowsbasedeitherontimeintervalsordataavailabilitytodrivetheexecutionofmultipleingestpipelinesSomeothertoolsthatcanmakethesetaskseasier,namely,HueandFalcon
Inthenextchapter,we’lllookatseveralofthehigher-levelanalytictoolsandframeworksthatcanbuildsophisticatedapplicationlogicuponthedatacollectedinaningestpipeline.
Chapter9.MakingDevelopmentEasierInthischapter,wewilllookathow,dependingonusecasesandendgoals,applicationdevelopmentinHadoopcanbesimplifiedusinganumberofabstractionsandframeworksbuiltontopoftheJavaAPIs.Inparticular,wewilllearnaboutthefollowingtopics:
HowthestreamingAPIallowsustowriteMapReducejobsusingdynamiclanguagessuchasPythonandRubyHowframeworkssuchasApacheCrunchandKiteMorphlinesallowustoexpressdatatransformationpipelinesusinghigher-levelabstractionsHowKiteData,apromisingframeworkdevelopedbyCloudera,providesuswiththeabilitytoapplydesignpatternsandboilerplatetoeaseintegrationandinteroperabilityofdifferentcomponentswithintheHadoopecosystem
ChoosingaframeworkInthepreviouschapters,welookedattheMapReduceandSparkprogrammingAPIstowritedistributedapplications.Althoughverypowerfulandflexible,theseAPIscomewithacertainlevelofcomplexityandpossiblyrequiresignificantdevelopmenttime.
Inanefforttoreduceverbosity,weintroducedthePigandHiveframeworks,whichcompiledomain-specificlanguages,PigLatinandHiveQL,intoanumberofMapReducejobsorSparkDAGs,effectivelyabstractingtheAPIsaway.BothlanguagescanbeextendedwithUDFs,whichisawayofmappingcomplexlogictothePigandHivedatamodels.
Attimeswhenweneedacertaindegreeofflexibilityandmodularity,thingscangettricky.Dependingontheusecaseanddeveloperneeds,theHadoopecosystempresentsavastchoiceofAPIs,frameworks,andlibraries.Inthischapter,weidentifyfourcategoriesofusersandmatchthemwiththefollowingrelevanttools:
DevelopersthatwanttoavoidJavainfavorofscriptingMapReducejobsusingdynamiclanguages,oruselanguagesnotimplementedontheJVM.Atypicalusecasewouldbeupfrontanalysisandrapidprototyping:HadoopstreamingJavadevelopersthatneedtointegratecomponentsoftheHadoopecosystemandcouldbenefitfromcodifieddesignpatternsandboilerplate:KiteDataJavadeveloperswhowanttowritemodulardatapipelinesusingafamiliarAPI:ApacheCrunchDeveloperswhowouldratherconfigurechainsofdatatransformations.Forinstance,adataengineerthatwantstoembedexistingcodeinanETLpipeline:KiteMorphlines
HadoopstreamingWehavementionedpreviouslythatMapReduceprogramsdon’thavetobewritteninJava.Thereareseveralreasonswhyyoumightwantorneedtowriteyourmapandreducetasksinanotherlanguage.Perhapsyouhaveexistingcodetoleverageorneedtousethird-partybinaries—thereasonsarevariedandvalid.
Hadoopprovidesanumberofmechanismstoaidnon-Javadevelopment,primaryamongstwhichareHadooppipesthatprovideanativeC++interfaceandHadoopstreamingthatallowsanyprogramthatusesstandardinputandoutputtobeusedformapandreducetasks.WiththeMapReduceJavaAPI,bothmapandreducetasksprovideimplementationsformethodsthatcontainthetaskfunctionality.ThesemethodsreceivetheinputtothetaskasmethodargumentsandthenoutputresultsviatheContextobject.Thisisaclearandtype-safeinterface,butitisbydefinitionJava-specific.
Hadoopstreamingtakesadifferentapproach.Withstreaming,youwriteamaptaskthatreadsitsinputfromstandardinput,onelineatatime,andgivestheoutputofitsresultstostandardoutput.Thereducetaskthendoesthesame,againusingonlystandardinputandoutputforitsdataflow.
Anyprogramthatreadsandwritesfromstandardinputandoutputcanbeusedinstreaming,suchascompiledbinaries,Unixshellscripts,orprogramswritteninadynamiclanguagesuchasPythonorRuby.ThebiggestadvantagetostreamingisthatitcanallowyoutotryideasanditeratethemmorequicklythanusingJava.Insteadofacompile/JAR/submitcycle,youjustwritethescriptsandpassthemasargumentstothestreamingJARfile.Especiallywhendoinginitialanalysisonanewdatasetortryingoutnewideas,thiscansignificantlyspeedupdevelopment.
Theclassicdebateregardingdynamicversusstaticlanguagesbalancesthebenefitsofswiftdevelopmentagainstruntimeperformanceandtypechecking.Thesedynamicdownsidesalsoapplywhenusingstreaming.Consequently,wefavortheuseofstreamingforupfrontanalysisandJavafortheimplementationofjobsthatwillbeexecutedontheproductioncluster.
StreamingwordcountinPythonWe’lldemonstrateHadoopstreamingbyre-implementingourfamiliarwordcountexampleusingPython.First,wecreateascriptthatwillbeourmapper.ItconsumesUTF-8encodedrowsoftextfromstandardinputwithaforloop,splitsthisintowords,andusestheprintfunctiontowriteeachwordtostandardoutput,asfollows:
#!/bin/envpython
importsys
forlineinsys.stdin:
#skipemptylines
ifline=='\n':
continue
#preserveutf-8encoding
try:
line=line.encode('utf-8')
exceptUnicodeDecodeError:
continue
#newlinecharacterscanappearwithinthetext
line=line.replace('\n','')
#lowercaseandtokenize
line=line.lower().split()
forterminline:
ifnotterm:
continue
try:
print(
u"%s"%(
term.decode('utf-8')))
exceptUnicodeEncodeError:
continue
Thereducercountsthenumberofoccurrencesofeachwordfromstandardinput,andgivestheoutputasthefinalvaluetostandardoutput,asfollows:
#!/bin/envpython
importsys
count=1
current=None
forwordinsys.stdin:
word=word.strip()
ifword==current:
count+=1
else:
ifcurrent:
print"%s\t%s"%(current.decode('utf-8'),count)
current=word
count=1
ifcurrent==word:
print"%s\t%s"%(current.decode('utf-8'),count)
NoteInbothcases,weareimplicitlyusingHadoopinputandoutputformatsdiscussedintheearlierchapters.ItistheTextInputFormatthatprocessesthesourcefileandprovideseachlineoneatatimetothemapscript.Conversely,theTextOutputFormatwillensurethattheoutputofreducetasksisalsocorrectlywrittenastext.
Copymap.pyandreduce.pytoHDFS,andexecutethescriptsasastreamingjobusingthesampledatafromthepreviouschapters,asfollows:
$hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-
streaming.jar\
-filemap.py\
-mapper"pythonmap.py"\
-filereduce.py\
-reducer"pythonreduce.py"\
-inputsample.txt\
-outputoutput.txt
NoteTweetsareUTF-8encoded.MakesurethatPYTHONIOENCODINGissetaccordinglyinordertopipedatainaUNIXterminal:
$exportPYTHONIOENCODING='UTF-8'
Thesamecodecanbeexecutedfromthecommand-lineprompt:
$catsample.txt|pythonmap.py|pythonreduce.py>out.txt
Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/wc/python/map.py.
DifferencesinjobswhenusingstreamingInJava,weknowthatourmap()methodwillbeinvokedonceforeachinputkey/valuepairandourreduce()methodwillbeinvokedforeachkeyanditssetofvalues.
Withstreaming,wedon’thavetheconceptofthemaporreducemethodsanymore;insteadwehavewrittenscriptsthatprocessstreamsofreceiveddata.Thischangeshowweneedtowriteourreducer.InJava,thegroupingofvaluestoeachkeywasperformedbyHadoop;eachinvocationofthereducemethodwouldreceiveasingle,tabseparatedkeyandallitsvalues.Instreaming,eachinstanceofthereducetaskisgiventheindividualungatheredvaluesoneatatime.
Hadoopstreamingdoessortthekeys,forexample,ifamapperemittedthefollowingdata:
First1
Word1
Word1
A1
First1
Thestreamingreducerwouldreceiveitinthefollowingorder:
A1
First1
First1
Word1
Word1
Hadoopstillcollectsthevaluesforeachkeyandensuresthateachkeyispassedonlytoasinglereducer.Inotherwords,areducergetsallthevaluesforanumberofkeys,andtheyaregroupedtogether;however,theyarenotpackagedintoindividualexecutionsofthereducer,thatis,oneperkey,aswiththeJavaAPI.SinceHadoopstreamingusesthestdinandstdoutchannelstoexchangedatabetweentasks,debuganderrormessagesshouldnotbeprintedtostandardoutput.Inthefollowingexample,wewillusethePythonlogging(https://docs.python.org/2/library/logging.html)packagetologwarningstatementstoafile.
FindingimportantwordsintextWewillnowimplementametric,TermFrequency-InverseDocumentFrequency(TF-IDF),thatwillhelpustodeterminetheimportanceofwordsbasedonhowfrequentlytheyappearacrossasetofdocuments(tweets,inourcase).
Intuitively,ifawordappearsfrequentlyinadocumentitisimportantandshouldbegivenahighscore.However,ifawordappearsinmanydocuments,weshouldpenalizeitwithalowerscore,asitisacommonwordanditsfrequencyisnotuniquetothisdocument.
Therefore,commonwordssuchasthe,andfor,whichappearinmanydocuments,willbescaleddown.Wordsthatappearfrequentlyinasingletweetwillbescaledup.UsesofTF-IDF,oftenincombinationwithothermetricsandtechniques,includestopwordremovalandtextclassification.Notethatthistechniquewillhaveshortcomingswhendealingwithshortdocuments,suchastweets.Insuchcases,thetermfrequencycomponentwilltendtobecomeone.Conversely,onecouldexploitthispropertytodetectoutliers.
ThedefinitionofTF-IDFwewilluseinourexampleisthefollowing:
tf=#oftimestermappearsinadocument(rawfrequency)
idf=1+log(#ofdocuments/#documentswithterminit)
tf-idf=tf*idf
WewillimplementthealgorithminPythonusingthreeMapReducejobs:
ThefirstonecalculatestermfrequencyThesecondonecalculatesdocumentfrequency(thedenominatorofIDF)Thethirdonecalculatesper-tweetTF-IDF
CalculatetermfrequencyThetermfrequencypartisverysimilartothewordcountexample.Themaindifferenceisthatwewillbeusingamulti-field,tab-separated,keytokeeptrackofco-occurrencesoftermsanddocumentIDs.Foreachtweet—inJSONformat—themapperextractstheid_strandtextfields,tokenizestext,andemitsaterm,doc_idtuple:
fortweetinsys.stdin:
#skipemptylines
iftweet=='\n':
continue
try:
tweet=json.loads(tweet)
except:
logger.warn("Invalidinput%s"%tweet)
continue
#Inourexampleonetweetcorrespondstoonedocument.
doc_id=tweet['id_str']
ifnotdoc_id:
continue
#preserveutf-8encoding
text=tweet['text'].encode('utf-8')
#newlinecharacterscanappearwithinthetext
text=text.replace('\n','')
#lowercaseandtokenize
text=text.lower().split()
fortermintext:
try:
print(
u"%s\t%s"%(
term.decode('utf-8'),doc_id.decode('utf-8'))
)
exceptUnicodeEncodeError:
logger.warn("Invalidterm%s"%term)
Inthereducer,weemitthefrequencyofeachterminadocumentasatab-separatedstring:
freq=1
cur_term,cur_doc_id=sys.stdin.readline().split()
forlineinsys.stdin:
line=line.strip()
try:
term,doc_id=line.split('\t')
except:
logger.warn("Invalidrecord%s"%line)
#thekeyisa(doc_id,term)pair
if(doc_id==cur_doc_id)and(term==cur_term):
freq+=1
else:
print(
u"%s\t%s\t%s"%(
cur_term.decode('utf-8'),cur_doc_id.decode('utf-8'),
freq))
cur_doc_id=doc_id
cur_term=term
freq=1
print(
u"%s\t%s\t%s"%(
cur_term.decode('utf-8'),cur_doc_id.decode('utf-8'),freq))
Forthisimplementationtowork,itiscrucialthatthereducerinputissortedbyterm.Wecantestbothscriptsfromthecommandlinewiththefollowingpipe:
$cattweets.json|pythonmap-tf.py|sort-k1,2|\
pythonreduce-tf.py
Whereasatthecommandlineweusethesortutility,inMapReducewewilluseorg.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator.Thiscomparatorimplementsasubsetoffeaturesprovidedbythesortcommand.Inparticular,orderingbyfieldcanbespecifiedwiththe–k<position>option.Tofilterbyterm,thefirstfieldofourkey,weset-Dmapreduce.text.key.comparator.options=-k1:
/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-
streaming.jar\
-Dmap.output.key.field.separator=\t\
-Dstream.num.map.output.key.fields=2\
-Dmapreduce.output.key.comparator.class=\
org.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator\
-Dmapreduce.text.key.comparator.options=-k1,2\
-inputtweets.json\
-output/tmp/tf-out.tsv\
-filemap-tf.py\
-mapper"pythonmap-tf.py"\
-filereduce-tf.py\
-reducer"pythonreduce-tf.py"
NoteWespecifywhichfieldsbelongtothekey(forshuffling)inthecomparatoroptions.
Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-tf.py.
CalculatedocumentfrequencyThemainlogictocalculatedocumentfrequencyisinthereducer,whilethemapperisjustanidentityfunctionthatloadsandpipesthe(orderedbyterm)outputoftheTFjob.Inthereducer,foreachterm,wecounthowmanytimesitoccursacrossalldocuments.Foreachterm,wekeepabufferkey_cacheof(term,doc_id,tf)tuples,andwhenanewtermisfoundweflushthebuffertostandardoutput,togetherwiththeaccumulateddocumentfrequencydf:
#Cachethe(term,doc_id,tf)tuple.
key_cache=[]
line=sys.stdin.readline().strip()
cur_term,cur_doc_id,cur_tf=line.split('\t')
cur_tf=int(cur_tf)
cur_df=1
forlineinsys.stdin:
line=line.strip()
try:
term,doc_id,tf=line.strip().split('\t')
tf=int(tf)
except:
logger.warn("Invalidrecord:%s"%line)
continue
#termistheonlykeyforthisinput
if(term==cur_term):
#incrementdocumentfrequency
cur_df+=1
key_cache.append(
u"%s\t%s\t%s"%(term.decode('utf-8'),doc_id.decode('utf-8'),
tf))
else:
forkeyinkey_cache:
print("%s\t%s"%(key,cur_df))
print(
u"%s\t%s\t%s\t%s"%(
cur_term.decode('utf-8'),
cur_doc_id.decode('utf-8'),
cur_tf,cur_df)
)
#flushthecache
key_cache=[]
cur_doc_id=doc_id
cur_term=term
cur_tf=tf
cur_df=1
forkeyinkey_cache:
print(u"%s\t%s"%(key.decode('utf-8'),cur_df))
print(
u"%s\t%s\t%s\t%s\n"%(
cur_term.decode('utf-8'),
cur_doc_id.decode('utf-8'),
cur_tf,cur_df))
Wecantestthescriptsfromthecommandlinewith:
$cat/tmp/tf-out.tsv|pythonmap-df.py|pythonreduce-df.py>
/tmp/df-out.tsv
AndwecantestthescriptsonHadoopstreamingwith:
/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-
streaming.jar\
-Dmap.output.key.field.separator=\t\
-Dstream.num.map.output.key.fields=3\
-Dmapreduce.output.key.comparator.class=\
org.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator\
-Dmapreduce.text.key.comparator.options=-k1\
-input/tmp/tf-out.tsv/part-00000\
-output/tmp/df-out.tsv\
-mapperorg.apache.hadoop.mapred.lib.IdentityMapper\
-filereduce-df.py\
-reducer"pythonreduce-df.py"
OnHadoopweuseorg.apache.hadoop.mapred.lib.IdentityMapper,whichprovidesthesamelogicasthemap-df.pyscript.
Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-df.py.
Puttingitalltogether–TF-IDFTocalculateTF-IDF,weonlyneedamapperthatconsumestheoutputoftheprevious
step:
num_doc=sys.argv[1]
forlineinsys.stdin:
line=line.strip()
try:
term,doc_id,tf,df=line.split('\t')
tf=float(tf)
df=float(df)
num_doc=float(num_doc)
except:
logger.warn("Invalidrecord%s"%line)
#idf=num_doc/df
tf_idf=tf*(1+math.log(num_doc/df))
print("%s\t%s\t%s"%(term,doc_id,tf_idf))
Thenumberofdocumentsinthecollectionispassedasaparametertotf-idf.py:
/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-
streaming.jar\
-Dmapreduce.reduce.tasks=0\
-input/tmp/df-out.tsv/part-00000\
-output/tmp/tf-idf.out\
-filetf-idf.py\
-mapper"pythontf-idf.py15578"
Tocalculatethetotalnumberoftweets,wecanusethecatandwcUnixutilitiesincombinationwithHadoopstreaming:
/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-
streaming.jar\
-inputtweets.json\
-outputtweets.cnt\
-mapper/bin/cat\
-reducer/usr/bin/wc
Themappersourcecodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/tf-idf.py.
KiteDataTheKiteSDK(http://www.kitesdk.org)isacollectionofclasses,command-linetools,andexamplesthataimsateasingtheprocessofbuildingapplicationsontopofHadoop.
InthissectionwewilllookathowKiteData,asubprojectofKite,caneaseintegrationwithseveralcomponentsofaHadoopdatawarehouse.Kiteexamplescanbefoundathttps://github.com/kite-sdk/kite-examples.
OnCloudera’sQuickStartVM,KiteJARscanbefoundat/opt/cloudera/parcels/CDH/lib/kite/.
KiteDataisorganizedinanumberofsubprojects,someofwhichwe’lldescribeinthefollowingsections.
DataCoreAsthenamesuggests,thecoreisthebuildingblockforallcapabilitiesprovidedintheDatamodule.Itsprincipalabstractionsaredatasetsandrepositories.
Theorg.kitesdk.data.Datasetinterfaceisusedtorepresentanimmutablesetofdata:
@Immutable
publicinterfaceDataset<E>extendsRefinableView<E>{
StringgetName();
DatasetDescriptorgetDescriptor();
Dataset<E>getPartition(PartitionKeykey,booleanautoCreate);
voiddropPartition(PartitionKeykey);
Iterable<Dataset<E>>getPartitions();
URIgetUri();
}
Eachdatasetisidentifiedbyanameandaninstanceoftheorg.kitesdk.data.DatasetDescriptorinterface,thatisthestructuraldescriptionofadatasetandprovidesitsschema(org.apache.avro.Schema)andpartitioningstrategy.
ImplementationsoftheReader<E>interfaceareusedtoreaddatafromanunderlyingstoragesystemandproducedeserializedentitiesoftypeE.ThenewReader()methodcanbeusedtogetanappropriateimplementationforagivendataset:
publicinterfaceDatasetReader<E>extendsIterator<E>,Iterable<E>,
Closeable{
voidopen();
booleanhasNext();
Enext();
voidremove();
voidclose();
booleanisOpen();
}
AninstanceofDatasetReaderwillprovidemethodstoreadanditerateoverstreamsofdata.Similarly,org.kitesdk.data.DatasetWriterprovidesaninterfacetowritestreamsofdatatotheDatasetobjects:
publicinterfaceDatasetWriter<E>extendsFlushable,Closeable{
voidopen();
voidwrite(Eentity);
voidflush();
voidclose();
booleanisOpen();
}
Likereaders,writersareuse-onceobjects.TheyserializeinstancesofentitiesoftypeEandwritethemtotheunderlyingstoragesystem.Writersareusuallynotinstantiateddirectly;rather,anappropriateimplementationcanbecreatedbythenewWriter()factorymethod.ImplementationsofDatasetWriterwillholdresourcesuntilclose()iscalledandexpect
thecallertoinvokeclose()inafinallyblockwhenthewriterisnolongerinuse.Finally,notethatimplementationsofDatasetWriteraretypicallynotthread-safe.Thebehaviorofawriterbeingaccessedfrommultiplethreadsisundefined.
AparticularcaseofadatasetistheViewinterface,whichisasfollows:
publicinterfaceView<E>{
Dataset<E>getDataset();
DatasetReader<E>newReader();
DatasetWriter<E>newWriter();
booleanincludes(Eentity);
publicbooleandeleteAll();
}
Viewscarrysubsetsofthekeysandpartitionsofanexistingdataset;theyareconceptuallysimilartothenotionof“view”intherelationalmodel.
AViewinterfacecanbecreatedfromrangesofdata,orrangesofkeys,orasaunionbetweenotherviews.
DataHCatalogDataHCatalogisamodulethatenablestheaccessingofHCatalogrepositories.Thecoreabstractionsofthismoduleareorg.kitesdk.data.hcatalog.HCatalogAbstractDatasetRepositoryanditsconcreteimplementation,org.kitesdk.data.hcatalog.HCatalogDatasetRepository.
TheydescribeaDatasetRepositorythatusesHCatalogtomanagemetadataandHDFSforstorage,asfollows:
publicclassHCatalogDatasetRepositoryextends
HCatalogAbstractDatasetRepository{
HCatalogDatasetRepository(Configurationconf){
super(conf,newHCatalogManagedMetadataProvider(conf));
}
HCatalogDatasetRepository(Configurationconf,MetadataProviderprovider)
{
super(conf,provider);
}
public<E>Dataset<E>create(Stringname,DatasetDescriptordescriptor)
{
getMetadataProvider().create(name,descriptor);
returnload(name);
}
publicbooleandelete(Stringname){
returngetMetadataProvider().delete(name);
}
publicstaticclassBuilder{
…
}
}
NoteAsofKite0.17,DataHCatalogisdeprecatedinfavorofthenewDataHivemodule.
ThelocationofthedatadirectoryiseitherchosenbyHive/HCatalog(so-called“managedtables”),orspecifiedwhencreatinganinstanceofthisclassbyprovidingafilesystemandarootdirectoryintheconstructor(externaltables).
DataHiveThekite-data-moduleexposesHiveschemasviatheDatasetinterface.AsofKite0.17,thispackagesupersedesDataHCatalog.
DataMapReduceTheorg.kitesdk.data.mapreducepackageprovidesinterfacestoreadandwritedatatoandfromaDatasetwithMapReduce.
DataSparkTheorg.kitesdk.data.sparkpackageprovidesinterfacesforreadingandwritingdatatoandfromaDatasetwithApacheSpark.
DataCrunchTheorg.kitesdk.data.crunch.CrunchDatasetspackageisahelperclasstoexposedatasetsandviewsasCrunchReadableSourceorTargetclasses:
publicclassCrunchDatasets{
publicstatic<E>ReadableSource<E>asSource(View<E>view,Class<E>type){
returnnewDatasetSourceTarget<E>(view,type);
}
publicstatic<E>ReadableSource<E>asSource(URIuri,Class<E>type){
returnnewDatasetSourceTarget<E>(uri,type);
}
publicstatic<E>ReadableSource<E>asSource(Stringuri,Class<E>type){
returnasSource(URI.create(uri),type);
}
publicstatic<E>TargetasTarget(View<E>view){
returnnewDatasetTarget<E>(view);
}
publicstaticTargetasTarget(Stringuri){
returnasTarget(URI.create(uri));
}
publicstaticTargetasTarget(URIuri){
returnnewDatasetTarget<Object>(uri);
}
}
ApacheCrunchApacheCrunch(http://crunch.apache.org)isaJavaandScalalibrarytocreatepipelinesofMapReducejobs.ItisbasedonGoogle’sFlumeJava(http://dl.acm.org/citation.cfm?id=1806638)paperandlibrary.TheprojectgoalistomakethetaskofwritingMapReducejobsasstraightforwardaspossibleforanybodyfamiliarwiththeJavaprogramminglanguagebyexposinganumberofpatternsthatimplementoperationssuchasaggregating,joining,filtering,andsortingrecords.
SimilartotoolssuchasPig,Crunchpipelinesarecreatedbycomposingimmutable,distributeddatastructuresandrunningallprocessingoperationsonsuchstructures;theyareexpressedandimplementedasuser-definedfunctions.PipelinesarecompiledintoaDAGofMapReducejobs,whoseexecutionismanagedbythelibrary’splanner.Crunchallowsustowriteiterativecodeandabstractsawaythecomplexityofthinkingintermsofmapandreduceoperations,whileatthesametimeavoidingtheneedofanadhocprogramminglanguagesuchasPigLatin.Inaddition,Crunchoffersahighlycustomizabletypesystemthatallowsustoworkwith,andmix,HadoopWritables,HBase,andAvroserializedobjects.
FlumeJava’smainassumptionisthatMapReduceisthewronglevelofabstractionforseveralclassesofproblems,wherecomputationsareoftenmadeupofmultiple,chainedjobs.Frequently,weneedtocomposelogicallyindependentoperations(forexample,filtering,projecting,grouping,andothertransformations)intoasinglephysicalMapReducejobforperformancereasons.Thisaspectalsohasimplicationsforcodetestability.Althoughwewon’tcoverthisaspectinthischapter,thereaderisencouragedtolookfurtherintoitbyconsultingCrunch’sdocumentation.
GettingstartedCrunchJARsarealreadyinstalledontheQuickStartVM.Bydefault,theJARsarefoundin/opt/cloudera/parcels/CDH/lib/crunch.
Alternatively,recentCrunchlibrariescanbedownloadedfromhttps://crunch.apache.org/download.html,fromMavenCentralorCloudera-specificrepositories.
ConceptsCrunchpipelinesarecreatedbycomposingtwoabstractions:PCollectionandPTable.
ThePCollection<T>interfaceisadistributed,immutablecollectionofobjectsoftypeT.ThePTable<Key,Value>interfaceisadistributed,immutablehashtable—asub-interfaceofPCollection—ofkeysoftheKeytypeandvaluesoftheValuetypethatexposesmethodstoworkwiththekey-valuepairs.
Thesetwoabstractionssupportthefollowingfourprimitiveoperations:
parallelDo:appliesauser-definedfunction,DoFn,toagivenPCollectionandreturnsanewPCollectionunion:mergestwoormorePCollectionsintoasinglevirtualPCollectiongroupByKey:sortsandgroupstheelementsofaPTablebytheirkeyscombineValues:aggregatesthevaluesfromagroupByKeyoperation
Thehttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/HashtagCount.javaimplementsaCrunchMapReducepipelinethatcountshashtagoccurrences:
Pipelinepipeline=newMRPipeline(HashtagCount.class,getConf());
pipeline.enableDebug();
PCollection<String>lines=pipeline.readTextFile(args[0]);
PCollection<String>words=lines.parallelDo(newDoFn<String,String>(){
publicvoidprocess(Stringline,Emitter<String>emitter){
for(Stringword:line.split("\\s+")){
if(word.matches("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")){
emitter.emit(word);
}
}
}
},Writables.strings());
PTable<String,Long>counts=words.count();
pipeline.writeTextFile(counts,args[1]);
//ExecutethepipelineasaMapReduce.
pipeline.done();
Inthisexample,wefirstcreateaMRPipelinepipelineanduseittofirstreadthecontentofsample.txtcreatedwithstream.py-tintoacollectionofstrings,whereeachelementofthecollectionrepresentsatweet.Wetokenizeeachtweetintowordswithtweet.split("\\s+"),andweemiteachwordthatmatchesthehashtagregularexpression,serializedasWritable.NotethatthetokenizingandfilteringoperationsareexecutedinparallelbyMapReducejobscreatedbytheparallelDocall.WecreateaPTablethatassociateseachhashtag,representedasastring,withthenumberoftimesitoccurredinthedatasets.Finally,wewritethePTablecountsintoHDFSasatextfile.The
pipelineisexecutedwithpipeline.done().
Tocompileandexecutethepipeline,wecanuseGradletomanagetheneededdependencies,asfollows:
$./gradlewjar
$./gradlewcopyJars
AddtheCrunchandAvrodependenciesdownloadedwithcopyJarstotheLIBJARSenvironmentvariable:
$exportCRUNCH_DEPS=build/libjars/crunch-example/lib
$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/crunch-core-0.9.0-
cdh5.0.3.jar,${CRUNCH_DEPS}/avro-1.7.5-cdh5.0.3.jar,${CRUNCH_DEPS}/avro-
mapred-1.7.5-cdh5.0.3-hadoop2.jar
Then,runtheexampleonHadoop:
$hadoopjarbuild/libs/crunch-example.jar\
com.learninghadoop2.crunch.HashtagCount\
tweets.jsoncount-out\
-libjars$LIBJARS
DataserializationOneoftheframework’sgoalsistomakeiteasytoprocesscomplexrecordscontainingnestedandrepeateddatastructures,suchasprotocolbuffersandThriftrecords.
Theorg.apache.crunch.types.PTypeinterfacedefinesthemappingbetweenadatatypethatisusedinaCrunchpipelineandaserializationandstorageformatthatisusedtoread/writedatafrom/toHDFS.EveryPCollectionhasanassociatedPTypethattellsCrunchhowtoread/writedata.
Theorg.apache.crunch.types.PTypeFamilyinterfaceprovidesanabstractfactorytoimplementinstancesofPTypethatsharethesameserializationformat.Currently,Crunchsupportstwotypefamilies:onebasedontheWritableinterfaceandtheotheronApacheAvro.
NoteAlthoughCrunchpermitsmixingandmatchingPCollectioninterfacesthatusedifferentinstancesofPTypeinthesamepipeline,eachPCollectioninterfaces’sPTypemustbelongtoauniquefamily.Forinstance,itisnotpossibletohaveaPTablewithakeyserializedasWritableanditsvalueserializedusingAvro.
Bothtypefamiliessupportacommonsetofprimitivetypes(strings,longs,integers,floats,doubles,booleans,andbytes)aswellasmorecomplexPTypeinterfacesthatcanbeconstructedoutofotherPTypes.TheseincludetuplesandcollectionsofotherPType.Aparticularlyimportant,complex,PTypeistableOf,whichdetermineswhetherthereturntypeofparalleDowillbeaPCollectionorPTable.
NewPTypescanbecreatedbyinheritingandextendingthebuilt-insoftheAvroandWritablefamilies.ThisrequiresimplementinginputMapFn<S,T>andoutputMapFn<T,S>classes.WeareimplementingPTypeforinstanceswhereSistheoriginaltypeandTisthenewtype.
DerivedPTypescanbefoundinthePTypesclass.Theseincludeserializationsupportforprotocolbuffers,Thriftrecords,JavaEnums,BigInteger,andUUIDs.TheElephantBirdlibrarywediscussedinChapter6,DataAnalysiswithApachePig,containsadditionalexamples.
Dataprocessingpatternsorg.apache.crunch.libimplementsanumberofdesignpatternsforcommondatamanipulationoperations.
AggregationandsortingMostofthedataprocessingpatternsprovidedbyorg.apache.crunch.librelyonthePTable‘sgroupByKeymethod.Themethodhasthreedifferentoverloadedforms:
groupByKey():letstheplannerdeterminethenumberofpartitionsgroupByKey(intnumPartitions):isusedtosetthenumberofpartitionsspecifiedbythedevelopergroupByKey(GroupingOptionsoptions):allowsustospecifycustompartitionsandcomparatorsforshuffling
Theorg.apache.crunch.GroupingOptionsclasstakesinstancesofHadoop’sPartitionerandRawComparatorclassestoimplementcustompartitioningandsortingoperations.
ThegroupByKeymethodreturnsaninstanceofPGroupedTable,Crunch’srepresentationofagroupedtable.ItcorrespondstotheoutputoftheshufflephaseofaMapReducejobandallowsvaluestobecombinedwiththecombineValuemethod.
Theorg.apache.crunch.lib.Aggregatepackageexposesmethodstoperformsimpleaggregations(count,max,top,andlength)onthePCollectioninstances.
SortprovidesanAPItosortPCollectionandPTableinstanceswhosecontentsimplementtheComparableinterface.
Bydefault,Crunchsortsdatausingonereducer.Thisbehaviorcanbemodifiedbypassingthenumberofpartitionsrequiredtothesortmethod.TheSort.Ordermethodsignalstheorderinwhichasortshouldbedone.
Thefollowingarehowdifferentsortoptionscanbespecifiedforcollections:
publicstatic<T>PCollection<T>sort(PCollection<T>collection)
publicstatic<T>PCollection<T>sort(PCollection<T>collection,Sort.Order
order)
publicstatic<T>PCollection<T>sort(PCollection<T>collection,int
numReducers,Sort.Orderorder)
Thefollowingarehowdifferentsortoptionscanbespecifiedfortables:
publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table)
publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table,Sort.Orderkey)
publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table,intnumReducers,
Sort.Orderkey)
Finally,sortPairssortsthePCollectionofpairsusingthespecifiedcolumnorderinSort.ColumnOrder:
sortPairs(PCollection<Pair<U,V>>collection,Sort.ColumnOrder…
columnOrders)
JoiningdataTheorg.apache.crunch.lib.JoinpackageisanAPItojoinPTablesbasedonacommonkey.Thefollowingfourjoinoperationsaresupported:
fullJoin
join(defaultstoinnerJoin)leftJoin
rightJoin
Themethodshaveacommonreturntypeandsignature.Forreference,wewilldescribethecommonlyusedjoinmethodthatimplementsaninnerjoin:
publicstatic<K,U,V>PTable<K,Pair<U,V>>join(PTable<K,U>left,
PTable<K,V>right)
Theorg.apache.crunch.lib.Join.JoinStrategypackageprovidesaninterfacetodefinecustomjoinstrategies.Crunch’sdefaultstrategy(defaultStrategy)istojoindatareduce-side.
PipelinesimplementationandexecutionCrunchcomeswiththreeimplementationsofthepipelineinterface.Theoldestone,implicitlyusedinthischapter,isorg.apache.crunch.impl.mr.MRPipeline,whichusesHadoop’sMapReduceasitsexecutionengine.org.apache.crunch.impl.mem.MemPipelineallowsalloperationstobeperformedinmemory,withnoserializationtodiskperformed.Crunch0.10introducedorg.apache.crunch.impl.spark.SparkPipelinewhichcompilesandrunsaDAGofPCollectionstoApacheSpark.
SparkPipelineWithSparkPipeline,CrunchdelegatesmuchoftheexecutiontoSparkanddoesrelativelylittleoftheplanningtasks,withthefollowingexceptions:
MultipleinputsMultipleoutputsDataserializationCheckpointing
Atthetimeofwriting,SparkPipelineisstillheavilyunderdevelopmentandmightnothandlealloftheusecasesofastandardMRPipeline.TheCrunchcommunityisactivelyworkingtoensurecompletecompatibilitybetweenthetwoimplementations.
MemPipelineMemPipelineexecutesin-memoryonaclient.UnlikeMRPipeline,MemPipelineisnotexplicitlycreatedbutreferencedbycallingthestaticmethodMemPipeline.getInstance().Alloperationsareinmemory,andtheuseofPTypesisminimal.
CrunchexamplesWewillnowuseApacheCrunchtoreimplementsomeoftheMapReducecodewrittensofarinamoremodularfashion.
Wordco-occurrenceInChapter3,Processing–MapReduceandBeyond,weshowedaMapReducejob,BiGramCount,tocountco-occurrencesofwordsintweets.ThatsamelogiccanbeimplementedasaDoFn.Insteadofemittingamulti-fieldkeyandhavingtoparseitatalaterstage,withCrunchwecanuseacomplextypePair<String,String>,asfollows:
classBiGramextendsDoFn<String,Pair<String,String>>{
@Override
publicvoidprocess(Stringtweet,
Emitter<Pair<String,String>>emitter){
String[]words=tweet.split("");
Textbigram=newText();
Stringprev=null;
for(Strings:words){
if(prev!=null){
emitter.emit(Pair.of(prev,s));
}
prev=s;
}
}
}
Noticehow,comparedtoMapReduce,theBiGramCrunchimplementationisastandaloneclass,easilyreusableinanyothercodebase.Thecodeforthisexampleisincludedinhttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/DataPreparationPipeline.java
TF-IDFWecanimplementtheTF-IDFchainofjobswithaMRPipeline,asfollows:
publicclassCrunchTermFrequencyInvertedDocumentFrequency
extendsConfiguredimplementsTool,Serializable{
privateLongnumDocs;
@SuppressWarnings("deprecation")
publicstaticclassTF{
Stringterm;
StringdocId;
intfrequency;
publicTF(){}
publicTF(Stringterm,
StringdocId,Integerfrequency){
this.term=term;
this.docId=docId;
this.frequency=(int)frequency;
}
}
publicintrun(String[]args)throwsException{
if(args.length!=2){
System.err.println();
System.err.println("Usage:"+this.getClass().getName()+"
[genericoptions]inputoutput");
return1;
}
//Createanobjecttocoordinatepipelinecreationandexecution.
Pipelinepipeline=
newMRPipeline(TermFrequencyInvertedDocumentFrequency.class,getConf());
//enabledebugoptions
pipeline.enableDebug();
//ReferenceagiventextfileasacollectionofStrings.
PCollection<String>tweets=pipeline.readTextFile(args[0]);
numDocs=tweets.length().getValue();
//WeuseAvroreflectionstomaptheTFPOJOtoavsc
PTable<String,TF>tf=tweets.parallelDo(newTermFrequencyAvro(),
Avros.tableOf(Avros.strings(),Avros.reflects(TF.class)));
//CalculateDF
PTable<String,Long>df=Aggregate.count(tf.parallelDo(new
DocumentFrequencyString(),Avros.strings()));
//FinallywecalculateTF-IDF
PTable<String,Pair<TF,Long>>tfDf=Join.join(tf,df);
PCollection<Tuple3<String,String,Double>>tfIdf=
tfDf.parallelDo(newTermFrequencyInvertedDocumentFrequency(),
Avros.triples(
Avros.strings(),
Avros.strings(),
Avros.doubles()));
//Serializeasavro
tfIdf.write(To.avroFile(args[1]));
//ExecutethepipelineasaMapReduce.
PipelineResultresult=pipeline.done();
returnresult.succeeded()?0:1;
}
…
}
Theapproachthatwefollowherehasanumberofadvantagescomparedtostreaming.Firstofall,wedon’tneedtomanuallychainMapReducejobsusingaseparatescript.ThistaskisCrunch’smainpurpose.Secondly,wecanexpresseachcomponentofthemetricasadistinctclass,makingiteasiertoreuseinfutureapplications.
Toimplementtermfrequency,wecreateaDoFnclassthattakesasinputatweetandemitsPair<String,TF>.Thefirstelementisaterm,andthesecondisaninstanceofthePOJOclassthatwillbeserializedusingAvro.TheTFpartcontainsthreevariables:term,documentId,andfrequency.Inthereferenceimplementation,weexpectinputdatatobeaJSONstringthatwedeserializeandparse.Wealsoincludetokenizingasasubtaskoftheprocessmethod.
Dependingontheusecases,wecouldabstractbothoperationsinseparateDoFns,asfollows:
classTermFrequencyAvroextendsDoFn<String,Pair<String,TF>>{
publicvoidprocess(StringJSONTweet,
Emitter<Pair<String,TF>>emitter){
Map<String,Integer>termCount=newHashMap<>();
Stringtweet;
StringdocId;
JSONParserparser=newJSONParser();
try{
Objectobj=parser.parse(JSONTweet);
JSONObjectjsonObject=(JSONObject)obj;
tweet=(String)jsonObject.get("text");
docId=(String)jsonObject.get("id_str");
for(Stringterm:tweet.split("\\s+")){
if(termCount.containsKey(term.toLowerCase())){
termCount.put(term,
termCount.get(term.toLowerCase())+1);
}else{
termCount.put(term.toLowerCase(),1);
}
}
for(Entry<String,Integer>entry:termCount.entrySet()){
emitter.emit(Pair.of(entry.getKey(),newTF(entry.getKey(),
docId,entry.getValue())));
}
}catch(ParseExceptione){
e.printStackTrace();
}
}
}
}
Documentfrequencyisstraightforward.ForeachPair<String,TF>generatedintheterm
frequencystep,weemittheterm—thefirstelementofthepair.WeaggregateandcounttheresultingPCollectionoftermstoobtaindocumentfrequency,asfollows:
classDocumentFrequencyStringextendsDoFn<Pair<String,TF>,String>{
@Override
publicvoidprocess(Pair<String,TF>tfAvro,
Emitter<String>emitter){
emitter.emit(tfAvro.first());
}
}
WefinallyjointhePTableTFwiththePTableDFonthesharedkey(term)andfeedtheresultingPair<String,Pair<TF,Long>>objecttoTermFrequencyInvertedDocumentFrequency.
Foreachtermanddocument,wecalculateTF-IDFandreturnaterm,docIf,andtfIdftriple:
classTermFrequencyInvertedDocumentFrequencyextendsMapFn<Pair<String,
Pair<TF,Long>>,Tuple3<String,String,Double>>{
@Override
publicTuple3<String,String,Double>map(
Pair<String,Pair<TF,Long>>input){
Pair<TF,Long>tfDf=input.second();
Longdf=tfDf.second();
TFtf=tfDf.first();
doubleidf=1.0+Math.log(numDocs/df);
doubletfIdf=idf*tf.frequency;
returnTuple3.of(tf.term,tf.docId,tfIdf);
}
}
WeuseMapFnbecausewearegoingtooutputonerecordforeachinput.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/CrunchTermFrequencyInvertedDocumentFrequency.java
Theexamplecanbecompiledandexecutedwiththefollowingcommands:
$./gradlewjar
$./gradlewcopyJars
Ifnotalreadydone,addtheCrunchandAvrodependenciesdownloadedwithcopyJarstotheLIBJARSenvironmentvariable,asfollows:
$exportCRUNCH_DEPS=build/libjars/crunch-example/lib
$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/crunch-core-0.9.0-
cdh5.0.3.jar,${CRUNCH_DEPS}/avro-1.7.5-cdh5.0.3.jar,${CRUNCH_DEPS}/avro-
mapred-1.7.5-cdh5.0.3-hadoop2.jar
Furthermore,addthejson-simpleJARtoLIBJARS:
$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/json-simple-1.1.1.jar
Finally,runCrunchTermFrequencyInvertedDocumentFrequencyasaMapReducejob,asfollows:
$hadoopjarbuild/libs/crunch-example.jar\
com.learninghadoop2.crunch.CrunchTermFrequencyInvertedDocumentFrequency\
-libjars${LIBJARS}\
tweets.jsontweets.avro-out
KiteMorphlinesKiteMorphlinesisadatatransformationlibrary,inspiredbyUnixpipes,originallydevelopedaspartofClouderaSearch.Amorphlineisanin-memorychainoftransformationcommandsthatreliesonapluginstructuretotapheterogeneousdatasources.ItusesdeclarativecommandstocarryoutETLoperationsonrecords.Commandsaredefinedinaconfigurationfile,whichislaterfedtoadriverclass.
ThegoalistomakeembeddingETLlogicintoanyJavacodebaseatrivialtaskbyprovidingalibrarythatallowsdeveloperstoreplaceprogrammingwithaseriesofconfigurationsettings.
ConceptsMorphlinesarebuiltaroundtwoabstractions:CommandandRecord.
Recordsareimplementationsoftheorg.kitesdk.morphline.api.Recordinterface:
publicfinalclassRecord{
privateArrayListMultimap<String,Object>fields;
…
privateRecord(ArrayListMultimap<String,Object>fields){…}
publicListMultimap<String,Object>getFields(){…}
publicListget(Stringkey){…}
publicvoidput(Stringkey,Objectvalue){…}
…
}
Arecordisasetofnamedfields,whereeachfieldhasalistofoneormorevalues.ARecordisimplementedontopofGoogleGuava’sListMultimapandArrayListMultimapclasses.NotethatavaluecanbeanyJavaobject,fieldscanbemultivalued,andtworecordsdon’tneedtousecommonfieldnames.Arecordcancontainan_attachment_bodyfieldthatcanbeajava.io.InputStreamorabytearray.
Commandsimplementtheorg.kitesdk.morphline.api.Commandinterface:
publicinterfaceCommand{
voidnotify(Recordnotification);
booleanprocess(Recordrecord);
CommandgetParent();
}
Acommandtransformsarecordintozeroormorerecords.CommandscancallthemethodsontheRecordinstanceprovidedforreadandwriteoperationsaswellasforaddingorremovingfields.
Commandsarechainedtogether,andateachstepofamorphlinetheparentcommandsendsrecordstoitschild,whichinturnprocessesthem.Informationbetweenparentsandchildrenisexchangedusingtwocommunicationchannels(planes);notificationsaresentviaacontrolplane,andrecordsaresentoveradataplane.Recordsareprocessedbytheprocess()method,whichreturnsaBooleanvaluetoindicatewhetheramorphlineshouldproceedornot.
Commandsarenotinstantiateddirectly,butviaanimplementationoftheorg.kitesdk.morphline.api.CommandBuilderinterface:
publicinterfaceCommandBuilder{
Collection<String>getNames();
Commandbuild(Configconfig,
Commandparent,
Commandchild,
MorphlineContextcontext);
}
ThegetNamesmethodreturnsthenameswithwhichthecommandcanbeinvoked.Multiplenamesaresupportedtoallowbackwardscompatiblenamechanges.Thebuild()methodcreatesandreturnsacommandrootedatthegivenmorphlineconfiguration.
Theorg.kitesdk.morphline.api.MorphlineContextinterfaceallowsadditionalparameterstobepassedtoallmorphlinecommands.
Thedatamodelofmorphlinesisstructuredfollowingasource-pipe-sinkpattern,wheredataiscapturedfromasource,pipedthroughanumberofprocessingsteps,anditsoutputisthendeliveredintoasink.
MorphlinecommandsKiteMorphlinescomeswithanumberofdefaultcommandsthatimplementdatatransformationsoncommonserializationformats(plaintext,Avro,JSON).Currentlyavailablecommandsareorganizedassubprojectsofmorphlinesandinclude:
kite-morphlines-core-stdio:willreaddatafrombinarylargeobjects(BLOBs)andtextkite-morphlines-core-stdlib:wrapsaroundJavadatatypesfordatamanipulationandrepresentationkite-morphlines-avro:isusedforserializationintoanddeserializationfromdataintheAvroformatkite-morphlines-json:willserializeanddeserializedatainJSONformatkite-morphlines-hadoop-core:isusedtoaccessHDFSkite-morphlines-hadoop-parquet-avro:isusedtoserializeanddeserializedataintheParquetformatkite-morphlines-hadoop-sequencefile:isusedtoserializeanddeserializedataintheSequencefileformatkite-morphlines-hadoop-rcfile:isusedtoserializeanddeserializedatainRCfileformat
Alistofallavailablecommandscanbefoundathttp://kitesdk.org/docs/0.17.0/kite-morphlines/morphlinesReferenceGuide.html.
Commandsaredefinedbydeclaringachainoftransformationsinaconfigurationfile,morphline.conf,whichisthencompiledandexecutedbyadriverprogram.Forinstance,wecanspecifyaread_tweetsmorphlinethatwillloadtweetsstoredasJSONdata,serializeanddeserializethemusingJackson,andprintthefirst10,bycombiningthe
defaultreadJsonandheadcommandscontainedintheorg.kitesdk.morphlinepackage,asfollows:
morphlines:[{
id:read_tweets
importCommands:["org.kitesdk.morphline.**"]
commands:[{
readJson{
outputClass:com.fasterxml.jackson.databind.JsonNode
}}
{
head{
limit:10
}}
]
}]
WewillnowshowhowthismorphlinecanbeexecutedbothfromastandaloneJavaprogramaswellasfromMapReduce.
MorphlineDriver.javashowshowtousethelibraryembeddedintoahostsystem.Thefirststepthatwecarryoutinthemainmethodistoloadmorphline’sJSONconfiguration,buildaMorphlineContextobject,andcompileitintoaninstanceofCommandthatactsasthestartingnodeofthemorphline.NotethatCompiler.compile()takesafinalChildparameter;inthiscase,itisRecordEmitter.WeuseRecordEmittertoactasasinkforthemorphline,byeitherprintingarecordtostdoutorstoringitintoHDFS.IntheMorphlineDriverexample,weuseorg.kitesdk.morphline.base.Notificationstomanageandmonitorthemorphlinelifecycleinatransactionalfashion.
AcalltoNotifications.notifyStartSession(morphline)startsthetransformationchainwithinatransactiondefinedbycallingNotifications.notifyBeginTransaction.Uponsuccess,weterminatethepipelinewithNotifications.notifyShutdown(morphline).Intheeventoffailure,werollbackthetransaction,Notifications.notifyRollbackTransaction(morphline),andpassanexceptionhandlerfromthemorphlinecontexttothecallingJavacode:
publicclassMorphlineDriver{
privatestaticfinalclassRecordEmitterimplementsCommand{
privatefinalTextline=newText();
@Override
publicCommandgetParent(){
returnnull;
}
@Override
publicvoidnotify(Recordrecord){
}
@Override
publicbooleanprocess(Recordrecord){
line.set(record.get("_attachment_body").toString());
System.out.println(line);
returntrue;
}
}
publicstaticvoidmain(String[]args)throwsIOException{
/*loadamorphlineconfandsetitup*/
FilemorphlineFile=newFile(args[0]);
StringmorphlineId=args[1];
MorphlineContextmorphlineContext=new
MorphlineContext.Builder().build();
Commandmorphline=newCompiler().compile(morphlineFile,
morphlineId,morphlineContext,newRecordEmitter());
/*Preparethemorphlineforexecution
*
*Notificationsaresentthroughthecommunicationchannel
**/
Notifications.notifyBeginTransaction(morphline);
/*Notethatweareusingthelocalfilesystem,nothdfs*/
InputStreamin=newBufferedInputStream(new
FileInputStream(args[2]));
/*fillinarecordandpassitover*/
Recordrecord=newRecord();
record.put(Fields.ATTACHMENT_BODY,in);
try{
Notifications.notifyStartSession(morphline);
booleansuccess=morphline.process(record);
if(!success){
System.out.println("Morphlinefailedtoprocessrecord:"+
record);
}
/*Committhemorphline*/
}catch(RuntimeExceptione){
Notifications.notifyRollbackTransaction(morphline);
morphlineContext.getExceptionHandler().handleException(e,null);
}
finally{
in.close();
}
/*shutitdown*/
Notifications.notifyShutdown(morphline);
}
}
Inthisexample,weloaddatainJSONformatfromthelocalfilesystemintoanInputStreamobjectanduseittoinitializeanewRecordinstance.TheRecordEmitter
classcontainsthelastprocessedrecordinstanceofthechain,onwhichweextract_attachment_bodyandprintittostandardoutput.ThesourcecodeforMorphlineDrivercanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriver.java
UsingthesamemorphlinefromaMapReducejobisstraightforward.DuringthesetupphaseoftheMapper,webuildacontextthatcontainstheinstantiationlogic,whilethemapmethodsetstheRecordobjectupandfiresofftheprocessinglogic,asfollows:
publicstaticclassReadTweets
extendsMapper<Object,Text,Text,NullWritable>{
privatefinalRecordrecord=newRecord();
privateCommandmorphline;
@Override
protectedvoidsetup(Contextcontext)
throwsIOException,InterruptedException{
FilemorphlineConf=newFile(context.getConfiguration()
.get(MORPHLINE_CONF));
StringmorphlineId=context.getConfiguration()
.get(MORPHLINE_ID);
MorphlineContextmorphlineContext=
newMorphlineContext.Builder()
.build();
morphline=neworg.kitesdk.morphline.base.Compiler()
.compile(morphlineConf,
morphlineId,
morphlineContext,
newRecordEmitter(context));
}
publicvoidmap(Objectkey,Textvalue,Contextcontext)
throwsIOException,InterruptedException{
record.put(Fields.ATTACHMENT_BODY,
newByteArrayInputStream(
value.toString().getBytes("UTF8")));
if(!morphline.process(record)){
System.out.println(
"Morphlinefailedtoprocessrecord:"+record);
}
record.removeAll(Fields.ATTACHMENT_BODY);
}
}
IntheMapReducecodewemodifyRecordEmittertoextracttheFieldspayloadfrompost-processedrecordsandstoreitintocontext.ThisallowsustowritedataintoHDFSbyspecifyingaFileOutputFormatintheMapReduceconfigurationboilerplate:
privatestaticfinalclassRecordEmitterimplementsCommand{
privatefinalTextline=newText();
privatefinalMapper.Contextcontext;
privateRecordEmitter(Mapper.Contextcontext){
this.context=context;
}
@Override
publicvoidnotify(Recordnotification){
}
@Override
publicCommandgetParent(){
returnnull;
}
@Override
publicbooleanprocess(Recordrecord){
line.set(record.get(Fields.ATTACHMENT_BODY).toString());
try{
context.write(line,null);
}catch(Exceptione){
e.printStackTrace();
returnfalse;
}
returntrue;
}
}
Noticethatwecannowchangetheprocessingpipelinebehaviorandaddfurtherdatatransformationsbymodifyingmorphline.confwithouttheexplicitneedtoaltertheinstantiationandprocessinglogic.TheMapReducedriversourcecodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriverMapReduce.java
Bothexamplescanbecompiledfromch9/kite/withthefollowingcommands:
$./gradlewjar
$./gradlewcopyJar
WeaddtheruntimedependenciestoLIBJARS,asfollows
$exportKITE_DEPS=/home/cloudera/review/hadoop2book-private-reviews-
gabriele-ch8/src/ch8/kite/build/libjars/kite-example/lib
exportLIBJARS=${LIBJARS},${KITE_DEPS}/kite-morphlines-core-
0.17.0.jar,${KITE_DEPS}/kite-morphlines-json-
0.17.0.jar,${KITE_DEPS}/metrics-core-3.0.2.jar,${KITE_DEPS}/metrics-
healthchecks-3.0.2.jar,${KITE_DEPS}/config-1.0.2.jar,${KITE_DEPS}/jackson-
databind-2.3.1.jar,${KITE_DEPS}/jackson-core-
2.3.1.jar,${KITE_DEPS}/jackson-annotations-2.3.0.jar
WecanruntheMapReducedriverwiththefollowing:
$hadoopjarbuild/libs/kite-example.jar\
com.learninghadoop2.kite.morphlines.MorphlineDriverMapReduce\
-libjars${LIBJARS}\
morphline.conf\
read_tweets\
tweets.json\
morphlines-out
TheJavastandalonedrivercanbeexecutedwiththefollowingcommand:
$exportCLASSPATH=${CLASSPATH}:${KITE_DEPS}/kite-morphlines-core-
0.17.0.jar:${KITE_DEPS}/kite-morphlines-json-
0.17.0.jar:${KITE_DEPS}/metrics-core-3.0.2.jar:${KITE_DEPS}/metrics-
healthchecks-3.0.2.jar:${KITE_DEPS}/config-1.0.2.jar:${KITE_DEPS}/jackson-
databind-2.3.1.jar:${KITE_DEPS}/jackson-core-
2.3.1.jar:${KITE_DEPS}/jackson-annotations-2.3.0.jar:${KITE_DEPS}/slf4j-
api-1.7.5.jar:${KITE_DEPS}/guava-11.0.2.jar:${KITE_DEPS}/hadoop-common-
2.3.0-cdh5.0.3.jar
$java-cp$CLASSPATH:./build/libs/kite-example.jar\
com.learninghadoop2.kite.morphlines.MorphlineDriver\
morphline.conf\
read_tweetstweets.json\
morphlines-out
SummaryInthischapter,weintroducedfourtoolstoeasedevelopmentonHadoop.Inparticular,wecovered:
HowHadoopstreamingallowsthewritingofMapReducejobsusingdynamiclanguagesHowKiteDatasimplifiesinterfacingwithheterogeneousdatasourcesHowApacheCrunchprovidesahigh-levelabstractiontowritepipelinesofSparkandMapReducejobsthatimplementcommondesignpatternsHowMorphlinesallowsustodeclarechainsofcommandsanddatatransformationsthatcanthenbeembeddedinanyJavacodebase
InChapter10,RunningaHadoop2Cluster,wewillshiftourfocusfromthedomainofsoftwaredevelopmenttosystemadministration.Wewilldiscusshowtosetup,manage,andscaleaHadoopcluster,whiletakingaspectssuchasmonitoringandsecurityintoconsideration.
Chapter10.RunningaHadoopClusterInthischapter,wewillchangeourfocusalittleandlookatsomeoftheconsiderationsyouwillfacewhenrunninganoperationalHadoopcluster.Inparticular,wewillcoverthefollowingtopics:
WhyadevelopershouldcareaboutoperationsandwhyHadoopoperationsaredifferentMoredetailonClouderaManageranditscapabilitiesandlimitationsDesigningaclusterforuseonbothphysicalhardwareandEMRSecuringaHadoopclusterHadoopmonitoringTroubleshootingproblemswithanapplicationrunningonHadoop
I’madeveloper–Idon’tcareaboutoperations!Beforegoinganyfurther,weneedtoexplainwhyweareputtingachapteraboutsystemsoperationsinabooksquarelyaimedatdevelopers.Foranyonewhohasdevelopedformoretraditionalplatforms(forexample,webapps,databaseprogramming,andsoon)thenthenormmightwellhavebeenforaverycleardelineationbetweendevelopmentandoperations.Thefirstgroupbuildsthecodeandpackagesitup,andthesecondgroupcontrolsandoperatestheenvironmentinwhichitruns.
Inrecentyears,theDevOpsmovementhasgainedmomentumwithabeliefthatitisbestforeveryoneifthesesilosareremovedandthattheteamsworkmorecloselytogether.WhenitcomestorunningsystemsandservicesbasedonHadoop,webelievethisisabsolutelyessential.
HadoopandDevOpspracticesEventhoughadevelopercanconceptuallybuildanapplicationreadytobedroppedintoYARNandforgottenabout,therealityisoftenmorenuanced.Howmanyresourcesareallocatedtotheapplicationatruntimeismostlikelysomethingthedeveloperwishestoinfluence.Oncetheapplicationisrunning,theoperationsstafflikelywantsomeinsightintotheapplicationwhentheyaretryingtooptimizethecluster.Therereallyisn’tthesameclear-cutsplitofresponsibilitiesseenintraditionalenterpriseIT.Andthat’slikelyareallygoodthing.
Inotherwords,developersneedtobemoreawareoftheoperationsaspects,andtheoperationsstaffneedtobemoreawareofwhatthedevelopersaredoing.Soconsiderthischapterourcontributiontohelpyouhavethosediscussionswithyouroperationsstaff.Wedon’tintendtomakeyouanexpertHadoopadministratorbytheendofthischapter;thatreallyisemergingasadedicatedroleandskillsetinitself.Instead,wewillgiveawhistle-stoptourofissuesyoudoneedsomeawarenessofandthatwillmakeyourlifeeasieronceyourapplicationsarerunningonliveclusters.
Bythenatureofthiscoverage,wewillbetouchingonalotoftopicsandgoingintothemonlylightly;ifanyareofdeeperinterest,thenweprovidelinksforfurtherinvestigation.Justmakesureyoukeepyouroperationsstaffinvolved!
ClouderaManagerInthisbook,weusedasthemostcommonplatformtheClouderaHadoopDistribution(CDH)withitsconvenientQuickStartvirtualmachineandthepowerfulClouderaManagerapplication.WithaCloudera-basedcluster,ClouderaManagerwillbecome(atleastinitially)yourprimaryinterfaceintothesystemtomanageandmonitorthecluster,solet’sexploreitalittle.
NotethatClouderaManagerhasextensiveandhigh-qualityonlinedocumentation.Wewon’tduplicatethisdocumentationhere;insteadwe’llattempttohighlightwhereClouderaManagerfitsintoyourdevelopmentandoperationalworkflowsandhowitmightormightnotbesomethingyouwanttoembrace.DocumentationforthelatestandpreviousversionsofClouderaManagercanbeaccessedviathemainClouderadocumentationpageathttp://www.cloudera.com/content/support/en/documentation.html.
TopayornottopayBeforegettingallexcitedaboutClouderaManager,it’simportanttoconsultthecurrentdocumentationconcerningwhatfeaturesareavailableinthefreeversionandwhichonesrequiresubscriptiontoapaid-forClouderaoffering.Ifyouabsolutelywantsomeofthefeaturesofferedonlyinthepaid-forversionbuteithercan’tordon’twishtopayforsubscriptionservices,thenClouderaManager,andpossiblytheentireClouderadistribution,mightnotbeagoodfitforyou.We’llreturntothistopicinChapter11,WheretoGoNext.
ClustermanagementusingClouderaManagerUsingtheQuickStartVM,itwon’tbeobvious,butClouderaManageristheprimarytooltobeusedformanagementofallservicesinthecluster.Ifyouwanttoenableanewservice,you’lluseClouderaManager.Tochangeaconfiguration,youwillneedClouderaManager.Toupgradetothelatestrelease,youwillagainrequireClouderaManager.
Eveniftheprimarymanagementoftheclusterishandledbyoperationalstaff,asadeveloperyou’lllikelystillwanttobecomefamiliarwiththeClouderaManagerinterfacejusttolooktoseeexactlyhowtheclusterisconfigured.Ifyourjobsarerunningslowly,thenlookingintoClouderaManagertoseejusthowthingsarecurrentlyconfiguredwilllikelybeyourfirststart.ThedefaultportfortheClouderaManagerwebinterfaceis7180,sothehomepagewillusuallybeconnectedtoviaaURLsuchashttp://<hostname>:7180/cmf/home,andcanbeseeninthefollowingscreenshot:
ClouderaManagerhomepage
It’sworthpokingaroundtheinterface;however,ifyouareconnectingwithauseraccountwithadminprivileges,becareful!
ClickontheClusterslink,andthiswillexpandtogivealistoftheclusterscurrentlymanagedbythisinstanceofClouderaManager.ThisshouldtellyouthatasingleClouderaManagerinstancecanmanagemultipleclusters.Thisisveryuseful,especiallyifyouhavemanyclustersspreadacrossdevelopmentandproduction.
Foreachexpandedcluster,therewillbealistoftheservicescurrentlyrunningonthecluster.Clickonaservice,andthenyouwillseealistofadditionalchoices.SelectConfiguration,andyoucanstartbrowsingthedetailedconfigurationofthatparticularservice.ClickonActions,andyouwillgetsomeservice-specificoptions;thiswillusuallyincludestopping,starting,restarting,andotherwisemanagingtheservice.
ClickontheHostsoptioninsteadofClusters,andyoucanstartdrillingdownintotheserversmanagedbyClouderaManager,andfromthere,seewhichservicecomponentsaredeployedoneach.
ClouderaManagerandothermanagementtoolsThatlastcommentmightraiseaquestion:howdoesClouderaManagerintegratewithothersystemsmanagementtools?GivenourearliercommentsregardingtheimportanceofDevOpsphilosophies,howwelldoesitintegratewiththetoolsfavoredinDevOpsenvironments?
Thehonestanswer:notalwaysverywell.ThoughthemainClouderaManagerservercanitselfbemanagedbyautomationtools,suchasPuppetorChef,thereisanexplicitassumptionthatClouderaManagerwillcontroltheinstallationandconfigurationofallthesoftwareClouderaManagerneedsonallthehoststhatwillbeincludedinitsclusters.Tosomeadministrators,thismakesthehardwarebehindClouderaManagerlooklikeabig,blackbox;theymightcontroltheinstallationofthebaseoperatingsystem,butthemanagementoftheconfigurationbaselinegoingforwardisentirelymanagedbyClouderaManager.There’snothingmuchtobedonehere;itiswhatitis—togetthebenefitsofClouderaManager,itwilladditselfasanewmanagementsysteminyourinfrastructure,andhowwellthatfitsinwithyourbroaderenvironmentwillbedeterminedonacase-by-casebasis.
MonitoringwithClouderaManagerAsimilarpointcanbemaderegardingsystemsmonitoringasClouderaManagerisalsoconceptuallyapointofduplicationhere.Butstartclickingaroundtheinterface,anditwillbecomeapparentveryquicklythatClouderaManagerprovidesanexceptionallyrichsetoftoolstoassessthehealthandperformanceofmanagedclusters.
FromgraphingtherelativeperformanceofImpalaqueriesthroughshowingthejobstatusforYARNapplicationsandgivinglow-leveldataontheblocksstoredonHDFS,itisallthereinasingleinterface.We’lldiscusslaterinthischapterhowtroubleshootingonHadoopcanbechallenging,butthesinglepointofvisibilityprovidedbyClouderaManagerisagreattoolwhenlookingtoassessclusterhealthorperformance.We’lldiscussmonitoringinalittlemoredetaillaterinthischapter.
FindingconfigurationfilesOneofthefirstconfusionsfacedwhenrunningaclustermanagedbyClouderaManageristryingtofindtheconfigurationfilesusedbythecluster.InthevanillaApachereleasesofproducts,suchasthecoreHadoop,therewouldbefilestypicallystoredin/etc/hadoop,similarly/etc/hiveforHive,/etc/oozieforOozie,andsoon.
InaClouderaManagermanagedcluster,however,theconfigfilesareregeneratedeachtimeaserviceisrestarted,andinsteadofsittinginthe/etclocationsonthefilesystem,willbefoundat/var/run/cloudera-scm-agent-process/<pid>-<taskname>/,wherethelastdirectorymighthaveanamesuchas7007-yarn-NODEMANAGER.ThismightseemoddtoanyoneusedtoworkingonearlierHadoopclustersorotherdistributionsthatdon’tdosuchathing.ButinaClouderaManager-controlledcluster,itmightoftenbeeasiertousethewebinterfacetobrowsetheconfigurationinsteadoflookingfortheunderlyingconfigfiles.Whichapproachisbest?Thisisalittlephilosophical,andeachteamneedstodecidewhichworksbestforthem.
ClouderaManagerAPIWe’veonlygiventhehighestlevelofoverviewofClouderaManager,andindoingso,havecompletelyignoredoneareathatmightbeveryusefulforsomeorganizations:ClouderaManageroffersanAPIthatallowsintegrationofitscapabilitiesintoothersystemsandtools.Consultthedocumentationifthismightbeofinteresttoyou.
ClouderaManagerlock-inThisbringsustothepointthatisimplicitinthewholediscussionaroundClouderaManager:itdoescauseadegreeoflock-intoClouderaandtheirdistribution.Thatlock-inmightonlyexistincertainways;code,forexample,shouldbeportableacrossclustersmodulotheusualcaveatsaboutdifferentunderlyingversions—buttheclusteritselfmightnoteasilybereconfiguredtouseadifferentdistribution.Assumethatswitchingdistributionswouldbeacompleteremove/reformat/reinstallactivity.
Wearen’tsayingdon’tuseit,ratherthatyouneedtobeawareofthelock-inthatcomeswiththeuseofClouderaManager.Forsmallteamswithlittlededicatedoperationssupportorexistinginfrastructure,theimpactofsuchalock-inislikelyoutweighedbythesignificantcapabilitiesthatClouderaManagergivesyou.
Forlargerteamsoronesworkinginanenvironmentwhereintegrationwithexistingtoolsandprocesseshasmoreweight,thedecisionmightbelessclear.LookatClouderaManager,discusswithyouroperationspeople,anddeterminewhatisrightforyou.
NotethatitispossibletomanuallydownloadandinstallthevariouscomponentsoftheClouderadistributionwithoutusingClouderaManagertomanagetheclusteranditshosts.ThismightbeanattractivemiddlegroundforsomeusersastheClouderasoftwarecanbeused,butdeploymentandmanagementcanbebuiltintotheexistingdeploymentandmanagementtools.Thisisalsopotentiallyawayofavoidingtheadditionalexpenseofthepaid-forlevelsofClouderasupportmentionedearlier.
Ambari–theopensourcealternativeAmbariisanApacheproject(http://ambari.apache.org),whichintheory,providesanopensourcealternativetoClouderaManager.ItistheadministrationconsolefortheHortonworksdistribution.AtthetimeofwritingHortonworksemployeesarealsothevastmajorityoftheprojectcontributors.
Ambari,asonewouldexpectgivenitsopensourcenature,reliesonotheropensourceproducts,suchasPuppetandNagios,toprovidethemanagementandmonitoringofitsmanagedclusters.Italsohashigh-levelfunctionalitysimilartoClouderaManager,thatis,theinstallation,configuration,management,andmonitoringofaHadoopcluster,andthecomponentserviceswithinit.
ItisgoodtobeawareoftheAmbariprojectasthechoiceisnotjustbetweenfulllock-intoClouderaandClouderaManageroramanuallymanagedcluster.Ambariprovidesagraphicaltoolthatmightbeworthconsideration,orindeedinvolvement,asitmatures.OnanHDPcluster,theAmbariUIequivalenttotheClouderaManagerhomepageshownearliercanbereachedathttp://<hostname>:8080/#/main/dashboardandlookslikethefollowingscreenshot:
Ambari
OperationsintheHadoop2worldAsmentionedinChapter2,Storage,someofthemostsignificantchangesmadetoHDFSinHadoop2involveitsfaulttoleranceandbetterintegrationwithexternalsystems.Thisisnotjustacuriosity,buttheNameNodeHighAvailabilityfeatures,inparticular,havemadeamassivedifferenceinthemanagementofclusterssinceHadoop1.Inthebadolddaysof2012orso,asignificantpartoftheoperationalpreparednessofaHadoopclusterwasbuiltaroundmitigationsfor,andrestorationprocessesaroundfailureoftheNameNode.IftheNameNodediedinHadoop1,andyoudidn’thaveabackupoftheHDFSfsimagemetadatafile,thenyoubasicallylostaccesstoallyourdata.Ifthemetadatawaspermanentlylost,thensowasthedata.
Hadoop2hasaddedthein-builtNameNodeHAandthemachinerytomakeitwork.Inaddition,therearecomponentssuchastheNFSgatewayintoHDFS,whichmakeitamuchmoreflexiblesystem.Butthisadditionalcapabilitydoescomeattheexpenseofmoremovingparts.ToenableNameNodeHA,thereareadditionalcomponentsintheJournalManagerandFailoverController,andtheNFSgatewayrequiresHadoop-specificimplementationsoftheportmapandnfsdservices.
Hadoop2alsonowhasextensiveotherintegrationpointswithexternalservicesaswellasamuchbroaderselectionofapplicationsandservicesthatrunatopit.Consequently,itmightbeusefultoviewHadoop2intermsofoperationsashavingtradedthesimplicityofHadoop1foradditionalcomplexity,whichdeliversasubstantiallymorecapableplatform.
SharingresourcesInHadoop1,theonlytimeonehadtoconsiderresourcesharingwasinconsideringwhichschedulertousefortheMapReduceJobTracker.SincealljobswereeventuallytranslatedintoMapReducecodehavingapolicyforresourcesharingattheMapReducelevelwasusuallysufficienttomanageclusterworkloadsinthelarge.
Hadoop2andYARNchangedthispicture.AswellasrunningmanyMapReducejobs,aclustermightalsoberunningmanyotherapplicationsatopotherYARNApplicationMasters.TezandSparkareframeworksintheirownrightthatrunadditionalapplicationsatoptheirprovidedinterfaces.
IfeverythingrunsonYARN,thenitprovideswaysofconfiguringthemaximumresourceallocation(intermsofCPU,memory,andsoonI/O)consumedbyeachcontainerallocatedtoanapplication.Theprimarygoalhereistoensurethatenoughresourcesareallocatedtokeepthehardwarefullyutilizedwithouteitherhavingunusedcapacityoroverloadingit.
Thingsgetsomewhatmoreinterestingwhennon-YARNapplications,suchasImpala,arerunningontheclusterandwanttograballocatedslicesofcapacity(particularlymemoryinthecaseofImpala).Thiscouldalsohappenif,say,youwererunningSparkonthesamehostsinitsnon-YARNmodeorindeedanyotherdistributedapplicationthatmightbenefitfromco-locationontheHadoopmachines.
Basically,inHadoop2,youneedtothinkoftheclusterasmuchmoreofamulti-tenancyenvironmentthatrequiresmoreattentiongiventotheallocationofresourcestothevarioustenants.
Therereallyisnosilverbulletrecommendationhere;therightconfigurationwillbeentirelydependentontheservicesco-locatedandtheworkloadstheyarerunning.Thisisanotherexamplewhereyouwanttoworkcloselywithyouroperationsteamtodoaseriesofloadtestswiththresholdstodeterminejustwhattheresourcerequirementsofthevariousclientsareandwhichapproachwillgivethemaximumutilizationandperformance.ThefollowingblogpostfromClouderaengineersgivesagoodoverviewofhowtheyapproachthisveryissueinhavingImpalaandMapReducecoexisteffectively:http://blog.cloudera.com/blog/2013/06/configuring-impala-and-mapreduce-for-multi-tenant-performance/.
BuildingaphysicalclusterThereisoneminorrequirementbeforethinkingaboutallocationofhardwareresources:definingandselectingthehardwareusedforyourcluster.Inthissection,we’lldiscussaphysicalclusterandmoveontoAmazonEMRinthenext.
Anyspecifichardwareadvicewillbeoutofdatethemomentitiswritten.WeadviseperusingthewebsitesofthevariousHadoopdistributionvendorsastheyregularlywritenewarticlesonthecurrentlyrecommendedconfigurations.
InsteadoftellingyouhowmanycoresorGBofmemoryyouneed,we’lllookathardwareselectionataslightlyhigherlevel.ThefirstthingtorealizeisthatthehostsrunningyourHadoopclusterwillmostlikelylookverydifferentfromtherestofyourenterprise.Hadoopisoptimizedforlow(er)costhardware,soinsteadofseeingasmallnumberofverylargeservers,expecttoseealargernumberofmachineswithfewerenterprisereliabilityfeatures.Butdon’tthinkthatHadoopwillrungreatonanyjunkyouhavelyingaround.Itmight,butrecentlytheprofileoftypicalHadoopservershasbeenmovingawayfromthebottom-endofthemarket,andinstead,thesweetspotwouldseemtobeinmid-rangeserverswherethemaximumcores/disks/memorycanbeachievedatapricepoint.
YoushouldalsoexpecttohavedifferentresourcerequirementsforthehostsrunningservicessuchastheHDFSNameNodeortheYARNResourceManager,asopposedtotheworkernodesstoringdataandexecutingtheapplicationlogic.Fortheformer,thereisusuallymuchlessrequirementforlotsofstorage,butfrequently,aneedformorememoryandpossiblyfasterdisks.
ForHadoopworkernodes,theratiobetweenthethreemainhardwarecategoriesofcores,memory,andI/Oisoftenthemostimportantthingtogetright.Andthiswilldirectlyinformthedecisionsyoumakeregardingworkloadandresourceallocation.
Forexample,manyworkloadstendtobecomeI/Oboundandhavingmanytimesasmanycontainersallocatedonahostthantherearephysicaldisksmightactuallycauseanoverallslowdownduetocontentionforthespinningdisks.Atthetimeofwriting,currentrecommendationshereareforthenumberofYARNcontainerstobenomorethan1.8timesthenumberofdisks.IfyouhaveworkloadsthatareI/Obound,thenyouwillmostlikelygetmuchbetterperformancebyaddingmorehoststotheclusterinsteadoftryingtogetmorecontainersrunningorindeedfasterprocessorsormorememoryonthecurrenthosts.
Conversely,ifyouexpecttorunlotsofconcurrentImpala,Spark,andothermemory-hungryjobs,thenmemorymightquicklybecometheresourcemostunderpressure.Thisiswhyeventhoughyoucangetcurrenthardwarerecommendationsforgeneral-purposeclustersfromthedistributionvendors,youstillneedtovalidateagainstyourexpectedworkloadsandtailoraccordingly.ThereisreallynosubstituteforbenchmarkingonasmalltestclusterorindeedonEMR,whichcanbeagreatplatformtoexploretheresourcerequirementsofmultipleapplicationsthatcaninformhardwareacquisitiondecisions.PerhapsEMRmightbeyourmainenvironment;ifso,we’lldiscussthatinalatersection.
PhysicallayoutIfyoudouseaphysicalcluster,thereareafewthingsyouwillneedtoconsiderthatarelargelytransparentonEMR.
RackawarenessThefirstoftheseaspectsforclusterslargeenoughtoconsumemorethanonerackofdatacenterspaceisbuildingrackawareness.AsmentionedinChapter2,Storage,whenHDFSplacesreplicasofnewfiles,itattemptstoplacethesecondreplicaonadifferenthostthanthefirst,andthethirdinadifferentrackofequipmentinamulti-racksystem.Thisheuristicisaimedatmaximizingresilience;therewillbeatleastonereplicaavailableevenifanentirerackofequipmentfails.MapReduceusessimilarlogictoattempttogetabetter-balancedtaskspread.
Ifyoudonothing,theneachhostwillbespecifiedasbeinginthesingledefaultrack.But,iftheclustergrowsbeyondthispoint,youwillneedtoupdatetherackname.
Underthecovers,Hadoopdiscoversanode’srackbyexecutingauser-suppliedscriptthatmapsnodehostnametoracknames.ClouderaManagerallowsracknamestobesetonagivenhost,andthisisthenretrievedwhenitsrackawarenessscriptsarecalledbyHadoop.Tosettherackforahost,clickonHosts-><hostname>->AssignRack,andthenassigntherackfromtheClouderaManagerhomepage.
ServicelayoutAsmentionedearlier,youarelikelytohavetwotypesofhardwareinyourcluster:themachinesrunningtheworkersandthoserunningtheservers.Whendeployingaphysicalcluster,youwillneedtodecidewhichservicesandwhichsubcomponentsoftheservicesrunonwhichphysicalmachines.
Fortheworkers,thisisusuallyprettystraightforward;most,thoughnotall,serviceshaveamodelofaworkeragentonallworkerhosts.But,forthemaster/servercomponents,itrequiresalittlethought.Ifyouhavethreemasternodes,thenhowdoyouspreadyourprimaryandbackupNameNodes:theYARNResourceManager,maybeHue,afewHiveservers,andanOoziemanager?Someofthesefeaturesarehighlyavailable,whileothersarenot.Asyouaddmoreandmoreservicestoyourcluster,you’llalsoseethislistofmasterservicesgrowsubstantially.
Inanidealworld,youmighthaveahostperservicemasterbutthatisonlytractableforverylargeclusters;insmallerinstallationsitisprohibitivelyexpensive.Plusitmightalwaysbealittlewasteful.Therearenohard-and-fastruleshereeither,butdolookatyouravailablehardware,andtrytospreadtheservicesacrossthenodesasmuchaspossible.Don’t,forexample,havetwonodesforthetwoNameNodesandthenputeverythingelseonathird.Thinkabouttheimpactofasinglehostfailureandmanagethelayouttominimizeit.Astheclustergrowsacrossmultipleracksofequipment,theconsiderationswillalsoneedtoconsiderhowtosurvivesingle-rackfailures.HadoopitselfhelpswiththissinceHDFSwillattempttoensureeachblockofdatahasreplicasacrossatleasttwo
racks.But,thistypeofresilienceisunderminedif,forexample,allthemasternodesresideinasinglerack.
UpgradingaserviceUpgradingHadoophashistoricallybeenatime-consumingandsomewhatriskytask.Thisremainsthecaseonamanuallydeployedcluster,thatis,onenotmanagedbyatoolsuchasClouderaManager.
IfyouareusingClouderaManager,thenittakesthetime-consumingpartoutoftheactivity,butnotnecessarilytherisk.Anyupgradeshouldalwaysbeviewedasanactivitywithahighchanceofunexpectedissues,andyoushouldarrangeenoughclusterdowntimetoaccountforthissurpriseexcitement.There’sreallynosubstitutefordoingatestupgradeonatestcluster,whichunderlinestheimportanceofthinkingaboutHadoopasacomponentofyourenvironmentthatneedstobetreatedwithadeploymentlifecyclelikeanyother.
SometimesanupgraderequiresmodificationtotheHDFSmetadataormightotherwiseaffectthefilesystem.Thisis,ofcourse,wheretherealriskslie.Inadditiontorunningatestupgrade,beawareoftheabilitytosetHDFSinupgrademode,whicheffectivelymakesasnapshotofthefilesystemstatepriortotheupgradeandwhichwillberetaineduntiltheupgradeisfinalized.Thiscanbereallyhelpfulasevenanupgradethatgoesbadlywrongandcorruptsdatacanpotentiallybefullyrolledback.
BuildingaclusteronEMRElasticMapReduceisaflexiblesolutionthat,dependingonrequirementsandworkloads,cansitnextto,orreplace,aphysicalHadoopcluster.Aswe’veseensofar,EMRprovidesclusterspreloadedandconfiguredwithHive,Streaming,andPigaswellaswithcustomJARclustersthatallowtheexecutionofMapReduceapplications.
Aseconddistinctiontomakeisbetweentransientandlong-runninglifecycles.AtransientEMRclusterisgeneratedondemand;dataisloadedinS3orHDFS,someprocessingworkflowisexecuted,outputresultsarestored,andtheclusterisautomaticallyshutdown.Along-runningclusteriskeptaliveoncetheworkflowterminates,andtheclusterremainsavailablefornewdatatobecopiedoverandnewworkflowstobeexecuted.Long-runningclustersaretypicallywell-suitedfordatawarehousingorworkingwithdatasetslargeenoughthatloadingandprocessingdatawouldbeinefficientcomparedtoatransientinstance.
Inamust-readwhitepaperforprospectiveusers(foundathttps://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf),Amazongivesaheuristictoestimatewhichclustertypeisabetterfitasfollows:
Ifnumberofjobsperday*(timetosetupclusterincludingAmazonS3dataloadtimeifusingAmazonS3+dataprocessingtime)<24hours,considertransientAmazonEMRclustersorphysicalinstances.Long-runninginstancesareinstantiatedbypassingthe–aliveargumenttotheElasticMapreducecommand,whichenablestheKeepAliveoptionanddisablesautotermination.
Notethattransientandlong-runningclusterssharethesamepropertiesandlimitations;inparticular,dataonHDFSisnotpersistedoncetheclusterisshutdown.
ConsiderationsaboutfilesystemsInourexamplessofarweassumeddatatobeavailableinS3.Inthiscase,abucketismountedinEMRasans3nfilesystem,anditisusedasinputsourceaswellasatemporaryfilesystemtostoreintermediatedataincomputations.WithS3weintroducepotentialI/Ooverhead,operationssuchasreadsandwritesfireoffGETandPUTHTTPrequests.
NoteNotethatEMRdoesnotsupportS3blockstorage.Thes3URImapstos3n.
AnotheroptionwouldbetoloaddataintotheclusterHDFSandrunprocessingfromthere.Inthiscase,wedohavefasterI/Oanddatalocality,butwewouldlosepersistence.Whentheclusterisshutdown,ourdatadisappears.Asaruleofthumb,ifyouarerunningatransientcluster,itmakessensetouseS3asabackend.Inpractice,oneshouldmonitorandtakedecisionsbasedontheworkflowcharacteristics.Iterative,multi-passMapReducejobswouldgreatlybenefitfromHDFS;onecouldarguethatforthosetypesofworkflows,anexecutionenginelikeTezorSparkwouldbemoreappropriate.
GettingdataintoEMRWhencopyingdatafromHDFStoS3,itisrecommendedtouses3distcp(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.htmlinsteadofApachedistcporHadoopdistcp.ThisapproachissuitablealsototransferdatawithinEMRandfromS3toHDFS.TomoveverylargeamountsofdatafromthelocaldiskintoS3,AmazonrecommendsparallelizingtheworkloadusingJets3torGNUParallel.Ingeneral,it’simportanttobeawarethatPUTrequeststoS3arecappedat5GBperfile.Touploadlargerfiles,oneneedstorelyonMultipartUpload(https://aws.amazon.com/about-aws/whats-new/2010/11/10/Amazon-S3-Introducing-Multipart-Upload/),anAPIthatallowssplittinglargefilesintosmallerpartsandreassemblesthemwhenuploaded.FilescanalsobecopiedwithtoolssuchastheAWSCLIorthepopularS3CMDutility,butthesedonothavetheparallelismadvantagesofass3distcp.
EC2instancesandtuningThesizeofanEMRclusterdependsonthedatasetsize,thenumberoffilesandblocks(determinesthenumberofsplits)andthetypeofworkload(trytoavoidspillingtodiskwhenataskrunsoutofmemory).Asaruleofthumb,agoodsizeisonethatmaximizesparallelism.ThenumberofmappersandreducersperinstanceaswellasheapsizeperJVMdaemonisgenerallyconfiguredbyEMRwhentheclusterisprovisionedandtunedintheeventofchangesintheavailableresources.
ClustertuningInadditiontothepreviouscommentsspecifictoaclusterrunonEMR,therearesomegeneralthoughtstokeepinmindwhenrunningworkloadsonanytypeofcluster.Thiswill,ofcourse,bemoreexplicitwhenrunningoutsideofEMRasitoftenabstractssomeofthedetails.
JVMconsiderationsYoushouldberunningthe64-bitversionofaJVMandusingtheservermode.Thiscantakelongertoproduceoptimizedcode,butitalsousesmoreaggressivestrategiesandwillre-optimizecodeovertime.Thismakesitamuchbetterfitforlong-runningservices,suchasHadoopprocesses.
EnsurethatyouallocateenoughmemorytotheJVMtopreventoverly-frequentGarbageCollection(GC)pauses.Theconcurrentmark-and-sweepcollectoriscurrentlythemosttestedandrecommendedforHadoop.TheGarbageFirst(G1)collectorhasbecometheGCoptionofchoiceinnumerousotherworkloadssinceitsintroductionwithJDK7,soit’sworthmonitoringrecommendedbestpracticeasitevolves.TheseoptionscanbeconfiguredascustomJavaargumentswithineachservice’sconfigurationsectionofClouderaManager.
ThesmallfilesproblemHeapallocationtoJavaprocessesonworkernodeswillbesomethingyouconsiderwhenthinkingaboutserviceco-location.ButthereisaparticularsituationregardingtheNameNodeyoushouldbeawareof:thesmallfilesproblem.
Hadoopisoptimizedforverylargefileswithlargeblocksizes.ButsometimesparticularworkloadsordatasourcespushmanysmallfilesontoHDFS.Thisismostlikelysuboptimalasitsuggestseachtaskprocessingablockatatimewillreadonlyasmallamountofdatabeforecompleting,causinginefficiency.
HavingmanysmallfilesalsoconsumesmoreNameNodememory;itholdsin-memorythemappingfromfilestoblocksandconsequentlyholdsmetadataforeachfileandblock.Ifthenumberoffilesandhenceblocksincreasesquickly,thensowilltheNameNodememoryusage.Thisislikelytoonlyhitasubsetofsystemsas,atthetimeofwritingthis,1GBofmemorycansupport2millionfilesorblocks,butwithadefaultheapsizeof2or4GB,thislimitcaneasilybereached.IftheNameNodeneedstostartveryaggressivelyrunninggarbagecollectionoreventuallyrunsoutofmemory,thenyourclusterwillbeveryunhealthy.ThemitigationistoassignmoreheaptotheJVM;thelonger-termapproachistocombinemanysmallfilesintoasmallernumberoflargerones.Ideally,compressedwithasplittablecompressioncodec.
MapandreduceoptimizationsMappersandreducersbothprovideareasforoptimizingperformance;hereareafewpointerstoconsider:
Thenumberofmappersdependsonthenumberofsplits.Whenfilesaresmallerthanthedefaultblocksizeorcompressedusinganonsplittableformat,thenumberofmapperswillequalthenumberoffiles.Otherwise,thenumberofmappersisgivenbythetotalsizeofeachfiledividedbytheblocksize.CompressmappersoutputtoreducewritestodiskandincreaseI/O.LZOisagoodformatforthistask.Avoidspilltodisk:themappersshouldhaveenoughmemorytoretainasmuchdataaspossible.NumberofReducers:itisrecommendedthatyouusefewerreducersthanthetotalreducercapacity(thisavoidsexecutionwaits).
SecurityOnceyoubuiltacluster,thefirstthingyouthoughtaboutwashowtosecureit,right?Don’tworry,mostpeopledon’t.But,asHadoophasmovedonfrombeingsomethingrunningin-houseanalysisintheresearchdepartmenttodirectlydrivingcriticalsystems,it’snotsomethingtoignorefortoolong.
SecuringHadoopisnotsomethingtobedoneonawhimorwithoutsignificanttesting.Wecannotgivedetailedadviceonthistopicandcannotstressstronglyenoughtheneedtotakethistopicseriouslyanddoitproperly.Itmightconsumetime,itmightcostmoney,butweighthisagainstthecostofhavingyourclustercompromised.
SecurityisalsoamuchbiggertopicthanjusttheHadoopcluster.We’llexploresomeofthesecurityfeaturesavailableinHadoop,butyoudoneedacoherentsecuritystrategyintowhichthesediscretecomponentsfit.
EvolutionoftheHadoopsecuritymodelInHadoop1,therewaseffectivelynosecurityprotectionastheprovidedsecuritymodelhadobviousattackvectors.TheUnixuserIDwithwhichyouconnectedtotheclusterwasassumedtobevalid,andyouhadalltheprivilegesofthatuser.Plainly,thismeantthatanyonewithadministrativeaccessonahostthatcouldaccesstheclustercouldeffectivelyimpersonateanyotheruser.
Thisledtothedevelopmentoftheso-called“headnode”accessmodel,wherebytheHadoopclusterwasfirewalledofffromeveryhostexceptone,theheadnode,andallaccesstotheclusterwasmediatedthroughthiscentrally-controllednode.Thiswasaneffectivemitigationforthelackofarealsecuritymodelandcanstillbeusefulinsituationsevenwhenrichersecurityschemesareutilized.
BeyondbasicauthorizationCoreHadoophashadadditionalsecurityfeaturesadded,whichaddressthepreviousconcerns.Inparticular,theyaddressthefollowing:
AclustercanrequireausertoauthenticateviaKerberosandprovetheyarewhotheysaytheyare.Insecuremode,theclustercanalsouseKerberosforallnode-nodecommunications,ensuringthatallcommunicatingnodesareauthenticatedandpreventingmaliciousnodesfromattemptingtojointhecluster.Toeasemanagement,userscanbecollectedintogroupsagainstwhichdata-accessprivilegescanbedefined.ThisiscalledRoleBasedAccessControl(RBAC)andisaprerequisiteforasecureclusterwithmorethanahandfulofusers.Theuser-groupmappingscanberetrievedfromcorporatesystems,suchasLDAPoractivedirectory.HDFScanapplyACLstoreplacethecurrentUnix-inspiredowner/group/worldmodel.
ThesecapabilitiesgiveHadoopasignificantlystrongersecurityposturethaninthepast,butthecommunityismovingfastandadditionaldedicatedApacheprojectshaveemergedtoaddressspecificareasofsecurity.
ApacheSentryhttps://sentry.incubator.apache.orgisasystemtoprovidemuchfiner-grainedauthorizationtoHadoopdataandservices.OtherservicesbuildSentrymappings,andthisallows,forexample,specificrestrictionstobeplacednotonlyonparticularHDFSdirectories,butalsoonentitiessuchasHivetables.
WhereasSentryfocusesonprovidingmuchrichertoolsfortheinternal,fine-grainedaspectsofHadoopsecurity,ApacheKnox(http://knox.apache.org)providesasecuregatewaytoHadoopthatintegrateswithexternalidentitymanagementsystemsandprovidesaccesscontrolmechanismstoallowordisallowaccesstospecificHadoopservicesandoperations.ItdoesthisbypresentingaREST-onlyinterfacetoHadoopandsecuringallcallstothisAPI.
ThefutureofHadoopsecurityTherearemanyotherdevelopmentshappeningintheHadoopworld.CoreHadoop2.5addedextendedfileattributestoHDFS,whichcanbeusedasthebasisofadditionalaccesscontrolmechanisms.Futureversionswillincorporatecapabilitiesforbettersupportofencryptionfordataintransitaswellasatrest,andtheProjectRhinoinitiativeledbyIntel(https://github.com/intel-hadoop/project-rhino/)isbuildingoutrichersupportforfilesystemcryptographicmodules,asecurefilesystem,and,atsomepoint,afullerkey-managementinfrastructure.
TheHadoopdistributionvendorsaremovingfasttoaddthesecapabilitiestotheirreleases,soifyoucareaboutsecurity(youdo,don’tyou!),thenconsultthedocumentationforthelatestreleaseofyourdistribution.Newsecurityfeaturesarebeingaddedeveninpointupdatesandaren’tbeingdelayeduntilmajorupgrades.
ConsequencesofusingasecuredclusterAfterteasingyouwithallthesecuritygoodnessthatisnowavailableandthatwhichiscoming,it’sonlyfairtogivesomewordsofwarning.Securityisoftenhardtodocorrectly,andoftenthefeelingofsecuritywronglyemployedwithabuggydeploymentisworsethanknowingyouhavenosecurity.
However,evenifyoudoitright,thereareconsequencestorunningasecurecluster.Itmakesthingsharderfortheadministratorscertainlyandoftentheusers,sothereisdefinitelyanoverhead.SpecificHadooptoolsandserviceswillalsoworkdifferentlydependingonwhatsecurityisemployedonacluster.
Oozie,whichwediscussedinChapter8,DataLifecycleManagement,usesitsowndelegationtokensbehindthescenes.Thisallowstheoozieusertosubmitjobsthatarethenexecutedonbehalfoftheoriginallysubmittinguser.Inaclusterusingonlythebasicauthorizationmechanism,thisisveryeasilyconfigured,butusingOozieinasecureclusterwillrequireadditionallogictobeaddedtotheworkflowdefinitionsandthegeneralOozieconfiguration.Thisisn’taproblemwithHadooporOozie;however,similarlyaswiththeadditionalcomplexityresultingfromthemuchbetterHAfeaturesofHDFSinHadoop2,bettersecuritymechanismswillsimplyhavecostsandconsequencesthatyouneedtakeintoconsideration.
MonitoringEarlierinthischapter,wediscussedClouderaManagerasavisualmonitoringtoolandhintedthatitcouldalsobeprogrammaticallyintegratedwithothermonitoringsystems.ButbeforepluggingHadoopintoanymonitoringframework,it’sworthconsideringjustwhatitmeanstooperationallymonitoraHadoopcluster.
Hadoop–wherefailuresdon’tmatterTraditionalsystemsmonitoringtendstobequiteabinarytool;generallyspeaking,eithersomethingisworkingoritisn’t.Ahostisaliveordead,andawebserverisrespondingoritisn’t.ButintheHadoopworld,thingsarealittledifferent;theimportantthingisserviceavailability,andthiscanstillbetreatedasliveevenifparticularpiecesofhardwareorsoftwarehavefailed.NoHadoopclustershouldbeintroubleifasingleworkernodefails.AsofHadoop2,eventhefailureoftheserverprocesses,suchastheNameNodeshouldn’treallybeaconcernifHAisconfigured.So,anymonitoringofHadoopneedstotakeintoaccounttheservicehealthandnotthatofspecifichostmachines,whichshouldbeunimportant.Operationspeopleon24/7pagerarenotgoingtobehappygettingpagedat3AMtodiscoverthatoneworkernodeinaclusterof10,000hasfailed.Indeed,oncethescaleoftheclusterincreasesbeyondacertainpoint,thefailureofindividualpiecesofhardwarebecomesanalmostcommonplaceoccurrence.
MonitoringintegrationYouwon’tbebuildingyourownmonitoringtools;instead,youmightlikelywanttointegratewithexistingtoolsandframeworks.Forpopularopensourcemonitoringtools,suchasNagiosandZabbix,therearemultiplesampletemplatestointegrateHadoop’sservice-wideandnode-specificmetrics.
Thiscangivethesortofseparationhintedpreviously;thefailureoftheYARNResourceManagerwouldbeahigh-criticalityeventthatshouldmostlikelycausealertstobesenttooperationsstaff,butahighloadonspecifichostsshouldonlybecapturedandnotcausealertstobefired.Thisthenprovidesthedualityoffiringalertswhenbadthingshappeninadditiontocapturingandprovidingtheinformationneededtodelveintosystemdataovertimetodotrendanalysis.
ClouderaManagerprovidesaRESTinterface,whichisanotherpointofintegrationagainstwhichtoolssuchasNagioscanintegrateandpulltheClouderaManager-definedservice-levelmetricsinsteadofhavingtodefineitsown.
Forheavier-weightenterprise-monitoringinfrastructurebuiltonframeworks,suchasIBMTivoliorHPOpenView,ClouderaManagercanalsodelivereventsviaSNMPtrapsthatwillbecollectedbythesesystems.
Application-levelmetricsAttimes,youmightalsowantyourapplicationstogathermetricsthatcanbecentrallycapturedwithinthesystem.Themechanismsforthiswilldifferfromonecomputationalmodeltoanother,butthemostwell-knownaretheapplicationcountersavailablewithinMapReduce.
WhenaMapReducejobcompletes,itoutputsanumberofcounters,gatheredbythesystemthroughoutthejobexecution,thatdealwithmetricssuchasthenumberofmaptasks,byteswritten,failedtasks,andsoon.Youcanalsowriteapplication-specificmetricsthatwillbeavailablealongsidethesystemcountersandwhichareautomaticallyaggregatedacrossthemap/reduceexecution.FirstdefineaJavaenum,andnameyourdesiredmetricswithinit,asfollows:
publicenumAppMetrics{
MAX_SEEN,
MIN_SEEN,
BAD_RECORDS
};
Then,withinthemap,reduce,setup,andcleanupmethodsofyourMaporReduceimplementations,youcandosomethinglikethefollowingtoincrementacounterbyone:
Context.getCounter(AppMetrics.BAD_RECORDS).increment(1);
RefertotheJavaDocoftheorg.apache.hadoop.mapreduce.Counterinterfaceformoredetailsofthismechanism.
TroubleshootingMonitoringandloggingcountersoradditionalinformationisallwellandgood,butitcanbeintimidatingtoknowhowtoactuallyfindtheinformationyouneedwhentroubleshootingaproblemwithanapplication.Inthissection,wewilllookathowHadoopstoreslogsandsysteminformation.Wecandistinguishthreetypologiesoflogs,asfollows:
YARNapplications,includingMapReducejobsDaemonlogs(NameNodeandResourceManager)Servicesthatlognon-distributedworkloads,forexample,HiveServer2loggingto/var/log
Nexttotheselogtypologies,Hadoopexposesanumberofmetricsatfilesystem(thestorageavailability,replicationfactor,andnumberofblocks)andsystemlevel.Asmentioned,bothApacheAmbariandClouderaManager,whichcentralizeaccesstodebuginformation,doanicejobasthefrontend.However,underthehood,eachservicelogstoeitherHDFSorthesingle-nodefilesystem.Furthermore,YARN,MapReduce,andHDFSexposetheirlogfilesandmetricsviawebinterfacesandprogrammaticAPIs.
LogginglevelsHadooplogsmessagestoLog4jbydefault.Log4jisconfiguredvialog4j.propertiesintheclasspath.Thisfiledefineswhatisloggedandwithwhichlayout:
log4j.rootLogger=${root.logger}
root.logger=INFO,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/ddHH:mm:ss}%p
%c{2}:%m%n
ThedefaultrootloggerisINFO,console,whichlogsallmessagesatthelevelINFOandabovetotheconsole’sstderr.SingleapplicationsdeployedonHadoopcanshiptheirownlog4j.propertiesandsetthelevelandotherpropertiesoftheiremittedlogsasrequired.
HadoopdaemonshaveawebpagetogetandsettheloglevelforanyLog4jproperty.Thisinterfaceisexposedbythe/LogLevelendpointineachservicewebUI.ToenabledebugloggingfortheResourceManagerclass,wewillvisithttp://resourcemanagerhost:8088/LogLevel,andthescreenshotcanbeseenasfollows:
GettingandsettingtheloglevelonResourceManager
Alternatively,theYARNdaemonlog<host:port>commandinterfaceswiththeservice/LogLevelendpoint.Wecaninspectthelevelassociatedwithmapreduce.map.log.levelfortheResourceManagerclassusingthe–getlevel<property>parameter,asfollows:
$hadoopdaemonlog-getlevellocalhost.localdomain:8088
mapreduce.map.log.level
Connectingtohttp://localhost.localdomain:8088/logLevel?
log=mapreduce.map.log.levelSubmittedLogName:mapreduce.map.log.levelLog
Class:org.apache.commons.logging.impl.Log4JLoggerEffectivelevel:INFO
Theeffectivelevelcanbemodifiedusingthe-setlevel<property><level>option:
$hadoopdaemonlog-setlevellocalhost.localdomain:8088
mapreduce.map.log.levelDEBUG
Connectingtohttp://localhost.localdomain:8088/logLevel?
log=mapreduce.map.log.level&level=DEBUG
SubmittedLogName:mapreduce.map.log.level
LogClass:org.apache.commons.logging.impl.Log4JLogger
SubmittedLevel:DEBUG
SettingLeveltoDEBUG…
Effectivelevel:DEBUG
NotethatthissettingwillaffectalllogsproducedbytheResourceManagerclass.Thisincludessystem-generatedentriesaswellastheonesgeneratedbyapplicationsrunningonYARN.
AccesstologfilesLogfilelocationsandnamingconventionsarelikelytodifferbasedonthedistribution.ApacheAmbariandClouderaManagercentralizeaccesstologfiles,bothforservicesandsingleapplications.OnCloudera’sQuickStartVM,anoverviewofthecurrentlyrunningprocessesandlinkstotheirlogfiles,thestderrandstdoutchannelscanbefoundathttp://localhost.localdomain:7180/cmf/hardware/hosts/1/processes,andthescreenshotcanbeseenasfollows:
AccesstologresourcesinClouderaManager
AmbariprovidesasimilaroverviewviatheServicesdashboardfoundathttp://127.0.0.1:8080/#/main/servicesontheHDPSandbox,andthescreenshotcanbeseenasfollows:
AccesstologresourcesonApacheAmbari
Non-distributedlogsareusuallyfoundunder/var/log/<service>oneachclusternode.YARNcontainersandMRv2logslocationsalsodependonthedistribution.OnCDH5theseresourcesareavailableinHDFSunder/tmp/logs/<user>.
Thestandardmodalitytoaccessdistributedlogsiseitherviacommand-linetoolsorusingserviceswebUIs.
Forinstance,thecommandisasfollows:
$yarnapplication-list-appStatesALL
TheprecedingcommandwilllistallrunningandretriedYARNapplications.TheURLinthetaskcolumnpointstoawebinterfacethatexposesthetasklog,asfollows:
14/08/0314:44:38INFOclient.RMProxy:ConnectingtoResourceManagerat
localhost.localdomain/127.0.0.1:8032Totalnumberofapplications
(application-types:[]andstates:[NEW,NEW_SAVING,SUBMITTED,ACCEPTED,
RUNNING,FINISHED,FAILED,KILLED]):4Application-Id
Application-NameApplication-TypeUserQueue
StateFinal-StateProgress
Tracking-URLapplication_1405630696162_0002PigLatin:DefaultJobName
MAPREDUCEclouderaroot.clouderaFINISHED
SUCCEEDED100%
http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0002
application_1405630696162_0004PigLatin:DefaultJobName
MAPREDUCEclouderaroot.clouderaFINISHED
SUCCEEDED100%
http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0004
application_1405630696162_0003PigLatin:DefaultJobName
MAPREDUCEclouderaroot.clouderaFINISHED
SUCCEEDED100%
http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0003
application_1405630696162_0005PigLatin:DefaultJobName
MAPREDUCEclouderaroot.clouderaFINISHED
SUCCEEDED100%
http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0005
Forinstance,http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0002,alinktoataskbelongingtousercloudera,isafrontendtothecontentstoredunderhdfs:///tmp/logs/cloudera/logs/application_1405630696162_0002/.
Inthefollowingsections,wewillgiveanoverviewoftheavailableUIsfordifferentservices.
NoteProvisioninganEMRclusterwiththe–log-uris3://<bucket>optionwillensurethatHadooplogsarecopiedintothes3://<bucket>location.
ResourceManager,NodeManager,andApplicationManagerOnYARNtheResourceManagerwebUIprovidesinformationandgeneraljobstatisticsoftheHadoopcluster,running/completed/failedjobs,andajobhistorylogfile.Bydefault,theUIisexposedathttp://<resourcemanagerhost>:8088/andcanbeseeninthefollowingscreenshot:
ResourceManager
ApplicationsOntheleft-handsidebar,itispossibletoreviewtheapplicationstatusofinterest:NEW,SUBMITTED,ACCEPTED,RUNNING,FINISHING,FINISHED,FAILED,orKILLED.Dependingontheapplicationstatus,thefollowinginformationisavailable:
TheapplicationIDThesubmittinguserTheapplicationnameTheschedulerqueueinwhichtheapplicationisplacedStart/finishtimesandstateLinktotheTrackingUIforapplicationhistory
Inaddition,theClusterMetricsviewgivesyouinformationonthefollowing:
OverallapplicationstatusNumberofrunningcontainersMemoryusageNodestatus
NodesTheNodesviewisafrontendtotheNodeManagerservicemenu,whichshowshealthandlocationinformationonthenode’srunningapplications,asfollows:
Nodesstatus
EachindividualnodeoftheclusterexposesfurtherinformationandstatisticsathostlevelviaitsownUI.TheseincludewhichversionofHadoopisrunningonthenode,howmuchmemoryisavailableonthenode,thenodestatus,andalistofrunningapplicationsandcontainers,asshowninthefollowingscreenshot:
Singlenodeinfo
SchedulerThefollowingscreenshotshowstheSchedulerwindow:
Scheduler
MapReduceThoughthesameinformationandloggingdetailsareavailableinMapReducev1and
MapReducev2,theaccessmodalityisslightlydifferent.
MapReducev1ThefollowingscreenshotshowstheMapReduceJobTrackerUI:
TheJobTrackerUI
TheJobTrackerUI,availablebydefaultathttp://<jobtracker>:50070,exposesinformationonallcurrentlyrunningaswellasretiredMapReducejobs,asummaryoftheclusterresourcesandhealth,aswellasschedulinginformationandcompletionpercentage,asshowninthefollowingscreenshot:
Jobdetails
Foreachrunningandretiredjob,detailsareavailable,includingitsID,owner,priority,taskassignment,andtasklaunchforthemapper.Clickingonajobidlinkwillleadtoajobdetailspage—thesameURLexposedbythemapredjob–listcommand.Thisresourcegivesdetailsaboutboththemapandreducetasksaswellasgeneralcounterstatisticsatthejob,filesystem,andMapReducelevels;theseincludethememoryused,numberofread/writeoperations,andthenumberofbytesreadandwritten.
ForeachMapandReduceoperation,theJobTrackerexposesthetotal,pending,running,completed,andfailedtasks,asshowninthefollowingscreenshot:
Jobtasksoverview
ClickingonthelinksintheJobtablewillleadtoafurtheroverviewatthetaskandtask-attemptlevels,asshowninthefollowingscreenshot:
Taskattempts
Fromthislastpage,wecanaccessthelogsofeachtaskattempt,bothforsuccessfulandfailed/killedtasksoneachindividualTaskTrackerhost.ThislogcontainsthemostgranularinformationaboutthestatusoftheMapReducejob,includingtheoutputofLog4jappendersaswellasoutputpipedtothestdoutandstderrchannelsandsyslog,asshowninthefollowingscreenshot:
TaskTrackerlogs
MapReducev2(YARN)AswehaveseeninChapter3,Processing–MapReduceandBeyond,withYARN,MapReduceisonlyoneofmanyprocessingframeworksthatcanbedeployed.RecallfrompreviouschaptersthattheJobTrackerandTaskTrackerserviceshavebeenreplacedbytheResourceManagerandNodeManager,respectively.Assuch,boththeserviceUIsandthelogfilesfromYARNaremoregenericthanMapReducev1.
Theapplication_1405630696162_0002nameshowninResourceManagercorrespondstoaMapReducejobwiththejob_1405630696162_0002ID.ThatapplicationIDbelongstothetaskrunninginsidethecontainer,andclickingonitwillrevealanoverviewoftheMapReducejobandallowadrill-downtotheindividualtasksfromeitherphaseuntilthesingle-tasklogisreached,asshowninthefollowingscreenshot:
AYARNapplicationcontainingaMapReducejob
JobHistoryServerYARNshipswithaJobHistoryRESTservicethatexposesdetailsonfinishedapplications.Currently,itonlysupportsMapReduceandprovidesinformationonfinishedjobs.ThisincludesthejobfinalstatusSUCCESSFULorFAILED,whosubmittedthejob,thetotalnumberofmapandreducetasks,andtiminginformation.
AUIisavailableathttp://<jobhistoryhost>:19888/jobhistory,asshowninthefollowingscreenshot:
JobHistoryUI
ClickingoneachjobIDwillleadtotheMapReducejobUIshownintheYARNapplicationscreenshot.
NameNodeandDataNodeThewebinterfacefortheHadoopDistributedFileSystem(HDFS)showsinformationabouttheNameNodeitselfaswellasthefilesystemgenerally.
Bydefault,itislocatedathttp://<namenodehost>:50070/,asshowninthefollowingscreenshot:
NameNodeUI
TheOverviewmenuexposesNameNodeinformationaboutDFScapacityandusageandtheblockpoolstatus,anditgivesasummaryofthestatusofDataNodehealthandavailability.Theinformationcontainedinthispageisforthemostpartequivalenttowhatisshownatthecommand-lineprompt:
$hdfsdfsadmin–report
TheDataNodesmenugivesmoredetailedinformationaboutthestatusofeachnodeandoffersadrill-downatthesingle-hostlevel,bothforavailableanddecommissionednodes,asshowninthefollowingscreenshot:
DatanodeUI
SummaryThishasbeenquiteawhistle-stoptouraroundtheconsiderationsofrunninganoperationalHadoopcluster.Wedidn’ttrytoturndevelopersintoadministrators,buthopefully,thebroaderperspectivewillhelpyoutohelpyouroperationsstaff.Inparticular,wecoveredthefollowingtopics:
HowHadoopisanaturalfitforDevOpsapproachesasitsmultilayeredcomplexitymeansit’snotpossibleordesirabletohavesubstantialknowledgegapsbetweendevelopmentandoperationsstaffClouderaManager,andhowitcanbeagreatmanagementandmonitoringtool;itmightcauseintegrationproblemsthough,ifyouhaveotherenterprisetools,anditcomeswithavendorlock-inriskAmbari,theApacheopensourcealternativetoClouderaManager,andhowitisusedintheHortonworksdistributionHowtothinkaboutselectinghardwareforaphysicalHadoopcluster,andhowthisnaturallyfitsintotheconsiderationsofhowthemultipleworkloadspossibleintheworldofHadoop2canpeacefullycoexistonsharedresourcesThedifferentconsiderationsforfiringupandusingEMRclustersandhowthiscanbebothanadjunctto,aswellasanalternativeto,aphysicalclusterTheHadoopsecurityecosystem,howitisaveryfastmovingarea,andhowthefeaturesavailabletodayarevastlybetterthansomeyearsagoandthereisstillmucharoundthecornerMonitoringofaHadoopcluster,consideringwhateventsareimportantintheHadoopmodelofembracingfailure,andhowthesealertsandmetricscanbeintegratedintootherenterprise-monitoringframeworksHowtotroubleshootissueswithaHadoopcluster,bothintermsofwhatmighthavehappenedandhowtofindtheinformationtoinformyouranalysisAquicktourofthevariouswebUIsprovidedbyHadoop,whichcangiveverygoodoverviewsofhappeningswithinvariouscomponentsinthesystem
ThisconcludesourtreatmentofHadoopindepth.Inthefinalchapter,wewillexpresssomethoughtsonthebroaderHadoopecosystem,givesomepointersforusefulandinterestingtoolsandproductsthatwedidn’thaveachancetocoverinthebook,andsuggesthowtogetinvolvedwiththecommunity.
Chapter11.WheretoGoNextInthepreviouschapterswehaveexaminedmanypartsofHadoop2andtheecosystemaroundit.However,wehavenecessarilybeenlimitedbypagecount;someareaswedidn’tgetintoasmuchdepthaswaspossible,otherareaswereferredtoonlyinpassingordidnotmentionatall.
TheHadoopecosystem,withdistributions,Apacheandnon-Apacheprojects,isanincrediblyvibrantandhealthyplacetoberightnow.Inthischapter,wehopetocomplementthepreviouslydiscussedmoredetailedmaterialwithatravelguide,ifyouwill,forotherinterestingdestinations.Inthischapter,wewilldiscussthefollowingtopics:
HadoopdistributionsOthersignificantApacheandnon-ApacheprojectsSourcesofinformationandhelp
Ofcourse,notethatanyoverviewoftheecosystemisbothskewedbyourinterestsandpreferences,andisoutdatedthemomentitiswritten.Inotherwords,don’tforamomentthinkthisisallthat’savailable,consideritinsteadawhettingoftheappetite.
AlternativedistributionsWe’vegenerallyusedtheClouderadistributionforHadoopinthisbook,buthaveattemptedtokeepthecoveragedistributionindependentasmuchaspossible.We’vealsomentionedtheHortonworksDataPlatform(HDP)throughoutthisbookbutthesearecertainlynottheonlydistributionchoicesavailabletoyou.
Beforetakingalookaround,let’sconsiderwhetheryouneedadistributionatall.ItiscompletelypossibletogototheApachewebsite,downloadthesourcetarballsoftheprojectsinwhichyouareinterested,thenworktobuildthemalltogether.However,givenversiondependencies,thisislikelytoconsumemoretimethanyouwouldexpect.Potentially,vastlymoreso.Inaddition,theendproductwilllikelylacksomepolishintermsoftoolsorscriptsforoperationaldeploymentandmanagement.Formostusers,theseareasarewhyemployinganexistingHadoopdistributionisthenaturalchoice.
Anoteonfreeandcommercialextensions—beinganopensourceprojectwithaquiteliberallicense,distributioncreatorsarealsofreetoenhanceHadoopwithproprietaryextensionsthataremadeavailableeitherasfreeopensourceorcommercialproducts.
Thiscanbeacontroversialissueassomeopensourceadvocatesdislikeanycommercializationofsuccessfulopensourceprojects;tothem,itappearsthatthecommercialentityisfreeloadingbytakingthefruitsoftheopensourcecommunitywithouthavingtobuilditforthemselves.OthersseethisasahealthyaspectoftheflexibleApachelicense;thebaseproductwillalwaysbefree,andindividualsandcompaniescanchoosewhethertogowithcommercialextensionsornot.Wedon’tgivejudgmenteitherway,butbeawarethatthisisanotherofthecontroversiesyouwillalmostcertainlyencounter.
Soyouneedtodecideifyouneedadistributionandifsoforwhatreasons,whichspecificaspectswillbenefityoumostaboverollingyourown?Doyouwishforafullyopensourceproductorareyouwillingtopayforcommercialextensions?Withthesequestionsinmind,let’slookatafewofthemaindistributions.
ClouderaDistributionforHadoopYouwillbefamiliarwiththeClouderadistribution(http://www.cloudera.com)asithasbeenusedthroughoutthisbook.CDHwasthefirstwidelyavailablealternativedistributionanditsbreadthofavailablesoftware,provenlevelofquality,anditsfreecosthasmadeitaverypopularchoice.
Recently,ClouderahasbeenactivelyextendingtheproductsitaddstoitsdistributionbeyondthecoreHadoopprojects.InadditiontoClouderaManagerandImpala(bothCloudera-developedproducts),ithasalsoaddedothertoolssuchasClouderaSearch(basedonApacheSolr)andClouderaNavigator(adatagovernancesolution).WhileCDHversionspriorto5werefocusedmoreontheintegrationbenefitsofadistribution,version5(andpresumablybeyond)isaddingmoreandmorecapabilityatopthebaseApacheHadoopprojects.
Clouderaalsoofferscommercialsupportforitsproductsinadditiontotrainingandconsultancyservices.Detailscanbefoundonthecompanywebpage.
HortonworksDataPlatformIn2011,theYahoo!divisionresponsibleforsomuchofthedevelopmentofHadoopwasspunoffintoanewcompanycalledHortonworks.Theyhavealsoproducedtheirownpre-integratedHadoopdistributioncalledtheHortonworksDataPlatform(HDP),availableathttp://hortonworks.com/products/hortonworksdataplatform/.
HDPisconceptuallysimilartoCDHbutbothproductshavedifferencesintheirfocus.HortonworksmakesmuchofthefactHDPisfullyopensource,includingthemanagementtoolAmbari,whichwediscussedbrieflyinChapter10,RunningaHadoopCluster.TheyhavealsopositionedHDPasakeyintegrationplatformthroughitssupportfortoolssuchasTalendOpenStudio.Hortonworksdoesnotofferproprietarysoftware;itsbusinessmodelfocusesinsteadonofferingprofessionalservicesandsupportfortheplatform.
BothClouderaandHortonworksareventure-backedcompanieswithsignificantengineeringexpertise;bothcompaniesemploymanyofthemostprolificcontributorstoHadoop.Theunderlyingtechnologyis,however,comprisedofthesameApacheprojects;thedistinguishingfactorsarehowtheyarepackaged,theversionsemployed,andtheadditionalvalue-addedofferingsprovidedbythecompanies.
MapRAdifferenttypeofdistributionisofferedbyMapRTechnologies,althoughthecompanyanddistributionareusuallyreferredtosimplyasMapR.Thedistributionavailablefromhttp://www.mapr.comisbasedonHadoop,buthasaddedanumberofchangesandenhancements.
ThefocusoftheMapRdistributionisonperformanceandavailability.Forexample,itwasthefirstdistributiontoofferahigh-availabilitysolutionfortheHadoopNameNodeandJobTracker,whichyouwillrememberfromChapter2,Storage,wasasignificantweaknessincoreHadoop1.ItalsoofferednativeintegrationwithNFSfilesystemslongbeforeHadoop2,whichmakesprocessingofexistingdatamucheasier.Toachievethesefeatures,MapRreplacedHDFSwithafullPOSIXcompliantfilesystemthatalsofeaturesnoNameNode,resultinginatruedistributedsystemwithnomaster,andaclaimofmuchbetterhardwareutilizationthanApacheHDFS.
MapRprovidesbothacommunityandenterpriseeditionofitsdistribution;notalltheextensionsareavailableinthefreeproduct.Thecompanyalsoofferssupportservicesaspartoftheenterpriseproductsubscriptioninadditiontotrainingandconsultancy.
Andtherest…Hadoopdistributionsarenotjusttheterritoryofyoungstart-ups,noraretheyastaticmarketplace.Intelhaditsowndistributionuntilearly2014whenitdecidedtofolditschangesintoCDHinstead.IBMhasitsowndistributioncalledIBMInfosphereBigInsightsavailableinbothfreeandcommercialeditions.Therearealsovariousstoriesofnumerouslargeenterprisesrollingtheirowndistributions,someofwhicharemadeopenlyavailablewhileothersarenot.Youwillhavenoshortageofoptionswithsomanyhigh-qualitydistributionsavailable.
ChoosingadistributionThisraisesthequestion:howtochooseadistribution?Ascanbeseen,theavailabledistributions(andwedidn’tcoverthemall)rangefromconvenientpackagingandintegrationoffullyopensourceproductsthroughtoentirebespokeintegrationandanalysislayersatopthem.Thereisnooverallbestdistribution;thinkcarefullyaboutyourrequirementsandconsiderthealternatives.Sincealltheseofferafreedownloadofatleastabasicversion,it’sgoodtosimplyplayandexperiencetheoptionsforyourself.
OthercomputationalframeworksWe’vefrequentlydiscussedthemyriadpossibilitiesbroughttotheHadoopplatformbyYARN.Wewentintodetailsoftwonewmodels,SamzaandSpark.Additionally,othermoreestablishedframeworkssuchasPigarealsobeingportedtotheframework.
Togiveaviewofthemuchbiggerpictureinthissection,wewillillustratethebreadthofprocessingpossibleusingYARNbypresentingasetofcomputationalmodelsthatarecurrentlybeingportedtoHadoopontopofYARN.
ApacheStormStorm(http://storm.apache.org)isadistributedcomputationframeworkwritten(mainly)intheClojureprogramminglanguage.Itusescustom-createdspoutsandboltstodefineinformationsourcesandmanipulationstoallowdistributedprocessingofstreamingdata.AStormapplicationisdesignedasatopologyofinterfacesthatcreatesastreamoftransformations.ItprovidessimilarfunctionalitytoaMapReducejobwiththeexceptionthatthetopologywilltheoreticallyrunindefinitelyuntilitismanuallyterminated.
ThoughinitiallybuiltdistinctfromHadoop,aYARNportisbeingdevelopedbyYahoo!andcanbefoundathttps://github.com/yahoo/storm-yarn.
ApacheGiraphGiraphoriginatedastheopensourceimplementationofGoogle’sPregelpaper(whichcanbefoundathttp://kowshik.github.io/JPregel/pregel_paper.pdf).BothGiraphandPregelareinspiredbytheBulkSynchronousParallel(BSP)modelofdistributedcomputationintroducedbyValiantin1990.Giraphaddsseveralfeaturesincludingmastercomputation,shardedaggregators,edge-orientedinput,andout-of-corecomputation.TheYARNportcanbefoundathttps://issues.apache.org/jira/browse/GIRAPH-13.
ApacheHAMAHamaisatop-levelApacheprojectthataims,likeothermethodswe’veencounteredsofar,toaddresstheweaknessofMapReducewithregardtoiterativeprogramming.SimilartotheaforementionedGiraph,HamaimplementstheBSPtechniquesandhasbeenheavilyinspiredbythePregelpaper.TheYARNportcanbefoundathttps://issues.apache.org/jira/browse/HAMA-431.
OtherinterestingprojectsWhetheryouuseabundleddistributionorstickwiththebaseApacheHadoopdownload,youwillencountermanyreferencestootherrelatedprojects.We’vecoveredseveralofthesesuchasHive,Samza,andCrunchinthisbook;we’llnowhighlightsomeoftheothers.
Notethatthiscoverageseekstopointoutthehighlights(fromtheauthors’perspective)aswellasgiveatasteofthebreadthoftypesofprojectsavailable.Asmentionedearlier,keeplookingout,astherewillbenewoneslaunchingallthetime.
HBasePerhapsthemostpopularApacheHadoop-relatedprojectthatwedidn’tcoverinthisbookisHBase(http://hbase.apache.org).BasedontheBigTablemodelofdatastoragepublicizedbyGoogleinanacademicpaper(soundfamiliar?),HBaseisanonrelationaldatastoresittingatopHDFS.
WhilebothMapReduceandHivefocusonbatch-likedataaccesspatterns,HBaseinsteadseekstoprovideverylow-latencyaccesstodata.ConsequentlyHBasecan,unliketheaforementionedtechnologies,directlysupportuser-facingservices.
TheHBasedatamodelisnottherelationalapproachthatwasusedinHiveandallotherRDBMSs,nordoesitofferthefullACIDguaranteesthataretakenforgrantedwithrelationalstores.Instead,itisakey-valueschema-lesssolutionthattakesacolumn-orientedviewofdata;columnscanbeaddedatruntimeanddependonthevaluesinsertedintoHBase.Eachlookupoperationisthenveryfast,asitiseffectivelyakey-valuemappingfromtherowkeytothedesiredcolumn.HBasealsotreatstimestampsasanotherdimensiononthedatasoonecandirectlyretrievedatafromapointintime.
Thedatamodelisverypowerfulbutdoesnotsuitallusecasesjustastherelationalmodelisn’tuniversallyapplicable.Butifyouhavearequirementforstructuredlow-latencyviewsonlarge-scaledatastoredinHadoop,thenHBaseisabsolutelysomethingyoushouldlookat.
SqoopInChapter7,HadoopandSQL,welookedattoolsforpresentingarelational-likeinterfacetodatastoredonHDFS.Often,suchdataeitherneedstoberetrievedfromanexistingrelationaldatabaseortheoutputofitsprocessingneedstobestoredback.
ApacheSqoop(http://sqoop.apache.org)providesamechanismfordeclarativelyspecifyingdatamovementbetweenrelationaldatabasesandHadoop.IttakesataskdefinitionandfromthisgeneratesMapReducejobstoexecutetherequireddataretrievalorstorage.ItwillalsogeneratecodetohelpmanipulaterelationalrecordswithcustomJavaclasses.Inaddition,itcanintegratewithHBaseandHcatalog/Hiveanditprovidesaveryrichsetofintegrationpossibilities.
Atthetimeofwriting,Sqoopisslightlyinflux.Itsoriginalversion,Sqoop1,wasapureclient-sideapplication.MuchliketheoriginalHivecommand-linetool,Sqoop1hasnoserverandgeneratesallcodeontheclient.Thisunfortunatelymeansthateachclientneedstoknowalotofdetailsaboutphysicaldatasources,includingexacthostnamesaswellasauthenticationcredentials.
Sqoop2providesacentralizedSqoopserverthatencapsulatesallthesedetailsandoffersthevariousconfigureddatasourcestotheconnectingclients.Itisasuperiormodelbutatthetimeofwriting,thegeneralcommunityrecommendationistostickwithSqoop1untilthenewversionevolvesfurther.Checkonthecurrentstatusifyouareinterestedinthistypeoftool.
WhirWhenlookingtousecloudservicessuchasAmazonAWSforHadoopdeployments,itisusuallyaloteasiertouseahigherlevelservicesuchasElasticMapReduceasopposedtosettingupyourownclusteronEC2.Thoughtherearescriptstohelp,thefactisthattheoverheadofHadoop-baseddeploymentsoncloudinfrastructurescanbeinvolved.That’swhereApacheWhir(https://whirr.apache.org/)comesin.
Whirisn’tfocusedonHadoop;it’saboutsupplier-independentinstantiationofcloudservicesofwhichHadoopisasingleexample.WhiraimstoprovideaprogrammaticwayofspecifyingandcreatingHadoop-baseddeploymentsoncloudinfrastructuresinawaythathandlesalltheunderlyingserviceaspectsforyou.Itdoesthisinaprovider-independentfashionsothatonceyou’velaunchedonsayEC2thenyoucanusethesamecodetocreatetheidenticalsetuponanotherprovidersuchasRightscaleorEucalyptus.Thismakesvendorlock-in,oftenaconcernwithclouddeployments,lessofanissue.
Whirisn’tquitethereyet.Today,itislimitedinservicesitcancreateandprovidersitsupports,however,ifyouareinterestedinclouddeploymentwithlesspainthenit’sworthwatchingitsprogress.
NoteIfyouarebuildingoutyourfullinfrastructureonAmazonWebServicesthenyoumightfindcloudformationgivesmuchofthesameabilitytodefineapplicationrequirements,thoughobviouslyinanAWS-specificfashion.
MahoutApacheMahout(http://mahout.apache.org/)isacollectionofdistributedalgorithms,Javaclasses,andtoolsforperformingadvancedanalyticsontopofHadoop.SimilartoSpark’sMLLibbrieflymentionedinChapter5,IterativeComputationwithSpark,Mahoutshipswithanumberofalgorithmsforcommonusecases:recommendation,clustering,regression,andfeatureengineering.Althoughthesystemisfocusedonnaturallanguageprocessingandtext-miningtasks,itsbuildingblocks(linearalgebraoperations)aresuitabletobeappliedtoanumberofdomains.AsofVersion0.9,theprojectisbeingdecoupledfromtheMapReduceframeworkinfavorofricherprogrammingmodelssuchasSpark.Thecommunityendgoalistoobtainaplatform-independentlibrarybasedonaScalaDSL.
HueInitiallydevelopedbyClouderaandmarketedasthe“UserInterfaceforHadoop”,Hue(http://gethue.com/)isacollectionofapplications,bundledtogetherunderacommonwebinterface,thatactasclientsforcoreservicesandanumberofcomponentsoftheHadoopecosystem:
TheHueQueryEditorforHive
Hueleveragesmanyofthetoolswediscussedinpreviouschaptersandprovidesanintegratedinterfaceforanalyzingandvisualizingdata.Therearetwocomponentsthatareremarkablyinteresting.Ononehand,thereisaqueryeditorthatallowstheusertocreateandsaveHive(orImpala)queries,exporttheresultsetinCSVorMicrosoftOfficeExcelformataswellasplotitinthebrowser.TheeditorfeaturesthecapabilityofsharingbothHiveQLandresultsets,thusfacilitatingcollaborationwithinanorganization.Ontheotherhand,thereisanOozieworkflowandcoordinatoreditorthatallowsausertocreateanddeployOoziejobsmanually,automatingthegenerationofXMLconfigurationsandboilerplate.
BothClouderaandHortonworksdistributionsshipwithHueandtypicallyincludethefollowing:
AfilemanagerforHDFSAJobBrowserforYARN(MapReduce)AnApacheHBasebrowserAHivemetastoreexplorerQueryeditorsforHiveandImpalaAscripteditorforPigAjobeditorforMapReduceandSpark
AneditorforSqoop2jobsAnOozieworkfloweditoranddashboardAnApacheZooKeeperbrowser
Ontopofthis,HueisaframeworkwithanSDKthatcontainsanumberofwebassets,APIs,andpatternsfordevelopingthird-partyapplicationsthatinteractwithHadoop.
OtherprogrammingabstractionsHadoopisn’tjustextendedbyadditionalfunctionality,therearetoolstoprovideentirelydifferentparadigmsforwritingthecodeusedtoprocessyourdatawithinHadoop.
CascadingDevelopedbyConcurrent,andopensourcedunderanApachelicense,Cascading(http://www.cascading.org/)isapopularframeworkthatabstractsthecomplexityofMapReduceawayandallowsustocreatecomplexworkflowsontopofHadoop.Cascadingjobscancompileto,andbeexecutedon,MapReduce,Tez,andSpark.Conceptually,theframeworkissimilartoApacheCrunch,coveredinChapter9,MakingDevelopmentEasier,thoughpracticallytherearedifferencesintermsofdataabstractionsandendgoals.Cascadingadoptsatupledatamodel(similartoPig)ratherthanarbitraryobjects,andencouragestheusertorelyonahigherlevelDSL,powerfulbuilt-intypes,andtoolstomanipulatedata.
Putinsimpleterms,CascadingistoPigLatinandHiveQLwhatCrunchistoauser-definedfunction.
LikeMorphlines,whichwealsosawinChapter9,MakingDevelopmentEasier,theCascadingdatamodelfollowsasource-pipe-sinkapproach,wheredataiscapturedfromasource,pipedthroughanumberofprocessingsteps,anditsoutputisthendeliveredintoasink,readytobepickedupbyanotherapplication.
CascadingencouragesdeveloperstowritecodeinanumberofJVMlanguages.PortsoftheframeworkexistforPython(PyCascading),JRuby(Cascading.jruby),Clojure(Cascalog),andScala(Scalding).CascalogandScaldinginparticularhavegainedalotoftractionandspawnedofftheirveryownecosystems.
AnareawhereCascadingexcelsisdocumentation.TheprojectprovidescomprehensivejavadocsoftheAPI,extensivetutorials(http://www.cascading.org/documentation/tutorials/)andaninteractiveexercise-basedlearningenvironment(https://github.com/Cascading/Impatient).
AnotherstrongsellingpointofCascadingisitsintegrationwiththird-partyenvironments.AmazonEMRsupportsCascadingasafirst-classprocessingframeworkandallowsustolaunchCascadingclustersbothwiththecommandlineandwebinterfaces(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CreateCascading.htmlPluginsfortheSDKexistforboththeIntelliJIDEAandEclipseintegrateddevelopmentenvironments.Oneoftheframework’stopprojects,CascadingPatterns,acollectionofmachine-learningalgorithms,featuresautilityfortranslatingPredictiveModelMarkupLanguage(PMML)documentsintoapplicationsonApacheHadoop,thusfacilitatinginteroperabilitywithpopularstatisticalenvironmentsandscientifictoolssuchasR(http://cran.r-project.org/web/packages/pmml/index.html).
AWSresourcesManyHadooptechnologiescanbedeployedonAWSaspartofaself-managedcluster.However,justasAmazonofferssupportforElasticMapReduce,whichhandlesHadoopasamanagedservice,thereareafewotherservicesthatareworthmentioning.
SimpleDBandDynamoDBForsometime,AWShasofferedSimpleDBasahostedserviceprovidinganHBase-likedatamodel.
Ithas,however,largelybeensupersededbyamorerecentservicefromAWS,DynamoDB,locatedathttp://aws.amazon.com/dynamodb.ThoughitsdatamodelisverysimilartothatofSimpleDBandHBase,itisaimedataverydifferenttypeofapplication.WhereSimpleDBhasquitearichsearchAPIbutisverylimitedintermsofsize,DynamoDBprovidesamoreconstrainedthoughconstantlyevolvingAPI,butwithaserviceguaranteeofnear-unlimitedscalability.
TheDynamoDBpricingmodelisparticularlyinteresting;insteadofpayingforacertainnumberofservershostingtheservice,youallocateacertaincapacityforread-and-writeoperations,andDynamoDBmanagestheresourcesrequiredtomeetthisprovisionedcapacity.Thisisaninterestingdevelopmentasitisamorepureservicemodel,wherethemechanismofdeliveringthedesiredperformanceiskeptcompletelyopaquetotheserviceuser.HavealookatDynamoDBbutifyouneedamuchlargerscaleofdatastorethanSimpleDBcanoffer;however,doconsiderthepricingmodelcarefullyasprovisioningtoomuchcapacitycanbecomeveryexpensiveveryquickly.AmazonprovidessomegoodbestpracticesforDynamoDBatthefollowingURLthatillustratethatminimizingtheservicecostscanresultinadditionalapplication-layercomplexity:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BestPractices.html.
NoteOfcoursethediscussionofDynamoDBandSimpleDBassumesanon-relationaldatamodel;thereistheAmazonRelationalDatabaseService(AmazonRDS)forarelationaldatabaseinthecloudservice.
KinesisJustasEMRishostedHadoopandDynamoDBhassimilaritiestoahostedHBase,itwasn’tsurprisingtoseeAWSannounceKinesis,ahostedstreamingdataservicein2013.Thiscanbefoundathttp://aws.amazon.com/kinesisandithasverysimilarconceptualbuildingblockstothestackofSamzaatopKafka.KinesisprovidesapartitionedviewofmessagesasastreamofdataandanAPItohavecallbacksthatexecutewhenmessagesarrive.AswithmostAWSservices,thereistightintegrationwithotherservicesmakingiteasytogetdataintoandoutoflocationssuchasS3.
DataPipelineThefinalAWSservicethatwe’llmentionisDataPipeline,whichcanbefoundathttp://aws.amazon.com/datapipeline.Asthenamesuggests,itisaframeworkforbuildingupdata-processingjobsthatinvolvemultiplesteps,datamovements,andtransformations.IthasquiteaconceptualoverlapwithOozie,butwithafewtwists.Firstly,DataPipelinehastheexpecteddeepintegrationwithmanyotherAWSservices,enablingeasydefinitionofdataworkflowsthatincorporatediverserepositoriessuchasRDS,S3,andDynamoDB.Inadditionhowever,DataPipelinedoeshavetheabilitytointegrateagentsinstalledonlocalinfrastructure,providinganinterestingavenueforbuildingworkflowsthatspanacrosstheAWSandon-premisesenvironments.
SourcesofinformationYoudon’tjustneednewtechnologiesandtools—eveniftheyarecool.Sometimes,alittlehelpfromamoreexperiencedsourcecanpullyououtofahole.Inthisregard,youarewellcovered,astheHadoopcommunityisextremelystronginmanyareas.
SourcecodeIt’ssometimeseasytooverlook,butHadoopandalltheotherApacheprojectsareafterallfullyopensource.Theactualsourcecodeistheultimatesource(pardonthepun)ofinformationabouthowthesystemworks.Becomingfamiliarwiththesourceandtracingthroughsomeofthefunctionalitycanbehugelyinformative.Nottomentionhelpfulwhenyouarehittingunexpectedbehavior.
MailinglistsandforumsAlmostalltheprojectsandserviceslistedinthischapterhavetheirownmailinglistsand/orforums;checkoutthehomepagesforthespecificlinks.Mostdistributionsalsohavetheirownforumsandothermechanismstoshareknowledgeandget(non-commercial)helpfromthecommunity.Additionally,ifusingAWS,makesuretocheckouttheAWSdeveloperforumsathttps://forums.aws.amazon.com.
Alwaysremembertoreadpostingguidelinescarefullyandunderstandtheexpectedetiquette.Thesearetremendoussourcesofinformation;thelistsandforumsareoftenfrequentlyvisitedbythedevelopersoftheparticularproject.ExpecttoseethecoreHadoopdevelopersontheHadooplists,HivedevelopersontheHivelist,EMRdevelopersontheEMRforums,andsoon.
LinkedIngroupsThereareanumberofHadoopandrelatedgroupsontheprofessionalsocialnetworkLinkedIn.Doasearchforyourparticularareasofinterest,butagoodstartingpointmightbethegeneralHadoopusers’groupathttp://www.linkedin.com/groups/Hadoop-Users-988957.
HUGsIfyouwantmoreface-to-faceinteractionthenlookforaHadoopUserGroup(HUG)inyourarea,mostofwhichwillbelistedathttp://wiki.apache.org/hadoop/HadoopUserGroups.Thesetendtoarrangesemi-regularget-togethersthatcombinethingssuchasqualitypresentations,theabilitytodiscusstechnologywithlike-mindedindividuals,andoftenpizzaanddrinks.
NoHUGnearwhereyoulive?Considerstartingone.
ConferencesThoughsomeindustriestakedecadestobuildupaconferencecircuit,Hadoopalreadyhassomesignificantconferenceactioninvolvingtheopensource,academic,andcommercialworlds.EventssuchastheHadoopSummitandStrataareprettybig;theseandsomeotherarelinkedfromhttp://wiki.apache.org/hadoop/Conferences.
SummaryInthischapter,wetookaquickgalloparoundthebroaderHadoopecosystem,lookingatthefollowingtopics:
WhyalternativeHadoopdistributionsexistandsomeofthemorepopularonesOtherprojectsthatprovidecapabilities,extensions,orHadoopsupportingtoolsAlternativewaysofwritingorcreatingHadoopjobsSourcesofinformationandhowtoconnectwithotherenthusiasts
Now,gohavefunandbuildsomethingamazing!
IndexA
additionaldata,collectingabout/Collectingadditionaldataworkflows,scheduling/SchedulingworkflowsOozietriggers/OtherOozietriggers
addMappermethod,argumentsjob/Textcleanupusingchainmapperclass/TextcleanupusingchainmapperinputKeyClass/TextcleanupusingchainmapperinputValueClass/TextcleanupusingchainmapperoutputKeyClass/TextcleanupusingchainmapperoutputValueClass/TextcleanupusingchainmappermapperConf/Textcleanupusingchainmapper
alternativedistributionsabout/AlternativedistributionsClouderaDistribution/ClouderaDistributionforHadoopHortonworksDataPlatform(HDP)/HortonworksDataPlatformMapR/MapRselecting/Choosingadistribution
Amazonaccountreferencelink/CreatinganAWSaccount
AmazonCLIreferencelink/TheAWScommand-lineinterface
AmazonEMRabout/AmazonEMRAWSaccount,creating/CreatinganAWSaccountrequiredservices,signingup/Signingupforthenecessaryservices
AmazonRelationalDatabaseService(AmazonRDS)/SimpleDBandDynamoDBAmazonWebServices
Hive,workingwith/HiveandAmazonWebServicesAmbari
about/Ambari–theopensourcealternativeURL/Ambari–theopensourcealternative
AMPLabatUCBerkeley,URL/ApacheSpark
ApacheAvroabout/AvroURL/Avro
ApacheCrunchabout/ApacheCrunchURL/ApacheCrunch
JARs/Gettingstartedlibraries/Gettingstartedconcepts/ConceptsPCollection<T>interface/ConceptsPTable<Key,Value>interface/Conceptsdataserialization/Dataserializationdataprocessingpatterns/DataprocessingpatternsPipelinesimplementation/Pipelinesimplementationandexecutionexecution/Pipelinesimplementationandexecutionexamples/CrunchexamplesKiteMorphlines/KiteMorphlines
ApacheDataFureferencelink/ContributedUDFs,ApacheDataFuabout/ApacheDataFu
ApacheGiraphabout/ApacheGiraphURL/ApacheGiraph
ApacheHAMAabout/ApacheHAMA
ApacheKafkaURL/ApacheSamza,Samza’sbestfriend–ApacheKafkaabout/Samza’sbestfriend–ApacheKafkaTwitterdata,gettinginto/GettingTwitterdataintoKafka
ApacheKnoxabout/BeyondbasicauthorizationURL/Beyondbasicauthorization
ApacheSentryURL/Beyondbasicauthorization
ApacheSparkabout/ApacheSpark,GettingstartedwithSparkURL/ApacheSpark,GettingstartedwithSparkclustercomputing,withworkingsets/ClustercomputingwithworkingsetsResilientDistributedDatasets(RDDs)/ResilientDistributedDatasets(RDDs)actions/Actionsdeployment/DeploymentonYARN/SparkonYARNonEC2/SparkonEC2standaloneapplications,writing/WritingandrunningstandaloneapplicationsScalaAPI/ScalaAPIJavaAPI/JavaAPIWordCount,inJava/WordCountinJavaPythonAPI/PythonAPIdata,processing/ProcessingdatawithApacheSpark
ApacheSpark,ecosystem
about/TheSparkecosystemSparkStreaming/SparkStreamingGraphX/GraphXMLLib/MLlibSparkSQL/SparkSQL
ApacheStormabout/ApacheStormURL/ApacheStorm
ApacheThriftabout/ThriftURL/Thrift
ApacheTikaabout/MultijobworkflowsURL/Multijobworkflows
ApacheTwillURL/Thinkinginlayers
ApacheZooKeeperabout/ApacheZooKeeper–adifferenttypeoffilesystemURL/ApacheZooKeeper–adifferenttypeoffilesystemdistributedlock,implementingwithsequentialZNodes/ImplementingadistributedlockwithsequentialZNodesgroupmembership,implementing/ImplementinggroupmembershipandleaderelectionusingephemeralZNodesleaderelection,implementingwithephemeralZNodes/ImplementinggroupmembershipandleaderelectionusingephemeralZNodesJavaAPI/JavaAPIblocks,building/Buildingblocksused,forenablingautomaticNameNodefailover/AutomaticNameNodefailover
applicationdevelopmentframework,selecting/Choosingaframework
ApplicationManagerabout/ResourceManager,NodeManager,andApplicationManager
ApplicationMaster(AM)about/AnatomyofaYARNapplication
architecturalprinciples,HDFSandMapReduce/CommonbuildingblocksArraywrapperclasses
about/ArraywrapperclassesautomaticNameNodefailover
enabling/AutomaticNameNodefailoverAvro
about/AvroAvroschemaevolution,using
thoughts/FinalthoughtsonusingAvroschemaevolution
additivechanges,making/Onlymakeadditivechangesschemaversions,managingexplicitly/Manageschemaversionsexplicitlyschemadistribution/Thinkaboutschemadistribution
Avroschemasabout/UsingtheJavaAPI
AvroSerdeURL/Avroabout/Avro
AWSabout/DistributionsofApacheHadoop,AWS–infrastructureondemandfromAmazonSimpleStorageService(S3)/SimpleStorageService(S3)ElasticMapReduce(EMR)/ElasticMapReduce(EMR)
AWScommand-lineinterfaceabout/TheAWScommand-lineinterfacereferencelink/TheAWScommand-lineinterface
AWScredentialsabout/AWScredentialsaccountID/AWScredentialsaccesskey/AWScredentialssecretaccesskey/AWScredentialskeypairs/AWScredentialsreferencelink/AWScredentials
AWSdeveloperforumsURL/Mailinglistsandforums
AWSresourcesabout/AWSresourcesSimpleDB/SimpleDBandDynamoDBDynamoDB/SimpleDBandDynamoDBDataPipeline/DataPipeline
Bblockreplication
about/BlockreplicationBulkSynchronousParallel(BSP)model
about/ApacheGiraph
CCascading
about/CascadingURL/Cascadingreferencelinks/Cascading
ClouderaURL/DistributionsofApacheHadoopURL,fordocumentation/ClouderaManagerURL,forblogpost/Sharingresources
Clouderadistributionabout/ClouderaDistributionforHadoopURL/ClouderaDistributionforHadoop
ClouderaHadoopDistribution(CDH)about/ClouderaManager
ClouderaKittenURL/Thinkinginlayers
ClouderaManagerabout/ClouderaManagerpayment,forsubscriptionservices/Topayornottopayclustermanagement,performing/ClustermanagementusingClouderaManagerintegrating,withsystemsmanagementtools/ClouderaManagerandothermanagementtoolsmonitoringwith/MonitoringwithClouderaManagerlogfiles,finding/Findingconfigurationfiles
ClouderaManagerAPIabout/ClouderaManagerAPI
ClouderaManagerlock-inabout/ClouderaManagerlock-in
ClouderaQuickstartVMabout/ClouderaQuickStartVMadvantages/ClouderaQuickStartVM
clusterbuilding,onEMR/BuildingaclusteronEMR
cluster,APacheSparkcomputing,withworkingsets/Clustercomputingwithworkingsets
cluster,onEMRfilesystem,considerations/Considerationsaboutfilesystemsdata,obtainingintoEMR/GettingdataintoEMREC2instances/EC2instancesandtuningEC2tuning/EC2instancesandtuning
clustermanagementperforming,ClouderaManagerused/ClustermanagementusingClouderaManager
clusterstartup,HDFSabout/ClusterstartupNameNodestartup/NameNodestartupDataNodestartup/DataNodestartup
clustertuningabout/ClustertuningJVMconsiderations/JVMconsiderationsmapoptimization/Mapandreduceoptimizationsreduceoptimization/Mapandreduceoptimizations
column-orienteddataformatsabout/Column-orienteddataformatsRCFile/RCFileORC/ORCParquet/ParquetAvro/AvroJavaAPI,using/UsingtheJavaAPI
columnarabout/Columnarstores
columnarstores/Columnarstorescombinerclass,JavaAPItoMapReduce
about/CombinercombineValuesoperation
about/Conceptscommand-lineaccess,HDFSfilesystem
about/Command-lineaccesstotheHDFSfilesystemhdfscommand/Command-lineaccesstotheHDFSfilesystemdfscommand/Command-lineaccesstotheHDFSfilesystemdfsadmincommand/Command-lineaccesstotheHDFSfilesystem
Comparableinterfaceabout/TheComparableandWritableComparableinterfaces
complexdatatypesmap/Pigdatatypestuple/Pigdatatypesbag/Pigdatatypes
complexeventprocessing(CEP)about/HowSamzaworks
components,Hadoopabout/ComponentsofHadoopcommonbuildingblocks/Commonbuildingblocksstorage/Storagecomputation/Computation
components,YARNabout/ThecomponentsofYARNResourceManager(RM)/ThecomponentsofYARN
NodeManager(NM)/ThecomponentsofYARNcomputation
about/Computationcomputation,Hadoop2
about/ComputationinHadoop2computationalframeworks
about/OthercomputationalframeworksApacheStorm/ApacheStormApacheGiraph/ApacheGiraph,ApacheHAMA
conferencesabout/Conferencesreferencelink/Conferences
configurationfile,Samzaabout/Theconfigurationfile
containersabout/SerializationandContainers
contributedUDFsabout/ContributedUDFsPiggybank/PiggybankElephantBird/ElephantBirdApacheDataFu/ApacheDataFu
create.hqlscriptreferencelink/ExtractingdataandingestingintoHive
Crunchexamplesabout/Crunchexampleswordco-occurrence/Wordco-occurrenceTF-IDF/TF-IDF
Curatorprojectreferencelink/Buildingblocks
Ddata,managing
about/ManagingandserializingdataWritableinterface/TheWritableinterfacewrapperclasses/IntroducingthewrapperclassesArraywrapperclasses/ArraywrapperclassesComparableinterface/TheComparableandWritableComparableinterfacesWritableComparableinterface/TheComparableandWritableComparableinterfaces
data,Pigworkingwith/WorkingwithdataFILTERoperator/Filteringaggregation/AggregationFOREACHoperator/ForeachJOINoperator/Join
data,storingabout/Storingdataserializationfileformat/SerializationandContainerscontainersfileformat/SerializationandContainersfilecompression/Compressiongeneral-purposefileformats/General-purposefileformatscolumn-orienteddataformats/Column-orienteddataformats
Datacoreabout/DataCore
DataCrunchabout/DataCrunch
DataHCatalogabout/DataHCatalog
DataHiveabout/DataHive
datalifecyclemanagementabout/Whatdatalifecyclemanagementisimportance/Importanceofdatalifecyclemanagementtools/Toolstohelp
DataMapReduceabout/DataMapReduce
DataNode/NameNodeandDataNodeDataNodes
about/StorageinHadoop2DataNodestartup
about/DataNodestartupDataPipeline
about/DataPipeline
referencelink/DataPipelinedataprocessing
about/DataprocessingwithHadoopdataset,generatingfromTwitter/WhyTwitter?dataset,building/Buildingourfirstdatasetprogrammaticaccess,withPython/ProgrammaticaccesswithPython
dataprocessing,ApacheSparkabout/ProcessingdatawithApacheSparkexamples,running/Buildingandrunningtheexamplesexamples,building/Buildingandrunningtheexamplesexamples,runningonYARN/RunningtheexamplesonYARNpopulartopics,finding/Findingpopulartopicssentiment,assigningtotopics/Assigningasentimenttotopicsonstreams/Dataprocessingonstreamsstatemanagement/Statemanagementdataanalysis,withSparkSQL/DataanalysiswithSparkSQLSQL,ondatastreams/SQLondatastreams
dataprocessingpatterns,Crunchabout/Dataprocessingpatternsaggregationandsorting/Aggregationandsortingjoiningdata/Joiningdata
dataserialization,Crunchabout/Dataserialization
dataset,buildingwithTwitterabout/BuildingourfirstdatasetmultipleAPIs,using/Oneservice,multipleAPIsanatomy,ofTweet/AnatomyofaTweetTwittercredentials/Twittercredentials
DataSparkabout/DataSpark
datatypes,Hivenumeric/Datatypesdateandtime/Datatypesstring/Datatypescollections/Datatypesmisc/Datatypes
datatypes,Pigscalardatatypes/Pigdatatypescomplexdatatypes/Pigdatatypes
DDLstatements,Hive/DDLstatementsdecayFactorfunction/StatemanagementDEFINEoperator
about/ExtendingPig(UDFs)deriveddata,producing
about/Producingderiveddatamultipleactions,performinginparallel/Performingmultipleactionsinparallelsubworkflow,calling/Callingasubworkflowglobalsettings,adding/Addingglobalsettings
DevOpspractices/HadoopandDevOpspractices
directedacyclicgraph(DAG)about/YARN
documentfrequencyabout/Calculatedocumentfrequencycalculating,TF-IDFused/Calculatedocumentfrequency
DrillURL/Drill,Tajo,andbeyondabout/Drill,Tajo,andbeyond
Driverclass,JavaAPItoMapReduceabout/TheDriverclass
dynamicinvokersabout/Dynamicinvokersreferencelink/Dynamicinvokers
DynamoDBURL/SimpleDBandDynamoDBabout/SimpleDBandDynamoDB
EEC2
ApacheSparkon/SparkonEC2EC2key-valuepair
referencelink/TheAWScommand-lineinterfaceElasticMapReduce
Hive,usingwith/HiveonElasticMapReduceElasticMapReduce(EMR)
about/DistributionsofApacheHadoop,ElasticMapReduce(EMR)URL/ElasticMapReduce(EMR)using/UsingElasticMapReduce
ElephantBirdreferencelink/ContributedUDFs,ElephantBird
EMRcluster,buildingon/BuildingaclusteronEMRURL,forbestpractices/BuildingaclusteronEMR
EMRdocumentationURL/HiveonElasticMapReduce
entitiesabout/Tweetmetadata
ephemeralZNodesabout/ImplementinggroupmembershipandleaderelectionusingephemeralZNodes
evalfunctions,PigAVG(expression)/EvalCOUNT(expression)/EvalCOUNT_STAR(expression)/EvalIsEmpty(expression)/EvalMAX(expression)/EvalMIN(expression)/EvalSUM(expression)/EvalTOKENIZE(expression)/Eval
examplesrunning/Runningtheexamples
examples,MapReduceprogramsreferencelink/Runningtheexampleslocalcluster/LocalclusterElasticMapReduce/ElasticMapReduce
examplesandsourcecodedownloadlink/Gettingstarted
ExecutionEngineinterface/AnoverviewofPigexternaldata,challenges
about/Challengesofexternaldata
datavalidation/Datavalidationvalidationactions/Validationactionsformatchanges,handling/Handlingformatchangesschemaevolution,handlingwithAvro/HandlingschemaevolutionwithAvro
EXTERNALkeyword/DDLstatementsExtract-Transform-Load(ETL)/DDLstatementsextract_for_hive.pig
URL,forsourcecode/Prerequisites
FFalcon
URL/Othertoolstohelpabout/Othertoolstohelp
fileformat,Hiveabout/FileformatsandstorageJSON/JSON
FileFormatclasses,HiveTextInputFormat/FileformatsandstorageHiveIgnoreKeyTextOutputFormat/FileformatsandstorageSequenceFileInputFormat/FileformatsandstorageSequenceFileOutputFormat/Fileformatsandstorage
filesystemmetadata,HDFSprotecting/ProtectingthefilesystemmetadataSecondaryNameNode,demerits/SecondaryNameNodenottotherescueHadoop2NameNodeHA/Hadoop2NameNodeHAclientconfiguration/Clientconfigurationfailover,working/Howafailoverworks
FILTERoperatorabout/Filtering
FlumeJavareferencelink/ApacheCrunch
FOREACHoperatorabout/Foreach
forknodeabout/Performingmultipleactionsinparallel
functions,Pigabout/Pigfunctionsbuilt-infunctions/Pigfunctionsreferencelink,forbuilt-infunctions/Pigfunctionsload/storefunctions/Load/storeeval/Evaltuple/Thetuple,bag,andmapfunctionsbag/Thetuple,bag,andmapfunctionsmap/Thetuple,bag,andmapfunctionsstring/Themath,string,anddatetimefunctionsmath/Themath,string,anddatetimefunctionsdatetime/Themath,string,anddatetimefunctionsdynamicinvokers/Dynamicinvokersmacros/Macros
GGarbageCollection(GC)/JVMconsiderationsGarbageFirst(G1)collector/JVMconsiderationsgeneral-purposefileformats
about/General-purposefileformatsTextfiles/General-purposefileformatsSequenceFile/General-purposefileformats
generalavailability(GA)/AnoteonversioningGoogleChubbysystem
referencelink/ApacheZooKeeper–adifferenttypeoffilesystemGoogleFileSystem(GFS)
referencelink/ThebackgroundofHadoopGradle
URL/RunningtheexamplesGraphX
about/GraphXURL/GraphX
groupByKey()method/AggregationandsortinggroupByKey(GroupingOptionsoptions)method/AggregationandsortinggroupByKey(intnumPartitions)method/AggregationandsortinggroupByKeyoperation
about/ConceptsGROUPoperator
about/AggregationGrunt
about/Grunt–thePiginteractiveshellshcommand/Grunt–thePiginteractiveshellhelpcommand/Grunt–thePiginteractiveshell
GuavalibraryURL/TheTopNpattern
HHadoop
versioning/Anoteonversioningbackground/ThebackgroundofHadoopcomponents/ComponentsofHadoopdualapproach/Adualapproachabout/Gettingstartedusing/GettingHadoopupandrunningEMR,using/HowtouseEMRAWScredentials/AWScredentialsdataprocessing/DataprocessingwithHadooppractices/HadoopandDevOpspracticesalternativedistributions/Alternativedistributionscomputationalframeworks/Othercomputationalframeworksinterestingprojects/Otherinterestingprojectsprogrammingabstractions/OtherprogrammingabstractionsAWSresources/AWSresourcessourcesofinformation/Sourcesofinformation
Hadoop-providedInputFormat,MapReducejobabout/Hadoop-providedInputFormatFileInputFormat/Hadoop-providedInputFormatSequenceFileInputFormat/Hadoop-providedInputFormatTextInputFormat/Hadoop-providedInputFormatKeyValueTextInputFormat/Hadoop-providedInputFormat
Hadoop-providedMapperandReducerimplementations,JavaAPItoMapReduceabout/Hadoop-providedmapperandreducerimplementationsmappers/Hadoop-providedmapperandreducerimplementationsreducers/Hadoop-providedmapperandreducerimplementations
Hadoop-providedOutputFormat,MapReducejobabout/Hadoop-providedOutputFormatFileOutputFormat/Hadoop-providedOutputFormatNullOutputFormat/Hadoop-providedOutputFormatSequenceFileOutputFormat/Hadoop-providedOutputFormatTextOutputFormat/Hadoop-providedOutputFormat
Hadoop-providedRecordReader,MapReducejobabout/Hadoop-providedRecordReaderLineRecordReader/Hadoop-providedRecordReaderSequenceFileRecordReader/Hadoop-providedRecordReader
Hadoop2about/Hadoop2–what’sthebigdeal?storage/StorageinHadoop2computation/ComputationinHadoop2diagrammaticrepresentation,architecture/ComputationinHadoop2
referencelink/Gettingstartedoperations/OperationsintheHadoop2world
Hadoop2NameNodeHAabout/Hadoop2NameNodeHAenabling/Hadoop2NameNodeHAkeeping,insync/KeepingtheHANameNodesinsync
HadoopDistributedFileSystem(HDFS)/NameNodeandDataNodeHadoopdistributions
about/DistributionsofApacheHadoopHortonworks/DistributionsofApacheHadoopCloudera/DistributionsofApacheHadoopMapR/DistributionsofApacheHadoopreferencelink/DistributionsofApacheHadoop
Hadoopfilesystemsabout/Hadoopfilesystemsreferencelink/HadoopfilesystemsHadoopinterfaces/Hadoopinterfaces
Hadoopinterfacesabout/HadoopinterfacesJavaFileSystemAPI/JavaFileSystemAPILibhdfs/LibhdfsApacheThrift/Thrift
Hadoopoperationsabout/I’madeveloper–Idon’tcareaboutoperations!
Hadoopsecurityfuture/ThefutureofHadoopsecurity
Hadoopsecuritymodelevolution/EvolutionoftheHadoopsecuritymodeladditionalsecurityfeatures/Beyondbasicauthorization
Hadoopstreamingabout/Hadoopstreamingwordcount,streaminginPython/StreamingwordcountinPythondifferencesinjobs/Differencesinjobswhenusingstreamingimportanceofwords,determining/Findingimportantwordsintext
HadoopUIURL/Othertoolstohelpabout/Othertoolstohelp
HadoopUserGroup(HUG)/HUGshashtagRegExp/Trendingtopicshashtags
about/SentimentofhashtagsHBase
about/HBaseURL/HBase
HCatalogabout/IntroducingHCatalogusing/UsingHCatalog
HCatCLItoolabout/UsingHCatalog
hcatutilityabout/UsingHCatalog
HDFSabout/ComponentsofHadoop,Storage,SamzaandHDFScharacteristics/Storagearchitecture/TheinnerworkingsofHDFSNameNode/TheinnerworkingsofHDFSDataNodes/TheinnerworkingsofHDFSclusterstartup/Clusterstartupblockreplication/Blockreplication
HDFSandMapReducemerits/Bettertogether
HDFSfilesystemcommand-lineaccess/Command-lineaccesstotheHDFSfilesystemexploring/ExploringtheHDFSfilesystem
HDFSsnapshotsabout/HDFSsnapshots
HelloSamzaabout/HelloSamza!URL/HelloSamza!
high-availability(HA)about/StorageinHadoop2
HighPerformanceComputing(HPC)/ComputationinHadoop2Hive
about/Hive-on-tezURL/Hive-on-tezoverview/OverviewofHivedatatypes/DatatypesDDLstatements/DDLstatementsfileformats/Fileformatsandstoragestorage/Fileformatsandstoragequeries/Queriesscripts,writing/Writingscriptsworking,withAmazonWebServices/HiveandAmazonWebServicesusing,withS3/HiveandS3using,withElasticMapReduce/HiveonElasticMapReduceURL,forsourcecodeofJDBCclient/JDBCURL,forsourcecodeofThriftclient/Thrift
Hive-JSON-Serde
URL/JSONhive-jsonmodule
URL/JSONabout/JSON
Hive-on-tezabout/Hive-on-tez
Hive0.13about/Hive-on-tez
Hivearchitectureabout/Hivearchitecture
HiveQLabout/WhySQLonHadoop,Queriesextending/ExtendingHiveQL
HiveServer2about/HivearchitectureURL/Hivearchitecture
Hivetablesabout/ThenatureofHivetablesstructuring,fromworkloads/StructuringHivetablesforgivenworkloads
Hortonwork’sHDPURL/SparkonYARN
HortonworksURL/DistributionsofApacheHadoop
HortonworksDataPlatform(HDP)about/Alternativedistributions,HortonworksDataPlatformURL/HortonworksDataPlatform
Hueabout/HueURL/Hue
HUGsabout/HUGsreferencelink/HUGs
IIAMconsole
URL/HiveandS3IBMInfosphereBigInsights
about/Andtherest…IdentityandAccessManagement(IAM)/AWScredentialsImpala
about/Impalareferences/Impala,Co-existingwithHivearchitecture/ThearchitectureofImpalaco-existing,withHive/Co-existingwithHive
in-syncreplicas(ISR)about/GettingTwitterdataintoKafka
indicesattribute,entityabout/Tweetmetadata
input/output,MapReducejobabout/Input/Output
InputFormat,MapReducejobabout/InputFormatandRecordReader
JJava
WordCount/WordCountinJavaJavaAPI
about/JavaAPIandScalaAPI,differences/JavaAPI
JavaAPItoMapReduceabout/JavaAPItoMapReduceMapperclass/TheMapperclassReducerclass/TheReducerclassDriverclass/TheDriverclasscombinerclass/Combinerpartitioning/PartitioningHadoop-providedMapperandReducerimplementations/Hadoop-providedmapperandreducerimplementationsreferencedata,sharing/Sharingreferencedata
JavaFileSystemAPIabout/JavaFileSystemAPI
JDBCabout/JDBC
JobTrackermonitoring,MapReducejobabout/OngoingJobTrackermonitoring
joinnodeabout/Performingmultipleactionsinparallel
JOINoperatorabout/Join
/QueriesJSON
about/JSONJSONSimple
URL/BuildingatweetparsingjobJVMconsiderations,clustertuning
about/JVMconsiderationssmallfilesproblem/Thesmallfilesproblem
Kkite-morphlines-avrocommand/Morphlinecommandskite-morphlines-core-stdiocommand/Morphlinecommandskite-morphlines-core-stdlibcommand/Morphlinecommandskite-morphlines-hadoop-corecommand/Morphlinecommandskite-morphlines-hadoop-parquet-avrocommand/Morphlinecommandskite-morphlines-hadoop-rcfilecommand/Morphlinecommandskite-morphlines-hadoop-sequencefilecommand/Morphlinecommandskite-morphlines-jsoncommand/MorphlinecommandsKiteData
about/KiteDataDatacore/DataCoreDatacore/DataCoreDataHCatalog/DataHCatalogDataHive/DataHiveDataMapReduce/DataMapReduceDataSpark/DataSparkDataCrunch/DataCrunch
Kiteexamplesreferencelink/KiteData
KiteJARsreferencelink/KiteData
KiteMorphlinesabout/KiteMorphlinesconcepts/ConceptsRecordabstractions/Conceptscommands/Morphlinecommands
KiteSDKURL/KiteData
KVMreferencelink/ClouderaQuickStartVM
LLambdasyntax
URL/PythonAPILibhdfs
about/LibhdfsLinkedIngroups
about/LinkedIngroupsURL/LinkedIngroups
Log4jabout/Logginglevels
logfilesaccessingto/Accesstologfiles
logginglevelsabout/Logginglevels
MMachineLearning(ML)
about/MLlibmacros
about/MacrosMahout
about/MahoutURL/Mahout
mapoptimization,clustertuningconsiderations/Mapandreduceoptimizations
Mapperclass,JavaAPItoMapReduceabout/TheMapperclass
mapperexecution,MapReducejobabout/Mapperexecution
mapperinput,MapReducejobabout/Mapperinput
mapperoutput,MapReducejobabout/Mapperoutputandreducerinput
mappers,MapperandReducerimplementationsInverseMapper/Hadoop-providedmapperandreducerimplementationsTokenCounterMapper/Hadoop-providedmapperandreducerimplementationsIdentityMapper/Hadoop-providedmapperandreducerimplementations
MapRURL/DistributionsofApacheHadoop,MapRabout/MapR
MapReducereferencelink/ThebackgroundofHadoop,MapReduceabout/MapReduceMapphase/MapReduce
MapReduceAPIabout/ComponentsofHadoop,Computation
MapReducedriversourcecodereferencelink/Morphlinecommands
MapReducejobabout/WalkingthrougharunofaMapReducejobstartup/Startupinput,splitting/Splittingtheinputtaskassignment/Taskassignmenttaskstartup/TaskstartupJobTrackermonitoring/OngoingJobTrackermonitoringmapperinput/Mapperinputmapperexecution/Mapperexecutionmapperoutput/Mapperoutputandreducerinput
reducerinput/Reducerinputreducerexecution/Reducerexecutionreduceroutput/Reduceroutputshutdown/Shutdowninput/output/Input/OutputInputFormat/InputFormatandRecordReaderRecordReader/InputFormatandRecordReaderHadoop-providedInputFormat/Hadoop-providedInputFormatHadoop-providedRecordReader/Hadoop-providedRecordReaderOutputFormat/OutputFormatandRecordWriterRecordWriter/OutputFormatandRecordWriterHadoop-providedOutputFormat/Hadoop-providedOutputFormatsequencefiles/Sequencefiles
MapReduceprogramswriting/WritingMapReduceprograms,Gettingstartedexamples,running/RunningtheexamplesWordCountexample/WordCount,theHelloWorldofMapReducewordco-occurrences/Wordco-occurrencessocialnetworktopics/Trendingtopicsreferencelink,forHashTagCountexamplesourcecode/TrendingtopicsTopNpattern/TheTopNpatternreferencelink,forTopTenHashTagsourcecode/TheTopNpatternhashtags/Sentimentofhashtagsreferencelink,forHashTagSentimentsourcecode/Sentimentofhashtagstextcleanup,chainmapperused/Textcleanupusingchainmapperreferencelink,forHashTagSentimentChainsourcecode/Textcleanupusingchainmapper
MassivelyParallelProcessing(MPP)about/ThearchitectureofImpala
MemPipelineabout/MemPipeline
MessagePassingInterface(MPI)/ComputationinHadoop2MLLib
about/MLlibmonitoring
about/MonitoringHadoop/Hadoop–wherefailuresdon’tmatterapplication-levelmetrics/Application-levelmetrics
monitoringtoolsabout/Monitoringintegration
MoprhlineDrviersourcecodereferencelink/Morphlinecommands
Morphlinecommandskite-morphlines-core-stdio/Morphlinecommands
kite-morphlines-core-stdlib/Morphlinecommandskite-morphlines-avro/Morphlinecommandskite-morphlines-json/Morphlinecommandskite-morphlines-hadoop-parquet-avro/Morphlinecommandskite-morphlines-hadoop-sequencefile/Morphlinecommandskite-morphlines-hadoop-rcfile/Morphlinecommandsreferencelink/Morphlinecommands
MRExecutionEngine/AnoverviewofPigMultipartUpload
URL/GettingdataintoEMR
NNameNode
about/StorageinHadoop2/NameNodeandDataNodeNameNodeHA
about/StorageinHadoop2NameNodestartup
about/NameNodestartupNFSshare/KeepingtheHANameNodesinsyncNodeManager
about/ResourceManager,NodeManager,andApplicationManagerNodeManager(NM)
about/ThecomponentsofYARN
OOozie
about/IntroducingOozieURL/IntroducingOoziefeatures/IntroducingOozieactionnodes/IntroducingOozieHDFSfilepermissions/AnoteonHDFSfilepermissionsdevelopment,makingeasier/Makingdevelopmentalittleeasierdata,extracting/ExtractingdataandingestingintoHivedata,ingestingintoHive/ExtractingdataandingestingintoHiveworkflowdirectorystructure/AnoteonworkflowdirectorystructureHCatalog/IntroducingHCatalogsharelib/TheOoziesharelibHCatalogandpartitionedtables/HCatalogandpartitionedtablesusing/Pullingitalltogether
Oozietriggers/OtherOozietriggersOozieworkflow
about/IntroducingOozie/IntroducingOozieoperations,Hadoop2
about/OperationsintheHadoop2worldopinionlexicon
URL/SentimentofhashtagsOptimizedRowColumnarfileformat(ORC)
about/ORCreferencelink/ORC
ORCURL/Columnarstores
org.apache.zookeeper.ZooKeeperclassabout/JavaAPI
OutputFormat,MapReducejobabout/OutputFormatandRecordWriter
PparallelDooperation
about/ConceptsPARALLELoperator
about/AggregationParquet
referencelink/Parquetabout/ParquetURL/Columnarstores
partitioning,JavaAPItoMapReduceabout/Partitioningoptionalpartitionfunction/Theoptionalpartitionfunction
PCollection<T>interface,Crunchabout/Concepts
physicalclusterbuilding/Buildingaphysicalcluster
physicalcluster,considerationsabout/Physicallayoutrackawareness/Rackawarenessservicelayout/Servicelayoutservice,upgrading/Upgradingaservice
Pigoverview/AnoverviewofPigusecases/AnoverviewofPigabout/Gettingstarted,WhySQLonHadooprunning/RunningPigreferencelink,forsourcecodeandbinarydistributions/RunningPigGrunt/Grunt–thePiginteractiveshellElasticMapReduce/ElasticMapReducefundamentals/FundamentalsofApachePigreferencelink,forparallelfeature/FundamentalsofApachePigreferencelink,formulti-queryimplementation/FundamentalsofApachePigprogramming/ProgrammingPigdatatypes/Pigdatatypesfunctions/Pigfunctionsdata,workingwith/Workingwithdata
Piggybankabout/Piggybank
PigLatin/AnoverviewofPigPigUDFs
extending/ExtendingPig(UDFs)contributedUDFs/ContributedUDFs
pipelinesimplementation,ApacheCrunch
about/PipelinesimplementationandexecutionSparkPipeline/SparkPipelineMemPipeline/MemPipeline
positive_wordsoperatorabout/Join
pre-requisitesabout/Prerequisites
PredictiveModelMarkupLanguage(PMML)/Cascadingprocessingmodels,YARN
ClouderaKitten/ThinkinginlayersApacheTwill/Thinkinginlayers
programmaticinterfacesabout/ProgrammaticinterfacesJDBC/JDBCThrift/Thrift
ProjectRhinoURL/ThefutureofHadoopsecurity
PTable<Key,Value>interface,Crunchabout/Concepts
Pythonused,forprogrammaticaccess/ProgrammaticaccesswithPython
PythonAPIabout/PythonAPI
QQJMmechanism
about/KeepingtheHANameNodesinsyncqueries,Hive/Queries
RRDDs
about/Clustercomputingwithworkingsets,ResilientDistributedDatasets(RDDs)
RDDs,operationsmap/Actionsfilter/Actionsreduce/Actionscollect/Actionsforeach/ActionsgroupByKey/ActionssortByKey/Actions
Recordabstractionsimplementing/Concepts
RecordReader,MapReducejobabout/InputFormatandRecordReader
RecordWriter,MapReducejobabout/OutputFormatandRecordWriter
Reducefunctionabout/MapReduce
reduceoptimization,clustertuningconsiderations/Mapandreduceoptimizations
Reducerclass,JavaAPItoMapReduceabout/TheReducerclass
reducerexecution,MapReducejobabout/Reducerexecution
reducerinput,MapReducejobabout/Reducerinput
reduceroutput,MapReducejobabout/Reduceroutput
reducers,MapperandReducerimplementationsIntSumReducer/Hadoop-providedmapperandreducerimplementationsLongSumReducer/Hadoop-providedmapperandreducerimplementationsIdentityReducer/Hadoop-providedmapperandreducerimplementations
referencedata,JavaAPItoMapReducesharing/Sharingreferencedata
REGISTERoperatorabout/ExtendingPig(UDFs)
requiredservices,AWSSimpleStorageService(S3)/SigningupforthenecessaryservicesElasticMapReduce/SigningupforthenecessaryservicesElasticComputeCloud(EC2)/Signingupforthenecessaryservices
ResourceManager
about/ResourceManager,NodeManager,andApplicationManagerapplications/ApplicationsNodesview/NodesSchedulerwindow/SchedulerMapReduce/MapReduceMapReducev1/MapReducev1MapReducev2(YARN)/MapReducev2(YARN)JobHistoryServer/JobHistoryServer
resourcessharing/Sharingresources
RoleBasedAccessControl(RBAC)/BeyondbasicauthorizationRowColumnarFile(RCFile)
about/RCFilereferencelink/RCFile
SS3
Hive,usingwith/HiveandS3s3distcp
URL/GettingdataintoEMRs3n/HadoopfilesystemsSamza
about/ApacheSamzaURL/ApacheSamza,StreamprocessingwithSamzaYARN-independentframeworks/YARN-independentframeworksused,forstreamprocessing/StreamprocessingwithSamzaworking/HowSamzaworksarchitecture/Samzahigh-levelarchitectureApacheKafka/Samza’sbestfriend–ApacheKafkaintegrating,withYARN/YARNintegrationindependentmodel/AnindependentmodelHelloSamza/HelloSamza!tweetparsingjob,building/Buildingatweetparsingjobconfigurationfile/TheconfigurationfileURL,forconfigurationoptions/TheconfigurationfileTwitterdata,gettingintoApacheKafka/GettingTwitterdataintoKafkaHDFS/SamzaandHDFSwindowfunction,adding/Windowingfunctionsmultijobworkflows/Multijobworkflowstweetsentimentanalysis,performing/Tweetsentimentanalysistasksprocessing/StatefultasksandSparkStreaming,comparing/ComparingSamzaandSparkStreaming
Samza,layersstreaming/Samzahigh-levelarchitectureexecution/Samzahigh-levelarchitectureprocessing/Samzahigh-levelarchitecture
Samzajobexecuting/RunningaSamzajob
sbtURL/GettingstartedwithSpark
ScalaandJavasourcecode,examplesURL/Buildingandrunningtheexamples
ScalaAPIabout/ScalaAPI
scalardatatypesint/Pigdatatypeslong/Pigdatatypesfloat/Pigdatatypes
double/Pigdatatypeschararray/Pigdatatypesbytearray/Pigdatatypesboolean/Pigdatatypesdatetime/Pigdatatypesbiginteger/Pigdatatypesbigdecimal/Pigdatatypes
ScalasourcecodeURL/Dataprocessingonstreams
SecondaryNameNodeabout/SecondaryNameNodenottotherescuedemerits/SecondaryNameNodenottotherescue
securedclusterusing,consequences/Consequencesofusingasecuredcluster
securityabout/Security
sentimentanalysisabout/Sentimentofhashtags
SequenceFileabout/General-purposefileformats
SequenceFileclass,MapReducejobabout/Sequencefiles
sequencefiles,MapReducejobabout/Sequencefilesadvantages/Sequencefiles
SerDeclasses,HiveMetadataTypedColumnsetSerDe/FileformatsandstorageThriftSerDe/FileformatsandstorageDynamicSerDe/Fileformatsandstorage
serializationabout/SerializationandContainers
sharelib,Oozieabout/TheOoziesharelib
SimpleDBabout/SimpleDBandDynamoDB
SimpleStorageService(S3),AWSabout/SimpleStorageService(S3)URL/SimpleStorageService(S3)
sourcesofinformation,Hadoopabout/Sourcesofinformationsourcecode/Sourcecodemailinglists/Mailinglistsandforumsforums/MailinglistsandforumsLinkedIngroups/LinkedIngroups
HUGs/HUGsconferences/Conferences
Sparkabout/ApacheSparkURL/ApacheSpark
SparkContextobject/ScalaAPISparkPipeline
about/SparkPipelineSparkSQL
about/SparkSQLdataanalysiswith/DataanalysiswithSparkSQL
SparkStreamingURL/SparkStreamingabout/SparkStreamingandSamza,comparing/ComparingSamzaandSparkStreaming
specializedjoinreferencelink/Join
speedofthoughtanalysis/AdifferentphilosophySQL
ondatastreams/SQLondatastreamsondatastreams,URL/SQLondatastreams
SQL-on-Hadoopneedfor/WhySQLonHadoopsolutions/OtherSQL-on-Hadoopsolutions
Sqoopabout/SqoopURL/Sqoop
Sqoop1about/Sqoop
Sqoop2about/Sqoop
standaloneapplications,ApacheSparkwriting/Writingandrunningstandaloneapplicationsrunning/Writingandrunningstandaloneapplications
statementsabout/FundamentalsofApachePig
Stingerinitiativeabout/Stingerinitiative
storageabout/Storage
storage,Hadoop2about/StorageinHadoop2
storage,Hiveabout/Fileformatsandstorage
columnarstores/ColumnarstoresStorm
URL/HowSamzaworksabout/HowSamzaworks
stream.pyreferencelink/ProgrammaticaccesswithPython
streamprocessingwithSamza/StreamprocessingwithSamza
streamsdata,processingon/Dataprocessingonstreams
systemsmanagementtoolsClouderaManager,integratingwith/ClouderaManagerandothermanagementtools
Ttablepartitioning
about/Partitioningatabledata,overwriting/Overwritingandupdatingdatadata,updating/Overwritingandupdatingdatabucketing/Bucketingandsortingsorting/Bucketingandsortingdata,sampling/Samplingdata
TajoURL/Drill,Tajo,andbeyondabout/Drill,Tajo,andbeyond
tasksprocessing,Samzaabout/Statefultasks
termfrequencyabout/Calculatetermfrequencycalculating,withTF-IDF/Calculatetermfrequency
textattribute,entityabout/Tweetmetadata
Textfilesabout/General-purposefileformats
Tezabout/TezURL/Tez,Stingerinitiativereferencelink,forcanonicalWordCountexample/TezHive-on-tez/Hive-on-tez
/AnoverviewofPigTF-IDF
about/Findingimportantwordsintextdefinition/Findingimportantwordsintexttermfrequency,calculating/Calculatetermfrequencydocumentfrequency,calculating/Calculatedocumentfrequencyimplementing/Puttingitalltogether–TF-IDF
Thriftabout/Thrift
TOBAG(expression)function/Thetuple,bag,andmapfunctionsTOMAP(expression)function/Thetuple,bag,andmapfunctionstools,datalifecyclemanagement
orchestrationservices/Toolstohelpconnectors/Toolstohelpfileformats/Toolstohelp
TOP(n,column,relation)function/Thetuple,bag,andmapfunctionsTOTUPLE(expression)function/Thetuple,bag,andmapfunctionstroubleshooting
about/Troubleshootingtuples
about/FundamentalsofApachePigTweet,structure
referencelink/AnatomyofaTweettweetanalysiscapability
building/Buildingatweetanalysiscapabilitytweetdata,obtaining/GettingthetweetdataOozie/IntroducingOoziederiveddata,producing/Producingderiveddata
tweetsentimentanalysisperforming/Tweetsentimentanalysisbootstrapstreams/Bootstrapstreams
Twitterused,forgeneratingdataset/DataprocessingwithHadoopURL/DataprocessingwithHadoopabout/WhyTwitter?signuppage/Twittercredentialswebform/Twittercredentials
Twitterdata,propertiesunstructured/WhyTwitter?structured/WhyTwitter?graph/WhyTwitter?geolocated/WhyTwitter?realtime/WhyTwitter?
TwitterSearchURL/Trendingtopics
Twitterstreamanalyzing/AnalyzingtheTwitterstreamprerequisites/Prerequisitesdatasetexploration/Datasetexplorationtweetmetadata/Tweetmetadatadatapreparation/Datapreparationtopnstatistics/Topnstatisticsdatetimemanipulation/Datetimemanipulationsessions/Sessionsusers’interaction,capturing/Capturinguserinteractionslinkanalysis/Linkanalysisinfluentialusers,identifying/Influentialusers
Uunionoperation
about/ConceptsupdateFuncfunction/StatemanagementUserDefinedAggregateFunctions(UDAFs/ExtendingHiveQLUserDefinedFunctions(UDFs)/AnoverviewofPig,ExtendingHiveQL
about/FundamentalsofApachePigUserDefinedTableFunctions(UDTF)/ExtendingHiveQL
Vversioning,Hadoop
about/AnoteonversioningVirtualBox
referencelink/ClouderaQuickStartVMVMware
referencelink/ClouderaQuickStartVM
WWhir
about/WhirURL/Whir
WhotoFollowservicereferencelink/Influentialusers
windowfunctionadding/Windowingfunctions
WordCountinJava/WordCountinJava
WordCountexample,MapReduceprogramsabout/WordCount,theHelloWorldofMapReducereferencelink,forsourcecode/Wordco-occurrences
workflow-appabout/IntroducingOozie
workflow.xmlfilereferencelink/ExtractingdataandingestingintoHive
workflowsbuilding,Oozieused/Pullingitalltogether
workloadsHivetables,structuringfrom/StructuringHivetablesforgivenworkloads
wrapperclassesabout/Introducingthewrapperclasses
WritableComparableinterfaceabout/TheComparableandWritableComparableinterfaces
Writableinterfaceabout/TheWritableinterface
YYARN
about/ComputationinHadoop2,YARN,YARNintherealworld–ComputationbeyondMapReducearchitecture/YARNarchitecturecomponents/ThecomponentsofYARNprocessingframeworks/Thinkinginlayersprocessingmodels/Thinkinginlayersissues,withMapReduce/TheproblemwithMapReduceTez/TezApacheSpark/ApacheSparkApacheSamza/ApacheSamzafuture/YARNtodayandbeyondpresentsituation/YARNtodayandbeyondSamza,integrating/YARNintegrationApacheSparkon/SparkonYARNexamples,runningon/RunningtheexamplesonYARNURL/RunningtheexamplesonYARN
YARNAPIabout/Thinkinginlayers
YARNapplicationanatomy/AnatomyofaYARNapplicationApplicationMaster(AM)/AnatomyofaYARNapplicationlifecycle/LifecycleofaYARNapplicationfault-tolerance/Faulttoleranceandmonitoringmonitoring/Faulttoleranceandmonitoringexecutionmodels/Executionmodels
ZZooKeeperFailoverController(ZKFC)/AutomaticNameNodefailoverZooKeeperquorum/AutomaticNameNodefailover