313
www.it-ebooks.info

Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

  • Upload
    vohanh

  • View
    254

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 2: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 3: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ApacheHiveEssentials

www.it-ebooks.info

Page 4: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

TableofContents

ApacheHiveEssentials

Credits

AbouttheAuthor

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.OverviewofBigDataandHive

Ashorthistory

Introducingbigdata

RelationalandNoSQLdatabaseversusHadoop

Batch,real-time,andstreamprocessing

OverviewoftheHadoopecosystem

Hiveoverview

Summary

2.SettingUptheHiveEnvironment

InstallingHivefromApache

www.it-ebooks.info

Page 5: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

InstallingHivefromvendorpackages

StartingHiveinthecloud

UsingtheHivecommandlineandBeeline

TheHive-integrateddevelopmentenvironment

Summary

3.DataDefinitionandDescription

UnderstandingHivedatatypes

Datatypeconversions

HiveDataDefinitionLanguage

Hivedatabase

Hiveinternalandexternaltables

Hivepartitions

Hivebuckets

Hiveviews

Summary

4.DataSelectionandScope

TheSELECTstatement

TheINNERJOINstatement

TheOUTERJOINandCROSSJOINstatements

SpecialJOIN–MAPJOIN

Setoperation–UNIONALL

Summary

5.DataManipulation

Dataexchange–LOAD

Dataexchange–INSERT

Dataexchange–EXPORTandIMPORT

ORDERandSORT

Operatorsandfunctions

Transactions

Summary

6.DataAggregationandSampling

www.it-ebooks.info

Page 6: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Basicaggregation–GROUPBY

Advancedaggregation–GROUPINGSETS

Advancedaggregation–ROLLUPandCUBE

Aggregationcondition–HAVING

Analyticfunctions

Sampling

Summary

7.PerformanceConsiderations

Performanceutilities

TheEXPLAINstatement

TheANALYZEstatement

Designoptimization

Partitiontables

Buckettables

Index

Datafileoptimization

Fileformat

Compression

Storageoptimization

Jobandqueryoptimization

Localmode

JVMreuse

Parallelexecution

Joinoptimization

Commonjoin

Mapjoin

Bucketmapjoin

Sortmergebucket(SMB)join

Sortmergebucketmap(SMBM)join

Skewjoin

Summary

www.it-ebooks.info

Page 7: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

8.ExtensibilityConsiderations

User-definedfunctions

TheUDFcodetemplate

TheUDAFcodetemplate

TheUDTFcodetemplate

Developmentanddeployment

Streaming

SerDe

Summary

9.SecurityConsiderations

Authentication

Metastoreserverauthentication

HiveServer2authentication

Authorization

Legacymode

Storage-basedmode

SQLstandard-basedmode

Encryption

Summary

10.WorkingwithOtherTools

JDBC/ODBCconnector

HBase

Hue

HCatalog

ZooKeeper

Oozie

Hiveroadmap

Summary

Index

www.it-ebooks.info

Page 8: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 9: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ApacheHiveEssentials

www.it-ebooks.info

Page 10: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 11: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ApacheHiveEssentialsCopyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:February2015

Productionreference:1210215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78355-857-5

www.packtpub.com

www.it-ebooks.info

Page 12: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 13: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

CreditsAuthor

DayongDu

Reviewers

PuneethaBM

HamzehKhazaei

NitinPradeepKumar

BalaswamyVaddeman

CommissioningEditor

AshwinNair

AcquisitionEditor

ShaonBasu

ContentDevelopmentEditor

MerwynD’souza

TechnicalEditor

TaabishKhan

CopyEditors

SameenSiddiqui

LaxmiSubramanian

ProjectCoordinator

NehaBhatnagar

Proofreaders

PaulHindle

JonathanTodd

Indexer

MonicaAjmeraMehta

ProductionCoordinator

AparnaBhagat

CoverWork

AparnaBhagat

www.it-ebooks.info

Page 14: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 15: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

AbouttheAuthorDayongDuisabigdatapractitioner,leader,anddeveloperwithexpertiseintechnologyconsulting,designing,andimplementingenterprisebigdatasolutions.Withmorethan10yearsofexperienceinenterprisedatawarehouse,businessintelligence,andbigdataandanalytics,hehasprovidedhisdataintelligenceexpertiseinvariousindustries,suchasmedia,travel,telecommunications,andsoon.HeiscurrentlyworkingwithQuickPlayMediainToronto,Canada,tobuildenterprisebigdataintelligencereportingforonlinemediaservicesandcontentproviders.Hehasamaster’sdegreeincomputersciencefromDalhousieUniversity,andheholdstheClouderaCertifiedDeveloperforApacheHadoopcertification.

Iwouldliketosincerelythankmywife,Joice,anddaughter,Elaine,fortheirsacrificesandencouragementduringthisjourney.Also,Iwouldliketothankmyparentsfortheirsupportduringthetimeofwritingthisbook.

IwouldalsoliketothankeveryoneatPacktPublishingandthetechnicalreviewersfortheirvaluablehelp,guidance,andfeedbackonmybook.

www.it-ebooks.info

Page 16: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 17: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

AbouttheReviewersPuneethaBMisasoftwareengineer,dataenthusiast,andtechnicalblogger.Herresearchinterestsincludebigdata,cloudcomputing,machinelearning,andNoSQLdatabases.Sheisalsoaprofessionalsoftwareengineerwithmorethan2yearsofworkingexperience.Sheholdsamaster’sdegreeincomputerapplicationsfromP.E.S.InstituteofTechnology.Otherthanprogramming,sheenjoyspaintingandlisteningtomusic.Youcanlearnmorefromherblog(http://blog.puneethabm.in/)andLinkedInprofile(https://www.linkedin.com/in/puneethabm).

IoweagreatdealtoProf.Dr.RamRustagiforbeingarolemodelinmylifeandforhiszealousinspiration.Iwouldliketothankmybrother,NischithB.M.,forsupportingmeineverythingIdo.IwouldalsoliketothankPacktPublishinganditsstaffforprovidingtheopportunitytocontributetothisbook.

HamzehKhazaeiisapostdoctoralresearchscientistatIBMCanadaResearchandDevelopmentCentre.HereceivedhisPhDdegreeincomputersciencefromUniversityofManitoba,Winnipeg,Manitoba,Canada(2009–2012).Earlier,hereceivedbothhisBScandMScdegreesincomputersciencefromAmirkabirUniversityofTechnology,Tehran,Iran(2000–2008).HeisalsoasessionalinstructorintheComputerSciencedepartmentatRyersonUniversity(http://scs.ryerson.ca/~hkhazaei).Heteachessoftwareengineeringtofourthyearundergraduatestudents.Hisresearchareaincludesbigdataanalytics,cloudcomputinginfrastructure,analyticsasaservice,andmodelingofcomputingsystems.

Iwouldliketothankmydearwifeforherperpetualsupportinallmyendeavors.

NitinPradeepKumarisapassionatedeveloperwithextensiveexperienceandoodlesofinterestinemergingtechnologiessuchasthecloudandmobile.HeiscurrentlyacloudqualityengineeratAppcelerator,aleadingSiliconValley-basedstart-upthatprovidesanMBaaSplatformpurpose-builtformobileandclouddevelopment.Beforethisstint,hestudiedattheNationalUniversityofSingaporetowardamaster’sdegreeinknowledgeengineering,whichinvolvesbuildingintelligentsystemsusingcutting-edgeartificialintelligenceanddata-miningtechniques.Heenjoysthestart-upenvironmentandhasworkedwithtechnologiessuchasHadoop,Hive,anddatawarehousing.HelivesinSingaporeandspendshissparecyclesplayingretroPCgamesonhismobileandlearningMuayThai.

Iwouldliketothankmyfamily,friends,andmywonderfulbrother,Nivin,forsupportingmeinallmyendeavors.

BalaswamyVaddemanisaHadoophackathonwinnerforHyderabadin2013.HeisoneofthetopcontributorsontheHivetagathttp://www.stackoverflow.com.Heisabigdataprofessionalwith3yearsofexperience.Heiswellknownfortrainingpeopleonbigdata/Hadoop.Sofar,hehasdeliveredsixbigdataprojects.HeisaJava/J2EEexpertwith8yearsofITexperienceand5yearsofRDBMSexperience.HeisanautomationexpertonUnix-basedsystemsusingShellscripting.Hehasexperienceinsettingupteamsandbringingthemuptospeedonbigdataprojects.HeisanactiveparticipantinHadoop/big

www.it-ebooks.info

Page 18: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

dataforums.

Iwouldliketothankmywife,Radha,myson,Pandu,andmydaughter,Bubly,fortheircooperationincompletingthisbook.

www.it-ebooks.info

Page 19: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 20: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.PacktPub.com

www.it-ebooks.info

Page 21: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

www.it-ebooks.info

Page 22: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

www.it-ebooks.info

Page 23: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

Idedicatethisbooktomydaughter

www.it-ebooks.info

Page 24: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 25: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

PrefaceWithanincreasinginterestinbigdataanalysis,HiveoverHadoopbecomesacutting-edgedatasolutionforstoring,computing,andanalyzingbigdata.TheSQL-likesyntaxmakesHiveeasiertolearnandpopularlyacceptedasastandardforinteractiveSQLqueriesoverbigdata.ThevarietyoffeaturesavailablewithinHiveprovidesuswiththecapabilityofdoingcomplexbigdataanalysiswithoutadvancedcodingskills.ThematurityofHiveletsitgraduallymergeandshareitsvaluablearchitectureandfunctionalitiesacrossdifferentcomputingframeworksbeyondHadoop.

ApacheHiveEssentialspreparesyourjourneytobigdatabycoveringtheintroductionofbackgroundsandconceptsinthebigdatadomainalongwiththeprocessofsettingupandgettingfamiliarwithyourHiveworkingenvironmentinthefirsttwochapters.Inthenextfourchapters,thebookguidesyouthroughdiscoveringandtransformingthevaluebehindbigdatabyexamplesandskillsofHivequerylanguages.Inthelastfourchapters,thebookhighlightswell-selectedandadvancedtopics,suchasperformance,security,andextensionsasexcitingadventuresforthisworthwhilebigdatajourney.

www.it-ebooks.info

Page 26: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

WhatthisbookcoversChapter1,OverviewofBigDataandHive,introducestheevolutionofbigdata,theHadoopecosystem,andHive.YouwillalsolearntheHivearchitectureandtheadvantagesofusingHiveinbigdataanalysis.

Chapter2,SettingUptheHiveEnvironment,describestheHiveenvironmentsetupandconfiguration.ItalsocoversusingHivethroughthecommandlineanddevelopmenttools.

Chapter3,DataDefinitionandDescription,introducesthebasicdatatypesanddatadefinitionlanguagefortables,partitions,buckets,andviewsinHive.

Chapter4,DataSelectionandScope,showsyouwaystodiscoverthedatabyquerying,linking,andscopingthedatainHive.

Chapter5,DataManipulation,describestheprocessofexchanging,moving,sorting,andtransformingthedatainHive.

Chapter6,DataAggregationandSampling,explainshowtodoaggregationandsampleusingaggregationfunctions,analyticfunctions,windowing,andsampleclauses.

Chapter7,PerformanceConsiderations,introducesthebestpracticesofperformanceconsiderationsintheaspectsofdesign,fileformat,compression,storage,query,andjob.

Chapter8,ExtensibilityConsiderations,describeshowtoextendHivebycreatinguser-definedfunctions,streaming,serializers,anddeserializers.

Chapter9,SecurityConsiderations,introducestheareaofHivesecurityintermsofauthentication,authorization,andencryption.

Chapter10,WorkingwithOtherTools,discusseshowHiveworkswithotherbigdatatools.ItalsoreviewsthekeymilestonesofHivereleases.

www.it-ebooks.info

Page 27: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 28: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

WhatyouneedforthisbookYouwillneedtoinstallbothHadoopandHivetoruntheexamplesinthisbook.ThescriptsinthisbookwerewrittenandtestedwithClouderaDistributedHadoop(CDH)v5.3(containsHivev0.13.xandHadoopv2.5.0),HortonworksDataPlatform(HDP)v2.2(containsHivev0.14.0andHadoopv2.6.0),andApacheHive1.0.0(withHadoop1.2.1)inpseudo-distributedmode.However,themajorityofthescriptswillalsorunonthepreviousversionsofHadoopandHive.ThefollowingaretheothersoftwareapplicationsyoumayneedforabetterunderstandingoftheHive-relatedtoolsmentionedinthebook.ThesetoolsarealsoavailableintheCDHorHDPpackages.

Hue2.2.0andaboveHBase0.98.4Oozie4.0.0andaboveZookeeper3.4.5Tez0.6.0

www.it-ebooks.info

Page 29: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 30: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

WhothisbookisforIfyouareadataanalyst,developer,anduserwhowantstouseHivetoexploreandanalyzedatainHadoop,thisisthebookforyou.Whetheryouarenewtobigdataoranexpert,youwillbeabletomasterboththebasicandtheadvancedfeaturesofHive.SinceHiveisanSQL-likelanguage,somepreviousexperiencewiththeSQLlanguageanddatabaseisusefultohaveabetterunderstandingofthisbook.

www.it-ebooks.info

Page 31: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 32: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ConventionsInthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestylesandanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“Aggregatefunctioncanbeusedwithotheraggregatefunctionsinthesameselectstatement.”

Ablockofcodeissetasfollows:

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://myhost:3306/hive?createDatabase

IfNotExist=true</value>

<description>JDBCconnectstringforaJDBCmetastore</description>

</property>

Whenwewishtodrawyourattentiontoaparticularpartofacodeblock,therelevantlinesoritemsaresetinbold:

customAuthenticator.java

packagecom.packtpub.hive.essentials.hiveudf;

importjava.util.Hashtable;

importjavax.security.sasl.AuthenticationException;

importorg.apache.hive.service.auth.PasswdAuthenticationProvider;

Anycommand-lineinputoroutputiswrittenasfollows:

bash-4.1$hdfsdfs–mkdir/tmp

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,forexample,inmenusordialogboxes,appearinthetextlikethis:“ClickontheOKbuttonandrestartOracleSQLDeveloper.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

www.it-ebooks.info

Page 33: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 34: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.

Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.

www.it-ebooks.info

Page 35: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 36: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

www.it-ebooks.info

Page 37: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

www.it-ebooks.info

Page 38: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.

Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.

www.it-ebooks.info

Page 39: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

PiracyPiracyofcopyrightedmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.IfyoucomeacrossanyillegalcopiesofourworksinanyformontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthorsandourabilitytobringyouvaluablecontent.

www.it-ebooks.info

Page 40: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

QuestionsIfyouhaveaproblemwithanyaspectofthisbook,youcancontactusat<[email protected]>,andwewilldoourbesttoaddresstheproblem.

www.it-ebooks.info

Page 41: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 42: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Chapter1.OverviewofBigDataandHiveThischapterisanoverviewofbigdataandHive,especiallyintheHadoopecosystem.Itbrieflyintroducestheevolutionofbigdatasothatreadersknowwheretheyareinthejourneyofbigdataandfindtheirpreferredareasinfuturelearning.ThischapteralsocovershowHivehasbecomeoneoftheleadingtoolsinbigdatawarehousingandwhyHiveisstillcompetitive.

Inthischapter,wewillcoverthefollowingtopics:

AshorthistoryfromdatabaseanddatawarehousetobigdataIntroducingbigdataRelationalandNoSQLdatabasesversusHadoopBatch,real-time,andstreamprocessingHadoopecosystemoverviewHiveoverview

www.it-ebooks.info

Page 43: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

AshorthistoryInthe1960s,whencomputersbecameamorecost-effectiveoptionforbusinesses,peoplestartedtousedatabasestomanagedata.Lateron,inthe1970s,relationaldatabasesbecamemorepopulartobusinessneedssincetheyconnectedphysicaldatatothelogicalbusinesseasilyandclosely.Inthenextdecade,aroundthe1980s,StructuredQueryLanguage(SQL)becamethestandardquerylanguagefordatabases.TheeffectivenessandsimplicityofSQLmotivatedlotsofpeopletousedatabasesandbroughtdatabasesclosertoawiderangeofusersanddevelopers.Soon,itwasobservedthatpeopleuseddatabasesfordataapplicationandmanagementandthiscontinuedforalongperiodoftime.

Onceplentyofdatawascollected,peoplestartedtothinkabouthowtodealwiththeolddata.Then,thetermdatawarehousingcameupinthe1990s.Fromthattimeonwards,peoplestartedtodiscusshowtoevaluatethecurrentperformancebyreviewingthehistoricaldata.Variousdatamodelsandtoolswerecreatedatthattimeforhelpingenterprisestoeffectivelymanage,transform,andanalyzethehistoricaldata.Traditionalrelationaldatabasesalsoevolvedtoprovidemoreadvancedaggregationandanalyzedfunctionsaswellasoptimizationsfordatawarehousing.TheleadingquerylanguagewasstillSQL,butitwasmoreintuitiveandpowerfulascomparedtothepreviousversions.Thedatawasstillwellstructuredandthemodelwasnormalized.Asweenteredthe2000s,theInternetgraduallybecamethetopmostindustryforthecreationofthemajorityofdataintermsofvarietyandvolume.Newertechnologies,suchassocialmediaanalytics,webmining,anddatavisualizations,helpedlotsofbusinessesandcompaniesdealwithmassiveamountsofdataforabetterunderstandingoftheircustomers,products,competition,aswellasmarkets.Thedatavolumegrewandthedataformatchangedfasterthaneverbefore,whichforcedpeopletosearchfornewsolutions,especiallyfromtheacademicandopensourceareas.Asaresult,bigdatabecameahottopicandachallengingfieldformanyresearchersandcompanies.

However,ineverychallengethereliesgreatopportunity.Hadoopwasoneoftheopensourceprojectsearningwideattentionduetoitsopensourcelicenseandactivecommunities.Thiswasoneofthefewtimesthatanopensourceprojectledtothechangesintechnologytrendsbeforeanycommercialsoftwareproducts.Soonafter,theNoSQLdatabaseandreal-timeandstreamcomputing,asfollowers,quicklybecameimportantcomponentsforbigdataecosystems.Armedwiththesebigdatatechnologies,companieswereabletoreviewthepast,evaluatethecurrent,andalsopredictthefuture.Aroundthe2010s,timetomarketbecamethekeyfactorformakingbusinesscompetitiveandsuccessful.Whenitcomestobigdataanalysis,peoplecouldnotwaittoseethereportsorresults.Ashortdelaycouldmakeagreatdifferencewhenmakingimportantbusinessdecisions.Decisionmakerswantedtoseethereportsorresultsimmediatelywithinafewhours,minutes,orevenpossiblysecondsinafewcases.Real-timeanalyticaltools,suchasImpala(http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html),Presto(http://prestodb.io/),Storm(https://storm.apache.org/),andsoon,makethispossibleindifferentways.

www.it-ebooks.info

Page 44: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 45: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 46: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

IntroducingbigdataBigdataisnotsimplyabigvolumeofdata.Here,theword“Big”referstothebigscopeofdata.Awell-knownsayinginthisdomainistodescribebigdatawiththehelpofthreewordsstartingwiththeletterV.Theyarevolume,velocity,andvariety.Buttheanalyticalanddatascienceworldhasseendatavaryinginotherdimensionsinadditiontothefundament3Vsofbigdatasuchasveracity,variability,volatility,visualization,andvalue.ThedifferentVsmentionedsofarareexplainedasfollows:

Volume:Thisreferstotheamountofdatageneratedinseconds.90percentoftheworld’sdatatodayhasbeencreatedinthelasttwoyears.Sincethattime,thedataintheworlddoubleseverytwoyears.Suchbigvolumesofdataismainlygeneratedbymachines,networks,socialmedia,andsensors,includingstructured,semi-structured,andunstructureddata.Velocity:Thisreferstothespeedinwhichthedataisgenerated,stored,analyzed,andmovedaround.WiththeavailabilityofInternet-connecteddevices,wirelessorwired,machinesandsensorscanpassontheirdataimmediatelyassoonasitiscreated.Thisleadstoreal-timestreamingandhelpsbusinessestomakevaluableandfastdecisions.Variety:Thisreferstothedifferentdataformats.Datausedtobestoredastext,dat,andcsvfromsourcessuchasfilesystems,spreadsheets,anddatabases.Thistypeofdatathatresidesinafixedfieldwithinarecordorfileiscalledstructureddata.Nowadays,dataisnotalwaysinthetraditionalformat.Thenewersemi-structuredorunstructuredformsofdatacanbegeneratedusingvariousmethodssuchase-mails,photos,audio,video,PDFs,SMSes,orevensomethingwehavenoideaabout.Thesevarietiesofdataformatscreateproblemsforstoringandanalyzingdata.Thisisoneofthemajorchallengesweneedtoovercomeinthebigdatadomain.Veracity:Thisreferstothequalityofdata,suchastrustworthiness,biases,noise,andabnormalityindata.Corruptdataisquitenormal.Itcouldoriginateduetoanumberofreasons,suchastypos,missingoruncommonabbreviation,datareprocessing,systemfailures,andsoon.However,ignoringthismaliciousdatacouldleadtoinaccuratedataanalysisandeventuallyawrongdecision.Therefore,makingsurethedataiscorrectintermsofdataauditionandcorrectionisveryimportantforbigdataanalysis.Variability:Thisreferstothechangingofdata.Itmeansthatthesamedatacouldhavedifferentmeaningsindifferentcontexts.Thisisparticularlyimportantwhencarryingoutsentimentanalysis.Theanalysisalgorithmsareabletounderstandthecontextanddiscovertheexactmeaningandvaluesofdatainthatcontext.Volatility:Thisreferstohowlongthedataisvalidandstored.Thisisparticularlyimportantforreal-timeanalysis.Itrequiresatargetscopeofdatatobedeterminedsothatanalystscanfocusonparticularquestionsandgaingoodperformanceoutoftheanalysis.Visualization:Thisreferstothewayofmakingdatawellunderstood.Visualizationdoesnotmeanordinarygraphsorpiecharts.Itmakesvastamountsofdata

www.it-ebooks.info

Page 47: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

comprehensibleinamultidimensionalviewthatiseasytounderstand.Visualizationisaninnovativewaytoshowchangesindata.Itrequireslotsofinteraction,conversations,andjointeffortsbetweenbigdataanalystsandbusinessdomainexpertstomakethevisualizationmeaningful.Value:Thisreferstotheknowledgegainedfromdataanalysisonbigdata.Thevalueofbigdataishoworganizationsturnthemselvesintobigdata-drivencompaniesandusetheinsightfrombigdataanalysisfortheirdecisionmaking.

Insummary,bigdataisnotjustaboutlotsofdata,itisapracticetodiscovernewinsightfromexistingdataandguidetheanalysisforfuturedata.Abig-data-drivenbusinesswillbemoreagileandcompetitivetoovercomechallengesandwincompetitions.

www.it-ebooks.info

Page 48: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 49: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

RelationalandNoSQLdatabaseversusHadoopLet’scomparedifferentdatasolutionswiththewaysoftraveling.Youwillbesurprisedtofindthattheyhavemanysimilarities.Whenpeopletravel,theyeithertakecarsorairplanesdependingonthetraveldistanceandcost.Forexample,whenyoutraveltoVancouverfromToronto,anairplaneisalwaysthefirstchoiceintermsofthetraveltimeversuscost.WhenyoutraveltoNiagaraFallsfromToronto,acarisalwaysagoodchoice.WhenyoutraveltoMontrealfromToronto,somepeoplemayprefertakingacartoanairplane.Thedistanceandcosthereislikethebigdatavolumeandinvestment.Thetraditionalrelationaldatabaseislikethecarinthisexample.TheHadoopbigdatatoolisliketheairplaneinthisexample.Whenyoudealwithasmallamountofdata(shortdistance),arelationaldatabase(likethecar)isalwaysthebestchoicesinceitismorefastandagiletodealwithasmallormoderatesizeofdata.Whenyoudealwithabigamountofdata(longdistance),Hadoop(liketheairplane)isthebestchoicesinceitismorelinear,fast,andstabletodealwiththebigsizeofdata.Onthecontrary,youcandrivefromTorontotoVancouver,butittakestoomuchtime.YoucanalsotakeanairplanefromTorontotoNiagara,butitcouldtakemoretimeandcostwaymorethanifyoutravelbyacar.Inaddition,youmayhaveachoicetoeithertakeashiporatrain.ThisislikeaNoSQLdatabase,whichofferscharactersfrombotharelationaldatabaseandHadoopintermsofgoodperformanceandvariousdataformatsupportforbigdata.

www.it-ebooks.info

Page 50: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 51: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Batch,real-time,andstreamprocessingBatchprocessingisusedtoprocessdatainbatchesanditreadsdatainput,processesit,andwritesittotheoutput.ApacheHadoopisthemostwell-knownandpopularopensourceimplementationofbatchprocessingandadistributedsystemusingtheMapReduceparadigm.ThedataisstoredinasharedanddistributedfilesystemcalledHadoopDistributedFileSystem(HDFS),dividedintosplits,whicharethelogicaldatadivisionsforMapReduceprocessing.ToprocessthesesplitsusingtheMapReduceparadigm,themaptaskreadsthesplitsandpassesallofitskey/valuepairstoamapfunctionandwritestheresultstointermediatefiles.Afterthemapphaseiscompleted,thereducerreadsintermediatefilesandpassesittothereducefunction.Finally,thereducetaskwritesresultstothefinaloutputfiles.TheadvantagesoftheMapReducemodelincludemakingdistributedprogrammingeasier,near-linearspeedup,goodscalability,aswellasfaulttolerance.Thedisadvantageofthisbatchprocessingmodelisbeingunabletoexecuterecursiveoriterativejobs.Inaddition,theobviousbatchbehavioristhatallinputsmustbereadybymapbeforethereducejobstarts,whichmakesMapReduceunsuitableforonlineandstreamprocessingusecases.

Real-timeprocessingistoprocessdataandgettheresultalmostimmediately.Thisconceptintheareaofreal-timeadhocqueriesoverbigdatawasfirstimplementedinDremelbyGoogle.Itusesanovelcolumnarstorageformatfornestedstructureswithfastindexandscalableaggregationalgorithmsforcomputingqueryresultsinparallelinsteadofbatchsequences.Thesetwotechniquesarethemajorcharactersforreal-timeprocessingandareusedbysimilarimplementations,suchasClouderaImpala,FacebookPresto,ApacheDrill,andHiveonTezpoweredbyStingerwhoseeffortistomakea100xperformanceimprovementoverApacheHive.Ontheotherhand,in-memorycomputingnodoubtoffersothersolutionsforreal-timeprocessing.In-memorycomputingoffersveryhighbandwidth,whichismorethan10gigabytes/second,comparedtoharddisks’200megabytes/second.Also,thelatencyiscomparativelylower,nanosecondsversusmilliseconds,comparedtoharddisks.WiththepriceofRAMgoinglowerandlowereachday,in-memorycomputingismoreaffordableasreal-timesolutions,suchasApacheSpark,whichisapopularopensourceimplementationofin-memorycomputing.SparkcanbeeasilyintegratedwithHadoopandtheresilientdistributeddatasetcanbegeneratedfromdatasourcessuchasHDFSandHBaseforefficientcaching.

Streamprocessingistocontinuouslyprocessandactonthelivestreamdatatogetaresult.Instreamprocessing,therearetwopopularframeworks:Storm(https://storm.apache.org/)fromTwitterandS4(http://incubator.apache.org/s4/)fromYahoo!.BoththeframeworksrunontheJavaVirtualMachine(JVM)andbothprocesskeyedstreams.Intermsoftheprogrammingmodel,S4isaprogramdefinedasagraphofProcessingElements(PE),smallsubprograms,andS4instantiatesaPEperkey.Inshort,Stormgivesyouthebasictoolstobuildaframework,whileS4givesyouawell-definedframework.

www.it-ebooks.info

Page 52: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 53: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

OverviewoftheHadoopecosystemHadoopwasfirstreleasedbyApachein2011asversion1.0.0.ItonlycontainedHDFSandMapReduce.Hadoopwasdesignedasbothacomputing(MapReduce)andstorage(HDFS)platformfromtheverybeginning.Withtheincreasingneedforbigdataanalysis,HadoopattractslotsofothersoftwaretoresolvebigdataquestionstogetherandmergestoaHadoop-centricbigdataecosystem.ThefollowingdiagramgivesabriefintroductiontotheHadoopecosystemandthecoresoftwareorcomponentsintheecosystems:

TheHadoopecosystem

InthecurrentHadoopecosystem,HDFSisstillthemajorstorageoption.Ontopofit,snappy,RCFile,Parquet,andORCFilecouldbeusedforstorageoptimization.CoreHadoopMapReducereleasedaversion2.0calledYarnforbetterperformanceandscalability.SparkandTezassolutionsforreal-timeprocessingareabletorunontheYarntoworkwithHadoopclosely.HBaseisaleadingNoSQLdatabase,especiallywhenthereisaNoSQLdatabaserequestonthedeployedHadoopclusters.SqoopisstilloneoftheleadingandmaturedtoolsforexchangingdatabetweenHadoopandrelationaldatabases.Flumeisamatureddistributedandreliablelog-collectingtooltomoveorcollectdatatoHDFS.ImpalaandPrestoquerydirectlyagainstthedataonHDFSforbetterperformance.However,HortonworksfocusesonStringerinitiativestomakeHive100timesfaster.Inaddition,HiveoverSparkandHiveoverTezofferachoiceforuserstorunHiveonothercomputingframeworksratherthanMapReduce.Asaresult,Hiveisplayingmoreimportantrolesintheecosystemthanever.

www.it-ebooks.info

Page 54: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 55: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HiveoverviewHiveisastandardforSQLqueriesoverpetabytesofdatainHadoop.ItprovidesSQL-likeaccessfordatainHDFSmakingHadooptobeusedlikeawarehousestructure.TheHiveQueryLanguage(HQL)hassimilarsemanticsandfunctionsasstandardSQLintherelationaldatabasesothatexperienceddatabaseanalystscaneasilygettheirhandsonit.Hive’squerylanguagecanrunondifferentcomputingframeworks,suchasMapReduce,Tez,andSparkforbetterperformance.

Hive’sdatamodelprovidesahigh-level,table-likestructureontopofHDFS.Itsupportsthreedatastructures:tables,partitions,andbuckets,wheretablescorrespondtoHDFSdirectoriesandcanbedividedintopartitions,whichinturncanbedividedintobuckets.HivesupportsamajorityofprimitivedataformatssuchasTIMESTAMP,STRING,FLOAT,BOOLEAN,DECIMAL,DOUBLE,INT,SMALLINT,BIGINT,andcomplexdatatypes,suchasUNION,STRUCT,MAP,andARRAY.

ThefollowingdiagramisthearchitectureseeninsidetheviewofHiveintheHadoopecosystem.TheHivemetadatastore(orcalledmetastore)canuseeitherembedded,local,orremotedatabases.HiveserversarebuiltonApacheThriftServertechnology.SinceHivehasreleased0.11,HiveServer2isavailabletohandlemultipleconcurrentclients,whichsupportKerberos,LDAP,andcustompluggableauthentication,providingbetteroptionsforJDBCandODBCclients,especiallyformetadataaccess.

Hivearchitecture

HerearesomehighlightsofHivethatwecankeepinmindmovingforward:

HiveprovidesasimplerquerymodelwithlesscodingthanMapReduceHQLandSQLhavesimilarsyntaxHiveprovideslotsoffunctionsthatleadtoeasieranalyticsusageTheresponsetimeistypicallymuchfasterthanothertypesofqueriesonthesame

www.it-ebooks.info

Page 56: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

typeofhugedatasetsHivesupportsrunningondifferentcomputingframeworksHivesupportsadhocqueryingdataonHDFSHivesupportsuser-definedfunctions,scripts,andacustomizedI/OformattoextenditsfunctionalityHiveisscalableandextensibletovarioustypesofdataandbiggerdatasetsMaturedJDBCandODBCdriversallowmanyapplicationstopullHivedataforseamlessreportingHiveallowsuserstoreaddatainarbitraryformats,usingSerDesandInput/OutputformatsHivehasawell-definedarchitectureformetadatamanagement,authentication,andqueryoptimizationsThereisabigcommunityofpractitionersanddevelopersworkingonandusingHive

www.it-ebooks.info

Page 57: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 58: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SummaryAftergoingthroughthischapter,wearenowabletounderstandwhyandwhentousebigdatainsteadofatraditionalrelationaldatabase.Wealsounderstandthedifferencebetweenbatchprocessing,real-timeprocessing,andstreamprocessing.WegotfamiliarwiththeHadoopecosystem,especiallyHive.Wehavealsogonebackintimeandbrushedthroughthehistoryofdatabaseandwarehousetobigdataalongwithsomebigdataterms,theHadoopecosystem,Hivearchitecture,andtheadvantageofusingHive.Inthenextchapter,wewillpracticesettingupHiveandallthetoolsneededtogetstartedusingHiveinthecommandline.

www.it-ebooks.info

Page 59: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 60: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Chapter2.SettingUptheHiveEnvironmentThischapterwillintroducehowtoinstallandsetuptheHiveenvironmentintheclusterandcloud.ItalsocoverstheusageofbasicHivecommandsandtheHive-integrateddevelopmentenvironment.

Inthischapter,wewillcoverthefollowingtopics:

InstallingHivefromApacheInstallingHivefromvendorpackagesStartingHiveinthecloudUsingtheHivecommandlineandBeelineTheHive-integrateddevelopmentenvironment

www.it-ebooks.info

Page 61: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

InstallingHivefromApacheTointroducetheHiveinstallation,weuseHiveversion1.0.0asanexample.Thepre-installationrequirementsforthisinstallationareasfollows:

JDK1.7.0_51Hadoop0.20.x,0.23.x.y,1.x.y,or2.x.yUbuntu14.04/CentOS6.2

NoteSincewefocusonHiveinthisbook,theinstallationstepsforJavaandHadooparenotprovidedhere.Forstepsoninstallingthem,pleaserefertohttps://www.java.com/en/download/help/download_options.xmlandhttp://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html.

ThefollowingstepsdescribehowtoinstallHivefromApachethroughtheLinuxcommandline:

1. DownloadHivefromApacheHiveandunpackit:

bash-4.1$wgethttp://apache.mirror.rafal.ca/hive/hive-1.0.0/apache-

hive-1.0.0-bin.tar.gz

bash-4.1$tar-zxvfapache-hive-1.0.0-bin.tar.gz

2. AddHivetothesystempathbyopening/etc/profileor~/.bashrcandaddthefollowingtworows:

exportHIVE_HOME=/home/hivebooks/apache-hive-1.0.0-bin

exportPATH=$PATH:$HIVE_HOME/bin:$HIVE_HOME/conf

3. Enablethesettingsimmediately:

bash-4.1$source/etc/profile

4. Createtheconfigurationfiles:

bash-4.1$cdapache-hive-1.0.0-bin/conf

bash-4.1$cphive-default.xml.templatehive-site.xml

bash-4.1$cphive-env.sh.templatehive-env.sh

bash-4.1$cphive-exec-log4j.properties.templatehive-exec-

log4j.properties

bash-4.1$cphive-log4j.properties.templatehive-log4j.properties

5. Modifytheconfigurationfileat$HIVE_HOME/conf/hive-env.sh:

#SetHADOOP_HOMEtopointtoaspecificHadoopinstalldirectory

exportHADOOP_HOME=/home/hivebooks/hadoop-2.2.0

#HiveConfigurationDirectorycanbeaccessedat:

exportHIVE_CONF_DIR=/home/hivebooks/apache-hive-1.0.0-bin/conf

6. Modifytheconfigurationfileat$HIVE_HOME/conf/hive-site.xml.Therearesomeimportantparametersthatneedspecialattention:

www.it-ebooks.info

Page 62: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

hive.metastore.warehourse.dir:ThisisthepathforHivewarehousestorage.Bydefaultitis/user/hive/warehouse.hive.exec.scratchdir:Thisisthetemporarydatafilepath.Bydefaultitis/tmp/hive-${user.name}.

Bydefault,HiveusestheDerby(http://db.apache.org/derby/)databaseasthemetadatastore.Hivecanalsouseotherdatabases,suchasPostgreSQL(http://www.postgresql.org/)orMySQL(http://www.mysql.com/)asthemetadatastore.ToconfigureHivetouseotherdatabases,thefollowingparametersshouldbeconfigured:

javax.jdo.option.ConnectionURL//thedatabaseURL

javax.jdo.option.ConnectionDriverName//theJDBCdrivername

javax.jdo.option.ConnectionUserName//databaseusername

javax.jdo.option.ConnectionPassword//databasepassword

ThefollowingisanexamplesettingusingMySQLasthemetastoredatabase:

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://myhost:3306/hive?createDatabase

IfNotExist=true</value>

<description>JDBCconnectstringforaJDBCmetastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>DriverclassnameforaJDBCmetastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>hive</value>

<description>usernametouseagainstmetastoredatabase</description>

</property>

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>hive</value>

<description>passwordtouseagainstmetastoredatabase</description>

</property>

MakesuretheMySQLJDBCdriverisavailableat$HIVE_HOME/lib.

NoteThedifferencesbetweenanembedDerbydatabaseandanexternaldatabaseisthatanexternaldatabaseoffersasharedservicesothatuserscansharethemetadataofHive.However,anembeddatabaseisonlyvisibletothelocalusers.

CreatefoldersandgrantproperwritepermissionstotheusergroupintheHDFSfolder:

bash-4.1$hdfsdfs–mkdir/tmp

bash-4.1$hdfsdfs–mkdir/user/hive/warehouse

bash-4.1$hdfsdfs-chmodg+w/tmp

bash-4.1$hdfsdfs-chmodg+w/user/hive/warehouse

www.it-ebooks.info

Page 63: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

That’sallaboutApacheHiveinstallation.InoneoftheHivenodesinstalled,typehivetoentertheHivecommand-lineenvironment(hive>),whichverifiesHiveissuccessfullyinstalled.

www.it-ebooks.info

Page 64: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 65: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

InstallingHivefromvendorpackagesRightnow,manycompanies,suchasCloudera,MapR,IBM,andHortonworks,havepackagedHadoopintomoreeasilymanageabledistributions.Eachcompanytakesaslightlydifferentstrategy,buttheconsensusforallofthesepackagesistomakeHadoopeasiertouseforenterprise.Forexample,wecaneasilyinstallHivefromClouderaDistributedHadoop(CDH),whichcanbedownloadedfromhttp://www.cloudera.com/content/cloudera/en/downloads/cdh.html.

OnceCDHisinstalledtohavetheHadoopenvironmentready,wecanaddHivetotheHadoopclusterbyfollowingafewsteps:

1. LogintotheClouderamanagerandclickonthedropdownbuttonaftertheclusternametochooseAddaService.

Clouderamanagermainpage

2. InthefirstAddServiceWizardpage,chooseHivetoinstall.

www.it-ebooks.info

Page 66: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

3. InthesecondAddServiceWizardpage,setthedependenciesfortheservice.SentryistheauthorizationpolicyserviceforHive.

4. InthethirdAddServiceWizardpage,choosetheproperhostsforHiveServer2,HiveMetastoreServer,WebHCatServer,andGateway.

5. InthefourthAddServiceWizardpage,configureHiveMetastoreServerdatabaseconnections.

www.it-ebooks.info

Page 67: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

6. InthelastpageofAddServiceWizard,reviewthechangesontheHivewarehousedirectoryandmetastoreserverportnumber.KeepthedefaultvaluesandclickontheContinuebuttontostartinstallingtheHiveservice.Onceitiscomplete,closethewizardtofinishtheHiveinstallation.

NoteHivecanalsobeinstalledalongwithotherserviceswhenwefirstinstallCDHintheclusterorwecandirectlyimportthevendors’quick-startHadoopvirtualmachineimage.

www.it-ebooks.info

Page 68: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 69: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

StartingHiveinthecloudRightnow,AmazonEMR,ClouderaDirector,andMicrosoftAzureHDInsightServicearesomeofthemajorvendorsofferingmaturedHadoopandHiveservicesinthecloud.UsingthecloudversionofHiveisveryconvenient.Italmostrequestsnoinstallationandsetup.

AmazonEMR(http://aws.amazon.com/elasticmapreduce/)istheearliestHadoopserviceinthecloud.However,itisnotapureopensourcedversionofHadoop,butiscustomizedtorunonlyonAWScloud.ClouderaisoneofthefirstfewplayersthatofferedopensourceHadoopsolutionstotheenterprise.SincethemiddleofOctober2014,ClouderahasdeliveredClouderaDirector(http://www.cloudera.com/content/cloudera/en/products-and-services/director.html),whichopensupHadoopdeploymentsinthecloudthroughasimple,self-serviceinterface,andisfullysupportedonAmazonWebServices.WindowsAzureHDInsightService(http://azure.microsoft.com/en-us/documentation/services/hdinsight/)isaservicethatdeploysandprovisionsApacheHadoopclustersintheAzurecloud.AlthoughHadoopwasfirstbuiltonLinux,HortonworksandMicrosofthavepartneredtobringthebenefitsofApacheHadooptotheWindowsAzurecloud.

TheconsensusamongallthevendorshereistoallowtheenterprisetoprovisionhighlyavailableHadoopclusterspoweredwithflexibility,security,management,andgovernancefunctionalitieswithaverysimpleuserinterface.

www.it-ebooks.info

Page 70: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 71: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

UsingtheHivecommandlineandBeelineHivefirststartedwithHiveServer1.However,thisversionoftheHiveserverwasnotverystable.Itsometimessuspendedorblockedclients’connectionquietly.Sinceversion11,HiveincludesanewHiveservercalledHiveSever2asanadditiontoHiveServer1.HiveServer2isanenhancedHiveserverdesignedformulticlientconcurrencyandimprovedauthentication.HiveServer2alsosupportsBeelineasthealternativecommand-lineinterface.HiveServer1isdeprecatedandremovedfromHivesinceversion1.0.0.

TheprimarydifferencebetweenthetwoHiveserversishowtheclientsconnecttoHive.HiveCLIisanApacheThrift-basedclient,andBeelineisaJDBCclientbasedonSQLLine(http://sqlline.sourceforge.net/)CLI.TheHiveCLIdirectlyconnectstotheHivedriversandrequiresinstallingHiveonthesamemachineastheclient.However,BeelineconnectstoHiveServer2throughJDBCconnectionsanddoesnotrequiretheinstallationofHivelibrariesonthesamemachineastheclient.ThatmeanswecanrunBeelineremotelyfromoutsideoftheHadoopcluster.

ThefollowingtableisthecommonlyusedcommandsforbothBeelineandHiveCLI.FormoreusageofHiveServer2andBeeline,refertohttps://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients.

Purpose HiveServer2Beeline HiveServer1CLI

Serverconnection beeline–u<jdbcurl>-n<username>-p<password> hive-h<hostname>-p<port>

Help beeline-horbeeline--help hive-H

Runquery beeline-e<queryinquotes>

beeline-f<queryfilename>

hive-e<queryinquotes>

hive-f<queryfilename>

Definevariablebeeline--hivevarkey=value.

ThisisavailableafterHive0.13.0.hive--hivevarkey=value

Thefollowingisthecommand-linesyntaxinBeelineorHiveCLI:

Purpose HiveServer2Beeline HiveServer1CLI

Entermode beeline hive

Connect !connect<jdbcurl> n/a

Listtables !table showtables;

Listcolumns !column<table_name> desc<table_name>;

Runquery <HQLquery>; <HQLquery>;

Saveresultset !record<file_name>

!recordN/A

www.it-ebooks.info

Page 72: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

RunshellCMD!shls

ThisisavailablesinceHive0.14.0.!ls;

RundfsCMD dfs-ls dfs-ls;

RunfileofSQL !run<file_name> source<file_name>;

CheckHiveversion !dbinfo !hive--version;

Quitmode !quit quit;

NoteForBeeline,;isnotneededafterthecommandthatstartswith!.

WhenrunningaqueryinHiveCLI,theMapReducestatisticsinformationisshownintheconsolescreenwhileprocessing,whereasBeelinedoesnot.

BothBeelineandHiveCLIdonotsupportrunningapastedquerywith<tab>inside,because<tab>isusedforautocompletebydefaultintheenvironment.Alternatively,runningthequeryfromfileshasnosuchissues.

HiveCLIshowstheexactlineandpositionoftheHivequeryorsyntaxerrorswhenthequeryhasmultiplelines.However,Beelineprocessesthemultiple-linequeryasasingleline,soonlythepositionisshownforqueryorsyntaxerrorswiththelinenumberas1forallinstances.Forthisaspect,HiveCLIismoreconvenientthanBeelinefordebuggingtheHivequery.

InbothHiveCLIandBeeline,usingtheupanddownarrowkeyscanretrieveupto10,000previouscommands.The!historycommandcanbeusedinBeelinetoshowallhistory.

BothHiveCLIandBeelinesupportsvariablesubstitution;refertohttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution.

AlistofHiveconfigurationsettingsandpropertiescanbeaccessedandoverwrittenbythesetkeywordfromthecommand-lineenvironment.Formoredetails,refertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties.

www.it-ebooks.info

Page 73: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 74: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

TheHive-integrateddevelopmentenvironmentBesidesthecommand-lineinterface,thereareafewintegrateddevelopmentenvironment(IDE)toolsavailableforHivedevelopment.OneofthebestisOracleSQLDeveloper,whichleveragesthepowerfulfunctionalitiesofOracleIDEandistotallyfreetouse.IfwehavetouseOraclealongwithHiveinaproject,itisquiteconvenienttoswitchbetweenthemonlyfromthesameIDE.

OracleSQLdeveloperhassupportedHivesinceversion4.0.3.ConfiguringittoworkwithHiveisquitestraightforward.ThefollowingareafewstepstoconfiguretheIDEtoconnecttoHive:

1. DownloadHiveJDBCdriversfromthevendorwebsite,suchasCloudera.2. UnziptheJDBCversion4drivertoalocaldirectory.3. StartOracleSQLDeveloperandnavigatetoPreferences|Database|ThirdParty

JDBCDrivers.4. AddalloftheJARfilescontainedintheunzippeddirectorytotheThird-party

JDBCDriverPathsettingasfollows:

SQLdeveloperconfiguration

5. ClickontheOKbuttonandrestartOracleSQLDeveloper.6. CreatenewconnectionsintheHivetabgivingaproperConnectionName,

www.it-ebooks.info

Page 75: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Username,Password,Hostname(Hiveserverhostname),Port,andDatabase.Then,clickontheAddandConnectbuttonstoconnecttoHive.

SQLdeveloperconnections

InOracleSQLDeveloper,wecanrunallHiveinteractivecommandsaswellasHivequeries.WecanalsoleveragethepowerofOracleSQLDevelopertobrowseandexportdataintoaHivetablefromthegraphicuserinterfaceandwizard.

BesidesHiveIDE,Hivealsohasitsownbuilt-inwebinterface,HiveWebInterface.However,itisnotpowerfulandisnotbeingusedveryoften.Hue(http://gethue.com/)isanotherwebinterfacefortheHadoopecosystem,includingHive.Itisaverypowerfulanduser-friendlywebuserinterface.MoredetailsaboutusingHuewithHiveareintroducedinChapter10,WorkingwithOtherTools.

www.it-ebooks.info

Page 76: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 77: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SummaryInthischapter,weintroducedthesetupofHiveindifferentenvironmentswithpropersettings.WealsolookedintoafewoftheHiveinteractivecommandsandqueriesinHiveCLI,Beeline,andIDEs.Aftergoingthroughthischapter,weshouldbeabletosetupourownHiveenvironmentlocallyanduseHivefromCLIorIDEtools.

Inthenextchapter,wewilldiveintothedetailsofHivedatadefinitionlanguages.

www.it-ebooks.info

Page 78: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 79: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Chapter3.DataDefinitionandDescriptionThischapterintroducesthebasicdatatypes,datadefinitionlanguage,andschemainHivetodescribedata.Italsocoversthebestpracticestodescribedatacorrectlyandeffectivelybyusinginternalorexternaltables,partitions,buckets,andviews.

Inthischapter,wewillcoverthefollowingtopics:

HiveprimitiveandcomplexdatatypesDatatypeconversionsHivetablesHivepartitionsHivebucketsHiveviews

www.it-ebooks.info

Page 80: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

UnderstandingHivedatatypesHivedatatypesarecategorizedintotwotypes:primitiveandcomplexdatatypes.Stringandintegerarethemostusefulprimitivetypes,whicharesupportedbymostHivefunctions.

TipDownloadingtheexamplecode

Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou

Thedetailsofprimitivetypesareasfollows:

Primitivedatatype Description Example

TINYINTIthas1bytefrom-128to127.ThepostfixisY.Itisusedasasmallrangeofnumbers. 10Y

SMALLINTIthas2bytesfrom-32,768to32,767.ThepostfixisS.Itisusedasaregulardescriptivenumber. 10S

INT Ithas4bytesfrom-2,147,483,648to2,147,483,647. 10

BIGINTIthas8bytesfrom-9,223,372,036,854,775,808to9,223,372,036,854,775,807.ThepostfixisL. 100L

FLOAT

Thisisa4-bytesingleprecisionfloatingpointnumberfrom1.40129846432481707e-45to3.40282346638528860e+38(positiveornegative).Scientificnotationisnotyetsupported.Itstoresverycloseapproximationsofnumericvalues.

1.2345679

DOUBLE

Thisisan8-bytedoubleprecisionfloatingpointnumberfrom4.94065645841246544e-324dto1.79769313486231570e+308d(positiveornegative).Scientificnotationisnotyetsupported.Itstoresverycloseapproximationsofnumericvalues.

1.2345678901234567

DECIMAL

ThiswasintroducedinHive0.11.0withahardcodeprecisionof38digits.Hive0.13.0introduceduserdefinableprecisionandscale.Itisaround1039-1to1-1038.Decimaldatatypesstoreexactrepresentationsofnumericvalues.Thedefaultdefinitionofthistypeisdecimal(10,0).

DECIMAL(3,2)for3.14

BINARY ThiswasintroducedinHive0.8.0andonlysupportsCASTtoSTRINGandviceversa. 1011

BOOLEAN ThisisaTRUEorFALSEvalue. TRUE

STRINGThisincludescharactersexpressedwitheithersinglequotes(‘)ordoublequotes(“).HiveusesC-styleescapingwithinthestrings.Themaxsizeisaround2G. ‘Books’or“Books”

www.it-ebooks.info

Page 81: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

CHAR ThisisavailablestartingwithHive0.13.0.MostUDFwillworkforthistypeafterHive0.14.0.Themaximumlengthisfixedat255.

‘US’or“US”

VARCHAR

ThisisavailablestartingwithHive0.12.0.MostUDFwillworkforthistypeafterHive0.14.0.Themaximumlengthisfixedat65355.Ifastringvaluebeingconverted/assignedtoavarcharvalueexceedsthelengthspecified,thestringissilentlytruncated.

‘Books’or“Books”

DATEThisdescribesaspecificyear,month,anddayintheformatofYYYY-MM-DD.ItisavailablesinceHive0.12.0.Therangeofdateisfrom0000-01-01to9999-12-31. ‘2013-01-01’

TIMESTAMP

Thisdescribesaspecificyear,month,day,hours,minutes,seconds,andmillisecondsintheformatofYYYY-MM-DDHH:MM:SS[.fff…].ItisavailablesinceHive0.8.0.

‘2013-01-0112:00:01.345’

Hivehasthreemaincomplextypes:ARRAY,MAP,andSTRUCT.Thesedatatypesarebuiltontopoftheprimitivedatatypes.ARRAYandMAParesimilartothatinJava.STRUCTisarecordtype,whichmaycontainasetofanytypeoffields.Complextypesallowthenestingoftypes.Thedetailsofcomplextypesareasfollows:

Complexdatatype

Description Example

ARRAYThisisalistofitemsofthesametype,suchas(val1,val2,andsoon).Youcanaccessthevalueusingarray_name[index],forexample,fruit[0]='apple'. [‘apple’,‘orange’,‘mango’]

MAPThisisasetofkey-valuepairs,suchas(key1,val1,key2,val2,andsoon).Youcanaccessthevalueusingmap_name[key],forexample,fruit[1]="apple". {1:“apple”,2:“orange”}

STRUCT

Thisisauser-definedstructureofanytypeoffields,suchas{val1,val2,val3,andsoon}.Bydefault,STRUCTfieldnameswillbecol1,col2,andsoon.Youcanaccessthevalueusingstructs_name.column_name,forexample,fruit.col1=1.

{1,“apple”}

NAMED

STRUCT

Thisisauser-definedstructureofanynumberoftypedfields,suchas(name1,val1,name2,val2,andsoon).Youcanaccessthevalueusingstructs_name.column_name,forexample,fruit.apple="gala".

{“apple”:“gala”,“weightkg”:1}

UNIONThisisastructurethathasexactlyanyoneofthespecifieddatatypes.ItisavailablesinceHive0.7.0.Itisnotcommonlyused. {2:[“apple”,“orange”]}

NoteForMAP,thetypeofkeysandvaluesareunified.However,STRUCTismoreflexible.STRUCTismorelikeatablewhereasMAPismorelikeanARRAYwithacustomizedindex.

ThefollowingisashortpracticeforallthecommonlyusedHivetypes.ThedetailsoftheCREATE,LOAD,andSELECTstatementswillbedescribedlater.Let’stakealookattheprocess:

1. Preparethedataasfollows:

-bash-4.1$viemployee.txt

www.it-ebooks.info

Page 82: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead

Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead

Shelley|NewYork|Female,27|Python:80|Test:Lead,COE:Architect

Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead

2. LogintoBeelinewiththeproperHiveServer2hostname,portnumber,databasename,username,andpassword:

-bash-4.1$beeline

beeline>!connectjdbc:hive2://localhost:10000/default

scancompletein20msConnectingto

jdbc:hive2://localhost:10000/default

Enterusernameforjdbc:hive2://localhost:10000/default:dayongdEnter

passwordforjdbc:hive2://localhost:10000/default:

3. CreateatableusingARRAY,MAP,andSTRUCTcompositedatatypes:

jdbc:hive2://>CREATETABLEemployee

.......>(

.......>namestring,

.......>work_placeARRAY<string>,

.......>sex_ageSTRUCT<sex:string,age:int>,

.......>skills_scoreMAP<string,int>,

.......>depart_titleMAP<string,ARRAY<string>>

.......>)

.......>ROWFORMATDELIMITED

.......>FIELDSTERMINATEDBY'|'

.......>COLLECTIONITEMSTERMINATEDBY','

.......>MAPKEYSTERMINATEDBY':';

Norowsaffected(0.149seconds)

4. Verifythetable’screation:

jdbc:hive2://>!tableemployee

+---------+------------+------------+--------------+---------+

|TABLE_CAT|TABLE_SCHEMA|TABLE_NAME|TABLE_TYPE|REMARKS|

+---------+------------+------------+--------------+---------+

||default|employee|MANAGED_TABLE||

+---------+------------+------------+--------------+---------+

jdbc:hive2://>!columnemployee

+--------------+-------------+---------------+---------------+

|TABLE_SCHEM|TABLE_NAME|COLUMN_NAME|TYPE_NAME|

+--------------+-------------+---------------+---------------+

|default|employee|name|STRING|

|default|employee|work_place|array<string>|

|default|employee|sex_age|

struct<sex:string,age:int>|

|default|employee|skills_score|map<string,int>|

|default|employee|depart_title|map<string,array<string>>

|

+--------------+-------------+---------------+---------------+

5. Loaddataintothetable:

jdbc:hive2://>LOADDATALOCALINPATH'/home/hadoop/employee.txt'

.......>OVERWRITEINTOTABLEemployee;

www.it-ebooks.info

Page 83: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Norowsaffected(1.023seconds)

6. Queryalltherowsinthetable:

jdbc:hive2://>SELECT*FROMemployee;

+-------+-------------------+------------+-----------------+-----------

-------------------+

|name|work_place|sex_age|skills_score|

depart_title|

+-------+-------------------+------------+-----------------+-----------

-------------------+

|Michael|[Montreal,Toronto]|[Male,30]|{DB=80}|{Product=

[Developer,Lead]}|

|Will|[Montreal]|[Male,35]|{Perl=85}|{Test=

[Lead],Product=[Lead]}|

|Shelley|[NewYork]|[Female,27]|{Python=80}|{Test=

[Lead],COE=[Architect]}|

|Lucy|[Vancouver]|[Female,57]|{Sales=89,HR=94}|{Sales=

[Lead]}|

+-------+-------------------+------------+-----------------+-----------

-------------------+

4rowsselected(0.677seconds)

7. Querythewholearrayandeacharraycolumninthetable:

jdbc:hive2://>SELECTwork_placeFROMemployee;

+----------------------+

|work_place|

+----------------------+

|[Montreal,Toronto]|

|[Montreal]|

|[NewYork]|

|[Vancouver]|

+----------------------+

4rowsselected(27.231seconds)

jdbc:hive2://>SELECTwork_place[0]AScol_1,

.......>work_place[1]AScol_2,work_place[2]AScol_3

.......>FROMemployee;

+------------+----------+--------+

|col_1|col_2|col_3|

+------------+----------+--------+

|Montreal|Toronto||

|Montreal|||

|NewYork|||

|Vancouver|||

+------------+----------+--------+

4rowsselected(24.689seconds)

8. Querythewholestructandeachstructcolumninthetable:

jdbc:hive2://>SELECTsex_ageFROMemployee;

+---------------+

|sex_age|

+---------------+

|[Male,30]|

|[Male,35]|

www.it-ebooks.info

Page 84: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|[Female,27]|

|[Female,57]|

+---------------+

4rowsselected(28.91seconds)

jdbc:hive2://>SELECTsex_age.sex,sex_age.ageFROMemployee;

+---------+------+

|sex|age|

+---------+------+

|Male|30|

|Male|35|

|Female|27|

|Female|57|

+---------+------+

4rowsselected(26.663seconds)

9. Querythewholemapandeachmapcolumninthetable:

jdbc:hive2://>SELECTskills_scoreFROMemployee;

+--------------------+

|skills_score|

+--------------------+

|{DB=80}|

|{Perl=85}|

|{Python=80}|

|{Sales=89,HR=94}|

+--------------------+

4rowsselected(32.659seconds)

jdbc:hive2://>SELECTname,skills_score['DB']ASDB,

.......>skills_score['Perl']ASPerl,

.......>skills_score['Python']ASPython,

.......>skills_score['Sales']asSales,

.......>skills_score['HR']asHR

.......>FROMemployee;

+----------+-----+-------+---------+--------+-----+

|name|db|perl|python|sales|hr|

+----------+-----+-------+---------+--------+-----+

|Michael|80|||||

|Will||85||||

|Shelley|||80|||

|Lucy||||89|94|

+----------+-----+-------+---------+--------+-----+

4rowsselected(24.669seconds)

NoteNotethatthecolumnnameshownintheresultsetforHiveisalwaysinlowercaseletters.

10. Querythecompositetypeinthetable:

jdbc:hive2://>SELECTdepart_titleFROMemployee;

+---------------------------------+

|depart_title|

+---------------------------------+

|{Product=[Developer,Lead]}|

www.it-ebooks.info

Page 85: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|{Test=[Lead],Product=[Lead]}|

|{Test=[Lead],COE=[Architect]}|

|{Sales=[Lead]}|

+---------------------------------+

4rowsselected(30.583seconds)

jdbc:hive2://>SELECTname,

.......>depart_title['Product']ASProduct,

.......>depart_title['Test']ASTest,

.......>depart_title['COE']ASCOE,

.......>depart_title['Sales']ASSales

.......>FROMemployee;

+--------+--------------------+---------+-------------+------+

|name|product|test|coe|sales|

+--------+--------------------+---------+-------------+------+

|Michael|[Developer,Lead]||||

|Will|[Lead]|[Lead]|||

|Shelley||[Lead]|[Architect]||

|Lucy||||[Lead]|

+--------+--------------------+---------+-------------+------+

4rowsselected(26.641seconds)

jdbc:hive2://>SELECTname,

.......>depart_title['Product'][0]ASproduct_col0,

.......>depart_title['Test'][0]AStest_col0

.......>FROMemployee;

+----------+---------------+------------+

|name|product_col0|test_col0|

+----------+---------------+------------+

|Michael|Developer||

|Will|Lead|Lead|

|Shelley||Lead|

|Lucy|||

+----------+---------------+------------+

4rowsselected(26.659seconds)

NoteThedefaultdelimitersinHiveareasfollows:

Rowdelimiter:ThiscanbeusedwithCtrl+Aor^A(Use\001whencreatingthetable)Collectionitemdelimiter:ThiscanbeusedwithCtrl+Bor^B(\002)Mapkeydelimiter:ThiscanbeusedwithCtrl+Cor^C(\003)

Ifthedelimiterisoveriddenduringthetablecreation,itonlyworkswhenusedintheflatstructure.ThisisstillalimitationinHivedescribedinApacheJiraHive-365(https://issues.apache.org/jira/browse/HIVE-365).

Fornestedtypes,forexample,thedepart_titlecolumnintheprecedingtables,thelevelofnestingdeterminesthedelimiter.UsingARRAYofARRAYasanexample,thedelimitersfortheouterARRAYareCtrl+B(\002)characters,asexpected,butfortheinnerARRAYtheyareCtrl+C(\003)characters,thenextdelimiterinthelist.ForourexampleofusingMAP

www.it-ebooks.info

Page 86: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ofARRAY,theMAPkeydelimiteris\003,andtheARRAYdelimiterisCtrl+Dor^D(\004).

www.it-ebooks.info

Page 87: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 88: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

DatatypeconversionsSimilartoJava,Hivesupportsbothimplicittypeconversionandexplicittypeconversion.

Primitivetypeconversionfromanarrowtoawidertypeisknownasimplicitconversion.However,thereverseconversionisnotallowed.Alltheintegralnumerictypes,FLOAT,andSTRINGcanbeimplicitlyconvertedtoDOUBLE,andTINYINT,SMALLINT,andINTcanallbeconvertedtoFLOAT.BOOLEANtypescannotbeconvertedtoanyothertype.IntheApacheHivewiki,thereisadatatypecrosstabledescribingtheallowedimplicitconversionbetweeneverytwotypesinHiveandthiscanbefoundathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types.

ExplicittypeconversionisusingtheCASTfunctionwiththeCAST(valueASTYPE)syntax.Forexample,CAST('100'ASINT)willconvertthestring100totheintegervalue100.Ifthecastfails,suchasCAST('INT'ASINT),thefunctionreturnsNULL.Inaddition,theBINARYtypecanonlycasttoSTRING,thencastfromSTRINGtoothertypes,ifneeded.

www.it-ebooks.info

Page 89: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 90: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HiveDataDefinitionLanguageHiveDataDefinitionLanguage(DDL)isasubsetofHiveSQLstatementsthatdescribethedatastructureinHivebycreating,deleting,oralteringschemaobjectssuchasdatabases,tables,views,partitions,andbuckets.MostHiveDDLstatementsstartwiththekeywordsCREATE,DROP,orALTER.ThesyntaxofHiveDDLisverysimilartotheDDLinSQL.ThecommentsinHivestartfrom--.

www.it-ebooks.info

Page 91: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 92: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HivedatabaseThedatabaseinHivedescribesacollectionoftablesthatareusedforasimilarpurposeorbelongtothesamegroups.Ifthedatabaseisnotspecified,thedefaultdatabaseisused.Wheneveranewdatabaseiscreated,Hivecreatesadirectoryforeachdatabaseat/user/hive/warehouse,definedinhive.metastore.warehouse.dir.Forexample,themyhivebookdatabaseislocatedat/user/hive/datawarehouse/myhivebook.db.However,thedefaultdatabasedoesn’thaveitsowndirectory.ThefollowingisthecoreDDLforHivedatabases:

Createthedatabasewithoutcheckingwhetherthedatabasealreadyexists:

jdbc:hive2://>CREATEDATABASEmyhivebook;

Createthedatabaseandcheckwhetherthedatabasealreadyexists:

jdbc:hive2://>CREATEDATABASEIFNOTEXISTSmyhivebook;

Createthedatabasewithlocation,comments,andmetadatainformation:

jdbc:hive2://>CREATEDATABASEIFNOTEXISTSmyhivebook

.......>COMMENT'hivedatabasedemo'

.......>LOCATION'/hdfs/directory'

.......>WITHDBPROPERTIES('creator'='dayongd','date'='2015-01-

01');

Showanddescribethedatabasewithwildcards:

jdbc:hive2://>SHOWDATABASES;

+----------------+

|database_name|

+----------------+

|default|

+----------------+

1rowselected(1.7seconds)

jdbc:hive2://>SHOWDATABASESLIKE'my.*';

jdbc:hive2://>DESCRIBEDATABASEdefault;

+-------+----------------------+-----------------------------+

|db_name|comment|location|

+-------+----------------------+-----------------------------+

|default|DefaultHivedatabase

|hdfs://localhost:8020/user/hive/warehouse|

+-------+----------------------+-----------------------------+

1rowselected(1.352seconds)

Usethedatabase:

jdbc:hive2://>USEmyhivebook;

Droptheemptydatabase:

jdbc:hive2://>DROPDATABASEIFEXISTSmyhivebook;

Note

www.it-ebooks.info

Page 93: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

NotethatHivekeepsthedatabaseandthetableindirectorymode.Inordertoremovetheparentdirectory,weneedtoremovethesubdirectoriesfirst.Bydefault,thedatabasecannotbedroppedifitisnotempty,unlessCASCADEisspecified.CASCADEdropsthetablesinthedatabaseautomaticallybeforedroppingthedatabase.

DropthedatabasewithCASCADE:

jdbc:hive2://>DROPDATABASEIFEXISTSmyhivebookCASCADE;

Alterthedatabaseproperties.TheALTERDATABASEstatementcanonlyapplytothetablepropertiesandroles(Hive0.13.0andlater)onthetable.Theothermetadataaboutthedatabasecannotbechanged:

jdbc:hive2://>ALTERDATABASEmyhivebook

.......>SETDBPROPERTIES('edited-by'='Dayong');

jdbc:hive2://>ALTERDATABASEmyhivebook

SETOWNERuserdayongd;

NoteSHOWandDESCRIBE

TheSHOWandDESCRIBEkeywordsinHiveareusedtoshowthedefinitioninformationformostoftheHiveobjects,suchastables,partitions,andsoon.

TheSHOWstatementsupportsawiderangeofHiveobjects,suchastables,tables’properties,tableDDL,index,partitions,columns,functions,locks,roles,configurations,transactions,andcompactions.

TheDESCRIBEstatementsupportsasmallrangeofHiveobjects,suchasdatabases,tables,views,columns,andpartitions.However,theDESCRIBEstatementisabletoprovidemoredetailedinformationcombinedwiththeEXTENDEDorFORMATTEDkeywords.

Inthisbook,thereisnosinglesectiontointroduceSHOWandDESCRIBE,butweintroducetheirusageinlinewithotherHQLthroughtheremainingchapters.

www.it-ebooks.info

Page 94: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 95: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HiveinternalandexternaltablesTheconceptofatableinHiveisverysimilartothetableintherelationaldatabase.Eachtableassociateswithadirectoryconfiguredin${HIVE_HOME}/conf/hive-site.xmlinHDFS.Bydefault,itis/user/hive/warehouseinHDFS.Forexample,/user/hive/warehouse/employeeiscreatedbyHiveinHDFSfortheemployeetable.Allthedatainthetablewillbekeptinthedirectory.TheHivetableisalsoreferredtoasinternalormanagedtables.

WhenthereisdataalreadyinHDFS,anexternalHivetablecanbecreatedtodescribethedata.ItiscalledEXTERNALbecausethedataintheexternaltableisspecifiedintheLOCATIONpropertiesinsteadofthedefaultwarehousedirectory.Whenkeepingdataintheinternaltables,Hivefullymanagesthelifecycleofthetableanddata.Thismeansthedataisremovedoncetheinternaltableisdropped.Iftheexternaltableisdropped,thetablemetadataisdeletedbutthedataiskept.Mostofthetime,anexternaltableispreferredtoavoiddeletingdataalongwithtablesbymistake.ThefollowingareDDLsforHiveinternalandexternaltableexamples:

Showthedatabasefile’slocationandcontentoftheemployeeinternaltable:

bash-4.1$vi/home/hadoop/employee.txt

Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead

Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead

Shelley|NewYork|Female,27|Python:80|Test:Lead,COE:Architect

Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead

Createtheinternaltableandloadthedata:

jdbc:hive2://>CREATETABLEIFNOTEXISTSemployee_internal

.......>(

.......>namestring,

.......>work_placeARRAY<string>,

.......>sex_ageSTRUCT<sex:string,age:int>,

.......>skills_scoreMAP<string,int>,

.......>depart_titleMAP<STRING,ARRAY<STRING>>

.......>)

.......>COMMENT'Thisisaninternaltable'

.......>ROWFORMATDELIMITED

.......>FIELDSTERMINATEDBY'|'

.......>COLLECTIONITEMSTERMINATEDBY','

.......>MAPKEYSTERMINATEDBY':'

.......>STOREDASTEXTFILE;

Norowsaffected(0.149seconds)

jdbc:hive2://>LOADDATALOCALINPATH'/home/hadoop/employee.txt'

.......>OVERWRITEINTOTABLEemployee_internal;

Createtheexternaltableandloadthedata:

jdbc:hive2://>CREATEEXTERNALTABLEemployee_external

.......>(

.......>namestring,

.......>work_placeARRAY<string>,

www.it-ebooks.info

Page 96: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

.......>sex_ageSTRUCT<sex:string,age:int>,

.......>skills_scoreMAP<string,int>,

.......>depart_titleMAP<STRING,ARRAY<STRING>>

.......>)

.......>COMMENT'Thisisanexternaltable'

.......>ROWFORMATDELIMITED

.......>FIELDSTERMINATEDBY'|'

.......>COLLECTIONITEMSTERMINATEDBY','

.......>MAPKEYSTERMINATEDBY':'

.......>STOREDASTEXTFILE

.......>LOCATION'/user/dayongd/employee';

Norowsaffected(1.332seconds)

jdbc:hive2://>LOADDATALOCALINPATH'/home/hadoop/employee.txt'...

....>OVERWRITE

INTOTABLEemployee_external;

NoteCREATETABLE

TheHivetabledoesnothaveconstraintssuchasadatabaseyet.

IfthefolderinthepathdoesnotexistintheLOCATIONproperty,Hivewillcreatethatfolder.IfthereisanotherfolderinsidethefolderspecifiedintheLOCATIONproperty,HivewillNOTreporterrorswhencreatingthetable,butwillreportanerrorwhenqueryingthetable.

Atemporarytable,whichisautomaticallydeletedattheendoftheHivesession,issupportedinHive0.14.0byHIVE-7090(https://issues.apache.org/jira/browse/HIVE-7090)throughtheCREATETEMPORARYTABLEstatement.

FortheSTOREASproperty,itissettoASTEXTFILEbydefault.Otherfileformatvalues,suchasSEQUENCEFILE,RCFILE,ORC,AVRO(sinceHive0.14.0),andPARQUET(sinceHive0.13.0)canalsobespecified.

Createthetableasselect(CTAS):

jdbc:hive2://>CREATETABLEctas_employee

.......>ASSELECT*FROMemployee_external;

Norowsaffected(1.562seconds)

NoteCTAS

CTAScopiesthedataaswellastabledefinitions.ThetablecreatedbyCTASisatomic;thismeansthatotherusersdonotseethetableuntilallthequeryresultsarepopulated.CTAShasthefollowingrestrictions:

ThetablecreatedcannotbeapartitionedtableThetablecreatedcannotbeanexternaltableThetablecreatedcannotbealistbucketingtable

ACTASstatementwilltriggeramapjobforpopulatingthedata;evenSELECT*itself

www.it-ebooks.info

Page 97: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

doesnottriggeranyMapReducejob.

CTASwithCommonTableExpression(CTE)canbecreatedasfollows:

jdbc:hive2://>CREATETABLEcte_employeeAS

.......>WITHr1AS

.......>(SELECTnameFROMr2

.......>WHEREname='Michael'),

.......>r2AS

.......>(SELECTnameFROMemployee

.......>WHEREsex_age.sex='Male'),

.......>r3AS

.......>(SELECTnameFROMemployee

.......>WHEREsex_age.sex='Female')

.......>SELECT*FROMr1UNIONALLselect*FROMr3;

Norowsaffected(61.852seconds)

jdbc:hive2://>SELECT*FROMcte_employee;

+----------------------------+

|cte_employee.name|

+----------------------------+

|Michael|

|Shelley|

|Lucy|

+----------------------------+

3rowsselected(0.091seconds)

NoteCTE

CTEisavailablesinceHive0.13.0.ItisatemporaryresultsetderivedfromasimpleSELECTqueryspecifiedinaWITHclause,followedbySELECTorINSERTkeywordtooperatethisresultset.TheCTEisdefinedonlywithintheexecutionscopeofasinglestatement.OneormoreCTEscanbeusedinanestedorchainedwaywithHivekeywords,suchastheSELECT,INSERT,CREATETABLEASSELECT,orCREATEVIEWASSELECTstatements.

Emptytablescanbecreatedintwowaysasfollows:

1. UseCTASasshownhere:

jdbc:hive2://>CREATETABLEempty_ctas_employeeAS

.......>SELECT*FROMemployee_internalWHERE1=2;

Norowsaffected(213.356seconds)

2. UseLIKEasshownhere:

jdbc:hive2://>CREATETABLEempty_like_employee

.......>LIKEemployee_internal;

Norowsaffected(0.115seconds)

Checktherowcountsforbothtables:

jdbc:hive2://>SELECTCOUNT(*)ASrow_cnt

www.it-ebooks.info

Page 98: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

.......>FROMempty_ctas_employee;

+----------+

|row_cnt|

+----------+

|0|

+----------+

1rowselected(51.228seconds)

jdbc:hive2://>SELECTCOUNT(*)ASrow_cnt

.......>FROMempty_like_employee;

+----------+

|row_cnt|

+----------+

|0|

+----------+

1rowselected(41.628seconds)

NoteTheLIKEway,whichisfaster,doesnottriggeraMapReducejobsinceitismetadataduplicationonly.

Thedroptable’scommandremovesthemetadatacompletelyandmovesdatatoTrashortothecurrentdirectoryifTrashisconfigured:

jdbc:hive2://>DROPTABLEIFEXISTSempty_ctas_employee;

Norowsaffected(0.283seconds)

jdbc:hive2://>DROPTABLEIFEXISTSempty_like_employee;

Norowsaffected(0.202seconds)

Thetruncatetable’scommandremovesalltherowsfromatablethatshouldbeaninternaltable:

jdbc:hive2://>SELECT*FROMcte_employee;

+--------------------+

|cte_employee.name|

+--------------------+

|Michael|

|Shelley|

|Lucy|

+--------------------+

3rowsselected(0.158seconds)

jdbc:hive2://>TRUNCATETABLEcte_employee;

Norowsaffected(0.093seconds)

--Tableisemptyaftertruncate

jdbc:hive2://>SELECT*FROMcte_employee;

+--------------------+

|cte_employee.name|

+--------------------+

+--------------------+

Norowsselected(0.059seconds)

Alterthetable’sstatementstorenamethetable:

www.it-ebooks.info

Page 99: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

jdbc:hive2://>!table

+-----------+------------------+-----------+---------------------------

+

|TABLE_SCHEM|TABLE_NAME|TABLE_TYPE|REMARKS

|

+-----------+------------------+-----------+---------------------------

+

|default|employee|TABLE|NULL

|

|default|employee_internal|TABLE|Thisisaninternaltable

|

|default|employee_external|TABLE|Thisisanexternaltable

|

|default|ctas_employee|TABLE|NULL

|

|default|cte_employee|TABLE|NULL

|

+-----------+------------------+-----------+---------------------------

+

jdbc:hive2://>ALTERTABLEcte_employeeRENAMETOc_employee;

Norowsaffected(0.237seconds)

Alterthetable’sproperties,suchascomments:

jdbc:hive2://>ALTERTABLEc_employee

.......>SETTBLPROPERTIES('comment'='Newname,comments');

Norowsaffected(0.239seconds)

jdbc:hive2://>!table

+-----------+------------------+-----------+---------------------------

+

|TABLE_SCHEM|TABLE_NAME|TABLE_TYPE|REMARKS

|

+-----------+------------------+-----------+---------------------------

+

|default|employee|TABLE|NULL

|

|default|employee_internal|TABLE|Thisisaninternaltable

|

|default|employee_external|TABLE|Thisisanexternaltable

|

|default|ctas_employee|TABLE|NULL

|

|default|c_employee|TABLE|Newname,comments

|

+-----------+------------------+-----------+---------------------------

+

Alterthetable’sdelimiterthroughSERDEPROPERTIES:

jdbc:hive2://>ALTERTABLEemployee_internalSET

.......>SERDEPROPERTIES('field.delim'='$');

Norowsaffected(0.148seconds)

Alterthetable’sfileformat:

www.it-ebooks.info

Page 100: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

jdbc:hive2://>ALTERTABLEc_employeeSETFILEFORMATRCFILE;

Norowsaffected(0.235seconds)

Alterthetable’slocation,whichmustbeafullURIofHDFS:

jdbc:hive2://>ALTERTABLEc_employee

.......>SETLOCATION

.......>'hdfs://localhost:8020/user/dayongd/employee';

Norowsaffected(0.169seconds)

Alterthetable’senable/disableprotectiontoNO_DROP,whichpreventsatablefrombeingdropped,orOFFLINE,whichpreventsdata(notmetadata)inatablefrombeingqueried:

jdbc:hive2://>ALTERTABLEc_employeeENABLENO_DROP;

jdbc:hive2://>ALTERTABLEc_employeeDISABLENO_DROP;

jdbc:hive2://>ALTERTABLEc_employeeENABLEOFFLINE;

jdbc:hive2://>ALTERTABLEc_employeeDISABLEOFFLINE;

Alterthetable’sconcatenationtomergesmallfilesintolargerfiles:

--Converttothefileformatsupported

jdbc:hive2://>ALTERTABLEc_employeeSETFILEFORMATORC;

Norowsaffected(0.160seconds)

--Concatenatefiles

jdbc:hive2://>ALTERTABLEc_employeeCONCATENATE;

Norowsaffected(0.165seconds)

--Converttotheregularfileformat

jdbc:hive2://>ALTERTABLEc_employeeSETFILEFORMATTEXTFILE;

Norowsaffected(0.143seconds)

NoteCONCATENATE

InHiverelease0.8.0,RCFileisaddedtosupportfastblock-levelmergingofsmallRCFilesusingtheCONCATENATEcommand.InHiverelease0.14.0ORC,thefilesthatareaddedsupportfaststripe-levelmergingofsmallORCfilesusingtheCONCATENATEcommand.Otherfileformatsarenotsupportedyet.IncaseofRCFiles,themergehappensatblocklevelandORCfilesmergeatstripeleveltherebyavoidingtheoverheadofdecompressinganddecodingthedata.MapReduceistriggeredwhenperformingconcatenation.

Alterthecolumn’sdatatype:

--Checkcolumntypebeforechanges

jdbc:hive2://>DESCemployee_internal;

+----------------+-----------------------------+----------+

|col_name|data_type|comment|

+----------------+-----------------------------+----------+

|employee_name|string||

|work_place|array<string>||

|sex_age|struct<sex:string,age:int>||

|skills_score|map<string,int>||

www.it-ebooks.info

Page 101: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|depart_title|map<string,array<string>>||

+----------------+-----------------------------+----------+

5rowsselected(0.119seconds)

--Changecolumntypeandorder

jdbc:hive2://>ALTERTABLEemployee_internal

.......>CHANGEnameemployee_namestringAFTERsex_age;

Norowsaffected(0.23seconds)

--Verifythechanges

jdbc:hive2://>DESCemployee_internal;

+----------------+-----------------------------+----------+

|col_name|data_type|comment|

+----------------+-----------------------------+----------+

|work_place|array<string>||

|sex_age|struct<sex:string,age:int>||

|employee_name|string||

|skills_score|map<string,int>||

|depart_title|map<string,array<string>>||

+----------------+-----------------------------+----------+

5rowsselected(0.214seconds)

Alterthecolumn’stypeandorder:

jdbc:hive2://>ALTERTABLEemployee_internal

.......>CHANGEemployee_namenamestringFIRST;

Norowsaffected(0.238seconds)

--Verifythechanges

jdbc:hive2://>DESCemployee_internal;

+---------------+-----------------------------+----------+

|col_name|data_type|comment|

+---------------+-----------------------------+----------+

|name|string||

|work_place|array<string>||

|sex_age|struct<sex:string,age:int>||

|skills_score|map<string,int>||

|depart_title|map<string,array<string>>||

+---------------+-----------------------------+----------+

5rowsselected(0.119seconds)

Add/replacecolumns:

--Addcolumnstothetable

jdbc:hive2://>ALTERTABLEc_employeeADDCOLUMNS(workstring);

Norowsaffected(0.184seconds)

--Verifytheaddedcolumns

jdbc:hive2://>DESCc_employee;

+-----------+------------+----------+

|col_name|data_type|comment|

+-----------+------------+----------+

|name|string||

|work|string||

+-----------+------------+----------+

2rowsselected(0.115seconds)

--Replaceallcolumns

www.it-ebooks.info

Page 102: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

jdbc:hive2://>ALTERTABLEc_employee

.......>REPLACECOLUMNS(namestring);

Norowsaffected(0.132seconds)

--Verifythereplacedallcolumns

jdbc:hive2://>DESCc_employee;

+-----------+------------+----------+

|col_name|data_type|comment|

+-----------+------------+----------+

|name|string||

+-----------+------------+----------+

1rowselected(0.129seconds)

NoteTheALTERcommandwillonlymodifyHive’smetadata,NOTthedata.Usersshouldmakesuretheactualdataconformswiththemetadatadefinitionmanually.

www.it-ebooks.info

Page 103: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 104: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HivepartitionsBydefault,asimplequeryinHivescansthewholeHivetable.Thisslowsdowntheperformancewhenqueryingalarge-sizetable.TheissuecouldberesolvedbycreatingHivepartitions,whichisverysimilartowhat’sintheRDBMS.InHive,eachpartitioncorrespondstoapredefinedpartitioncolumn(s)andstoresitasasubdirectoryinthetable’sdirectoryinHDFS.Whenthetablegetsqueried,onlytherequiredpartitions(directory)ofdatainthetablearequeried,sotheI/Oandtimeofqueryisgreatlyreduced.ItisveryeasytoimplementHivepartitionswhenthetableiscreatedandcheckthepartitionscreated,asfollows:

--

Createpartitionswhencreatingtables

jdbc:hive2://>CREATETABLEemployee_partitioned

.......>(

.......>namestring,

.......>work_placeARRAY<string>,

.......>sex_ageSTRUCT<sex:string,age:int>,

.......>skills_scoreMAP<string,int>,

.......>depart_titleMAP<STRING,ARRAY<STRING>>

.......>)

.......>PARTITIONEDBY(YearINT,MonthINT)

.......>ROWFORMATDELIMITED

.......>FIELDSTERMINATEDBY'|'

.......>COLLECTIONITEMSTERMINATEDBY','

.......>MAPKEYSTERMINATEDBY':';

Norowsaffected(0.293seconds)

--Showpartitions

jdbc:hive2://>SHOWPARTITIONSemployee_partitioned;

+------------+

|partition|

+------------+

+------------+

Norowsselected(0.177seconds)

Fromtheprecedingresult,wecanseethatthepartitionisnotenabledautomatically.WehavetouseALTERTABLEADDPARTITIONtoaddpartitionstoatable.TheADDPARTITIONcommandchangesthetable’smetadata,butdoesnotloaddata.Ifthedatadoesnotexistinthepartition’slocation,querieswillnotreturnanyresults.Todropthepartitionincludingbothdataandmetadata,usetheALTERTABLEDROPPARTITIONstatementasfollows:

--Addmultiplepartitions

jdbc:hive2://>ALTERTABLEemployee_partitionedADD

.......>PARTITION(year=2014,month=11)

.......>PARTITION(year=2014,month=12);

Norowsaffected(0.248seconds)

jdbc:hive2://>SHOWPARTITIONSemployee_partitioned;

+---------------------+

|partition|

+---------------------+

www.it-ebooks.info

Page 105: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|year=2014/month=11|

|year=2014/month=12|

+---------------------+

2rowsselected(0.108seconds)

--Dropthepartition

jdbc:hive2://>ALTERTABLEemployee_partitioned

.......>DROPIFEXISTSPARTITION(year=2014,month=11);

jdbc:hive2://>SHOWPARTITIONSemployee_partitioned;

+---------------------+

|partition|

+---------------------+

|year=2014/month=12|

+---------------------+

1rowselected(0.107seconds)

Toavoidmanuallyaddingpartitions,dynamicpartitioninsert(ormultipartitioninsert)isdesignedfordynamicallydeterminingwhichpartitionsshouldbecreatedandpopulatedwhilescanningtheinputtable.ThispartisintroducedwithmoredetailinChapter5,DataManipulation.

Toloadoroverwritedatainpartition,wecanusetheLOADorINSERTOVERWRITEstatements.Thestatementonlyoverwritesthedatainthespecifiedpartitions.Althoughpartitioncolumnsaresubdirectorynames,wecanqueryorspecifythemintheSELECTorWHEREstatementstonarrowdowntheresultset.Thefollowingstepsshowhowtoloaddatatothepartitiontable:

Loaddatatothepartition:

jdbc:hive2://>LOADDATALOCALINPATH

.......>'/home/dayongd/Downloads/employee.txt'

.......>OVERWRITEINTOTABLEemployee_partitioned

.......>PARTITION(year=2014,month=12);

Norowsaffected(0.96seconds)

Verifythedatathatisloaded:

jdbc:hive2://>SELECTname,year,monthFROMemployee_partitioned;

+----------+-------+--------+

|name|year|month|

+----------+-------+--------+

|Michael|2014|12|

|Will|2014|12|

|Shelley|2014|12|

|Lucy|2014|12|

+----------+-------+--------+

4rowsselected(37.451seconds)

Thealtertable/partitionstatementforfileformat,location,protections,andconcatenationhasthesamesyntaxasthealtertablestatementsandisshownhere:

ALTERTABLEtable_namePARTITIONpartition_specSETFILEFORMAT

file_format;

ALTERTABLEtable_namePARTITIONpartition_specSETLOCATION'full

www.it-ebooks.info

Page 106: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

URI';

ALTERTABLEtable_namePARTITIONpartition_specENABLENO_DROP;

ALTERTABLEtable_namePARTITIONpartition_specENABLEOFFLINE;

ALTERTABLEtable_namePARTITIONpartition_specDISABLENO_DROP;

ALTERTABLEtable_namePARTITIONpartition_specDISABLEOFFLINE;

ALTERTABLEtable_namePARTITIONpartition_specCONCATENATE;

www.it-ebooks.info

Page 107: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 108: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HivebucketsBesidespartition,bucketisanothertechniquetoclusterdatasetsintomoremanageablepartstooptimizequeryperformance.Differentfrompartition,thebucketcorrespondstosegmentsoffilesinHDFS.Forexample,theemployee_partitionedtablefromtheprevioussectionusestheyearandmonthasthetop-levelpartition.Ifthereisafurtherrequesttousetheemployee_idasthethirdlevelofpartition,itleadstomanydeepandsmallpartitionsanddirectories.Forinstance,wecanbuckettheemployee_partitionedtableusingemployee_idasthebucketcolumn.Thevalueofthiscolumnwillbehashedbyauser-definednumberintobuckets.Therecordswiththesameemployee_idwillalwaysbestoredinthesamebucket(segmentoffiles).Byusingbuckets,Hivecaneasilyandefficientlydosampling(seeChapter6,DataAggregationandSampling)andmapsidejoins(seeChapter4,DataSelectionandScope).Anexampletocreateabuckettableisasfollows:

--Prepareanotherdatasetandtableforbuckettable

jdbc:hive2://>CREATETABLEemployee_id

.......>(

.......>namestring,

.......>employee_idint,

.......>work_placeARRAY<string>,

.......>sex_ageSTRUCT<sex:string,age:int>,

.......>skills_scoreMAP<string,int>,

.......>depart_titleMAP<string,ARRAY<string>>

.......>)

.......>ROWFORMATDELIMITED

.......>FIELDSTERMINATEDBY'|'

.......>COLLECTIONITEMSTERMINATEDBY','

.......>MAPKEYSTERMINATEDBY':';

Norowsaffected(0.101seconds)

jdbc:hive2://>LOADDATALOCALINPATH

.......>'/home/dayongd/Downloads/employee_id.txt'

.......>OVERWRITEINTOTABLEemployee_id

Norowsaffected(0.112seconds)

--Createthebucketstable

jdbc:hive2://>CREATETABLEemployee_id_buckets

.......>(

.......>namestring,

.......>employee_idint,

.......>work_placeARRAY<string>,

.......>sex_ageSTRUCT<sex:string,age:int>,

.......>skills_scoreMAP<string,int>,

.......>depart_titleMAP<string,ARRAY<string>>

.......>)

.......>CLUSTEREDBY(employee_id)INTO2BUCKETS

.......>ROWFORMATDELIMITED

.......>FIELDSTERMINATEDBY'|'

.......>COLLECTIONITEMSTERMINATEDBY','

.......>MAPKEYSTERMINATEDBY':';

Norowsaffected(0.104seconds)

www.it-ebooks.info

Page 109: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

NoteBucketnumbers

Todefinethepropernumberofbuckets,weshouldavoidhavingtoomuchortoolittleofdataineachbucket.Abetterchoiceissomewhereneartwoblocksofdata.Forexample,wecanplan512MBofdataineachbucket,iftheHadoopblocksizeis256MB.Ifpossible,use2Nasthenumberofbuckets.

Bucketinghasclosedependencyontheunderlyingdataloaded.Toproperlyloaddatatoabuckettable,weneedtoeithersetthemaximumnumberofreducerstothesamenumberofbucketsspecifiedinthetablecreation(forexample,2)orenableenforcebucketingasfollows:

jdbc:hive2://>setmap.reduce.tasks=2;

Norowsaffected(0.026seconds)

jdbc:hive2://>sethive.enforce.bucketing=true;

Norowsaffected(0.002seconds)

Topopulatethedatatothebuckettable,wecannotuseLOADkeywordssuchaswhatwasdoneintheregulartablessinceLOADdoesnotverifythedataagainstthemetadata.Instead,INSERTshouldbeusedtopopulatethebuckettableasfollows:

jdbc:hive2://>INSERTOVERWRITETABLEemployee_id_buckets

.......>SELECT*FROMemployee_id;

Norowsaffected(75.468seconds)

--VerifythebucketsintheHDFS

-bash-4.1$hdfsdfs-ls/user/hive/warehouse/employee_id_buckets

Found2items

-rwxrwxrwx1hivehive9002014-11-0210:54

/user/hive/warehouse/employee_id_buckets/000000_0

-rwxrwxrwx1hivehive5822014-11-0210:54

/user/hive/warehouse/employee_id_buckets/000001_0

www.it-ebooks.info

Page 110: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 111: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HiveviewsInHive,viewsarelogicaldatastructuresthatcanbeusedtosimplifyqueriesbyeitherhidingthecomplexitiessuchasjoins,subqueries,andfiltersorbyflattingthedata.UnlikesomeRDBMS,Hiveviewsdonotstoredataorgetmaterialized.OncetheHiveviewiscreated,itsschemaisfrozenimmediately.Subsequentchangestotheunderlyingtables(forexample,addingacolumn)willnotbereflectedintheview’sschema.Ifanunderlyingtableisdroppedorchanged,subsequentattemptstoquerytheinvalidviewwillfail,asfollows:

jdbc:hive2://>CREATEVIEWemployee_skills

.......>AS

.......>SELECTname,skills_score['DB']ASDB,

.......>skills_score['Perl']ASPerl,

.......>skills_score['Python']ASPython,

.......>skills_score['Sales']asSales,

.......>skills_score['HR']asHR

.......>FROMemployee;

Norowsaffected(0.253seconds)

Whencreatingviews,thereisnoMapReducejobtriggeredatallsincethisisonlyametadatachange.However,aproperMapReducejobwillbetriggeredwhenqueryingtheview.UseSHOWCREATETABLEorDESCFORMATTEDTABLEtodisplaytheCREATEVIEWstatementthatcreatedaview.ThefollowingareotherHiveviewDDLs:

Altertheviews’properties:

jdbc:hive2://>ALTERVIEWemployee_skills

.......>SETTBLPROPERTIES('comment'='Thisisaview');

Norowsaffected(0.19seconds)

Redefineviews:

jdbc:hive2://>ALTERVIEWemployee_skillsAS

.......>SELECT*fromemployee;

Norowsaffected(0.17seconds)

Dropviews:

jdbc:hive2://>DROPVIEWemployee_skills;

Norowsaffected(0.156seconds)

www.it-ebooks.info

Page 112: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 113: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SummaryAftergoingthroughthischapter,weareabletodefineandusevariousdatatypesinHive.Weshouldknowhowtocreate,alter,anddroptables,partitions,andviewsinHiveandhowtouseexternaltables,internaltables,partitions,buckets,andviewsinHive.

Inthenextchapter,wewilldiveintothedetailsofqueryingdatabyHive.

www.it-ebooks.info

Page 114: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 115: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Chapter4.DataSelectionandScopeThischapterisabouthowtodiscoverthedatabyqueryingthedata,linkingthedata,andlimitingthedatarangesorscopes.ThechaptermainlycoversthesyntaxandusageofHiveSELECT,WHERE,LIMIT,JOIN,andUNIONALLtooperatedatasets.

Inthischapterwewillcoverthefollowingtopics:

TheSELECTstatementThecommonJOINstatementThespecialJOIN(MAPJOIN)statementThesetoperationstatement(UNIONALL)

www.it-ebooks.info

Page 116: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

TheSELECTstatementThemostcommonusecaseofusingHiveistoquerythedatainHadoop.Toachievethis,weneedtowriteandexecutetheSELECTstatementinHive.ThetypicalworkdonebytheSELECTstatementistoprojecttherowsmeetingqueryconditionsspecifiedintheWHEREclauseafterthetargettableandreturntheresultset.TheSELECTstatementisquiteoftenusedwithFROM,DISTINCT,WHERE,andLIMITkeywords.Wewillintroducethemthroughexamplesasfollows.

TheSELECT*statementheremeansallthecolumnsinthetableareselected.Bydefault,allrowsarereturnedincludingduplicatedrows.IftheDISTINCTkeywordisused,onlyuniquerowsfromthetableareselectedandreturned.TheLIMITkeywordisusedtolimitthenumberofrowsreturnedrandomly.Inaddition,SELECT*scansthewholetable/filewithouttriggeringMapReducejobs,soitrunsfasterthanSELECT<column_name>.SinceHive0.10.0,thesimpleSELECTstatements,suchasSELECT<column_name>FROM<table_name>LIMITn,canalsoavoidtriggeringtheMapReducejobiftheHivefetchtaskconversionisenabledbysettinghive.fetch.task.conversion=more.

Thefollowingtaskscanbedone:

Queryallorspecificcolumnsinthetable:

jdbc:hive2://>SELECT*FROMemployee;

+-------+------------------+-----------+----------------+--------------

---------------+

|name|work_place|sex_age|skills_score|

depart_title|

+-------+------------------+-----------+----------------+--------------

---------------+

|Michael|[Montreal,Toronto]|[Male,30]|{DB=80}|{Product=

[Developer,Lead]}|

|Will|[Montreal]|[Male,35]|{Perl=85}|{Test=

[Lead],Product=[Lead]}|

|Shelley|[NewYork]|[Female,27]|{Python=80}|{Test=

[Lead],COE=[Architect]}|

|Lucy|[Vancouver]|[Female,57]|{Sales=89,HR=94}|{Sales=[Lead]}

|

+-------+------------------+-----------+----------------+--------------

---------------+

4rowsselected(0.677seconds)

jdbc:hive2://>SELECTnameFROMemployee;

+----------+

|name|

+----------+

|Michael|

|Will|

|Shelley|

|Lucy|

+----------+

4rowsselected(162.452seconds)

www.it-ebooks.info

Page 117: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Selectauniquevalueofthespecifiedcolumn:

jdbc:hive2://>SELECTDISTINCTnameFROMemployeeLIMIT2;

+----------+

|name|

+----------+

|Lucy|

|Michael|

+----------+

2rowsselected(71.125seconds)

Enablefetchandverifytheperformanceimprovement:

jdbc:hive2://>SEThive.fetch.task.conversion=more;

Norowsaffected(0.002seconds)

jdbc:hive2://>SELECTnameFROMemployee;

+----------+

|name|

+----------+

|Michael|

|Will|

|Shelley|

|Lucy|

+----------+

4rowsselected(0.242seconds)

BesidesLIMIT,WHEREisanothergenericconditionclausetolimitthereturnedresultset.TheWHEREconditioncanbeanyBooleanexpressionoruser-definedfunctionscomparingtotableorpartitioncolumns:

jdbc:hive2://>SELECTname,work_placeFROMemployee

.......>WHEREname='Michael';

+----------+-------------------------+

|name|work_place|

+----------+-------------------------+

|Michael|["Montreal","Toronto"]|

+----------+-------------------------+

1rowselected(38.107seconds)

MultipleSELECTstatementscanworktogethertobuildacomplexqueryusingnestorsubqueries,suchasJOINandUNION.Thefollowingareafewexamplestousenest/subqueries.SubqueriescanbeusedintheformatofWITH(alsoreferredtoasCTEsinceHive0.13.0),aftertheFROMorWHEREstatement.Whenusingsubqueries,analiasshouldbegivenforthesubquery(seet1inthefollowingexample).Orelse,Hivewillreportexceptions.ThedifferentusesofSELECTstatementsareasfollows:

NestedSELECTusingCTEcanbeimplementedasfollows:

jdbc:hive2://>WITHt1AS(

.......>SELECT*FROMemployee

.......>WHEREsex_age.sex='Male')

.......>SELECTname,sex_age.sexASsexFROMt1;

+----------+-------+

|name|sex|

www.it-ebooks.info

Page 118: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

+----------+-------+

|Michael|Male|

|Will|Male|

+----------+-------+

2rowsselected(38.706seconds)

NestedSELECTaftertheFROMstatementcanbeimplementedasfollows:

jdbc:hive2://>SELECTname,sex_age.sexASsex

.......>FROM

.......>(

.......>SELECT*FROMemployee

.......>WHEREsex_age.sex='Male'

.......>)t1;

+----------+-------+

|name|sex|

+----------+-------+

|Michael|Male|

|Will|Male|

+----------+-------+

2rowsselected(48.198seconds)

TheHivesubqueryintheWHEREclausecanbeusedwithIN,NOTIN,EXIST,orNOTEXISTasfollows.Ifthealias(seethefollowingexamplefortheemployeetable)isnotspecifiedbeforecolumns(name)intheWHEREcondition,HivewillreporttheerrorCorrelatingexpressioncannotcontainunqualifiedcolumnreferences.ThisisalimitationoftheHivesubquery.AsubquerythatusesEXISTorNOTEXISTmustrefertobothinnerandouterexpression.ThisissimilartotheJOINtable,whichisintroducedlater.ThisisnotsupportedbytheINandNOTINclause.

jdbc:hive2://>SELECTname,sex_age.sexASsex

.......>FROMemployeea

.......>WHEREa.nameIN

.......>(SELECTnameFROMemployee

.......>WHEREsex_age.sex='Male'

.......>);

+----------+-------+

|name|sex|

+----------+-------+

|Michael|Male|

|Will|Male|

+----------+-------+

2rowsselected(54.644seconds)

jdbc:hive2://>SELECTname,sex_age.sexASsex

.......>FROMemployeea

.......>WHEREEXISTS

.......>(SELECT*FROMemployeeb

.......>WHEREa.sex_age.sex=b.sex_age.sex

.......>ANDb.sex_age.sex='Male'

.......>);

+----------+-------+

|name|sex|

+----------+-------+

|Michael|Male|

www.it-ebooks.info

Page 119: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|Will|Male|

+----------+-------+

2rowsselected(69.48seconds)

ThereareadditionalrestrictionsforsubqueriesusedinWHEREclauses:

Subqueriescanonlyappearontheright-handsideoftheWHEREclausesNestedsubqueriesarenotallowedTheINandNOTINstatementsupportsonlyonecolumn

www.it-ebooks.info

Page 120: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 121: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

TheINNERJOINstatementHiveJOINisusedtocombinerowsfromtwoormoretablestogether.HivesupportscommonJOINoperationssuchaswhat’sintheRDBMS,forexample,JOIN,LEFTOUTERJOIN,RIGHTOUTERJOIN,FULLOUTERJOIN,andCROSSJOIN.However,HiveonlysupportsequalJOINinsteadofunequalJOIN,becauseunequalJOINisdifficulttobeconvertedtoMapReducejobs.

TheINNERJOINinHiveusesJOINkeywords,whichreturnrowsmeetingtheJOINconditionsfrombothleftandrighttables.TheINNERJOINkeywordcanalsobeomittedbycomma-separatedtablenamessinceHive0.13.0.SeethefollowingexamplestoshowvariousinnerJOINstatementsinHive:

Prepareanothertabletojoinandloaddata:

jdbc:hive2://>CREATETABLEIFNOTEXISTSemployee_hr

.......>(

.......>namestring,

.......>employee_idint,

.......>sin_numberstring,

.......>start_datedate

.......>)

.......>ROWFORMATDELIMITED

.......>FIELDSTERMINATEDBY'|'

.......>STOREDASTEXTFILE;

Norowsaffected(1.732seconds)

jdbc:hive2://>LOADDATALOCALINPATH

.......>'/home/Dayongd/employee_hr.txt'

.......>OVERWRITEINTOTABLEemployee_hr;

Norowsaffected(0.635seconds)

PerforminnerJOINbetweentwotableswithequalJOINconditions:

jdbc:hive2://>SELECTemp.name,emph.sin_number

.......>FROMemployeeemp

.......>JOINemployee_hremphONemp.name=emph.name;

+-----------+------------------+

|emp.name|emph.sin_number|

+-----------+------------------+

|Michael|547-968-091|

|Will|527-948-090|

|Lucy|577-928-094|

+-----------+------------------+

3rowsselected(71.083seconds)

TheJOINoperationcanbeperformedamongmoretables(threetablesinthiscase),asfollows:

jdbc:hive2://>SELECTemp.name,empi.employee_id,emph.sin_number

.......>FROMemployeeemp

.......>JOINemployee_hremphONemp.name=emph.name

.......>JOINemployee_idempiONemp.name=empi.name;

+-----------+-------------------+------------------+

www.it-ebooks.info

Page 122: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|emp.name|empi.employee_id|emph.sin_number|

+-----------+-------------------+------------------+

|Michael|100|547-968-091|

|Will|101|527-948-090|

|Lucy|103|577-928-094|

+-----------+-------------------+------------------+

3rowsselected(67.933seconds)

Self-joinisaspecialJOINwhereonetablejoinsitself.Whendoingsuchjoins,adifferentaliasshouldbegiventodistinguishthesametable:

jdbc:hive2://>SELECTemp.name

.......>FROMemployeeemp

.......>JOINemployeeemp_b

.......>ONemp.name=emp_b.name;

+-----------+

|emp.name|

+-----------+

|Michael|

|Will|

|Shelley|

|Lucy|

+-----------+

4rowsselected(59.891seconds)

ImplicitjoinisaJOINoperationwithoutusingtheJOINkeyword.ItissupportedsinceHive0.13.0:

jdbc:hive2://>SELECTemp.name,emph.sin_number

.......>FROMemployeeemp,employee_hremph

.......>WHEREemp.name=emph.name;

+-----------+------------------+

|emp.name|emph.sin_number|

+-----------+------------------+

|Michael|547-968-091|

|Will|527-948-090|

|Lucy|577-928-094|

+-----------+------------------+

3rowsselected(47.241seconds)

TheJOINoperationusesdifferentcolumnsinjoinconditionsandwillcreateanadditionalMapReduce:

jdbc:hive2://>SELECTemp.name,empi.employee_id,emph.sin_number

.......>FROMemployeeemp

.......>JOINemployee_hremphONemp.name=emph.name

.......>JOINemployee_idempiONemph.employee_id=

empi.employee_id;

+-----------+-------------------+------------------+

|emp.name|empi.employee_id|emph.sin_number|

+-----------+-------------------+------------------+

|Michael|100|547-968-091|

|Will|101|527-948-090|

|Lucy|103|577-928-094|

+-----------+-------------------+------------------+

3rowsselected(49.785seconds)

www.it-ebooks.info

Page 123: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

NoteIfJOINusesdifferentcolumnsinthejoinconditions,itwillrequestadditionaljobstagestocompletethejoin.IftheJOINoperationusesthesamecolumninthejoinconditions,Hivewilljoinonthisconditionusingonestage.

WhenJOINisperformedbetweenmultipletables,theMapReducejobsarecreatedtoprocessthedataintheHDFS.Eachofthejobsiscalledastage.Usually,itissuggestedforJOINstatementstoputthebigtablerightattheendforbetterperformanceaswellasavoidingOutOfMemory(OOM)exceptions,becausethelasttableinthesequenceisstreamedthroughthereducerswheretheothersarebufferedinthereducerbydefault.Also,ahint,suchas/*+STREAMTABLE(table_name)*/,canbespecifiedtotellwhichtableisstreamedasfollows:

jdbc:hive2://>SELECT/*+STREAMTABLE(employee_hr)*/

.......>emp.name,empi.employee_id,emph.sin_number

.......>FROMemployeeemp

.......>JOINemployee_hremphONemp.name=emph.name

.......>JOINemployee_idempiONemph.employee_id=

empi.employee_id;

www.it-ebooks.info

Page 124: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 125: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

TheOUTERJOINandCROSSJOINstatementsBesidesINNERJOIN,HivealsosupportsregularOUTERJOINandFULLJOIN.ThelogicofsuchJOINisthesametowhat’sintheRDBMS.ThefollowingtablesummarizesthedifferencesofacommonJOIN:

CommonJOINtype

LogicRowsreturned(assumetable_mhasmrowsandtable_nhasnrows)

table_m

JOIN

table_n

Thisreturnsallrowsmatchedinbothtables. m∩n

table_m

LEFT

[OUTER]

JOIN

table_n

Thisreturnsallrowsinthelefttableandmatchedrowsintherighttable.Ifthereisnomatchintherighttable,returnnullintherighttable.

m

table_m

RIGHT

[OUTER]

JOIN

table_n

Thisreturnsallrowsintherighttableandmatchedrowsinthelefttable.Ifthereisnomatchinthelefttable,returnnullinthelefttable. n

table_m

FULL

[OUTER]

JOIN

table_n

Thisreturnsallrowsinboththetablesandmatchedrowsinboththetables.Ifthereisnomatchintheleftorrighttable,returnnullinstead. m+n-m∩n

table_m

CROSS

JOIN

table_n

ThisreturnsallrowcombinationsinboththetablestoproduceaCartesianproduct. m*n

ThefollowingexamplesdemonstrateOUTERJOIN:

jdbc:hive2://>SELECTemp.name,emph.sin_number

.......>FROMemployeeemp

.......>LEFTJOINemployee_hremphONemp.name=emph.name;

+-----------+------------------+

|emp.name|emph.sin_number|

+-----------+------------------+

|Michael|547-968-091|

|Will|527-948-090|

|Shelley|NULL|

|Lucy|577-928-094|

+-----------+------------------+

4rowsselected(39.637seconds)

www.it-ebooks.info

Page 126: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

jdbc:hive2://>SELECTemp.name,emph.sin_number

.......>FROMemployeeemp

.......>RIGHTJOINemployee_hremphONemp.name=emph.name;

+-----------+------------------+

|emp.name|emph.sin_number|

+-----------+------------------+

|Michael|547-968-091|

|Will|527-948-090|

|NULL|647-968-598|

|Lucy|577-928-094|

+-----------+------------------+

4rowsselected(34.485seconds)

jdbc:hive2://>SELECTemp.name,emph.sin_number

.......>FROMemployeeemp

.......>FULLJOINemployee_hremphONemp.name=emph.name;

+-----------+------------------+

|emp.name|emph.sin_number|

+-----------+------------------+

|Lucy|577-928-094|

|Michael|547-968-091|

|Shelley|NULL|

|NULL|647-968-598|

|Will|527-948-090|

+-----------+------------------+

5rowsselected(64.251seconds)

TheCROSSJOINstatement,whichisavailablesinceHive0.10.0,doesnothavetheJOINcondition.TheCROSSJOINstatementcanalsobewrittenusingJOINwithoutconditionorwiththealwaystruecondition,suchas1=1.ThefollowingthreewaysofwritingCROSSJOINproducethesameresultset:

jdbc:hive2://>SELECTemp.name,emph.sin_number

.......>FROMemployeeemp

.......>CROSSJOINemployee_hremph;

jdbc:hive2://>SELECTemp.name,emph.sin_number

.......>FROMemployeeemp

.......>JOINemployee_hremph;

jdbc:hive2://>SELECTemp.name,emph.sin_number

.......>FROMemployeeemp

.......>JOINemployee_hremphon1=1;

+-----------+------------------+

|emp.name|emph.sin_number|

+-----------+------------------+

|Michael|547-968-091|

|Michael|527-948-090|

|Michael|647-968-598|

|Michael|577-928-094|

|Will|547-968-091|

|Will|527-948-090|

|Will|647-968-598|

|Will|577-928-094|

www.it-ebooks.info

Page 127: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|Shelley|547-968-091|

|Shelley|527-948-090|

|Shelley|647-968-598|

|Shelley|577-928-094|

|Lucy|547-968-091|

|Lucy|527-948-090|

|Lucy|647-968-598|

|Lucy|577-928-094|

+-----------+------------------+

16rowsselected(34.924seconds)

Inaddition,JOINalwayshappensbeforeWHERE.Ifpossible,pushconditionssuchastheJOINconditionsratherthanWHEREconditionstofiltertheresultsetafterJOINimmediately.What’smore,JOINisNOTcommutative!ItisalwaysleftassociativenomatterwhethertheyareLEFTJOINorRIGHTJOIN.

AlthoughHivedoesnotsupportunequalJOINexplicitly,thereareworkaroundsusingCROSSJOINandWHEREconditionsmentionedinthefollowingexample:

jdbc:hive2://>SELECTemp.name,emph.sin_number

.......>FROMemployeeemp

.......>JOINemployee_hremphONemp.name<>emph.name;

Error:Errorwhilecompilingstatement:FAILED:SemanticException[Error

10017]:Line1:77BothleftandrightaliasesencounteredinJOIN'name'

(state=42000,code=10017)

jdbc:hive2://>SELECTemp.name,emph.sin_number

.......>FROMemployeeemp

.......>CROSSJOINemployee_hremphWHEREemp.name<>emph.name;

+-----------+------------------+

|emp.name|emph.sin_number|

+-----------+------------------+

|Michael|527-948-090|

|Michael|647-968-598|

|Michael|577-928-094|

|Will|547-968-091|

|Will|647-968-598|

|Will|577-928-094|

|Shelley|547-968-091|

|Shelley|527-948-090|

|Shelley|647-968-598|

|Shelley|577-928-094|

|Lucy|547-968-091|

|Lucy|527-948-090|

|Lucy|647-968-598|

+-----------+------------------+

13rowsselected(35.016seconds)

www.it-ebooks.info

Page 128: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 129: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SpecialJOIN–MAPJOINTheMAPJOINstatementmeansdoingtheJOINoperationonlybymapwithoutthereducejob.TheMAPJOINstatementreadsallthedatafromthesmalltabletomemoryandbroadcaststoallmaps.Duringthemapphase,theJOINoperationisperformedbycomparingeachrowofdatainthebigtablewithsmalltablesagainstthejoinconditions.Becausethereisnoreduceneeded,theJOINperformanceisimproved.Whenthehive.auto.convert.joinsettingissettotrue,HiveautomaticallyconvertstheJOINtoMAPJOINatruntimeifpossibleinsteadofcheckingthemapjoinhint.Inaddition,MAPJOINcanbeusedforunequaljoinstoimproveperformancesincebothMAPJOINandWHEREareperformedinthemapphase.ThefollowingisanexampleofMAPJOINthatisenabledbyqueryhint:

jdbc:hive2://>SELECT/*+MAPJOIN(employee)*/emp.name,emph.sin_number

.......>FROMemployeeemp

.......>CROSSJOINemployee_hremphWHEREemp.name<>emph.name;

TheMAPJOINoperationdoesnotsupportthefollowing:

TheuseofMAPJOINafterUNIONALL,LATERALVIEW,GROUPBY/JOIN/SORTBY/CLUSTERBY/DISTRIBUTEBYTheuseofMAPJOINbeforeUNION,JOIN,andanotherMAPJOIN

ThebucketmapjoinisaspecialtypeofMAPJOINthatusesbucketcolumns(thecolumnspecifiedbyCLUSTEREDBYintheCREATEtablestatement)asthejoincondition.Insteadoffetchingthewholetableasdonebytheregularmapjoin,bucketmapjoinonlyfetchestherequiredbucketdata.Toenablebucketmapjoin,weneedtosethive.optimize.bucketmapjoin=trueandmakesurethebucketsnumberisamultipleofeachother.Ifbothtablesjoinedaresortedandbucketedwiththesamenumberofbuckets,asort-mergejoincanbeperformedinsteadofcachingallsmalltablesinthememory.Thefollowingadditionalsettingsareneededtoenablethisbehavior:

SEThive.optimize.bucketmapjoin=true;

SEThive.optimize.bucketmapjoin.sortedmerge=true;

SET

hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

TheLEFTSEMIJOINstatementisalsoatypeofMAPJOIN.BeforeHivesupportsIN/EXIST,LEFTSEMIJOINisusedtoimplementsucharequestasshowninthefollowingexample.TherestrictionofusingLEFTSEMIJOINisthattheright-handsidetableshouldonlybereferencedinthejoincondition,butnotinWHEREorSELECTclauses.

jdbc:hive2://>SELECTa.name

.......>FROMemployeea

.......>WHEREEXISTS

.......>(SELECT*FROMemployee_idb

.......>WHEREa.name=b.name);

jdbc:hive2://>SELECTa.name

.......>FROMemployeea

www.it-ebooks.info

Page 130: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

.......>LEFTSEMIJOINemployee_idb

.......>ONa.name=b.name;

+----------+

|a.name|

+----------+

|Michael|

|Will|

|Shelley|

|Lucy|

+----------+

4rowsselected(35.027seconds)

www.it-ebooks.info

Page 131: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 132: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Setoperation–UNIONALLTooperatetheresultsetvertically,HiveonlysupportsUNIONALLrightnow.And,theresultsetofUNIONALLkeepsduplicatesifany.BeforeHive0.13.0,UNIONALLcanonlybeusedinthesubquery.SinceHive0.13.0,UNIONALLcanalsobeusedintop-levelqueries.ThefollowingareexamplesoftheUNIONALLstatements:

Checkthenamecolumnintheemployee_hrandemployeetable:

jdbc:hive2://>SELECTnameFROMemployee_hr;

+----------+

|name|

+----------+

|Michael|

|Will|

|Steven|

|Lucy|

+----------+

4rowsselected(0.116seconds)

jdbc:hive2://>SELECTnameFROMemployee;

+----------+

|name|

+----------+

|Michael|

|Will|

|Shelley|

|Lucy|

+----------+

4rowsselected(0.049seconds)

UseUNIONonthenamecolumnfrombothtables,includingduplications:

jdbc:hive2://>SELECTa.name

.......>FROMemployeea

.......>UNIONALL

.......>SELECTb.name

.......>FROMemployee_hrb;

+-----------+

|_u1.name|

+-----------+

|Michael|

|Will|

|Shelley|

|Lucy|

|Michael|

|Will|

|Steven|

|Lucy|

+-----------+

8rowsselected(39.93seconds)

ForothersetoperationssupportedbyRDBMS,suchasUNION,INTERCEPT,andMINUS,wecanuseSELECTwiththeWHEREconditiontoimplementthemasfollows:

www.it-ebooks.info

Page 133: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ImplementUNIONbetweentwotableswithoutduplications:

jdbc:hive2://>SELECTDISTINCTname

.......>FROM

.......>(

.......>SELECTa.nameASname

.......>FROMemployeea

.......>UNIONALL

.......>SELECTb.nameASname

.......>FROMemployee_hrb

.......>)union_set;

+----------+

|name|

+----------+

|Lucy|

|Michael|

|Shelley|

|Steven|

|Will|

+----------+

5rowsselected(100.366seconds)

NoteThesubqueryalias(suchasunion_setinthisexample)mustbegiventoavoidaHivesyntaxerror.

TheemployeetableimplementsINTERCEPTonemployee_hrusingJOIN:

jdbc:hive2://>SELECTa.name

.......>FROMemployeea

.......>JOINemployee_hrb

.......>ONa.name=b.name;

+----------+

|a.name|

+----------+

|Michael|

|Will|

|Lucy|

+----------+

3rowsselected(44.862seconds)

TheemployeetableimplementsMINUSonemployee_hrusingOUTERJOIN:

jdbc:hive2://>SELECTa.name

.......>FROMemployeea

.......>LEFTJOINemployee_hrb

.......>ONa.name=b.name

.......>WHEREb.nameISNULL;

+----------+

|a.name|

+----------+

|Shelley|

+----------+

1rowselected(36.841seconds)

www.it-ebooks.info

Page 134: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 135: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SummaryInthischapter,youlearnedtouseSELECTstatementstodiscoverthedatayouneed.Then,weintroducedHiveoperationstolinkdifferentdatasetsfromverticalorhorizontaldirectionsusingJOINorUNIONALL.Aftergoingthroughthischapter,weshouldbeabletousetheSELECTstatementwithdifferentWHEREconditions,LIMIT,DISTINCT,andcomplexsubqueries.WeshouldbeabletounderstandandusedifferenttypesofJOINstatementstolinkthedifferentdatasetshorizontallyandUNIONALLtocombinethedifferentdatasetsvertically.

Inthenextchapter,wewilltalkaboutthedetailsofexchange,order,andtransformingdataaswellastransactionsinHive.

www.it-ebooks.info

Page 136: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 137: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Chapter5.DataManipulationTheabilitytomanipulatedataisacriticalcapabilityinbigdataanalysis.Manipulatingdataistheprocessofexchanging,moving,sorting,andtransformingthedata.Thistechniqueisusedinmanysituations,suchascleaningdata,searchingpatterns,creatingtrends,andsoon.Hiveoffersvariousquerystatements,keywords,operators,andfunctionstocarryoutdatamanipulation.

Inthischapter,wewillcoverthefollowingtopics:

DataexchangeusingLOAD,INSERT,IMPORT,andEXPORTOrderandsortOperatorsandfunctionsTransaction

www.it-ebooks.info

Page 138: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Dataexchange–LOADTomovedatainHive,itusestheLOADkeyword.Moveheremeanstheoriginaldataismovedtothetargettable/partitionanddoesnotexistintheoriginalplaceanymore.ThefollowingisanexampleofhowtomovedatatotheHivetableorpartitionfromlocalorHDFSfiles.TheLOCALkeywordspecifieswherethefilesarelocatedinthehost.IftheLOCALkeywordisnotspecified,thefilesareloadedfromthefullUniformResourceIdentifier(URI)specifiedafterINPATHorthevaluefromthefs.default.nameHivepropertybydefault.ThepathafterINPATHcanbearelativepathoranabsolutepath.Thepatheitherpointstoafileorafolder(allfilesinthefolder)tobeloaded,butthesubfolderisnotallowedinthepathspecified.Ifthedataisloadedintoapartitiontable,thepartitioncolumnmustbespecified.TheOVERWRITEkeywordisusedtodecidewhethertoappendorreplacetheexistingdatainthetargettable/partition.

ThefollowingaretheexamplestoloadfilesintoHivetables:

LoadlocaldatatotheHivetable:

jdbc:hive2://>LOADDATALOCALINPATH

.......>'/home/dayongd/Downloads/employee_hr.txt'

.......>OVERWRITEINTOTABLEemployee_hr;

Norowsaffected(0.436seconds)

LoadlocaldatatotheHivepartitiontable:

jdbc:hive2://>LOADDATALOCALINPATH

.......>'/home/dayongd/Downloads/employee.txt'

.......>OVERWRITEINTOTABLEemployee_partitioned

.......>PARTITION(year=2014,month=12);

Norowsaffected(0.772seconds)

LoadHDFSdatatotheHivetableusingthedefaultsystempath:

jdbc:hive2://>LOADDATAINPATH

.......>'/user/dayongd/employee/employee.txt'

.......>OVERWRITEINTOTABLEemployee;

Norowsaffected(0.453seconds)

LoadHDFSdatatotheHivetablewithfullURI:

jdbc:hive2://>LOADDATAINPATH

.......>

'hdfs://[dfs_host]:8020/user/dayongd/employee/employee.txt'

.......>OVERWRITEINTOTABLEemployee;

Norowsaffected(0.297seconds)

www.it-ebooks.info

Page 139: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 140: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Dataexchange–INSERTToextractthedatafromHivetables/partitions,wecanusetheINSERTkeyword.LikeRDBMS,Hivesupportsinsertingdatabyselectingdatafromothertables.Thisisaverycommonwaytopopulateatablefromexistingdata.ThebasicINSERTstatementhasthesamesyntaxasarelationaldatabase’sINSERT.However,HivehasimproveditsINSERTstatementbysupportingOVERWRITE,multipleINSERT,dynamicpartitionINSERT,aswellasusingINSERTtofiles.Thefollowingareafewexamples:

ThefollowingisaregularINSERTfromtheSELECTstatement:

--Checkthetargettable,whichisempty.

jdbc:hive2://>SELECTname,work_place,sex_age

.......>FROMemployee;

+-------------+-------------------+----------------+

|employee.name|employee.work_place|employee.sex_age|

+-------------+-------------------+----------------+

+-------------+-------------------+----------------+

Norowsselected(0.115seconds)

--PopulatedatafromSELECT

jdbc:hive2://>INSERTINTOTABLEemployee

.......>SELECT*FROMctas_employee;

Norowsaffected(31.701seconds)

--Verifythedataloaded

jdbc:hive2://>SELECTname,work_place,sex_ageFROMemployee;

+-------------+----------------------+-------------------------+

|employee.name|employee.work_place|employee.sex_age|

+-------------+----------------------+-------------------------+

|Michael|["Montreal","Toronto"]|{"sex":"Male","age":30}|

|Will|["Montreal"]|{"sex":"Male","age":35}|

|Shelley|["NewYork"]|{"sex":"Female","age":27}|

|Lucy|["Vancouver"]|{"sex":"Female","age":57}|

+-------------+----------------------+-------------------------+

4rowsselected(0.12seconds)

InsertdatafromtheCTEstatement:

jdbc:hive2://>WITHaAS(SELECT*FROMctas_employee)

.......>FROMa

.......>INSERTOVERWRITETABLEemployee

.......>SELECT*;

Norowsaffected(30.1seconds)

RunmultipleINSERTbyonlyscanningthesourcetableonce:

jdbc:hive2://>FROMctas_employee

.......>INSERTOVERWRITETABLEemployee

.......>SELECT*

.......>INSERTOVERWRITETABLEemployee_internal

.......>SELECT*;

Norowsaffected(27.919seconds)

www.it-ebooks.info

Page 141: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

NoteTheINSERTOVERWRITEstatementwillreplacethedatainthetargettable/partitionwhileINSERTINTOwillappenddata.

Wheninsertingdatatothepartitions,weneedtospecifythepartitioncolumns.Insteadofspecifyingstaticvaluesforstaticpartitions,Hivealsosupportsdynamicallygivingpartitionvalues.Dynamicpartitionsareusefulwhenthedatavolumeislargeandwedon’tknowwhatwillbethepartitionvalues.Forexample,thedateisdynamicallyusedaspartitioncolumns.

Dynamicpartitionisnotenabledbydefault.Weneedtosetthefollowingpropertiestomakeitwork:

jdbc:hive2://>SEThive.exec.dynamic.partition=true;

Norowsaffected(0.002seconds)

Bydefault,theusermustspecifyatleastonestaticpartitioncolumn.Thisistoavoidaccidentallyoverwritingpartitions.Todisablethisrestriction,wecansetthepartitionmodetononstrictfromthedefaultstrictmodebeforeinsertingintodynamicpartitionsasfollows:

jdbc:hive2://>SEThive.exec.dynamic.partition.mode=nonstrict;

Norowsaffected(0.002seconds)

jdbc:hive2://>INSERTINTOTABLEemployee_partitioned

.......>PARTITION(year,month)

.......>SELECTname,array('Toronto')aswork_place,

.......>named_struct("sex","Male","age",30)assex_age,

.......>map("Python",90)asskills_score,

.......>map("R&D",array('Developer'))asdepart_title,

.......>year(start_date)asyear,month(start_date)asmonth

.......>FROMemployee_hreh

.......>WHEREeh.employee_id=102;

Norowsaffected(29.024seconds)

NoteComplextypeconstructorsareusedintheprecedingexampletoassignaconstantvaluetoacomplexdatatypecolumn.

TheHiveINSERTtofilesstatementistheoppositeoperationforLOAD.ItextractsthedatafromSELECTstatementstolocalorHDFSfiles.However,itonlysupportstheOVERWRITEkeyword,notINTO.Thismeanswecannotappenddataextractedtotheexistingfiles.Bydefault,thecolumnsareseparatedby^Aandrowsareseparatedbynewlines.SinceHive0.11.0,rowseparatorscanbespecified.Thefollowingareafewexamplestoinsertdatatofiles:

Wecaninserttolocalfileswithdefaultrowseparators.InsomerecentversionofHadoop,thelocaldirectorypathonlyworksforadirectorylevellessthantwo.Wemayneedtosethive.insert.into.multilevel.dirs=truetogetthisfixed:

jdbc:hive2://>INSERTOVERWRITELOCALDIRECTORY'/tmp/output1'

www.it-ebooks.info

Page 142: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

.......>SELECT*FROMemployee;

Norowsaffected(30.859seconds)

NoteBydefault,manypartialfilescouldbecreatedbythereducerwhendoingINSERT.Tomergethemintoone,wecanuseHDFScommands,asshowninthefollowingexample:

hdfsdfs–getmergehdfs://<host_name>:8020/user/dayongd/output

/tmp/test

Inserttolocalfileswithspecifiedrowseparators:

jdbc:hive2://>INSERTOVERWRITELOCALDIRECTORY'/tmp/output2'

.......>ROWFORMATDELIMITEDFIELDSTERMINATEDBY','

.......>SELECT*FROMemployee;

Norowsaffected(31.937seconds)

--Verifytheseparator

vi/tmp/output2/000000_0

Michael,Montreal^BToronto,Male^B30,DB^C80,Product^CDeveloper^DLead

Will,Montreal,Male^B35,Perl^C85,Product^CLead^BTest^CLead

Shelley,NewYork,Female^B27,Python^C80,Test^CLead^BCOE^CArchitect

Lucy,Vancouver,Female^B57,Sales^C89^BHR^C94,Sales^CLead

FiremultipleINSERTstatementsfromthesametableSELECTstatement:

jdbc:hive2://>FROMemployee

.......>INSERTOVERWRITEDIRECTORY'/user/dayongd/output'

.......>SELECT*

.......>INSERTOVERWRITEDIRECTORY'/user/dayongd/output1'

.......>SELECT*;

Norowsaffected(25.4seconds)

NoteBesidestheHiveINSERTstatement,HiveandHDFSshellcommandscanalsobeusedtoextractdatatolocalorremotefileswithbothappendandoverwritesupport.Thehive-e'quoted_hql_string'orhive-f<hql_filename>commandscanexecuteaHivequerystatementorqueryfile.Linuxredirectoperatorsandpipingcanbeusedwiththesecommandstoredirectresultsets.Thefollowingareafewexamples:

Appendtolocalfiles:

$hive-e'select*fromemployee'>>test

Overwritetolocalfiles:

$hive-e'select*fromemployee'>test

AppendtoHDFSfiles:

$hive-e'select*fromemployee'|hdfsdfs-appendToFile-

/user/dayongd/output2/test

OverwritetoHDFSfiles:

www.it-ebooks.info

Page 143: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

$hive-e'select*fromemployee'|hdfsdfs-put-f-

/user/dayongd/output2/test

www.it-ebooks.info

Page 144: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 145: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Dataexchange–EXPORTandIMPORTWhenworkingwithHive,sometimesweneedtomigratedataamongdifferentenvironments.Orwemayneedtobackupsomedata.SinceHive0.8.0,EXPORTandIMPORTstatementsareavailabletosupporttheimportandexportofdatainHDFSfordatamigrationorbackup/restorepurposes.

TheEXPORTstatementwillexportbothdataandmetadatafromatableorpartition.Metadataisexportedinafilecalled_metadata.Dataisexportedinasubdirectorycalleddata:

jdbc:hive2://>EXPORTTABLEemployeeTO'/user/dayongd/output3';

Norowsaffected(0.19seconds)

AfterEXPORT,wecanmanuallycopytheexportedfilestootherHiveinstancesoruseHadoopdistcpcommandstocopytootherHDFSclusters.Then,wecanimportthedatainthefollowingmanner:

Importdatatoatablewiththesamename.Itthrowsanerrorifthetableexists:

jdbc:hive2://>IMPORTFROM'/user/dayongd/output3';

Error:Errorwhilecompilingstatement:FAILED:SemanticException

[Error10119]:Tableexistsandcontainsdatafiles

(state=42000,code=10119)

Importdatatoanewtable:

jdbc:hive2://>IMPORTTABLEempolyee_importedFROM

.......>'/user/dayongd/output3';

Norowsaffected(0.788seconds)

Importdatatoanexternaltable,wheretheLOCATIONpropertyisoptional:

jdbc:hive2://>IMPORTEXTERNALTABLEempolyee_imported_external

.......>FROM'/user/dayongd/output3'

.......>LOCATION'/user/dayongd/output4';

Norowsaffected(0.256seconds)

Exportandimportpartitions:

jdbc:hive2://>EXPORTTABLEemployee_partitionedpartition

.......>(year=2014,month=11)TO'/user/dayongd/output5';

Norowsaffected(0.247seconds)

jdbc:hive2://>IMPORTTABLEemployee_partitioned_imported

.......>FROM'/user/dayongd/output5';

Norowsaffected(0.14seconds)

www.it-ebooks.info

Page 146: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 147: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ORDERandSORTAnotheraspecttomanipulatedatainHiveistoproperlyorderorsortthedataorresultsetstoclearlyidentifytheimportantfacts,suchastopNvalues,maximum,minimum,andsoon.

TherearethefollowingkeywordsusedinHivetoorderandsortdata:

ORDERBY(ASC|DESC):ThisissimilartotheRDBMSORDERBYstatement.Asortedorderismaintainedacrossalloftheoutputfromeveryreducer.Itperformstheglobalsortusingonlyonereducer,soittakesalongertimetoreturntheresult.UsagewithLIMITisstronglyrecommendedforORDERBY.Whenhive.mapred.mode=strict(bydefault,hive.mapred.mode=nonstrict)issetandwedonotspecifyLIMIT,thereareexceptions.Thiscanbeusedasfollows:

jdbc:hive2://>SELECTnameFROMemployeeORDERBYNAMEDESC;

+----------+

|name|

+----------+

|Will|

|Shelley|

|Michael|

|Lucy|

+----------+

4rowsselected(57.057seconds)

SORTBY(ASC|DESC):Thisindicateswhichcolumnstosortwhenorderingthereducerinputrecords.Thismeansitcompletessortingbeforesendingdatatothereducer.TheSORTBYstatementdoesnotperformaglobalsortandonlymakessuredataislocallysortedineachreducerunlesswesetmapred.reduce.tasks=1.Inthiscase,itisequaltotheresultofORDERBY.Itcanbeusedasfollows:

--Usemorethan1reducer

jdbc:hive2://>SETmapred.reduce.tasks=2;

Norowsaffected(0.001seconds)

jdbc:hive2://>SELECTnameFROMemployeeSORTBYNAMEDESC;

+----------+

|name|

+----------+

|Shelley|

|Michael|

|Lucy|

|Will|

+----------+

4rowsselected(54.386seconds)

--Useonly1reducer

jdbc:hive2://>SETmapred.reduce.tasks=1;

Norowsaffected(0.002seconds)

jdbc:hive2://>SELECTnameFROMemployeeSORTBYNAMEDESC;

+----------+

www.it-ebooks.info

Page 148: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|name|

+----------+

|Will|

|Shelley|

|Michael|

|Lucy|

+----------+

4rowsselected(46.03seconds)

DISTRIBUTEBY:Rowswithmatchingcolumnvalueswillbepartitionedtothesamereducer.Whenusedalone,itdoesnotguaranteesortedinputtothereducer.TheDISTRIBUTEBYstatementissimilartoGROUPBYinRDBMSintermsofdecidingwhichreducertodistributethemapperoutputto.WhenusingwithSORTBY,DISTRIBUTEBYmustbespecifiedbeforetheSORTBYstatement.And,thecolumnusedtodistributemustappearintheselectcolumnlist.Itcanbeusedasfollows:

jdbc:hive2://>SELECTname

.......>FROMemployee_hrDISTRIBUTEBYemployee_id;

Error:Errorwhilecompilingstatement:FAILED:SemanticException

[Error10004]:Line1:44Invalidtablealiasorcolumnreference

'employee_id':(possiblecolumnnamesare:name)

(state=42000,code=10004)

jdbc:hive2://>SELECTname,employee_id

.......>FROMemployee_hrDISTRIBUTEBYemployee_id;

+----------+--------------+

|name|employee_id|

+----------+--------------+

|Lucy|103|

|Steven|102|

|Will|101|

|Michael|100|

+----------+--------------+

4rowsselected(38.92seconds)

--UsedwithSORTBY

jdbc:hive2://>SELECTname,employee_id

.......>FROMemployee_hr

.......>DISTRIBUTEBYemployee_idSORTBYname;

+----------+--------------+

|name|employee_id|

+----------+--------------+

|Lucy|103|

|Michael|100|

|Steven|102|

|Will|101|

+----------+--------------+

4rowsselected(38.01seconds)

CLUSTERBY:ThisisashorthandoperatortoperformDISTRIBUTEBYandSORTBYoperationsonthesamegroupofcolumns.And,itissortedlocallyineachreducer.TheCLUSTERBYstatementdoesnotsupportASCorDESCyet.ComparedtoORDERBY,whichisgloballysorted,theCLUSTERBYoperationissortedineachdistributedgroup.Tofullyutilizealltheavailablereducerswhendoingaglobalsort,wecando

www.it-ebooks.info

Page 149: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

CLUSTERBYfirstandthenORDERBY.Thiscanbeusedasfollows:

jdbc:hive2://>SELECTname,employee_id

.......>FROMemployee_hrCLUSTERBYname;

+----------+--------------+

|name|employee_id|

+----------+--------------+

|Lucy|103|

|Michael|100|

|Steven|102|

|Will|101|

+----------+--------------+

4rowsselected(39.791seconds)

ThedifferencebetweenORDERBYandCLUSTERBYcanbeseeninthefollowingdiagram:

www.it-ebooks.info

Page 150: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 151: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

OperatorsandfunctionsTofurthermanipulatedata,wecanalsouseexpressions,operators,andfunctionsinHivetotransformdata.TheHivewiki(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)hasofferedspecificationsforeachexpressionandfunction,sowedonotwanttorepeatallofthemhereexceptafewimportantusagesortipsinthischapter.

Hivehasdefinedrelationaloperators,arithmeticoperators,logicaloperators,complextypeconstructors,andcomplextypeoperators.Forrelational,arithmetic,andlogicaloperators,theyaresimilartostandardoperatorsinSQL/Java.Wedonotrepeatthemagaininthischapter.Foroperatorsonacomplexdatatype,wehavealreadyintroducedthemintheUnderstandingHivedatatypessectionofChapter3,DataDefinitionandDescription,aswellastheexampleforadynamicpartitioninsertinthischapter.

ThefunctionsinHivearecategorizedasfollows:

Mathematicalfunctions:Thesefunctionsaremainlyusedtoperformmathematicalcalculations,suchasRAND()andE().Collectionfunctions:Thesefunctionsareusedtofindthesize,keys,andvaluesforcomplextypes,suchasSIZE(Array<T>).Typeconversionfunctions:ThesearemainlyCASTandBINARYfunctionstoconvertonetypetotheother.Datefunctions:Thesefunctionsareusedtoperformdate-relatedcalculations,suchasYEAR(stringdate)andMONTH(stringdate).Conditionalfunctions:Thesefunctionsareusedtocheckspecificconditionswithadefinedvaluereturned,suchasCOALESCE,IF,andCASEWHEN.Stringfunctions:Thesefunctionsareusedtoperformstring-relatedoperations,suchasUPPER(stringA)andTRIM(stringA).Aggregatefunctions:Thesefunctionsareusedtoperformaggregation(whichisintroducedinthenextchapterformoredetails),suchasSUM(),COUNT(*).Table-generatingfunctions:Thesefunctionstransformasingleinputrowintomultipleoutputrows,suchasEXPLODE(MAP)andJSON_TUPLE(jsonString,k1,k2,…).Customizedfunctions:ThesefunctionsarecreatedbyJavacodeasextensionsforHive.TheyareintroducedinChapter8,ExtensibilityConsiderations.

TolistHivebuilt-infunctions/UDF,wecanusethefollowingcommandsinHiveCLI:

SHOWFUNCTIONS;--Listallfunctions

DESCRIBEFUNCTION<function_name>;--Detailforspecifiedfunction

DESCRIBEFUNCTIONEXTENDED<function_name>;--Evenmoredetails

Thefollowingareafewexamplesandtipsforusingthesefunctions:

Complexdatatypefunctionstips:TheSIZEtypeisusedtocalculatethesizeforMAP,ARRAY,ornestedMAP/ARRAY.Itreturns-1ifthesizeisunknown.Itcanbeimplementedasfollows:

www.it-ebooks.info

Page 152: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

jdbc:hive2://>SELECTwork_place,skills_score,depart_title

.......>FROMemployee;

+----------------------+--------------------+--------------------------

-----------+

|work_place|skills_score|depart_title

|

+----------------------+--------------------+--------------------------

-----------+

|["Montreal","Toronto"]|{"DB":80}|{"Product":

["Developer","Lead"]}|

|["Montreal"]|{"Perl":85}|{"Product":

["Lead"],"Test":["Lead"]}|

|["NewYork"]|{"Python":80}|{"Test":["Lead"],"COE":

["Architect"]}|

|["Vancouver"]|{"Sales":89,"HR":94}|{"Sales":["Lead"]}

|

+----------------------+--------------------+--------------------------

-----------+

4rowsselected(0.084seconds)

jdbc:hive2://>SELECTSIZE(work_place)ASarray_size,

.......>SIZE(skills_score)ASmap_size,

.......>SIZE(depart_title)AScomplex_size,

.......>SIZE(depart_title["Product"])ASnest_size

.......>FROMemployee;

+-------------+-----------+---------------+------------+

|array_size|map_size|complex_size|nest_size|

+-------------+-----------+---------------+------------+

|2|1|1|2|

|1|1|2|1|

|1|1|2|-1|

|1|2|1|-1|

+-------------+-----------+---------------+------------+

4rowsselected(0.062seconds)

TheARRAY_CONTAINSstatementcheckswhetherthearraycontainssomevaluestoreturnTRUEorFALSE.TheSORT_ARRAYstatementsortsthearrayinascendingorder.Thesecanbeusedasfollows:

jdbc:hive2://>SELECTARRAY_CONTAINS(work_place,'Toronto')

.......>ASis_Toronto,

.......>SORT_ARRAY(work_place)ASsorted_array

.......>FROMemployee;

+-------------+-------------------------+

|is_toronto|sorted_array|

+-------------+-------------------------+

|true|["Montreal","Toronto"]|

|false|["Montreal"]|

|false|["NewYork"]|

|false|["Vancouver"]|

+-------------+-------------------------+

4rowsselected(0.059seconds)

Datefunctiontips:TheFROM_UNIXTIME(UNIX_TIMESTAMP())statementperformsthesamefunctionasSYSDATEinOracle.Itdynamicallyreturnsthecurrentdate-timein

www.it-ebooks.info

Page 153: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

theHiveserver,asfollows:

jdbc:hive2://>SELECT

.......>FROM_UNIXTIME(UNIX_TIMESTAMP())AScurrent_time

.......>FROMemployeeLIMIT1;

+----------------------+

|current_time|

+----------------------+

|2014-11-1519:28:29|

+----------------------+

1rowselected(0.047seconds)

TheUNIX_TIMESTAMP()statementcanbeusedtocomparetwodatesorcanbeusedafterORDERBYtoproperlyorderthedifferentstringtypesofadatevalue,suchasORDERBYUNIX_TIMESTAMP(string_date,'dd-MM-yyyy').Thiscanbeusedasfollows:

--Tocomparethedifferencebetweentwodates.

jdbc:hive2://>SELECT(UNIX_TIMESTAMP('2015-01-2118:00:00')

.......>-UNIX_TIMESTAMP('2015-01-1011:00:00'))/60/60/24

.......>ASdaydiffFROMemployeeLIMIT1;

+---------------------+

|daydiff|

+---------------------+

|11.291666666666666|

+---------------------+

1rowselected(0.093seconds)

TheTO_DATEstatementremovesthehours,minutes,andsecondsfromadate.Thisisusefulwhenweneedtocheckwhetherthevalueofdate-timetypecolumnsiswithinthedatarange,suchasWHERETO_DATE(update_datetime)BETWEEN'2014-11-01'AND'2014-11-31'.Thiscanbeusedasfollows:

jdbc:hive2://>SELECTTO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP()))

.......>AScurrent_dateFROMemployeeLIMIT1;

+---------------+

|current_date|

+---------------+

|2014-11-15|

+---------------+

1rowselected(0.153seconds)

CASEfordifferentdatatypes:BeforeHive0.13.0,thedatatypeafterTHENorELSEneededtobethesame.Otherwise,itwouldgiveanexception,suchasTheexpressionafterELSEshouldhavethesametypeasthoseafterTHEN:“bigint”isexpectedbut“int”isfound.TheworkaroundistouseIF.InHive0.13.0,thisgetsfixed,asshownhere:

jdbc:hive2://>SELECT

.......>CASEWHEN1ISNULLTHEN'TRUE'ELSE0END

.......>AScase_resultFROMemployeeLIMIT1;

+--------------+

|case_result|

+--------------+

www.it-ebooks.info

Page 154: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|0|

+--------------+

1rowselected(0.063seconds)

Parserandsearchtips:TheLATERALVIEWstatementisusedwithuser-definedtablegeneratingfunctionssuchasEXPLODE()toflattenthemaporarraytypeofacolumn.TheexplodefunctioncanbeusedonbothARRAYandMAPwithLATERALVIEW.IfevenoneofthecolumnsexplodedisNULL,thewholerowisfilteredout,suchastherowofSteveninthefollowingexample.Toavoidthis,OUTERLATERALVIEWcanbeusedasfollowssinceHive0.12.0:

--Preparedata

jdbc:hive2://>INSERTINTOTABLEemployee

.......>SELECT'Steven'ASname,array(null)aswork_place,

.......>named_struct("sex","Male","age",30)assex_age,

.......>map("Python",90)asskills_score,

.......>map("R&D",array('Developer'))asdepart_title

.......>FROMemployeeLIMIT1;

Norowsaffected(28.187seconds)

jdbc:hive2://>SELECTname,work_place,skills_score

.......>FROMemployee;

+----------+-------------------------+-----------------------+

|name|work_place|skills_score|

+----------+-------------------------+-----------------------+

|Michael|["Montreal","Toronto"]|{"DB":80}|

|Will|["Montreal"]|{"Perl":85}|

|Shelley|["NewYork"]|{"Python":80}|

|Lucy|["Vancouver"]|{"Sales":89,"HR":94}|

|Steven|NULL|{"Python":90}|

+----------+-------------------------+-----------------------+

5rowsselected(0.053seconds)

--LATERALVIEWignorestherowswhenEXPLOREreturnsNULL

jdbc:hive2://>SELECTname,workplace,skills,score

.......>FROMemployee

.......>LATERALVIEWexplode(work_place)wpASworkplace

.......>LATERALVIEWexplode(skills_score)ss

.......>ASskills,score;

+----------+------------+---------+--------+

|name|workplace|skills|score|

+----------+------------+---------+--------+

|Michael|Montreal|DB|80|

|Michael|Toronto|DB|80|

|Will|Montreal|Perl|85|

|Shelley|NewYork|Python|80|

|Lucy|Vancouver|Sales|89|

|Lucy|Vancouver|HR|94|

+----------+------------+---------+--------+

6rowsselected(24.733seconds)

--OUTERLATERALVIEWkeepsrowswhenEXPLOREreturnsNULL

jdbc:hive2://>SELECTname,workplace,skills,score

.......>FROMemployee

.......>LATERALVIEWOUTERexplode(work_place)wp

www.it-ebooks.info

Page 155: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

.......>ASworkplace

.......>LATERALVIEWexplode(skills_score)ss

.......>ASskills,score;

+----------+------------+---------+--------+

|name|workplace|skills|score|

+----------+------------+---------+--------+

|Michael|Montreal|DB|80|

|Michael|Toronto|DB|80|

|Will|Montreal|Perl|85|

|Shelley|NewYork|Python|80|

|Lucy|Vancouver|Sales|89|

|Lucy|Vancouver|HR|94|

|Steven|None|Python|90|

+----------+------------+---------+--------+

7rowsselected(24.573seconds)

TheREVERSEstatementcanbeusedtoreversetheorderofeachletterinastring.TheSPLITstatementcanbeusedtotokenizethestringusingaspecifiedtokenizer.ThefollowingisanexampleofusingthemtogetthefilenamefromaLinuxpath:

jdbc:hive2://>SELECT

.......>reverse(split(reverse('/home/user/employee.txt'),'/')

[0])

.......>ASlinux_file_nameFROMemployeeLIMIT1;

+------------------+

|linux_file_name|

+------------------+

|employee.txt|

+------------------+

1rowselected(0.1seconds)

Whereasreverseoutputseachelementinanarrayormapasseparaterows,collect_setandcollect_listdoestheoppositebyreturningasetwithelementsfromeachrow.Thecollect_setstatementwillremoveduplicationsfromtheresult,butcollect_listdoesnot.Thisisshownhere:

jdbc:hive2://>SELECTcollect_set(work_place[0])

.......>ASflat_workplace0FROMemployee;

+--------------------------------------+

|flat_workplace0|

+--------------------------------------+

|["Vancouver","Montreal","NewYork"]|

+--------------------------------------+

1rowselected(43.455seconds)

jdbc:hive2://>SELECTcollect_list(work_place[0])

.......>ASflat_workplace0FROMemployee;

+-------------------------------------------------+

|flat_workplace0|

+-------------------------------------------------+

|["Montreal","Montreal","NewYork","Vancouver"]|

+-------------------------------------------------+

1rowselected(45.488seconds)

Virtualcolumns:VirtualcolumnsarespecialfunctiontypeofcolumnsinHive.

www.it-ebooks.info

Page 156: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Rightnow,Hiveofferstwovirtualcolumns:INPUT__FILE__NAMEandBLOCK__OFFSET__INSIDE__FILE.TheINPUT__FILE__NAMEfunctionistheinputfile’snameforamappertask.TheBLOCK__OFFSET__INSIDE__FILEfunctionisthecurrentglobalfilepositionorcurrentblock’sfileoffsetifthefileiscompressed.ThefollowingareexamplestousevirtualcolumnstoknowtheplacewherethedataisphysicallylocatedintheHDFS,especiallyforbucketedandpartitionedtables:

jdbc:hive2://>SELECTINPUT__FILE__NAME,

.......>BLOCK__OFFSET__INSIDE__FILEASOFFSIDE

.......>FROMemployee_id_buckets;

+---------------------------------------------------------+----------+

|input__file__name|offside|

+---------------------------------------------------------+----------+

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|0|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|55|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|120|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|175|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|240|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|295|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|360|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|415|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|480|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|535|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|592|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|657|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|712|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|769|

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|834|

|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|0|

|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|57|

|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|122|

|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|177|

|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|234|

|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|291|

|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|348|

|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|405|

|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|462|

|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|517|

+---------------------------------------------------------+----------+

25rowsselected(0.073seconds)

jdbc:hive2://>SELECTINPUT__FILE__NAMEFROMemployee_partitioned;

+----------------------------------------------------------------------

---+

|input__file__name

|

+----------------------------------------------------------------------

---+

|hdfs://warehouse_URI/employee_partitioned/year=2010/month=1/000000_0

|hdfs://warehouse_URI/employee_partitioned/year=2012/month=11/000000_0

|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.

txt

|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.

txt

www.it-ebooks.info

Page 157: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.

txt

|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.

txt

|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0

|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0

|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0

|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0

+----------------------------------------------------------------------

---+

10rowsselected(0.47seconds)

FunctionsnotmentionedintheHivewiki:ThefollowingarethefunctionsnotmentionedintheHivewiki:

--Functionstocheckfornullvalues

jdbc:hive2://>SELECTwork_place,isnull(work_place)is_null,

.......>isnotnull(work_place)is_not_nullFROMemployee;

+-------------------------+----------+--------------+

|work_place|is_null|is_not_null|

+-------------------------+----------+--------------+

|["Montreal","Toronto"]|false|true|

|["Montreal"]|false|true|

|["NewYork"]|false|true|

|["Vancouver"]|false|true|

|NULL|true|false|

+-------------------------+----------+--------------+

5rowsselected(0.058seconds)

--assert_true,throwanexceptionif'condition'isnottrue.

jdbc:hive2://>SELECTassert_true(work_placeISNULL)

.......>FROMemployee;

Error:java.io.IOException:

org.apache.hadoop.hive.ql.metadata.HiveException:ASSERT_TRUE():

assertionfailed.(state=,code=0)

--elt(n,str1,str2,...),returnsthen-thstring

jdbc:hive2://>SELECTelt(2,'NewYork','Montreal','Toronto')

.......>FROMemployeeLIMIT1;

+-----------+

|_c0|

+-----------+

|Montreal|

+-----------+

1rowselected(0.055seconds)

--Returnthenameofcurrent_databasesinceHive0.13.0

jdbc:hive2://>SELECTcurrent_database();

+----------+

|_c0|

+----------+

|default|

+----------+

1rowselected(0.057seconds)

www.it-ebooks.info

Page 158: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 159: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

TransactionsBeforeHiveversion0.13.0,Hivedoesnotsupportrow-leveltransactions.Asaresult,thereisnowaytoupdate,insert,ordeleterowsofdata.Hence,dataoverwritecanonlyhappenontablesorpartitions.ThismakesHiveverydifficultwhendealingwithconcurrentread/writeanddata-cleaningusecases.

SinceHiveversion0.13.0,Hivefullysupportsrow-leveltransactionsbyofferingfullAtomicity,Consistency,Isolation,andDurability(ACID)toHive.Fornow,allthetransactionsareautocommutedandonlysupportdataintheOptimizedRowColumnar(ORC)file(availablesinceHive0.11.0)formatandinbucketedtables.

ThefollowingconfigurationparametersmustbesetappropriatelytoturnontransactionsupportinHive:

SEThive.support.concurrency=true;

SEThive.enforce.bucketing=true;

SEThive.exec.dynamic.partition.mode=nonstrict;

SEThive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

SEThive.compactor.initiator.on=true;

SEThive.compactor.worker.threads=1;

TheSHOWTRANSACTIONScommandisaddedsinceHive0.13.0toshowcurrentlyopenandabortedtransactionsinthesystem:

jdbc:hive2://>SHOWTRANSACTIONS;

+-----------------+--------------------+-------+-----------+

|txnid|state|user|host|

+-----------------+--------------------+-------+-----------+

|TransactionID|TransactionState|User|Hostname|

+-----------------+--------------------+-------+-----------+

1rowselected(15.209seconds)

SinceHive0.14.0,theINSERTVALUE,UPDATE,andDELETEcommandsareaddedtooperaterowswiththefollowingsyntax:

INSERTINTOTABLEtablename[PARTITION(partcol1[=val1],partcol2[=val2]

...)]

VALUESvalues_row[,values_row…];

UPDATEtablenameSETcolumn=value[,column=value…][WHEREexpression]

DELETEFROMtablename[WHEREexpression]

www.it-ebooks.info

Page 160: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 161: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SummaryInthischapter,wecoveredhowtoexchangedatabetweenHiveandfilesusingtheLOAD,INSERT,IMPORT,andEXPORTkeywords.Then,weintroducedthedifferentHiveorderingandsortingoptions.WealsocoveredsomecommonlyusedtipsusingHivefunctions.Finally,weprovidedanoverviewofrow-leveltransactionsthatarenewlysupportedsinceHive0.13.0.Aftergoingthroughthischapter,weshouldbeabletoimportorexportdatatoHive.Weshouldbeexperiencedinusingdifferenttypesoforderingandsortingkeywords,Hivefunctions,andtransactions.

Inthenextchapter,we’lllookatthedifferentwaysofcarryingoutdataaggregationsandsamplinginHive.

www.it-ebooks.info

Page 162: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 163: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Chapter6.DataAggregationandSamplingThischapterisabouthowtoaggregateandsampledatainHive.Itfirstlycoverstheusageofseveralaggregationfunctions,analyticfunctionsworkingwithGROUPBYandPARTITIONBY,andwindowingclauses.Then,itintroducesdifferentwaysofsamplingdatainHive.

Inthischapter,wewillcoverthefollowingtopics:

BasicaggregationAdvancedaggregationAggregationconditionAnalyticfunctionsSampling

www.it-ebooks.info

Page 164: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Basicaggregation–GROUPBYDataaggregationisanyprocesstogatherandexpressdatainasummaryformtogetmoreinformationaboutparticulargroupsbasedonspecificconditions.Hiveoffersseveralbuilt-inaggregatefunctions,suchasMAX,MIN,AVG,andsoon.HivealsosupportsadvancedaggregationbyusingGROUPINGSETS,ROLLUP,CUBE,analyticfunctions,andwindowing.

TheHivebasicbuilt-inaggregatefunctionsareusuallyusedwiththeGROUPBYclause.IfthereisnoGROUPBYclausespecified,itaggregatesoverthewholetablebydefault.Besidesaggregatefunctions,allothercolumnsthatareselectedmustalsobeincludedintheGROUPBYclause.Thefollowingareafewexamplesusingthebuilt-inaggregatefunctions:

AggregationwithoutGROUPBYcolumns:

jdbc:hive2://>SELECTcount(*)ASrow_cntFROMemployee;

+----------+

|row_cnt|

+----------+

|5|

+----------+

1rowselected(60.709seconds)

AggregationwithGROUPBYcolumns:

jdbc:hive2://>SELECTsex_age.sex,count(*)ASrow_cnt

.......>FROMemployee

.......>GROUPBYsex_age.sex;

+--------------+----------+

|sex_age.sex|row_cnt|

+--------------+----------+

|Female|2|

|Male|3|

+--------------+----------+

2rowsselected(100.565seconds)

--Thecolumnnameselectedisnotgroupbycolumns

jdbc:hive2://>SELECTname,sex_age.sex,count(*)ASrow_cnt

.......>FROMemployeeGROUPBYsex_age.sex;

Error:Errorwhilecompilingstatement:FAILED:SemanticException

[Error10025]:Line1:7ExpressionnotinGROUPBYkey'name'

(state=42000,code=10025)

IfwehavetoselectthecolumnsthatarenotGROUPBYcolumns,onewayistouseanalyticfunctions,whichareintroducedlater,tocompletelyavoidusingtheGROUPBYclause.Theotherwayistousethecollect_setfunction,whichreturnsasetofobjectswithduplicateelementseliminatedasfollows:

--Findrowcountbysexandasampledageforeachsex

jdbc:hive2://>SELECTsex_age.sex,

.......>collect_set(sex_age.age)[0]ASrandom_age,

.......>count(*)ASrow_cnt

.......>FROMemployeeGROUPBYsex_age.sex;

www.it-ebooks.info

Page 165: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

+--------------+-------------+----------+

|sex_age.sex|random_age|row_cnt|

+--------------+-------------+----------+

|Female|27|2|

|Male|35|3|

+--------------+-------------+----------+

2rowsselected(48.15seconds)

Theaggregatefunctioncanbeusedwithotheraggregatefunctionsinthesameselectstatement.Itcanalsobeusedwithotherfunctions,suchasconditionalfunctions,inthenestedway.However,nestedaggregatefunctionsarenotsupported.Seethefollowingexamplesformoredetails:

MultipleaggregatefunctionsarecalledinthesameSELECTstatement,asfollows:

jdbc:hive2://>SELECTsex_age.sex,AVG(sex_age.age)ASavg_age,

.......>count(*)ASrow_cnt

.......>FROMemployeeGROUPBYsex_age.sex;

+--------------+---------------------+----------+

|sex_age.sex|avg_age|row_cnt|

+--------------+---------------------+----------+

|Female|42.0|2|

|Male|31.666666666666668|3|

+--------------+---------------------+----------+

2rowsselected(98.857seconds)

TheseaggregatefunctionsareusedwithCASEWHEN,asfollows:

jdbc:hive2://>SELECTsum(CASEWHENsex_age.sex='Male'

.......>THENsex_age.ageELSE0END)/

.......>count(CASEWHENsex_age.sex='Male'THEN1

.......>ELSENULLEND)ASmale_age_avgFROMemployee;

+---------------------+

|male_age_avg|

+---------------------+

|31.666666666666668|

+---------------------+

1rowselected(38.415seconds)

TheseaggregatefunctionsareusedwithCOALESCEandIF,asfollows:

jdbc:hive2://>SELECT

.......>sum(coalesce(sex_age.age,0))ASage_sum,

.......>sum(if(sex_age.sex='Female',sex_age.age,0))

.......>ASfemale_age_sumFROMemployee;

+----------+---------------+

|age_sum|female_age_sum|

+----------+---------------+

|179|84|

+----------+---------------+

1rowselected(42.137seconds)

Nestedaggregatefunctionsarenotallowed,asshownhere:

jdbc:hive2://>SELECTavg(count(*))ASrow_cnt

.......>FROMemployee;

Error:Errorwhilecompilingstatement:FAILED:SemanticException

www.it-ebooks.info

Page 166: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

[Error10128]:Line1:11NotyetsupportedplaceforUDAF'count'

(state=42000,code=10128)

AggregatefunctionscanalsobeusedwiththeDISTINCTkeywordtodoaggregationonuniquevalues:

jdbc:hive2://>SELECTcount(DISTINCTsex_age.sex)ASsex_uni_cnt,

.......>count(DISTINCTname)ASname_uni_cnt

.......>FROMemployee;

+--------------+---------------+

|sex_uni_cnt|name_uni_cnt|

+--------------+---------------+

|2|5|

+--------------+---------------+

1rowselected(35.935seconds)

NoteWhenweuseCOUNTandDISTINCTtogether,Hivealwaysignoresthesetting(suchasmapred.reduce.tasks=20)forthenumberofreducersusedandusesonlyonereducer.Inthiscase,thesinglereducerbecomesthebottleneckwhenprocessingbigvolumesofdata.Theworkaroundistousethesubqueryasfollows:

--Triggersinglereducerduringthewholeprocessing

SELECTcount(distinctsex_age.sex)ASsex_uni_cntFROMemployee;

--Usesubquerytoselectuniquevaluebeforeaggregationsforbetter

performance

SELECTcount(*)ASsex_uni_cntFROM(SELECTdistinctsex_age.sexFROM

employee)a;

Inthiscase,thefirststageofthequeryimplementingDISTINCTcanusemorethanonereducer.Inthesecondstage,themapperwillhavelessoutputjustfortheCOUNTpurposesincethedataisalreadyuniqueafterimplementingDISTINCT.Asaresult,thereducerwillnotbeoverloaded.

WemayencounteraveryspecialbehaviorwhenHivedealswithaggregationacrosscolumnswithaNULLvalue.Theentirerow(ifonecolumnhasNULLasavalueintherow)willbeignoredinthesecondrowofthefollowingexample.Toavoidthis,wecanuseCOALESCEtoassignadefaultvaluewhenthecolumnvalueisNULL.Thiscanbedoneasfollows:

--Createatabletfortesting

jdbc:hive2://>CREATETABLEtASSELECT*FROM

.......>(SELECTemployee_id-99ASval1,

.......>(employee_id-98)ASval2FROMemployee_hr

.......>WHEREemployee_id<=101

.......>UNIONALL

.......>SELECTnullval1,2ASval2FROMemployee_hr

.......>WHEREemployee_id=100)a;

Norowsaffected(0.138seconds)

--Checktherowsinthetablecreated

jdbc:hive2://>SELECT*FROMt;

+---------+---------+

www.it-ebooks.info

Page 167: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|t.val1|t.val2|

+---------+---------+

|1|2|

|NULL|2|

|2|3|

+---------+---------+

3rowsselected(0.069seconds)

--The2ndrow(NULL,2)isignoredwhendoingsum(val1+val2)

jdbc:hive2://>SELECTsum(val1),sum(val1+val2)

.......>FROMt;

+------+------+

|_c0|_c1|

+------+------+

|3|8|

+------+------+

1rowselected(57.775seconds)

jdbc:hive2://>SELECTsum(coalesce(val1,0)),

.......>sum(coalesce(val1,0)+val2)FROMt;

+------+------+

|_c0|_c1|

+------+------+

|3|10|

+------+------+

1rowselected(69.967seconds)

Thehive.map.aggrpropertycontrolsaggregationsinthemaptask.Thedefaultvalueforthissettingisfalse.Ifitissettotrue,Hivewilldothefirst-levelaggregationdirectlyinthemaptaskforbetterperformance,butconsumemorememory:

jdbc:hive2://>SEThive.map.aggr=true;

Norowsaffected(0.002seconds)

www.it-ebooks.info

Page 168: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 169: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Advancedaggregation–GROUPINGSETSHivehasofferedtheGROUPINGSETSkeywordstoimplementadvancedmultipleGROUPBYoperationsagainstthesamesetofdata.Actually,GROUPINGSETSisashorthandwayofconnectingseveralGROUPBYresultsetswithUNIONALL.TheGROUPINGSETSkeywordcompletesallprocessesinonestageofjobs,whichismoreefficientthanGROUPBYandUNIONALLhavingmultiplestages.Ablankset()intheGROUPINGSETSclausecalculatestheoverallaggregation.ThefollowingareafewexamplestoshowtheequivalenceofGROUPINGSETS.Forbetterunderstanding,wecansaythattheouterlevelofGROUPINGSETSdefinesonwhatdataUNIONALListobeimplemented.TheinnerleveldefinesonwhatdataGROUPBYistobeimplementedineachUNIONALL.

SELECTname,work_place[0]ASmain_place,

count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname,work_place[0]GROUPINGSETS((name,work_place[0]));

||

SELECTname,work_place[0]ASmain_place,

count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname,work_place[0]

SELECTname,work_place[0]ASmain_place,

count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname,work_place[0]GROUPINGSETS(name,work_place[0]);

||

SELECTname,NULLASmain_place,count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname

UNIONALL

SELECTNULLASname,work_place[0]ASmain_place,

count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYwork_place[0];

SELECTname,work_place[0]ASmain_place,

count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname,work_place[0]

GROUPINGSETS((name,work_place[0]),name);

||

SELECTname,work_place[0]ASmain_place,

count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname,work_place[0]

UNIONALL

SELECTname,NULLASmain_place,count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname;

www.it-ebooks.info

Page 170: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SELECTname,work_place[0]ASmain_place,

count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname,work_place[0]

GROUPINGSETS((name,work_place[0]),name,work_place[0],());

||

SELECTname,work_place[0]ASmain_place,

count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname,work_place[0]

UNIONALL

SELECTname,NULLASmain_place,count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname

UNIONALL

SELECTNULLASname,work_place[0]ASmain_place,

count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYwork_place[0]

UNIONALL

SELECTNULLASname,NULLASmain_place,

count(employee_id)ASemp_id_cnt

FROMemployee_id;

However,theGROUPINGSETSoperationstillhasunresolvedissueswhenworkingwithcolumnsreferredbyatableorrecordtypealias(seeApacheJiraHIVE-6950athttps://issues.apache.org/jira/browse/HIVE-6950).Thisisshownhere:

jdbc:hive2://>SELECTsex_age.sex,sex_age.age,

.......>count(name)ASname_cnt

.......>FROMemployee

.......>GROUPBYsex_age.sex,sex_age.age

.......>GROUPINGSETS((sex_age.sex,sex_age.age));

Error:Errorwhilecompilingstatement:FAILED:ParseExceptionline1:131

missing)at','near'<EOF>'

line1:145extraneousinput')'expectingEOFnear'<EOF>'

(state=42000,code=40000)

www.it-ebooks.info

Page 171: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 172: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Advancedaggregation–ROLLUPandCUBETheROLLUPstatementenablesaSELECTstatementtocalculatemultiplelevelsofaggregationsacrossaspecifiedgroupofdimensions.TheROLLUPstatementisasimpleextensiontotheGROUPBYclausewithhighefficiencyandminimaloverheadtoaquery.ComparedtoGROUPINGSETSthatcreatesspecifiedlevelsofaggregations,ROLLUPcreatesn+1levelsofaggregations,wherenisthenumberofgroupingcolumns.First,itcalculatesthestandardaggregatevaluesspecifiedintheGROUPBYclause.Then,itcreateshigher-levelsubtotals,movingfromrighttoleftthroughthelistofcombinationsofgroupingcolumns,asshowninthefollowingexample:

GROUPBYa,b,cWITHROLLUP

Thisisequivalenttothefollowing:

GROUPBYa,b,cGROUPINGSETS((a,b,c),(a,b),(a),())

TheCUBEstatementtakesaspecifiedsetofgroupingcolumnsandcreatesaggregationsforalloftheirpossiblecombinations.IfncolumnsarespecifiedforCUBE,therewillbe2ncombinationsofaggregationsreturned,asshowninthefollowingexample:

GROUPBYa,b,cWITHCUBE

Thisisequivalenttothefollowing:

GROUPBYa,b,cGROUPINGSETS((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())

TheGROUPING__IDfunctionworksasanextensiontodistinguishentirerowsfromeachother.ItacceptsoneormorecolumnsandreturnsthedecimalequivalentoftheBITvectorforeachcolumnspecifiedafterGROUPBY.Thereturneddecimalnumberisconvertedfromabinaryof1sand0s,whichrepresentswhetherthecolumnisaggregated(valueisnotNULL)intherow.TheorderofcolumnsstartsfromcountingthenearestcolumnfromGROUPBY.Inthefollowingexample,thefirstcolumnisstart_date:

jdbc:hive2://>SELECTGROUPING__ID,

.......>BIN(CAST(GROUPING__IDASBIGINT))ASbit_vector,

.......>name,start_date,count(employee_id)emp_id_cnt

.......>FROMemployee_hr

.......>GROUPBYstart_date,name

.......>WITHCUBEORDERBYstart_date;

+---------------+-------------+----------+-------------+------------+

|grouping__id|bit_vector|name|start_date|emp_id_cnt|

+---------------+-------------+----------+-------------+------------+

|2|10|Steven|NULL|1|

|2|10|Michael|NULL|1|

|2|10|Lucy|NULL|1|

|0|0|NULL|NULL|4|

|2|10|Will|NULL|1|

|3|11|Lucy|2010-01-03|1|

|1|1|NULL|2010-01-03|1|

www.it-ebooks.info

Page 173: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|1|1|NULL|2012-11-03|1|

|3|11|Steven|2012-11-03|1|

|1|1|NULL|2013-10-02|1|

|3|11|Will|2013-10-02|1|

|1|1|NULL|2014-01-29|1|

|3|11|Michael|2014-01-29|1|

+---------------+-------------+----------+-------------+------------+

13rowsselected(136.708seconds)

www.it-ebooks.info

Page 174: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 175: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Aggregationcondition–HAVINGSinceHive0.7.0,HAVINGisaddedtosupporttheconditionalfilteringofGROUPBYresults.ByusingHAVING,wecanavoidusingasubqueryafterGROUPBY.Thefollowingisanexample:

jdbc:hive2://>SELECTsex_age.ageFROMemployee

.......>GROUPBYsex_age.ageHAVINGcount(*)<=1;

+--------------+

|sex_age.age|

+--------------+

|57|

|27|

|35|

+--------------+

3rowsselected(74.376seconds)

IfwedonotuseHAVING,wecanuseasubqueryforinstanceasfollows:

jdbc:hive2://>SELECTa.age

.......>FROM

.......>(SELECTcount(*)ascnt,sex_age.age

.......>FROMemployeeGROUPBYsex_age.age

.......>)aWHEREa.cnt<=1;

+--------+

|a.age|

+--------+

|57|

|27|

|35|

+--------+

3rowsselected(87.298seconds)

www.it-ebooks.info

Page 176: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 177: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

AnalyticfunctionsAnalyticfunctions,availablesinceHive0.11.0,areaspecialgroupoffunctionsthatscanthemultipleinputrowstocomputeeachoutputvalue.AnalyticfunctionsareusuallyusedwithOVER,PARTITIONBY,ORDERBY,andthewindowingspecification.DifferentfromtheregularaggregatefunctionsusedwiththeGROUPBYclausethatislimitedtooneresultvaluepergroup,analyticfunctionsoperateonwindowswheretheinputrowsareorderedandgroupedusingflexibleconditionsexpressedthroughanOVERPARTITIONclause.Thoughanalyticfunctionsgiveaggregateresults,theydonotgrouptheresultset.Theyreturnthegroupvaluemultipletimeswitheachrecord.TheanalyticfunctionsoffergreatflexibilityandfunctionalitiesthantheregularGROUPBYclauseandmakespecialaggregationsinHiveeasierandpowerful.Thesyntaxfortheanalyzefunctionisasfollows:

Function(arg1,...,argn)OVER([PARTITIONBY<...>][ORDERBY<....>]

[<window_clause>])

TheFunction(arg1,...,argn)canbeanyfunctioninthefollowinglistwithexamples:

Standardaggregations:ThiscanbeeitherCOUNT(),SUM(),MIN(),MAX(),orAVG().RANK:Itranksitemsinagroup,suchasfindingthetopNrowsforspecificconditions.DENSE_RANK:ItissimilartoRANK,butleavesnogapsintherankingsequencewhenthereareties.Forexample,ifwerankamatchusingDENSE_RANKandhadtwoplayerstieforsecondplace,wewouldseethatthetwoplayerswereinsecondplaceandthatthenextpersonisrankedasthird.However,theRANKfunctionwouldalsoranktwopeopleinsecondplace,butthenextpersonwouldbeinfourthplace.ROW_NUMBER:Itassignsauniquesequencenumberstartingfrom1toeachrowaccordingtothepartitionandorderspecification.CUME_DIST:Itcomputesthenumberofrowswhosevalueissmallerorequaltothevalueofthetotalnumberofrowsdividedbythecurrentrow.PERCENT_RANK:ItissimilartoCUME_DIST,butitusesrankvaluesratherthanrowcountsinitsnumeratorastotalnumberofrows-1dividedbycurrentrank-1.Therefore,itreturnsthepercentrankofavaluerelativetoagroupofvalues.NTILE:Itdividesanordereddatasetintonumberofbucketsandassignsanappropriatebucketnumbertoeachrow.Itcanbeusedtodividerowsintoequalsetsandassignanumbertoeachrow.LEAD:TheLEADfunction,lead(value_expr[,offset[,default]]),isusedtoreturndatafromthenextrow.Thenumber(value_expr)ofrowstoleadcanoptionallybespecified.Ifthenumberofrows(offset)toleadisnotspecified,theleadisonerowbydefault.Itreturns[,default]ornullwhenthedefaultisnotspecifiedandtheleadforthecurrentrowextendsbeyondtheendofthewindow.LAG:TheLAGfunction,lag(value_expr[,offset[,default]]),isusedtoaccessdatafromapreviousrow.Thenumber(value_expr)ofrowstolagcanoptionallybespecified.Ifthenumberofrows(offset)tolagisnotspecified,thelagisonerowbydefault.Itreturns[,default]ornullwhenthedefaultisnotspecifiedandthelagforthecurrentrowextendsbeyondtheendofthewindow.

www.it-ebooks.info

Page 178: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

FIRST_VALUE:Itreturnsthefirstresultfromanorderedset.LAST_VALUE:Itreturnsthelastresultfromanorderedset.ForLAST_VALUE,usingthedefaultwindowingclause,theresultcanbealittleunexpected.ThisisbecausethedefaultwindowingclauseisRANGEBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW,whichinthisexamplemeansthecurrentrowwillalwaysbethelastvalue.ChangingthewindowingclausetoRANGEBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWINGgivesustheresultweprobablyexpected(seethelast_valuecolumninthefollowingexamples).

The[PARTITIONBY<...>]statementissimilartotheGROUPBYclause.Itdividestherowsintogroupscontainingidenticalvaluesinoneormorepartitionsbycolumns.Theselogicalgroupsareknownaspartitions,whichisnotthesametermusedforpartitiontables.OmittingthePARTITIONBYstatementappliestheanalyticoperationtoalltherowsinthetable.

The[ORDERBY<....>]clauseisliketheORDERBYexpr[ASC|DESC]clause.TheORDERBYclauseisthesameastheregularORDERBYclause.ItmakessuretherowsproducedbythePARTITIONBYclauseareorderedbyspecifications,suchasascendingordescendingorder.Rightnow,HiveonlysupportsoneORDERBYcolumninthiscase.Otherwise,itwillthrowasemanticexception(seeApacheJiraHIVE-4662athttps://issues.apache.org/jira/browse/HIVE-4662).Theworkaroundistousetherowsunboundedprecedingwindowingclause(seerunningTotal2columninthefollowingexamples):

Preparethetableanddatafordemonstration:

jdbc:hive2://>CREATETABLEIFNOTEXISTSemployee_contract

.......>(

.......>namestring,

.......>dept_numint,

.......>employee_idint,

.......>salaryint,

.......>typestring,

.......>start_datedate

.......>)

.......>ROWFORMATDELIMITED

.......>FIELDSTERMINATEDBY'|'

.......>STOREDASTEXTFILE;

Norowsaffected(0.282seconds)

jdbc:hive2://>LOADDATALOCALINPATH

.......>'/home/dayongd/Downloads/employee_contract.txt'

.......>OVERWRITEINTOTABLEemployee_contract;

Norowsaffected(0.48seconds)

Theregularaggregationsareusedasanalyticfunctions,asfollows:

jdbc:hive2://>SELECTname,dept_num,salary,

.......>COUNT(*)OVER(PARTITIONBYdept_num)ASrow_cnt,

.......>SUM(salary)OVER(PARTITIONBYdept_num

.......>ORDERBYdept_num)ASdeptTotal,

.......>SUM(salary)OVER(ORDERBYdept_num)

www.it-ebooks.info

Page 179: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

.......>ASrunningTotal1,SUM(salary)

.......>OVER(ORDERBYdept_num,namerowsunbounded

.......>preceding)ASrunningTotal2

.......>FROMemployee_contract

.......>ORDERBYdept_num,name;

+-------+--------+------+-------+---------+-------------+-------------+

|name|dept_num|salary|row_cnt|deptTotal|runningTotal1|runningTotal2|

+-------+--------+------+-------+---------+-------------+-------------+

|Lucy|1000|5500|5|24900|24900|5500|

|Michael|1000|5000|5|24900|24900|10500|

|Steven|1000|6400|5|24900|24900|16900|

|Will|1000|4000|5|24900|24900|24900|

|Will|1000|4000|5|24900|24900|20900|

|Jess|1001|6000|3|17400|42300|30900|

|Lily|1001|5000|3|17400|42300|35900|

|Mike|1001|6400|3|17400|42300|42300|

|Richard|1002|8000|3|20500|62800|50300|

|Wei|1002|7000|3|20500|62800|57300|

|Yun|1002|5500|3|20500|62800|62800|

+-------+--------+------+-------+---------+-------------+-------------+

11rowsselected(359.918seconds)

Otheranalyticfunctionsareusedasfollows:

jdbc:hive2://>SELECTname,dept_num,salary,

.......>RANK()OVER(PARTITIONBYdept_numORDERBYsalary)

.......>ASrank,

.......>DENSE_RANK()

.......>OVER(PARTITIONBYdept_numORDERBYsalary)

.......>ASdense_rank,ROW_NUMBER()OVER()ASrow_num,

.......>ROUND((CUME_DIST()OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)),1)AScume_dist,

.......>PERCENT_RANK()OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)ASpercent_rank,NTILE(4)

.......>OVER(PARTITIONBYdept_numORDERBYsalary)

.......>ASntile

.......>FROMemployee_contractORDERBYdept_num;

+-------+--------+------+----+----------+-------+---------+------------

+-----+

|name

|dept_num|salary|rank|dense_rank|row_num|cume_dist|percent_rank|ntile|

+-------+--------+------+----+----------+-------+---------+------------

+-----+

|Will|1000|4000|1|1|11|0.4|0.0

|1|

|Will|1000|4000|1|1|10|0.4|0.0

|1|

|Michael|1000|5000|3|2|9|0.6|0.5

|2|

|Lucy|1000|5500|4|3|8|0.8|0.75

|3|

|Steven|1000|6400|5|4|7|1.0|1.0

|4|

|Lily|1001|5000|1|1|6|0.3|0.0

|1|

|Jess|1001|6000|2|2|5|0.7|0.5

www.it-ebooks.info

Page 180: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|2|

|Mike|1001|6400|3|3|4|1.0|1.0

|3|

|Yun|1002|5500|1|1|3|0.3|0.0

|1|

|Wei|1002|7000|2|2|2|0.7|0.5

|2|

|Richard|1002|8000|3|3|1|1.0|1.0

|3|

+-------+--------+------+----+----------+-------+---------+------------

+-----+

11rowsselected(367.112seconds)

jdbc:hive2://>SELECTname,dept_num,salary,

.......>LEAD(salary,2)OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)ASlead,

.......>LAG(salary,2,0)OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)ASlag,

.......>FIRST_VALUE(salary)OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)ASfirst_value,

.......>LAST_VALUE(salary)OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)ASlast_value_default,

.......>LAST_VALUE(salary)OVER(PARTITIONBYdept_num

.......>ORDERBYsalary

.......>RANGEBETWEENUNBOUNDEDPRECEDING

.......>ANDUNBOUNDEDFOLLOWING)ASlast_value

.......>FROMemployee_contractORDERBYdept_num;

+-------+--------+------+----+----+-----------+------------------+-----

----+

|name|dept_num|salary|lead|lag|first_value|last_value_default|

last_value

|

+-------+--------+------+----+----+-----------+------------------+-----

----+

|Will|1000|4000|5000|0|4000|4000|6400

|

|Will|1000|4000|5500|0|4000|4000|6400

|

|Michael|1000|5000|6400|4000|4000|5000|6400

|

|Lucy|1000|5500|NULL|4000|4000|5500|6400

|

|Steven|1000|6400|NULL|5000|4000|6400|6400

|

|Lily|1001|5000|6400|0|5000|5000|6400

|

|Jess|1001|6000|NULL|0|5000|6000|6400

|

|Mike|1001|6400|NULL|5000|5000|6400|6400

|

|Yun|1002|5500|8000|0|5500|5500|8000

|

|Wei|1002|7000|NULL|0|5500|7000|8000

|

|Richard|1002|8000|NULL|5500|5500|8000|8000

|

www.it-ebooks.info

Page 181: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

+-------+--------+------+----+----+-----------+------------------+-----

----+

11rowsselected(92.572seconds)

The[<window_clause>]clauseisusedtofurthersubpartitiontheresultandapplytheanalyticfunctions.Therearetwotypesofwindows:rowtypewindowandrangetypewindow.

NoteAccordingtothearticleathttps://issues.apache.org/jira/browse/HIVE-4797,theRANK,NTILE,DENSE_RANK,CUME_DIST,PERCENT_RANK,LEAD,LAG,andROW_NUMBERfunctionsdonotsupportbeingusedwithawindowclauseyet.

Forrowtypewindows,thedefinitionisintermsofrownumbersbeforeorafterthecurrentrow.Thegeneralsyntaxoftherowwindowclauseisasfollows:

ROWSBETWEEN<start_expr>AND<end_expr>

The<start_expr>canbeanyoneofthefollowing:

UNBOUNDEDPRECEDING

CURRENTROW

NPRECEDINGorFOLLOWING

The<end_expr>canbeanyoneofthefollowing:

UNBOUNDEDFOLLOWING

CURRENTROW

NPRECEDINGorFOLLOWING

Thefollowingarethewindowexpressions:

BETWEEN…AND:UsetheBETWEEN…ANDclausetospecifythestartpointandendpointforthewindow.Thefirstexpression(beforeAND)definesthestartpointandthesecondexpression(afterAND)definestheendpoint.IfweomitBETWEEN…AND(suchasROWSNPRECEDINGorROWSUNBOUNDEDPRECEDING),Hiveconsidersitasthestartpoint,andtheendpointdefaultstothecurrentrow(seewin13columnintheupcomingexamples).NPRECEDINGorFOLLOWING:ThisindicatesNrowsbeforeorafterthecurrentrow.UNBOUNDEDPRECEDING:Thisindicatesthatthewindowstartsatthefirstrowofthepartition.Thisisthestartpointspecificationandcannotbeusedasanendpointspecification.UNBOUNDEDFOLLOWING:Thisindicatesthatthewindowendsatthelastrowofthepartition.Thisistheendpointspecificationandcannotbeusedasastartpointspecification.UNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING:Thisindicatesthefirstandlastrowforeveryrow,meaningallrowsinthetable(seewin12columnintheupcomingexamples).CURRENTROW:Asastartpoint,CURRENTROWspecifiesthatthewindowbeginsatthecurrentroworvaluedependingonwhetherwehavespecifiedROWorRANGE(RANGE

www.it-ebooks.info

Page 182: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

isintroducedlaterinthischapter).Inthiscase,theendpointcannotbeNPRECEDING.Asanendpoint,CURRENTROWspecifiesthatthewindowendsatthecurrentroworvaluedependingonwhetherwehavespecifiedROWorRANGE.Inthiscase,thestartpointcannotbeNFOLLOWING.

Thefollowingisadiagramthatcanhelpusunderstandtheprecedingdefinitionsmoreclearly:

Windowexpressiondefinition

Thefollowingexamplesimplementthewindowexpressions:

jdbc:hive2://>SELECTname,dept_numASdept,salaryASsal,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEEN2PRECEDINGANDCURRENTROW)win1,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEEN2PRECEDINGANDUNBOUNDEDFOLLOWING)win2,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEEN1PRECEDINGAND2FOLLOWING)win3,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEEN1PRECEDINGAND2PRECEDING)win4,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEEN1FOLLOWINGAND2FOLLOWING)win5,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEENCURRENTROWANDCURRENTROW)win7,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEENCURRENTROWAND1FOLLOWING)win8,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEENCURRENTROWANDUNBOUNDEDFOLLOWING)win9,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

www.it-ebooks.info

Page 183: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

.......>BETWEENUNBOUNDEDPRECEDINGANDCURRENTROW)win10,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEENUNBOUNDEDPRECEDINGAND1FOLLOWING)win11,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDED

.......>FOLLOWING)win12,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS2PRECEDING)win13

.......>FROMemployee_contract

.......>ORDERBYdept_num,name;

+-------+----+----+----+----+----+----+----+----+----+----+-----+-----+----

-+-----+

|name|dept|sal

|win1|win2|win3|win4|win5|win7|win8|win9|win10|win11|win12|win13|

+-------+----+----+----+----+----+----+----+----+----+----+-----+-----+----

-+-----+

|Lucy|1000|5500|5500|6400|6400|NULL|6400|5500|5500|6400|5500|5500|6400

|5500|

|Michael|1000|5000|5500|6400|6400|NULL|6400|5000|6400|6400|5500|6400|6400

|5500|

|Steven|1000|6400|6400|6400|6400|NULL|4000|6400|6400|6400|6400|6400|6400

|6400|

|Will|1000|4000|6400|6400|4000|NULL|NULL|4000|4000|4000|6400|6400|6400

|6400|

|Will|1000|4000|6400|6400|6400|NULL|4000|4000|4000|4000|6400|6400|6400

|6400|

|Jess|1001|6000|6000|6400|6400|NULL|6400|6000|6000|6400|6000|6000|6400

|6000|

|Lily|1001|5000|6000|6400|6400|NULL|6400|5000|6400|6400|6000|6400|6400

|6000|

|Mike|1001|6400|6400|6400|6400|NULL|NULL|6400|6400|6400|6400|6400|6400

|6400|

|Richard|1002|8000|8000|8000|8000|NULL|7000|8000|8000|8000|8000|8000|8000

|8000|

|Wei|1002|7000|8000|8000|8000|NULL|5500|7000|7000|7000|8000|8000|8000

|8000|

|Yun|1002|5500|8000|8000|7000|NULL|NULL|5500|5500|5500|8000|8000|8000

|8000|

+-------+----+----+----+----+----+----+----+----+----+----+-----+-----+----

-+-----+

11rowsselected(168.732seconds)

Fromtheprecedingexample,wecanseethatthewin4columnisNULL.Thisisbecausetherowspecifiedby<start_expr>mustbesmallerthantherowspecifiedby<end_expr>.However,ifwetrytofixitbyreorderingit,especiallywhenusingthePRECEDINGkeyword,itreportsthefollowingexceptionsandthesamethingappliestoUNBOUNDEDPRECEDING.Thisisanissue(https://issues.apache.org/jira/browse/HIVE-9412)forHivewindowingrightnow:

jdbc:hive2://>SELECTname,dept_num,salary,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEEN2PRECEDINGAND1PRECEDING)win4_alter

.......>FROMemployee_contract

www.it-ebooks.info

Page 184: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

.......>ORDERBYdept_num,name;

Error:Errorwhilecompilingstatement:FAILED:SemanticExceptionFailedto

breakupWindowinginvocationsintoGroups.Atleast1groupmustonly

dependoninputcolumns.Alsocheckforcirculardependencies.

Underlyingerror:Windowrangeinvalid,startboundaryisgreaterthanend

boundary:window(start=range(2PRECEDING),end=range(1PRECEDING))

(state=42000,code=40000)

jdbc:hive2://>SELECTname,dept_num,salary,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEENUNBOUNDEDPRECEDINGAND1PRECEDING)win1

.......>FROMemployee_contract

.......>ORDERBYdept_num,name;

Error:Errorwhilecompilingstatement:FAILED:SemanticExceptionEndofa

WindowFramecannotbeUNBOUNDEDPRECEDING(state=42000,code=40000)

Inaddition,windowscanbedefinedinaseparateWINDOWclauseorreferredbyotherwindows,asfollows:

jdbc:hive2://>SELECTname,dept_num,salary,

.......>MAX(salary)OVERw1ASwin1,

.......>MAX(salary)OVERw1ASwin2,

.......>MAX(salary)OVERw1ASwin3

.......>FROMemployee_contract

.......>ORDERBYdept_num,name

.......>WINDOW

.......>w1AS(PARTITIONBYdept_numORDERBYname

.......>ROWSBETWEEN2PRECEDINGANDCURRENTROW),

.......>w2ASw3,

.......>w3AS(PARTITIONBYdept_numORDERBYname

.......>ROWSBETWEEN1PRECEDINGAND2FOLLOWING);

+----------+-----------+---------+-------+-------+-------+

|name|dept_num|salary|win1|win2|win3|

+----------+-----------+---------+-------+-------+-------+

|Lucy|1000|5500|5500|5500|5500|

|Michael|1000|5000|5500|5500|5500|

|Steven|1000|6400|6400|6400|6400|

|Will|1000|4000|6400|6400|6400|

|Will|1000|4000|6400|6400|6400|

|Jess|1001|6000|6000|6000|6000|

|Lily|1001|5000|6000|6000|6000|

|Mike|1001|6400|6400|6400|6400|

|Richard|1002|8000|8000|8000|8000|

|Wei|1002|7000|8000|8000|8000|

|Yun|1002|5500|8000|8000|8000|

+----------+-----------+---------+-------+-------+-------+

11rowsselected(156.902seconds)

Comparedtorowtypewindowsintermsofrows,therangetypewindowsareintermsofvaluesbeforeorafterthecurrentORDERBYcolumn,whichmustbeanumberordatetype.Fornow,onlyoneORDERBYcolumnissupportedbyrangetypewindows.

jdbc:hive2://>SELECTname,salary,start_year,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>start_yearRANGE

www.it-ebooks.info

Page 185: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

.......>BETWEEN2PRECEDINGANDCURRENTROW)win1

.......>FROM

.......>(

.......>SELECTname,salary,dept_num,

.......>YEAR(start_date)ASstart_year

.......>FROMemployee_contract

.......>)a;

+----------+---------+-------------+-------+

|name|salary|start_year|win1|

+----------+---------+-------------+-------+

|Lucy|5500|2010|5500|

|Steven|6400|2012|6400|

|Will|4000|2013|6400|

|Will|4000|2014|6400|

|Michael|5000|2014|6400|

|Mike|6400|2013|6400|

|Jess|6000|2014|6400|

|Lily|5000|2014|6400|

|Wei|7000|2010|7000|

|Richard|8000|2013|8000|

|Yun|5500|2014|8000|

+----------+---------+-------------+-------+

11rowsselected(92.035seconds)

NoteIfweomitthewindowingclauseentirely,thedefaultwindowisRANGEBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW.

www.it-ebooks.info

Page 186: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 187: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SamplingWhendatavolumeisextralarge,wemayneedtofindasubsetofdatatospeedupdataanalysis.Hereitcomestoatechniqueusedtoselectandanalyzeasubsetofdatainordertoidentifypatternsandtrends.InHive,therearethreewaysofsamplingdata:randomsampling,buckettablesampling,andblocksampling.

RandomsamplingusestheRAND()functionandLIMITkeywordtogetthesamplingofdataasshowninthefollowingexample.TheDISTRIBUTEandSORTkeywordsareusedheretomakesurethedataisalsorandomlydistributedamongmappersandreducersefficiently.TheORDERBYRAND()statementcanalsoachievethesamepurpose,buttheperformanceisnotgood:

SELECT*FROM<Table_Name>DISTRIBUTEBYRAND()SORTBYRAND()

LIMIT<Nrowstosample>;

Buckettablesamplingisaspecialsamplingoptimizedforbuckettablesasshowninthefollowingsyntaxandexample.Thecolnamevaluespecifiesthecolumnwheretosamplethedata.TheRAND()functioncanalsobeusedwhensamplingisontheentirerows.IfthesamplecolumnisalsotheCLUSTEREDBYcolumn,theTABLESAMPLEstatementwillbemoreefficient.

--Syntax

SELECT*FROM<Table_Name>

TABLESAMPLE(BUCKET<specifiedbucketnumbertosample>OUTOF<totalnumber

ofbuckets>ON[colname|RAND()])table_alias;

--Anexample

jdbc:hive2://>SELECTnameFROMemployee_id_buckets

.......>TABLESAMPLE(BUCKET1OUTOF2ONrand())a;

+----------+

|name|

+----------+

|Lucy|

|Shelley|

|Lucy|

|Lucy|

|Shelley|

|Lucy|

|Will|

|Shelley|

|Michael|

|Will|

|Will|

|Will|

|Will|

|Will|

|Lucy|

+----------+

15rowsselected(0.07seconds)

BlocksamplingallowsHivetorandomlypickupNrowsofdata,percentage(n

www.it-ebooks.info

Page 188: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

percentage)ofdatasize,orNbytesizeofdata.ThesamplinggranularityistheHDFSblocksize.Itssyntaxandexamplesareasfollows:

--Syntax

SELECT*

FROM<Table_Name>TABLESAMPLE(NPERCENT|ByteLengthLiteral|NROWS)s;

--ByteLengthLiteral

--(Digit)+('b'|'B'|'k'|'K'|'m'|'M'|'g'|'G')

--Samplebyrows

jdbc:hive2://>SELECTname

.......>FROMemployee_id_bucketsTABLESAMPLE(4ROWS)a;

+----------+

|name|

+----------+

|Lucy|

|Shelley|

|Lucy|

|Shelley|

+----------+

4rowsselected(0.055seconds)

--Samplebypercentageofdatasize

jdbc:hive2://>SELECTname

.......>FROMemployee_id_bucketsTABLESAMPLE(10PERCENT)a;

+----------+

|name|

+----------+

|Lucy|

|Shelley|

|Lucy|

+----------+

3rowsselected(0.061seconds)

--Samplebydatasize

jdbc:hive2://>SELECTname

.......>FROMemployee_id_bucketsTABLESAMPLE(3M)a;

+----------+

|name|

+----------+

|Lucy|

|Shelley|

|Lucy|

|Shelley|

|Lucy|

|Shelley|

|Lucy|

|Shelley|

|Lucy|

|Will|

|Shelley|

|Lucy|

|Will|

|Shelley|

|Michael|

www.it-ebooks.info

Page 189: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|Will|

|Shelley|

|Lucy|

|Will|

|Will|

|Will|

|Will|

|Will|

|Lucy|

|Shelley|

+----------+

25rowsselected(0.07seconds)

www.it-ebooks.info

Page 190: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 191: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SummaryInthischapter,wecoveredhowtoaggregatedatausingbasicaggregationfunctions.Then,weintroducedtheadvancedaggregationswithGROUPINGSETS,ROLLUP,andCUBE,aswellasaggregationconditionsusingHAVING.Wealsocoveredthevariousanalyticfunctionsandwindowingclauses.Attheendofthechapter,weintroducedthreewaysofsamplingdatainHive.Aftergoingthroughthischapter,youshouldbeabletodobasicandadvancedaggregationsanddatasamplinginHive.

Inthenextchapter,we’lltalkaboutperformanceconsiderationsinHive.

www.it-ebooks.info

Page 192: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 193: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Chapter7.PerformanceConsiderationsAlthoughHiveisbuilttodealwithbigdata,westillcannotignoretheimportanceofperformance.Mostofthetime,abetterHivequerycanrelyonthesmartqueryoptimizertofindthebestexecutionstrategyaswellasthedefaultsettingbestpracticefromvendorpackages.However,asexperiencedusers,weshouldlearnmoreaboutthetheoryandpracticeofperformancetuninginHive,especiallywhenworkinginaperformance-basedprojectorenvironment.Inthischapter,wewillstartfromutilitiesavailableinHivetofindpotentialissuescausingpoorperformance.Then,weintroducethebestpracticesofperformanceconsiderationsintheareasofdesign,fileformat,compression,storage,query,andjob.

Inthischapter,wewillcoverthefollowingtopics:

PerformanceutilitiesDesignoptimizationDatafileoptimizationJobandqueryoptimization

www.it-ebooks.info

Page 194: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

PerformanceutilitiesHiveprovidestheEXPLAINandANALYZEstatementsthatcanbeusedasutilitiestocheckandidentifytheperformanceofqueries.

www.it-ebooks.info

Page 195: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

TheEXPLAINstatementHiveprovidesanEXPLAINcommandtoreturnaqueryexecutionplanwithoutrunningthequery.WecanuseanEXPLAINcommandforqueriesifwehaveadoubtoraconcernaboutperformance.TheEXPLAINcommandwillhelptoseethedifferencebetweentwoormorequeriesforthesamepurpose.ThesyntaxforEXPLAINisasfollows:

EXPLAIN[EXTENDED|DEPENDENCY|AUTHORIZATION]hive_query

Thefollowingkeywordscanbeused:

EXTENDED:Thisprovidesadditionalinformationfortheoperatorsintheplan,suchasfilepathnameandabstractsyntaxtree.DEPENDENCY:ThisprovidesaJSONformatoutputthatcontainsalistoftablesandpartitionsthatthequerydependson.ItisavailablesinceHIVE0.10.0.AUTHORIZATION:ThislistsallentitiesneededtobeauthorizedincludinginputandoutputtoruntheHivequeryandauthorizationfailures,ifany.ItisavailablesinceHIVE0.14.0.

Atypicalqueryplancontainsthefollowingthreesections.Wewillalsohavealookatanexamplelater:

Abstractsyntaxtree(AST):HiveusesapacergeneratorcalledANTLR(seehttp://www.antlr.org/)toautomaticallygenerateatreeofsyntaxforHQL.Wecanusuallyignorethismostofthetime.Stagedependencies:Thislistsalldependenciesandnumberofstagesusedtorunthequery.Stageplans:Itcontainsimportantinformation,suchasoperatorsandsortorders,forrunningthejob.

Thefollowingiswhatatypicalqueryplanlookslike.Fromthefollowingexample,wecanseethattheASTsectionisnotshownsincetheEXTENDEDkeywordisnotusedwithEXPLAIN.IntheSTAGEDEPENDENCIESsection,bothStage-0andStage-1areindependentrootstages.IntheSTAGEPLANSsection,Stage-1hasonemapandreducereferredtobyMapOperatorTreeandReduceOperatorTree.InsideeachMap/ReduceOperatorTreesection,alloperatorscorrespondingtoHivequerykeywordsaswellasexpressionsandaggregationsarelisted.TheStage-0stagedoesnothavemapandreduce.ItisjustaFetchoperation.

jdbc:hive2://>EXPLAINSELECTsex_age.sex,count(*)

.......>FROMemployee_partitioned

.......>WHEREyear=2014GROUPBYsex_age.sexLIMIT2;

+--------------------------------------------------------------------------

---+

|Explain

|

+--------------------------------------------------------------------------

---+

|STAGEDEPENDENCIES:

|

www.it-ebooks.info

Page 196: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|Stage-1isarootstage

|

|Stage-0isarootstage

|

|

|

|STAGEPLANS:

|

|Stage:Stage-1

|

|MapReduce

|

|MapOperatorTree:

|

|TableScan

|

|alias:employee_partitioned

|

|Statistics:Numrows:0Datasize:227Basicstats:PARTIAL

|

|Columnstats:NONE

|

|SelectOperator

|

|expressions:sex_age(type:struct<sex:string,age:int>)

|

|outputColumnNames:sex_age

|

|Statistics:Numrows:0Datasize:227Basicstats:PARTIAL

|

|Columnstats:NONE

|

|GroupByOperator

|

|aggregations:count()

|

|keys:sex_age.sex(type:string)

|

|mode:hash

|

|outputColumnNames:_col0,_col1

|

|Statistics:Numrows:0Datasize:227Basic

stats:PARTIAL|

|Columnstats:NONE

|

|ReduceOutputOperator

|

|keyexpressions:_col0(type:string)

|

|sortorder:+

|

|Map-reducepartitioncolumns:_col0(type:string)

|

|Statistics:Numrows:0Datasize:227Basic

stats:PARTIAL|

www.it-ebooks.info

Page 197: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|Columnstats:NONE

|

|valueexpressions:_col1(type:bigint)

|

|ReduceOperatorTree:

|

|GroupByOperator

|

|aggregations:count(VALUE._col0)

|

|keys:KEY._col0(type:string)

|

|mode:mergepartial

|

|outputColumnNames:_col0,_col1

|

|Statistics:Numrows:0Datasize:0Basicstats:NONE

|

|Columnstats:NONE

|

|SelectOperator

|

|expressions:_col0(type:string),_col1(type:bigint)

|

|outputColumnNames:_col0,_col1

|

|Statistics:Numrows:0Datasize:0Basicstats:NONE

|

|Columnstats:NONE

|

|Limit

|

|Numberofrows:2

|

|Statistics:Numrows:0Datasize:0Basicstats:NONE

|

|Columnstats:NONE

|

|FileOutputOperator

|

|compressed:false

|

|Statistics:Numrows:0Datasize:0Basicstats:NONE

|

|Columnstats:NONE

|

|table:

|

|inputformat:

org.apache.hadoop.mapred.TextInputFormat|

|output

format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|

|

serde:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe||

|

|Stage:Stage-0

www.it-ebooks.info

Page 198: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|

|FetchOperator

|

|limit:2

|

+--------------------------------------------------------------------------

---+

53rowsselected(0.26seconds)

www.it-ebooks.info

Page 199: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

TheANALYZEstatementHivestatisticsareacollectionofdatathatdescribemoredetails,suchasthenumberofrows,numberoffiles,andrawdatasize,ontheobjectsintheHivedatabase.StatisticsisametadataofHivedata.Hivesupportsstatisticsatthetable,partition,andcolumnlevel.ThesestatisticsserveasaninputtotheHiveCost-BasedOptimizer(CBO),whichisanoptimizertopickthequeryplanwiththelowestcostintermsofsystemresourcesrequiredtocompletethequery.

ThestatisticsaregatheredthroughtheANALYZEstatementsinceHive0.10.0ontables,partitions,andcolumnsasgiveninthefollowingexamples:

jdbc:hive2://>ANALYZETABLEemployeeCOMPUTESTATISTICS;

Norowsaffected(27.979seconds)

jdbc:hive2://>ANALYZETABLEemployee_partitioned

.......>PARTITION(year=2014,month=12)COMPUTESTATISTICS;

Norowsaffected(45.054seconds)

jdbc:hive2://>ANALYZETABLEemployee_idCOMPUTESTATISTICS

.......>FORCOLUMNSemployee_id;

Norowsaffected(41.074seconds)

Oncethestatisticsarebuilt,wecancheckthestatisticsbytheDESCRIBEEXTENDED/FORMATTEDstatement.Fromthetable/partitionoutput,wecanfindthestatisticsinformationinsidetheparameters,suchasparameters:{numFiles=1,COLUMN_STATS_ACCURATE=true,transient_lastDdlTime=1417726247,numRows=4,

totalSize=227,rawDataSize=223}).Thefollowingisanexample:

jdbc:hive2://>DESCRIBEEXTENDEDemployee_partitioned

.......>PARTITION(year=2014,month=12);

jdbc:hive2://>DESCRIBEEXTENDEDemployee;

parameters:{numFiles=1,COLUMN_STATS_ACCURATE=true,

transient_lastDdlTime=1417726247,numRows=4,totalSize=227,

rawDataSize=223}).

jdbc:hive2://>DESCRIBEFORMATTEDemployee.name;

+--------+---------+---+---+---------+--------------+-----------+----------

-+

|col_name|data_type|min|max|num_nulls|distinct_count|avg_col_len|max_col_le

n|

+--------+---------+---+---+---------+--------------+-----------+----------

-+

|name|string|||0|5|5.6|7

|

+--------+---------+---+---+---------+--------------+-----------+----------

-+

+---------+----------+-----------------+

|num_trues|num_falses|comment|

+---------+----------+-----------------+

|||fromdeserializer|

www.it-ebooks.info

Page 200: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

+---------+----------+-----------------+

3rowsselected(0.116seconds)

Hivestatisticsarepersistedinthemetastoretoavoidcomputingthemeverytime.Fornewlycreatedtablesand/orpartitions,statisticsareautomaticallycomputedbydefaultifweenablethefollowingsetting:

jdbc:hive2://>SEThive.stats.autogather=ture;

NoteHivelogs

LogsprovideusefulinformationtofindouthowaHivequery/jobruns.BycheckingtheHivelogs,wecanidentifyruntimeproblemsandissuesthatmaycausebadperformance.TherearetwotypesoflogsavailableinHive:systemlogandjoblog.

ThesystemlogcontainstheHiverunningstatusandissues.Itisconfiguredin{HIVE_HOME}/conf/hive-log4j.properties.ThefollowingthreelinesforHivelogcanbefound:

hive.root.logger=WARN,DRFA

hive.log.dir=/tmp/${user.name}

hive.log.file=hive.log

Tomodifythestatus,wecaneithermodifytheprecedinglinesinhive-log4j.properties(appliestoallusers)orsetfromtheHiveCLI(onlyappliestothecurrentuserandcurrentsession)asfollows:

hive--hiveconfhive.root.logger=DEBUG,console

ThejoblogcontainsHivequeryinformationandissavedatthesameplace,/tmp/${user.name},bydefaultasonefileforeachHiveusersession.Wecanoverrideitinhive-site.xmlwiththehive.querylog.locationproperty.IfaHivequerygeneratesMapReducejobs,thoselogscanalsobeviewedthroughtheHadoopJobTrackerWebUI.

www.it-ebooks.info

Page 201: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 202: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

DesignoptimizationDesignoptimizationcoversseveraldatalayoutanddesignstrategiestoimproveperformance.

www.it-ebooks.info

Page 203: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

PartitiontablesHivepartitioningisoneofthemosteffectivemethodstoimprovethequeryperformanceonlargertables.Thequerywithpartitionfilteringwillonlyloadthedatainthespecifiedpartitions(subdirectories),soitcanexecutemuchfasterthananormalquerythatfiltersbyanon-partitioningfield.Theselectionofpartitionkeyisalwaysanimportantfactorforperformance.Itshouldalwaysbealowcardinalattributetoavoidmanysubdirectoriesoverhead.

Thefollowingaresomecommonlyuseddimensionsaspartitionkeys:

Partitionsbydateandtime:Usedateandtime,suchasyear,month,andday(evenhours),aspartitionkeyswhendataisassociatedwiththetimedimensionPartitionsbylocations:Usecountry,territory,state,andcityaspartitionkeyswhendataislocationrelatedPartitionsbybusinesslogics:Usedepartment,salesregion,applications,customers,andsoonaspartitionedkeyswhendatacanbeseparatedevenlybysomebusinesslogic

www.it-ebooks.info

Page 204: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

BuckettablesSimilartopartitioning,abuckettableorganizesdataintoseparatefilesintheHDFS.BucketingcanspeedupthedatasamplinginHivewithsamplingonbuckets.Bucketingcanalsoimprovethejoinperformanceifthejoinkeysarealsobucketkeysbecausebucketingensuresthatthekeyispresentinacertainbucket.MoredetailsaregivenintheJobandQueryoptimizationsectioninthischapter.

www.it-ebooks.info

Page 205: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

IndexIndexisverycommonwithRDBMSwhenwewanttospeedaccesstoacolumnorsetofcolumns.Hivesupportsindexcreationontables/partitionssinceHive0.7.0.TheindexinHiveprovideskey-baseddataviewandbetterdataaccessforcertainoperations,suchasWHERE,GROUPBY,andJOIN.Wecanuseindexisacheaperalternativethanfulltablescans.ThecommandtocreateanindexinHiveisstraightforwardasfollows:

jdbc:hive2://>CREATEINDEXidx_id_employee_id

.......>ONTABLEemployee_id(employee_id)

.......>AS'COMPACT'

.......>WITHDEFERREDREBUILD;

Norowsaffected(1.149seconds)

InadditiontotheCOMPACTkeyword(referstoorg.apache.hadoop.hive.ql.index.compact.CompactIndexHandler)usedintheprecedingexample,HivealsosupportsBITMAPindexessinceHIVE0.8.0forcolumnswithlessdifferentvalues,asshowninthefollowingexample:

jdbc:hive2://>CREATEINDEXidx_sex_employee_id

.......>ONTABLEemployee_id(sex_age)

.......>AS'BITMAP'

.......>WITHDEFERREDREBUILD;

Norowsaffected(0.251seconds)

TheWITHDEFERREDREBUILDkeywordintheprecedingexamplepreventstheindexfromimmediatelybeingbuilt.Tobuildtheindex,wecanissueALTER…REBUILDcommandsasinthefollowingexample.Whendatainthebasetablechanges,theALTER…REBUILDcommandmustbeusedtobringtheindexuptodate.Thisisanatomicoperation,soiftheindexrebuiltonatablethathasbeenpreviouslyindexedfailed,thestateofindexremainsthesame,asshownhere:

jdbc:hive2://>ALTERINDEXidx_id_employee_idONemployee_idREBUILD;

Norowsaffected(111.413seconds)

jdbc:hive2://>ALTERINDEXidx_sex_employee_idONemployee_id

.......>REBUILD;

Norowsaffected(82.23seconds)

Oncetheindexisbuilt,Hivewillcreateanewindextableforeachindexasfollows:

jdbc:hive2://>!table

+-----------+------------------------------------------+-----------+-------

+

|TABLE_SCHEM|TABLE_NAME|

TABLE_TYPE|REMARKS|

+-----------+------------------------------------------+-----------+-------

+

|default|default__employee_id_idx_id_employee_id__|INDEX_TABLE|NULL

|

|default|default__employee_id_idx_sex_employee_id__|INDEX_TABLE|NULL

|

+-----------+------------------------------------------+-----------+-------

www.it-ebooks.info

Page 206: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

+

Theindextablewillhavenameconventionsuchasdefault__tablename_indexname__.Itcontainstheindexedcolumn,the_bucketname(typicalfileURIonHDFS),and_offsets(offsetsforeachrows).Then,thisindextablecanbeusedwhereweneedtoquerytheindexedcolumnslikearegulartable,asshownhere:

jdbc:hive2://>DESCdefault__employee_id_idx_id_employee_id__;

+--------------+----------------+----------+

|col_name|data_type|comment|

+--------------+----------------+----------+

|employee_id|int||

|_bucketname|string||

|_offsets|array<bigint>||

+--------------+----------------+----------+

3rowsselected(0.135seconds)

Todropanindex,wecanusetheDROPINDEXindex_nameONtable_namestatementasfollows.However,wecannotdroptheindextablewithaDROPTABLEstatement:

jdbc:hive2://>DROPINDEXidx_sex_employee_idONemployee_id;

Norowsaffected(0.247seconds)

NoteSinceHive0.13.0,Hiveincludesthefollowingnewfeaturesforperformanceoptimizations:

Tez:Tez(http://tez.apache.org/)isanapplicationframeworkbuiltonYarnthatcanexecutecomplexdirectedacyclicgraphs(DAGs)forgeneraldata-processingtasks.Tezfurthersplitsmapandreducejobsintosmallertasksandcombinestheminaflexibleandefficientwayforexecution.TezisconsideredaflexibleandpowerfulsuccessortotheMapReduceframework.ToconfigureHivetouseTez,weneedtooverwritethefollowingsettingsfromthedefaultMapReduce:

SEThive.execution.engine=tez;

Vectorization:Vectorizationoptimizationprocessesalargerbatchofdataatthesametimeratherthanonerowatatime,thussignificantlyreducingcomputingoverhead.Eachbatchconsistsofacolumnvectorthatisusuallyanarrayofprimitivetypes.Operationsareperformedontheentirecolumnvector,whichimprovestheinstructionpipelinesandcacheusage.FilesmustbestoredintheOptimizedRowColumnar(ORC)formatinordertousevectorization.Formoreonvectorization,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution.Toenablevectorization,weneedtodothefollowingsetting:

SEThive.vectorized.execution.enabled=true;

www.it-ebooks.info

Page 207: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 208: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

DatafileoptimizationDatafileoptimizationcoverstheperformanceimprovementonthedatafilesintermsoffileformat,compression,andstorage.

www.it-ebooks.info

Page 209: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

FileformatHivesupportsTEXTFILE,SEQUENCEFILE,RCFILE,ORC,andPARQUETfileformats.Thethreewaystospecifythefileformatareasfollows:

CREATETABLE…STOREAS<File_Format>

ALTERTABLE…[PARTITIONpartition_spec]SETFILEFORMAT<File_Format>

SEThive.default.fileformat=<File_Format>--defaultfileformatfor

table

Here,<File_Type>isTEXTFILE,SEQUENCEFILE,RCFILE,ORC,andPARQUET.

WecanloadatextfiledirectlytoatablewiththeTEXTFILEformat.Toloaddatatothetablewithotherfileformats,weneedtoloadthedatatoaTEXTFILEformattablefirst.Then,useINSERTOVERWRITETABLE<target_file_format_table>SELECT*FROM<text_format_source_table>toconvertandinsertthedatatothefileformatasexpected.

ThefileformatssupportedbyHiveandtheiroptimizationsareasfollows:

TEXTFILE:ThisisthedefaultfileformatforHive.Dataisnotcompressedinthetextfile.Itcanbecompressedwithcompressiontools,suchasGZip,Bzip2,andSnappy.However,thesecompressedfilesarenotsplittableasinputduringprocessing.Asaresult,itleadstorunningasingle,hugemapjobtoprocessonebigfile.SEQUENCEFILE:Thisisabinarystorageformatforkey/valuepairs.ThebenefitofasequencefileisthatitismorecompactthanatextfileandfitswellwiththeMapReduceoutputformat.Sequencefilescanbecompressedonrecordorblocklevelwhereblocklevelhasabettercompressionratio.Toenableblocklevelcompression,weneedtodothefollowingsettings:

jdbc:hive2://>SEThive.exec.compress.output=true;

jdbc:hive2://>SETio.seqfile.compression.type=BLOCK;

Unfortunately,bothtextandsequencefilesasarowlevelstoragefileformatarenotanoptimalsolutionsinceHivehastoreadafullrowevenifonlyonecolumnisbeingrequested.Forinstance,ahybridrow-columnarstoragefileformat,suchasRCFILE,ORC,andPARQUETimplementation,iscreatedtoresolvethisproblem.

RCFILE:ThisisshortforRecordColumnarFile.Itisaflatfileconsistingofbinarykey/valuepairsthatsharesmuchsimilaritywithasequencefile.TheRCFilesplitsdatahorizontallyintorowgroups.OneorseveralgroupsarestoredinanHDFSfile.Then,RCFilesavestherowgroupdatainacolumnarformatbysavingthefirstcolumnacrossallrows,thenthesecondcolumnacrossallrows,andsoon.ThisformatissplittableandallowsHivetoskipirrelevantpartsofdataandgettheresultsfasterandcheaper.ORC:ThisisshortforOptimizedRowColumnar.ItisavailablesinceHive0.11.0.TheORCformatcanbeconsideredanimprovedversionofRCFILE.Itprovidesalargerblocksizeof256MBbydefault(RCFILEhas4MBandSEQUENCEFILEhas1MB)optimizedforlargesequentialreadsonHDFSformorethroughputandfewerfilesto

www.it-ebooks.info

Page 210: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

reduceoverloadinthenamenode.DifferentfromRCFILEthatreliesonmetastoretoknowdatatypes,theORCfileunderstandsthedatatypesbyusingspecificencoderssothatitcanoptimizecompressiondependingondifferenttypes.Italsostoresbasicstatistics,suchasMIN,MAX,SUM,andCOUNT,oncolumnsaswellasalightweightindexthatcanbeusedtoskipblocksofrowsthatdonotmatter.PARQUET:ThisisanotherrowcolumnarfileformatthathasasimilardesigntothatofORC.What’smore,ParquethasawiderrangeofsupportforthemajorityprojectsintheHadoopecosystemcomparedtoORCthatonlysupportsHiveandPig.ParquetleveragesthedesignbestpracticesofGoogle’sDremel(seehttp://research.google.com/pubs/pub36632.html)tosupportthenestedstructureofdata.ParquetissupportedbyapluginsinceHive0.10.0andhasgotnativesupportsinceHive0.13.0.

ConsideringthematurityofHive,itissuggestedtousetheORCformatifHiveisthemainmajoritytoolusedinyourHadoopenvironment.IfyouuseseveraltoolsintheHadoopecosystem,PARQUETisabetterchoiceintermsofadaptability.

NoteHadoopArchiveFile(HAR)isanothertypeoffileformattopackHDFSfilesintoarchives.Thisisanoption(notagoodoption)forstoringalargenumberofsmall-sizedfilesinHDFS,asstoringalargenumberofsmall-sizedfilesdirectlyinHDFSisnotveryefficient.However,HARstillhassomelimitationsthatmakeitunpopular,suchasimmutablearchiveprocess,notbeingsplittable,andcompatibilityissues.FormoreinformationaboutHARandarchiving,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+Archiving.

www.it-ebooks.info

Page 211: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

CompressionCompressiontechniquesinHivecansignificantlyreducetheamountofdatatransferringbetweenmappersandreducersbyproperintermediateoutputcompressionaswellasoutputdatasizeinHDFSbyoutputcompression.Asaresult,theoverallHivequerywillhavebetterperformance.TocompressintermediatefilesproducedbyHivebetweenmultipleMapReducejobs,weneedtosetthefollowingproperty(falsebydefault)intheHiveCLIorthehive-site.xmlfile:

jdbc:hive2://>SEThive.exec.compress.intermediate=true

Then,weneedtodecidewhichcompressioncodectoconfigure.AlistofcommoncodecssupportedinHadoopandHiveisasfollows:

Compression Codec Extension Splittable

Deflate org.apache.hadoop.io.compress.DefaultCodec .deflate N

GZip org.apache.hadoop.io.compress.GzipCodec .gz N

Bzip2 org.apache.hadoop.io.compress.BZip2Codec .gz Y

LZO com.hadoop.compression.lzo.LzopCodec .lzo N

LZ4 org.apache.hadoop.io.compress.Lz4Codec .lz4 N

Snappy org.apache.hadoop.io.compress.SnappyCodec .snappy N

Hadoophasadefaultcodec(.deflate).ThecompressionratioforGZipishigheraswellasitsCPUcost.Bzip2issplittable,butsplittingisn’tsupportedbyHadoopuntil1.1(seehttps://issues.apache.org/jira/browse/HADOOP-4012).Inaddition,Bzip2istooslowforcompressionconsideringitshugeCPUcost.LZOfilesarenotnativelysplittable.Butwecanpreprocessthem(usingcom.hadoop.compression.lzo.LzoIndexer)tocreateanindexthatdeterminesthefilesplits.WhenitcomestothebalanceofCPUcostandcompressionratio,LZ4orSnappydoabetterjob.Sincethemajorityofcodecdonotsupportsplitaftercompression,itissuggestedtoavoidcompressingbigfilesinHDFS.

Thecompressioncodeccanbespecifiedineithermapred-site.xml,hive-site.xml,orHiveCLI,asinthefollowingexample:

jdbc:hive2://>SEThive.intermediate.compression.codec=

.......>org.apache.hadoop.io.compress.SnappyCodec

Intermediatecompressionwillonlysavediskspaceforspecificjobsthatrequiremultiplemapandreducejobs.Forfurthersavingofdiskspace,theactualHiveoutputfilescanbecompressed.Whenthehive.exec.compress.outputpropertyissettotrue,Hivewillusethecodecconfiguredbythemapred.map.output.compression.codecpropertytocompressthestorageinHDFSasfollows.Thesepropertiescanbesetinthehive-site.xmlorintheHiveCLI.

www.it-ebooks.info

Page 212: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

jdbc:hive2://>SEThive.exec.compress.output=true

jdbc:hive2://>SETmapred.output.compression.codec=

.......>org.apache.hadoop.io.compress.SnappyCodec

www.it-ebooks.info

Page 213: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

StorageoptimizationThedata,whichisusedorscannedfrequently,canbeidentifiedashotdata.Usually,thequeryperformanceonthehotdataiscriticalforoverallperformance.IncreasingthedatareplicationfactorinHDFS(seethefollowingexample)forhotdatacouldincreasethechanceofdatabeinghitlocallybyHivejobsandimprovetheperformance.However,thisisatrade-offforstorage.

$hdfsdfs-setrep-R-w4/user/hive/warehouse/employee

Replication4set:/user/hive/warehouse/employee/000000_0

Ontheotherhand,toomanyfilesorredundancycouldmakenamenode’smemoryexhausted,especiallyforlotsofsmallfileslessthantheHDFSblocksizes.Hadoopitselfalreadyhassomesolutionstodealwithtoomanysmall-fileissues,suchasthefollowing:

HadoopArchiveandHAR:Thesearetoolkitstopacksmallfiles.SequenceFileformat:Thisisaformattocompresssmallfilestobiggerfiles.CombineFileInputFormat:AtypeofInputFormattocombinesmallfilesbeforemapandreduceprocessing.ItisthedefaultInputFormatforHive(seehttps://issues.apache.org/jira/browse/HIVE-2245).HDFSfederation:Itmakesnamenodesextensibleandpowerfultomanagemorefiles.

WecanalsoleverageothertoolsintheHadoopecosystemifwehavetheminstalled,suchasthefollowing:

HBasehasasmallerblocksizeandbetterfileformattodealwithsmaller-fileaccessissuesFlumeNGcanbeusedaspipestomergesmallfilestobigonesAscheduledofflinefilemergeprogramtomergesmallfilesinHDFSorbeforeloadingthemtoHDFS

ForHive,wecandothefollowingconfigurationsformergingfilesofqueryresultstoavoidrecreatingsmallfiles:

hive.merge.mapfiles:Thismergessmallfilesattheendofamap-onlyjob.Bydefault,itistrue.hive.merge.mapredfiles:ThismergessmallfilesattheendofaMapReducejob.Setittotruesinceitsdefaultisfalse.hive.merge.size.per.task:Thisdefinesthesizeofmergedfilesattheendofthejob.Thedefaultvalueis256,000,000.hive.merge.smallfiles.avgsize:Thisisthethresholdfortriggeringfilemerge.Thedefaultvalueis16,000,000.

Whentheaverageoutputfilesizeofajobislessthanthevaluespecifiedbyhive.merge.smallfiles.avgsize,andbothhive.merge.mapfiles(formap-onlyjobs)andhive.merge.mapredfiles(forMapReducejobs)aresettotrue,HivewillstartanadditionalMapReducejobtomergetheoutputfilesintobigfiles.

www.it-ebooks.info

Page 214: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 215: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

JobandqueryoptimizationJobandqueryoptimizationcoversexperienceandskillstoimproveperformanceintheareaofjob-runningmode,JVMreuse,jobparallelrunning,andqueryoptimizationsinJOIN.

www.it-ebooks.info

Page 216: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

LocalmodeHadoopcanruninstandalone,pseudo-distributed,andfullydistributedmode.Mostofthetime,weneedtoconfigureHadooptoruninfullydistributedmode.Whenthedatatoprocessissmall,itisanoverheadtostartdistributeddataprocessingsincethelaunchingtimeofthefullydistributedmodetakesmoretimethanthejobprocessingtime.SinceHive0.7.0,Hivesupportsautomaticconversionofajobtoruninlocalmodewiththefollowingsettings:

jdbc:hive2://>SEThive.exec.mode.local.auto=true;--defaultfalse

jdbc:hive2://>SEThive.exec.mode.local.auto.inputbytes.max=50000000;

jdbc:hive2://>SEThive.exec.mode.local.auto.input.files.max=5;

--default4

Ajobmustsatisfythefollowingconditionstoruninthelocalmode:

Thetotalinputsizeofthejobislowerthanhive.exec.mode.local.auto.inputbytes.max

Thetotalnumberofmaptasksislessthanhive.exec.mode.local.auto.input.files.max

Thetotalnumberofreducetasksrequiredis1or0

www.it-ebooks.info

Page 217: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

JVMreuseBydefault,HadooplaunchesanewJVMforeachmaporreducejobandrunsthemaporreducetaskinparallel.Whenthemaporreducejobisalightweightjobrunningonlyforafewseconds,theJVMstartupprocesscouldbeasignificantoverhead.TheMapReduceframework(version1only,notYarn)hasanoptiontoreuseJVMbysharingtheJVMtorunmapper/reducerseriallyinsteadofparallel.JVMreuseappliestomaporreducetasksinthesamejob.TasksfromdifferentjobswillalwaysruninaseparateJVM.Toenablethereuse,wecansetthemaximumnumberoftasksforasinglejobforJVMreuseusingthemapred.job.reuse.jvm.num.tasksproperty.Itsdefaultvalueis1:

jdbc:hive2://>SETmapred.job.reuse.jvm.num.tasks=5;

Wecanalsosetthevalueto–1toindicatethatallthetasksforajobwillruninthesameJVM.

www.it-ebooks.info

Page 218: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ParallelexecutionHivequeriescommonlyaretranslatedintoanumberofstagesthatareexecutedbythedefaultsequence.Thesestagesarenotalwaysdependentoneachother.Instead,theycanruninparalleltosavetheoveralljobrunningtime.Wecanenablethisfeaturewiththefollowingsettings:

jdbc:hive2://>SEThive.exec.parallel=true;—defaultfalse

jdbc:hive2://>SEThive.exec.parallel.thread.number=16;

--default8,itdefinesthemaxnumberforrunninginparallel

Parallelexecutionwillincreasetheclusterutilization.Iftheutilizationofaclusterisalreadyveryhigh,parallelexecutionwillnothelpmuchintermsofoverallperformance.

www.it-ebooks.info

Page 219: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

JoinoptimizationWehavealreadydiscussedoptimizationindifferenttypesofHivejoinsinChapter4,DataSelectionandScope.Here,we’llbrieflyreviewthekeysettingsforjoinimprovement.

CommonjoinThecommonjoinisalsocalledreducesidejoin.ItisabasicjoininHiveandworksformostofthetime.Forcommonjoins,weneedtomakesurethebigtableisontheright-mostsideorspecifiedbyhit,asfollows:

/*+STREAMTABLE(stream_table_name)*/.

MapjoinMapjoinisusedwhenoneofthejointablesissmallenoughtofitinthememory,soitisveryfastbutlimited.SinceHive0.7.0,Hivecanconvertmapjoinautomaticallywiththefollowingsettings:

jdbc:hive2://>SEThive.auto.convert.join=true;--defaultfalse

jdbc:hive2://>SEThive.mapjoin.smalltable.filesize=600000000;

--default25M

jdbc:hive2://>SEThive.auto.convert.join.noconditionaltask=true;

--defaultfalse.Settotruesothatmapjoinhintisnotneeded

jdbc:hive2://>SEThive.auto.convert.join.noconditionaltask.size=10000000;

--Thedefaultvaluecontrolsthesizeoftabletofitinmemory

Onceautoconvertisenabled,Hivewillautomaticallycheckifthesmallertablefilesizeisbiggerthanthevaluespecifiedbyhive.mapjoin.smalltable.filesize,andthenHivewillconvertthejointoacommonjoin.Ifthefilesizeissmallerthanthisthreshold,itwilltrytoconvertthecommonjoinintoamapjoin.Onceautoconvertjoinisenabled,thereisnoneedtoprovidethemapjoinhintsinthequery.

BucketmapjoinBucketmapjoinisaspecialtypeofmapjoinappliedonthebuckettables.Toenablebucketmapjoin,weneedtoenablethefollowingsettings:

jdbc:hive2://>SEThive.auto.convert.join=true;--defaultfalse

jdbc:hive2://>SEThive.optimize.bucketmapjoin=true;--defaultfalse

Inbucketmapjoin,allthejointablesmustbebuckettablesandjoinonbucketscolumns.Inaddition,thebucketsnumberinbiggertablesmustbeamultipleofthebucketnumberinthesmalltables.

Sortmergebucket(SMB)joinSMBisthejoinperformedonthebuckettablesthathavethesamesorted,bucket,andjoinconditioncolumns.Itreadsdatafrombothbuckettablesandperformscommonjoins(mapandreducetriggered)onthebuckettables.WeneedtoenablethefollowingpropertiestouseSMB:

www.it-ebooks.info

Page 220: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

jdbc:hive2://>SEThive.input.format=

.......>org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

jdbc:hive2://>SEThive.auto.convert.sortmerge.join=true;

jdbc:hive2://>SEThive.optimize.bucketmapjoin=true;

jdbc:hive2://>SEThive.optimize.bucketmapjoin.sortedmerge=true;

jdbc:hive2://>SEThive.auto.convert.sortmerge.join.noconditionaltask=true;

Sortmergebucketmap(SMBM)joinSMBMjoinisaspecialbucketjoinbuttriggersmap-sidejoinonly.Itcanavoidcachingallrowsinthememorylikemapjoindoes.ToperformSMBMjoins,thejointablesmusthavethesamebucket,sort,andjoinconditioncolumns.Toenablesuchjoins,weneedtoenablethefollowingsettings:

jdbc:hive2://>SEThive.auto.convert.join=true;

jdbc:hive2://>SEThive.auto.convert.sortmerge.join=true

jdbc:hive2://>SEThive.optimize.bucketmapjoin=true;

jdbc:hive2://>SEThive.optimize.bucketmapjoin.sortedmerge=true;

jdbc:hive2://>SEThive.auto.convert.sortmerge.join.noconditionaltask=true;

jdbc:hive2://>SET

hive.auto.convert.sortmerge.join.bigtable.selection.policy=

org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSM

J;

SkewjoinWhenworkingwithdatathathasahighlyunevendistribution,thedataskewcouldhappeninsuchawaythatasmallnumberofcomputenodesmusthandlethebulkofthecomputation.ThefollowingsettinginformsHivetooptimizeproperlyifdataskewhappens:

jdbc:hive2://>SEThive.optimize.skewjoin=true;

--Ifthereisdataskewinjoin,setittotrue.Defaultisfalse.

jdbc:hive2://>SEThive.skewjoin.key=100000;

--Thisisthedefaultvalue.Ifthenumberofkeyisbiggerthan

--this,thenewkeyswillsendtotheotherunusedreducers.

NoteSkewdatacouldhappenontheGROUPBYdatatoo.Tooptimizeit,weneedtodothefollowingsettingstoenableskewdataoptimizationintheGROUPBYresult:

SEThive.groupby.skewindata=true;

Onceconfigured,HivewillfirsttriggeranadditionalMapReducejobwhosemapoutputwillrandomlydistributetothereducertoavoiddataskew.

FormoreinformationaboutHivejoinoptimization,pleaserefertotheApacheHivewikiavailableathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimizationandhttps://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization.

www.it-ebooks.info

Page 221: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 222: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SummaryInthischapter,wefirstcoveredhowtoidentifyperformancebottlenecksusingtheEXPLAINandANALYZEstatements.Then,wespokeaboutthedesignoptimizationforperformancewhenusingtables,partition,andindex.Wealsocoveredthedatafileoptimizationincludingfileformat,compression,andstorage.Attheendofthischapter,wediscussedjobandqueryoptimizationinHive.Aftergoingthroughthischapter,weshouldbeabletodoperformancetroubleshootingandtuninginHive.

Inthenextchapter,we’lltalkaboutfunctionextensionsforHive.

www.it-ebooks.info

Page 223: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 224: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Chapter8.ExtensibilityConsiderationsAlthoughHivehasmanybuilt-infunctions,userssometimeswillneedpowerbeyondthatprovidedbybuilt-infunctions.Fortheseinstances,Hiveoffersthefollowingthreemainareaswhereitsfunctionalitiescanbeextended:

User-definedfunction(UDF):Thisprovidesawaytoextendfunctionalitieswithanexternalfunction(mainlywritteninJava)thatcanbeevaluatedinHQLStreaming:Thisplugsinusers’owncustomizedmappersandreducersprogramsinthedatastreamingSerDe:ThisstandsforserializersanddeserializersandprovidesawaytoserializeordeserializeacustomfileformatwithfilesstoredonHDFS

Inthischapter,we’lltalkabouteachoftheminmoredetail.

www.it-ebooks.info

Page 225: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

User-definedfunctionsHivedefinesthefollowingthreetypesofUDF:

UDFs:Theseareregularuser-definedfunctionsthatoperaterow-wiseandoutputoneresultforonerow,suchasmostbuilt-inmathematicandstringfunctions.UDAFs:Theseareuser-definedaggregatingfunctionsthatoperaterow-wiseorgroup-wiseandoutputoneroworonerowforeachgroupasaresult,suchastheMAXandCOUNTbuilt-infunctions.UDTFs:Theseareuser-definedtable-generatingfunctionsthatalsooperaterow-wise,buttheyproducemultiplerows/tablesasaresult,suchastheEXPLODEfunction.UDTFcanbeusedeitherafterSELECToraftertheLATERALVIEWstatement.

NoteSinceHiveisimplementedinJava,UDFsshouldbewritteninJavaaswell.SinceJavasupportsrunningcodeinotherlanguagesthroughthejavax.scriptAPI(seehttp://docs.oracle.com/javase/6/docs/api/javax/script/package-summary.html),UDFscanbewritteninlanguagesotherthanJava.Inthisbook,weonlyfocusonJavaUDFs.

We’llstartlookingattheJavacodetemplateforeachkindoffunctioninmoredetail.

www.it-ebooks.info

Page 226: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

TheUDFcodetemplateThecodetemplateforaregularUDFisasfollows:

packagecom.packtpub.hive.essentials.hiveudf;

importorg.apache.hadoop.hive.ql.exec.UDF;

importorg.apache.hadoop.hive.ql.exec.Description;

importorg.apache.hadoop.hive.ql.udf.UDFType;

//Belowareoptionsoraddmorewhenneeded

importorg.apache.hadoop.io.Text;

importorg.apache.commons.lang.StringUtils;

@Description(

name="udf_name",

value="_FUNC_(arg1,arg2,...argN)-Ashortdescriptionforthe

function",

extended="Thisismoredetailaboutthefunction,suchassyntax,

examples."

)

@UDFType(deterministic=true,stateful=false)

publicclassudf_nameextendsUDF{

publicStringevaluate(){

/*

*Dosomethinghere

*/

return"returntheudfresult";

}

//overrideissupported

publicStringevaluate(<Type_arg1>arg1,...,<Type_argN>argN){

/*

*Dosomethinghere

*/

return"returntheudfresult";

}

}

Intheprecedingtemplate,thepackagedefinitionandimportsshouldbeself-explanatory.Wecanimportwhateverisneededbesidesthetopthreemandatorylibraries.The@DescriptionannotationisausefulHivespecificannotationtoprovideusageinformationfortheUDFintheHiveconsole.TheinformationdefinedinthevaluepropertywillbeshownintheHQLDESCRIBEFUNCTIONcommand.TheinformationdefinedintheextendedpropertywillbeshownintheHQLDESCRIBEFUNCTIONEXTENDEDcommand.The@UDFTypeannotationtellsHivewhatbehaviortoexpectfromthefunction.AdeterministicUDF(deterministic=true)isafunctionthatalwaysgivesthesameresultwhenpassedthesamearguments,suchasLENGTH(stringinput),MAX(),andsoon.Ontheotherhand,anon-deterministic(deterministic=false)UDFcanreturnadifferentresultforthesamesetofarguments,forexample,UNIX_TIMESTAMP()returningthecurrenttimestampinthedefaulttimezone.Thestateful(stateful=true)propertyallowsfunctionstokeepsomestaticvariablesavailableacrossrows,suchas

www.it-ebooks.info

Page 227: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ROW_NUMBER(),whichassignssequentialnumbersforallrowsinatable.

AllUDFsextendtheHiveUDFclass,sotheUDFsubclassmustimplementtheevaluatemethodcalledbyHive.Theevaluatemethodcanbeoverriddenforadifferentpurpose.Inthismethod,wecanimplementwhateverlogicandexceptionhandlingthedesignforthefunctionusingtheJavaHadooplibraryandtheHadoopdatatypeforMapReducedataserialization,suchasTEXT,DoubleWritable,INTWritable,andsoon.

www.it-ebooks.info

Page 228: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

TheUDAFcodetemplateInthissection,weintroducetheUDAFcodetemplatebyextendingitfromtheUDAFclass.Thecodetemplateisasfollows:

packagecom.packtpub.hive.essentials.hiveudaf;

importorg.apache.hadoop.hive.ql.exec.UDAF;

importorg.apache.hadoop.hive.ql.exec.UDAFEvaluator;

importorg.apache.hadoop.hive.ql.exec.Description;

importorg.apache.hadoop.hive.ql.udf.UDFType;

@Description(

name="udaf_name",

value="_FUNC_(arg1,arg2,...argN)-Ashortdescriptionforthe

function",

extended="Thisismoredetailaboutthefunction,suchassyntax,

examples."

)

@UDFType(deterministic=false,stateful=true)

publicfinalclassudaf_nameextendsUDAF{

/**

*Theinternalstateofanaggregationfunction.

*

*Notethatthisisonlyneedediftheinternalstate

*cannotberepresentedbyaprimitive.

*

*Theinternalstatecancontainfieldswithtypeslike

*ArrayList<String>andHashMap<String,Double>ifneeded.

*/

publicstaticclassUDAFState{

private<Type_state1>state1;

private<Type_stateN>stateN;

}

/**

*Theactualclassfordoingtheaggregation.Hivewill

*automaticallylookforallinternalclassesoftheUDAF

*thatimplementsUDAFEvaluator.

*/

publicstaticclassUDAFExampleAvgEvaluatorimplementsUDAFEvaluator{

UDAFStatestate;

publicUDAFExampleAvgEvaluator(){

super();

state=newUDAFState();

init();

}

/**

*Resetthestateoftheaggregation.

*/

publicvoidinit(){

www.it-ebooks.info

Page 229: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

/*

*Examplesforinitializingstate.

*/

state.state1=0;

state.stateN=0;

}

/**

*Iteratethroughonerowoforiginaldata.

*

*Thenumberandtypeofargumentsneedtobethesameaswe

*callthisUDAFfromtheHivecommandline.

*

*Thisfunctionshouldalwaysreturntrue.

*/

publicbooleaniterate(<Type_arg1>arg1,...,<Type_argN>argN)

{

/*

*Addlogichereforhowtodoaggregationifthereis

*anewvaluetobeaggregated.

*/

returntrue;

}

/**

*Calledonthemappersideondifferentdatanodes.

*Terminateapartialaggregationandreturnthestate.

*Ifthestateisaprimitive,justreturnprimitiveJava

*classeslikeIntegerorString.

*/

publicUDAFStateterminatePartial(){

/*

*Checkandreturnapartialresultinexpectations.

*/

returnstate;

}

/**

*Mergewithapartialaggregation.

*

*Thisfunctionshouldalwayshaveasingleargument,

*whichhasthesametypeasthereturnvalueof

*terminatePartial().

*/

publicbooleanmerge(UDAFStateo){

/*

*Defineoperationshowtomergetheresultcalculated

*fromalldatanodes.

*/

returntrue;

}

/**

*Terminatestheaggregationandreturnsthefinalresult.

*/

publiclongterminate(){

www.it-ebooks.info

Page 230: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

/*

*Checkandreturnfinalresultinexpectations.

*/

returnstate.stateN;

}

}

}

AUDAFmustbeasubclassoforg.apache.hadoop.hive.ql.exec.UDAFcontainingoneormorenestedstaticclassesimplementingorg.apache.hadoop.hive.ql.exec.UDAFEvaluator.MakesurethattheinnerclassthatimplementsUDAFEvaluatorisdefinedaspublic.Otherwise,Hivewon’tbeabletousereflectionanddeterminetheUDAFEvaluatorimplementation.Weshouldalsoimplementthefiverequiredfunctions,init,iterate,terminatePartial,merge,andterminate,alreadydescribedinthecodecomments.

NoteBothUDFandUDAFcanalsobeimplementedbyextendingfromtheGenericUDFandGenericUDAFEvaluatorclassestoavoidusingJavareflectionforbetterperformance.And,thesegenericfunctionsareactuallyextendedbyHive’sbuilt-inUDFsimplementationsinternally.Genericfunctionssupportcomplexdatatypes,suchasMAP,ARRAY,andSTRUCT,asarguments,buttheUDFandUDAFclassdonot.FormoreinformationaboutGenericUDAF,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy.

www.it-ebooks.info

Page 231: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

TheUDTFcodetemplateToimplementUDTF,thereisonlyonewaybyextendingfromorg.apache.hadoop.hive.ql.exec.GenericUDTF.ThereisnoplainUDTFclass.Weneedtoimplementthreemethods:initialize,process,andclose.TheUDTFwillcalltheinitializemethod,whichreturnstheinformationofthefunctionoutput,suchasdatatype,numberofoutput,andsoon.Then,theprocessmethodiscalledtodocorefunctionlogicwithargumentsandforwardtheresult.Attheend,theclosemethodwilldoapropercleanup,ifneeded.ThecodetemplateforUDTFisasfollows:

packagecom.packtpub.hive.essentials.hiveudtf;

importorg.apache.hadoop.hive.ql.udf.generic.GenericUDTF;

importorg.apache.hadoop.hive.ql.exec.Description;

importorg.apache.hadoop.hive.ql.exec.UDFArgumentException;

importorg.apache.hadoop.hive.ql.metadata.HiveException;

importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;

import

org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;

import

org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;

importorg.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;

import

org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInsp

ectorFactory;

@Description(

name="udtf_name",

value="_FUNC_(arg1,arg2,...argN)-Ashortdescriptionforthe

function",

extended="Thisismoredetailaboutthefunction,suchassyntax,

examples."

)

publicclassudtf_nameextendsGenericUDTF{

privatePrimitiveObjectInspectorstringOI=null;

/**

*Thismethodwillbecalledexactlyonceperinstance.

*Itperformsanycustominitializationlogicweneed.

*Itisalsoresponsibleforverifyingtheinputtypesand

*specifyingtheoutputtypes.

*/

@Override

publicStructObjectInspectorinitialize(ObjectInspector[]args)

throwsUDFArgumentException{

//Checknumberofarguments.

if(args.length!=1){

thrownewUDFArgumentException("TheUDTFshouldtakeexactlyone

argument");

}

/*

*CheckthattheinputObjectInspector[]arraycontainsa

*singlePrimitiveObjectInspectorofthePrimitivetype,

*suchasString.

www.it-ebooks.info

Page 232: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

*/

if(args[0].getCategory()!=ObjectInspector.Category.PRIMITIVE

&&

((PrimitiveObjectInspector)args[0]).getPrimitiveCategory()!=

PrimitiveObjectInspector.PrimitiveCategory.STRING){

thrownewUDFArgumentException("TheUDTFshouldtakeastringasa

parameter");

}

stringOI=(PrimitiveObjectInspector)args[0];

/*

*Definetheexpectedoutputforthisfunction,including

*eachaliasandtypesforthealiases.

*/

List<String>fieldNames=newArrayList<String>(2);

List<ObjectInspector>fieldOIs=newArrayList<ObjectInspector>(2);

fieldNames.add("alias1");

fieldNames.add("alias2");

fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);

//Setuptheoutputschema.

return

ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,

fieldOIs);

}

/**

*Thismethodiscalledonceperinputrowandgenerates

*output.The"forward"methodisused(insteadof

*"return")inordertospecifytheoutputfromthefunction.

*/

@Override

publicvoidprocess(Object[]record)throwsHiveException{

/*

*Wemayneedtoconverttheobjecttoaprimitivetype

*beforeimplementingcustomizedlogic.

*/

finalStringrecStr=(String)

stringOI.getPrimitiveJavaObject(record[0]);

//emitnewlycreatedstructsafterapplyingcustomizedlogic.

forward(newObject[]{recStr,Integer.valueOf(1)});

}

/**

*Thismethodisforanycleanupthatisnecessarybefore

*returningfromtheUDTF.Sincetheoutputstreamhas

*alreadybeenclosedatthispoint,thismethodcannot

*emitmorerows.

*/

@Override

publicvoidclose()throwsHiveException{

//Donothing.

}

}

www.it-ebooks.info

Page 233: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

DevelopmentanddeploymentWe’llgothroughthewholedevelopmentanddeploymentstepsusinganexample.Let’screateaHivefunctioncalledtoUpper,whichwillconvertastringtouppercaseusingthefollowingsteps:

1. DownloadandinstallaJavaIDE,suchasEclipse,fromhttp://www.eclipse.org/downloads/packages/eclipse-ide-java-developers/lunasr1.

2. StarttheIDEandcreateaJavaproject.3. Right-clickontheprojecttochoosetheBuildPath|ConfigureBuildPath|Add

ExternalJarsoption.Itwillopenanewwindow.NavigatetothedirectoryhavingthelibraryofHiveandHadoop.Then,selectandaddallJARfilesneededtoimport.WecanalsoresolvelibrarydependencyautomaticallybyusingMaven(seehttp://maven.apache.org/)andtheproperpom.xmlfile.Howtoconfigurealibraryrepositoryinpom.xmlfilesisusuallywelldescribedintheHadoopvendorpackageorApacheHiveandHadoophelpdocuments.

4. IntheIDE,createthetoupper.javafileasfollows,accordingtotheUDFtemplatementionedpreviously:

packagecom.packtpub.hive.essentials.hiveudf;

importorg.apache.hadoop.hive.ql.exec.UDF;

importorg.apache.hadoop.io.Text;

classToUpperextendsUDF{

publicTextevaluate(Textinput){

if(input==null)returnnull;

returnnewText(input.toString().toUpperCase());

}

}

5. Now,exportthisprojectasaJARfile(orbuiltbyMaven)namedastoupper.jar.6. CopythisJARfileinadirectory,suchas/home/dayongd/hive/lib/,inanodeof

theHivecluster.7. AddtheJARtotheHiveenvironmentusingoneofthefollowingoptions(option3or

4isrecommended):

Option1:RunADDJAR/home/dayongd/hive/lib/toupper.jarintheHiveCLI.Thisisonlyvalidforthecurrentsession,butdoesnotworkforODBCconnections.Option2:AddADDJAR/home/dayongd/hive/lib/toupper.jarin/home/$USER/.hiverc(wecancreatethefileifitisnotthere).Inthiscase,thefileneedstobedeployedtoeverynodefromwherewemightlaunchtheHiveshell.Thisisonlyvalidforthecurrentsession,butdoesnotworkforODBCconnections.Option3:Addthefollowingconfigurationinthehive-site.xmlfile:

<property>

www.it-ebooks.info

Page 234: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

<name>hive.aux.jars.path</name>

<value>file:///home/dayongd/hive/lib/toupper.jar</value>

</property>

Option4:CopyandpastetheJARfiletothe/${HIVE_HOME}/auxlib/folder(createitifitdoesnotexist).

8. Createthefunction.WecancreateatemporaryfunctionthatisonlyvalidinthecurrentHivesessionasfollows:

CREATETEMPORARYFUNCTIONtoUpperAS

'com.packtpub.hive.essentials.hiveudf.toupper';

NoteSinceHive0.13.0,wecanuseonecommandtoaddJARandcreatepermanentfunctions,whichisregisteredtothemegastoreandcanbereferencedinaquerywithoutcreatingatemporaryfunctionineachsession:

CREATEFUNCTIONtoUpperAS

'com.packtpub.hive.essentials.hiveudf.ToUpper'USINGJAR

'hdfs:///path/to/jar';

9. Verifythefunction:

SHOWFUNCTIONSToUpper;

DESCRIBEFUNCTIONToUpper;

DESCRIBEFUNCTIONEXTENDEDToUpper;

10. UsetheUDFinHQL:

SELECTtoUpper(name)FROMemployeeLIMIT1000;

11. Dropthefunctionwhenneeded:

DROPTEMPORARYFUNCTIONIFEXISTStoUpper;

www.it-ebooks.info

Page 235: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 236: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

StreamingHivecanalsoleveragethestreamingfeatureinHadooptotransformdatainanalternativeway.ThestreamingAPIopensanI/Opipetoanexternalprocess(script).Then,theprocessreadsdatafromthestandardinputandwritestheresultsoutthroughthestandardoutput.InHive,wecanuseTRANSFORMclausesinHQLdirectly,andembedthemapperandthereducerscriptswrittenincommands,shellscripts,Java,orotherprogramminglanguages.Althoughstreamingbringsoverheadbyusingserialization/deserializationbetweenprocesses,itisasimplercodingmodefordevelopers,especiallynon-Javadevelopers.ThesyntaxoftheTRANSFORMclauseisasfollows:

FROM(

FROMsrc

SELECTTRANSFORM'('expression(','expression)*')'

(inRowFormat)?

USING'map_user_script'

(AScolName(','colName)*)?

(outRowFormat)?(outRecordReader)?

(CLUSTERBY?|DISTRIBUTEBY?SORTBY?)src_alias

)

SELECTTRANSFORM'('expression(','expression)*')'

(inRowFormat)?

USING'reduce_user_script'

(AScolName(','colName)*)?

(outRowFormat)?(outRecordReader)?

Bydefault,theINPUTvaluesfortheuserscriptarethefollowing:

ColumnstransformedtoSTRINGvaluesDelimitedbyatabNULLvaluesconvertedtotheliteralstringN(differentiatesNULLvaluesfromemptystrings)

Bydefault,theOUTPUTvaluesoftheuserscriptarethefollowing:

Treatedastab-separatedSTRINGcolumnsNwillbereinterpretedasNULLTheresultingSTRINGcolumnwillbecasttothedatatypespecifiedinthetabledeclaration

ThesedefaultscanbeoverriddenwithROWFORMAT.AnexampleofHivestreamingusingthePythonscriptupper.pyisasfollows:

#!/usr/bin/envpython

'''

Thisisascripttoupperallcases

'''

importsys

defmain():

try:

forlineinsys.stdin:

www.it-ebooks.info

Page 237: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

n=line.strip()

printn.upper()

except:

returnNone

if__name__=="__main__":main()

Testthescript,asfollows:

$echo"Will"|pythonupper.py

$WILL

CallthescriptintheHiveCLIfromHQL:

jdbc:hive2://>ADDFILE/home/dayongd/Downloads/upper.py;

jdbc:hive2://>SELECTTRANSFORM(name,work_place[0])

.......>USING'pythonupper.py'AS(CAP_NAME,CAP_PLACE)

.......>FROMemployee;

+-----------+------------+

|cap_name|cap_place|

+-----------+------------+

|MICHAEL|MONTREAL|

|WILL|MONTREAL|

|SHELLEY|NEWYORK|

|LUCY|VANCOUVER|

|STEVEN|NULL|

+-----------+------------+

5rowsselected(30.101seconds)

NoteTheTRANSFORMcommandisnotallowedwhenSQLstandard-basedauthorizationisconfigured,sinceHive0.13.0.

www.it-ebooks.info

Page 238: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 239: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SerDeSerDestandsforSerializerandDeserializer.ItisthetechnologythatHiveusestoprocessrecordsandmapthemtocolumndatatypesinHivetables.ToexplainthescenarioofusingSerDe,weneedtounderstandhowHivereadsandwritesdata.

Theprocesstoreaddataisasfollows:

1. DataisreadfromHDFS.2. DataisprocessedbytheINPUTFORMATimplementation,whichdefinestheinputdata

splitandkey/valuerecords.InHive,wecanuseCREATETABLE…STOREDAS<FILE_FORMAT>(seeChapter7,PerformanceConsiderations,foravailablefileformats)tospecifywhichINPUTFORMATitreadsfrom.

3. TheJavaDeserializerclassdefinedinSerDeiscalledtoformatthedataintoarecordthatmapstocolumnanddatatypesinatable.

Foranexampleofreadingdata,wecanuseJSONSerDetoreadtheTEXTFILEformatdatafromHDFSandtranslateeachrowoftheJSONattributeandvaluetorowsinHivetableswiththecorrectschema.

Theprocesstowritedataisasfollows:

1. Data(suchasusinganINSERTstatement)tobewrittenistranslatedbytheSerializerclassdefinedinSerDetotheformatthattheOUTPUTFORMATclasscanread.

2. DataisprocessedbytheOUTPUTFORMATimplementation,whichcreatestheRecordWriterobject.SimilartotheINPUTFORMATimplementation,theOUTPUTFORMATimplementationisspecifiedinthesamewayasatablewhereitwritesthedata.

3. Thedataiswrittentothetable(datasavedintheHDFS).

Foranexampleofwritingdata,wecanwritearow-columnofdatatoHivetablesusingJSONSerDe,whichtranslatesdatatoaJSONtextstringsavedtotheHDFS.

RecentHiveversionsusestheorg.apache.hadoop.hive.serde2library,whereorg.apache.hadoop.hive.serdeisthedeprecatedlibrary.AlistofcommonlyusedSerDeinHiveisasfollows:

LazySimpleSerDe:Thedefaultbuilt-inSerDe(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)that’susedwiththeTEXTFILEformat.Itcanbeimplementedasfollows:

jdbc:hive2://>CREATETABLEtest_serde_lz

.......>STOREDASTEXTFILEAS

.......>SELECTnamefromemployee;

Norowsaffected(32.665seconds)

ColumnarSerDe:Thisisthebuilt-inSerDeusedwiththeRCFILEformat.Itcanbeusedasfollows:

www.it-ebooks.info

Page 240: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

jdbc:hive2://>CREATETABLEtest_serde_cs

.......>ROWFORMATSERDE

.......>'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'

.......>STOREDASRCFileAS

.......>SELECTnamefromemployee;

Norowsaffected(27.187seconds)

RegexSerDe:Thisisthebuilt-inJavaregularexpressionSerDetoparsetextfiles.Itcanbeusedasfollows:

--Parse,seperatefields

jdbc:hive2://>CREATETABLEtest_serde_rex(

.......>namestring,

.......>sexstring,

.......>agestring

.......>)

.......>ROWFORMATSERDE

.......>'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

.......>WITHSERDEPROPERTIES(

.......>'input.regex'='([^,]*),([^,]*),([^,]*)',

.......>'output.format.string'='%1$s%2$s%3$s'

.......>)

.......>STOREDASTEXTFILE;

Norowsaffected(0.266seconds)

HBaseSerDe:Thisisthebuilt-inSerDetoenableHivetointegratewithHBase.WecanstoreHivetablesinHBasebyleveragingthisSerDe.MakesuretohaveHBaseinstalledbeforerunningthefollowingquery:

jdbc:hive2://>CREATETABLEtest_serde_hb(

.......>idstring,

.......>namestring,

.......>sexstring,

.......>agestring

.......>)

.......>ROWFORMATSERDE

.......>'org.apache.hadoop.hive.hbase.HBaseSerDe'

.......>STOREDBY

.......>'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

.......>WITHSERDEPROPERTIES(

.......>"hbase.columns.mapping"=

.......>":key,info:name,info:sex,info:age"

.......>)

.......>TBLPROPERTIES("hbase.table.name"="test_serde");

Norowsaffected(0.387seconds)

AvroSerDe:Thisisthebuilt-inSerDethatenablesreadingandwritingAvro(seehttp://avro.apache.org/)datainHivetables.Avroisaremoteprocedurecallanddataserializationframework.SinceHive0.14.0,Avro-backedtablescansimplybecreatedbyusingtheCREATETABLE…STOREDASAVROstatement,asfollows:

jdbc:hive2://>CREATETABLEtest_serde_avro(

.......>namestring,

.......>sexstring,

.......>agestring

www.it-ebooks.info

Page 241: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

.......>)

.......>ROWFORMATSERDE

.......>'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

.......>STOREDASINPUTFORMAT

.......>

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

.......>OUTPUTFORMAT

.......>

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

.......>;

Norowsaffected(0.31seconds)

ParquetHiveSerDe:Thisisthebuilt-inSerDe(parquet.hive.serde.ParquetHiveSerDe)thatenablesreadingandwritingtheParquetdataformatsinceHive0.13.0.Itcanbeusedasfollows:

jdbc:hive2://>CREATETABLEtest_serde_parquet

.......>STOREDASPARQUETAS

.......>SELECTnamefromemployee;

Norowsaffected(34.079seconds)

OpenCSVSerDe:ThisistheSerDetoreadandwriteCSVdata.Itcomesasabuilt-inSerDesinceHive0.14.0.Wecanalsoinstalltheimplementationfromotheropensourcelibraries,suchashttps://github.com/ogrodnek/csv-serde.Itcanbeusedasfollows:

jdbc:hive2://>CREATETABLEtest_serde_csv(

.......>namestring,

.......>sexstring,

.......>agestring

.......>)

.......>ROWFORMATSERDE

.......>'org.apache.hadoop.hive.serde2.OpenCSVSerde'

.......>STOREDASTEXTFILE;

JSONSerDe:Thisisathird-partySerDetoreadandwriteJSONdatarecordswithHive.Makesuretoinstallit(fromhttps://github.com/rcongiu/Hive-JSON-Serde)beforerunningthefollowingquery:

jdbc:hive2://>CREATETABLEtest_serde_js(

.......>namestring,

.......>sexstring,

.......>agestring

.......>)

.......>ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe'

.......>STOREDASTEXTFILE;

Norowsaffected(0.245seconds)

HivealsoallowsuserstodefineacustomizedSerDeifnoneoftheseworkfortheirdataformat.FormoreinformationaboutcustomSerDe,pleaserefertotheApachewikiathttps://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HowtoWriteYourOwnSerDe.

www.it-ebooks.info

Page 242: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 243: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SummaryInthischapter,weintroducedthreemainareastoextendHive’sfunctionalities.Wealsocoveredthreeuser-definedfunctionsinHiveaswellasthecodingtemplateanddeploymentstepstoguideyourcodinganddeploymentpractice.Then,wetalkedaboutstreaminginHivetopluginyourowncode,whichdoesnothavetobeJavacode.Attheendofthischapter,wediscussedtheavailableSerDeinHivetoparsedifferentformatsofdatafileswhenreadingorwritingdata.Aftergoingthroughthischapter,weshouldbeabletowritebasicUDFs,plugcodeinstreamings,anduseavailableSerDeinHive.

Inthenextchapter,we’lltalkaboutsecurityconsiderationsforHive.

www.it-ebooks.info

Page 244: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 245: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Chapter9.SecurityConsiderationsInmostopensourcesoftware,securityisoneofthemostimportantareas,butalwaysaddressedatalaterstage.AsthemainSQL-likeinterfaceofdatainHadoop,Hivemustensurethatdataissecurelyprotectedandaccessed.Forthisreason,securityinHiveisnowconsideredasanintegralandimportantpartoftheHadoopecosystem.TheearlierversionofHivemainlyreliedontheHDFSforsecurity.ThesecurityofHivegraduallybecamematureafterHiveServer2wasreleasedasanimportantmilestoneoftheHiveserver.

ThischapterwilldiscussHivesecurityinthefollowingareas:

AuthenticationAuthorizationEncryption

www.it-ebooks.info

Page 246: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

AuthenticationAuthenticationistheprocessofverifyingtheidentityofauserbyobtainingtheuser’scredentials.HivehasofferedauthenticationsinceHiveServer2.InthepreviousHiveServer,ifwecouldaccessthehost/portoverthenetwork,wecouldaccessthedata.Inthiscase,theHiveMetastoreServercanbeusedtoauthenticatethriftclientsusingKerberos.AsmentionedinChapter2,SettingUptheHiveEnvironment,itisstronglyrecommendedtoupgradetheHiveservertoHiveServer2intermsofsecurityandreliability.Inthissection,wewillbrieflytalkaboutauthenticationconfigurationsinbothMetastoreServerandHiveServer2.

NoteKerberos

KerberosisanetworkauthenticationprotocoldevelopedbyMITaspartofProjectAthena.Itusestime-sensitiveticketsthataregeneratedusingsymmetrickeycryptographytosecurelyauthenticateauserinanunsecurednetworkenvironment.KerberosisderivedfromGreekmythology,whereKerberoswasthethree-headeddogthatguardedthegatesofHades.Thethree-headedpartreferstothethreepartiesinvolvedintheKerberosauthenticationprocess:client,server,andKeyDistributionCenter(KDC).AllclientsandserversregisteredtoKDCareknownasarealm,whichistypicallythedomain’sDNSnameinallcaps.Formoreinformation,pleaserefertotheMITKerberoswebsiteathttp://web.mit.edu/kerberos/.

www.it-ebooks.info

Page 247: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

MetastoreserverauthenticationToforceclientstoauthenticatewiththeHiveMetastoreserverusingKerberos,wecansetthefollowingpropertiesinthehive-site.xmlfile:

EnabletheSimpleAuthenticationandSecurityLayer(SASL)frameworktoenforceclientKerberosauthentication,asfollows:

<property>

<name>hive.metastore.sasl.enabled</name>

<value>true</value>

<description>Iftrue,themetastorethriftinterfacewillbesecured

withSASLframework.ClientsmustauthenticatewithKerberos.

</description>

</property>

SpecifytheKerberoskeytabthatisgenerated.Overridethefollowingexampleifwewanttokeepthefileinanotherplace.Makesurethefileaccesspermissionsaresetto400implyingonlyreadpermissionfortheownertoavoidtheiridentitybeingstolenbyothers:

<property>

<name>hive.metastore.kerberos.keytab.file</name>

<value>/etc/hive/conf/hive.keytab</value>

<description>ThesamplepathtotheKerberosKeytabfilecontaining

themetastorethriftserver'sserviceprincipal.</description>

</property>

SpecifytheKerberosprincipalpatternstring.Thespecialstring_HOSTwillbereplacedautomaticallywiththecorrecthostnames.TheYOUR-REALM.COMvalueshouldbereplacedbytheactualrealmname:

<property>

<name>hive.metastore.kerberos.principal</name>

<value>hive/[email protected]</value>

<description>Theserviceprincipalforthemetastorethriftserver.

</description>

</property>

www.it-ebooks.info

Page 248: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HiveServer2authenticationHiveServer2supportsthefollowingauthentications.ToconfigureHiveServer2touseoneoftheseauthenticationmodes,wecansettheproperpropertiesinhive_site.xmlasfollows:

Noneauthentication:Noneauthenticationiswhat’sinthedefaultsettings.“None”heremeansHiveallowsanonymousaccessasshowninthefollowingsetting:

<property>

<name>hive.server2.authentication</name>

<value>NONE</value>

</property>

Kerberosauthentication:IfKerberosauthenticationisused,authenticationissupportedbetweenthethriftclientandHiveServer2,andbetweenHiveServer2andsecureHDFS.ToenableKerberosauthenticationforHiveServer2,wecansetthefollowingpropertiesbyoverridingthekeytabpath(ifwewanttokeepthefileinanotherplace)aswellaschangingYOUR-REALM.COMtotheactualrealmname:

<property>

<name>hive.server2.authentication</name>

<value>KERBEROS</value>

</property>

<property>

<name>hive.server2.authentication.kerberos.keytab</name>

<value>/etc/hive/conf/hive.keytab</value>

</property>

<property>

<name>hive.server2.authentication.kerberos.principal</name>

<value>hive/[email protected]</value>

</property>

OnceKerberosisenabled,theJDBCclient(suchasBeeline)mustincludetheprincipalparameterintheJDBCconnectionstringsuchasthefollowing:

jdbc:hive2://HiveServer2HostName:10000/default;principal=hive/HiveServe

[email protected]

LDAPauthentication:ToconfigureHiveServer2touseuserandpasswordvalidationbackedbyLDAP(seehttp://tools.ietf.org/html/rfc4511),wecansetthefollowingproperties:

<property>

<name>hive.server2.authentication</name>

<value>LDAP</value>

</property>

<property>

<name>hive.server2.authentication.ldap.url</name>

<value>LDAP_URL,suchasldap://[email protected]</value>

</property>

<property>

<name>hive.server2.authentication.ldap.Domain</name>

<value>YourDomainName</value>

www.it-ebooks.info

Page 249: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

</property>

ToconfigurewithOpenLDAP,wecanaddthesettingofbaseDNinsteadoftheDomainpropertyasfollows:

<property>

<name>hive.server2.authentication.ldap.baseDN</name>

<value>LDAP_BaseDN,suchasou=people,dc=packtpub,dc=com</value>

</property>

Pluggablecustomauthentication:PluggablecustomauthenticationprovidesacustomauthenticationproviderforHiveServer2.Toenableit,configurethesettingsasfollows:

<property>

<name>hive.server2.authentication</name>

<value>CUSTOM</value>

</property>

<property>

<name>hive.server2.custom.authentication.class</name>

<value>pluggable-auth-class-name</value>

<description>Customauthenticationclassname,suchas

com.packtpub.hive.essentials.hiveudf.customAuthenticator

</description>

</property>

NoteThepluggableauthenticationwithacustomizedclassdidnotworkuntilthebug(seehttps://issues.apache.org/jira/browse/HIVE-4778)wasfixedinHive0.13.0.

Thefollowingisasampleofacustomizedclassthatimplementstheorg.apache.hive.service.auth.PasswdAuthenticationProviderinterface.TheoverriddenAuthenticatemethodhasthecorelogicofhowtoauthenticateausernameandpassword.MakesuretocopythecompiledJARfileto$HIVE_HOME/lib/sothattheprecedingsettingscanwork.

customAuthenticator.java

packagecom.packtpub.hive.essentials.hiveudf;

importjava.util.Hashtable;

importjavax.security.sasl.AuthenticationException;

importorg.apache.hive.service.auth.PasswdAuthenticationProvider;

/*

*ThecustomizedclassforHiveServer2authentication

*/

publicclasscustomAuthenticatorimplements

PasswdAuthenticationProvider{

Hashtable<String,String>authHashTable=null;

publiccustomAuthenticator(){

authHashTable=newHashtable<String,String>();

www.it-ebooks.info

Page 250: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

authHashTable.put("user1","passwd1");

authHashTable.put("user2","passwd2");

}

@Override

publicvoidAuthenticate(Stringuser,Stringpassword)

throwsAuthenticationException{

StringstoredPasswd=authHashTable.get(user);

if(storedPasswd!=null&&storedPasswd.equals(password))

return;

thrownewAuthenticationException("customAuthenticatorException:

Invaliduser");

}

}

PluggableAuthenticationModules(PAM)authentication:SinceHive0.13.0,itsupportsPAMauthentication,whichprovidesthebenefitofpluggingexistingauthenticationmechanismstoHive.ConfigurethefollowingsettingstoenablePAMauthentication.FormoreinformationabouthowtoinstallPAM,pleaserefertotheSettingUpHiveServer2articleintheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PluggableAuthenticationModules(PAM).

<property>

<name>hive.server2.authentication</name>

<value>PAM</value>

</property>

<property>

<name>hive.server2.authentication.pam.services</name>

<value>pluggable-auth-class-name</value>

<description>Setthistoalistofcomma-separatedPAMservicesthat

willbeused.NotethatafilewiththesamenameasthePAMservice

mustexistin/etc/pam.d.</description>

</property>

www.it-ebooks.info

Page 251: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 252: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

AuthorizationAuthorizationinHiveisusedtoverifyifauserhaspermissiontoperformacertainaction,suchascreating,reading,andwritingdataormetadata.Hiveprovidesthreeauthorizationmodes:legacymode,storage-basedmode,andSQLstandard-basedmode.

www.it-ebooks.info

Page 253: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

LegacymodeThisisthedefaultauthorizationmodeinHive,providingcolumnandrow-levelauthorizationthroughHQLstatements.However,itisnotacompletelysecureauthorizationmodeandhasacoupleoflimitations.Itcanbemainlyusedtopreventgoodusersfromaccidentallydoingbadthingsratherthanpreventingmalicioususers’operations.Inordertoenablethelegacyauthorizationmode,weneedtosetthefollowingpropertiesinhive-site.xml:

<property>

<name>hive.security.authorization.enabled</name>

<value>true</value>

<description>enablesordisablethehiveclientauthorization

</description>

</property>

<property>

<name>hive.security.authorization.createtable.owner.grants</name>

<value>ALL</value>

<description>theprivilegesautomaticallygrantedtotheownerwhenevera

tablegetscreated.Anexamplelike"select,drop"willgrantselectand

dropprivilegetotheownerofthetable.

</description>

</property>

Sincethisisnotasecureauthorizationmode,wewillnotdiscussmoredetailshere.FormoreHQLsupportinthelegacyauthorizationmode,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Hive+Default+Authorization+-+Legacy+Mode.

www.it-ebooks.info

Page 254: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Storage-basedmodeThestorage-basedauthorizationmode(sinceHive0.10.0)reliesontheauthorizationprovidedbythestoragelayerHDFS,whichprovidesbothPOSIXandACLpermissions(availablesinceHive0.14.0;refertohttps://issues.apache.org/jira/browse/HIVE-7583).Thestorage-basedauthorizationisenabledintheHiveMetastoreserverhavingasingleconsistentviewofmetadataacrossotherapplicationsintheecosystem.ThismodechecksHiveuserpermissionsagainstthePOSIXpermissionsonthecorrespondingfiledirectoriesinHDFS.InadditiontothePOSIXpermissionsmodel,HDFSalsoprovidesaccesscontrollistsdescribedinACLsonHDFSathttp://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#ACLs_Access_Control_Lists.Consideringitsimplementation,thestorage-basedauthorizationmodeonlyoffersauthorizationatthelevelofHivedatabases,tables,andpartitionsratherthancolumnandrowlevel.WithdependencyontheHDFSpermissions,itlackstheflexibilitytomanagetheauthorizationthroughHQLstatements.

Toenablestorage-basedauthorizationmode,wecansetthefollowingpropertiesinthehive-site.xmlfile:

<property>

<name>hive.security.authorization.enabled</name>

<value>true</value>

<description>enableordisablethehiveclientauthorization

</description>

</property>

<property>

<name>hive.security.authorization.manager</name>

<value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthori

zationProvider</value>

<description>TheclassnameoftheHiveclientauthorizationmanager.

</description>

</property>

<property>

<name>hive.server2.enable.doAs</name>

<value>true</value>

<description>AllowsHivequeriestoberunbytheuserwhosubmitsthe

queryratherthanthehiveuser.</description>

</property>

</property>

<name>hive.metastore.pre.event.listeners</name>

<value>org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEve

ntListener</value>

<description>Thisturnsonmetastore-sidesecurity.</description>

</property>

<property>

<name>hive.security.metastore.authorization.manager</name>

<value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthori

zationProvider</value>

<description>authenticatormanagerclassnametobeusedinthe

www.it-ebooks.info

Page 255: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

metastoreforauthentication.</description>

</property>

NoteSinceHive0.14.0,storage-basedauthorizationalsoauthorizesreadprivilegesondatabasesandtablesbydefaultthroughthehive.security.metastore.authorization.auth.readsproperty.Formoreinformation,pleaserefertohttps://issues.apache.org/jira/browse/HIVE-8221.

www.it-ebooks.info

Page 256: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SQLstandard-basedmodeForfine-grainedaccesscontrolonacolumnandrowlevel,wecanuseSQLstandard-basedmodeavailablesinceHive0.13.0.ItissimilartotheSQLauthorizationbyusingtheGRANTandREVOKEstatementstocontrolaccessthroughtheHiveServer2configuration.However,toolssuchasHiveCLIandHadoop/HDFS/MapReducecommandsdonotaccessdatathroughHiveServer2,soSQLstandard-basedmodecannotauthorizetheiraccess.Therefore,itisrecommendedtousestorage-basedmodetogetherwithSQLstandard-basedmodeauthorizationtoauthorizeuserswhodonotaccessfromHiveServer2.

ToenableSQLstandard-basedmodeauthorization,wecansetthefollowingpropertiesinthehive-site.xmlfile:

<property>

<name>hive.server2.enable.doAs</name>

<value>false</value>

<description>AllowsHivequeriestoberunbytheuserwhosubmitsthe

queryratherthanthehiveuser.NeedtoturnifoffforthisSQLstandard-

basemode</description>

</property>

<property>

<name>hive.users.in.admin.role</name>

<value>dayongd,administrator</value>

<description>Comma-separatedlistofusersassignedtotheADMINrole.

</description>

</property>

<property>

<name>hive.security.authorization.enabled</name>

<value>true</value>

</property>

<property>

<name>hive.security.authorization.manager</name>

<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sql</value>

</property>

<property>

<name>hive.security.authenticator.manager</name>

<value>org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator</va

lue>

</property>

<property>

<name>hive.metastore.uris</name>

<value>""</value>

<description>""(quotationmarkssurroundingasingleemptyspace).

</description>

</property>

BeforerestartingHiveServer2,theusersintheconfiguredadminrolemustrunthefollowingcommandtomaketheadminroleeffective,andthenrestartHiveServer2:

jdbc:hive2://>GRANTadminTOUSERdayongd;

www.it-ebooks.info

Page 257: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Thebasicsyntaxtograntorrevokeanauthorizationroleorprivilegeisasfollows:

GRANT<ROLENAME>TO<USERS>[WITHADMINOPTION];

REVOKE[ADMINOPTIONFOR]<ROLENAME>FROM<USERS>;

Here,thefollowingparametersareused:

<ROLENAME>:Thiscanbeacomma-separatednameofroles<USERS>:ThiscanbeauseroraroleWITHADMINOPTION:Thismakessurethattheusergetsprivilegestogranttheroletootherusers/roles

Anotherexampletograntorrevokeanauthorizationisasfollows:

GRANT<PRIVILEGE>ON<OBJECT>TO<USERS>;

REVOKE<PRIVILEGE>ON<OBJECT>FROM<USERS>;

Here,thefollowingparametersareused:

<PRIVILEGE>:ThiscanbeINSERT,SELECT,UPDATE,DELETE,orALL<USERS>:Thiscanbeauserorarole<OBJECT>:Thisisatableoraview

FormoreexamplesofHQLstatementstomanageSQLstandard-basedauthorization,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization#SQLStandardBasedHiveAuthorization-Configuration.

NoteSentry

Sentryisahighlymodularsystemforprovidingcentralized,fine-grained,role-basedauthorizationtobothdataandmetadatastoredonanApacheHadoopcluster.ItcanbeintegratedwithHivetodeliveradvancedauthorizationcontrols.FormoreinformationaboutSentry,pleaserefertohttp://incubator.apache.org/projects/sentry.html.

www.it-ebooks.info

Page 258: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 259: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

EncryptionForsensitiveandlegallyprotecteddatasuchaspersonalidentityinformation(PII),itisrequiredtostorethedatainencryptedformatinthefilesystem.However,Hivedoesnotnativelysupportencryptionanddecryptionyet(seehttps://issues.apache.org/jira/browse/HIVE-5207).

Alternatively,wecanlookforthird-partytoolstoencryptanddecryptdataafterexportingitfromHive,butthisrequiresadditionalpostprocessing.ThenewHDFSencryption(seehttps://issues.apache.org/jira/browse/HDFS-6134)offersgreattransparentencryptionanddecryptionofdataonHDFS.ItwillsatisfyourrequestifwewanttoencryptthewholedatasetinHDFS.However,itcannotbeappliedtotheselectedcolumnandrowlevelinthetableofHive,wheremostPIIthatisencryptedisonlyapartofrawdata.Inthiscase,thebestsolutionfornowistouseHiveUDFtopluginencryptionanddecryptionimplementationsonselectedcolumnsorpartialdataintheHivetables.

SampleUDFimplementationsforencryptionanddecryptionusingtheAESencryptionalgorithmareasfollows:

AESEncrypt.java:Theimplementationisasfollows:

packagecom.packtpub.hive.essentials.hiveudf;

importorg.apache.hadoop.hive.ql.exec.UDF;

importorg.apache.hadoop.hive.ql.exec.Description;

importorg.apache.hadoop.hive.ql.udf.UDFType;

@Description(

name="aesencrypt",

value="_FUNC_(str)-ReturnsencryptedstringbasedonAES

key.",

extended="Example:\n"+

">SELECTaesencrypt(pii_info)FROMtable_name;\n"

)

@UDFType(deterministic=true,stateful=false)

/*

*AHiveencryptionUDF

*/

publicclassAESEncryptextendsUDF{

publicStringevaluate(Stringunencrypted){

Stringencrypted="";

if(unencrypted!=null){

try{

encrypted=CipherUtils.encrypt(unencrypted);

}catch(Exceptione){};

}

returnencrypted;

}

}

AESDecrypt.java:Thiscanbeimplementedasfollows:

packagecom.packtpub.hive.essentials.hiveudf;

www.it-ebooks.info

Page 260: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

importorg.apache.hadoop.hive.ql.exec.UDF;

importorg.apache.hadoop.hive.ql.exec.Description;

importorg.apache.hadoop.hive.ql.udf.UDFType;

@Description(

name="aesdecrypt",

value="_FUNC_(str)-Returnsunencryptedstringbasedon

AESkey.",

extended="Example:\n"+

">SELECTaesdecrypt(pii_info)FROMtable_name;\n"

)

@UDFType(deterministic=true,stateful=false)

/*

*AHivedecryptionUDF

*/

publicclassAESDecryptextendsUDF{

publicStringevaluate(Stringencrypted){

Stringunencrypted=newString(encrypted);

if(encrypted!=null){

try{

unencrypted=CipherUtils.decrypt(encrypted);

}catch(Exceptione){};

}

returnunencrypted;

}

}

CipherUtils.java:Thiscanbeimplementedasfollows:

packagecom.packtpub.hive.essentials.hiveudf;

importjavax.crypto.Cipher;

importjavax.crypto.spec.SecretKeySpec;

importorg.apache.commons.codec.binary.Base64;

/*

*Thecoreencryptionanddecryptionlogicfunction

*/

publicclassCipherUtils

{

//ThisisasecretkeyintermsofASCII

privatestaticbyte[]key={

0x75,0x69,0x69,0x73,0x40,0x73,0x41,0x53,0x65,0x65,

0x72,0x69,0x74,0x4b,0x65,0x75

};

publicstaticStringencrypt(StringstrToEncrypt)

{

try

{

//preparealgorithm

Ciphercipher=Cipher.getInstance("AES/ECB/PKCS5Padding");

finalSecretKeySpecsecretKey=newSecretKeySpec(key,

"AES");

//initializecipherforencryption

cipher.init(Cipher.ENCRYPT_MODE,secretKey);

www.it-ebooks.info

Page 261: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

//Base64.encodeBase64Stringthatgivesanasciistring

finalStringencryptedString=

Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));

returnencryptedString.replaceAll("\r|\n","");

}

catch(Exceptione)

{

e.printStackTrace();

}

returnnull;

}

publicstaticStringdecrypt(StringstrToDecrypt)

{

try

{

//preparealgorithm

Ciphercipher=Cipher.getInstance("AES/ECB/PKCS5PADDING");

finalSecretKeySpecsecretKey=newSecretKeySpec(key,

"AES");

//initializecipherfordecryption

cipher.init(Cipher.DECRYPT_MODE,secretKey);

finalStringdecryptedString=new

String(cipher.doFinal(Base64.decodeBase64(strToDecrypt)));

returndecryptedString;

}

catch(Exceptione)

{

e.printStackTrace();

}

returnnull;

}

}

NoteAES

ShortforAdvancedEncryptionStandard,AESisasymmetric128-bitblockdataencryptiontechniquedevelopedbyBelgiancryptographersJoanDaemenandVincentRijmen.Formoreinformation,pleaserefertohttp://en.wikipedia.org/wiki/Advanced_Encryption_Standard.

TodeploytheUDFandverifythem,dothefollowing:

jdbc:hive2://>ADDJAR/home/dayongd/Downloads/

.......>hiveessentials-1.0-SNAPSHOT.jar;

Norowsaffected(0.002seconds)

jdbc:hive2://>CREATETEMPORARYFUNCTIONaesdecryptAS

.......>'com.packtpub.hive.essentials.hiveudf.AESDecrypt';

Norowsaffected(0.02seconds)

jdbc:hive2://>CREATETEMPORARYFUNCTIONaesencryptAS

www.it-ebooks.info

Page 262: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

.......>'com.packtpub.hive.essentials.hiveudf.AESEncrypt';

Norowsaffected(0.015seconds)

jdbc:hive2://>SELECTaesencrypt('Will')ASencrypt_name

.......>FROMemployeeLIMIT1;

+---------------------------+

|encrypt_name|

+---------------------------+

|YGvo54QIahpb+CVOwv9OkQ==|

+---------------------------+

1rowselected(34.494seconds)

jdbc:hive2://>SELECTaesdecrypt('YGvo54QIahpb+CVOwv9OkQ==')

.......>ASdecrypt_name

.......>FROMemployeeLIMIT1;

+---------------+

|decrypt_name|

+---------------+

|Will|

+---------------+

1rowselected(45.43seconds)

www.it-ebooks.info

Page 263: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 264: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SummaryInthischapter,weintroducedthreemainareasforHivesecurity:authentication,authorization,andencryption.WecoveredtheauthenticationsinMetastoreserverandHiveServer2.Then,wetalkedaboutdefault,storage-based,andSQLstandard-basedauthorizationmethodsinHiveServer2.Attheendofthischapter,wediscussedtheuseofHiveUDFforencryptionanddecryption.Aftergoingthroughthischapter,weshouldclearlyunderstandthedifferentareasthatwillhelpusaddressHivesecurity.

Inthenextchapter,we’lltalkaboutusingHivewithothertools.

www.it-ebooks.info

Page 265: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 266: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Chapter10.WorkingwithOtherToolsAsoneoftheearliestandmostpopularSQLoverHadooptools,Hivehasmanyusecasesofworkingwithothertoolstoofferanend-to-enddataintelligencesolution.Inthischapter,wewilldiscussthewayHiveworkswithotherbigdatatoolsinthefollowingareas:

JDBC/ODBCconnectorHBaseHueHCatalogZookeeperOozieHiveroadmap

www.it-ebooks.info

Page 267: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

JDBC/ODBCconnectorJDBC/ODBCisoneofthemostcommonwaysforHivetoworkwithothertools.Hadoopvendors,suchasClouderaandHortonworks,offerfreeHiveJDBC/ODBCdriverssothatHivecanbeconnectedthroughthesedrivers;thesecanbefoundatthefollowinglinks:

ForCloudera,thelinkishttp://www.cloudera.com/content/cloudera/en/downloads/connectors/hive.htmlForHortonworks,thelinkishttp://hortonworks.com/hdp/addons/

WecanusetheseJDBC/ODBCconnectorstoconnectHivetotoolssuchasthefollowing:

Acommand-lineutilitysuchasBeeline,mentionedinChapter2,SettingUptheHiveEnvironmentIntegrateddevelopmentenvironmentsuchasOracleSQLDeveloper,mentionedinChapter2,SettingUptheHiveEnvironmentDataextraction,transformation,loading,andintegrationtools,suchasTalendOpenStudioBusinessintelligencereportingtools,suchasJasperReportsandQlikViewDataanalysistoolssuchasMicrosoftExcel2013DatavisualizationtoolssuchasTableau

Sincethesetupofconnectorsisverystraightforward,pleaserefertothewebsitesoftheprecedingtoolsformoredetailedinstructionstoconnecttoHive.

www.it-ebooks.info

Page 268: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 269: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HBaseHBase(seehttp://hbase.apache.org/)isahigh-performanceNoSQLkey/valuestoreonHadoop.HivehasofferedastoragehandlermechanismtointegratewithHBasebyusingtheHBaseStorageHandlerclassthatcreatesHBasetablesmanagedbyHive.ByintegratingHivewithHBase,Hiveuserscanleveragereal-timetransactionperformanceofHBasetodoreal-timebigdataanalysis.Currently,theintegrationfeatureisstillinprogress,especiallyintheareaofofferinghigherperformanceandsnapshotssupport.ThereisanotherprojectcalledPhoenix(seehttp://phoenix.apache.org/),whichprovidesbasicSQLwithhigher-performancesupportoverHBase.

AnexampleofcreatinganHBasetableinHQLisasfollows:

CREATETABLEhbase_table_sample(

idint,

value1string,

value2string,

map_valuemap<string,string>

)

STOREDBY'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITHSERDEPROPERTIES("hbase.columns.mapping"=

":key,cf1:val,cf2:val,cf3:")

TBLPROPERTIES("hbase.table.name"="table_name_in_hbase");

InthisspecialCREATETABLEstatement,theHBaseStorageHandlerclassisdelegatinginteractionwiththeHBasetablewithHiveHBaseTableInputFormatandHiveHBaseTableOutputFormat.Thehbase.columns.mappingpropertyisrequiredtomapeachtablecolumndefinedinthestatementtotheHBasetablecolumnsinorder.Forexample,theID,byorder,mapstotheHBasetable’srowkeyas:key.Sometimes,wemayneedtogeneratetheproperrowkeycolumnsusingHiveUDFsifthereisnoexistingcolumnthatcanbeusedasarowkeyfortheHBasetable.Thevalue1mapstothevalcolumninthecf1columnfamilyintheHBasetable.TheHiveMAPdatatypecanbeusedtoaccessanentirecolumnfamily.Eachrowcanhaveadifferentsetofcolumns,wherethecolumnnamescorrespondtothemapkeysandthecolumnvaluescorrespondtothemapvalues,suchasthemap_valuecolumns.Thehbase.table.nameproperty,whichisoptional,specifiesthetablenameknownbyHBase.Ifitisnotprovided,theHiveandHBasetablewillhavethesamename,suchashbase_table_sample.

NoteFormoreinformationaboutconfigurationsandfeaturesinprogressaboutHive-HBaseintegration,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/HBaseIntegration.

www.it-ebooks.info

Page 270: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 271: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HueHue(seehttp://gethue.com/)isshortforHadoopUserExperience.ItisawebinterfaceformakingtheHadoopecosystemeasiertouse.ForHiveusers,HueoffersaunifiedwebinterfaceforeasilyaccessingbothHDFSandHiveinaninteractiveenvironment.HuecanbeinstalledaloneorwiththeHadoopvendorpackages.Inaddition,Hueaddsmoreprogramming-friendlyfeaturestoHive,suchasthefollowing:

HighlightsHQLkeywordsAutocompletesHQLqueryOffersliveprogressandlogsforHiveandMapReducejobsSubmitsseveralqueriesandchecksprogresslaterBrowsesdatainHivetablesthroughawebuserinterfaceNavigatesthroughthemetadataRegistersUDFandaddsfiles/archivesthroughawebuserinterfaceSaves,exports,andsharesthequeryresultCreatesvariouschartsfromthequeryresult

ThefollowingisascreenshotoftheHiveeditorinterfaceinHue:

HueHiveeditoruserinterface

www.it-ebooks.info

Page 272: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 273: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HCatalogHCatalog(seehttps://cwiki.apache.org/confluence/display/Hive/HCatalog)isametadatamanagementsystemforHadoopdata.ItstoresconsistentschemainformationforHadoopecosystemtools,suchasPig,Hive,andMapReduce.Bydefault,HCatalogsupportsdataintheformatofRCFile,CSV,JSON,SequenceFile,ORCfile,andacustomizedformatifInputFormat,OutputFormat,andSerDeareimplemented.ByusingHCatalog,usersareabletodirectlycreate,edit,andexpose(viaitsRESTAPI)metadata,whichbecomeseffectiveimmediatelyinalltoolssharingthesamepieceofmetadata.Atfirst,HCatalogwasaseparateApacheprojectfromHiveandwaspartofApacheIncubator,wheremostApacheprojectsfirststarted.Eventually,HCatalogbecameapartoftheHiveprojectin2013startingwithHive0.11.0.

HCatalogisbuiltontopoftheHivemetastoreandincorporatessupportforHiveDDL.ItprovidesreadandwriteinterfacesandHCatLoaderandHCatStorer,forPig,byimplementingPig’sloadandstoreinterfaces,respectively.HCatalogalsoprovidesaninterfaceforMapReduceprogramsbyusingHCatInputFormatandHCatOutputFormat,whichareverysimilartoothercustomizedformatsbyimplementingHadoop’sInputFormatandOutputFormat.HCatalogprovidesaRESTAPIfromacomponentcalledWebHCatsothatHTTPrequestscanbemadetoaccessthemetadataofHadoopMapReduce/Yarn,Pig,Hive,andHCatalogDDLfromotherapplications.ThereisnoHive-specificinterfacesinceHCatalogusesHive’smetastore.Therefore,HCatalogcandefinemetadataforHivedirectlythroughitsCLI.TheHCatalogCLIsupportstheHQLSHOW/DESCRIBEstatementandthemajorityofHiveDDL,exceptthefollowingstatements,thatrequirerunningMapReducejobs:

CREATETABLE…ASSELECT

ALTERINDEX…REBUILD

ALTERTABLE…CONCATENATE

ALTERTABLEARCHIVE/UNARCHIVEPARTITION

ANALYZETABLE…COMPUTESTATISTICS

IMPORT/EXPORT

www.it-ebooks.info

Page 274: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 275: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ZooKeeperZooKeeper(seehttp://zookeeper.apache.org/)isacentralizedserviceforconfigurationmanagementandthesynchronizationofvariousaspectsofnamingandcoordination.Itmanagesanamingregistryandeffectivelyimplementsasystemformanagingthevariousstaticallyanddynamicallynamedobjectsinahierarchicalsystem.Italsoenablescoordinationandcontroltothesharedresources,suchasfilesanddata,whicharemanipulatedbymultipleconcurrentprocesses.

UnlikeRDBMS,Hivedoesnotnativelysupportconcurrencyaccessandlockingmechanisms.HivereliesonZooKeeperforlockingthesharedresourcessinceHive0.7.0.TherearetwotypesoflocksprovidedbyHivethroughZookeeperandtheyareasfollows:

Sharedlock:Thisisacquiredwhenatable/partitionisread.TheconcurrentsharedlocksareallowedinHive.Exclusivelock:Thisisacquiredforallotheroperationsthatmodifythetable.Forpartitiontables,onlyasharedlockisacquiredifthechangeisonlyapplicabletothenewly-createdpartitions.Anexclusivelockisacquiredonthetableifthechangeisapplicabletoallpartitions.Inaddition,anexclusivelockonthetablegloballyaffectsallpartitions.

AnyHQLmustacquireproperlocksbeforebeingallowedtoperformcorrespondinglock-permittedoperations.

ToenablelockinginHive,weneedtomakesureZooKeeperisinstalledandconfigured.Then,configurethefollowingpropertiesinHive’shive-site.xmlfile:

<property>

<name>hive.support.concurrency</name>

<description>EnableHive'sTableLockManagerService</description>

<value>true</value>

</property>

<property>

<name>hive.zookeeper.quorum</name>

<description>CommaseparatedZookeeperquorumusedbyHive'sTableLock

Manager.</description>

<value>localhost.localdomain</value>

</property>

WecanalsosetthefollowingpropertytousethenewlockmanagerfortransactionssupportsinceHive0.13.0:

<property>

<name>hive.txn.manager</name>

<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>

</property>

NoteOnceconfigured,wecanfurthersetlockingproperties,specifiedanddetailedathttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-

www.it-ebooks.info

Page 276: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Locking.

Locksareeitherimplicitlyacquired/releasedfromHQLorexplicitlyacquired/releasedusingtheLOCKandUNLOCKstatementsasfollows:

--Locktableandspecifylocktype

jdbc:hive2://>LOCKTABLEemployeeshared;

Norowsaffected(1.328seconds)

--Showthelockinformationonthespecifictables

jdbc:hive2://>SHOWLOCKSemployeeEXTENDED;

+------------------------------------------------------------------------+-

----+

|tab_name|

mo|

+------------------------------------------------------------------------+-

----+

|default@employee|

SHA|

|LOCK_QUERYID:hive_20150105170303_792598b1-0ac8-4aad-aa4e-c4cdb0de6697|

|

|LOCK_TIME:1420495466554|

|

|LOCK_MODE:EXPLICIT|

|

|LOCK_QUERYSTRING:LOCKTABLEemployeeshared|

|

+------------------------------------------------------------------------+-

----+

5rowsselected(0.576seconds)

--Releasethelockonthetable

jdbc:hive2://>UNLOCKTABLEemployee;

Norowsaffected(0.209seconds)

--Showalllocksinthedatabase

jdbc:hive2://>SHOWLOCKS;

+-----------+-------+

|tab_name|mode|

+-----------+-------+

+-----------+-------+

Norowsselected(0.529seconds)

jdbc:hive2://>LOCKTABLEemployeeexclusive;

Norowsaffected(0.185seconds)

jdbc:hive2://>SHOWLOCKSemployeeEXTENDED;

+------------------------------------------------------------------------+-

----+

|tab_name|

mo|

+------------------------------------------------------------------------+-

----+

|default@employee|

EXC|

|LOCK_QUERYID:hive_20150105170808_bbc6db18-e44a-49a1-bdda-3dc30b5c8cee|

www.it-ebooks.info

Page 277: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

|

|LOCK_TIME:1420495807855|

|

|LOCK_MODE:EXPLICIT|

|

|LOCK_QUERYSTRING:LOCKTABLEemployeeexclusive|

|

+------------------------------------------------------------------------+-

----+

5rowsselected(0.578seconds)

jdbc:hive2://>SELECT*FROMemployee;

Whenthetableacquiresanexclusivelock,theprecedingSELECTstatementwillwaitforthelockandshownothingasaresultsetunlessweunlockthetableintheothersession.FromtheHivelog,wecanfindthefollowinginformationthatspecifiesthattheSELECTstatementiswaitingtogetthereadlock:

15/01/0517:13:39INFOql.Driver:<PERFLOGmethod=acquireReadWriteLocks>

15/01/0517:13:39ERRORZooKeeperHiveLockManager:conflictinglockpresent

fordefault@employeemodeSHARED

NoteFormoreinformationaboutusingZooKeeperforHivelocks,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Locking.

www.it-ebooks.info

Page 278: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 279: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

OozieOozie(seehttp://oozie.apache.org/)isanopensourceworkflowcoordinationandscheduleservicetomanagedataprocessingjobs.OozieworkflowjobsaredefinedinaseriesofnodesinaDirectedAcyclicalGraph(DAG).Acyclicalheremeansthattherearenoloopsinthegraphandallnodesinthegraphflowinonedirectionwithoutgoingback.Oozieworkflowscontaineitherthecontrolflownodeoractionnode:

Controlflownode:Thiseitherdefinesthestart,end,andfailednodeinaworkfloworcontrolstheworkflowexecutionpathsuchasdecision,fork,andjoinnodes.Actionnode:ThisdefinesthecoredataprocessingactionjobsuchasMapReduce,Hadoopfilesystem,Hive,Pig,Java,Shell,e-mail,andOoziesubworkflows.Additionaltypesofactionsarealsosupportedbydevelopingextensions.

Oozieisascalable,reliable,andextensiblesystem.Itcanbeparameterizedforworkflowsubmissionandscheduledtorunautomatically.Therefore,Oozieisverysuitableforlightweightdataintegrationormaintenancejobs.

HueoffersveryfriendlyandpowerfulsupportforOoziethroughtheOozieeditor.CreatingandsubmittinganOozieworkflowofHiveactionsfromHueisasstraightforwardasthefollowingsteps:

1. LogintoHueandselectfromthetopmenubarWorkflows|Editors|WorkflowstoopenWorkflowManager.

2. ClickontheCreatebuttontocreateaworkflow.3. Giveaproperworkflownameandsavetheworkflow.4. Oncetheworkflowissaved,theOozieeditorwindowappearsforfurthersettings.5. DragaHiveactiontothemiddleofthestartandendnodes.6. IntheEditNode:menushown,thefollowingsettingsarepresent.Provideproper

settingsasfollows:

Name:Giveaproperactionname.Description:Thisiswheretodescribethejob.Thisisoptional.Advanced:ThisisforSLAmonitoring.Thisisoptional.Scriptname:ChoosetheHQLscriptsfromHDFSforHiveaction.Prepare:Defineactions,suchasdeletefilesorcreatefolders,beforerunningthescript.Thisisoptional.Parameters:Thisdefinestheparameterstobetakenwhensubmittingthejob(suchas${date}).Thisisoptional.Jobproperties:ThisiswheretosetHadoop/Hiveproperties.Thisisoptional.Files:Thisiswheretoselectthefilesneededforthescripts.Thisisoptional.Archives:ThisiswheretoselectthearchivefilessuchasUDFJARs.Thisisoptional.JobXML:Chooseacopyofthehive-site.xmlfileoftheHiveclusterfromHDFSsothatOoziecanconnecttotheHivemetastore.

7. ClickonDoneintheEditNode:menuandthenclickonSaveinWorkflowEditor.

www.it-ebooks.info

Page 280: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

8. ClickonSubmittosubmittheworkflow.Then,theHiveactionistriggeredbytheOozieworkflowsuccessfully.

www.it-ebooks.info

Page 281: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 282: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HiveroadmapAsitistheendofthischapteraswellasofthisbook,thehighlightofeachHivereleasemilestoneandfuturefeaturesexpectedaresummarizedasfollowsalongwithbestwishestotheHivecommunitiesforgrowingbiggerandbetterinthenearfuture:

December2011–Hive0.8.0

AddedBitmapindexesAddedtheTIMESTAMPdatatypeAddedtheHivePluginDeveloperKittomakepluginbuildingandtestingeasierImprovedJDBCDriverandbugfixes

April2012–Hive0.9.0

AddedtheCREATEORREPLACEVIEWstatementAddedNOTINandNOTLIKEsupportAddedtheBETWEENandNULLsafeequalityoperatorAddedprintf(),sort_array(),andconcat_ws()functionsAddedafilterpush-downfromHiveintoHBaseforthekeycolumnCombinedmultipleUNIONALLstatementsinoneMapReducejobCombinedmultipleGROUPBYstatementsonthesamedatawiththesamekeysinoneMapReducejob

January2013–Hive0.10.0

AddedtheCUBEandROLLUPstatementsAddedbettersupportforYARNAddedmoreinformationintheEXPLAINstatementAddedtheSHOWCREATETABLEstatementAddedbuilt-insupportforreading/writingAvrodataAddedimprovementsforskewedjoinsImprovedsimplequerieswithoutrunningMapReducejobsfaster

May2013–Hive0.11.0asStingerPhase1

AddedORCforbetterperformanceAddedanalyticandwindowsfunctionsAddedHCatalogaspartofHiveAddedGROUPBYcolumnpositionsImproveddatatypesandaddedtheDECIMALdatatypeImprovedjoinsforbroadcastandSMBjoinsImplementedHiveServer2

October2013–Hive0.12.0asStingerPhase2

AddedVARCHARandDATEsupportAddedparallelORDERBYtoHiveAddedmoreimprovementsforORC,suchaspredicatepush-downAddedacorrelationoptimizer

www.it-ebooks.info

Page 283: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

AddedsupportforGROUPBYontheSTRUCTtypeAddedsupportfortheouterlateralviewPushedLIMITdowntomappers

April2014–Hive0.13.0asStingerPhase3Final

AddedDECIMALandCHARdatatypesAddedsupportforrunningjobsonTezAddedavectorizedqueryengineAddedsupportforsubqueriesforIN,NOTIN,EXISTS,andNOTEXISTSAddedsupportforpermanentfunctionsAddedsupportforcommontableexpressionsAddedSQLstandard-basedauthentication

November2014–Hive0.14.0asStinger.nextPhase1

AddedtransactionswithACIDsemanticsAddedaCostBaseOptimizer(CBO)AddedtheCREATETEMPORARYTABLEstatementAddedsupportfortheSTOREDASAVROintheCREATETABLEstatementAddedskipTrashconfigurationfortheDROPTABLEstatementAddedAccumuloStorageHandleUsedTezautoparallelisminHive

February2015–Hive1.0.0

Movedtoa1.x.yreleasenamingstructureMadeHiveMetaStoreClientapublicAPIRemovalofHiveServer1SwitchedtoTez0.5.2

Future

OffersubsecondquerieswithLiveLongAndProcess(LLAP)OfferHiveoverSparkSupportSQL2011analyticsSupportcross-geoqueriesOffermaterializedviewsOfferworkloadmanagementviaYARNandLLAPintegrationHiveasaunifieddataquerytool

www.it-ebooks.info

Page 284: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

www.it-ebooks.info

Page 285: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

SummaryInthisfinalchapter,weintroducedsomebigdatatools,whichcanworkwithHivethroughJDBCorODBCintegration,suchasHBase,Hue,HCatalog,ZooKeeper,andOozie.Then,wereviewedthekeyreleasesofHivefrom0.8.0to1.0.0,aswellastheexcitingfeaturesexpectedinthefuture.Aftergoingthroughthischapter,weshouldunderstandhowtouseotherbigdatatoolswithHivetoprovideend-to-enddataintelligencesolutions.

www.it-ebooks.info

Page 286: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

IndexA

Abstractsyntaxtree(AST)about/TheEXPLAINstatement

ACLsonHDFS,URL/Storage-basedmode

AdvancedEncryptionStandard(AES)URL/Encryption

aggregatefunctions/Operatorsandfunctionsaggregation

dataaggregation/Basicaggregation–GROUPBYwithoutGROUPBYcolumns/Basicaggregation–GROUPBYwithGROUPBYcolumns/Basicaggregation–GROUPBYadvanced/Advancedaggregation–GROUPINGSETS,Advancedaggregation–ROLLUPandCUBEROLLUPstatement/Advancedaggregation–ROLLUPandCUBECUBEstatement/Advancedaggregation–ROLLUPandCUBEcondition,HAVINGstatement/Aggregationcondition–HAVING

AmazonEMRURL/StartingHiveinthecloud

analyticfunctionsabout/AnalyticfunctionsFunction(arg1,…,argn)/AnalyticfunctionsStandardaggregations/AnalyticfunctionsRANK/AnalyticfunctionsDENSE_RANK/AnalyticfunctionsROW_NUMBER/AnalyticfunctionsCUME_DIST/AnalyticfunctionsPERCENT_RANK/AnalyticfunctionsNTILE/AnalyticfunctionsLEADfunction/AnalyticfunctionsLAGfunction/AnalyticfunctionsFIRST_VALUE/AnalyticfunctionsLAST_VALUE/Analyticfunctionswindowexpressions/Analyticfunctions

ANALYZEstatementabout/TheANALYZEstatement

ANTLRURL/TheEXPLAINstatement

Apacheused,forinstallingHive/InstallingHivefromApache

ApacheHive

www.it-ebooks.info

Page 287: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Wiki,URL/UsingtheHivecommandlineandBeelineApacheHiveWiki

URL/HBaseApacheJIRAHive-365

URL/UnderstandingHivedatatypesAtomicity,Consistency,Isolation,andDurability(ACID)

about/Transactionsauthentication

about/AuthenticationMetastoreserverauthentication/MetastoreserverauthenticationHiveServer2authentication/HiveServer2authentication

authorizationabout/Authorizationlegacymode/Legacymodestorage-basedmode/Storage-basedmodeSQLstandard-basedmode/SQLstandard-basedmode

AvroURL/SerDe

AvroSerDe/SerDeAzureHDInsightService

URL/StartingHiveinthecloud

www.it-ebooks.info

Page 288: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Bbatchprocessing

about/Batch,real-time,andstreamprocessingBeeline

using/UsingtheHivecommandlineandBeelineURL/UsingtheHivecommandlineandBeelinecommand-linesyntax/UsingtheHivecommandlineandBeeline

bigdataabout/IntroducingbigdataVolume/Introducingbigdatavolume/Introducingbigdatavelocity/Introducingbigdatavariety/Introducingbigdataveracity/Introducingbigdatavariability/Introducingbigdatavolatility/Introducingbigdatavisualization/Introducingbigdatavalue/Introducingbigdata

blocksampling/Samplingbucketmapjoin/Bucketmapjoinbuckets

about/Hivebucketsnumber/Hivebuckets

buckettablesabout/Buckettables

buckettablesampling/Sampling

www.it-ebooks.info

Page 289: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Ccloud

Hive,starting/StartingHiveinthecloudCloudera

URL/StartingHiveinthecloudabout/JDBC/ODBCconnector

ClouderaDistributedHadoop(CDH)URL/InstallingHivefromvendorpackages

CLUSTERBY/ORDERandSORTcollectionfunctions/Operatorsandfunctionscollectionitemdelimiter/UnderstandingHivedatatypesColumnarSerDe/SerDeCombineFileInputFormat/Storageoptimizationcommonjoin,joinoptimization/CommonjoinCommonTableExpression(CTE)/HiveinternalandexternaltablesCommonTableExpression(CTE)/Hiveinternalandexternaltablescompression/Compressionconditionalfunctions/OperatorsandfunctionsCost-BasedOptimizer(CBO)

about/TheANALYZEstatementCostBaseOptimizer(CBO)/HiveroadmapCREATETABLE/HiveinternalandexternaltablesCreatethetableasselect(CTAS)/HiveinternalandexternaltablesCROSSJOINstatement/TheOUTERJOINandCROSSJOINstatementsCUBEstatement

about/Advancedaggregation–ROLLUPandCUBE

www.it-ebooks.info

Page 290: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Ddataaggregation

about/Basicaggregation–GROUPBYdatabase,Hive

about/Hivedatabasedataexchange

LOADkeyword/Dataexchange–LOADINSERTkeyword/Dataexchange–INSERT

dataexchangeEXPORTstatement/Dataexchange–EXPORTandIMPORTIMPORTstatement/Dataexchange–EXPORTandIMPORT

datafileoptimizationabout/Datafileoptimizationfileformat/Fileformatcompression/Compressionstorageoptimization/Storageoptimization

datatypeconversionsabout/Datatypeconversionsprimitivetypeconversion/Datatypeconversionsexplicittypeconversion/Datatypeconversions

datatypefunctionstips,complex/Operatorsandfunctionsdatatypes,Hive

about/UnderstandingHivedatatypesTINYINT/UnderstandingHivedatatypesSMALLINT/UnderstandingHivedatatypesINT/UnderstandingHivedatatypesBIGINT/UnderstandingHivedatatypesFLOAT/UnderstandingHivedatatypesDOUBLE/UnderstandingHivedatatypesDECIMAL/UnderstandingHivedatatypesBINARY/UnderstandingHivedatatypesBOOLEAN/UnderstandingHivedatatypesSTRING/UnderstandingHivedatatypesCHAR/UnderstandingHivedatatypesVARCHAR/UnderstandingHivedatatypesDATE/UnderstandingHivedatatypesTIMESTAMP/UnderstandingHivedatatypes

datefunctions/Operatorsandfunctionsdatefunctiontips/Operatorsandfunctionsdelimiters

rowdelimiter/UnderstandingHivedatatypescollectionitemdelimiter/UnderstandingHivedatatypesmapkeydelimiter/UnderstandingHivedatatypes

www.it-ebooks.info

Page 291: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

deployment/DevelopmentanddeploymentDerby

URL/InstallingHivefromApachedesignoptimization

about/Designoptimizationpartitiontables/Partitiontablesbuckettables/Buckettablesindex/Index

development/DevelopmentanddeploymentDirectedAcyclicalGraph(DAG)/Ooziedirectedacyclicgraphs(DAGs)/IndexDISTRIBUTEBY/ORDERandSORT

www.it-ebooks.info

Page 292: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Eencryption

about/EncryptionEXPLAINstatement

about/TheEXPLAINstatementEXTENDEDkeyword/TheEXPLAINstatementDEPENDENCYkeyword/TheEXPLAINstatementAUTHORIZATIONkeyword/TheEXPLAINstatement

explicittypeconversion/DatatypeconversionsEXPORTstatement/Dataexchange–EXPORTandIMPORTexternaltables

about/Hiveinternalandexternaltables/Hiveinternalandexternaltables

www.it-ebooks.info

Page 293: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Ffileformat,datafileoptimization

about/FileformatTEXTFILE/FileformatSEQUENCEFILE/FileformatRCFILE/FileformatOptimizedRowColumnar(ORC)/FileformatPARQUET/Fileformat

Flume/OverviewoftheHadoopecosystemfunctions

about/Operatorsandfunctionsmathematicalfunctions/Operatorsandfunctionscollectionfunctions/Operatorsandfunctionstypeconversionfunctions/Operatorsandfunctionsdatefunctions/Operatorsandfunctionsconditionalfunctions/Operatorsandfunctionsstringfunctions/Operatorsandfunctionsaggregatefunctions/Operatorsandfunctionstable-generatingfunctions/Operatorsandfunctionscustomized/Operatorsandfunctionscomplexdatatypefunctionstips/Operatorsandfunctionsdatefunctiontips/OperatorsandfunctionsCASE,fordatatypes/Operatorsandfunctionsparserandsearchtips/Operatorsandfunctionsvirtualcolumns/Operatorsandfunctions

www.it-ebooks.info

Page 294: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

GGenericUDAF

URL/TheUDAFcodetemplateGROUPINGSETSkeyword

about/Advancedaggregation–GROUPINGSETS

www.it-ebooks.info

Page 295: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

HHadoop

versusrelationaldatabase/RelationalandNoSQLdatabaseversusHadoopversusNoSQLdatabase/RelationalandNoSQLdatabaseversusHadoop

HadoopArchiveandHAR/Storageoptimization

HadoopArchiveFile(HAR)/FileformatHadoopecosystem

about/OverviewoftheHadoopecosystemHAVINGstatement

about/Aggregationcondition–HAVINGHBase

about/HBaseURL/HBasetable,creatinginHQL/HBase

HBaseSerDe/SerDeHCatalog

about/HCatalogURL/HCatalog

HDFSabout/Batch,real-time,andstreamprocessing,OverviewoftheHadoopecosystem

HDFSfederation/StorageoptimizationHive

about/Hiveoverviewinstalling,fromApache/InstallingHivefromApacheURL/InstallingHivefromApacheinstalling,fromvendorpackages/InstallingHivefromvendorpackagesstarting,incloud/StartingHiveintheclouddatatypes/UnderstandingHivedatatypescomplextypes/UnderstandingHivedatatypestypes/UnderstandingHivedatatypesdatabase/Hivedatabaseinternaltables/Hiveinternalandexternaltablesexternaltables/Hiveinternalandexternaltablespartitions/Hivepartitionsbuckets/Hivebucketsviews/Hiveviewsperformanceutilities/Performanceutilities

Hive,complextypesARRAY/UnderstandingHivedatatypesMAP/UnderstandingHivedatatypesSTRUCT/UnderstandingHivedatatypes

www.it-ebooks.info

Page 296: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

NAMEDSTRUCT/UnderstandingHivedatatypesUNION/UnderstandingHivedatatypes

Hive-integrateddevelopmentenvironment(IDE)about/TheHive-integrateddevelopmentenvironment

hive.map.aggrproperty/Basicaggregation–GROUPBYHiveCLI

command-linesyntax/UsingtheHivecommandlineandBeelineURL/UsingtheHivecommandlineandBeeline

Hivecommandlineusing/UsingtheHivecommandlineandBeeline

HiveDataDefinitionLanguage(DDL)about/HiveDataDefinitionLanguage

HivejoinoptimizationURL/Skewjoin

Hiveroadmapabout/Hiveroadmap

HiveServer2URL/UsingtheHivecommandlineandBeeline

HiveServer2authenticationnoneauthentication/HiveServer2authenticationKerberosauthentication/HiveServer2authenticationLDAPauthentication/HiveServer2authenticationpluggablecustomauthentication/HiveServer2authenticationPluggableAuthenticationModules(PAM)authentication/HiveServer2authentication

HiveWikiURL/Operatorsandfunctions

HortonworksURL/JDBC/ODBCconnector

HQLabout/Hiveoverview

HueURL/TheHive-integrateddevelopmentenvironment,Hueabout/Hue

www.it-ebooks.info

Page 297: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

IImpala

URL/AshorthistoryIMPORTstatement/Dataexchange–EXPORTandIMPORTindex

about/IndexINNERJOINstatement/TheINNERJOINstatementINSERTkeyword/Dataexchange–INSERTinternaltables

about/Hiveinternalandexternaltables/Hiveinternalandexternaltables

www.it-ebooks.info

Page 298: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

JJavaIDE

URL/DevelopmentanddeploymentJavaVirtualMachine(JVM)/Batch,real-time,andstreamprocessingjavax.scriptAPI

URL/User-definedfunctionsJDBC/ODBCconnector

about/JDBC/ODBCconnectorjobandqueryoptimization

about/Jobandqueryoptimizationlocalmode/LocalmodeJVMreuse/JVMreuseparallelexecution/Parallelexecution

joinoptimizationabout/Joinoptimizationcommonjoin/Commonjoinmapjoin/Mapjoinbucketmapjoin/BucketmapjoinSortmergebucket(SMB)join/Sortmergebucket(SMB)joinSortmergebucketmap(SMBM)join/Sortmergebucketmap(SMBM)joinskewjoin/Skewjoin

JSONSerDeURL/SerDeabout/SerDe

JVMreuse,jobandqueryoptimization/JVMreuse

www.it-ebooks.info

Page 299: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

KKerberos

about/AuthenticationKerberosauthentication/HiveServer2authenticationKeyDistributionCenter(KDC)/Authentication

www.it-ebooks.info

Page 300: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

LLazySimpleSerDe/SerDeLDAPauthentication/HiveServer2authenticationlegacymode,authorization

about/LegacymodeLiveLongAndProcess(LLAP)/HiveroadmapLOADkeyword/Dataexchange–LOADlocalmode,jobandqueryoptimization/Localmode

www.it-ebooks.info

Page 301: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Mmapjoin,joinoptimization/MapjoinMAPJOINstatement/SpecialJOIN–MAPJOINmapkeydelimiter/UnderstandingHivedatatypesmathematicalfunctions/OperatorsandfunctionsMaven

URL/Developmentanddeploymentmetastore/HiveoverviewMetastoreserverauthentication

about/MetastoreserverauthenticationMITKerberos

URL/AuthenticationMySQL

URL/InstallingHivefromApache

www.it-ebooks.info

Page 302: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Nnoneauthentication/HiveServer2authenticationNoSQLdatabase

versusHadoop/RelationalandNoSQLdatabaseversusHadoop

www.it-ebooks.info

Page 303: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

OOozie

about/OozieURL/Ooziecontrolflownode/Oozieactionnode/Oozie

OpenCSVSerDe/SerDeoperators

about/OperatorsandfunctionsOptimizedRowColumnar(ORC)/Index,FileformatOptimizedRowColumnar(ORC)file

about/TransactionsORDERBY(ASC|DESC)keyword/ORDERandSORTORDERkeyword/ORDERandSORTOUTERJOINstatement/TheOUTERJOINandCROSSJOINstatementsOutOfMemory(OOM)exceptions/TheINNERJOINstatement

www.it-ebooks.info

Page 304: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Pparallelexecution,jobandqueryoptimization/ParallelexecutionParquetHiveSerDe/SerDeparserandsearchtips/OperatorsandfunctionsPARTITIONBYstatement/Analyticfunctionspartitions

about/Hivepartitionspartitiontables

bydateandtime/Partitiontablesbylocations/Partitiontablesbybusinesslogics/Partitiontables

personalidentityinformation(PII)about/Encryption

PhoenixURL/HBase

PluggableAuthenticationModules(PAM)authentication/HiveServer2authenticationpluggablecustomauthentication/HiveServer2authenticationPostgreSQL

URL/InstallingHivefromApachePresto

URL/Ashorthistoryprimitivetypeconversion/DatatypeconversionsProcessingElements(PE)/Batch,real-time,andstreamprocessing

www.it-ebooks.info

Page 305: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Rrandomsampling

URL/Samplingreal-timeprocessing

about/Batch,real-time,andstreamprocessingRecordColumnarFile(RCFILE)/FileformatRegexSerDe/SerDerelationaldatabase

versusHadoop/RelationalandNoSQLdatabaseversusHadoopROLLUPstatement

about/Advancedaggregation–ROLLUPandCUBErowdelimiter/UnderstandingHivedatatypes

www.it-ebooks.info

Page 306: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Ssampling

about/Samplingrandomsampling/Samplingbuckettablesampling/Samplingblocksampling/Sampling

SELECT*statement/TheSELECTstatementSELECTstatement/TheSELECTstatementSentry

URL/SQLstandard-basedmodeSequenceFileformat/StorageoptimizationSerDe

about/SerDedata,reading/SerDedata,writing/SerDeLazySimpleSerDe/SerDeColumnarSerDe/SerDeRegexSerDe/SerDeHBaseSerDe/SerDeAvroSerDe/SerDeParquetHiveSerDe/SerDeOpenCSVSerDe/SerDeJSONSerDe/SerDe

SHOWTRANSACTIONScommand/TransactionsSimpleAuthenticationandSecurityLayer(SASL)framework/Metastoreserverauthenticationskewjoin/SkewjoinSORTBY(ASC|DESC)keyword/ORDERandSORTSORTkeyword/ORDERandSORTsortmergebucket(SMB)join/Sortmergebucket(SMB)joinsortmergebucketmap(SMBM)join/Sortmergebucketmap(SMBM)joinSpark/OverviewoftheHadoopecosystemSQLLine

URL/UsingtheHivecommandlineandBeelineSQLstandard-basedmode,authorization

about/SQLstandard-basedmodeSqoop/OverviewoftheHadoopecosystemstagedependencies

about/TheEXPLAINstatementstageplans

about/TheEXPLAINstatementstorage-basedmode,authorization

about/Storage-basedmode

www.it-ebooks.info

Page 307: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

storageoptimization/StorageoptimizationStorm

URL/Ashorthistory,Batch,real-time,andstreamprocessingstreaming

about/Streamingstreamprocessing

about/Batch,real-time,andstreamprocessingstringfunctions/OperatorsandfunctionsStructuredQueryLanguage(SQL)

about/Ashorthistory

www.it-ebooks.info

Page 308: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Ttable-generatingfunctions/OperatorsandfunctionsTez/OverviewoftheHadoopecosystem

about/IndexURL/Index

transactionsabout/Transactions

typeconversionfunctions/Operatorsandfunctions

www.it-ebooks.info

Page 309: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

UUDAF

code,template/TheUDAFcodetemplateUDAFs

about/User-definedfunctionsUDF

code,template/TheUDFcodetemplateUDFs

about/User-definedfunctionsUDTF

code,template/TheUDTFcodetemplateUDTFs

about/User-definedfunctionsUniformResourceIdentifier(URI)/Dataexchange–LOADUNIONALLstatement/Setoperation–UNIONALL

www.it-ebooks.info

Page 310: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

Vvalue/Introducingbigdatavariability/Introducingbigdatavariety/IntroducingbigdataVectorizationoptimization

about/IndexURL/Index

velocity/Introducingbigdatavendorpackages

used,forinstallingHive/InstallingHivefromvendorpackagesveracity/Introducingbigdataviews

about/Hiveviewsaltering/Hiveviewsredefining/Hiveviewsdropping/Hiveviews

virtualcolumns/Operatorsandfunctionsvisualization/Introducingbigdatavolatility/Introducingbigdatavolume/Introducingbigdata

www.it-ebooks.info

Page 311: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

WWHEREclauses

subqueries,restrictions/TheSELECTstatementwindowexpressions

BETWEEN…ANDclause/AnalyticfunctionsNPRECEDINGorFOLLOWING/AnalyticfunctionsUNBOUNDEDPRECEDING/AnalyticfunctionsUNBOUNDEDFOLLOWING/AnalyticfunctionsUNBOUNDEDPRECEDINGANDUNBOUNEDFOLLOWING/AnalyticfunctionsCURRENTROW/AnalyticfunctionsURL/Analyticfunctions

www.it-ebooks.info

Page 312: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

YYarn/OverviewoftheHadoopecosystem

www.it-ebooks.info

Page 313: Apache Hive Essentials - DropPDF2.droppdf.com/files/i9eKZ/apache-hive-essentials.pdf · Neha Bhatnagar Proofreaders Paul Hindle ... , with PDF and ... Apache Hive Essentials prepares

ZZooKeeper

about/ZooKeeperURL/ZooKeepersharedlock/ZooKeeperexclusivelock/ZooKeeperforHivelocks,URL/ZooKeeper

www.it-ebooks.info