Enterprise Data Warehouse Optimization: 7 Keys to Success

Preview:

Citation preview

1 ©HortonworksInc.2011–2016.AllRightsReserved1 ©HortonworksInc.2011–2017.AllRightsReserved

ScottGnau CTO,Hortonworks@Scott_GnauDavidLoshin,President,KnowledgeIntegrityloshin@knowledge-integrity.com

LegacyArchitecturesImpedePerformance

EDW

CapitalCosts

OperationsCosts

Scalability

AnalyticFlexibility

TimetoValue

DataQuality

DataVariety

©2017Knowledge Integrity,Incloshin@knowledge-integrity.com (301) 754-6350 2

• Datawarehouseperformance isnolongersolelydefinedintermsofcomputationspeed

• Optimalperformancereflectstheabilitytomaximizevalueacrossarangeofdimensions

• Thestaticdesignoflegacyplatformshasnotkeptpacewithgrowingdesireforbusinessintelligenceandanalytics

Step1:LeverageHorizontalScalability• DWappliancesrequire

significantcapitalinvestment– Systemmustbesizedtomeet

anticipatedneeds– Allowsforunusedcapacityat

beginning– Requiresincreased“step-up”

investmentsonregularintervals• Hadoopfinessesthischallenge

– Reliesoncommoditycomponents

– Startwithwhatyouneed,growwithincreaseddemand

– Introducenewerhardwareseamlessly

– Exploitinnovationstospeedperformance(e.g.,Stinger.next,LowLatencyAnalyticalProcessing)

©2017Knowledge Integrity,Incloshin@knowledge-integrity.com (301) 754-6350 3

Rackswitch

NameNode

DataNode&TaskTracker

DataNode&TaskTracker

DataNode&TaskTracker

DataNode&TaskTracker

Rackswitch

NameNode

DataNode&TaskTracker

DataNode&TaskTracker

DataNode&TaskTracker

DataNode&TaskTracker

Rackswitch

NameNode

DataNode&TaskTracker

DataNode&TaskTracker

DataNode&TaskTracker

DataNode&TaskTracker

Rackswitch

NameNode

DataNode&TaskTracker

DataNode&TaskTracker

DataNode&TaskTracker

DataNode&TaskTracker

Step2:AugmentEDWStoragewithHive

• ThevalueofexistingEDWinvestmentscanbeextendedusingaHybridArchitecture

• Hivecontinuestoevolvewithinnovativeperformanceimprovements:– In-memorycachingand

persistentqueryexecutors– Column-orienteddistributed

dataorganization– Improvedsecurityusing

ApacheRanger– SQLACIDMerge

©2017Knowledge Integrity,Incloshin@knowledge-integrity.com (301) 754-6350 4

HadoopCluster

EDW

Step3:IncreaseDataFlexibility

• Conventionaldatawarehousearchitecturesareorganizedusingadimensionalmodel– Factsrepresentevents– Dimensionscharacterizethefacts

• ThedimensionalmodelissuitedtotypicalDWoperations– Aggregationandrolled-upreporting– “Sliceanddice”

• However,thismodelforcesalldataintopredeterminedschema(“schema-on-write”)– Introducesbias,createsconstraintsandlimitsdataflexibility

• Alternative:schema-on-read– Datasetsarecapturedintheirsourceformats– Freesdataconsumerstoapplytheirownorganization– Allowslogicalstructuretobelayeredontopofdatainsourceformat– Enablesuseofcreativealgorithmsforanalytics,textmining,andmachinelearning

©2017Knowledge Integrity,Incloshin@knowledge-integrity.com (301) 754-6350 5

Step4:UseUnstructuredData

• Datawarehousesareengineeredaroundstructureddata• Manysourcesofincreasingvolumeofunstructureddata

– AppsrunningonInternet-connecteddevicesgeneratetextstreams– Machine-generatedunstructuredcontent– Semi-structuredsources

• Applicationsthatconsumebothstructuredandunstructureddataprovidefullervisibilityintoanalyticalresults

• ToolslikeLucene,Solr,Mahout,andothertextanalyticslibrarieshelptoparseandtagunstructuredtext

©2017Knowledge Integrity,Incloshin@knowledge-integrity.com (301) 754-6350 6

Ingest

Parse

Tag

Organ

ize

Lucene

Solr

Mahout

Step5:DataDiscovery

©2017Knowledge Integrity,Incloshin@knowledge-integrity.com (301) 754-6350 7

DataIngestion&

Transformation

• Dataimportedintothedatawarehouseishomogenizedandorganizedwithinpredefineddatamodels

• Thisconstrainsdownstreamconsumers

Step5:DataDiscovery

©2017Knowledge Integrity,Incloshin@knowledge-integrity.com (301) 754-6350 8

DataDiscovery&Preparation

DataDiscovery&Preparation

DataDiscovery&Preparation

DataDiscovery&Preparation

DataDiscovery&Preparation

• Datadiscoveryallowseachusertoconfigurethedatafortheirspecializedpurposes

Step6:OffloadETLtoHadoop

• 60-70%oftheeffortofdatawarehousingisattributedtoextraction,transformation,andloading(ETL)

• HadoopisanaturalplatformforETLprocessing:– ETLisinherentlydataparallel,enablingfasterexecution– Developmenttimecanbedrasticallyreducedwithfasterdev/test/debugcycle– ResourcescanbedynamicallyapportionedandreleasedwhenETLprocessingiscompleted,

loweringcosts

• ApacheHivesupportsSQLACIDMergewhichhandlesinserts,updates,anddeletesinasinglepass

• Allowsforin-databasetransformationswithoutneedformassiverefreshes

©2017Knowledge Integrity,Incloshin@knowledge-integrity.com (301) 754-6350 9

Step7:OperationalDataGovernance

• Delegatingmoreresponsibilitytotheconsumercommunityposesariskofinconsistentinterpretationanduse

• Instituteoperationaldatagovernancetosupportversioning,lineage,andprovenance– Metadatamanagement– Datalineage– Archivingpolicies– Versioningpolicies– Datasecurityandprotection

• ApacheAtlasisanopensourcecomponentoftheHadoopecosystemthatcapturesdatadefinitions,hierarchicaltaxonomies,dataelementsandtheirrelationships,andlineage

©2017Knowledge Integrity,Incloshin@knowledge-integrity.com (301) 754-6350 10

Modernization:EvolvingtheHybridEDW

• ConventionalRDBMS-baseddatawarehouseshaveservedorganizationswell,butarebeingeclipsedbynewertechnologies

• Scalablesystemsbuiltoncommoditycomponentsarerapidlybeingadoptedforbusinessintelligenceandanalyticsapplications

• OptimizetheEDWusinganevolutionaryapproachtoembracingHadoop:– Expandthestoragefootprint– Increasecomputationalpower– Broadenthescopeofapplicationsupport– Lowercosts

©2017Knowledge Integrity,Incloshin@knowledge-integrity.com (301) 754-6350 11

Questions&Suggestions

• www.knowledge-integrity.com• www.dataqualitybook.com• www.decisionworx.com• Ifyouhavequestions,comments,

orsuggestions,pleasecontactmeDavidLoshin301-754-6350loshin@knowledge-integrity.com

©2017Knowledge Integrity,Incloshin@knowledge-integrity.com (301) 754-6350 12

13 ©HortonworksInc.2011–2016.AllRightsReserved

TheNextGenEDWistheBigDataWarehouseà InForrester’s2016globalsurvey,59%ofrespondentsstatedthatleveragingbigdata

andanalyticswasacriticalorhighpriority.

14 ©HortonworksInc.2011–2016.AllRightsReserved

CompaniesAreLookingtoBigDataforEDWOptimization

à 82%of2550+respondentsarelookingtoBigDataforEDWOptimizationratherthanastraightreplacement.– 2016BigDataMaturitySurvey

15 ©HortonworksInc.2011–2016.AllRightsReserved

HortonworksConnectedDataPlatformsandSolutions

HortonworksConnection

HortonworksSolutions

EnterpriseDataWarehouseOptimization

CyberSecurityandThreatManagement

InternetofThingsandStreamingAnalytics

HortonworksConnectionSubscriptionSupportSmartSense

PremierSupportEducationalServicesProfessionalServices

CommunityConnection

CloudHortonworks DataCloudAWS HDInsight

DataCenterHortonworks DataSuite

HDFHDP

16 ©HortonworksInc.2011–2016.AllRightsReserved

DriversofaModernBIInfrastructure

DeeperandBroaderDataSets

CompleteData‘Provenance’

LeadingAnalyticsandTools

Integratenon-EDWdataandEDWdata

TotalCostofOwnership

17 ©HortonworksInc.2011–2016.AllRightsReserved

OpenSourceTransformationalImpacttoEDW

UnmatchedEconomicssupportlowcostdata-centerandcloudarchitecturesforEnterpriseApacheHadoop

EliminatesRiskandEnsuresIntegrationpreventsvendorlock-inandspeedsecosystemadoptionofODPi-compliantcore

COSTEFFICIENCY

DATAVARIETY

EDW

PROPRIETARYHADOOP

HORTONWORKSOPENSOURCE

RDBMS

18 ©HortonworksInc.2011–2016.AllRightsReserved

But,whyaren’tmorecompaniesrunningtothissolution?

Risky

Hadooprequiresabunchofnewskillsets

It’lltakealongtime

There’stoomuchmanualcodingrequired

It’shardtointegratetomyBItoolstack

19 ©HortonworksInc.2011–2016.AllRightsReserved

LegacyEDWSolution

20 ©HortonworksInc.2011–2016.AllRightsReserved

UsingHadooptoOptimizetheDataWarehouse

à AugmentEDWwithHive

à OffloadETLtoHadoop

à DataGovernance

21 ©HortonworksInc.2011–2016.AllRightsReserved

AugmentcurrentEDWwithHive

HiveLLAPGA:Interactivequeryinseconds,10Xfastjoinperformance

EaseofUseandAdoption:SQLStandardACIDMerge

EnterpriseReadiness:SupportsallTPC-DSQueries

StreamlinedOperations:HiveViews

22 ©HortonworksInc.2011–2016.AllRightsReserved

0

5

10

15

20

25

30

35

40

45

50

0

50

100

150

200

250

Speedup(xFactor)

QueryTime(s)(Low

erisBetter)

Hive2withLLAPaverages26xfasterthanHive1

Hive1/TezTime(s) Hive2/LLAPTime(s) Speedup (xFactor)

Hive2withLLAP:26xPerformanceBoostat1TBScale

23 ©HortonworksInc.2011–2016.AllRightsReserved

HiveLLAPinHDP2.6:StablePerformancewithHighConcurrency

4xQueries,2.8x

RuntimeDifference

5xQueries,4.6x

RuntimeDifference

Mark ConcurrentQueries

AverageRuntime

5 7.76s

25 36.24s

100 102.89s

24 ©HortonworksInc.2011–2016.AllRightsReserved

OffloadETLtoHadoop

à TheProblem:– EDWscanconsumebetween50%and90%of

resourcesjustonETL/ELTtasks.– Thesejobsinterferewithmorebusiness-

criticaltaskslikeBIandadvancedanalytics.

à TheSolution:– HiveandHDPdeliverETLthatscalesto

petabytes.– Economicalscale-outprocessingon

commodityservers.

à TheResult:– BetterSLAsformission-criticalanalytics.– LimitEDWexpansionorretireoldsystems.

ETL/ELT

DATAMART

DATALANDING&

DEEPARCHIVE

CUBEMART

ENDUSER

APPLICATIONS

APPLICATIONS

APPLICATIONS

ENDUSERSANDAPPS

25 ©HortonworksInc.2011–2016.AllRightsReserved

DataGovernanceforEDWOptimization

Classification

Prohibition

Time

Location

Policies

PDPResourceCache

Ranger

ManageAccessPoliciesandAuditLogs

TrackMetadataandLineage

AtlasClientSubscriberstoTopic

GetsMetadataUpdates

Atlas

MetastoreTags

Assets

Entitles

Streams

Pipelines

Feeds

HiveTables

HDFSFiles

HBaseTables

EntitiesinDataLake

IndustryFirst:DynamicTag-basedSecurityPolicies

26 ©HortonworksInc.2011–2016.AllRightsReserved

UseCase1:Multi-ChannelBehavioralAnalysis

à Industry:MassMedia– Largestbroadcastingandcablecompany

intheworldbyrevenue– Multiplechannels:Cable(set-top-box),

wirelessdevices,streamingprogramming,

– 22million+subscribers(internet&video)

à Results:– Scalability:480Brows,500nodes– 60xqueryperformanceimprovement– Insights:Newinfoimprovenegations– Loyalty:Outreachtocustomersviewing

competitivestreams;▼churn▲revenue

Before After

LeadingMediaCompany

HortonworksHDP

AtScaleIntelligenceServer

HortonworksHDP

Netezza DataMart

ChannelFeeds

Tableau+MSExcel+R

ChannelFeeds

Tableau+MSExcel

27 ©HortonworksInc.2011–2016.AllRightsReserved

UseCase2:CampaignPaid-SearchEffectiveness

à Industry:Retail/eCommerce– TopUSdepartmentstore(byrev)– Onlinesales$4B+&growing(11%+total)– 800+departmentstoresnationwide

à Results– Scale:Millionspaidkeywordsanalyzed– Speed:Eliminateextractstep– Insight:Operationalizedclosed-loop

analysisà insightà decisionà action– Impact:Makeandsave$millionsw/

instantbiddecisionsover6-weekseasonà thatdrives60%annualrevenue

Before After

HortonworksHDP

AtScaleIntelligenceServer

HortonworksHDP

Vertica DataMarts

Ad&PaidKeywords

Cognos +Tableau+Excel

Ad&PaidKeywords

Tableau+Excel

LeadingRetailer

28 ©HortonworksInc.2011–2016.AllRightsReserved

UseCase3:ClientandPatientAnalysis

à Industry:ManagedHealthCare– MemberofFortune100– Health,life+otherinsuranceproducts– ~52millionmembers;

medical/dental/pharm

à Results– Scalable:BIdirectlyon264+nodesdata– Time: Eliminatedatamovement step– 62xqueryperformanceimprovement– Speed:<2.2secondaveragequerytime– Insight:TableauonHadoopfor1000+– Security:Accesscontrolbyuser;HIPAA

Before After

LeadingManagedHealthcareProvider

HortonworksHDP

AtScaleIntelligenceServer

HortonworksHDP

Netezza DataMart

Client/PatientDetails

Tableau+MSExcel

Client/PatientDetails

Tableau+MSExcel

29 ©HortonworksInc.2011–2016.AllRightsReserved

NextStep:

à EveryonewillreceiveafreecopyofForresterWhitePapertitled”TheNext-GenerationEDWIsTheBigDataWarehouse”

à EDWOptimizationwithHDP– http://hortonworks.com/solutions/edw-optimization/– EDWOptimization7minvideo

30 ©HortonworksInc.2011–2016.AllRightsReserved

HortonworksConnectedDataPlatformsandSolutions

HortonworksConnection

HortonworksSolutions

EnterpriseDataWarehouseOptimization

CyberSecurityandThreatManagement

InternetofThingsandStreamingAnalytics

HortonworksConnectionSubscriptionSupportSmartSense

PremierSupportEducationalServicesProfessionalServices

CommunityConnection

CloudHortonworks DataCloudAWS HDInsight

DataCenterHortonworks DataSuite

HDFHDP

31 ©HortonworksInc.2011–2016.AllRightsReserved

ThankYou

Recommended