91
RUNNING DATA ANALYTIC WORKLOADS IN THE CLOUD

IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

RUNNING DATA ANALYTIC WORKLOADS IN THE CLOUD

Page 2: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

2 © Cloudera, Inc. All rights reserved.

WHO WE ARE

Jason Wang

Software Engineer

Mael Ropas

Sr. Sales Engineer

Vinithra Varadharajan

Sr. Engineering Manager

Eugene Fratkin

Director of Engineering

Page 3: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

3 © Cloudera, Inc. All rights reserved.

TODAY’SAGENDA

• We will introduce concepts related to cloud deployment

• Talk about architecture and some major considerations.

• Do hands-on exercises related running data engineering and analytic database pipelines

WHY ARE YOU MOVING TO THE CLOUD

HOW TO MAP ON-PREM TO CLOUD

CLOUD ARCHITECTURE AND CONSIDERATIONS

CLOUDERA ALTUS + DIRECTOR

HANDS-ON LAB

Page 4: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

4 © Cloudera, Inc. All rights reserved.

TODAY’SAGENDA

• We will introduce concepts related to cloud deployment

• Talk about architecture and some major considerations.

• Do hands-on exercises related running data engineering and analytic database pipelines

WHY ARE YOU MOVING TO THE CLOUD

HOW TO MAP ON-PREM TO CLOUD

CLOUD ARCHITECTURE AND CONSIDERATIONS

CLOUDERA ALTUS + DIRECTOR

HANDS-ON LAB

Page 5: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

5 © Cloudera, Inc. All rights reserved.

Deployment types

On-prem

Data

Application

Runtime

Middleware

OS

Virtualization

Servers

Storage

Networking

Infrastructure as a Service

Data

Application

Runtime

Middleware

OS

Virtualization

Servers

Storage

Networking

Platform as a Service

Data

Application

Runtime

Middleware

OS

Virtualization

Servers

Storage

Networking

Software as a Service

Data

Application

Runtime

Middleware

OS

Virtualization

Servers

Storage

Networking

Managed for you

Page 6: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

6 © Cloudera, Inc. All rights reserved.

Shift to the cloud

On-premise

Data

Application

Runtime

Middleware

OS

Virtualization

Servers

Storage

Networking

50% of the market by 2019

Infrastructure as a Service

Data

Application

Runtime

Middleware

OS

Virtualization

Servers

Storage

Networking

Platform as a Service

Data

Application

Runtime

Middleware

OS

Virtualization

Servers

Storage

Networking

Software as a Service

Data

Application

Runtime

Middleware

OS

Virtualization

Servers

Storage

Networking

Managed for you

Page 7: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

7 © Cloudera, Inc. All rights reserved.

Physical infra

Lets migrate!

Hive cluster

Cloud infra

Hive clusterSimplified (CIO) view:

Cloud is the same but saves money!

Page 8: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

8 © Cloudera, Inc. All rights reserved.

Physical infra

Lets migrate!

Hive cluster Simplified (CIO) view:

Cloud infra

Hive cluster

Page 9: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

9 © Cloudera, Inc. All rights reserved.

Physical infra

Lets migrate!

Hive cluster Large on-prem cluster operated at high capacity will be cheaper than Cloud based solution…

Except...

Cloud infra

Hive cluster

Page 10: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

10 © Cloudera, Inc. All rights reserved.

When is it worth it?

1 Compute/storage needs are changing over time and/or hard to predict

Page 11: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

11 © Cloudera, Inc. All rights reserved.

When is it worth it?

2 IT lacks scale to make cluster maintenance cost effective

Page 12: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

12 © Cloudera, Inc. All rights reserved.

When is it worth it?

3 IT lacks scale to integrate many new components

Page 13: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

13 © Cloudera, Inc. All rights reserved.

When is it worth it?

4 IT lacks scale to manage component lifecycles (upgrades, patches, etc)

2.6 2.7

2.72.6

Page 14: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

14 © Cloudera, Inc. All rights reserved.

TODAY’SAGENDA

• We will introduce concepts related to cloud deployment

• Talk about architecture and some major considerations.

• Do hands-on exercises related running data engineering and analytic database pipelines

WHY ARE YOU MOVING TO THE CLOUD

HOW TO MAP ON-PREM TO CLOUD

CLOUD ARCHITECTURE AND CONSIDERATIONS

CLOUDERA ALTUS + DIRECTOR

HANDS-ON LAB

Page 15: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

15 © Cloudera, Inc. All rights reserved.

Considerations for the cloudThere could be many dimensions along which to analyze use case● Tenancy (single / multi user)● Lifecycle (transient / long-lived / HA)● Secure requirements and network access● Multiple services sharing resources / resource contention● Level of configuration tuning ● Etc.

Based on that On-prem, IaaS, PaaS, or SaaS may make sense

Page 16: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

16 © Cloudera, Inc. All rights reserved.

ImpalaADB

● Multiple people and user personas● Cluster needs security● Cluster will need to be available, but can

tolerate scheduled downtime● Stores data in object storage

Considerations for the cloudTransient Long-lived Highly Available

Mul

ti Te

nant

Si

ngle

Ten

ant

Hive ETL

● Ran by headless server● Does not need authorization● Does not need to exist after processing● Only needs to preserve metadata● Stores data in object storage

Kafkastreams

● All processed with same user● Has stringent performance requirements● Cannot take downtime● Requires extensive monitoring● Stores objects within the cluster

Page 17: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

17 © Cloudera, Inc. All rights reserved.

Considerations for the cloud – good PaaS candidates

Hive ETL

● Ran by headless server● Does not need authorization● Does not need to exist after processing● Only needs to preserve metadata● Stores data in object storage

● Clusters that need to be provisioned on-demand or another simple lifecycle

● Clusters where configuration can be derived from basic cluster characteristics (size, service type), and do not need to be highly optimized

● Case where one can effectively leverage other advanced features provided by managed services (autoscaling, debugging)

Page 18: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

18 © Cloudera, Inc. All rights reserved.

Considerations for the cloud – good IaaS candidates

Kafkastreams

● All processed with same user● Has stringent performance requirements● Cannot take downtime● Requires extensive monitoring● Stores objects within the cluster

● Cluster has non-trivial lifecycle, such as HA cluster

● Case where monitoring needs to be integrated with sophisticated monitoring and metering system

● Configurations need to be highly optimized for given use case – several services need to run on one cluster with non-trivial resource contention

● Very restrictive security environment into which cluster needs to deploy (can’t talk to the managed service)

Page 19: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

19 © Cloudera, Inc. All rights reserved.

Considerations for the cloud – practical considerations when switching

● It may be difficult to switch from on on-prem deployment model to a cloud deployment model○ IT needs to map access and infrastructure to a

new deployment model○ Developers need to adjust all pipelines to cloud

native architecture● It is worth considering moving in stages – one use

case at a time○ Lift and shift○ Data into S3 for a more cloud native deployment○ Metadata shareable between clusters ○ Full cloud native, on demand deployment

Page 20: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

20 © Cloudera, Inc. All rights reserved.

Considerations for the cloud – public cloud lock-in

● Choice of public cloud can be even more consequential than choice of on-prem vendors

● Some companies maintain multi-cloud infrastructure – several public clouds or on-prem and public

○ Less lock-in○ Multi-front effort

● Can maintain migration plan “from the cloud”○ Make sure there is understood data migration plan○ Make sure there is understood metadata migration plan○ Rely on non-proprietary technologies for core

processing

Page 21: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

21 © Cloudera, Inc. All rights reserved.

WHY ARE YOU MOVING TO THE CLOUD

HOW TO MAP ON-PREM TO CLOUD

CLOUD ARCHITECTURE AND CONSIDERATIONS

CLOUDERA ALTUS + DIRECTOR

HANDS-ON LAB

TODAY’SAGENDA

• We will introduce concepts related to cloud deployment

• Talk about architecture and some major considerations.

• Do hands-on exercises related running data engineering and analytic database pipelines

Page 22: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

22 © Cloudera, Inc. All rights reserved.

Physical infra

Hive cluster

HDFS 2.7.3

Spark 2.0.1

Oozie 2.0.2

Hue 4.1

Active Directory

Hive 2.1 Yarn 1.6.0 Sentry 1.7.1

High level view

users

users

users

usersusers

SSSD

CM 5.1users

BDR

Physical infra

Hive cluster

Technical view private subnet

Storage

Page 23: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

23 © Cloudera, Inc. All rights reserved.

Properties and considerations of cloud deployment

Cloud infrastructure

Active Directory

users

Hive cluster

HDFS 2.7.3

Spark 2.0.1

Oozie 2.0.2

compute

Hue 4.1

Hive 2.1 Yarn 1.6.0 Sentry 1.7.1

users

users

usersusers

CM 5.1users

storage

virtual private subnetprivate subnet

IAMusers

Hive cluster

BDR

User computers

users

● Decoupled storage and compute● Multi-cluster metadata● Bursting to the cloud● Data security● User management

For clarity, we may specifically refer to AWS component names; unless pointed out otherwise the same applies to Azure, GCP

Page 24: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

24 © Cloudera, Inc. All rights reserved.

Decoupled compute and storage

Cloud infrastructure

Active Directory

Hive cluster

HDFS

Spark

Oozie

compute

Hue

Hive Yarn Sentry

CM

storage

virtual private subnetprivate subnet

IAM

Hive cluster

User computers

Page 25: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

25 © Cloudera, Inc. All rights reserved.

Decoupled compute and storage – local and attached volumesworker

local disk

HDFS

worker

local disk

EBS HDFS

worker

local disk

HDFS

worker

local disk

EBS HDFS

worker

local disk

EBS

worker

local disk

HDFS

S3

● Except for specific instance types cloud VM’s have minimal local disk● Local disk may, by default, contain sensitive information – e.g. logs – so you should not forget that it exists● Most of the time virtual disk needs to be attached for each instance, for things like Spark memory spills and logs.

Even when no data is meant to persist there.

Page 26: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

26 © Cloudera, Inc. All rights reserved.

Decoupled compute and storage – same HDFS new role worker

local disk

HDFS

worker

local disk

EBS HDFS

worker

local disk

HDFS

worker

local disk

EBS HDFS

worker

local disk

EBS

worker

local disk

HDFS

S3

● Local HDFS can make sense for latency sensitive data computations, but it should not be used to persist data● Probability of VM failure is almost order of magnitude higher than that of well maintained physical server● If spot instances are used, you must build framework to be able to tolerate catastrophic failures● HDFS, and its level of redundancy, is not meant to deal with volatility of public cloud● Not all instances need to be parts of HDFS

Page 27: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

27 © Cloudera, Inc. All rights reserved.

Decoupled compute and storage – S3 vs volumesworker

local disk

HDFS

worker

local disk

EBS HDFS

worker

local disk

HDFS

worker

local disk

EBS HDFS

worker

local disk

EBS

worker

local disk

HDFS

S3

● S3 is elastic● S3 is much more reliable● Attached volume provide lower latency access, even with HDFS● S3 has higher latency, but can come close in throughput

Page 28: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

28 © Cloudera, Inc. All rights reserved.

Decoupled compute and storage – S3 consistency

Process 1

S3

● AWS S3 is not strongly consistent, unlike Azure ADLS, HDFS● Objects written to S3, may not be instantly seen by everyone● While object is present, meta-data is lagging – cannot list

S3

Process 2

. . .

metadata metadata

Page 29: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

29 © Cloudera, Inc. All rights reserved.

Decoupled compute and storage – S3 consistency

Process 1

S3

S3guard introduces strong consistency to S3

● DynamoDB serves as the local metadata source● If objects from S3 get deleted without using S3guard, it needs to be invalidated ● EMRFS is AWS’s version of S3guard only for EMR

S3

Process 2

. . .

metadata metadata

DynamoDB

Page 30: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

30 © Cloudera, Inc. All rights reserved.

Multi-cluster metadataHive cluster

HDFS

Spark

Hive CM

virtual private subnet

Impala cluster

HDFS

Impala

CM

Sentry

HMS

Navigator

Cloud infrastructure

storage

Page 31: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

31 © Cloudera, Inc. All rights reserved.

Multi-cluster metadatav. private subnet

Hive cluster

Impala cluster

Spark cluster

● One of the main advantages of running in the cloud is the ability to separate workloads○ Each cluster is optimized for single workflow○ Cluster lifecycle matches workflow’s demands

● Persistent storage (e.g. S3, ADLS) is a necessary part, but not sufficient; we need answers for metadata:○ Schema○ Authorization○ Lineage / data governance

● All of them need to be persistent, with lifecycle decoupled from that of the cluster

HMS NavigatorSentry

Page 32: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

32 © Cloudera, Inc. All rights reserved.

Multi-cluster metadatav. private subnet

Hive cluster

Impala cluster

Spark cluster

● If we are sharing resources we need to address:○ Performance impact○ Concurrency○ Other design limitations

HMS NavigatorSentryHMS NavigatorSentry

Page 33: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

33 © Cloudera, Inc. All rights reserved.

Multi-cluster metadata – Latencyv. private subnet

Hive cluster

Impala cluster

Spark cluster

● Impact of Schema and Authorization latency○ Most of the Apache stack is being optimized for

cloud use cases, but was not designed this way originally

○ Depending on the location and load on these components latency may increase significantly

○ Certain queries are extremely metadata intensive and should be done with care■ e.g. Invalidate entire metadata for Impala■ e.g. Declare DDL over large number of

objects

HMS NavigatorSentryHMS NavigatorSentry

Page 34: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

34 © Cloudera, Inc. All rights reserved.

Multi-cluster metadata – Concurrencyv. private subnet

Hive cluster

Impala cluster

Spark cluster

● Concurrent write access by multiple components can leave metadata in inconsistent state○ While locking exists in HMS, it is not meant to be

multi-cluster○ As a result, it is responsibility of the user to

minimize the risk of concurrent writes

HMS NavigatorSentryHMS NavigatorSentry

Page 35: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

35 © Cloudera, Inc. All rights reserved.

Multi-cluster metadata – other design limitations

Hive cluster

Impala cluster

Spark cluster

● Managed: relationship between table and data lifecycle○ SQL syntax explicitly differentiates between

managed and unmanaged tables● Local/Global: visibility of data across clusters

○ Notion of global tables does not explicitly exist, which can cause issues.

● Example:○ Cluster is created to run a daily workflow, which

creates table in short-lived HDFS○ Metadata is shared via global HMS○ Table and its content disappears when cluster is

shutdown, but HMS maintains a record of it as a managed table

○ Subsequent execution will fail since declaration of the table will collide with declaration of non-existing table

HMSManaged tables:

● Drop table means data is deleted

Unmanaged (external) tables:

● Drop table only deletes table references

Local tables:

● HDFS/EBS backed table that goes away when cluster is gone

Global tables:

S3/ADLS backed table that do not depend on cluster lifecycle

Page 36: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

36 © Cloudera, Inc. All rights reserved.

Multi-cluster metadata – other design limitationsv. private subnet

Hive cluster

Impala cluster

Spark cluster

● HMS was designed to be local to the cluster○ No strict boundary existed for file access○ Some variables were stored in local process

● If central HMS exists in different subnet with limited data access (HDFS, S3) will fail

HMS NavigatorSentryHMS NavigatorSentry

Page 37: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

37 © Cloudera, Inc. All rights reserved.

Bursting to the cloud

Cloud infrastructure

Active Directory

Hive cluster

HDFS

Spark

Oozie

compute

Hue

Hive Yarn Sentry

CM

storage

virtual private subnetprivate subnet

IAM

Hive cluster

User computers

Page 38: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

38 © Cloudera, Inc. All rights reserved.

Bursting to the cloudprivate subnet

Hive cluster

v. private subnet

Hive cluster

Concept:

● Local cluster is at max capacity● We need to perform critical computational tasks● We will create a cluster on the cloud and run the tasks!

Page 39: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

39 © Cloudera, Inc. All rights reserved.

Bursting to the cloud – data movementprivate subnet

Hive cluster

v. private subnet

Hive cluster

● Data could be periodically sync’d using BDR (e.g. with Cloudera Manager)● Data can take very long time to copy● Is BDR data fresh enough for bursting purposes?● Is it feasible given data size?

Cloudstorage

BDRCM CM

data

Page 40: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

40 © Cloudera, Inc. All rights reserved.

Bursting to the cloud – data movementprivate subnet

Hive cluster

v. private subnet

Hive cluster

● Metadata should also be replicated during BDR● Metadata recreation can take hours, which is an obstacle not only for bursting, but for any on-demand use

Cloudstorage

BDRCM CM

datametadata

Page 41: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

41 © Cloudera, Inc. All rights reserved.

Data security

Cloud infrastructure

Active Directory

Hive cluster

HDFS

Spark

Oozie

compute

Hue

Hive Yarn Sentry

CM

storage

virtual private subnetprivate subnet

IAM

Hive cluster

User computers

Page 42: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

42 © Cloudera, Inc. All rights reserved.

cluster

Data security – object storage encryption

S3

IAM

Instance

local disk

EBS

profile

profile policy

bucket 1 bucket 2 bucket 3

CMK KMS SSE

policy KMS

policy bkt 1

Instance

local disk

EBS

profileInstance

local disk

EBS

profile

policy bkt 2

policy bkt 3

● Three server side types of S3 encryption○ S3-SSE (server side). Built into ADLS.○ S3-KMS (server side, specific key)○ S3-C (customer provided)

● Type of encryption needs to be specified per object read/write

● To read data in a bucket encrypted with KMS key○ Right to read from bucket○ Right to use KMS key

Page 43: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

43 © Cloudera, Inc. All rights reserved.

hive

cluster

Data security – object storage encryption

S3

IAM

Instance

local disk

EBS

profile

profile policy

bkt 1 bkt 2 bkt 3

CMK KMS SSE

policy KMS

policy bkt 1

Instance

local disk

EBS

profileInstance

local disk

EBS

profile

policy bkt 2

policy bkt 3

● Default bucket encryption – policy specified on bucket○ Was always there in Azure○ Available in AWS since late 2017

● Will do the work for you○ Check that instance has bucket access○ Check which KMS key to use○ Check that instance has KMS key access

HMS

HDFS

query s3a:

query s3a: query s3a:

query s3a:

Page 44: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

44 © Cloudera, Inc. All rights reserved.

Data security – wire encryption● Easy outside of cluster / Hard inside of cluster● Communication to S3

○ Secured using HTTPS● Communication between UI components

○ TLS● Communication between Impala workers

○ Kerberos for establishing connection○ TLS for data

● HDFS RPC uses its own framework○ Supports Kerberos for Authentication○ TLS for data

● Other services communication○ E.g. HMS, YARN client-server relationships○ Does not use TLS (AES based), uses

Kerberos (3DES based)○ No hardware acceleration – significantly

slower data movement

Active Directory

Impala cluster

YARN

Impala HMS

CM

S3 bkt 1 bkt 2 bkt 3

KDC HDFS

HTTPS

User computers

TLS

TLS

TLS

Page 45: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

45 © Cloudera, Inc. All rights reserved.

Identity management

Cloud infrastructure

Active Directory

Hive cluster

HDFS

Spark

Oozie

compute

Hue

Hive Yarn Sentry

CM

storage

virtual private subnetprivate subnet

IAM

Hive cluster

User computers

Page 46: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

46 © Cloudera, Inc. All rights reserved.

Identity management● On-demand cluster deployment needs to

connect cloud identity to corporate identity● There are three main types of identities that

need to be reconciled○ Corporate User Directory○ Cluster OS users○ IAM roles / users

Cloud user Directory

Impala cluster

HDFS

Hue

Impala

users

users

usersCM

users

IAM

Instance

OS Users

1

2

3

users

users

LDAP

KDC

users

users

Corporate User Directory (on-prem)

LDAPusers

Page 47: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

47 © Cloudera, Inc. All rights reserved.

Identity management – LDAP on-prem and cloud● Need to have both LDAP and KDC (Kerberos

Key Distribution Center)○ Can be maintained separately

● Two common choices for user management○ Active Directory○ FreeIPA

● Both contain LDAP and KDC○ Will synchronize users

Cloud user Directory

Impala cluster

HDFS

Hue

Impala

users

users

usersCM

users

IAM

Instance

OS Users

users

users

LDAP

KDC

users

users

Corporate User Directory (on-prem)

LDAPusers

Page 48: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

48 © Cloudera, Inc. All rights reserved.

Identity management – LDAP on-prem and cloud● In case of AD content can be synchronized using

Active Directory Federation Service (ADFS)● Requires two directional network access

○ May be a problem for some corporate infosec

○ Most of the time non-starter with third party providers

Cloud user Directory

Impala cluster

HDFS

Hue

Impala

users

users

usersCM

users

IAM

Instance

OS Users

users

users

LDAP

KDC

users

users

Corporate User Directory (on-prem)

LDAPusers

ADFS

Page 49: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

49 © Cloudera, Inc. All rights reserved.

Identity management – LDAP on-prem and cloud● For third party integration the prefered way is to

use SAML○ No secrets that can be leaked – just a

signed assertion○ Does not require inbound connection to

on-prem○ Could be exposed via API endpoint, not

whitelisting AD/FreeIPA

Cloud user Directory

Impala cluster

HDFS

Hue

Impala

users

users

usersCM

users

IAM

Instance

OS Users

users

users

LDAP

KDC

users

users

Corporate User Directory (on-prem)

LDAPusers

SAML

Page 50: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

50 © Cloudera, Inc. All rights reserved.

Identity management – OS users● OS users need to be synchronized with user

directory● Standard tool is SSSD● Needs to be automated to create on-demand

provisioning○ Or done by managed services

Cloud user Directory

Impala cluster

HDFS

Hue

Impala

users

users

usersCM

users

IAM

Instance

OS Users

users

users

LDAP

KDC

users

users

Corporate User Directory (on-prem)

LDAPusers

SAML

SSSD

Page 51: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

51 © Cloudera, Inc. All rights reserved.

Identity management – HDFS users● HDFS users can come from several places

○ Directly form LDAP■ Not efficient; lacks some features

○ Can maintain isolated set of users■ Does not help

○ Can refer to OS users■ Unix system lookup of user-group

membership

Cloud user Directory

Impala cluster

HDFS

Hue

Impala

users

users

usersCM

users

IAM

Instance

OS Users

users

users

LDAP

KDC

users

users

Corporate User Directory (on-prem)

LDAPusers

SAML

SSSD

Page 52: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

52 © Cloudera, Inc. All rights reserved.

Identity management – CM, Hue, Impala● Other components support various

authentication means○ LDAP (CM, Hue, Impala)○ KDC (Hue, Impala)○ SAML (CM)○ Internal user directory (Hue)○ Customer service (CM)

● LDAP is the common denominator for most

Cloud user Directory

Impala cluster

HDFS

Hue

Impala

users

users

usersCM

users

IAM

Instance

OS Users

users

users

LDAP

KDC

users

users

Corporate User Directory (on-prem)

LDAPusers

SAML

SSSD

Page 53: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

53 © Cloudera, Inc. All rights reserved.

Identity management – IAM● Depending on the public cloud can be easy or

hard○ Azure uses AD so can just federate○ AWS need to use third party tools

● Unfortunately, this synchronization is still difficult to tie to data (object store) access, as we discussed

● IAM required for S3 bucket access

Cloud user Directory

Impala cluster

HDFS

Hue

Impala

users

users

usersCM

users

IAM

Instance

OS Users

users

users

LDAP

KDC

users

users

Corporate User Directory (on-prem)

LDAPusers

SAML

SSSD

Page 54: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

54 © Cloudera, Inc. All rights reserved.

Questions?

Page 55: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

55 © Cloudera, Inc. All rights reserved.

WHY ARE YOU MOVING TO THE CLOUD

HOW TO MAP ON-PREM TO CLOUD

CLOUD ARCHITECTURE AND CONSIDERATIONS

CLOUDERA ALTUS + DIRECTOR

HANDS-ON LAB

TODAY’SAGENDA

• We will introduce concepts related to cloud deployment

• Talk about architecture and some major considerations.

• Do hands-on exercises related running data engineering and analytic database pipelines

Page 56: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

56 © Cloudera, Inc. All rights reserved.

CDH

CLOUDERA DIRECTOR

MULTICLOUD

AWS S3

Microsoft ADLS

IaaS

ALTUS SERVICES ANALYTIC DATABASE

DATA ENGINEERING

CLOUDERA ALTUS

DATA CATALOG

GOVERNANCESECURITY INGEST & REPLICATION

WORKLOAD MANAGEMENT

MULTICLOUD

PaaS

Page 57: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

57 © Cloudera, Inc. All rights reserved.

CDH

CLOUDERA DIRECTOR

IaaS

ALTUS SERVICES ANALYTIC DATABASE

CLOUDERA ALTUS

PaaS

DATA ENGINEERING

on-prem

CDH

Migrate data to the cloud. Migrate services to run on the cloud. Evaluate performance without investment in re-engineering

1Given data and metadata are cloud based, opportunistically migrate specific use cases to cloud native frameworks

2

Page 58: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

58 © Cloudera, Inc. All rights reserved.

Altus Data EngineeringWhat is it?

• Short-lived• Single tenant• Hive, Spark, or MapReduce Cluster

What is it used for?• ETL jobs and data pipelining• Batch processing• With data living in S3 or ADLS• Provides fast and easy job submission

without cluster management

Page 59: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

59 © Cloudera, Inc. All rights reserved.

What is it?• Long-lived• Multi tenant• Impala Cluster

What is it used for?• Data warehousing • Analytics• With data living in S3 or ADLS• Provies fast and easy analytics

without cluster management

Altus Analytic DB

Page 60: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

60 © Cloudera, Inc. All rights reserved.

Altus Shared Data Experience (SDX)What is it?

• Cloud native shared metadata store• With metadata living Cloudera managed cloud

storage

What is it used for?• Shared cataloging to define and preserve

structure and business context of data• Provides unified security across

transient and recurring workloads• Enables consistent governance

across all data to increase compliance

Page 61: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

61 © Cloudera, Inc. All rights reserved.

TODAY’SAGENDA

• We will introduce concepts related to cloud deployment

• Talk about architecture and some major considerations.

• Do hands-on exercises related running data engineering and analytic database pipelines

WHY ARE YOU MOVING TO THE CLOUD

HOW TO MAP ON-PREM TO CLOUD

CLOUD ARCHITECTURE AND CONSIDERATIONS

CLOUDERA ALTUS + DIRECTOR

HANDS-ON LAB

Page 62: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

62 © Cloudera, Inc. All rights reserved.

Setting the sceneSolving a business need

- We work for an outdoor clothing retail company and website sales are struggling

- We need to figure out whether sales orders correlate with website visits and what steps we can take to improve sales

- We’ll architect and price out clusters, set up a data pipeline, run analytics on the data, and then troubleshoot common problems

Page 63: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

63 © Cloudera, Inc. All rights reserved.

Already set up: raw data ingestion

Sales Orders Raw Logs

Page 64: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

64 © Cloudera, Inc. All rights reserved.

Part one: Data Exploration and Analytics on Sales Orders

Sales Orders

ALTUS SDX

ANALYTICDATABASE

Page 65: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

65 © Cloudera, Inc. All rights reserved.

Part two: Data Pipeline - Prepare and clean the web logs

Sales Orders Raw Logs Tokenized logs

DATAENGINEERING

ALTUS SDX

Page 66: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

66 © Cloudera, Inc. All rights reserved.

Part three: Analytics on orders and site visits

Sales Orders Raw Logs Tokenized logs

DATAENGINEERING

ANALYTICDATABASE

ALTUS SDX

Page 67: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

67 © Cloudera, Inc. All rights reserved.

http://tiny.cloudera.com/stratalondon18

Page 68: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

68 © Cloudera, Inc. All rights reserved.

Handout #1: Architecture and cost

Compute (EC2) and storage (S3) will usually be the biggest costs (and, depending on workload, data transfer as well)

There are a lot of things to consider when designing a cluster architecture, three big ones for people used to on-prem deployments are:

- Transient vs. Persistent- How to manage data (S3 vs HDFS)- Automate, automate, automate

See the handout for more things to consider

Page 69: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

69 © Cloudera, Inc. All rights reserved.

Data Engineering on AWS - Best PracticesFor full details Google “Cloudera Data Engineering on AWS Best Practices”

Page 70: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

70 © Cloudera, Inc. All rights reserved.

DE Pattern #1: Transient Cluster on Object Storage

Ideal for async, infrequent, and irregular jobs. Often set up as an automatic pipeline where cluster startup times don’t matter.Pros

- Tailor clusters to jobs- Run workloads in isolation- Use Spot instances to lower cost- Create / terminate clusters as needed

Cons- Incur cluster startup cost (time and

money, time will probably matter less)- Spot instances can be terminated (need

to automate handling this)

Page 71: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

71 © Cloudera, Inc. All rights reserved.

DE Pattern #2: Persistent Cluster on Object Storage

Ideal for homogeneous jobs that run well on the same cluster and jobs that run frequently or continuously.Pros

- No cluster startup cost (time and money)

- Grow and shrink the cluster on demand

Cons- Potentially paying for unused compute

(can shrink the cluster to lower the base cost)

- Less workload isolation

Page 72: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

72 © Cloudera, Inc. All rights reserved.

DE Pattern #3: Persistent Cluster on Local Storage

Ideal for “lift and shift” jobs that require maximum performance. Or when you’re first getting started in the cloud.Pros

- Good first step in cloud migration- Most performant- No cluster startup cost (time and money)

Cons- Potentially paying for unused compute- Clusters are less elastic (unlike S3 it’s

difficult to resize HDFS)- Less workload isolation / cluster flexibility

Page 73: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

73 © Cloudera, Inc. All rights reserved.

Analytic Database on AWS - Best PracticesFor full details Google “Cloudera Analytic Database on AWS Best Practices”

Page 74: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

74 © Cloudera, Inc. All rights reserved.

ADB Pattern #1: Transient Cluster on Object Storage

Ideal for async, unpredictable, and irregular workloads. Or exploratory workloads and workloads still under development.Pros

- Quick iteration- Tailor instances and jobs to workloads- Run workloads in isolation- Use Spot instances to lower cost

Cons- Incur cluster startup cost (time and

money)- Spot instances are not always ideal

(subject to more interruptions)

Page 75: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

75 © Cloudera, Inc. All rights reserved.

ADB Pattern #2: Persistent Cluster on Object Storage

Ideal for frequent, flexible, and changeable workloads.

Pros- Predictable results - Full multi-tenant isolation- No cluster startup cost (time and money)

Cons- Per node performance is lower with S3

than with HDFS. You can compensate for this by using more, cheaper nodes, for more throughput.

Page 76: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

76 © Cloudera, Inc. All rights reserved.

ADB Pattern #3: Persistent Cluster on Local Storage

Ideal for “lift and shift” jobs that require maximum performance. Or when you’re first getting started in the cloud.Pros

- Good first step in cloud migration- Most performant for tight SLAs- No cluster startup cost (time and

money)Cons

- Clusters are less elastic (unlike S3 it’s difficult to resize HDFS)

- Less workload isolation- Less instance type and cluster flexibility

Page 77: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

77 © Cloudera, Inc. All rights reserved.

Handout #1: Architecture and cost

Review cluster and S3 considerations in the handout

Page 78: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

78 © Cloudera, Inc. All rights reserved.

Handout #1a: Architecture and cost for an ADB cluster

Page 79: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

79 © Cloudera, Inc. All rights reserved.

Handout #1b: Architecture and cost for a DE clusterAssignment

1. Calculate per hour cost for one node (m4.2xlarge)

2. Calculate per month cost for the whole cluster (at 4 hours / day)

3. Calculate per month S3 cost (assuming all 90 days of data is already in S3)

4. Calculate total AWS cost

Page 80: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

80 © Cloudera, Inc. All rights reserved.

Handout #1b: Architecture and cost for a DE clusterAnswers

1. Calculate per hour cost for one node (m4.2xlarge)$0.4 [ec2] + $0.071 [ebs] = $0.471

2. Calculate per month cost for the whole cluster (at 4 hours / day)$0.471 [1 node / hour] * 5 [nodes / cluster] * (4 * 30) = $282.6

3. Calculate per month S3 cost (assuming all 90 days of data is already in S3)1TB * 90 = 90,000GB * $0.023 = $2,070

4. Calculate total AWS cost$282.6 + $2,070 = $2,352.6

Page 81: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

81 © Cloudera, Inc. All rights reserved.

Handout #2: Log into Altus

If you have an AWS account use Altus Direct (Handout #3)

If you don’t have an AWS account hold tight

Page 82: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

82 © Cloudera, Inc. All rights reserved.

Handout #3: Altus Direct

Walkthrough

Page 83: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

83 © Cloudera, Inc. All rights reserved.

Handout #4: Create the Clusters

Page 84: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

84 © Cloudera, Inc. All rights reserved.

15 minute break

Page 85: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

85 © Cloudera, Inc. All rights reserved.

Handout #5: Run Queries against ADB - Part 1

• Review the Data in S3 using AWS cloud-altus-demo account• Connect to your Analytical DB cluster and create tables • Explore the data : run you first queries

• Based on the total number of items sold, what are the top 5 best selling products?

Page 86: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

86 © Cloudera, Inc. All rights reserved.

Handout #6: Run Queries against DE

• Review the web logs in S3 using AWS cloud-altus-demo account• Execute a batch job to parse the logs and produce a tokenized table • Look at S3 and see what has been created

• Optional exercise : run a second batch job to produce a denormalized products table

Page 87: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

87 © Cloudera, Inc. All rights reserved.

Handout #7: Run Queries against ADB - Part 2

• Connect to Hue in the Analytic Cluster• See what new tables are available for Data Analysts• Explore the tokenized logs table

• Run a complex query : compare the sales information with the web site visits and see if there is anything wrong. Use the provided SQL query.

Page 88: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

88 © Cloudera, Inc. All rights reserved.

Handout #8: Troubleshoot failures

- Create a Hive DE job expected to fail- Use Workload Analytics to understand the error

- (Optional) People with their own account connect using [email protected]

- Look for the failed SparkPi job- Use Workload Analytics to understand the error

Page 89: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

© Cloudera, Inc. All rights reserved.

WRAP-UP

Page 90: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

90 © Cloudera, Inc. All rights reserved.

Key benefits of PaaS

Spin up working environments ad hoc

Bring your own data and tools

Adjust resources on-demand

Pay for your actual consumption of resources

Page 91: IN THE CLOUD RUNNING DATA ANALYTIC WORKLOADS

THANK YOU