Data Lake BestPractices - aws-de-media.s3-eu-west-1 ...... · AWS CloudHSM Dedicated Tenancy...

Preview:

Citation preview

Data Lake Best Practices

Agenda

Why DataLakeKeyComponents ofaDataLakeModernDataArchitectureSomeBestPracticesCaseStudySummaryTakeaways

WhatisaDataLake?

What,whyetc.

Whatisadatalake?• Itisanarchitecturethatallowsyoutocollect,store,process,analyzeand

consumealldatathatflowsintoyourorganization.Why datalake?• Leveragealldatathatflowsintoyourorganization

• Customercentricity• Businessagility• BetterpredictionsviaMachineLearning• Competitiveadvantage

ComparisonofaDataLaketoanEnterpriseDataWarehouse

Complementary to EDW (not replacement) Data lake can be source for EDW

Schema on read (no predefined schemas) Schema on write (predefined schemas)

Structured/semi-structured/Unstructured data Structured data only

Fast ingestion of new data/content Time consuming to introduce new content

Data Science + Prediction/Advanced Analytics + BI use cases BI use cases only (no prediction/advanced analytics)

Data at low level of detail/granularity Data at summary/aggregated level of detail

Loosely defined SLAs Tight SLAs (production schedules)

Flexibility in tools (open source/tools for advancedanalytics) Limited flexibility in tools (SQL only)

EnterpriseDWEMR S3

KeyConceptsAssociatedwithaDataLake

STORAGECOMPUTE

COMPUTE COMPUTE

COMPUTECOMPUTE

COMPUTE

COMPUTE

COMPUTE

ComponentsofaDataLake

DataStorage

• Highdurability• Storesrawdatafrominputsources• Supportforanytypeofdata• Lowcost

Streaming

• Streamingingestoffeeddata• Providestheabilitytoconsumeanydataset

asastream• Facilitateslowlatencyanalytics

Storage&Streams

Catalogue&Search

Entitlements

API&UI

ComponentsofaDataLake

Catalogue

• Metadatalake• Usedforsummarystatisticsanddata

Classificationmanagement

Search

• SimplifiedaccessmodelfordatadiscoveryStorage&Streams

Catalogue&Search

Entitlements

API&UI

ComponentsofaDataLake

Entitlementssystem

• Encryption• Authentication• Authorisation• Chargeback• Quotas• Datamasking• Regionalrestrictions

Storage&Streams

Catalogue&Search

Entitlements

API&UI

ComponentsofaDataLake

Storage&Streams

Catalogue&Search

Entitlements

API&UI API&UserInterface

• Exposesthedatalaketocustomers• Programmaticallyquerycatalogue• ExposesearchAPI• Ensuresthatentitlementsarerespected

The Modern Data Architecture

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Storage&Streams

Catalogue&Search

Entitlements

API&UI

WhyIsAmazonS3theFabricofDataLake?• Nativelysupportedbybigdataframeworks(Spark,Hive,Presto,etc.)• Decouplestorageandcompute

• Noneedtoruncomputeclustersforstorage(unlikeHDFS)• CanruntransientHadoopclusters&AmazonEC2SpotInstances• Multiple&heterogeneousanalysis clusterscanusethesamedata

• Virtuallyunlimitednumberofobjectsandvolumeofdata• Veryhighbandwidth– noaggregatethroughputlimit• Designedfor99.99%availability– cantoleratezonefailure• Designedfor99.999999999%durability• Noneedtopayfordatareplication• Nativesupportforversioning• Tiered-storage(Standard,IA,AmazonGlacier)vialife-cyclepolicies

• UseHDFSforveryfrequentlyaccessed(hot)data

• Secure– SSL,client/server-sideencryptionatrest• Lowcost

Storage&Streams

Catalogue&Search

Entitlements

API&UI

AWS Lambda

AWS Lambda

Metadata Index(DynamoDB)

Search Index(Amazon Elasticsearch

Service or AmazonCloudSearch)

ObjectCreatedObjectDeleted PutItem

Update Stream

Update Index

Extract Search Fields

Indexing and Searching using Metadata

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Identity&AccessManagement

• Manageusers,groups,androles• IdentityfederationwithOpenID• TemporarycredentialswithAmazonSecurityToken

Service(AmazonSTS)• Storedpolicytemplates• Powerfulpolicylanguage• AmazonS3bucketpolicies

DataEncryption

AWSCloudHSMDedicatedTenancySafeNet LunaSAHSMDevice

CommonCriteriaEAL4+,NISTFIPS140-2

AWSKeyManagementServiceAutomatedkeyrotation&auditing

IntegrationwithotherAWSservices

AWSserversideencryptionAWSmanagedkeyinfrastructure

Storage&Streams

Catalogue&Search

Entitlements

API&UI

DataLakeAPI&UI

ExposestheMetadataAPI,search,andAmazonS3storageservicestocustomers

CanbebasedonTVM/STSTemporaryAccessformanyservices,andabespokeAPIforMetadata

DriveallUIoperationsfromAPI?

IntroducingAmazonAPIGateway

HostmultipleversionsandstagesofAPIs

CreateanddistributeAPIkeystodevelopers

LeverageAWSSigv4toauthorizeaccesstoAPIs

Throttleandmonitorrequeststoprotectthebackend

LeveragesAWSLambda

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Storage&Streams

Catalogue&Search

Entitlements

API&UI

https://aws.amazon.com/big-data/partner-solutions/

DataIntegrationPartnersReducetheefforttomove,cleanse,synchronize,manage,andautomatizedatarelatedprocesses.

Putting it all together

Building a Data Lake on AWS

Kinesis Firehose AthenaQuery Service

1

2

3

4

5

6

7

8

GlueBatch

9

10

Processing Data for Analytics on your data lake

Processing&Analytics

Real-time Batch

AI&Predictive

BI&DataVisualization

Transactional&RDBMS

AWS LambdaApache Storm

on EMR

Apache Flinkon EMR

Spark Streaming on EMR

ElasticsearchService

Kinesis Analytics, Kinesis Streams

DynamoDB

NoSQL DB Relational DatabaseAurora

EMRHadoop, Spark,

Presto

RedshiftData Warehouse

AthenaQuery Service

Amazon LexSpeech recognition

Amazon Rekognition

Amazon PollyText to speech

Machine LearningPredictive analytics

Kinesis Streams & Firehose

Important considerations

DataTemperature

Hot Warm ColdVolume MB–GB GB–TB PB–EBItemsize B–KB KB–MB KB–TBLatency ms ms,sec min,hrsDurability Low–high High VeryhighRequestrate Veryhigh High LowCost/GB $$-$ $-¢¢ ¢

Hot data Warm data Cold data

WhichStream/MessageStorageShouldIUse?AmazonDynamoDBStreams

AmazonKinesisStreams

AmazonKinesisFirehose

ApacheKafka

AmazonSQS(Standard)

AmazonSQS(FIFO)

AWS managed Yes Yes Yes No Yes Yes

Guaranteed ordering Yes Yes No Yes No Yes

Delivery(deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once

Dataretentionperiod 24hours 7days N/A Configurable 14 days 14days

Availability 3 AZ 3 AZ 3AZ Configurable 3 AZ 3AZ

Scale /throughput

Nolimit/~ tableIOPS

Nolimit/~shards

No limit/automatic

Nolimit /~nodes

No limits/automatic

300 TPS/queue

Parallelconsumption Yes Yes No Yes No No

StreamMapReduce Yes Yes N/A Yes N/A N/A

Row/objectsize 400KB 1MB Destinationrow/objectsize

Configurable 256KB 256KB

Cost Higher(tablecost)

Low Low Low (+admin) Low-medium Low-medium

Hot Warm

New

BatchTakesminutestohoursExample:Daily/weekly/monthlyreportsAmazonEMR(MapReduce,Hive,Pig,Spark)

InteractiveTakessecondsExample:Self-servicedashboardsAmazonRedshift,AmazonAthena,AmazonEMR(Presto,Spark)Subsecond:ElastiCache (Redis 3.2TiB,MemCache),SAPHana

MessageTakesmillisecondstosecondsExample:MessageprocessingAmazonSQSapplicationsonAmazonEC2

StreamTakesmillisecondstosecondsExample:Fraudalerts,1minutemetricsAmazonEMR(SparkStreaming),AmazonKinesisAnalytics,KCL,Storm,AWSLambda

ArtificialIntelligenceTakesmillisecondstominutesExample:Frauddetection,forecastdemand,texttospeechAmazonAI(Lex,Polly,ML,Rekognition),AmazonEMR(SparkML),DeepLearningAMI(MXNet,TensorFlow,Theano,Torch,CNTKandCaffe)

AnalyticsTypes&FrameworksPROCESS/ANALYZE

Message

AmazonSQSappsAmazonEC2

Streaming

AmazonKinesisAnalytics

KCLapps

AWSLambda

Stream

AmazonEC2

AmazonEMR

Fast

AmazonRedshift

Presto

EMR

Fast

Slow

AmazonAthena

Batch

Interactive

AmazonAIAI

WhichAnalysisToolShouldIUse?AmazonRedshift AmazonAthena AmazonEMR

Presto Spark Hive

Use case Optimizedfordatawarehousing

Ad-hocInteractiveQueries

InteractiveQuery

Generalpurpose(iterativeML,RT,..)

Batch

Scale/throughput ~Nodes Automatic/No limits ~Nodes

AWSManagedService

Yes Yes, Serverless Yes

Storage Localstorage Amazon S3 AmazonS3,HDFS

Optimization Columnarstorage,datacompression,andzonemaps

CSV,TSV,JSON,Parquet,ORC, ApacheWeblog

Framework dependent

Metadata AmazonRedshiftmanaged AthenaCatalogManager HiveMeta-store

BI toolssupports Yes(JDBC/ODBC) Yes(JDBC) Yes(JDBC/ODBC&Custom)

Accesscontrols Users, groups,andaccesscontrols

AWSIAM Integration withLDAP

UDF support Yes(Scalar) No Yes

Slow

Case Study

“For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren’t able to do before, and that is priceless.”

- Steve Randich, CIO

Case Study: Re-architecting Compliance

What FINRA needed• Infrastructure for its market surveillance platform• Support of analysis and storage of approximately 75

billion market events every day

Why they chose AWS• Fulfillment of FINRA’s security requirements• Ability to create a flexible platform using dynamic

clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3

Benefits realized• Increased agility, speed, and cost savings• Estimated savings of $10-20m annually by using AWS

Fraud Detection

FINRAusesAmazonEMRandAmazonS3toprocessupto75billiontradingeventsperdayandsecurelystoreover5petabytesofdata,attainingsavingsof$10-20mmperyear.

Summary

• AWS enables you to build sophisticated data lakes and related analytics applications

• Retrospective, Real-time, Predictive

• You can build incrementally, adding use cases and increasing scale as you go

• AWS provides a broad range of security and auditing features to enable you to meet your security requirements

https://aws.amazon.com/big-data/

Takeaways

• Prescriptiveguidanceandrapidlydeployablesolutionstohelpyoustore,analyze,andprocessbigdataontheAWSCloud

• DeriveInsightsfromIoT inMinutesusingAWSIoT,AmazonKinesisFirehose,AmazonAthena,andAmazonQuickSight

• DeployingaDataLakeonAWS- March2017AWSOnlineTechTalks

• Harmonize,Search,andAnalyzeLooselyCoupledDatasetsonAWS

• BestPracticesforBuildingaDataLakewithAmazonS3-August2016MonthlyWebinarSeries- YouTube

http://bit.ly/2qiElYx

http://amzn.to/2mzGppL

http://bit.ly/2qipA8h

http://amzn.to/2qpiFaK

http://amzn.to/2lpbc8p

?

Recommended