Data Lake BestPractices - aws-de-media.s3-eu-west-1 ...... · AWS CloudHSM Dedicated Tenancy...

Data Lake Best Practices

Agenda

Why DataLakeKeyComponents ofaDataLakeModernDataArchitectureSomeBestPracticesCaseStudySummaryTakeaways

WhatisaDataLake?

What,whyetc.

Whatisadatalake?• Itisanarchitecturethatallowsyoutocollect,store,process,analyzeand

consumealldatathatflowsintoyourorganization.Why datalake?• Leveragealldatathatflowsintoyourorganization

• Customercentricity• Businessagility• BetterpredictionsviaMachineLearning• Competitiveadvantage

ComparisonofaDataLaketoanEnterpriseDataWarehouse

Complementary to EDW (not replacement) Data lake can be source for EDW

Schema on read (no predefined schemas) Schema on write (predefined schemas)

Structured/semi-structured/Unstructured data Structured data only

Fast ingestion of new data/content Time consuming to introduce new content

Data Science + Prediction/Advanced Analytics + BI use cases BI use cases only (no prediction/advanced analytics)

Data at low level of detail/granularity Data at summary/aggregated level of detail

Loosely defined SLAs Tight SLAs (production schedules)

Flexibility in tools (open source/tools for advancedanalytics) Limited flexibility in tools (SQL only)

EnterpriseDWEMR S3

KeyConceptsAssociatedwithaDataLake

STORAGECOMPUTE

COMPUTE COMPUTE

COMPUTECOMPUTE

COMPUTE

ComponentsofaDataLake

DataStorage

• Highdurability• Storesrawdatafrominputsources• Supportforanytypeofdata• Lowcost

Streaming

• Streamingingestoffeeddata• Providestheabilitytoconsumeanydataset

asastream• Facilitateslowlatencyanalytics

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Catalogue

• Metadatalake• Usedforsummarystatisticsanddata

Classificationmanagement

Search

• SimplifiedaccessmodelfordatadiscoveryStorage&Streams

Catalogue&Search

Entitlements

API&UI

Entitlementssystem

• Encryption• Authentication• Authorisation• Chargeback• Quotas• Datamasking• Regionalrestrictions

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Storage&Streams

Catalogue&Search

Entitlements

API&UI API&UserInterface

• Exposesthedatalaketocustomers• Programmaticallyquerycatalogue• ExposesearchAPI• Ensuresthatentitlementsarerespected

The Modern Data Architecture

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Storage&Streams

Catalogue&Search

Entitlements

API&UI

WhyIsAmazonS3theFabricofDataLake?• Nativelysupportedbybigdataframeworks(Spark,Hive,Presto,etc.)• Decouplestorageandcompute

• Noneedtoruncomputeclustersforstorage(unlikeHDFS)• CanruntransientHadoopclusters&AmazonEC2SpotInstances• Multiple&heterogeneousanalysis clusterscanusethesamedata

• Virtuallyunlimitednumberofobjectsandvolumeofdata• Veryhighbandwidth– noaggregatethroughputlimit• Designedfor99.99%availability– cantoleratezonefailure• Designedfor99.999999999%durability• Noneedtopayfordatareplication• Nativesupportforversioning• Tiered-storage(Standard,IA,AmazonGlacier)vialife-cyclepolicies

• UseHDFSforveryfrequentlyaccessed(hot)data

• Secure– SSL,client/server-sideencryptionatrest• Lowcost

Storage&Streams

Catalogue&Search

Entitlements

API&UI

AWS Lambda

Metadata Index(DynamoDB)

Search Index(Amazon Elasticsearch

Service or AmazonCloudSearch)

ObjectCreatedObjectDeleted PutItem

Update Stream

Update Index

Extract Search Fields

Indexing and Searching using Metadata

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Identity&AccessManagement

• Manageusers,groups,androles• IdentityfederationwithOpenID• TemporarycredentialswithAmazonSecurityToken

Service(AmazonSTS)• Storedpolicytemplates• Powerfulpolicylanguage• AmazonS3bucketpolicies

DataEncryption

AWSCloudHSMDedicatedTenancySafeNet LunaSAHSMDevice

CommonCriteriaEAL4+,NISTFIPS140-2

AWSKeyManagementServiceAutomatedkeyrotation&auditing

IntegrationwithotherAWSservices

AWSserversideencryptionAWSmanagedkeyinfrastructure

Storage&Streams

Catalogue&Search

Entitlements

API&UI

DataLakeAPI&UI

ExposestheMetadataAPI,search,andAmazonS3storageservicestocustomers

CanbebasedonTVM/STSTemporaryAccessformanyservices,andabespokeAPIforMetadata

DriveallUIoperationsfromAPI?

IntroducingAmazonAPIGateway

HostmultipleversionsandstagesofAPIs

CreateanddistributeAPIkeystodevelopers

LeverageAWSSigv4toauthorizeaccesstoAPIs

Throttleandmonitorrequeststoprotectthebackend

LeveragesAWSLambda

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Storage&Streams

Catalogue&Search

Entitlements

API&UI

Storage&Streams

Catalogue&Search

Entitlements

API&UI

https://aws.amazon.com/big-data/partner-solutions/

DataIntegrationPartnersReducetheefforttomove,cleanse,synchronize,manage,andautomatizedatarelatedprocesses.

Putting it all together

Building a Data Lake on AWS

Kinesis Firehose AthenaQuery Service

GlueBatch

Processing Data for Analytics on your data lake

Processing&Analytics

Real-time Batch

AI&Predictive

BI&DataVisualization

Transactional&RDBMS

AWS LambdaApache Storm

on EMR

Apache Flinkon EMR

Spark Streaming on EMR

ElasticsearchService

Kinesis Analytics, Kinesis Streams

DynamoDB

NoSQL DB Relational DatabaseAurora

EMRHadoop, Spark,

Presto

RedshiftData Warehouse

AthenaQuery Service

Amazon LexSpeech recognition

Amazon Rekognition

Amazon PollyText to speech

Machine LearningPredictive analytics

Kinesis Streams & Firehose

Important considerations

DataTemperature

Hot Warm ColdVolume MB–GB GB–TB PB–EBItemsize B–KB KB–MB KB–TBLatency ms ms,sec min,hrsDurability Low–high High VeryhighRequestrate Veryhigh High LowCost/GB $$-$ $-¢¢ ¢

Hot data Warm data Cold data

WhichStream/MessageStorageShouldIUse?AmazonDynamoDBStreams

AmazonKinesisStreams

AmazonKinesisFirehose

ApacheKafka

AmazonSQS(Standard)

AmazonSQS(FIFO)

AWS managed Yes Yes Yes No Yes Yes

Guaranteed ordering Yes Yes No Yes No Yes

Delivery(deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once

Dataretentionperiod 24hours 7days N/A Configurable 14 days 14days

Availability 3 AZ 3 AZ 3AZ Configurable 3 AZ 3AZ

Scale /throughput

Nolimit/~ tableIOPS

Nolimit/~shards

No limit/automatic

Nolimit /~nodes

No limits/automatic

300 TPS/queue

Parallelconsumption Yes Yes No Yes No No

StreamMapReduce Yes Yes N/A Yes N/A N/A

Row/objectsize 400KB 1MB Destinationrow/objectsize

Configurable 256KB 256KB

Cost Higher(tablecost)

Low Low Low (+admin) Low-medium Low-medium

Hot Warm

BatchTakesminutestohoursExample:Daily/weekly/monthlyreportsAmazonEMR(MapReduce,Hive,Pig,Spark)

InteractiveTakessecondsExample:Self-servicedashboardsAmazonRedshift,AmazonAthena,AmazonEMR(Presto,Spark)Subsecond:ElastiCache (Redis 3.2TiB,MemCache),SAPHana

MessageTakesmillisecondstosecondsExample:MessageprocessingAmazonSQSapplicationsonAmazonEC2

StreamTakesmillisecondstosecondsExample:Fraudalerts,1minutemetricsAmazonEMR(SparkStreaming),AmazonKinesisAnalytics,KCL,Storm,AWSLambda

ArtificialIntelligenceTakesmillisecondstominutesExample:Frauddetection,forecastdemand,texttospeechAmazonAI(Lex,Polly,ML,Rekognition),AmazonEMR(SparkML),DeepLearningAMI(MXNet,TensorFlow,Theano,Torch,CNTKandCaffe)

AnalyticsTypes&FrameworksPROCESS/ANALYZE

Message

AmazonSQSappsAmazonEC2

Streaming

AmazonKinesisAnalytics

KCLapps

AWSLambda

Stream

AmazonEC2

AmazonEMR

AmazonRedshift

Presto

AmazonAthena

Interactive

AmazonAIAI

WhichAnalysisToolShouldIUse?AmazonRedshift AmazonAthena AmazonEMR

Presto Spark Hive

Use case Optimizedfordatawarehousing

Ad-hocInteractiveQueries

InteractiveQuery

Generalpurpose(iterativeML,RT,..)

Scale/throughput ~Nodes Automatic/No limits ~Nodes

AWSManagedService

Yes Yes, Serverless Yes

Storage Localstorage Amazon S3 AmazonS3,HDFS

Optimization Columnarstorage,datacompression,andzonemaps

CSV,TSV,JSON,Parquet,ORC, ApacheWeblog

Framework dependent

Metadata AmazonRedshiftmanaged AthenaCatalogManager HiveMeta-store

BI toolssupports Yes(JDBC/ODBC) Yes(JDBC) Yes(JDBC/ODBC&Custom)

Accesscontrols Users, groups,andaccesscontrols

AWSIAM Integration withLDAP

UDF support Yes(Scalar) No Yes

Case Study

“For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren’t able to do before, and that is priceless.”

- Steve Randich, CIO

Case Study: Re-architecting Compliance

What FINRA needed• Infrastructure for its market surveillance platform• Support of analysis and storage of approximately 75

billion market events every day

Why they chose AWS• Fulfillment of FINRA’s security requirements• Ability to create a flexible platform using dynamic

clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3

Benefits realized• Increased agility, speed, and cost savings• Estimated savings of $10-20m annually by using AWS

Fraud Detection

FINRAusesAmazonEMRandAmazonS3toprocessupto75billiontradingeventsperdayandsecurelystoreover5petabytesofdata,attainingsavingsof$10-20mmperyear.

Summary

• AWS enables you to build sophisticated data lakes and related analytics applications

• Retrospective, Real-time, Predictive

• You can build incrementally, adding use cases and increasing scale as you go

• AWS provides a broad range of security and auditing features to enable you to meet your security requirements

https://aws.amazon.com/big-data/

Takeaways

• Prescriptiveguidanceandrapidlydeployablesolutionstohelpyoustore,analyze,andprocessbigdataontheAWSCloud

• DeriveInsightsfromIoT inMinutesusingAWSIoT,AmazonKinesisFirehose,AmazonAthena,andAmazonQuickSight

• DeployingaDataLakeonAWS- March2017AWSOnlineTechTalks

• Harmonize,Search,andAnalyzeLooselyCoupledDatasetsonAWS

• BestPracticesforBuildingaDataLakewithAmazonS3-August2016MonthlyWebinarSeries- YouTube

http://bit.ly/2qiElYx

http://amzn.to/2mzGppL

http://bit.ly/2qipA8h

http://amzn.to/2qpiFaK

http://amzn.to/2lpbc8p

Data Lake BestPractices - aws-de-media.s3-eu-west-1 ...... · AWS CloudHSM Dedicated Tenancy...

Documents

WebSphere Application Server EAL4+ Security Target · 2012-07-05 · WebSphere Application Server EAL4+ Security Target Issue: v3.0

Cisco IOS® Advanced Firewall · Photo Shop Head Office ... IPSec (EAL4) Firewall (EAL4) Cisco 870 Series 9 In progress 9 Cisco 1800 Series 9 In progress 9 Cisco 2800 Series 9 In

Common Criteria LSPP EAL4+ Evaluated Configuration Guide ... · Common Criteria LSPP EAL4+ Evaluated Conﬁguration Guide for Oracle Enterprise Linux 5

Linux Achieves CAPP/EAL4+ Can it achieve EAL5? · Linux Achieves CAPP/EAL4+ Can it achieve EAL5? Doc Shankar, IBM Helmut Kurth, atsec

AWS CloudHSM Classic - ユーザーガイド · AWS CloudHSM Classic ユーザーガイド AWS 無料アカウント作成方法 AWS CloudHSM のセットアップ AWS CloudHSM を使用するには、AWS

Security on AWS - resources.trendmicro.com IAM Amazon CloudWatch AWS CloudTrail AWS Config AWS CloudFormation AWS Trusted Advisor

AWS CloudHSM - ユーザーガイド · hsm ユーザー ... aws cloudhsm ユーザーガイド ssl/tls オフロードでウェブサーバーのセキュリティを向上させる

Common Criteria LSPP EAL4+ Evaluated Configuration - Oracle

AWS Webcast - AWS 101 - Journey to the AWS Cloud: Introduction to AWS

Common Criteria EAL4 Evaluation Criteria EAL4 Evaluation Alteon Switched Firewall (Version 2.0.3.0) Security Target Doc Ref: NortelASF-ST-1.2 Date: 3rd June 2003 Issue: 1.2 Page 7

Migrating to AWS CloudHSM...Classic While both CloudHSM Classic and the new CloudHSM provide single-tenant, FIPS-validated HSMs under your control in your VPC, there are important

Deploying a containerized web application with AWS Cloud ... · aws-elasticbeanstalk aws-elasticloadbalancing aws-elasticloadbalancingv2 aws-elasticloadbalancingv2-targets aws-elasticsearch

AWS CloudHSM - Guía del usuario · 2018-06-01 · Herramientas de línea de comandos de AWS CloudHSM ... Some versions of Oracle's database software offer a feature called Transparent

EAL4+ Security Target - Cyber · 2019-06-06 · 2 of 66 Document management Document identification Document ID E14_EAL4_ASE Document title Microsoft Exchange 2010 SP1 EAL4 Security

Common Criteria EAL4+ Evaluated Conﬁguration Guide for ......Common Criteria EAL4+ Evaluated Conﬁguration Guide for SUSE LINUX Enterprise Server on IBM Hardware Klaus Weidner

AWS CloudHSM - Guia do usuário€¦ · O AWS CloudHSM fornece módulos de segurança de hardware na nuvem do AWS. Um módulo de segurança de hardware (HSM) é um dispositivo computacional

AWS CloudHSM - Guide de l'utilisateur · AWS CloudHSM fournit des modules de sécurité matériels (HSM) dans le cloud AWS. Un module de sécurité matériel (HSM, Hardware Security

AWS CloudHSM - 用户指南 · AWS CloudHSM 用户指南保护证书颁发机构 (CA) 的私有密钥有关设置借助 AWS CloudHSM 进行 SSL/TLS 分载的信息，请参阅 SSL/TLS

NetScreen Appliances Security Target: EAL4...Juniper Networks Security Appliances Security Target Revision L December 19, 2005 EAL4 • Hardware version: 4010 • Juniper Networks

AWS ParallelCluster - AWS ParallelCluster User Guide · 2020-06-12 · AWS ParallelCluster AWS ParallelCluster User Guide What Is AWS ParallelCluster AWS ParallelCluster is an AWS-supported