Data Lake BestPractices - Amazon Web Servicesaws-de-media.s3-eu-west-1.?What is a data lake? • It is an architecture that allows you to collect ... AWS Lambda AWS Lambda Metadata

  • View
    215

  • Download
    3

Embed Size (px)

Text of Data Lake BestPractices - Amazon Web Servicesaws-de-media.s3-eu-west-1.?What is a data lake?...

  • Data Lake Best Practices

  • Agenda

    Why DataLakeKeyComponents ofaDataLakeModernDataArchitectureSomeBestPracticesCaseStudySummaryTakeaways

  • WhatisaDataLake?

  • What,whyetc.

    Whatisadatalake? Itisanarchitecturethatallowsyoutocollect,store,process,analyzeand

    consumealldatathatflowsintoyourorganization.Why datalake? Leveragealldatathatflowsintoyourorganization

    Customercentricity Businessagility BetterpredictionsviaMachineLearning Competitiveadvantage

  • ComparisonofaDataLaketoanEnterpriseDataWarehouse

    Complementary to EDW (not replacement) Data lake can be source for EDW

    Schema on read (no predefined schemas) Schema on write (predefined schemas)

    Structured/semi-structured/Unstructured data Structured data only

    Fast ingestion of new data/content Time consuming to introduce new content

    Data Science + Prediction/Advanced Analytics + BI use cases BI use cases only (no prediction/advanced analytics)

    Data at low level of detail/granularity Data at summary/aggregated level of detail

    Loosely defined SLAs Tight SLAs (production schedules)

    Flexibility in tools (open source/tools for advancedanalytics) Limited flexibility in tools (SQL only)

    EnterpriseDWEMR S3

  • KeyConceptsAssociatedwithaDataLake

  • STORAGECOMPUTE

    COMPUTE COMPUTE

    COMPUTECOMPUTE

    COMPUTE

    COMPUTE

    COMPUTE

  • ComponentsofaDataLake

    DataStorage

    Highdurability Storesrawdatafrominputsources Supportforanytypeofdata Lowcost

    Streaming

    Streamingingestoffeeddata Providestheabilitytoconsumeanydataset

    asastream Facilitateslowlatencyanalytics

    Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • ComponentsofaDataLake

    Catalogue

    Metadatalake Usedforsummarystatisticsanddata

    Classificationmanagement

    Search

    SimplifiedaccessmodelfordatadiscoveryStorage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • ComponentsofaDataLake

    Entitlementssystem

    Encryption Authentication Authorisation Chargeback Quotas Datamasking Regionalrestrictions

    Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • ComponentsofaDataLake

    Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI API&UserInterface

    Exposesthedatalaketocustomers Programmaticallyquerycatalogue ExposesearchAPI Ensuresthatentitlementsarerespected

  • The Modern Data Architecture

  • Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • WhyIsAmazonS3theFabricofDataLake? Nativelysupportedbybigdataframeworks(Spark,Hive,Presto,etc.) Decouplestorageandcompute

    Noneedtoruncomputeclustersforstorage(unlikeHDFS) CanruntransientHadoopclusters&AmazonEC2SpotInstances Multiple&heterogeneousanalysis clusterscanusethesamedata

    Virtuallyunlimitednumberofobjectsandvolumeofdata Veryhighbandwidth noaggregatethroughputlimit Designedfor99.99%availability cantoleratezonefailure Designedfor99.999999999%durability Noneedtopayfordatareplication Nativesupportforversioning Tiered-storage(Standard,IA,AmazonGlacier)vialife-cyclepolicies

    UseHDFSforveryfrequentlyaccessed(hot)data Secure SSL,client/server-sideencryptionatrest Lowcost

  • Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • AWS Lambda

    AWS Lambda

    Metadata Index(DynamoDB)

    Search Index(Amazon Elasticsearch

    Service or AmazonCloudSearch)

    ObjectCreatedObjectDeleted PutItem

    Update Stream

    Update Index

    Extract Search Fields

    Indexing and Searching using Metadata

  • Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • Identity&AccessManagement

    Manageusers,groups,androles IdentityfederationwithOpenID TemporarycredentialswithAmazonSecurityToken

    Service(AmazonSTS) Storedpolicytemplates Powerfulpolicylanguage AmazonS3bucketpolicies

  • DataEncryption

    AWSCloudHSMDedicatedTenancySafeNet LunaSAHSMDevice

    CommonCriteriaEAL4+,NISTFIPS140-2

    AWSKeyManagementServiceAutomatedkeyrotation&auditing

    IntegrationwithotherAWSservices

    AWSserversideencryptionAWSmanagedkeyinfrastructure

  • Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • DataLakeAPI&UI

    ExposestheMetadataAPI,search,andAmazonS3storageservicestocustomers

    CanbebasedonTVM/STSTemporaryAccessformanyservices,andabespokeAPIforMetadata

    DriveallUIoperationsfromAPI?

  • IntroducingAmazonAPIGateway

    HostmultipleversionsandstagesofAPIs

    CreateanddistributeAPIkeystodevelopers

    LeverageAWSSigv4toauthorizeaccesstoAPIs

    Throttleandmonitorrequeststoprotectthebackend

    LeveragesAWSLambda

  • Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • Storage&Streams

    Catalogue&Search

    Entitlements

    API&UI

  • https://aws.amazon.com/big-data/partner-solutions/

    DataIntegrationPartnersReducetheefforttomove,cleanse,synchronize,manage,andautomatizedatarelatedprocesses.

  • Putting it all together

  • Building a Data Lake on AWS

    Kinesis Firehose AthenaQuery Service

    1

    2

    3

    4

    5

    6

    7

    8

    GlueBatch

    9

    10

  • Processing Data for Analytics on your data lake

  • Processing&Analytics

    Real-time Batch

    AI&Predictive

    BI&DataVisualization

    Transactional&RDBMS

    AWS LambdaApache Storm

    on EMR

    Apache Flinkon EMR

    Spark Streaming on EMR

    ElasticsearchService

    Kinesis Analytics, Kinesis Streams

    DynamoDB

    NoSQL DB Relational DatabaseAurora

    EMRHadoop, Spark,

    Presto

    RedshiftData Warehouse

    AthenaQuery Service

    Amazon LexSpeech recognition

    Amazon Rekognition

    Amazon PollyText to speech

    Machine LearningPredictive analytics

    Kinesis Streams & Firehose

  • Important considerations

  • DataTemperature

    Hot Warm ColdVolume MBGB GBTB PBEBItemsize BKB KBMB KBTBLatency ms ms,sec min,hrsDurability Lowhigh High VeryhighRequestrate Veryhigh High LowCost/GB $$-$ $-

    Hot data Warm data Cold data

  • WhichStream/MessageStorageShouldIUse?AmazonDynamoDBStreams

    AmazonKinesisStreams

    AmazonKinesisFirehose

    ApacheKafka

    AmazonSQS(Standard)

    AmazonSQS(FIFO)

    AWS managed Yes Yes Yes No Yes Yes

    Guaranteed ordering Yes Yes No Yes No Yes

    Delivery(deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once

    Dataretentionperiod 24hours 7days N/A Configurable 14 days 14days

    Availability 3 AZ 3 AZ 3AZ Configurable 3 AZ 3AZ

    Scale /throughput

    Nolimit/~ tableIOPS

    Nolimit/~shards

    No limit/automatic

    Nolimit /~nodes

    No limits/automatic

    300 TPS/queue

    Parallelconsumption Yes Yes No Yes No No

    StreamMapReduce Yes Yes N/A Yes N/A N/A

    Row/objectsize 400KB 1MB Destinationrow/objectsize

    Configurable 256KB 256KB

    Cost Higher(tablecost)

    Low Low Low (+admin) Low-medium Low-medium

    Hot Warm

    New

  • BatchTakesminutestohoursExample:Daily/weekly/monthlyreportsAmazonEMR(MapReduce,Hive,Pig,Spark)

    InteractiveTakessecondsExample:Self-servicedashboardsAmazonRedshift,AmazonAthena,AmazonEMR(Presto,Spark)Subsecond:ElastiCache (Redis 3.2TiB,MemCache),SAPHana

    MessageTakesmillisecondstosecondsExample:MessageprocessingAmazonSQSapplicationsonAmazonEC2

    StreamTakesmillisecondstosecondsExample:Fraudalerts,1minutemetricsAmazonEMR(SparkStreaming),AmazonKinesisAnalytics,KCL,Storm,AWSLambda

    ArtificialIntelligenceTakesmillisecondstominutesExample:Frauddetection,forecastdemand,texttospeechAmazonAI(Lex,Polly,ML,Rekognition),AmazonEMR(SparkML),DeepLearningAMI(MXNet,TensorFlow,Theano,Torch,CNTKandCaffe)

    AnalyticsTypes&FrameworksPROCESS/ANALYZE

    Message

    AmazonSQSappsAmazonEC2

    Streaming

    AmazonKinesisAnalytics

    KCLapps

    AWSLambda

    Stream

    AmazonEC2

    AmazonEMR

    Fast

    AmazonRedshift

    Presto

    EMR

    Fast

    Slow

    AmazonAthena

    Batch

    Interactive

    AmazonAIA

    I

  • WhichAnalysisToolShouldIUse?AmazonRedshift AmazonAthena AmazonEMR

    Presto Spark Hive

    Use case Optimizedfordatawarehousing

    Ad-hocInteractiveQueries

    InteractiveQuery

    Generalpurpose(iterativeML,RT,..)

    Batch

    Scale/throughput ~Nodes Automatic/No limits ~Nodes

    AWSManagedService

    Yes Yes, Serverless Yes

    Storage Localstorage Amazon S3 AmazonS3,HDFS

    Optimization Columnarstorage,datacompression,andzonemaps

    CSV,TSV,JSON,Parquet,ORC, ApacheWeblog

    Framework dependent

    Metadata AmazonRedshiftmanaged AthenaCatalogManager HiveMeta-store

    BI toolssupports Yes(JDBC/ODBC) Yes(JDBC) Yes(JDBC/ODBC&Custom)

    Accesscontrols Users, groups,andaccesscontrols

    AWSIAM Integration withLDAP

    UDF support Yes(Scalar) No Yes

    Slow

  • Case Study

  • For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the b