Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Data Lake Best Practices
Agenda
Why DataLakeKeyComponents ofaDataLakeModernDataArchitectureSomeBestPracticesCaseStudySummaryTakeaways
WhatisaDataLake?
What,whyetc.
Whatisadatalake?• Itisanarchitecturethatallowsyoutocollect,store,process,analyzeand
consumealldatathatflowsintoyourorganization.Why datalake?• Leveragealldatathatflowsintoyourorganization
• Customercentricity• Businessagility• BetterpredictionsviaMachineLearning• Competitiveadvantage
ComparisonofaDataLaketoanEnterpriseDataWarehouse
Complementary to EDW (not replacement) Data lake can be source for EDW
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured/Unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction/Advanced Analytics + BI use cases BI use cases only (no prediction/advanced analytics)
Data at low level of detail/granularity Data at summary/aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advancedanalytics) Limited flexibility in tools (SQL only)
EnterpriseDWEMR S3
KeyConceptsAssociatedwithaDataLake
STORAGECOMPUTE
COMPUTE COMPUTE
COMPUTECOMPUTE
COMPUTE
COMPUTE
COMPUTE
ComponentsofaDataLake
DataStorage
• Highdurability• Storesrawdatafrominputsources• Supportforanytypeofdata• Lowcost
Streaming
• Streamingingestoffeeddata• Providestheabilitytoconsumeanydataset
asastream• Facilitateslowlatencyanalytics
Storage&Streams
Catalogue&Search
Entitlements
API&UI
ComponentsofaDataLake
Catalogue
• Metadatalake• Usedforsummarystatisticsanddata
Classificationmanagement
Search
• SimplifiedaccessmodelfordatadiscoveryStorage&Streams
Catalogue&Search
Entitlements
API&UI
ComponentsofaDataLake
Entitlementssystem
• Encryption• Authentication• Authorisation• Chargeback• Quotas• Datamasking• Regionalrestrictions
Storage&Streams
Catalogue&Search
Entitlements
API&UI
ComponentsofaDataLake
Storage&Streams
Catalogue&Search
Entitlements
API&UI API&UserInterface
• Exposesthedatalaketocustomers• Programmaticallyquerycatalogue• ExposesearchAPI• Ensuresthatentitlementsarerespected
The Modern Data Architecture
Storage&Streams
Catalogue&Search
Entitlements
API&UI
Storage&Streams
Catalogue&Search
Entitlements
API&UI
Storage&Streams
Catalogue&Search
Entitlements
API&UI
WhyIsAmazonS3theFabricofDataLake?• Nativelysupportedbybigdataframeworks(Spark,Hive,Presto,etc.)• Decouplestorageandcompute
• Noneedtoruncomputeclustersforstorage(unlikeHDFS)• CanruntransientHadoopclusters&AmazonEC2SpotInstances• Multiple&heterogeneousanalysis clusterscanusethesamedata
• Virtuallyunlimitednumberofobjectsandvolumeofdata• Veryhighbandwidth– noaggregatethroughputlimit• Designedfor99.99%availability– cantoleratezonefailure• Designedfor99.999999999%durability• Noneedtopayfordatareplication• Nativesupportforversioning• Tiered-storage(Standard,IA,AmazonGlacier)vialife-cyclepolicies
• UseHDFSforveryfrequentlyaccessed(hot)data
• Secure– SSL,client/server-sideencryptionatrest• Lowcost
Storage&Streams
Catalogue&Search
Entitlements
API&UI
AWS Lambda
AWS Lambda
Metadata Index(DynamoDB)
Search Index(Amazon Elasticsearch
Service or AmazonCloudSearch)
ObjectCreatedObjectDeleted PutItem
Update Stream
Update Index
Extract Search Fields
Indexing and Searching using Metadata
Storage&Streams
Catalogue&Search
Entitlements
API&UI
Identity&AccessManagement
• Manageusers,groups,androles• IdentityfederationwithOpenID• TemporarycredentialswithAmazonSecurityToken
Service(AmazonSTS)• Storedpolicytemplates• Powerfulpolicylanguage• AmazonS3bucketpolicies
DataEncryption
AWSCloudHSMDedicatedTenancySafeNet LunaSAHSMDevice
CommonCriteriaEAL4+,NISTFIPS140-2
AWSKeyManagementServiceAutomatedkeyrotation&auditing
IntegrationwithotherAWSservices
AWSserversideencryptionAWSmanagedkeyinfrastructure
Storage&Streams
Catalogue&Search
Entitlements
API&UI
DataLakeAPI&UI
ExposestheMetadataAPI,search,andAmazonS3storageservicestocustomers
CanbebasedonTVM/STSTemporaryAccessformanyservices,andabespokeAPIforMetadata
DriveallUIoperationsfromAPI?
IntroducingAmazonAPIGateway
HostmultipleversionsandstagesofAPIs
CreateanddistributeAPIkeystodevelopers
LeverageAWSSigv4toauthorizeaccesstoAPIs
Throttleandmonitorrequeststoprotectthebackend
LeveragesAWSLambda
Storage&Streams
Catalogue&Search
Entitlements
API&UI
Storage&Streams
Catalogue&Search
Entitlements
API&UI
Storage&Streams
Catalogue&Search
Entitlements
API&UI
https://aws.amazon.com/big-data/partner-solutions/
DataIntegrationPartnersReducetheefforttomove,cleanse,synchronize,manage,andautomatizedatarelatedprocesses.
Putting it all together
Building a Data Lake on AWS
Kinesis Firehose AthenaQuery Service
1
2
3
4
5
6
7
8
GlueBatch
9
10
Processing Data for Analytics on your data lake
Processing&Analytics
Real-time Batch
AI&Predictive
BI&DataVisualization
Transactional&RDBMS
AWS LambdaApache Storm
on EMR
Apache Flinkon EMR
Spark Streaming on EMR
ElasticsearchService
Kinesis Analytics, Kinesis Streams
DynamoDB
NoSQL DB Relational DatabaseAurora
EMRHadoop, Spark,
Presto
RedshiftData Warehouse
AthenaQuery Service
Amazon LexSpeech recognition
Amazon Rekognition
Amazon PollyText to speech
Machine LearningPredictive analytics
Kinesis Streams & Firehose
Important considerations
DataTemperature
Hot Warm ColdVolume MB–GB GB–TB PB–EBItemsize B–KB KB–MB KB–TBLatency ms ms,sec min,hrsDurability Low–high High VeryhighRequestrate Veryhigh High LowCost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
WhichStream/MessageStorageShouldIUse?AmazonDynamoDBStreams
AmazonKinesisStreams
AmazonKinesisFirehose
ApacheKafka
AmazonSQS(Standard)
AmazonSQS(FIFO)
AWS managed Yes Yes Yes No Yes Yes
Guaranteed ordering Yes Yes No Yes No Yes
Delivery(deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once
Dataretentionperiod 24hours 7days N/A Configurable 14 days 14days
Availability 3 AZ 3 AZ 3AZ Configurable 3 AZ 3AZ
Scale /throughput
Nolimit/~ tableIOPS
Nolimit/~shards
No limit/automatic
Nolimit /~nodes
No limits/automatic
300 TPS/queue
Parallelconsumption Yes Yes No Yes No No
StreamMapReduce Yes Yes N/A Yes N/A N/A
Row/objectsize 400KB 1MB Destinationrow/objectsize
Configurable 256KB 256KB
Cost Higher(tablecost)
Low Low Low (+admin) Low-medium Low-medium
Hot Warm
New
BatchTakesminutestohoursExample:Daily/weekly/monthlyreportsAmazonEMR(MapReduce,Hive,Pig,Spark)
InteractiveTakessecondsExample:Self-servicedashboardsAmazonRedshift,AmazonAthena,AmazonEMR(Presto,Spark)Subsecond:ElastiCache (Redis 3.2TiB,MemCache),SAPHana
MessageTakesmillisecondstosecondsExample:MessageprocessingAmazonSQSapplicationsonAmazonEC2
StreamTakesmillisecondstosecondsExample:Fraudalerts,1minutemetricsAmazonEMR(SparkStreaming),AmazonKinesisAnalytics,KCL,Storm,AWSLambda
ArtificialIntelligenceTakesmillisecondstominutesExample:Frauddetection,forecastdemand,texttospeechAmazonAI(Lex,Polly,ML,Rekognition),AmazonEMR(SparkML),DeepLearningAMI(MXNet,TensorFlow,Theano,Torch,CNTKandCaffe)
AnalyticsTypes&FrameworksPROCESS/ANALYZE
Message
AmazonSQSappsAmazonEC2
Streaming
AmazonKinesisAnalytics
KCLapps
AWSLambda
Stream
AmazonEC2
AmazonEMR
Fast
AmazonRedshift
Presto
EMR
Fast
Slow
AmazonAthena
Batch
Interactive
AmazonAIAI
WhichAnalysisToolShouldIUse?AmazonRedshift AmazonAthena AmazonEMR
Presto Spark Hive
Use case Optimizedfordatawarehousing
Ad-hocInteractiveQueries
InteractiveQuery
Generalpurpose(iterativeML,RT,..)
Batch
Scale/throughput ~Nodes Automatic/No limits ~Nodes
AWSManagedService
Yes Yes, Serverless Yes
Storage Localstorage Amazon S3 AmazonS3,HDFS
Optimization Columnarstorage,datacompression,andzonemaps
CSV,TSV,JSON,Parquet,ORC, ApacheWeblog
Framework dependent
Metadata AmazonRedshiftmanaged AthenaCatalogManager HiveMeta-store
BI toolssupports Yes(JDBC/ODBC) Yes(JDBC) Yes(JDBC/ODBC&Custom)
Accesscontrols Users, groups,andaccesscontrols
AWSIAM Integration withLDAP
UDF support Yes(Scalar) No Yes
Slow
Case Study
“For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren’t able to do before, and that is priceless.”
- Steve Randich, CIO
Case Study: Re-architecting Compliance
What FINRA needed• Infrastructure for its market surveillance platform• Support of analysis and storage of approximately 75
billion market events every day
Why they chose AWS• Fulfillment of FINRA’s security requirements• Ability to create a flexible platform using dynamic
clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3
Benefits realized• Increased agility, speed, and cost savings• Estimated savings of $10-20m annually by using AWS
Fraud Detection
FINRAusesAmazonEMRandAmazonS3toprocessupto75billiontradingeventsperdayandsecurelystoreover5petabytesofdata,attainingsavingsof$10-20mmperyear.
Summary
• AWS enables you to build sophisticated data lakes and related analytics applications
• Retrospective, Real-time, Predictive
• You can build incrementally, adding use cases and increasing scale as you go
• AWS provides a broad range of security and auditing features to enable you to meet your security requirements
https://aws.amazon.com/big-data/
Takeaways
• Prescriptiveguidanceandrapidlydeployablesolutionstohelpyoustore,analyze,andprocessbigdataontheAWSCloud
• DeriveInsightsfromIoT inMinutesusingAWSIoT,AmazonKinesisFirehose,AmazonAthena,andAmazonQuickSight
• DeployingaDataLakeonAWS- March2017AWSOnlineTechTalks
• Harmonize,Search,andAnalyzeLooselyCoupledDatasetsonAWS
• BestPracticesforBuildingaDataLakewithAmazonS3-August2016MonthlyWebinarSeries- YouTube
http://bit.ly/2qiElYx
http://amzn.to/2mzGppL
http://bit.ly/2qipA8h
http://amzn.to/2qpiFaK
http://amzn.to/2lpbc8p
?