Deep Dive on Big Data

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

John Yeung, Solutions Architect

31 October 2017

Deep Dive on AWS with DemoAWS Big Data and Machine Learning Day | Hong Kong

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

What to expect from the session

Big Data ChallengesArchitectural PrinciplesDesign PatternsDemo (around 15 mins)


Ever-Increasing Big Data

Volume

Velocity

Variety


Big Data Evolution

Batch Processing

StreamProcessing

MachineLearning


Plenty of Tools

Amazon Glacier

S3 DynamoDB

RDS

EMR

Amazon Redshift

Data PipelineAmazon Kinesis

Amazon Kinesis Streams app

Lambda Amazon ML

SQS

ElastiCache

DynamoDBStreams

Amazon ElasticsearchService

Amazon Kinesis Analytics


Big Data Challenges

Why?

How?

What tools should I use?

Is there a reference architecture?


Architectural Principles


Architecture Principles

#1: Build Decoupled Systems• Data → Store → Process → Store → Analyze → Answers

#2: Use Right Tool for the Job• Data structure, latency, throughput, access patterns

#3: Leverage AWS Managed Services• Scalable/elastic, available, reliable, secure, no/low admin

#4: Use Lambda Architecture Ideas• Immutable (append-only) log, batch/speed/serving layer

#5: Be Cost-conscious• Big data ≠ big cost


Simplify Big Data Processing

COLLECT STORE PROCESS/ANALYZE CONSUME

1. Time to answer (Latency)2. Throughput

3. Cost


COLLECT


Types of DataCOLLECT

Mobile apps

Web apps

Data centersAWS Direct

Connect

RECORDS

Appl

icat

ions In-memory data

Database records

AWS Import/ExportSnowball

Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILES

Logg

ing

Tran

spor

t

Search documents

Log files

MessagingMessage MESSAGES

Mes

sagi

ng

Messages

Devices

Sensors & IoT platforms

AWS IoT STREAMS

IoT Data streams

Transaction-based

File-based

Event-based


Store


STORE

Devices


AWS IoT STREAMS

IoT

COLLECT


Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILES

Logg

ing

Tran

spor

t


Mes

sagi

ngAp

plic

atio

ns

Mobile apps

Web apps


Connect

RECORDS

Types of Data Stores

Database SQL & NoSQL databases

Search Search engines

File store File systems

Queue Message queues

Streamstorage

Pub/sub message queues

In-memory Caches


In-memory

COLLECT STORE

Mobile apps

Web apps


Connect

RECORDS Database


Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILES

Search


Devices


AWS IoT STREAMS

Apache Kafka

Amazon KinesisStreams

Amazon Kinesis Firehose

Amazon DynamoDB Streams

Hot

Stre

am

Amazon SQS

Mes

sage

Amazon S3File

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

In-memory, Database, Search


COLLECT STORE

Mobile apps

Web apps


Connect

RECORDS


Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILES


Devices


AWS IoT STREAMS

Apache Kafka




Hot

Stre

am

Amazon SQS

Mes

sage

Amazon Elasticsearch Service

Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS

Sear

ch

SQL

N

oSQ

L C

ache

File

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

Amazon ElastiCache• Managed Memcached or Redis service

Amazon DynamoDB• Managed NoSQL database service

Amazon RDS• Managed relational database service

Amazon Elasticsearch Service• Managed Elasticsearch service


Use the Right Tool for the Job

Data Tier

Search


In-memory

Amazon ElastiCacheRedisMemcached

SQL

Amazon AuroraAmazon RDS

MySQLPostgreSQLOracleSQL Server

NoSQL

Amazon DynamoDBCassandraHBaseMongoDB


In-memory

COLLECT STORE

Mobile apps

Web apps


Connect

RECORDSDatabase


Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILES

Search


Devices


AWS IoT STREAMS

Apache Kafka




Hot

Stre

am

Amazon S3

Amazon SQS

Mes

sage

Amazon S3File

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

File Storage


Why Is Amazon S3 Good for Big DataNatively supported by big data frameworks (Spark, Hive, Presto, etc.) Multiple & heterogeneous analysis clusters can use the same dataUnlimited number of objects and volume of dataVery high bandwidth – no aggregate throughput limitDesigned for 99.99% availability – can tolerate zone failureDesigned for 99.999999999% durabilityNo need to pay for data replicationNative support for versioningTiered-storage (Standard, IA, Amazon Glacier) via life-cycle policiesSecure – SSL, client/server-side encryption at restLow cost


In-memory



Apache Kafka


Amazon SQS

Amazon SQS• Managed message queue service

Apache Kafka• High throughput distributed streaming platform

Amazon Kinesis Streams• Managed stream storage + processing

Amazon Kinesis Firehose• Managed data delivery

Amazon DynamoDB• Managed NoSQL database• Tables can be stream-enabled

Message & Stream Storage

Devices


AWS IoT STREAMS

IoT

COLLECT STORE

Mobile apps

Web apps


Connect

RECORDSDatabaseAp

plic

atio

ns


Logging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILES

Search

File store

Logg

ing

Tran

spor

t


Mes

sagi

ng

Mes

sage

Stre

am


Why Stream Storage

Decouple producers & consumers

Persistent buffer

Collect multiple streams

Preserve client ordering

Parallel consumption

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 14 4 3 3 2 2 1 1

shard 1 / partition 1

shard 2 / partition 2

Consumer 1Count of red = 4

Count of violet = 4

Consumer 2Count of blue = 4

Count of green = 4

DynamoDB stream Amazon Kinesis stream Kafka topic


What Stream Storage should I use?AmazonDynamoDBStreams

AmazonKinesisStreams

AmazonKinesis Firehose

ApacheKafka

AmazonSQS

AWS managed service

Yes Yes Yes No Yes

Guaranteedordering

Yes Yes Yes Yes No

Delivery exactly-once at-least-once exactly-once at-least-once at-least-once

Data retention period

24 hours 7 days N/A Configurable 14 days

Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ

Scale / throughput

No limit /~ table IOPS

No limit /~ shards

No limit /automatic

No limit /~ nodes

No limits /automatic

Parallel clients Yes Yes No Yes No

Stream MapReduce Yes Yes N/A Yes N/A

Record/object size 400 KB 1 MB Redshift row size Configurable 256 KB

Cost Higher (table cost) Low Low Low (+admin) Low-medium

Hot Warm


Which Data Store Should I Use

Data Structure → Fixed schema, JSON, key-value

Access Patterns → Store data in the format you will access it

Data Characteristics → Hot, Warm, Cold

Cost → Right cost


Data Structure and Access Patterns

Access Patterns What to use?Put/Get (key, value) In-memory, NoSQLSimple relationships → 1:N, M:N NoSQLMulti-table joins, transaction, SQL SQLFaceting, search Search

Data Structure What to use?Fixed schema SQL, NoSQLSchema-free (JSON) NoSQL, Search(Key, value) In-memory, NoSQL


What is the temperature of your data


Data characteristics: Hot, Warm or Cold

Hot Warm ColdVolume MB–GB GB–TB PB–EBItem size B–KB KB–MB KB–TBLatency ms ms, sec min, hrsDurability Low–high High Very highRequest rate Very high High LowCost/GB $$-$ $-¢¢ ¢

Hot data Warm data Cold data


In-memory SQL

Request rateHigh Low

Cost/GBHigh Low

LatencyLow High

Data volumeLow High

Amazon Glacier

Stru

ctur

e

NoSQL


Low

High


Which Data Store Should I UseAmazon ElastiCache

AmazonDynamoDB

AmazonRDS/Aurora

AmazonES

Amazon S3

AmazonGlacier

Average latency

ms ms ms, sec ms,sec ms,sec,min(~ size)

hrs

Typicaldata stored

GB GB–TBs(no limit)

GB–TB(64 TB max)

GB–TB MB–PB(no limit)

GB–PB(no limit)

Typicalitem size

B-KB KB(400 KB max)

KB(64 KB max)

B-KB(2 GB max)

KB-TB(5 TB max)

GB(40 TB max)

Request Rate

High – very high Very high(no limit)

High High Low – high(no limit)

Very low

Storage costGB/month

$$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10

Durability Low - moderate Very high Very high High Very high Very high

Availability High2 AZ

Very high 3 AZ

Very high3 AZ

High2 AZ

Very high3 AZ

Very high3 AZ



PROCESS / ANALYZE


Analytics & FrameworksInteractive

Takes secondsExample: Self-service dashboardsAmazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)

BatchTakes minutes to hours Example: Daily/weekly/monthly reportsAmazon EMR (MapReduce, Hive, Pig, Spark)

MessageTakes milliseconds to secondsExample: Message processingAmazon SQS applications on Amazon EC2

StreamTakes milliseconds to secondsExample: Fraud alerts, 1 minute metricsAmazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, Storm, AWS Lambda

PROCESS / ANALYZE

Amazon Machine LearningM

LM

essa

ge

Amazon SQS appsAmazon EC2

Streaming


KCLapps

AWS Lambda

Stre

am

Amazon EC2

Amazon EMR

Fast

Amazon Redshift

Presto

AmazonEMR

Fast

Slow

Amazon Athena

Batc

hIn

tera

ctiv

e


What about ETL

https://aws.amazon.com/big-data/partner-solutions/

ETLSTORE PROCESS / ANALYZE

Data Integration PartnersReduce the effort to move, cleanse, synchronize, manage, and automatize data related processes. AWS Glue

AWSGlueisafullymanagedETLservicethatmakesiteasytounderstandyourdatasources,preparethedata,andmoveitreliablybetweendatastores

New


CONSUME


COLLECT STORE CONSUMEPROCESS / ANALYZE


Apache Kafka

Amazon SQS



Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS


Hot

Hot

War

m

File

Mes

sage

Stre

am

Mobile apps

Web apps

Devices

MessagingMessage


AWS IoT


Connect


Logging

Amazon CloudWatch

AWS CloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

ETL

Sear

ch

SQL

N

oSQ

L C

ache

Streaming


KCLapps

AWS Lambda

Fast

Stre

am

Amazon EC2

Amazon EMR

Amazon SQS apps

Amazon Redshift

Amazon Machine Learning

Presto

AmazonEMR

Fast

Slow

Amazon EC2

Amazon Athena

Batc

hM

essa

geIn

tera

ctiv

eM

L


STORE CONSUMEPROCESS / ANALYZE

Amazon QuickSight

Apps & Services

Anal

ysis

& v

isua

lizat

ion

Not

eboo

ks

IDE

API

Applications & API

Analysis and visualization

Notebooks

IDE

Business users

Data scientist, developers

COLLECT ETL


Put them together


Streaming


KCLapps

AWS Lambda

COLLECT STORE CONSUMEPROCESS / ANALYZE


Apache Kafka

Amazon SQS



Amazon DynamoDB

Amazon S3

Amazon ElastiCache

Amazon RDS


Hot

Hot

War

m

Fast

Stre

am

Sear

ch

SQL

N

oSQ

L C

ache

File

Mes

sage

Stre

am

Amazon EC2

Mobile apps

Web apps

Devices

MessagingMessage


AWS IoT


Connect


Logging

Amazon CloudWatch

AWS CloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Amazon QuickSight

Apps & Services

Anal

ysis

& v

isua

lizat

ion

Not

eboo

ksID

EAP

I

Logg

ing

IoT

Appl

icat

ions

Tran

spor

tM

essa

ging

ETL

Amazon EMR

Amazon SQS apps

Amazon Redshift

Amazon Machine Learning

Presto

AmazonEMR

Fast

Slow

Amazon EC2

Amazon Athena

Batc

hM

essa

geIn

tera

ctiv

eM

L


Design Patterns


Concept #1: Decoupled Data Bus

• Storage decoupled from processing• Multiple stages

Store Process Store Process

ProcessStore


Concept #2: Multiple Stream Processing

ProcessStore

Amazon Kinesis

Amazon DynamoDB

Amazon S3

AWS Lambda

Amazon Kinesis Connector

Library KCL

• Parallel processing


Concept #3: Multiple Data Stores

Amazon EMR

Amazon Kinesis

AWS Lambda

Amazon S3

Amazon DynamoDB

Spark Streaming

Amazon Kinesis Connector

Library KCL

Spark SQL

• Analysis framework reads from or writes to multiple data stores

ProcessStore


Amazon EMR

ApacheKafka

KCL

AWS Lambda

SparkStreaming

Apache Storm

Amazon SNS

AmazonML

Notifications

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

Alert

App state

Real-time prediction

KPI

DynamoDBStreams

Amazon Kinesis

ProcessStore

Real-time Analytics Design Pattern


Amazon SQS

Amazon SQS App

Amazon SQS App

Amazon SNS Subscribers

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

Publish

App state

KPI

Amazon SQS App

Amazon SQSApp

Auto Scaling group

Amazon SQSPriority queue

Messages /eventsProcess

Store

Message / Event Processing Design Pattern


Amazon S3

Amazon EMR

Hive

Pig

Spark

AmazonMachine Learning

Consume

Amazon Redshift

Amazon EMR

PrestoSpark

BatchMode

InteractiveMode

Batch prediction

Real-time predictionAmazon Kinesis

Firehose

Amazon Athena

Amazon KinesisAnalytics

Files

ProcessStore

Interactive &Batch Analytics Design Pattern


DemonstrationApply what we’ve just learnt


Real-time Analytics Design Pattern

Apache Web Server

Amazon Kinesis

Firehose

Amazon Kinesis

Firehose


Amazon S3 bucket

Availability Zone #1

KibanaAmazon ElasticSearch


Amazon Elastic Cloud Computing EC2

Amazon EC2 provides the Virtual Machines VMs, known as instances, to run your web application on the platform you choose. It allows you to configure and scale your compute capacity easily to meet changing requirements and demand.

In this demo, this instance is installed with Apache Web Server which continuously generates web access log records and Amazon Kinesis Agent which streams these records to Amazon Kinesis Firehose.

Apache Web Server

+Amazon

Kinesis Agent



Amazon Kinesis Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (AmazonS3), Amazon Redshift, or Amazon Elasticsearch Service (Amazon ES).

In this step, we will create an Amazon Kinesis Firehose delivery stream to save each log entry in Amazon S3 and to provide the log data to the Amazon Kinesis Analytics application.

Amazon Kinesis

Firehose


Example: Real-time Analytics (1)

Apache Web Server

Amazon Kinesis

Firehose


1. A Linux Instance is installed with Amazon Kinesis Agent which sends log records to Amazon Kinesis Firehose continuously.

Streaming data

COLLECT


Amazon Simple Storage Service S3

Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure.

Examples: Web Access Log, Static Web Site and Data Lake etc.

Amazon S3



Amazon Kinesis Analytics enables you to query streaming data or build entire streaming applications using SQL, so that you can gain actionable insights promptly.

It takes care of everything required to run your queries continuously and scales automatically to match the volume and throughput rate of your incoming data.

Amazon Kinesis

Analytics



Apache Web Server

Amazon Kinesis

Firehose

Amazon S3 bucket


2a. Amazon Kinesis Firehose will write each log record to Amazon Simple Storage Service S3 for durable storage.

COLLECT STORE



Apache Web Server

Amazon Kinesis

Firehose


Amazon S3 bucket


2b. Amazon Kinesis Analytics run a SQL statement against the streaming input data.

COLLECT STORE PROCESS / ANALYZE


SQL Operations Inside Kinesis Analytics

Source Stream

Insert & Select (Pump)

Destination Stream


CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" ( datetime VARCHAR(30), status INTEGER, statusCount INTEGER);

CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM" SELECT STREAM TIMESTAMP_TO_CHAR('yyyy-MM-dd''T''HH:mm:ss.SSS', LOCALTIMESTAMP) as datetime, "response" as status, COUNT(*) AS statusCountFROM "SOURCE_SQL_STREAM_001" GROUP BY "response", FLOOR(("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') minute / 1 TO MINUTE);

Amazon Kinesis

Firehose



Apache Web Server

Amazon Kinesis

Firehose

Amazon Kinesis

Firehose


Amazon S3 bucket


COLLECT STORE PROCESS / ANALYZE

3. Amazon Kinesis Analytics creates an aggregated data set every minute and output that data to a second Firehose delivery stream.

STORE


Amazon Elasticsearch Service ES

Amazon Elasticsearch Service makes it easy to deploy, secure, operate, and scale Elasticsearch for log analytics, full text search, application monitoring, and more. Amazon Elasticsearch Service is a fully managed service that delivers real-time analytics capabilities alongside the availability, scalability, and security that production workloads require.

The service offers built-in integrations with Kibana, Logstashand other AWS services. It enables you to go from raw data to actionable insights quickly and securely.

Amazon Elasticsearch



Apache Web Server

Amazon Kinesis

Firehose

Amazon Kinesis

Firehose


Amazon S3 bucket


Amazon ElasticSearch

COLLECT STORE PROCESS / ANALYZE STORE

4. This Firehose delivery stream will write the aggregated data to an Amazon ES domain.


Kibana

Kibana lets you visualize your Elasticsearch data. It provides you interactive visualizations with various types including histograms, line graphs, pie charts, and more. It leverages the full aggregation capabilities of Elasticsearch.

Kibana



Apache Web Server

Amazon Kinesis

Firehose

Amazon Kinesis

Firehose


Amazon S3 bucket



COLLECT STORE PROCESS / ANALYZE STORE CONSUME

5. Finally, use Kibana to visualize the result of your system.


Implementation Steps

Apache Web Server

Amazon Kinesis

Firehose

Amazon Kinesis

Firehose


Amazon S3 bucket



COLLECT STORE PROCESS / ANALYZE STORE CONSUME

1 2a

2b

345 6


Let’s build your own one in 60 mins!

https://aws.amazon.com/getting-started/projects/build-log-analytics-solution/


Thank you!John Yeung | [email protected]

Documents

Deep Dive on Big Data