53
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. BDM205 Big Data Mini Con State of the Union Roger Barga, AWS November 29, 2016

AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Embed Size (px)

Citation preview

Page 1: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

BDM205

Big Data Mini ConState of the Union

Roger Barga, AWS

November 29, 2016

Page 2: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

What is Big Data?

When your data sets become so large and complex

you have to start innovating around how to

collect, store, process, analyze, and share it.

Page 3: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Amazon EMR Amazon EC2

Process & Analyze

Amazon Glacier

Amazon S3

Store

AWS Import/Export

AWS Direct Connect

Collect

Amazon Kinesis

Amazon MachineLearning

Amazon Redshift

AmazonDynamoDB

Amazon Kinesis

Analytics

Amazon QuickSightAWS Database

Migration Service

AWS Data Pipeline

Amazon RDS, Aurora

Big Data services on AWS

Amazon Elasticsearch

Service

Page 4: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Store anything

Object storage

Highly scalable

99.999999999% durabilityAmazon S3

Collection and storage

Page 5: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Petabyte-scale data transfer service that uses Amazon-provided storage devices for transport.

Copy up to 80TB data from on-prem file system to the Snowball through a 10Gbps network interface

All data is encrypted by 256-bit GSM encryption

AWS Import/Export

Snowball

Collection and storage

E-ink shipping label

Ruggedizedcase

“8.5G Impact”

50TB & 80TB 10G network

Page 6: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Relational data warehouse

Massively parallel; Petabyte scale

Fully managed

HDD and SSD Platforms

$1,000/TB/Year; start at $0.25/hour

Amazon Redshift

Structured data processing

Page 7: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Hadoop as a service

Spark, Presto, Flink, Hbase, Hive, etc.

Easy to use; fully managed

On-demand and Spot pricing

HDFS & S3 file systems

Amazon EMR

Semi-structured / unstructured data processing

Page 8: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Distributed search and analytics engine

Managed service using Elasticsearch and Kibana

Fully managed - zero admin

Highly available and reliable

Tightly integrated with other AWS servicesAmazon Elasticsearch

Service

Semi-structured / unstructured data processing

Page 9: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Serverless compute service that runs your code in response to events.

Extend AWS services with user-defined custom logic.

Pay only for the requests served and compute time required - billing in increments of 100 milliseconds

AWS Lambda

Serverless event processing

Page 10: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Streams: Build your own custom application to process streaming data using Amazon Kinesis Client Library. Connectors to S3, DynamoDB, Lambda, Amazon Redshift, Elastisearch, Storm spout,…

Firehose: Load massive volumes of streaming data into S3, Amazon Redshift, Elasticsearch. Inline processing using Lambda and library of exiemplates.Analytics: Analyze streaming data using standard SQL, no servers to manage, elastically scale, pay as you go.

Amazon Kinesis

Streaming data processing

Page 11: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Streams: Build your own custom application to process streaming data using Amazon Kinesis Client Library. Connectors to S3, DynamoDB, Lambda, Amazon Redshift, Elastisearch, Storm spout,…

Firehose: Load massive volumes of streaming data into S3, Amazon Redshift, Elasticsearch. Inline processing using Lambda and library of ready to use templates.

Analytics: Analyze streaming data using standard SQL, no servers to manage, elastically scale, pay as you go.

Amazon Kinesis

Streaming data processing

Page 12: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Fast, powered by SPICE, automatically scales.

Explore, analyze, share insights with anyone.

1/10th the cost of traditional BI solutions.

Broad connectivity with AWS data services, on-premises data, files and business applications.

Amazon QuickSight

Visualize and explore

Amazon RDS

Amazon S3 Amazon Redshift

Page 13: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Putting it together

Scale

Page 14: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Scale as your data and business grows

The volume, variety, and velocity at which data is being generated are leaving organizations with new questions to answer, such as:

Page 15: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Store and analyze all your data, structured and unstructured from all of your sources, in one centralized location at low cost.

Quickly ingest data without needing to force it into a pre-defined schema, enabling ad-hoc analysis by applying schemas on read, not write.

Separating your storage and compute allows you to scale each component as required, attach multiple data processing and analytics services to the same data set.

Scale

S3 Data Lake

Page 16: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Implementing a Data Lake on AWS

Elasticsearch

Page 17: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Starting small is powerful, when you can scale up fast

Scaling up your analytics systems With AWS Traditional IT *

Get a new BI server 20 minutes 3 months

Upgrade your analytics server to the newest Intel processors and add 16GB memory

10 minutes 2 months

Add 500TB of storage instant 2 months

Grow a DWH cluster from 8GB to 1PB 1 hour 8 months

Build a 1024-node Hadoop cluster 30 minutes unlikely

Roll out multi-region production environment hours months

* actual provisioning times in a well-organized IT division

Page 18: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Netflix: Using Amazon S3 as the fabric of our big data ecosystemTuesday, Nov. 295:30pm – 6:30pmMirage, St. Croix B

Page 19: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Putting it together

Cost

Page 20: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Putting it together: costHow much would it cost to process the Twitter fire hose?

Page 21: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Putting it together: costHow much would it cost to process the Twitter fire hose?

S3: $0.025/GB-MoRedshift: Starts at $0.25/hour

EC2: Starts at $0.02/hour

Glacier: $0.007/GB-Mo

Kinesis: $0.015/shard 1MB/s in;

2MB/out; $0.014/million puts

Page 22: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

500MM tweets/day = ~ 5,800 tweets/sec

2k/tweet is ~12MB/sec (~1TB/day)

$0.015/hour per shard, $0.014/million PUTS

Amazon Kinesis cost is $0.47/hour

Amazon Redshift cost is $0.850/hour (for a 2TB node)

S3 cost is $1.02/hour (no compression)

Total: $2.34/hour – on demand

Cost

Page 23: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Use only the services you need

Scale only the services you need

Pay for only what you use

Discounts through Reserved Instances

Types including Spot, and upfront commitments.

Cost

Page 24: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Putting it together

Scale and security

Page 25: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Putting it together: scale and securityFINRA: Monitor and enforce trading regulations

FINRA handles approximately 75 billion market events every day to build a holistic picture of trading in the U.S. Hundreds of surveillance algorithms against massive amounts of data.

FINRA mission Deter misconduct by enforcing the rules.

Detect and prevent wrongdoing in US markets

Discipline those who break the rules

Scale brings unique challenges Market volumes are volatile and increasing

Exchanges are dynamically evolving

Regulatory rules are created and enhanced

New securities products are introduced

Market manipulators innovate

Page 26: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Petabytes of data generated on

premise and brought to AWS and

stored in S3 data lake.

Thousands of analytical queries

performed on EMR and Redshift. Over

400 analytics packages.

Stringent security requirements met by

leveraging VPC, VPN, Encryption at

Rest and In Transit, AWS CloudTrail and

database auditing

Flexible Interactive Queries

Predefined Queries

Surveillance Analytics

Data Management Data MovementData Registration

Version Management

Amazon S3

Platform that adapts to market dynamics

Web ApplicationsAnalysts; Regulators

Amazon EMR

Amazon EMR

Amazon Redshift

Page 27: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Store an exabyte of data or more in S3

Analyze GB to PB using standard tools

Encryption of all data at each step

Auditability of all APIs and retrievals

Control egress and ingress points using VPCs

Scale and security

FINRA: Building a Secure Data Science Platform on AWSTuesday, Nov. 294:00pm – 5:00pmMirage, St. Croix B

Page 28: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Putting it together

Agility and actionable insights

Page 29: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Actionable insightsDemonstration

http://amzn.to/bigdata

Access from a mobile device…

Page 30: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

What item most interests you this week?

What item will be the most difficult to explain to your significant other when you return home?

What will give you the biggest headache this week?

New Amazon Web Services Blackjack

Networking with Peers re:Play Party

Page 31: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

What item most interests you this week?

What are your colleagues most interested in hearing about when you return next week?

What will give you the biggest headache this week?

New Amazon Web Services Blackjack

Networking with Peers re:Play Party

Page 32: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

What item most interests you this week?

What are your colleagues most interested in hearing about when you return next week?

What will give you the biggest headache this week?

New Amazon Web Services Blackjack

Networking with Peers re:Play Party

Page 33: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Kinesis

Ingestion

Stream

Kinesis

Analytics

Kinesis

Aggregate

Stream

Lambda

Function

DynamoDB

TableAmazon

Cognito

SELECT ROWTIME, userId, COUNT(*)

FROM STREAM

GROUP BY userId, FLOOR(ROWTIME to

SECOND)

S3 Bucket

HTML, JavascriptAggregated DataRaw Device and

Quadrant Data

Demo architecture

Page 34: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

The demo application

CREATE OR REPLACE STREAM DESTINATION_SQL_STREAM (UNIQUE_USER_COUNT INT, ANDROID_COUNT INT, IOS_COUNT INT, WINDOWS_PHONE_COUNT INT,

OTHER_OS_COUNT INT, QUADRANT_A_COUNT INT, QUADRANT_B_COUNT INT, QUADRANT_C_COUNT INT, QUADRANT_D_COUNT INT, WINDOW_TIME TIMESTAMP);

CREATE OR REPLACE STREAM DISTINCT_USER_STREAM (COGNITO_ID VARCHAR(64), DEVICE VARCHAR(32), OS VARCHAR(32), QUADRANT char(1), DT

TIMESTAMP);

CREATE OR REPLACE PUMP "DISTINCT_USER_PUMP" AS

INSERT INTO "DISTINCT_USER_STREAM"

SELECT STREAM DISTINCT

"cognitoId",

"device",

"os",

"quadrant",

FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO SECOND)

FROM "SOURCE_SQL_STREAM_001";

CREATE OR REPLACE PUMP "OUTPUT_PUMP" AS

INSERT INTO "DESTINATION_SQL_STREAM"

SELECT STREAM

COUNT("DISTINCT_USER_STREAM".COGNITO_ID) AS UNIQUE_USER_COUNT,

COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'Android' THEN COGNITO_ID ELSE null END)) AS ANDROID_COUNT,

COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'iOS' THEN COGNITO_ID ELSE null END)) AS IOS_COUNT,

COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'Windows Phone' THEN COGNITO_ID ELSE null END)) AS WINDOWS_PHONE_COUNT,

COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'other' THEN COGNITO_ID ELSE null END)) AS OTHER_OS_COUNT,

COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'A' THEN COGNITO_ID ELSE null END)) AS QUADRANT_A_COUNT,

COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'B' THEN COGNITO_ID ELSE null END)) AS QUADRANT_B_COUNT,

COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'C' THEN COGNITO_ID ELSE null END)) AS QUADRANT_C_COUNT,

COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'D' THEN COGNITO_ID ELSE null END)) AS QUADRANT_D_COUNT,

ROWTIME

FROM "DISTINCT_USER_STREAM"

GROUP BY

FLOOR("DISTINCT_USER_STREAM".ROWTIME TO SECOND);

Page 35: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Big data does not mean just batch Can be streamed in Processed in real time Can be used to respond quickly to requests and

actionable events, generate business value.

You can mix and match On-premises and cloud Custom development and managed services

Agility & actionable

insights

Page 36: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Putting it together

Choice and selection

Page 37: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

1-click deployment to launch, in multiple regions around the world

Pay-as-you-go pricing with no long term contracts required

2,000+ product listings to browse, test, and buy software; 290 specific to big data.

Advanced Analytics

Database and Data Enablement

Business Intelligence

Putting it together: choice and selectionAWS Marketplace: Software store with simplified procurement

Page 38: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Largest ecosystem of ISVs & integrators

Tens of thousands of consulting and technology partners

Page 39: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

We have a retail mindset

Use our managed big data services

Build or bring your own

Or access thousands in our marketplace

Each customer decides for themselves

Choice &

selection

Page 40: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Richard T. Freeman, Ph.D., Lead Data Engineer and Architect, JustGiving

November 29, 2016

JustGiving: Event-Driven Data Platform

BDM205

Page 41: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

We are

A tech-for-good platform for

events-based fundraising,

charities, and crowdfunding

“Ensure no good cause goes unfunded”

• The #1 platform for online

social giving in the world

• Peaks in traffic: Ice bucket,

natural disasters

• Raised $4.2bn in donations

• 28.5m users

• 196 countries

• 27,000 good causes

• GiveGraph

• 91 million nodes

• 0.53 billion relationships

Page 42: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Fundraising page

Page 43: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Our requirements

• Limitation in existing SQL Server data warehouse

• Long-running and complex queries for data scientists

• New data sources: API, clickstream, unstructured, log, behavioral

data, etc.

• Easy to add data sources and pipelines

• Reduce time spent on data preparation and experiments

Machine

learning

Graph

processing

Natural language

processing

Stream processing

Data

ingestion

Data

preparation

Automated Pipelines

Insight

Predictions

Measure

Recommendations

Data-driven

Page 44: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Event-driven data platform at JustGiving [1 of 2]

• JustGiving developed in-house analytics and data science platform in AWS called RAVEN.

• Reporting, Analytics, Visualization, Experimental, Networks

• Uses event-driven and serverless pipelines rather than workflows or DAGs

• Messaging, queues, pub/sub patterns

• Separate storage from compute

• Supports scalable event driven• ETL / ELT

• Machine learning

• Natural language processing

• Graph processing

• Allows users to consume raw tables, data blocks, metrics, KPIs, insight, reports etc.

Page 45: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Event-driven data platform at JustGiving [2 of 2]

Page 46: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Serverless streaming analytics and persist stream

Page 47: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

The outcome

• Ingest full clickstream

• Near real-time streaming analytics

• Persist streams to Amazon S3 and Amazon Redshift

Amazon Kinesis

• AWS managed services

• Event-driven and serverless

• Scale out and automate complex queries

• Improved productivity

• Data-driven: Measure, insight, predict, recommend

RAVEN platform: scalable event-driven data platform in AWS

Page 48: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Thank you!

“Ensure no good cause goes unfunded”

Contact:

https://linkedin.com/in/

drfreeman

BDM303 - JustGiving: Serverless Data Pipelines, Event-Driven ETL, and Stream Processing

Tuesday 2:30 PM - 3:30 PM

Wednesday, 3:30 PM - 4:30 PM [repeat]

Page 49: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Proven customer success

The vast majority of big data use cases deployed in the cloud

today run on AWS.

Page 50: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Big Data Mini Con sessions

Mirage, Bermuda A Mirage, St. Croix B Mirage, Event Center B Mirage, Barbados A

1:00 PMBeeswax: Building a Real-

Time Streaming Data Platform on AWS

Big Data Architectural Patterns and Best Practices on AWS

Deep Dive: Amazon EMR Best Practices &

Design Patterns Workshop: Building Your First Big Data

Application with AWS2:30 PM

JustGiving: Serverless Data Pipelines, Event-Driven ETL,

and Stream Processing

Best Practices for Apache Spark on

Amazon EMR

Understanding IoT Data: How to Leverage

Amazon Kinesis in Building an IoT Analytics Platform

on AWS

4:00 PMAnalyzing Streaming Data in

Real-time with Amazon Kinesis Analytics

FINRA: Building a Secure Data Science

Platform on AWS

Best Practices for Data Warehousing with Amazon Redshift Workshop: Building

Your First Big Data Application with AWS

5:30 PM

Real-Time Data Exploration and Analytics with Amazon

Elasticsearch Service and Kibana

Netflix: Using Amazon S3 as the fabric of our

big data ecosystem

Visualizing Big Data Insights with Amazon

QuickSight

Plus, repeats for many sessions throughout the week!

Page 51: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Get started with Big Data on AWS

aws.amazon.com/big-dataBig Data Quest Learn at your own pace and practice working with AWS services for big data on QwikLABS. (3 Hours | Online)qwiklabs.com/quests/1

Big Data on AWS How to use AWS services to process data with Hadoop & create big data environments (3 Days | Classroom ) aws.amazon.com/training/course-descriptions/bigdata/

Big Data Technology Fundamentals FREE!Overview of AWS big data solutions for architects or data scientists new to big data. (3 Hours | Online)

AWS Courses

Self-paced Online Labs

Page 52: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Remember to complete

your evaluations!

Page 53: AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

Thank you!