98
Big Data Analytics Peter Sirota General Manager, Amazon Elastic MapReduce

Big Data Analytics

Embed Size (px)

DESCRIPTION

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.

Citation preview

Page 1: Big Data Analytics

Big Data Analytics

Peter Sirota

General Manager, Amazon Elastic MapReduce

Page 2: Big Data Analytics

1. Introducing Big Data

2. From data to actionable information

3. Analytics and Cloud Computing

4. The Big Data ecosystem

Overview

Page 3: Big Data Analytics

Introducing Big Data

1

Page 4: Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 5: Big Data Analytics

The cost of data generation

is falling

Page 6: Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Page 7: Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Highly

constrained

Page 8: Big Data Analytics

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure

Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Generated data

Available for analysis

Data volume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Page 9: Big Data Analytics

Elastic and highly scalable

No upfront capital expense

Only pay for what you use +

+

Available on-demand

+

= Remove

constraints

Page 10: Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,

higher throughput

Highly

constrained

Page 11: Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Accelerated

Page 12: Big Data Analytics

Close the gap.

Page 13: Big Data Analytics

Technologies and techniques for

working productively with data,

at any scale.

Big Data

Page 14: Big Data Analytics

From data to

actionable information

2

Page 15: Big Data Analytics

“Who buys video games?”

Page 16: Big Data Analytics

3.5 billion records

13 TB of click stream logs

71 million unique cookies

Per day:

Page 17: Big Data Analytics
Page 18: Big Data Analytics
Page 19: Big Data Analytics

500% return on ad spend

17,000% reduction in procurement time

Results:

Page 20: Big Data Analytics

“Who is using our

service?”

Page 21: Big Data Analytics

Identified early mobile usage

Invested heavily in mobile development

Finding signal in the noise of logs

Page 22: Big Data Analytics

9,432,061 unique mobile devices

used the Yelp mobile app.

4 million+ calls. 5 million+ directions.

In January 2013

Page 23: Big Data Analytics

Open web index.

3.4 billion records.

Available to all.

Page 24: Big Data Analytics

Full parse for impact of

social networks

300 lines of Ruby code.

14 hours.

$100.

Page 25: Big Data Analytics

You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011

Tweeting about Flu

Page 26: Big Data Analytics

Tweets about

the price of rice

Official food

price inflation

Tweeting about Food

Page 27: Big Data Analytics

Analytics and

Cloud Computing

3

Page 28: Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 29: Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

S3, Glacier,

Storage Gateway,

DynamoDB,

Redshift, RDS,

HBase

Page 30: Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 &

Elastic MapReduce

Page 31: Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 & S3,

CloudFormation,

Elastic MapReduce,

RDS, DynamoDB, Redshift

Page 32: Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 & S3,

CloudFormation,

Elastic MapReduce,

RDS, DynamoDB, Redshift

EC2 &

Elastic MapReduce

S3, Glacier,

Storage Gateway,

DynamoDB,

Redshift, RDS,

HBase AWS Data Pipeline

Page 33: Big Data Analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 & S3,

CloudFormation,

Elastic MapReduce,

RDS, DynamoDB, Redshift

EC2 &

Elastic MapReduce

S3, Glacier,

Storage Gateway,

DynamoDB,

Redshift, RDS,

HBase AWS Data Pipeline

Page 34: Big Data Analytics

Elastic MapReduce

Page 35: Big Data Analytics

Managed Hadoop analytics

Page 36: Big Data Analytics

Input data

S3, DynamoDB, Redshift

Page 37: Big Data Analytics

Elastic

MapReduce

Code

Input data

S3, DynamoDB, Redshift

Page 38: Big Data Analytics

Elastic

MapReduce

Code Name

node

Input data

S3, DynamoDB, Redshift

Page 39: Big Data Analytics

Elastic

MapReduce

Code Name

node

Input data

Elastic

cluster

S3, DynamoDB, Redshift

S3/HDFS

Page 40: Big Data Analytics

Elastic

MapReduce

Code Name

node

Input data

S3/HDFS Queries

+ BI

Via JDBC, Pig, Hive

S3, DynamoDB, Redshift

Elastic

cluster

Page 41: Big Data Analytics

Elastic

MapReduce

Code Name

node

Output

Input data

Queries

+ BI

Via JDBC, Pig, Hive

S3, DynamoDB, Redshift

Elastic

cluster

S3/HDFS

Page 42: Big Data Analytics

Output

Input data

S3, DynamoDB, Redshift

Page 43: Big Data Analytics
Page 44: Big Data Analytics
Page 45: Big Data Analytics
Page 46: Big Data Analytics
Page 47: Big Data Analytics
Page 48: Big Data Analytics
Page 49: Big Data Analytics
Page 50: Big Data Analytics
Page 51: Big Data Analytics
Page 52: Big Data Analytics
Page 53: Big Data Analytics

1. Elastic clusters

Page 54: Big Data Analytics

10 hours

Page 55: Big Data Analytics

6 hours

Page 56: Big Data Analytics

Peak capacity

Page 57: Big Data Analytics

2. Rapid, tuned provisioning

Page 58: Big Data Analytics

Tedious.

Page 59: Big Data Analytics

Remove undifferentiated

heavy lifting.

Page 60: Big Data Analytics

3. Hadoop all the way down

Page 61: Big Data Analytics

Robust ecosystem. Databases, machine learning, segmentation,

clustering, analytics, metadata stores,

exchange formats, and so on...

Page 62: Big Data Analytics

4. Agility for experimentation

Page 63: Big Data Analytics

Instance choice. Stay flexible on instance type & number.

Page 64: Big Data Analytics

5. Cost optimizations

Page 65: Big Data Analytics

Built for Spot. Name-your-price supercomputing.

Page 66: Big Data Analytics

1. Elastic clusters

2. Rapid, tuned provisioning

3. Hadoop all the way down

4. Agility for experimentation.

5. Cost optimizations

Page 67: Big Data Analytics

Vin Sharma [email protected]

Director, Product Strategy & Marketing

Big Data Software, Intel Corporation

Page 68: Big Data Analytics

Analysis of Data Can Transform Society

Create new business

models and improve

organizational

processes.

Enhance scientific

understanding, drive

innovation, and

accelerate medical cures.

Increase public safety

and improve

energy efficiency with

smart grids.

Page 69: Big Data Analytics

Intel’s Vision to Democratize Big Data

Unlock Value in

Silicon

Support Open

Platforms

Deliver Software Value

Page 70: Big Data Analytics

Intel at the Intersection of Big Data

Enabling exascale

computing on massive

data sets

Helping enterprises build open

interoperable clouds

Contributing code and fostering ecosystem

HPC Cloud Open Source

Page 71: Big Data Analytics

Intel® Technology at the Heart of the Cloud

Server

Storage

Network

Page 72: Big Data Analytics

Scale-Out Big Data

Compute Platform Optimization

Cost-effective performance

•Intel® Advanced Vector Extension Technology

•Intel® Turbo Boost Technology 2.0

•Intel® Advanced Encryption Standard New

Instructions Technology

Page 73: Big Data Analytics

73

Intel® Advanced Vector Extensions Technology

• Newest in a long line of

processor instruction

innovations

• Increases floating point

operations per clock up to

2X1 performance

1 : Performance comparison using Linpack benchmark. See backup for configuration details.

For more legal information on performance forecasts go to http://www.intel.com/performance

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Page 74: Big Data Analytics

Intel® Turbo Boost Technology 2.0

More Performance Higher turbo speeds maximize

performance for single and

multi-threaded applications

Page 75: Big Data Analytics

Intel® Advanced Encryption

Standard New Instructions

• Processor assistance for performing AES encryption 7 new instructions

• Makes enabled encryption software faster and stronger

Page 76: Big Data Analytics

The Power of Intel® Platform Solutions:

Richer

user

experiences

4 HRS

50% Reduction

10 MIN

80% Reduction 50%

Reduction 40% Reduction

TeraSort for

1 TB sort

Intel®

Xeon®

Processor

E5 2600

Solid-State

Drive 10G

Ethernet Intel® Apache

Hadoop

Previous

Intel®

Xeon®

Processor

Page 77: Big Data Analytics

Cloud

Intelligent Systems

Clients

The Virtuous Cycle of User Experience

Page 78: Big Data Analytics

The Big Data

Ecosystem

4

Page 79: Big Data Analytics

Data, data, everywhere... Data is stored in silos.

Page 80: Big Data Analytics

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On-premises

Page 81: Big Data Analytics

“How do I get my data to the cloud?”

Page 82: Big Data Analytics

Data mobility

Generated and stored in AWS

Inbound data transfer is free

Multipart upload to S3

Physical media

AWS Direct Connect

Regional replication of AMIs and snapshots

Page 83: Big Data Analytics

“How do I integrate my data for

maximum impact?”

Page 84: Big Data Analytics

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On-premises

Page 85: Big Data Analytics

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On-premises

Page 86: Big Data Analytics

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On premises

Page 87: Big Data Analytics

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On premises

Page 88: Big Data Analytics

S3

DynamoDB EMR

HBase on EMR RDS

Redshift

On premises

Page 89: Big Data Analytics

AWS Data Pipeline

Announced in November, available now.

Orchestration for data-intensive workloads.

Page 90: Big Data Analytics

AWS Data Pipeline

Data-intensive orchestration and automation

Reliable and scheduled

Easy to use, drag and drop

Execution and retry logic

Map data dependencies

Create and manage temporary compute

resources

Page 91: Big Data Analytics

Anatomy of a pipeline

Page 92: Big Data Analytics

Additional checks and notifications

Page 93: Big Data Analytics

Arbitrarily complex pipelines

Page 94: Big Data Analytics

aws.amazon.com/datapipeline

Page 95: Big Data Analytics

aws.amazon.com/big-data

Page 96: Big Data Analytics

1. Introducing Big Data

2. From data to actionable information

3. Analytics and Cloud Computing

4. The Big Data ecosystem

Summary

Page 97: Big Data Analytics

Get 600 Hours of free supercomputing

time!

www.powerof60.com