46
Building Your First Big Data Application Philipp Behre, Solutions Architect, Amazon Web Services …live, now, in realtime

Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Building Your First Big Data Application

Philipp Behre, Solutions Architect, Amazon Web Services

…live, now, in realtime

Page 2: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Amazon S3

Amazon Kinesis

Amazon DynamoDB

Amazon RDS (Aurora)

AWS Lambda

KCL Apps

Amazon EMR

Amazon Redshift

Amazon MachineLearning

Collect Process Analyze

Store

Data Collectionand Storage

Data

Processing

EventProcessing

Data Analysis

Data Answers

Amazon Kinesis Analytics

Page 3: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

A newer approach to the same challenge (event-driven)

Page 4: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Your first big data application on AWS (a timeless approach)

Page 5: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Collect Process Analyze

StoreData Answers

Page 6: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Collect Process Analyze

StoreData Answers

Page 7: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Collect Process Analyze

StoreData Answers

SQL

Page 8: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Set up with the AWS CLI – sharp and precise

Page 9: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Amazon Kinesis

Create a single-shard Amazon Kinesis stream for incoming log data:

aws kinesis create-stream \

--stream-name AccessLogStream \

--shard-count 1

Page 10: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Amazon S3

YOUR-BUCKET-NAME

Page 11: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Amazon EMR

Launch a 3-node Amazon EMR cluster with Spark and Hive:

m3.xlarge

YOUR-AWS-SSH-KEY

Page 12: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Amazon Redshift

\

CHOOSE-A-REDSHIFT-PASSWORD

Page 13: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Your first big data application on AWS

2. PROCESS: Process data with

Amazon EMR using Spark & Hive

STORE

3. ANALYZE: Analyze data in

Amazon Redshift using SQLSQL

1. COLLECT: Stream data into

Amazon Kinesis with Log4J

Page 14: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

1. Collect

Page 15: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Amazon Kinesis Log4J Appender

In a separate terminal window on your local machine, download Log4J Appender:

Then download and save the sample Apache log file:

Page 16: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Amazon Kinesis Log4J Appender

Create a file called AwsCredentials.properties with credentials for an IAM user with permission to write to Amazon Kinesis:

accessKey=YOUR-IAM-ACCESS-KEYsecretKey=YOUR-SECRET-KEY

Then start the Amazon Kinesis Log4J Appender:

Page 17: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Log file format

Page 18: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Spark

• Fast, general purpose engine

for large-scale data processing

• Write applications quickly in

Java, Scala, or Python

• Combine SQL, streaming, and

complex analytics

Page 19: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Amazon Kinesis and Spark Streaming

Log4J

Appender

Amazon

KinesisAmazon

S3

Amazon

DynamoDBSpark-Streaming uses

Kinesis Client Library

Amazon

EMR

Page 20: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Using Spark Streaming on Amazon EMR

-o TCPKeepAlive=yes -o ServerAliveInterval=30 \

YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME

On your cluster, download the Amazon Kinesis client for Spark:

Page 21: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Using Spark Streaming on Amazon EMR

Cut down on console noise:

Start the Spark shell:

spark-shell --jars /usr/lib/spark/extras/lib/spark-streaming-kinesis-asl.jar,amazon-kinesis-client-1.6.0.jar --driver-java-options "-Dlog4j.configuration=file:///etc/spark/conf/log4j.properties"

Page 22: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Using Spark Streaming on Amazon EMR

/* import required libraries */

Page 23: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Using Spark Streaming on Amazon EMR

/* Set up the variables as needed */

YOUR-REGION

YOUR-S3-BUCKET

/* Reconfigure the spark-shell */

Page 24: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Reading Amazon Kinesis with Spark Streaming

/* Setup the KinesisClient */

val kinesisClient = new AmazonKinesisClient(new DefaultAWSCredentialsProviderChain())

kinesisClient.setEndpoint(endpointUrl)

/* Determine the number of shards from the stream */

val numShards = kinesisClient.describeStream(streamName).getStreamDescription().getShards().size()

/* Create one worker per Kinesis shard */

val ssc = new StreamingContext(sc, outputInterval)

val kinesisStreams = (0 until numShards).map { i =>

KinesisUtils.createStream(ssc, streamName, endpointUrl,outputInterval,InitialPositionInStream.TRIM_HORIZON, StorageLevel.MEMORY_ONLY)

}

Page 25: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Writing to Amazon S3 with Spark Streaming

/* Merge the worker Dstreams and translate the byteArray to string */

/* Write each RDD to Amazon S3 */

Page 26: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

View the output files in Amazon S3

YOUR-S3-BUCKET

YOUR-S3-BUCKETyyyy mm dd HH

Page 27: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

2. Process

Page 28: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Spark SQL

Spark's module for working with structured data using SQL

Run unmodified Hive queries on existing data

Page 29: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Using Spark SQL on Amazon EMR

YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME

Start the Spark SQL shell:

spark-sql --driver-java-options "-Dlog4j.configuration=file:///etc/spark/conf/log4j.properties"

Page 30: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Create a table that points to your Amazon S3 bucketCREATE EXTERNAL TABLE access_log_raw(

host STRING, identity STRING,

user STRING, request_time STRING,

request STRING, status STRING,

size STRING, referrer STRING,

agent STRING

)

PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?"

)

LOCATION 's3://YOUR-S3-BUCKET/access-log-raw';

msck repair table access_log_raw;

Page 31: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Query the data with Spark SQL

-- return the first row in the stream

-- return count all items in the Stream

-- find the top 10 hosts

Page 32: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Preparing the data for Amazon Redshift import

We will transform the data that is returned by the query before writing it to our Amazon S3-stored external Hive table

Hive user-defined functions (UDF) in use for the text transformations: from_unixtime, unix_timestamp and hour

The “hour” value is important: this is what’s used to split and organize the output files before writing to Amazon S3. These splits will allow us to more efficiently load the data into Amazon Redshift later in the lab using the parallel “COPY” command.

Page 33: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Create an external table in Amazon S3

YOUR-S3-BUCKET

Page 34: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Configure partition and compression

-- setup Hive's "dynamic partitioning"

-- this will split output files when writing to Amazon S3

-- compress output files on Amazon S3 using Gzip

Page 35: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Write output to Amazon S3

-- convert the Apache log timestamp to a UNIX timestamp

-- split files in Amazon S3 by the hour in the log lines

INSERT OVERWRITE TABLE access_log_processed PARTITION (hour)

SELECT

from_unixtime(unix_timestamp(request_time,

'[dd/MMM/yyyy:HH:mm:ss Z]')),

host,

request,

status,

referrer,

agent,

hour(from_unixtime(unix_timestamp(request_time,

'[dd/MMM/yyyy:HH:mm:ss Z]'))) as hour

FROM access_log_raw;

Page 36: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

View the output files in Amazon S3

YOUR-S3-BUCKET

YOUR-S3-BUCKET

Page 37: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

3. Analyze

Page 38: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Connect to Amazon Redshift

# using the PostgreSQL CLI

YOUR-REDSHIFT-ENDPOINT

Or use any JDBC or ODBC SQL client with the PostgreSQL 8.x drivers or native Amazon Redshift support

• Aginity Workbench for Amazon Redshift

• SQL Workbench/J

Page 39: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Create an Amazon Redshift table to hold your data

Page 40: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Loading data into Amazon Redshift

“COPY” command loads files in parallelCOPY accesslogs

FROM 's3://YOUR-S3-BUCKET/access-log-processed'

CREDENTIALS

'aws_access_key_id=YOUR-IAM-ACCESS_KEY;aws_secret_access_key=YOUR-IAM-SECRET-KEY'

DELIMITER '\t' IGNOREHEADER 0

MAXERROR 0

GZIP;

Page 41: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Amazon Redshift test queries

-- find distribution of status codes over days

-- find the 404 status codes

-- show all requests for status as PAGE NOT FOUND

Page 42: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Your first big data application on AWS

A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors

Page 43: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

4. Visualize

Page 45: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

…around the same cost as a cup of coffee

Try it yourself on the AWS cloud…

Service Est. Cost*

Amazon Kinesis $1.00

Amazon S3 (free tier) $0

Amazon EMR $0.44

Amazon Redshift $1.00

Est. Total $2.44

*Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running

for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage.

$3.50++ Amazon QuickSight @ $0 with FreeTier

(https://quicksight.aws/pricing/ )

Page 46: Building Your First Big Data Application - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/_Munich_Loft... · Building Your First Big Data Application Philipp Behre, Solutions

Questions?AMAZON CONFIDENTIAL