Data Analysis on AWS

ADVANCED LOGGING SCENARIOS

WITH DYNAMODB AND ELASTIC

MAPREDUCE

Paolo [email protected]

Agenda

• Cloud Computing Log

• Generate Data

• Collect Data

• Store Data

• ETL

• Analysis

CLOUD COMPUTING LOGA “big data” problem

A big data problem

• Log data analysis in cloud computing technology is a big data problem!

• Volume: large amount of data generated

• Velocity: data growing at large rates and needs to be analyzed quickly

• Variety: Different type of structured and

unstructured data

Log analysis tasks

Generate Collect Store ETL Analyze

Server LogApplication Log

ScribeFlume

FluentdLogstash

DynamodbS3

Elastic MapReduceData Pipeline

DynamodbS3

CloudSearchRedshift

GENERATE DATAA large amount of data at large rates

A large amount of data

At large rate

Don’t worry we have autoscaling

www.animoto.com

And log data

5.000 instances

COLLECT DATACollect events and logs in real-time

Collect events and log

Batch processing (rsync)• burst of traffic• High latency• Hard to analyze

Remote server (rsyslog)• Hard to configure• Not scalable• Hard to analyze

Log collector• simple to configure• Scalable• NoSQL integration

In real-time: log collector (1/2)

• Apache Flume: distributed, reliable, and available service (java) for collecting, aggregating, and moving large amounts of log data• source: a source of data from which Flume receives data

• sink: is the counterpart to the source in that it is a destination for data

• Fluent: fully free and open-source (ruby) log collector that instantly enables you to have a “Log Everything” architecture with +125 plugins• Input plugin: retrieve data from most varied sources (file, mysql, scribe,

flume and others)

• Output plugin: send data to different destination (file, S3, Dynamodb, SQSand others)

In real-time: log collector (2/2)

• Logstash: a tool (java) for managing events and logs• input section: configure source of data and relative plugin

• filter section: configure regular expression or plugin for filter data input

• output section: configure destination section and relative plugin

• Scribe: is a service (C++ and Python) for aggregating log data streamed in real-time from a large number of servers• Scribe was developed at Facebook using Apache Thrift and released in

2008 as open source

• Scribe servers are arranged in a directed graph, with each server knowing only about the next server in the graph

Log collector: Fluentd

fluentd.org

Fluentd: plugin

http://www.treasure-data.com/

STORE DATANoSQL database or storage

NoSQL database

NoSQL database: Amazon DynamoDB

• A fast, fully managed NoSQL database service in the AWS cloud

• You simply create a table and specify how much request (read/write) capacity you require

• Tables do not have fixed schemas, and each item may have a different number of attributes and multiple data types

• Performance, reliability and security are built-in• SSD-storage and automatic 3-way replication

• Integrate with Amazon Elastic MapReduce by a customized version of Hive

Amazon DynamoDB: web console

http://aws.amazon.com

Amazon DynamoDB: throughput

1 million evenly spread writes per day is equivalent to

1,000,000 (writes) / 24 (hours) / 60 (minutes) / 60

(seconds) = 11.6 writes per second.

A DynamoDB Write Capacity Unit can handle 1 write per

second, so you need 12 Write Capacity Units. Similarly, to

handle 1 million reads per second, you need 12 Read

Capacity Units.

Or Storage: Amazon S3

• Provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web

• Write, read, and delete objects containing from 1 byte to 5 terabytes of data each. The number of objects you can store is unlimited.

• Designed for 99.999999999% durability and 99.99%availability of objects over a given year

• Import data from Dynamodb with Apache hive

ETLWith Amazon Web Services

With Amazon Web Services (1/2)

• Amazon provide Elastic MapReduce web service for ETL operations in AWS cloud

• Elastic MapReduce Extract your data from Dynamodb,S3 or HDFS

• Elastic MapReduce Transform your data across a cluster of EC2 instances

• Elastic MapReduce Load your data to S3 or Dynamodbthen to Redshift or Cloudsearch

With Amazon Web Services (2/2)

Amazon EC2

Amazon EC2

Auto scaling

Group

S3

Bucket

DynamoDB

Extract Transform Load

Amazon Redshift

S3

Bucket

Amazon CloudSearch

DynamoDB

EMR

Amazon Elastic MapReduce

• With Amazon Elastic MapReduce (Amazon EMR) you

can analyze and process vast amounts of data.

• It does this by distributing the computational work

across a cluster of EC2 virtual servers running in the

Amazon cloud

• The cluster is managed using an open-source framework

called Hadoop and use the map-reduce logic

Amazon EMR: cluster (1/2)

• Offers several different types of clusters, each

configured for a particular type of data processing

• Hive clusters: use Apache Hive on top of Hadoop.

• You can load data onto cluster and then query it using HiveQL

• Custom JAR Cluster: runs a Java map-reduce application

that you have previously compiled into a JAR file and

uploaded to Amazon S3

Amazon EMR: cluster (2/2)

• Streaming Cluster: runs mapper and reducer scripts that

you have previously uploaded to Amazon S3.

• The scripts can be written using any of the following supported

languages: Ruby, Perl, Python, PHP, Bash, C++.

• Pig Cluster: use Apache Pig, an open-source Apache

library that runs on top of Hadoop.

• you can load data onto your cluster and then query it using Pig

Latin, a SQL-like query language.

ETL workflow: Amazon Data pipeline

Helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available

You define a pipeline composed of the “data sources”, the “activities” or business logic and the “schedule” on which your business logic executes.

ANALYZEAnalysis and search

Analysis: Amazon Redshift (beta)

• Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud

• Load data from your RDS, S3 or DynamoDB

• Use your preferred business intelligence software package for the fast-growing Big Data analysis

• After you create an Amazon Redshift cluster, you can use any SQL client tools to connect to the cluster with JDBC or ODBC drivers from postgresql.org .

Amazon Redshift cluster

http://aws.amazon.com

Search: Amazon Cloudsearch

DEMO

Demo resume

1. Configure fluentd for log collection

2. Create a Dynamodb table and setting throughput

3. Generate logs and items to Dynamodb table

4. Create Hive cluster on Elastic MapReduce

5. Import data from DynamoDB with Hive and join tables

6. Export join tables from Hive to Amazon S3

7. Load data from S3 to Redshift and query data

8. Load data from S3 to Cloudsearch

QUESTIONS&

ANSWERS

Grazie.

• AWS303

Technology

Data Analysis on AWS