AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

MAKING BIG DATA COME ALIVE

 AWS Serverless Architecture  Think Big

 Garrett Holbrook, Data Engineer  Feb 1st, 2017

2

 What is Think Big?

 Example Project Walkthrough

 AWS Serverless

Agenda

3

•  Big Data Consulting –  Roadmaps

–  Training

–  Strategy & Architecture

–  Implementation

•  Acquired by Teradata in 2014

•  Open source

Think Big, a Teradata Company

© 2017 Think Big, a Teradata Company 2/1/17

4

•  Garrett Holbrook –  Graduated Neumont University with BS in CS

–  With Think Big ~1 year

•  Mike Forsyth –  Graduated BYU with BS in Computer Engineering

–  With Think Big since May 2016

•  Max Goff –  Think Big Academy

About Us


5

•  Company has a lot of data stored in RDBMS

•  RDBMS costly to manage and is underperforming in certain queries

•  Hoping Hadoop can provide reduced costs and better performance

•  Hired Think Big to help

Example implementation walkthrough


6

•  Evaluate use cases and prioritize

•  Install MapReduce and HDFS on their servers

•  Write some MapReduce jobs

•  Done?

Next step


7

•  Hadoop ecosystem has mind-boggling number of technologies

Not that simple, unfortunately/fortunately


8

Hadoop Ecosystem

2/1/17 © 2017 Think Big, a Teradata Company

9

•  Hadoop ecosystem has mind-boggling number of technologies

•  Each of these technologies fulfills some business or technical need

•  MapReduce is only the tip of the iceberg

Not that simple, unfortunately/fortunately


10

•  Built for efficiently transferring bulk data between HDFS and relational databases


Sqoop

Source: blogs.apache.org/sqoop

11

•  Engine for general distributed big data processing

•  Accomplishes same goal as MapReduce, but does it better

•  Spark API provides functions in addition to map and reduce


Spark

12

•  “Hadoop’s data warehouse”

•  SQL is the language of Hive. Turns SQL queries into MapReduce Jobs

•  Newer versions (including stable release) use Tez for better performance

•  SQL skills carry over

•  It is NOT a relational database despite the usage of SQL


Hive + Tez

13

•  Web interface for data analysis on Hadoop

•  SQL editor for use with Hive, Phoenix, etc.

•  Spark notebooks

•  Can be used as the main tool for users to gain access to a hadoop cluster


Hue

14 © 2017 Think Big, a Teradata Company 2/1/17

Hue

Source: gethue.com

15

1. Sqoop to import data from RDBMS into Hadoop


Example Implementation

Relational Database

event_id source_location event_xml

15234 40.741895,-73.989308 <Header><Event>…

15235 35.689487,139.691706 <Header><Event>…

.cfg .cfg .csv

16

2. Spark to flatten XML data



source_country event_type_code event_timestamp

JP 5 1484911141

US 2 1484914741

.cfg .cfg .csv

Flatten

.cfg .cfg .csv

17

3a. Hive to write sample to sample table



.cfg .cfg .csv

Sample

.cfg .cfg .orc

sample_table

18

3b. Hive to run SQL query using distributed and scalable processing engine



.cfg .cfg .csv

Query

.cfg .cfg .orc

query_result

19

4. Hue for visualization and analysis



query_result

sample_table

User, Analyst, etc.

20

•  Have a plan, engineers are ready, now what?

•  Build out the cluster –  Provision hardware -  On-site or cloud?

–  Install Hadoop, Spark, Sqoop, Hue, Hive, and in reality many more…

–  Test cluster stability

–  Set up security

–  And more, all on open-source software...


Administration

21

•  Things that make your life easier when building a cluster –  Hadoop Admins

–  Hadoop Distributions

•  Hadoop distributions –  Hortonworks Data Platform (HDP) and Cloudera

–  Provide version compatibility

–  Support

–  Additional software


Administration

22 © 2017 Think Big, a Teradata Company 2/1/17

HDP

Source: hortonworks.com

23

•  Amazon Web Services is the leading cloud services provider

•  What is cloud? –  Renting servers

–  Redundant data storage

•  AWS has a lot of services built on top of their cloud infrastructure


What is AWS?

24

•  Elastic Cloud Compute (EC2) is a cloud service that lets you rent servers

•  Define hardware details –  For example, a t2.large instance gives you 2 vCPUs and 8 GB of memory

–  Specify how much storage you need

•  Define the OS image –  Redhat, Ubuntu, Windows Server, etc.

•  Launch instance


AWS EC2

25

•  Simple Storage Service (S3) is a redundant data storage service

•  Files are called objects in S3 and folders are called buckets

•  Charged per GB/month of data stored in S3


AWS S3

26

•  Kinesis Streams –  Distributed, fault-tolerant messaging queue

–  Fit for small, high frequency data

•  Kinesis Firehose –  Writes streaming data directly to S3 and other AWS storage services

•  Kinesis Analytics –  Run SQL on a Kinesis Stream


AWS Kinesis

27

•  Run code in the cloud without worrying about servers

•  Define a function –  Java, Node, Python, and now C#

•  Define a trigger –  File put in S3

–  Data sent to a Kinesis Stream

–  Lambda function called directly

•  AWS will deploy your code and run it whenever the function is triggered


AWS Lambda

28

•  Any cloud service where the details and operations of the server are not exposed to the user of the services –  Lambda

–  Kinesis

–  DynamoDB

–  S3

–  Athena

–  Not EC2


What do we mean by serverless?

29

•  In implementation example, administration is a large inhibitor of success and development speed

•  Even with Hadoop distributions and support, getting everything installed and configured correctly is a large effort

•  Server administration is still a big part of hadoop administration –  OS updates

–  OS level security

–  Space concerns (log files get out of hand)

•  Support is costly, and the hours spent on administration are costly

•  Capacity planning, especially if cluster on-site –  Scaling based off of load

•  Serverless potentially alleviates these issues


Where would you use this

30

1. Records written to a Kinesis Firehose delivery stream. Firehose batches up records and puts them in S3


Serverless Example Implementation

Relational Database

S3 Kinesis Firehose

Delivery stream

.cfg .cfg .csv

Bucket

31

2. Lambda function triggers on write to S3, flattens the records, and pushes them to a kinesis stream



S3 Lambda Kinesis Stream

Flatten

Trigger Flattened rows

Bucket

32

3. Kinesis analytics is used to write a sample to S3 and to run the query



S3

Kinesis Stream

Query

Kinesis Analytics

Sample

Kinesis Firehose

Delivery stream

Delivery stream

Bucket

Bucket

33

4. Athena is used for ad-hoc query and analysis



S3

34 34 2/1/17 © 2017 Think Big, a Teradata Company

Documents

AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development