34
MAKING BIG DATA COME ALIVE AWS Serverless Architecture Think Big Garrett Holbrook, Data Engineer Feb 1 st , 2017

AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

MAKING BIG DATA COME ALIVE

 AWS Serverless Architecture  Think Big

 Garrett Holbrook, Data Engineer  Feb 1st, 2017

Page 2: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

2

 What is Think Big?

 Example Project Walkthrough

 AWS Serverless

Agenda

Page 3: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

3

•  Big Data Consulting –  Roadmaps

–  Training

–  Strategy & Architecture

–  Implementation

•  Acquired by Teradata in 2014

•  Open source

Think Big, a Teradata Company

© 2017 Think Big, a Teradata Company 2/1/17

Page 4: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

4

•  Garrett Holbrook –  Graduated Neumont University with BS in CS

–  With Think Big ~1 year

•  Mike Forsyth –  Graduated BYU with BS in Computer Engineering

–  With Think Big since May 2016

•  Max Goff –  Think Big Academy

About Us

© 2017 Think Big, a Teradata Company 2/1/17

Page 5: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

5

•  Company has a lot of data stored in RDBMS

•  RDBMS costly to manage and is underperforming in certain queries

•  Hoping Hadoop can provide reduced costs and better performance

•  Hired Think Big to help

Example implementation walkthrough

© 2017 Think Big, a Teradata Company 2/1/17

Page 6: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

6

•  Evaluate use cases and prioritize

•  Install MapReduce and HDFS on their servers

•  Write some MapReduce jobs

•  Done?

Next step

© 2017 Think Big, a Teradata Company 2/1/17

Page 7: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

7

•  Hadoop ecosystem has mind-boggling number of technologies

Not that simple, unfortunately/fortunately

© 2017 Think Big, a Teradata Company 2/1/17

Page 8: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

8

Hadoop Ecosystem

2/1/17 © 2017 Think Big, a Teradata Company

Page 9: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

9

•  Hadoop ecosystem has mind-boggling number of technologies

•  Each of these technologies fulfills some business or technical need

•  MapReduce is only the tip of the iceberg

Not that simple, unfortunately/fortunately

© 2017 Think Big, a Teradata Company 2/1/17

Page 10: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

10

•  Built for efficiently transferring bulk data between HDFS and relational databases

© 2017 Think Big, a Teradata Company 2/1/17

Sqoop

Source: blogs.apache.org/sqoop

Page 11: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

11

•  Engine for general distributed big data processing

•  Accomplishes same goal as MapReduce, but does it better

•  Spark API provides functions in addition to map and reduce

© 2017 Think Big, a Teradata Company 2/1/17

Spark

Page 12: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

12

•  “Hadoop’s data warehouse”

•  SQL is the language of Hive. Turns SQL queries into MapReduce Jobs

•  Newer versions (including stable release) use Tez for better performance

•  SQL skills carry over

•  It is NOT a relational database despite the usage of SQL

© 2017 Think Big, a Teradata Company 2/1/17

Hive + Tez

Page 13: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

13

•  Web interface for data analysis on Hadoop

•  SQL editor for use with Hive, Phoenix, etc.

•  Spark notebooks

•  Can be used as the main tool for users to gain access to a hadoop cluster

© 2017 Think Big, a Teradata Company 2/1/17

Hue

Page 14: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

14 © 2017 Think Big, a Teradata Company 2/1/17

Hue

Source: gethue.com

Page 15: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

15

1. Sqoop to import data from RDBMS into Hadoop

© 2017 Think Big, a Teradata Company 2/1/17

Example Implementation

Relational Database

event_id source_location event_xml

15234 40.741895,-73.989308 <Header><Event>…

15235 35.689487,139.691706 <Header><Event>…

.cfg .cfg .csv

Page 16: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

16

2. Spark to flatten XML data

© 2017 Think Big, a Teradata Company 2/1/17

Example Implementation

source_country event_type_code event_timestamp

JP 5 1484911141

US 2 1484914741

.cfg .cfg .csv

Flatten

.cfg .cfg .csv

Page 17: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

17

3a. Hive to write sample to sample table

© 2017 Think Big, a Teradata Company 2/1/17

Example Implementation

.cfg .cfg .csv

Sample

.cfg .cfg .orc

sample_table

Page 18: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

18

3b. Hive to run SQL query using distributed and scalable processing engine

© 2017 Think Big, a Teradata Company 2/1/17

Example Implementation

.cfg .cfg .csv

Query

.cfg .cfg .orc

query_result

Page 19: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

19

4. Hue for visualization and analysis

© 2017 Think Big, a Teradata Company 2/1/17

Example Implementation

query_result

sample_table

User, Analyst, etc.

Page 20: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

20

•  Have a plan, engineers are ready, now what?

•  Build out the cluster –  Provision hardware -  On-site or cloud?

–  Install Hadoop, Spark, Sqoop, Hue, Hive, and in reality many more…

–  Test cluster stability

–  Set up security

–  And more, all on open-source software...

© 2017 Think Big, a Teradata Company 2/1/17

Administration

Page 21: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

21

•  Things that make your life easier when building a cluster –  Hadoop Admins

–  Hadoop Distributions

•  Hadoop distributions –  Hortonworks Data Platform (HDP) and Cloudera

–  Provide version compatibility

–  Support

–  Additional software

© 2017 Think Big, a Teradata Company 2/1/17

Administration

Page 22: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

22 © 2017 Think Big, a Teradata Company 2/1/17

HDP

Source: hortonworks.com

Page 23: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

23

•  Amazon Web Services is the leading cloud services provider

•  What is cloud? –  Renting servers

–  Redundant data storage

•  AWS has a lot of services built on top of their cloud infrastructure

© 2017 Think Big, a Teradata Company 2/1/17

What is AWS?

Page 24: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

24

•  Elastic Cloud Compute (EC2) is a cloud service that lets you rent servers

•  Define hardware details –  For example, a t2.large instance gives you 2 vCPUs and 8 GB of memory

–  Specify how much storage you need

•  Define the OS image –  Redhat, Ubuntu, Windows Server, etc.

•  Launch instance

© 2017 Think Big, a Teradata Company 2/1/17

AWS EC2

Page 25: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

25

•  Simple Storage Service (S3) is a redundant data storage service

•  Files are called objects in S3 and folders are called buckets

•  Charged per GB/month of data stored in S3

© 2017 Think Big, a Teradata Company 2/1/17

AWS S3

Page 26: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

26

•  Kinesis Streams –  Distributed, fault-tolerant messaging queue

–  Fit for small, high frequency data

•  Kinesis Firehose –  Writes streaming data directly to S3 and other AWS storage services

•  Kinesis Analytics –  Run SQL on a Kinesis Stream

© 2017 Think Big, a Teradata Company 2/1/17

AWS Kinesis

Page 27: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

27

•  Run code in the cloud without worrying about servers

•  Define a function –  Java, Node, Python, and now C#

•  Define a trigger –  File put in S3

–  Data sent to a Kinesis Stream

–  Lambda function called directly

•  AWS will deploy your code and run it whenever the function is triggered

© 2017 Think Big, a Teradata Company 2/1/17

AWS Lambda

Page 28: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

28

•  Any cloud service where the details and operations of the server are not exposed to the user of the services –  Lambda

–  Kinesis

–  DynamoDB

–  S3

–  Athena

–  Not EC2

© 2017 Think Big, a Teradata Company 2/1/17

What do we mean by serverless?

Page 29: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

29

•  In implementation example, administration is a large inhibitor of success and development speed

•  Even with Hadoop distributions and support, getting everything installed and configured correctly is a large effort

•  Server administration is still a big part of hadoop administration –  OS updates

–  OS level security

–  Space concerns (log files get out of hand)

•  Support is costly, and the hours spent on administration are costly

•  Capacity planning, especially if cluster on-site –  Scaling based off of load

•  Serverless potentially alleviates these issues

© 2017 Think Big, a Teradata Company 2/1/17

Where would you use this

Page 30: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

30

1. Records written to a Kinesis Firehose delivery stream. Firehose batches up records and puts them in S3

© 2017 Think Big, a Teradata Company 2/1/17

Serverless Example Implementation

Relational Database

S3 Kinesis Firehose

Delivery stream

.cfg .cfg .csv

Bucket

Page 31: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

31

2. Lambda function triggers on write to S3, flattens the records, and pushes them to a kinesis stream

© 2017 Think Big, a Teradata Company 2/1/17

Serverless Example Implementation

S3 Lambda Kinesis Stream

Flatten

Trigger Flattened rows

Bucket

Page 32: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

32

3. Kinesis analytics is used to write a sample to S3 and to run the query

© 2017 Think Big, a Teradata Company 2/1/17

Serverless Example Implementation

S3

Kinesis Stream

Query

Kinesis Analytics

Sample

Kinesis Firehose

Delivery stream

Delivery stream

Bucket

Bucket

Page 33: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

33

4. Athena is used for ad-hoc query and analysis

© 2017 Think Big, a Teradata Company 2/1/17

Serverless Example Implementation

S3

Page 34: AWS Serverless Architecturedml.cs.byu.edu/~cgc/docs/Capstone16-17/ThinkBig W17.pdf · 29 • In implementation example, administration is a large inhibitor of success and development

34 34 2/1/17 © 2017 Think Big, a Teradata Company