Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
MAKING BIG DATA COME ALIVE
AWS Serverless Architecture Think Big
Garrett Holbrook, Data Engineer Feb 1st, 2017
2
What is Think Big?
Example Project Walkthrough
AWS Serverless
Agenda
3
• Big Data Consulting – Roadmaps
– Training
– Strategy & Architecture
– Implementation
• Acquired by Teradata in 2014
• Open source
Think Big, a Teradata Company
© 2017 Think Big, a Teradata Company 2/1/17
4
• Garrett Holbrook – Graduated Neumont University with BS in CS
– With Think Big ~1 year
• Mike Forsyth – Graduated BYU with BS in Computer Engineering
– With Think Big since May 2016
• Max Goff – Think Big Academy
About Us
© 2017 Think Big, a Teradata Company 2/1/17
5
• Company has a lot of data stored in RDBMS
• RDBMS costly to manage and is underperforming in certain queries
• Hoping Hadoop can provide reduced costs and better performance
• Hired Think Big to help
Example implementation walkthrough
© 2017 Think Big, a Teradata Company 2/1/17
6
• Evaluate use cases and prioritize
• Install MapReduce and HDFS on their servers
• Write some MapReduce jobs
• Done?
Next step
© 2017 Think Big, a Teradata Company 2/1/17
7
• Hadoop ecosystem has mind-boggling number of technologies
Not that simple, unfortunately/fortunately
© 2017 Think Big, a Teradata Company 2/1/17
8
Hadoop Ecosystem
2/1/17 © 2017 Think Big, a Teradata Company
9
• Hadoop ecosystem has mind-boggling number of technologies
• Each of these technologies fulfills some business or technical need
• MapReduce is only the tip of the iceberg
Not that simple, unfortunately/fortunately
© 2017 Think Big, a Teradata Company 2/1/17
10
• Built for efficiently transferring bulk data between HDFS and relational databases
© 2017 Think Big, a Teradata Company 2/1/17
Sqoop
Source: blogs.apache.org/sqoop
11
• Engine for general distributed big data processing
• Accomplishes same goal as MapReduce, but does it better
• Spark API provides functions in addition to map and reduce
© 2017 Think Big, a Teradata Company 2/1/17
Spark
12
• “Hadoop’s data warehouse”
• SQL is the language of Hive. Turns SQL queries into MapReduce Jobs
• Newer versions (including stable release) use Tez for better performance
• SQL skills carry over
• It is NOT a relational database despite the usage of SQL
© 2017 Think Big, a Teradata Company 2/1/17
Hive + Tez
13
• Web interface for data analysis on Hadoop
• SQL editor for use with Hive, Phoenix, etc.
• Spark notebooks
• Can be used as the main tool for users to gain access to a hadoop cluster
© 2017 Think Big, a Teradata Company 2/1/17
Hue
14 © 2017 Think Big, a Teradata Company 2/1/17
Hue
Source: gethue.com
15
1. Sqoop to import data from RDBMS into Hadoop
© 2017 Think Big, a Teradata Company 2/1/17
Example Implementation
Relational Database
event_id source_location event_xml
15234 40.741895,-73.989308 <Header><Event>…
15235 35.689487,139.691706 <Header><Event>…
.cfg .cfg .csv
16
2. Spark to flatten XML data
© 2017 Think Big, a Teradata Company 2/1/17
Example Implementation
source_country event_type_code event_timestamp
JP 5 1484911141
US 2 1484914741
.cfg .cfg .csv
Flatten
.cfg .cfg .csv
17
3a. Hive to write sample to sample table
© 2017 Think Big, a Teradata Company 2/1/17
Example Implementation
.cfg .cfg .csv
Sample
.cfg .cfg .orc
sample_table
18
3b. Hive to run SQL query using distributed and scalable processing engine
© 2017 Think Big, a Teradata Company 2/1/17
Example Implementation
.cfg .cfg .csv
Query
.cfg .cfg .orc
query_result
19
4. Hue for visualization and analysis
© 2017 Think Big, a Teradata Company 2/1/17
Example Implementation
query_result
sample_table
User, Analyst, etc.
20
• Have a plan, engineers are ready, now what?
• Build out the cluster – Provision hardware - On-site or cloud?
– Install Hadoop, Spark, Sqoop, Hue, Hive, and in reality many more…
– Test cluster stability
– Set up security
– And more, all on open-source software...
© 2017 Think Big, a Teradata Company 2/1/17
Administration
21
• Things that make your life easier when building a cluster – Hadoop Admins
– Hadoop Distributions
• Hadoop distributions – Hortonworks Data Platform (HDP) and Cloudera
– Provide version compatibility
– Support
– Additional software
© 2017 Think Big, a Teradata Company 2/1/17
Administration
22 © 2017 Think Big, a Teradata Company 2/1/17
HDP
Source: hortonworks.com
23
• Amazon Web Services is the leading cloud services provider
• What is cloud? – Renting servers
– Redundant data storage
• AWS has a lot of services built on top of their cloud infrastructure
© 2017 Think Big, a Teradata Company 2/1/17
What is AWS?
24
• Elastic Cloud Compute (EC2) is a cloud service that lets you rent servers
• Define hardware details – For example, a t2.large instance gives you 2 vCPUs and 8 GB of memory
– Specify how much storage you need
• Define the OS image – Redhat, Ubuntu, Windows Server, etc.
• Launch instance
© 2017 Think Big, a Teradata Company 2/1/17
AWS EC2
25
• Simple Storage Service (S3) is a redundant data storage service
• Files are called objects in S3 and folders are called buckets
• Charged per GB/month of data stored in S3
© 2017 Think Big, a Teradata Company 2/1/17
AWS S3
26
• Kinesis Streams – Distributed, fault-tolerant messaging queue
– Fit for small, high frequency data
• Kinesis Firehose – Writes streaming data directly to S3 and other AWS storage services
• Kinesis Analytics – Run SQL on a Kinesis Stream
© 2017 Think Big, a Teradata Company 2/1/17
AWS Kinesis
27
• Run code in the cloud without worrying about servers
• Define a function – Java, Node, Python, and now C#
• Define a trigger – File put in S3
– Data sent to a Kinesis Stream
– Lambda function called directly
• AWS will deploy your code and run it whenever the function is triggered
© 2017 Think Big, a Teradata Company 2/1/17
AWS Lambda
28
• Any cloud service where the details and operations of the server are not exposed to the user of the services – Lambda
– Kinesis
– DynamoDB
– S3
– Athena
– Not EC2
© 2017 Think Big, a Teradata Company 2/1/17
What do we mean by serverless?
29
• In implementation example, administration is a large inhibitor of success and development speed
• Even with Hadoop distributions and support, getting everything installed and configured correctly is a large effort
• Server administration is still a big part of hadoop administration – OS updates
– OS level security
– Space concerns (log files get out of hand)
• Support is costly, and the hours spent on administration are costly
• Capacity planning, especially if cluster on-site – Scaling based off of load
• Serverless potentially alleviates these issues
© 2017 Think Big, a Teradata Company 2/1/17
Where would you use this
30
1. Records written to a Kinesis Firehose delivery stream. Firehose batches up records and puts them in S3
© 2017 Think Big, a Teradata Company 2/1/17
Serverless Example Implementation
Relational Database
S3 Kinesis Firehose
Delivery stream
.cfg .cfg .csv
Bucket
31
2. Lambda function triggers on write to S3, flattens the records, and pushes them to a kinesis stream
© 2017 Think Big, a Teradata Company 2/1/17
Serverless Example Implementation
S3 Lambda Kinesis Stream
Flatten
Trigger Flattened rows
Bucket
32
3. Kinesis analytics is used to write a sample to S3 and to run the query
© 2017 Think Big, a Teradata Company 2/1/17
Serverless Example Implementation
S3
Kinesis Stream
Query
Kinesis Analytics
Sample
Kinesis Firehose
Delivery stream
Delivery stream
Bucket
Bucket
33
4. Athena is used for ad-hoc query and analysis
© 2017 Think Big, a Teradata Company 2/1/17
Serverless Example Implementation
S3
34 34 2/1/17 © 2017 Think Big, a Teradata Company