Back to School - St. Louis Hadoop Meetup September 2016

Preview:

Citation preview

© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies

Back to School – Hadoop Ecosystem Overview

Matt Miller, Solutions EngineerSeptember, 2016

© 2016 MapR Technologies 2© 2016 MapR Technologies 2

Roadmap• What is Apache?• Hadoop Timeline and Level Set• Hadoop Suite of tools

1. Hive2. Sqoop3. Pig4. Oozie5. Hbase6. Flume7. Kafka8. Drill9. Yarn10. Zookeeper

• Use Cases• Q&A

© 2016 MapR Technologies 3© 2016 MapR Technologies 3

What is Apache?• Non-profit organization

• Governs the development of open source “Projects”

• “Top Level” projects are the most prominent

• Features “committers” from all over the world

© 2016 MapR Technologies 4© 2016 MapR Technologies 4

Hadoop Timeline

2003GFS White Paper Published

2004Map Reduce White Paper Published

2006Hadoop is born

HDFS + MapReduce

2009Hadoop distributions start

popping up

2016Organized Chaos – New projectsreleased every few months andonly the winners gain traction

2007 - PresentHadoop continuously evolves.

New tools are released to improveusability and make it easier to adopt.

2000 2020

© 2016 MapR Technologies 5© 2016 MapR Technologies 5

What is Hadoop?

Distributed File System + Processing Engine

HDFS Map Reduce

© 2016 MapR Technologies 6© 2016 MapR Technologies 6

What is MapReduce?• Three phase program built for distributed processing

– Map– Shuffle/Sort– Reduce

• Processing overhead associated with MR jobs(~30 seconds)

• Heavy disk usage

© 2016 MapR Technologies 7© 2016 MapR Technologies 7

1.) Hive• First SQL on Hadoop – HiveQL is the language

• Hadoop data warehousing tool

• Converts HiveQL into a Map Reduce job

• Bash, Java, and Python scripts can execute Hive commands

• Not ANSI compliant but VERY similar

Use Hive for long running jobs -- not ad-hoc queries

© 2016 MapR Technologies 8© 2016 MapR Technologies 8

2.) Sqoop• RDBMS connector for Hadoop

• Execute Sqoop scripts via the command line

• Sqoop can move Schemas, Tables, or Select statement results

• Helps improve ETL or enable data warehouse offload

Use Sqoop anytime data needs to move to/from an RDBMS

© 2016 MapR Technologies 9© 2016 MapR Technologies 9

3.) Pig• High level coding language for processing data

• Language used to express data flows is called Pig Latin

• Pig turns data flows into a series of MR jobs

• Can run in a single JVM or on a Hadoop Cluster

• User Defined Functions(UDFs) make Pig code easy to repurpose

Use pig to speed up development process

© 2016 MapR Technologies 10© 2016 MapR Technologies 10

4.) Oozie• Workflow Orchestration

• Schedule tasks to be completed based on time or completion of a previous task

• Used for Automation

• Develop these workflows either in a GUI or in XML– Hint: the GUI is much much MUCH simpler

Use Oozie when you need workflows

© 2016 MapR Technologies 11© 2016 MapR Technologies 11

5.) Hbase• Database built on HDFS

• Meant for big and fast data

• Hbase is a NoSQL database– Multiple types of NoSQL databases:

• Wide-column stores, Document DB, Graph DB, Key-Value stores• Hbase is a wide-column store

Use Hbase when “real-time read/write access to very large datasets” is required

© 2016 MapR Technologies 12© 2016 MapR Technologies 12

6.) Flume• Meant for ingesting streams of data

• Runs on the same cluster and stores data in HDFS– Also flexible enough to stream into Hbase or SolR

• Flume PUSHES data to its destination

• Flume does NOT store data within itself

Use Flume when basic streaming is required

© 2016 MapR Technologies 13© 2016 MapR Technologies 13

7.) Kafka• …Also meant for ingesting streams of data

• Runs on its own cluster

• Kafka does not PUSH data to other places– Other places pull from Kafka

• Kafka streams in the data, then PUBLISHES the data on its cluster and multiple users can SUBSCRIBE to that data and get their copy.

Use Kafka for advanced streaming

© 2016 MapR Technologies 14© 2016 MapR Technologies 14

8.) Drill• Flexible SQL tool

• Works with a lot of data types and storage platforms

• Does not require transformations to the data

• For ad-hoc analytics and performant queries on LARGE data sets

• Scales to thousands of nodes

Use Drill for data exploration and performant SQL

© 2016 MapR Technologies 15© 2016 MapR Technologies 15

9.) Yarn• Yet Another Resource Negotiator• Helps you allocate resources (and enforce usage quotas) to multiple

groups/users

10.) Zookeeper• Coordinates the distribution of jobs• Handles partial failures• Provides synchronization of jobs

Use Yarn for Multitenancy

ALWAYS use Zookeeper with Hadoop

© 2016 MapR Technologies 16© 2016 MapR Technologies 16

Use Case 1: Expensive RDBMS• Organization has 5 TB of sales

data in RDBMS ($$$)

• Currently 50 reports being generated regularly

• Largest report takes ~24 hours to generate

• Team only knows SQL

HDFS

Sqoop

Hive

Hive/Drill

© 2016 MapR Technologies 17© 2016 MapR Technologies 17

Use Case 2: Customer 360 Data Lake/Hub• 50 TB of customer data

• Data consists of everything from ERP data to JSON data from a rest API

• Four different business units need access to the data and they each have performance requirements

• Basic users need ad-hoc query capabilities

• Weekly jobs need to be kicked off during off hours

HDFS

Drill

YARN

Drill

Oozie

© 2016 MapR Technologies 18© 2016 MapR Technologies 18

Use Case 3: Online Video Game Support• Stats need to be updated milliseconds after

the game finishes

• Player needs to be able to randomly look up other player stats in less than a second

• System can never go down or lose information

• Management wants to save this data so analytics can be run on these datasets.

Kafka/Flume & Hbase

Hbase

Kafka & Hbase

HDFS

© 2016 MapR Technologies 19© 2016 MapR Technologies 19

Advice for those getting started…• Don’t try to hire a big data team, build from within

– MOTIVATED Linux and SQL people are enough to get started

• Target legacy RDMBS and move ~80% to Hadoop– Quick win– Instant validation and justification if you can cut costs and improve speed

at the same time

• Have fun

© 2016 MapR Technologies 20© 2016 MapR Technologies 20

Additional Resources• Full List of Hadoop Ecosystem

• Books:– The Definitive Guide to Hadoop– Hadoop Application Architectures

• Free Training:– Coursera and Edx

• My favorite is a Python specialization series– learn.mapr.com

• Free courses from 100 level to 400 level

© 2016 MapR Technologies 21© 2016 MapR Technologies 21

Q & A@mapr maprtech

matthewmiller@mapr.com

Engage with us!

MapR

maprtech

mapr-technologies