Back to School - St. Louis Hadoop Meetup September 2016

Back to School – Hadoop Ecosystem Overview

Matt Miller, Solutions EngineerSeptember, 2016

Roadmap• What is Apache?• Hadoop Timeline and Level Set• Hadoop Suite of tools

1. Hive2. Sqoop3. Pig4. Oozie5. Hbase6. Flume7. Kafka8. Drill9. Yarn10. Zookeeper

• Use Cases• Q&A

What is Apache?• Non-profit organization

• Governs the development of open source “Projects”

• “Top Level” projects are the most prominent

• Features “committers” from all over the world

Hadoop Timeline

2003GFS White Paper Published

2004Map Reduce White Paper Published

2006Hadoop is born

HDFS + MapReduce

2009Hadoop distributions start

popping up

2016Organized Chaos – New projectsreleased every few months andonly the winners gain traction

2007 - PresentHadoop continuously evolves.

New tools are released to improveusability and make it easier to adopt.

2000 2020

What is Hadoop?

Distributed File System + Processing Engine

HDFS Map Reduce

What is MapReduce?• Three phase program built for distributed processing

– Map– Shuffle/Sort– Reduce

• Processing overhead associated with MR jobs(~30 seconds)

• Heavy disk usage

1.) Hive• First SQL on Hadoop – HiveQL is the language

• Hadoop data warehousing tool

• Converts HiveQL into a Map Reduce job

• Bash, Java, and Python scripts can execute Hive commands

• Not ANSI compliant but VERY similar

Use Hive for long running jobs -- not ad-hoc queries

2.) Sqoop• RDBMS connector for Hadoop

• Execute Sqoop scripts via the command line

• Sqoop can move Schemas, Tables, or Select statement results

• Helps improve ETL or enable data warehouse offload

Use Sqoop anytime data needs to move to/from an RDBMS

3.) Pig• High level coding language for processing data

• Language used to express data flows is called Pig Latin

• Pig turns data flows into a series of MR jobs

• Can run in a single JVM or on a Hadoop Cluster

• User Defined Functions(UDFs) make Pig code easy to repurpose

Use pig to speed up development process

4.) Oozie• Workflow Orchestration

• Schedule tasks to be completed based on time or completion of a previous task

• Used for Automation

• Develop these workflows either in a GUI or in XML– Hint: the GUI is much much MUCH simpler

Use Oozie when you need workflows

5.) Hbase• Database built on HDFS

• Meant for big and fast data

• Hbase is a NoSQL database– Multiple types of NoSQL databases:

• Wide-column stores, Document DB, Graph DB, Key-Value stores• Hbase is a wide-column store

Use Hbase when “real-time read/write access to very large datasets” is required

6.) Flume• Meant for ingesting streams of data

• Runs on the same cluster and stores data in HDFS– Also flexible enough to stream into Hbase or SolR

• Flume PUSHES data to its destination

• Flume does NOT store data within itself

Use Flume when basic streaming is required

7.) Kafka• …Also meant for ingesting streams of data

• Runs on its own cluster

• Kafka does not PUSH data to other places– Other places pull from Kafka

• Kafka streams in the data, then PUBLISHES the data on its cluster and multiple users can SUBSCRIBE to that data and get their copy.

Use Kafka for advanced streaming

8.) Drill• Flexible SQL tool

• Works with a lot of data types and storage platforms

• Does not require transformations to the data

• For ad-hoc analytics and performant queries on LARGE data sets

• Scales to thousands of nodes

Use Drill for data exploration and performant SQL

9.) Yarn• Yet Another Resource Negotiator• Helps you allocate resources (and enforce usage quotas) to multiple

groups/users

10.) Zookeeper• Coordinates the distribution of jobs• Handles partial failures• Provides synchronization of jobs

Use Yarn for Multitenancy

ALWAYS use Zookeeper with Hadoop

Use Case 1: Expensive RDBMS• Organization has 5 TB of sales

data in RDBMS ($$$)

• Currently 50 reports being generated regularly

• Largest report takes ~24 hours to generate

• Team only knows SQL

Hive/Drill

Use Case 2: Customer 360 Data Lake/Hub• 50 TB of customer data

• Data consists of everything from ERP data to JSON data from a rest API

• Four different business units need access to the data and they each have performance requirements

• Basic users need ad-hoc query capabilities

• Weekly jobs need to be kicked off during off hours

Use Case 3: Online Video Game Support• Stats need to be updated milliseconds after

the game finishes

• Player needs to be able to randomly look up other player stats in less than a second

• System can never go down or lose information

• Management wants to save this data so analytics can be run on these datasets.

Kafka/Flume & Hbase

Kafka & Hbase

Advice for those getting started…• Don’t try to hire a big data team, build from within

– MOTIVATED Linux and SQL people are enough to get started

• Target legacy RDMBS and move ~80% to Hadoop– Quick win– Instant validation and justification if you can cut costs and improve speed

at the same time

• Have fun

Additional Resources• Full List of Hadoop Ecosystem

• Books:– The Definitive Guide to Hadoop– Hadoop Application Architectures

• Free Training:– Coursera and Edx

• My favorite is a Python specialization series– learn.mapr.com

• Free courses from 100 level to 400 level

Q & A@mapr maprtech

matthewmiller@mapr.com

Engage with us!

maprtech

mapr-technologies

Back to School - St. Louis Hadoop Meetup September 2016

Data & Analytics

Manchester Hadoop Meetup: Spark Cassandra Integration

Datameer - May 2014 Hadoop MeetUp

Meetup - Hadoop User Group - Munich : 2013-05-22

Casablanca Hadoop & Big Data Meetup - Introduction à Hadoop

Introduction to HBase - files.meetup.comfiles.meetup.com/1228907/NYC Hadoop Meetup - Introduction to HBa… · Introduction to HBase NYC Hadoop Meetup ... •Assignment, load balancing,

An evening with... Apache hadoop Meetup

Impala: A Modern SQL Engine for Hadoop - Meetup

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications

Boston Hadoop Meetup, April 26 2012

Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup

One Hadoop, Multiple Clouds - NYC Big Data Meetup

MapReduce with Scalding @ 24th Hadoop London Meetup

Hadoop virtualization extensions hadoop world meetup

Hadoop Hadoop & Spark meetup - Altiscale

SF Hadoop Users Group August 2014 Meetup Slides

Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup

"Data in the Digital Age" - Hadoop Big Data Meetup

Kafka & Hadoop - for NYC Kafka Meetup

Hadoop meetup zhemzhitsky

Hadoop France meetup Feb2016 : recommendations with spark