The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data...

The Big Data Ecosystem Data Scientist’s Paradise August 26, 2014 Tuesday 6, E Bay Street, Jacksonville

(Ravi Nair, Jax Big Data )

1 www.meetup.com/jaxbigdata

There are twelve horses There are five fences The race is announced The starting gate holds the first six to arrive A gun is fired to start the race The time taken between fences is between 2-6 seconds (sleep) No two horses jump the same fence at the same time No two horses cross the finish line at the same time A commentator announces Each horse jumping a fence

The first three horses to cross the finish line

All horses have finished the race

Lets solve this problem

Let me implement and run this in front of you, and will form the basic for our

future discussion

Steeplechase Demo

Challenges

Heterogeneity Latency Remote Memory vs Local Memory Synchronization Partial failure

Applications need to adapt gracefully in the face of partial failure Lamport once defined a distributed system as “One on which I cannot get any

work done because some machine I have never heard of has crashed”

Elephant to Rescue

Thanks to Dave Cutting, Tom White, Murthy et al.

• Simple data-parallel programming model designed for scalability and fault-tolerance

• Pioneered by Google • Processes 20 petabytes of data per day

• Popularized by open-source Hadoop project

• Used at Yahoo!, Facebook, Amazon, …

MapReduce

• At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation

• At Yahoo!: “Web map” powering Yahoo! Search Spam detection for Yahoo! Mail

• At Facebook: Data mining Ad optimization Spam detection

MapReduce Use

• Scalability to large data volumes: 1000’s of machines, 10,000’s of disks

• Cost-efficiency:

Commodity machines (cheap, but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)

MapReduce Design Goals

Commodity Hardware

Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit

www.meetup.com/jaxbigdata

Hadoop Cluster

Courtesy: http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

MapReduce In Action

the quick

brown fox

the fox ate

the mouse

how now

brown cow

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1

brown, 1

fox, 1

quick, 1

the, 1

fox, 1

the, 1

how, 1

now, 1

brown, 1

ate, 1

mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

Input: (filename, text) records Output: list of files containing each word Map: foreach word in text.split(): output(word, filename) Combine: uniquify filenames for each word Reduce: def reduce(word, filenames): output(word, sort(filenames))

One More Example – Inverted Index

Inverted Index - Flow

to be or not to be afraid, (12th.txt)

be, (12th.txt, hamlet.txt) greatness, (12th.txt)

not, (12th.txt, hamlet.txt) of, (12th.txt)

or, (hamlet.txt) to, (hamlet.txt)

hamlet.txt

be not afraid of greatness

12th.txt

to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt

be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt

I want more bread

Many parallel algorithms can be expressed by a series of MapReduce jobs But MapReduce is fairly low-level: must think about keys, values, partitioning, etc Can we capture common “job building blocks”?

Started at Yahoo! Research Runs about 30% of Yahoo!’s jobs Features:

•Expresses sequences of MapReduce jobs •Data model: nested “bags” of items •Provides relational (SQL) operators (JOIN, GROUP BY, etc) •Easy to plug in Java functions •Pig Pen development environment for Eclipse

Example - Pig

Problem:

Suppose you have logged in user data in one file, tweets data in another, and you need to find the top 25 trending topics by users aged 18 - 35.

Load Users Load tweet data/topics

Filter by age

Join on name

Group on topics

Count topics

Order by topics

Take top 25

Pig Translation

Problem: Load Users Load Tweet

data/topics

Filter by age

Join on name

Group on topics

Count topics

Order by topics

Take top 25

Users = load … Filtered = filter … Topics= load … Joined = join … Grouped = group … Total= … count()… Sorted = order … Top 25 = limit …

Pig Latin

Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 35; Topics = load ‘topics’ as (user, topic); Joined = join Filtered by name, topics by user; Grouped = group Joined by topics; Summed = foreach Grouped generate group, count(Joined) as frequencies; Sorted = order Summed by frequencies desc; Top25 = limit Sorted 25;

store Top25 into ‘top25trends’;

Hive | Introduction

Structured data to HDFS Developed at Facebook Used for majority of Facebook jobs “Relational database” built on Hadoop

• Maintains list of table schemas • SQL-like query language (HQL) • Can call Hadoop Streaming scripts from HQL • Supports table partitioning, clustering, complex data types, some optimizations

Hive | Sample Hive Queries

• Find top 5 pages visited by users aged 18-25:

SELECT p.url, COUNT(1) as clicks

FROM users u JOIN page_views p ON (u.name = p.user)

WHERE u.age >= 18 AND u.age <= 25

GROUP BY p.url

ORDER BY clicks

LIMIT 5;

• Filter page views through Python script :

SELECT TRANSFORM(p.user, p.date)

USING 'map_script.py'

AS dt, uid CLUSTER BY dt

FROM page_views p;

Hadoop 1 and Hadoop 2

Hadoop Daemons

Hadoop 1

Limited up to 4,000 nodes per cluster

O(# of tasks in a cluster)

JobTracker bottleneck - resource management, job scheduling and monitoring

Only has one namespace for managing HDFS

Map and Reduce slots are static

Only job to run is MapReduce

Hadoop MapReduce Classic

• JobTracker

–Manages cluster resources and job scheduling

• TaskTracker

–Per-node agent

–Manage tasks

Hadoop 1 - Reading Files

Rack1 Rack2 Rack3 RackN

read file (fsimage/edit) Hadoop Client

NameNode SNameNode

return DNs, block ids, etc.

DN | TT

checkpoint

heartbeat/ block report read blocks

Hadoop 1 - Writing Files

request write (fsimage/edit) Hadoop Client

NameNode SNameNode

return DNs, etc.

DN | TT

checkpoint

block report write blocks

replication pipelining

Hadoop 1 - Running Jobs

Hadoop Client

JobTracker

DN | TT

submit job

deploy job

part 0

reduce

shuffle

Hadoop 1 - APIs

org.apache.hadoop.mapreduce.Partitioner

org.apache.hadoop.mapreduce.Mapper

org.apache.hadoop.mapreduce.Reducer

org.apache.hadoop.mapreduce.Job

Hadoop 2

Potentially up to 10,000 nodes per cluster O(cluster size) Supports multiple namespace for managing HDFS Efficient cluster utilization (YARN) MRv1 backward and forward compatible Any apps can integrate with Hadoop Beyond Java

YARN: A new abstraction layer

HADOOP 1.0

HDFS (redundant, reliable storage)

MapReduce (cluster resource management

& data processing)

HDFS2 (redundant, reliable storage)

YARN (cluster resource management)

MapReduce (data processing)

Others (data processing)

HADOOP 2.0

Single Use System

Batch Apps

Multi Purpose Platform

Batch, Interactive, Online, Streaming, …

YARN Architecture

Architecting the Future of Big Data

Requirements

• Reliability

• Availability

• Utilization

• Wire Compatibility

• Agility & Evolution – Ability for customers to control

upgrades to the grid software stack.

• Scalability - Clusters of 6,000-10,000 machines –Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks

–100,000+ concurrent tasks

–10,000 concurrent jobs

Architecting the Future of Big Data

Architecture

• Resource Manager –Global resource scheduler

–Hierarchical queues

• Node Manager

–Per-machine agent

–Manages the life-cycle of container

–Container resource monitoring

• Application Master

–Per-application

–Manages application scheduling and task execution

–E.g. MapReduce Application Master

Hadoop 2 - Reading Files (w/ NN Federation)

read file

fsimage/edit copy Hadoop Client NN1/ns1

SNameNode per NN

return DNs, block ids, etc.

DN | NM

checkpoint

register/ heartbeat/ block report

read blocks

fs sync Backup NN per NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

ns1 ns2 ns3 ns4

dn1, dn2

dn1, dn3

dn4, dn5 dn4, dn5

Block Pools

Hadoop 2 - Writing Files

request write

Hadoop Client

return DNs, etc.

DN | NM

write blocks

replication pipelining

fsimage/edit copy NN1/ns1

SNameNode per NN

checkpoint

block report

fs sync Backup NN per NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

Hadoop 2 - Running Jobs

NodeManager

C2.2 C2.3

Hadoop Client 1

Hadoop Client 2

create app2

submit app1

submit app2

create app1

ASM Scheduler queues

ASM Containers

NM ASM

Scheduler Resources

.......negotiates.......

.......reports to.......

.......partitions.......

ResourceManager

status report

Hadoop 2 - Security

F I R E W A L L

LDAP/AD

Knox Gateway Cluster

Hadoop Cluster

Enterprise/ Cloud SSO Provider

JDBC Client

REST Client

F I R E W A L L

Browser(HUE) Native Hive/HBase Encryption

Hadoop 2 - APIs

org.apache.hadoop.yarn.api.ApplicationClientProtocol

org.apache.hadoop.yarn.api.ApplicationMasterProtocol

org.apache.hadoop.yarn.api.ContainerManagementProtocol

• Map Reduce is the processing model within any Hadoop ecosystem.

• Whenever you execute your actions against Hadoop/Hive, Map Reduce is invoked.

• This is 1) complex and 2) time consuming 3)Difficult to learn/debug.

• We often need tools/engines which can efficiently help reducing this complexity.

Live Demo Objectives: 1) A Hive table is queried using standard Hive query language

2) We see a Map Reduce program is executed in the background

3) We use Presto to avoid Map Reduce and get the data faster

4) We will connect R to Hive through Presto for retrieval of data from Hive without Map Reduce

Hive Table

No Map Reduce

R http://www.r-project.org/

1) R is a language and environment

for statistical computing and

graphics. (Similar to SAS, but

open source)

2) R is supplied with a variety of

packages for various analytical

purposes.

3) By writing code in “R” , you can

create customized actions, one

of which is to connect to big

Presto http://prestodb.io

1) Open source query engine

against GBs and PBs of data

2) Developed by Facebook, the one

who developed Hive

3) Multiple data sources will be

supported in the near future

4) Currently used at Facebook for

querying their 300 PB Hive data

warehouse

My favourites – R and Presto

Please submit your feedback At

Thank you for opportunity!!

The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data...

Documents

Data Ecosystem: Audit & Integration

Evolving the Ecosystem of Personal Behavioral Data › ~wiese › publications › ... · Evolving the Ecosystem of Personal Behavioral Data Running Head: Ecosystem of Personal Data

Ebook: Data, all about Big Data ecosystem

A data scientist's study plan

Ecosystem challenges around data use

Big Data Ecosystem

Using Oracle Big Data Discovey as a Data Scientist's Toolkit

Retail Data Ecosystem

Snapshot : Open Data Ecosystem

Sustaining the Big Data Ecosystem

Hadoop - the data scientist's toolbox

Data Regions: Modernizing your company's data ecosystem

1108 Forest Ecosystem Data

A Modern Data Ecosystem | Enterprise Data …...2 BRINGING DATA TOGETHER: A MODERN DATA ECOSYSTEM From managing data to mining information Today’s enterprise data management enables

Big Data - Hadoop Ecosystem

Google data ecosystem

The Rocket Scientist's Guide to Epic Content

A Scientist's Perspective on Open Access and Data Management by Leigh Winowiecki

OPERATIONS RESEARCH - THE SCIENTIST'S INVASION OF THE

The data ecosystem