39
The Big Data Ecosystem Data Scientist’s Paradise August 26, 2014 Tuesday 6, E Bay Street, Jacksonville (Ravi Nair, Jax Big Data ) 1 www.meetup.com/jaxbigdata

The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

The Big Data Ecosystem Data Scientist’s Paradise August 26, 2014 Tuesday 6, E Bay Street, Jacksonville

(Ravi Nair, Jax Big Data )

1 www.meetup.com/jaxbigdata

Page 2: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Rules

There are twelve horses There are five fences The race is announced The starting gate holds the first six to arrive A gun is fired to start the race The time taken between fences is between 2-6 seconds (sleep) No two horses jump the same fence at the same time No two horses cross the finish line at the same time A commentator announces Each horse jumping a fence

The first three horses to cross the finish line

All horses have finished the race

Lets solve this problem

Let me implement and run this in front of you, and will form the basic for our

future discussion

Steeplechase Demo

2 www.meetup.com/jaxbigdata

Page 3: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Challenges

Heterogeneity Latency Remote Memory vs Local Memory Synchronization Partial failure

Applications need to adapt gracefully in the face of partial failure Lamport once defined a distributed system as “One on which I cannot get any

work done because some machine I have never heard of has crashed”

3 www.meetup.com/jaxbigdata

Page 4: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Elephant to Rescue

Thanks to Dave Cutting, Tom White, Murthy et al.

4 www.meetup.com/jaxbigdata

Page 5: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

• Simple data-parallel programming model designed for scalability and fault-tolerance

• Pioneered by Google • Processes 20 petabytes of data per day

• Popularized by open-source Hadoop project

• Used at Yahoo!, Facebook, Amazon, …

MapReduce

5 www.meetup.com/jaxbigdata

Page 6: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

• At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation

• At Yahoo!: “Web map” powering Yahoo! Search Spam detection for Yahoo! Mail

• At Facebook: Data mining Ad optimization Spam detection

MapReduce Use

6 www.meetup.com/jaxbigdata

Page 7: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

• Scalability to large data volumes: 1000’s of machines, 10,000’s of disks

• Cost-efficiency:

Commodity machines (cheap, but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)

MapReduce Design Goals

7 www.meetup.com/jaxbigdata

Page 8: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Commodity Hardware

8

Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit

www.meetup.com/jaxbigdata

Page 9: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop Cluster

Courtesy: http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

9 www.meetup.com/jaxbigdata

Page 10: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

MapReduce In Action

the quick

brown fox

the fox ate

the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1

brown, 1

fox, 1

quick, 1

the, 1

fox, 1

the, 1

how, 1

now, 1

brown, 1

ate, 1

mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

10 www.meetup.com/jaxbigdata

Page 11: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Input: (filename, text) records Output: list of files containing each word Map: foreach word in text.split(): output(word, filename) Combine: uniquify filenames for each word Reduce: def reduce(word, filenames): output(word, sort(filenames))

One More Example – Inverted Index

11 www.meetup.com/jaxbigdata

Page 12: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Inverted Index - Flow

to be or not to be afraid, (12th.txt)

be, (12th.txt, hamlet.txt) greatness, (12th.txt)

not, (12th.txt, hamlet.txt) of, (12th.txt)

or, (hamlet.txt) to, (hamlet.txt)

hamlet.txt

be not afraid of greatness

12th.txt

to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt

be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt

12 www.meetup.com/jaxbigdata

Page 13: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

I want more bread

Many parallel algorithms can be expressed by a series of MapReduce jobs But MapReduce is fairly low-level: must think about keys, values, partitioning, etc Can we capture common “job building blocks”?

13 www.meetup.com/jaxbigdata

Page 14: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Pig

Started at Yahoo! Research Runs about 30% of Yahoo!’s jobs Features:

•Expresses sequences of MapReduce jobs •Data model: nested “bags” of items •Provides relational (SQL) operators (JOIN, GROUP BY, etc) •Easy to plug in Java functions •Pig Pen development environment for Eclipse

14 www.meetup.com/jaxbigdata

Page 15: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Example - Pig

Problem:

Suppose you have logged in user data in one file, tweets data in another, and you need to find the top 25 trending topics by users aged 18 - 35.

Load Users Load tweet data/topics

Filter by age

Join on name

Group on topics

Count topics

Order by topics

Take top 25

15 www.meetup.com/jaxbigdata

Page 16: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Pig Translation

Problem: Load Users Load Tweet

data/topics

Filter by age

Join on name

Group on topics

Count topics

Order by topics

Take top 25

Users = load … Filtered = filter … Topics= load … Joined = join … Grouped = group … Total= … count()… Sorted = order … Top 25 = limit …

16 www.meetup.com/jaxbigdata

Page 17: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Pig Latin

Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 35; Topics = load ‘topics’ as (user, topic); Joined = join Filtered by name, topics by user; Grouped = group Joined by topics; Summed = foreach Grouped generate group, count(Joined) as frequencies; Sorted = order Summed by frequencies desc; Top25 = limit Sorted 25;

store Top25 into ‘top25trends’;

17 www.meetup.com/jaxbigdata

Page 18: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hive | Introduction

Structured data to HDFS Developed at Facebook Used for majority of Facebook jobs “Relational database” built on Hadoop

• Maintains list of table schemas • SQL-like query language (HQL) • Can call Hadoop Streaming scripts from HQL • Supports table partitioning, clustering, complex data types, some optimizations

18 www.meetup.com/jaxbigdata

Page 19: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hive | Sample Hive Queries

• Find top 5 pages visited by users aged 18-25:

SELECT p.url, COUNT(1) as clicks

FROM users u JOIN page_views p ON (u.name = p.user)

WHERE u.age >= 18 AND u.age <= 25

GROUP BY p.url

ORDER BY clicks

LIMIT 5;

• Filter page views through Python script :

SELECT TRANSFORM(p.user, p.date)

USING 'map_script.py'

AS dt, uid CLUSTER BY dt

FROM page_views p;

19 www.meetup.com/jaxbigdata

Page 20: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 1 and Hadoop 2

20 www.meetup.com/jaxbigdata

Page 21: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop Daemons

21 www.meetup.com/jaxbigdata

Page 22: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 1

Limited up to 4,000 nodes per cluster

O(# of tasks in a cluster)

JobTracker bottleneck - resource management, job scheduling and monitoring

Only has one namespace for managing HDFS

Map and Reduce slots are static

Only job to run is MapReduce

Page 23: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

© Hortonworks Inc. 2011

www.meetup.com/jaxbigdata

Hadoop MapReduce Classic

• JobTracker

–Manages cluster resources and job scheduling

• TaskTracker

–Per-node agent

–Manage tasks

23

Page 24: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 1 - Reading Files

Rack1 Rack2 Rack3 RackN

read file (fsimage/edit) Hadoop Client

NameNode SNameNode

return DNs, block ids, etc.

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

checkpoint

heartbeat/ block report read blocks

Page 25: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 1 - Writing Files

Rack1 Rack2 Rack3 RackN

request write (fsimage/edit) Hadoop Client

NameNode SNameNode

return DNs, etc.

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

checkpoint

block report write blocks

replication pipelining

Page 26: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 1 - Running Jobs

Rack1 Rack2 Rack3 RackN

Hadoop Client

JobTracker

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

submit job

deploy job

part 0

map

reduce

shuffle

Page 27: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 1 - APIs

org.apache.hadoop.mapreduce.Partitioner

org.apache.hadoop.mapreduce.Mapper

org.apache.hadoop.mapreduce.Reducer

org.apache.hadoop.mapreduce.Job

Page 28: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 2

Potentially up to 10,000 nodes per cluster O(cluster size) Supports multiple namespace for managing HDFS Efficient cluster utilization (YARN) MRv1 backward and forward compatible Any apps can integrate with Hadoop Beyond Java

Page 29: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

YARN: A new abstraction layer

HADOOP 1.0

HDFS (redundant, reliable storage)

MapReduce (cluster resource management

& data processing)

HDFS2 (redundant, reliable storage)

YARN (cluster resource management)

MapReduce (data processing)

Others (data processing)

HADOOP 2.0

Single Use System

Batch Apps

Multi Purpose Platform

Batch, Interactive, Online, Streaming, …

Page 29

Page 30: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

© Hortonworks Inc. 2011

www.meetup.com/jaxbigdata

YARN Architecture

Page 30 Architecting the Future of Big Data

Page 31: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

© Hortonworks Inc. 2011

www.meetup.com/jaxbigdata

Requirements

• Reliability

• Availability

• Utilization

• Wire Compatibility

• Agility & Evolution – Ability for customers to control

upgrades to the grid software stack.

• Scalability - Clusters of 6,000-10,000 machines –Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks

–100,000+ concurrent tasks

–10,000 concurrent jobs

Page 31 Architecting the Future of Big Data

Page 32: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

© Hortonworks Inc. 2011

www.meetup.com/jaxbigdata

Architecture

• Resource Manager –Global resource scheduler

–Hierarchical queues

• Node Manager

–Per-machine agent

–Manages the life-cycle of container

–Container resource monitoring

• Application Master

–Per-application

–Manages application scheduling and task execution

–E.g. MapReduce Application Master

32

Page 33: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 2 - Reading Files (w/ NN Federation)

Rack1 Rack2 Rack3 RackN

read file

fsimage/edit copy Hadoop Client NN1/ns1

SNameNode per NN

return DNs, block ids, etc.

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

checkpoint

register/ heartbeat/ block report

read blocks

fs sync Backup NN per NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

or

ns1 ns2 ns3 ns4

dn1, dn2

dn1, dn3

dn4, dn5 dn4, dn5

Block Pools

Page 34: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 2 - Writing Files

Rack1 Rack2 Rack3 RackN

request write

Hadoop Client

return DNs, etc.

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

write blocks

replication pipelining

fsimage/edit copy NN1/ns1

SNameNode per NN

checkpoint

block report

fs sync Backup NN per NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

or

Page 35: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 2 - Running Jobs

RackN

NodeManager

NodeManager

NodeManager

Rack2

NodeManager

NodeManager

NodeManager

Rack1

NodeManager

NodeManager

NodeManager

C2.1

C1.4

AM2

C2.2 C2.3

AM1

C1.3

C1.2

C1.1

Hadoop Client 1

Hadoop Client 2

create app2

submit app1

submit app2

create app1

ASM Scheduler queues

ASM Containers

NM ASM

Scheduler Resources

.......negotiates.......

.......reports to.......

.......partitions.......

ResourceManager

status report

Page 36: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 2 - Security

F I R E W A L L

LDAP/AD

Knox Gateway Cluster

KDC

Hadoop Cluster

Enterprise/ Cloud SSO Provider

JDBC Client

REST Client

F I R E W A L L

DMZ

Browser(HUE) Native Hive/HBase Encryption

Page 37: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Hadoop 2 - APIs

org.apache.hadoop.yarn.api.ApplicationClientProtocol

org.apache.hadoop.yarn.api.ApplicationMasterProtocol

org.apache.hadoop.yarn.api.ContainerManagementProtocol

Page 38: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

38

• Map Reduce is the processing model within any Hadoop ecosystem.

• Whenever you execute your actions against Hadoop/Hive, Map Reduce is invoked.

• This is 1) complex and 2) time consuming 3)Difficult to learn/debug.

• We often need tools/engines which can efficiently help reducing this complexity.

Live Demo Objectives: 1) A Hive table is queried using standard Hive query language

2) We see a Map Reduce program is executed in the background

3) We use Presto to avoid Map Reduce and get the data faster

4) We will connect R to Hive through Presto for retrieval of data from Hive without Map Reduce

Hive Table

No Map Reduce

R http://www.r-project.org/

1) R is a language and environment

for statistical computing and

graphics. (Similar to SAS, but

open source)

2) R is supplied with a variety of

packages for various analytical

purposes.

3) By writing code in “R” , you can

create customized actions, one

of which is to connect to big

data.

Presto http://prestodb.io

1) Open source query engine

against GBs and PBs of data

2) Developed by Facebook, the one

who developed Hive

3) Multiple data sources will be

supported in the near future

4) Currently used at Facebook for

querying their 300 PB Hive data

warehouse

My favourites – R and Presto

www.meetup.com/jaxbigdata

Page 39: The Big Data Ecosystem - Meetupfiles.meetup.com/1802328/The Big Analytics Ecosystem-Data Scientist's Paradise.pdfIntroduction Structured data to HDFS Developed at Facebook Used for

Please submit your feedback At

www.meetup.com/jaxbigdata

Thank you for opportunity!!

www.meetup.com/jaxbigdata