42
Cmprssd Intrduction To Hadoop, SQL-on-Hadoop, NoSQL @arsenyspb [email protected] Singapore University of Technology & Design 2016-11-09

Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb [email protected] Singapore University of Technology & Design . 2016-11-09

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

Cmprssd Intrduction To Hadoop, SQL-on-Hadoop, NoSQL

@arsenyspb

[email protected]

Singapore University of Technology & Design 2016-11-09

Page 2: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

Thank You For Inviting! My special kind regards to:

Professor Meihui Zhang

Associate Director Hou Liang Seah

Industry Outreach Manager Robin Soo

Page 3: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

🤔🤔 What am I supposed to do?.. Please raise hand if you…

…want to learn about modern data analytics ?..

…are OK if I use words like “Java” or “Command Line” or “Port”?..

…got enough kopi / teh / red bull for next 1 hour?..

…have hands-on experience with Hadoop, Spark, Hive?..

Page 4: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

Shameless Self-Intro

Page 5: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

5

Hi, My Name Is Arseny, And I’m…

Page 6: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

Hadoop In A 🌰 Nutshell

Page 7: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

7

1998

2016

It All Started At Google

Page 8: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

8

2003

2004

2006

Hadoop is Google’s Tech in Open Source

2006

Page 9: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

9 Hadoop Originates From Hyperscale Approach

However, in 2016 big data & Hadoop don’t need a hyperscale datacenter

Presenter
Presentation Notes
Expect students from a DB background to be comfortable here. Expect they will become uncomfortable when we get to CAP/BASE.
Page 10: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Closer Look, i.e. Hortonworks Data Platform (HDP)

YARN : Data Operating System

DATA ACCESS SECURITY GOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS Encryption

Data Workflow Sqoop Flume Kafka NFS WebHDFS

Provisioning, Managing, & Monitoring

Ambari Cloudbreak Zookeeper Scheduling Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBase Accumulo Phoenix

Stream

Storm

In-memory

Spark

Others

ISV Engines

Tez Tez Tez Slider Slider

HDFS Hadoop Distributed File System

DATA MANAGEMENT

Hortonworks Data Platform 2.3

Data Lifecycle & Governance Falcon Atlas

We will “compress” all these topics during next 1 hour

Page 11: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

Quick demo

Page 12: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

HDFS In A 🌰 Nutshell Hadoop Distributed File System

Page 13: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

13 © 2015 Pivotal Software, Inc. All rights reserved.

Reading Data From HDFS

Client Node Client JVM

Distributed FileSystem

HDFS Client

1: open

FSData InputStream

namenode JVM

NameNode

datanode JVM

DataNode

datanode JVM

DataNode

datanode JVM

DataNode

2: Request file block locations

3: read

6: close

4: read from block

5: read from block

Presenter
Presentation Notes
Each data-block is read from one of the data-nodes that holds it (assuming it is replicated multiple times). The NameNode tries to assign the read to the least busy data-node. Note: the ‘client’ is whatever code that is reading the data from hdfs. It could be anything: a web app, Spring batch, Spring integration, a HAWQ query, anything. Typically it is running on one of the nodes in the Hadoop cluster rather than externally to the cluster. Good deep background on this can be found at http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/
Page 14: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

14 © 2015 Pivotal Software, Inc. All rights reserved.

Writing Data to HDFS

Client Node Client JVM

Distributed FileSystem

HDFS Client

1: create

FSDataOutputStream

namenode JVM

NameNode

datanode JVM

DataNode

datanode JVM

DataNode

datanode JVM

DataNode

2: create

3a: write

6: close

4a: write packet 5c: ack packet

4b: write packet

4c: write packet

5b: ack packet

5a: ack packet

7: complete

DataStreamer

3b: Request allocation (as new blocks required)

3c: Three data-node, data-block pairs returned

Diagram shows 3x replication

Presenter
Presentation Notes
The # of writes to data nodes depends on the replication factor, of course. NameNode returns as multiple (data-node/data-block) pairs as per the replication factor. The first write is ‘on rack’, or at least as close to the client as possible. The second two writes are ‘off rack’ – on a different rack as the first write.
Page 15: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

Quick demo

Page 16: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

YARN In A 🌰 Nutshell Yet Another Resource Negotiator

Page 17: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

17

Traditional SQL databases: structured Schema-on-Write Legacy SQL Is All Structured

row keys color shape timestamp

row

row

row

... ...

first red square HH:MM:SS

second blue round HH:MM:SS

1 create schema on file or block storage 2 load data 3 query data select ROW KEY, COLOR from … where

Can’t add data before the schema is created. To change schema, drop and re-loaded entire table. A drop of TB-size table with Foreign Keys could last days.

Page 18: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

18

file.csv & other.txt

Unstructured Schema-on-Read Query MapReduce In Color

1 load data straight from HDFS 2 query data - map - shuffle - reduce

Page 19: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

19

MapReduce In Process Diagram

Page 20: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

20 © 2015 Pivotal Software, Inc. All rights reserved.

Starting Job – MapReduce v2.0

Client Node Client JVM

Job MapReduce program

Jobtracker Node

1: initiate job 2: request new application

3: copy job jars, config

4: submit job

9: retrieve job jars, data

Node Manager Node

JVM

Node manager

Child JVM

YARN child

Mapper or Reducer

10: run

Shared File-System (e.g. HDFS)

6: determine input splits

7b: start container

Node Manager Node JVM

MRApp Master Node Manager

5b: launch

5c: initialize job

5a: start container

7a: allocate task resources

8: launch

JVM

ResourceManager

Presenter
Presentation Notes
We believe it is the resource manager that takes care of the copying of the job jars, config A confusing aspect of this slide is that there is a ‘Node Manager Node’ that spawns MRAppMaster AND a “Node Manager Node” that launches a YARN child. From Hadoop the Definitive Guide (p197 & 198), it seems what this is saying is that all data nodes will also have a NodeManager daemon process running. That process could be contacted by the ResourceManager to launch a MR job, which would create an MRAppMaster internally. During the management of the job, the MRAppMaster could contact another NodeManager elsewhere in the cluster to spawn a YARN child, which would either run a Mapper task or a Reducer task.
Page 21: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

Quick demo

Page 22: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

Hive In A 🌰 Nutshell SQL interface to MapReduce Jobs

Page 23: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

23

Relational DB

Relational DB and SQL conceived to – Remove repeated data, replace with tabular structure & relationships

▪ Provide efficient & robust structure for data storage

Exploit regular structure with declarative query language

–Structured Query Language

DRY – Don’t Repeat Yourself

Page 24: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

24

What Hive Is… A SQL-like processing capability based on Hadoop

Enables easy data summarisation, ad-hoc reporting and querying, and analysis of large volumes of data

Built on HQL, a SQL-like query language – Statements run as mapreduce jobs – Also allows mapreduce programmers to plugin custom mappers and

reducers

• Works with Plain text, Hbase, ORC, Parquet and others formats

• Metadata is stored in MySQL

Page 25: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

25

Hive Schemas

Hive is schema-on-read – Schema is only enforced when the data is read (at query time) – Allows greater flexibility: same data can be read using multiple

schemas

Contrast with RDBMSes, which are schema-on-write – Schema is enforced when the data is loaded – Speeds up queries at the expense of load times

Presenter
Presentation Notes
Note that Pig is schema on read too. So is Map Reduce.
Page 26: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

26

Hive Architecture

Hive Metastore + MySQL

Page 27: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

27

What Hive Is Not…

Hive, like Hadoop, is designed for batch processing of large datasets

Not a real-time system, not fully SQL-92 compliant – “Sibling” solutions like Tez, Impala and HAWQ offer more compliance

Latency and throughput are both high compared to a traditional RDBMS – Even when dealing with relatively small data (<100 MB)

Page 28: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

Quick demo

Page 29: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

HBASE In A 🌰 Nutshell SQL interface to MapReduce Jobs

Page 30: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

30

ACID is Business Requirement for RDBMs Traditional DB-s have excellent support for ACID transactions

–Atomic: All write operations succeed, or nothing is written

–Consistent: Integrity rules guaranteed at commit

–Isolation: It appears to the user as if only one process executes at a time. (Two concurrent transactions will not see on another’s transaction while “in flight”.)

–Durable: The updates made to the database in a committed transaction will be visible to future transactions. (Effects of a process do not get lost if the system crashes.)

Presenter
Presentation Notes
Expect students from a DB background to be comfortable here. Expect they will become uncomfortable when we get to CAP/BASE.
Page 31: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

31

Scale RDBMS?..

RDBMS is bad fit for huge scale, online applications Sharding?.. Scaling up?.. No Joins?.. Master-Slave?..

Big Data describes problem, Not only SQLdefines the general approach to solution: – Emphasis on scale, distributed processing, use of commodity

hardware

Page 32: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

32

Business Needs for “Not Only SQL” Not Only SQL DBs evolved from web-scale use-cases

– Google, Amazon, Facebook, Twitter, Yahoo, … ▪ “Google Cache” = Entire page saved in to a cell of a BigTable database

▪ Columnar layout preferred ▪ filters to reduce the disk lookups for non-existent rows or columns increases the performance of a

database query operation.

– Requirement for massive scale, relational fits badly ▪ Queries relatively simple ▪ Direct interaction with online customers

– Cost-effective, dynamic horizontal scaling required ▪ Many nodes based on inexpensive (commodity) hardware ▪ Must manage frequent node failures & addition of nodes at any time

Presenter
Presentation Notes
Hierarchical and Network DBs actually predate Relational. Most of these companies were small startups, did not start with the resources necessary to buy big iron.
Page 33: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

🤔🤔 But how to build such DB?..

Page 34: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

34

Reminder: The CAP Theorem (2 not 3)

Consistency

Partition tolerance

Availability “Once a writer has written, all readers will see that write”

Single Version of Truth?

“System is Available to serve 100% of requests and complete them successfully.”

No SPOF?..

“A system can continue to operate in the presence of a network Partitions”

Replicas?..

Presenter
Presentation Notes
Hierarchical and Network DBs actually predate Relational. Most of these companies were small startups, did not start with the resources necessary to buy big iron.
Page 35: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

35

Eventually Consistent vs. ACID An artificial acronym you may see is BASE

–Basically Available ▪ System seems to work all the time

–Soft State ▪ Not wholly consistent all the time, but…

–Eventual Consistency ▪ After a period with no updates, a given dataset will be consistent

Resulting systems characterized as “eventually consistent” – Overbooking an airline or hotel and passing risk to customer

Presenter
Presentation Notes
Maybe they don't need consistent data ever for some datasets! Examples, based on previous: - The bank has decided that it is ok to allow deposits and withdrawals during partition failure. When the system comes back up we will reconcile, see how much we lost, and book it as the cost of high availability. My online banking site shows pending deposits and withdrawals (soft state), even they can’t tell me at a precise moment in time what is in my account! - An airline or hotel may decide to overbook and pass the cost of inconsistent state on to the customer (ever happen to you?) For deep background, a student suggested “Building on Quicksand”: http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf
Page 36: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

36

Non-relational distributed database • HBase is a database: has a schema, but it’s non-relational

row keys column family

“color” column family

“shape”

row

row

first “red”: #F00 “blue”: #00F

“yellow”: #F0F “square”:�

second “round”: “size”: XXL

1.) Create column families

2.) Load data, multiples of rows form region files on HDFS 3.) Query data

hbase>get “first”, “color”:”yellow” COLUMN CELL yellow timestamp=1295774833226, value=“#F0F” hbase>get “second”, “shape”:”size” COLUMN CELL size timestamp=1295723467122, value=“XXL”

Page 37: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

37

Col

umn

Orie

nted

St

orag

e

Page 38: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

38

Hba

se

Clie

nt

Reg

ion

Serv

er

Zookeeper

SQL

ODB

C Cl

ient

Pivo

tal

HAW

Q P

XF

Hba

se

Clie

nt

Apac

he

Phoe

nix

Hba

se

Clie

nt

Sequential HDFS Write & L2 Read

Adaptive Pre Fetch & L2 Reads Sequential Writes

SQ

L JD

BC

Clie

nt

Hbas

e AP

I Cl

ient

(1) Put/Delete

Writ

e-Ah

ead

Log

(WAL

)

Mem

stor

e (3) Flush to HDFS

(2.1) Write to MemStore

(2.0) Write to WAL

(4) Get/Scan Read Request Client RAM Pre-Fetch

HBase Architecture, Read & Write

Memstore = Eventual

Consistency

HFile

Page 39: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

39

HBase namespace layout

Page 40: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

40

From “Hbase Definitive Guide”

http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2

Compression (HBase and others)

Presenter
Presentation Notes
Moby is integrated with CDH 5.1.2 and 5.1.3 and Ambari 1.5.1, 1.6
Page 41: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09

Q&A?.. http://bit.ly/isilonhbase

@arsenyspb

Page 42: Cmprssd Intrduction To · 2016-12-16 · Cmprssd Intrduction To . Hadoop, SQL-on-Hadoop, NoSQL . @arsenyspb Arseny.Chernov@Dell.com Singapore University of Technology & Design . 2016-11-09