Transcript
Page 1: SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL

SQL on Hadoop

Page 2: SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL

Todays agenda

• Introduction• Hive – the first SQL approach• Data ingestion and data formats• Impala – MPP SQL

Page 3: SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL

Why SQL?

• Data warehousing…• Structured data– organization of the data– optimized data access

• Declarative data processing– No need to have developer skills, but…– Portable – universal language– We are lazy

• SQL drivers supported– No need of Hadoop client installation– Easier integration with the current systems

Page 4: SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL

Why not SQL

• It is not RDBMS!– big tables joins should by avoided– no indexes by default– no primary keys and constraints

• Not suited for OLTP– no locks– no transactions

• write once – read many• Additional data structuring during data shipping

(ETL) needed • Not all problems can be solved with SQL

Page 5: SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL

5

SQL on Hadoop

HDFSHadoop Distributed File System

Hba

seN

oSql

col

umna

r sto

re

YARNCluster resource manager

MapReduce

Hiv

eSQ

L

Pig

Scrip

ting

Flum

eLo

g da

ta c

olle

ctor Sq

oop

Dat

a ex

chan

ge w

ith R

DBM

S

Ooz

ieW

orkfl

ow m

anag

er

Mah

out

Mac

hine

lear

ning

Zook

eepe

rCo

ordi

natio

n Impa

laSQ

L

Spar

kLa

rge

scal

e da

ta p

roce

esin

g

Page 6: SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL

SQL on Hadoop

Client

Metadata

SQL master node

JDBC/ODBC server

SQL engine

HDFS

Executor

Cluster Node

HDFS

Executor

Cluster Node

HDFS

Executor

Cluster Node

HDFS

Executor

SQL

Data

Tables definition lookup

YARN

Page 8: SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL

Summary

• SQL on Hadoop is not for OLTP!• …but for data warehousing workloads• … ad-hoc queries

• Enforces semi-structuring of the data

• Does not enforce using certain data format on HDFS


Recommended