SQL friendly Hadoop MySQL and Impala ecosystem integration • 2 HDFS Kudu

MySQL and Impala ecosystem:SQL friendly Hadoop

Agenda

•

•- SQL support

- Architecture

- Hadoop integration

•

2

●● Web portal, search engine in the Czech Republic● 30+ web services (news, email, media, listings…)● only Open Source technologies

●● PPC ads, Google AdWords competitor in Czech Republic

●● Software engineers, team leaders, database enthusiasts

● MySQL, HBASE, Hadoop, Analytics

● MySQL trainings, internal consultations

Who we are

2

●

Sklik.cz Statistics

3

MySQL

Hadoop

HBase client statistics

Apache Impala

Apache Impala (incubating) is...

•

•

•

•

•

6

Brief History

•

•

•

•

7

MySQL vs Impala

8

MySQL vs Impala: Purpose

9

MySQL vs Impala: Data

10

MySQL vs Impala: Environment

11

Impala Architecture

Why is Impala So Fast?

•

•

•

•

•

•13

Architecture

14client (JDBC, ODBC, impala-shell)

Architecture: Impala Daemon

●


Architecture: Statestore

●


Architecture: Catalog

●


Architecture: Hive Metastore

●


Architecture: Data Storage Layer

●


Data storage formats

•

•

•

•

•

20

Parquet file format

•

•

•

•

•

21

Row Group 1row 1 col 1row 2 col 1row 3 col 1row 4 col 1

row 1 col 2row 2 col 2row 3 col 2row 4 col 2

row 1 col 3row 2 col 3row 3 col 3row 4 col 3

\\\\

Row Group 2row 5 col 1

Partitioning and schema proposal

•

•

•

•

22

2014 2015 2016

… WHERE year = 2016

Impala SQL support

SQL Support

•

•

•

•

•

24

$ impala-shell

[impala1.test:21000] > USE db_example;

Query: USE db_example;

Database changed.

[impala1.test:21000] > SELECT * FROM example;

Query: SELECT * FROM example;

+-----+----------+

| Day | Audience |

+-----+----------+

| 20 | 122 |

+-----+----------+

| 21 | 129 |

+-----+----------+

2 rows in set (0.18 sec)

Use Your Favorite SELECT...

25

●

●

…

●

●

●

●

Data Manipulation

26

● …

● …

●

●

●

●

●

How to Write the Data?

• … …

• … …

• …

• …

•

• …27

Impala Data Types

•

•

•

•

28

Kick Out From Your DDL...

•

•

•

•

•

29

Impala Specific DDL

•

•

•

30

CREATE TABLE … PARTITIONED BY (column int)

CREATE TABLE … AS PARQUETCREATE TABLE … AS TEXTFILE SEPARATED BY “,”

COMPUTE STATS mytableREFRESH mytableINVALIDATE METADATA

Hadoop Integration

Hadoop integration

•

2

HDFS Kudu

SparkMapReduce Tez

SqoopImport/export

HiveHQL

ImpalaSQL

HBaseNo SQL

StormStream

Hue / ODBC / ...

ImpalaSQL

Apache Kudu (incubating)

33

●

●

●

●

●

●

● Source: Cloudera Blog

https://blog.cloudera.com/blog/2015/09/kudu-new-apache-hadoop-storage-for-fast-analytics-on-fast-data/

Other Big Data SQL

●

○○○

●

○

●

○

●

○○

34

Apache Hive

●

35

●

●

Apache Drill

36

Druid

●

●

37

MariaDB Column Store

●

38

Our use case

●○ Only for internal use, several queries per hour○ No client reports○ Billions of rows

●○ Group by (web, zone, position,..)○ Period (from one day up to all period)○ aggregated daily, weekly, yearly reports

●2

Yahoo! use case

●

●○ Asynchronous client report○ Around 15k request/hour, totally 6TB of data

●○ couldn’t handle the use case

●

2

SELECT question FROM audience;

41

Documents

SQL friendly Hadoop MySQL and Impala ecosystem integration • 2 HDFS Kudu