15
SKT Hadoop DW SK telecom Corporate R&D Center Yousun Jeong

IEEE International Conference on Data Engineering 2015

Embed Size (px)

Citation preview

Page 1: IEEE International Conference on Data Engineering 2015

SKT Hadoop DW

SK telecom!Corporate R&D Center

Yousun Jeong

Page 2: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

1. Big Data in SKT

2. What is Hadoop DW ?

3. SQL on Hadoop TAJO

4. Hadoop DW Commercialization Cases

Table of Contents

2

Page 3: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

High TCO for Data Management

250TB/day (91.25PB/year) 4 Hadoop clusters with various

commercial MPP databases for analytics

OperationalSystems

Integration Layer

Data Warehouse

Marts

Marketing

Sales

ERP

SCM

ODS

StagingArea

StagingArea

Mart A

Mart B

Mart C

Mart D

Hadoop+Hive MPP DBMS

High TCO for Data Management (Too much data is loaded into MPP DBMS)

One Unified Solution

30PB+ (compressed) on 1000+ nodes 10+ Hadoop clusters with Tajo & Spark

for all purposes

OperationalSystems

Integration Layer

Data Warehouse

Marts

Marketing

Sales

ERP

SCM

ODS

StagingArea

StagingArea

Mart A

Mart B

Mart C

Mart D

Hadoop+Tajo+Spark

Affordable & Faster (Unified framework for Big Data)

1. Big Data in SKT

3

Page 4: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

✓ Optimized configuration of a large-scale cluster ✓ Operation know-how of managing 1000+ nodes ✓ Fault tolerant and effective resource management system

Data Collector

Data Collect

& pre-processing

Main Cluster

Analysis

R&D Cluster

~250 TB/day (700+ node)

Service Logic

Repository

(200+ Node)

(100+ node)

Service Cluster (150+ node)

App. 1 … App. N

T-Hadoop

Data Feeding

Data Feeding

Commercialize

Develop.

1. Big Data in SKTSKT Hadoop Clusters

4

Page 5: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

“Hadoop S/W and Commodity H/W!Based Cost-effective IT Infrastructure System”

【 Hadoop DW Infrastructure】 “High-price, High-performance!

Proprietary IT Infrastructure System”

【 Legacy IT Infrastructure 】

※ MPP Massively Parallel Processing, SAN Storage Area Network, NAS Network Attached Storage, RDBMS Relational DB Management System, ! SQL Structured Query Language

2. What is Hadoop DW ?

Structured/Un-structured Data!Scale-out Structure (Petabyte, Exabyte)

Low price($200 ~ $1,000 / TB)

Data

Cost

Structured Data!Scale-up Structure (Terabyte)

High price!($5,000~$50,000 / TB)

Commodity H/W (x86 Server)H/W High Performance H/W!(MPP, Fabric Switch, etc.)

Hadoop Architecture

SQL on HadoopS/W

Proprietary S/W(RDBMS, etc.)

Transaction/Batch Processing!

(SQL) Hadoop File System

The Hadoop DW provides a Hadoop Architecture based Data Warehouse from an Enterprise environment so the user can accommodate the massive amount of increasing data at a low cost.

Solution SKT Hadoop DW

5

Page 6: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

Tajo - Fully Distributed - Vector process

HDFS

Hadoop Cluster + Tajo

[ Legacy Approach (MR) ] [Tajo Approach ]

Process more dataon same clusters

with improvedprocessing speed

Response

SpeedHadoop Cluster

Query

Hadoop Cluster

Query

Up to 10x min few

sec~min+ Tajo

Try more queriesfor analysis

with improved!response speed

Hive

MapReduce - Partially Distributed - Sequential process

HDFS

Hadoop Cluster

Processing

Speed

High-speed SQL-on-Hadoop processing engine • 3~5x improvement in processing speed to Hive under TPC-H procedure

• 80~100% response speed to Impala without data size limit

• Full ANSI-SQL support for easy RDBMS migration

3. SQL on Hadoop - TAJO

6

Page 7: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

7

3. SQL on Hadoop - TAJO

SQL Support

▪ ANSI SQL support ▪ Partition Type ▪ Meta Store

Service Stability

▪ High Availability ▪ Resource Manager ▪ Fair Scheduler

Performance

▪ High-speed processing ▪ Shuffling ▪ Dynamic Query Optimizer ▪ Query Rewriting

System Integration

▪ BI Connector ▪ Proxy Support ▪ Tajo-R

Function Support

▪ Analytic Function ▪ Hive Function

[ Tajo Features ]

[ Performance Comparison ]

[ Apache Top-Level Project ]

Page 8: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

Worker!

8

3.1 Tajo Architecture

1. Query Master!2. TaskRunner

Tajo Master!Persistent Storage!!!! Derby Store! MySQL Store! Postgre SQL

Store!

Logical Planner!

Logical Optimizer!

Resource Manager!

SQL Parser!! Query Rewriter!

Query Manager!

Tajo Catalog HCatalog

Client Service Handler!

JDBC !Driver

Tajo!CLI!

Tajo!CLI!

Worker!Query Master!!!!!!!!!

Global Planner!

Client Service Handler!

!!!!!!!

Local Query Engine!

Storage Manager!

Local HDFS/Hbase S3 / swift

ODBC !Driver

Page 9: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

9

3.1 Technical Characteristic - Logical Flow Data Processing

Tajo Master!!!!!!!!!

SQL Parser

Logical/Global Planner

Resource Manager

Query Parsing

Decomposition of a work unit

Work units delivered to the server

Tajo Worker!

Tajo Worker!

Tajo Worker!

Tajo Worker!

Tajo Worker!!!!!!!!

Physical Planner

Query Engine

Storage Manager

Decomposing the!task operation unit

Unit operation

Disk data I/O control

Page 10: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

10

3.1 Technical Characteristic - JIT Query Engine

Implemented as a binary to consider the number of all cases-> performance degradation(call, if, switch below 50%)

switch(operand)!Case numeric : add numeric!Case string : add string!

real-time code generation based on operand type combined operation can be processed by the compiler optimization

Four functions in a single operation(+2,-1,*1)

<Existing methods> <JIT methods>

Behavior depends on the operand characteristic!!- 1 + 2 = 3!- “a” + “b” = “ab”!- {1,2} + {3,4} = {4,6}!- 1 + {1,2} = {2,3}

Result = A x (1-B) + (1+C)

+

x

- +

A A A A A

+

Page 11: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

11

3.1 Technical Characteristic -Vectorized Query Engine

<Tuple at a time> <Vectorized engine>

- DB!- 1 operation/record

- Vectorized data!- 1 operation/vector

A[] = {a1, a2, a3, a4, a5, a6}!B[] = {b1, b2, b3, b4, b5, b6}!! C[] = A[] + B[]

a1

a2

a3

a5

a4

a6

b1

b2

b3

b5

b4

b6

++

+

+

++

a1

a2

a3

a5

a4

a6

+

b1

b2

b3

b5

b4

b6

Page 12: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

12

3.1 Technical Characteristic -Storage Manager

Tajo Worker!Tajo Worker!Tajo Worker(scan)!

Storage Manager!!!!!!!!!!

Disk Scanner!! Pre-fetching Buffer!

Disk Scanner!

Disk Scanner!

Request queue!! ! ! !

Request queue!

Request queue!

Scan !Scheduler

Bulk Read

Fine granularityFile

request

Page 13: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

13

Business Challenge

How SKT Hadoop DW Helped

[ SK Telecom ]

• Explosion of log data with LTE service

• Increase in types of data to be analyzed

• Insufficient DW capacity due to high cost

✓ 3x storage expansion under same price, or 80% reduction in unit price

✓ Enabled Ad-hoc analysis of unstructured text data sets for daily

✓ Hadoop DW could decrease contents-based analysis process time from few hours to 20 minutes max.

4. Hadoop DW Commercialization Cases Telco

Category MPP DBMS Hadoop DW

Raw Data Size 0.5 TB/Day 4 TB/Day

Total ETL Time Average of 3 hours Average of 6 hours

DW Creation!

30 minutes 40 minutes

Mart Creation 1 hour 1 hour 40 minutes

Report Creation 1 hour 30 minutes 2 hours 4 minutes

Page 14: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

14

Business Challenge

[ Global Top-5 Semiconductor Player ]

• Collect immense amount of unstructured measurement data while manufacturing

• RDMBS & BI are incapable for such data type • Even data loading can take up to 20 min

How SKT Hadoop DW Helped✓ Support for unstructured data through variable

column schema ✓ 100x increase in data processing capacity ✓ Decreased data loading time by 10x (2 min) ✓ Minimized user action for pivot/unpivot

4. Hadoop DW Commercialization Cases Manufacturer

Page 15: IEEE International Conference on Data Engineering 2015

Copyright@ 2015 by SK Telecom All rights reserved.

Thank you.