Upload
yousun-jeong
View
68
Download
0
Embed Size (px)
Citation preview
SKT Hadoop DW
SK telecom!Corporate R&D Center
Yousun Jeong
Copyright@ 2015 by SK Telecom All rights reserved.
1. Big Data in SKT
2. What is Hadoop DW ?
3. SQL on Hadoop TAJO
4. Hadoop DW Commercialization Cases
Table of Contents
2
Copyright@ 2015 by SK Telecom All rights reserved.
High TCO for Data Management
250TB/day (91.25PB/year) 4 Hadoop clusters with various
commercial MPP databases for analytics
OperationalSystems
Integration Layer
Data Warehouse
Marts
Marketing
Sales
ERP
SCM
ODS
StagingArea
StagingArea
Mart A
Mart B
Mart C
Mart D
Hadoop+Hive MPP DBMS
High TCO for Data Management (Too much data is loaded into MPP DBMS)
One Unified Solution
30PB+ (compressed) on 1000+ nodes 10+ Hadoop clusters with Tajo & Spark
for all purposes
OperationalSystems
Integration Layer
Data Warehouse
Marts
Marketing
Sales
ERP
SCM
ODS
StagingArea
StagingArea
Mart A
Mart B
Mart C
Mart D
Hadoop+Tajo+Spark
Affordable & Faster (Unified framework for Big Data)
1. Big Data in SKT
3
Copyright@ 2015 by SK Telecom All rights reserved.
✓ Optimized configuration of a large-scale cluster ✓ Operation know-how of managing 1000+ nodes ✓ Fault tolerant and effective resource management system
Data Collector
Data Collect
& pre-processing
Main Cluster
Analysis
R&D Cluster
~250 TB/day (700+ node)
Service Logic
Repository
(200+ Node)
(100+ node)
Service Cluster (150+ node)
App. 1 … App. N
T-Hadoop
Data Feeding
Data Feeding
Commercialize
Develop.
1. Big Data in SKTSKT Hadoop Clusters
4
Copyright@ 2015 by SK Telecom All rights reserved.
“Hadoop S/W and Commodity H/W!Based Cost-effective IT Infrastructure System”
【 Hadoop DW Infrastructure】 “High-price, High-performance!
Proprietary IT Infrastructure System”
【 Legacy IT Infrastructure 】
※ MPP Massively Parallel Processing, SAN Storage Area Network, NAS Network Attached Storage, RDBMS Relational DB Management System, ! SQL Structured Query Language
2. What is Hadoop DW ?
Structured/Un-structured Data!Scale-out Structure (Petabyte, Exabyte)
Low price($200 ~ $1,000 / TB)
Data
Cost
Structured Data!Scale-up Structure (Terabyte)
High price!($5,000~$50,000 / TB)
Commodity H/W (x86 Server)H/W High Performance H/W!(MPP, Fabric Switch, etc.)
Hadoop Architecture
SQL on HadoopS/W
Proprietary S/W(RDBMS, etc.)
Transaction/Batch Processing!
(SQL) Hadoop File System
The Hadoop DW provides a Hadoop Architecture based Data Warehouse from an Enterprise environment so the user can accommodate the massive amount of increasing data at a low cost.
Solution SKT Hadoop DW
5
Copyright@ 2015 by SK Telecom All rights reserved.
Tajo - Fully Distributed - Vector process
HDFS
Hadoop Cluster + Tajo
[ Legacy Approach (MR) ] [Tajo Approach ]
Process more dataon same clusters
with improvedprocessing speed
Response
SpeedHadoop Cluster
Query
Hadoop Cluster
Query
Up to 10x min few
sec~min+ Tajo
Try more queriesfor analysis
with improved!response speed
Hive
MapReduce - Partially Distributed - Sequential process
HDFS
Hadoop Cluster
Processing
Speed
High-speed SQL-on-Hadoop processing engine • 3~5x improvement in processing speed to Hive under TPC-H procedure
• 80~100% response speed to Impala without data size limit
• Full ANSI-SQL support for easy RDBMS migration
3. SQL on Hadoop - TAJO
6
Copyright@ 2015 by SK Telecom All rights reserved.
7
3. SQL on Hadoop - TAJO
SQL Support
▪ ANSI SQL support ▪ Partition Type ▪ Meta Store
Service Stability
▪ High Availability ▪ Resource Manager ▪ Fair Scheduler
Performance
▪ High-speed processing ▪ Shuffling ▪ Dynamic Query Optimizer ▪ Query Rewriting
System Integration
▪ BI Connector ▪ Proxy Support ▪ Tajo-R
Function Support
▪ Analytic Function ▪ Hive Function
[ Tajo Features ]
[ Performance Comparison ]
[ Apache Top-Level Project ]
Copyright@ 2015 by SK Telecom All rights reserved.
Worker!
8
3.1 Tajo Architecture
1. Query Master!2. TaskRunner
Tajo Master!Persistent Storage!!!! Derby Store! MySQL Store! Postgre SQL
Store!
Logical Planner!
Logical Optimizer!
Resource Manager!
SQL Parser!! Query Rewriter!
Query Manager!
Tajo Catalog HCatalog
Client Service Handler!
JDBC !Driver
Tajo!CLI!
Tajo!CLI!
Worker!Query Master!!!!!!!!!
Global Planner!
Client Service Handler!
!!!!!!!
Local Query Engine!
Storage Manager!
Local HDFS/Hbase S3 / swift
ODBC !Driver
Copyright@ 2015 by SK Telecom All rights reserved.
9
3.1 Technical Characteristic - Logical Flow Data Processing
Tajo Master!!!!!!!!!
SQL Parser
Logical/Global Planner
Resource Manager
Query Parsing
Decomposition of a work unit
Work units delivered to the server
Tajo Worker!
Tajo Worker!
Tajo Worker!
Tajo Worker!
Tajo Worker!!!!!!!!
Physical Planner
Query Engine
Storage Manager
Decomposing the!task operation unit
Unit operation
Disk data I/O control
Copyright@ 2015 by SK Telecom All rights reserved.
10
3.1 Technical Characteristic - JIT Query Engine
Implemented as a binary to consider the number of all cases-> performance degradation(call, if, switch below 50%)
switch(operand)!Case numeric : add numeric!Case string : add string!
real-time code generation based on operand type combined operation can be processed by the compiler optimization
Four functions in a single operation(+2,-1,*1)
<Existing methods> <JIT methods>
Behavior depends on the operand characteristic!!- 1 + 2 = 3!- “a” + “b” = “ab”!- {1,2} + {3,4} = {4,6}!- 1 + {1,2} = {2,3}
Result = A x (1-B) + (1+C)
+
x
- +
A A A A A
+
Copyright@ 2015 by SK Telecom All rights reserved.
11
3.1 Technical Characteristic -Vectorized Query Engine
<Tuple at a time> <Vectorized engine>
- DB!- 1 operation/record
- Vectorized data!- 1 operation/vector
A[] = {a1, a2, a3, a4, a5, a6}!B[] = {b1, b2, b3, b4, b5, b6}!! C[] = A[] + B[]
a1
a2
a3
a5
a4
a6
b1
b2
b3
b5
b4
b6
++
+
+
++
a1
a2
a3
a5
a4
a6
+
b1
b2
b3
b5
b4
b6
Copyright@ 2015 by SK Telecom All rights reserved.
12
3.1 Technical Characteristic -Storage Manager
Tajo Worker!Tajo Worker!Tajo Worker(scan)!
Storage Manager!!!!!!!!!!
Disk Scanner!! Pre-fetching Buffer!
Disk Scanner!
Disk Scanner!
Request queue!! ! ! !
Request queue!
Request queue!
Scan !Scheduler
Bulk Read
Fine granularityFile
request
Copyright@ 2015 by SK Telecom All rights reserved.
13
Business Challenge
How SKT Hadoop DW Helped
[ SK Telecom ]
• Explosion of log data with LTE service
• Increase in types of data to be analyzed
• Insufficient DW capacity due to high cost
✓ 3x storage expansion under same price, or 80% reduction in unit price
✓ Enabled Ad-hoc analysis of unstructured text data sets for daily
✓ Hadoop DW could decrease contents-based analysis process time from few hours to 20 minutes max.
4. Hadoop DW Commercialization Cases Telco
Category MPP DBMS Hadoop DW
Raw Data Size 0.5 TB/Day 4 TB/Day
Total ETL Time Average of 3 hours Average of 6 hours
DW Creation!
30 minutes 40 minutes
Mart Creation 1 hour 1 hour 40 minutes
Report Creation 1 hour 30 minutes 2 hours 4 minutes
Copyright@ 2015 by SK Telecom All rights reserved.
14
Business Challenge
[ Global Top-5 Semiconductor Player ]
• Collect immense amount of unstructured measurement data while manufacturing
• RDMBS & BI are incapable for such data type • Even data loading can take up to 20 min
How SKT Hadoop DW Helped✓ Support for unstructured data through variable
column schema ✓ 100x increase in data processing capacity ✓ Decreased data loading time by 10x (2 min) ✓ Minimized user action for pivot/unpivot
4. Hadoop DW Commercialization Cases Manufacturer
Copyright@ 2015 by SK Telecom All rights reserved.
Thank you.