View
230
Download
0
Category
Preview:
Citation preview
Architecture and Performance Considerations in the Logical Data Lake
Dr. Alberto Pan, Chief Technical Officer
Architecture and Performance Considerations in the Logical Data Lake
Dr. Alberto Pan, Chief Technical Officer
Agenda1. Data Lake Architecture
2.Data Virtualization in the Logical Data Lake
3.Performance: ‘Move Processing To the Data’
4.Performance: Choosing the Best Execution Plan
5.Example Scenario: The Numbers
Data Lake Architecture
5
Architecture of the Data LakeReal-TimeDecision
Management
Alerts
ScorecardsDashboards
Reporting
Data DiscoverySelf-Service
Search
Predictive Analytics
Statistical Analytics (R)
Text Analytics
Data MiningData Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content (Unstructured)
Big Data
Enterprise ApplicationsTraditional
Enterprise Data
Cloud
Cloud Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDW In-Memory(SAP Hana, …)
Analytical Appliances
Cloud DW(Redshift,..)
ODS
Big Data ETL
CDC
Sqoop
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
HiveSparkDrill
ImpalaStorm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
TezMapRed.
6
How can I combine Data from Several Systems ensuring good Performance ?
How can I abstract consuming applications from technology change and requirements evolution ?
How can I enforce consistent Security and Governance Policies across the Data Lake ?
Questions for the Logical Data Lake:
The Logical Data Lake ArchitectureIntegrated View of a Plurality of systems: Hadoop, EDW, Streaming, In-memory,...
DV in the Logical Data Lake
8
Architecture of the Data LakeReal-TimeDecision
Management
Alerts
ScorecardsDashboards
Reporting
Data DiscoverySelf-Service
Search
Predictive Analytics
Statistical Analytics (R)
Text Analytics
Data MiningData Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content (Unstructured)
Big Data
Enterprise ApplicationsTraditional
Enterprise Data
Cloud
Cloud Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDW In-Memory(SAP Hana, …)
Analytical Appliances
Cloud DW(Redshift,..)
ODS
Big Data ETL
CDC
Sqoop
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
HiveSparkDrill
ImpalaStorm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
TezMapRed.
9
Architecture of the Logical Data LakeReal-TimeDecision
Management
Alerts
ScorecardsDashboards
Reporting
Data DiscoverySelf-Service
Search
Predictive Analytics
Statistical Analytics (R)
Text Analytics
Data MiningData Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content (Unstructured)
Big Data
Enterprise Applications
Traditional Enterprise
Data
Cloud
Cloud Applications
NoSQL
EDW In-Memory(SAP Hana, …)
Analytical Appliances
Cloud DW(Redshift,..)
ODS
Big Data ETL
CDC
Sqoop
(Flume, Kafka, …)
Data Virtualization
Real-Time Data Access (On-Demand / Streaming)
Data Caching
Dat
a Ser
vice
s
Data Search & Discovery
GovernanceSecurity
Optimization
Dat
a Abs
trac
tion
Dat
a Tr
ansf
orm
atio
n
Dat
a Fe
dera
tionBatch
YARN / Workload Management
HDFS
HiveSparkDrill
ImpalaStorm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
TezMapRed.
10
What is Needed ?Requirements for the Integration Component in the Logical Data Lake
Ability to answer ad-hoc queries combining data from several systems
Performance comparable to physical approaches
Ability to expose different logical views over the same data
Single entry point to apply Security and Governance policies. Comprehensive, granular security support
Denodo Data Virtualization is the only option verifying:
Performance: Move Processing to the Data
12
Move Processing to the DataProcess the data where it resides
Process the data locally where it resides
DV System combines partial results
Minimizes network traffic
Leverages specialized data sources
13
Move Processing to the Data: Example 1Obtain Total Sales By Product (Naive Strategy)
Naive Strategy: 350M rows moved through the network
14
Move Processing to the Data: Example 1Obtain Total Sales By Product (Move Processing to the Data)
Denodo Strategy: 30k rows moved through the network
15
Move Processing to the Data: Example 2Maximum Sales Discount By Product in the last year: On-the-fly Data Movement
Move Products Data to a Temp table in the DW : 20K rows moved through the network + 10K
rows inserted in the DW
Execute full query on the DW: 10k rows through the network
16
Move Processing to the Data: Example 2Maximum Sales Discount By Product in the last year: Partial aggregation Pushdown
Products DB: 10K rows through the network
Data Warehouse: #rows through the network = 10K * average
#sale_prices_per_product
Performance: Choosing the Best Execution Plan
18
How to Choose the Best Execution Plan?Cost-Based Optimization in Data Virtualization
Data statistics to estimate size of intermediate result sets
Data Source Indexes (and other physical structures)
Execution Model of data sources: e.g. Parallel Databases VS Hadoop clusters VS Relational Databases
Features of data sources (e.g. number of processing cores in parallel database or Hadoop Cluster)
Data Transfer rate
Must take into account:
Example Scenario: The Numbers
20
Example Scenario: The NumbersBest Performance Even When Processing Billions of Rows
Performance Comparison of Physical vs Logical Scenario
Big Data volumes
TPC-DS benchmarkSales(Netezza)
Customers(Oracle) Items
(SQLServer)290M
2M 400K
21
Example Scenario: The NumbersPhysical vs Logical DW Performance
Query Description Rows Returned AVG Time Physical (all data in Netezza) AVG Time Logical
Optimization Technique (automatically chosen by Denodo6.0)
Total sales by customer 1,99 M 20975 ms 21457 msFull group bypushdown
Total sales by customer and year between 2000 and 2004 5,51 M 52313 ms 59060 ms
Full group bypushdown
Total sales by item brand 31,35 K 4697 ms 5330 msPartial group bypushdown
Total sales by item where sale price less than current list price 17,05 K 3509 ms 5229 ms
On the fly data movement
Thanks!
www.denodo.com info@denodo.com© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.
Find more details at: datavirtualization.bloghttp://www.datavirtualizationblog.com/myths-in-data-virtualization-performance/
Recommended