Designing High Performance Datawarehouse
Welcome to the webinar on
Presented by
&
Contents
1 What happened in the Data 1.0 World
2 What is shaping the new Data 2.0 World
3 Designing High Performance Datawarehouse
4 Q & A
What happened in the Data 1.0 World?
Before 2000 2000s Now
Do we need a DWH?
Advent of ODS
Data Silos
Metrics for success?
OLAP = Insights
Painful Implementations
Select success : top down & bottom up
We’ve got BI / DWH Tools
Performance vs. Volume : Game Changer
Drill-down Reporting from DWH – getting into mainstream
Standardized KPIs
Analytics as differentiator?
Retaining skills and expertise
Business led
Volume | Variety | Velocity | Value
Need insights from non-structured data as well
Analytics is a differentiator
Show me the ROI
(DATA) Big, Real time, In-memory – what do with existing initiatives?
Data 2.0 : scale, performance, knowledge, relevance
TDWI research based on 278 respondents – Top Responses`
42% sayCan’t scale to big data volumes
27% sayInadequate data load speed
27% sayPoor query response
Existing DW modeled for reports & OLAP only
Can’t score analytic models Fast enough
Cost of scaling up or out is too expensive
Can’t support high Concurrent user count
25%
24%
24%
23%
Inadequate support forIn-memory processing
19%
Current platform needs greatManual effort for performance
Poorly suited to real-time workloads
Can’t support in-database analytics
Poor CPU speed and capacity
18%
18%
15%
15%
Current platform is a legacy,We must phase it out
9%
Challenges in current DW environment - Survey
High PerformanceData Warehouse
Concurrency Enabled
Able to handle Complexity
Ability to Scale
Speed
Social Media Data
Text Data
Sensor Data
Syndicated Data
Numeric Data
True Sentiment
Faster Compliance
Faster Reach
Big Data Analytics
Analytics = Competitive Advantage
Efficiencies driving down costs
Customer experience & service
Every 18 months, non-rich structured and unstructured enterprise data doubles.
Data 2.0 World
Business is now equipped to consume, identify and act upon this data for superior insights
So what is a High Performance Datawarehouse?
Key Dimensions
SPEED
COMPLEXITY
CONCURRENCY
SCALE
HIGH PERFORMANCE
DATA WAREHOUSE
SPEED
COMPLEXITY
CONCURRENCY
SCALE
Competing Workloads – OLAP, Analytics Intraday data loads Thousands of users Ad hoc queries
Streaming Big Data Event Processing Real time operation
Operational BI Near time Analytics Dashboard Refresh
Fast Queries
Big Data variety Unstructured Sensor Social media
Many sources / targets Complex models and SQL High availability
Big Data volumes Detailed source data Thousands of reports Scale out into: cloud, clusters, grids, etc.
High Performance
Data Warehouse
Designing High Performance Datawarehouse
TDWI research based on 329 responses from 114 respondents
45% sayCreating Summary Tables
44% sayAdding Indexes
33% sayAltering SQL Statements or routines
Changing physical data models
Using in-memory databases
Upgrading Hardware
Moving an application to a separate data mart
24%
24%
21%
20%
Applying workload to management controls
16%
Choosing between column-row oriented data storage
Restricting or throttling user queries
Shifting some workloads to off-peak hours
Adjusting system parameters
16%
16%
15%
10%
Others
6%
Industry recognized top techniques
Designing Summary Tables
45% sayCreating Summary Tables
Summary table design process
COLLECTA good sampling of queries. These may come from user interviews, testing / QA queries,
production queries, reports or any other means that provide a good representation of
expected production queries
ANALYZE The dimension hierarchy levels, dimension attributes, and fact table measures that are
required by each query or report.
IDENTIFY The row counts associated with each dimension level represented.
BALANCEThe most commonly required dimension levels against the number of rows in the resulting
summary tables. A goal should be to design summary tables that are roughly 1/100 th the size
of the source fact tables in terms of rows (or less)
MINIMIZEThe columns that are carried in the summary table in favor of joining back to the dimension
table. The larger the summary table, the less performance advantages it provides.
Some of the best candidates for aggregation will be those where the row counts decrease the most from one level in a hierarchy to the next.
Capturing requirements for Summary table
•Choosing Aggregates to Create - There are two basic pieces of information which are required to select the appropriate aggregates.• Expected usage patterns of the data. • Data volumes and distributions in the fact table
Report Dimension Level MeasuresStore Item Date Sales
Report 1 District Calendar Year Sale_Amt
Sales_QtyReport 2 District Category Calendar Year
Calendar Month
Sale_Amt
Sales_QtyReport 3 District Calendar Year
Calendar Month
Sale_Amt
Sales_QtyReport 4 District Fiscal Period Sale_AmtReport 5 Store Dept Fiscal Week Sales_QtyReport 6 Dept Fiscal Period Sale_AmtReport 7 District Fiscal Week Sale_Amt
Sales_QtyReport 8 District Fiscal Week Sale_Amt
Sales_QtyReport 9 District Dept Fiscal Quarter Sale_AmtReport 10 District Fiscal Period Sales_QtyReport 11 Region Category Fiscal Week
Dimension Level # Populated of Members
Store Geography Division 1Region 3District 50Store 3980
Item Category Subject 279Category 1987Department 4145
Date Fiscal Year 3 Fiscal Quarter 12Fiscal Period 36Fiscal Week 156
Summary table design considerations
Semi-additive and all non-additive fact data
– need not be stored in the summary table Add as many “pre calculated” columns as possible “Count” columns could be added for non additive
facts to preserve a portion of the information
Aggregate storage column selection
A combined table containing basic level fact
rows and aggregate rows A single aggregate table which holds all
aggregate data for a single base fact table A separate table for each aggregate created
– Most preferred option
Storing Aggregate Rows
Efficient for aggregation programs to update the
aggregate tables with the newly loaded data Regeneration more appropriate if there is a lot of
program logic to determine what data must be
updated in the aggregate table
Recreating vs. Updating Aggregates
Multiple hierarchies in a single dimension Store all of the aggregate dimension records
together in a single table Use a separate table for each level in the
dimension Add dimension data to aggregate fact table
Storing Aggregate Dimension Data
Efficient Indexing for Datawarehouse
44% sayAdding Indexes
Dimension table indexing• Create a non clustered, primary key on the surrogate key of
each dimension table
• A clustered index on the business key should be considered.• Enhance the query response when the business key is used
in the WHERE clause. • Help avoid lock escalation during ETL process
• For large type 2 SCDs, create a four-part non-clustered index : business key, record begin date, record end date and surrogate key
• Create non-clustered indexes on columns in the dimension that will be used for searching, sorting, or grouping,.
• If there’s a hierarchy in a dimension, such as Category- Sub Category-Product ID, then create index on Hierarchy
Index columns Index Type
EmployeeKey Non clustered
EmployeeNationalIDAlternateKey clustered
EmployeeNationalIDAlternateKey,StartDate, EndDateEmployeeKey
Non clustered
FirstNameLastNameDeoartmentName
Non clustered
Fact table indexing
• Create a clustered, composite index composed of each of the foreign keys to the fact tables
• Keep the most commonly queried date column as the leftmost column in the index
• There can be more than one date in the fact table but there is usually one date that is of the most interest to business users. A clustered index on this column has the effect of quickly segmenting the amount of data that must be evaluated for a given query
Index columns Index Type
OrderDateKeyProductKeyCustomerKeyPromotionKeyCurrencyKeySalesTerritoryKeyDueDateKey
clustered
Column Oriented databases
Row Store and Column Store
Most of the queries does not process all the attributes of a particular relation.
Row Store Column Store(+) Easy to add/modify a record (+) Only need to read in relevant data
(-) Might read in unnecessary data (-) Tuple writes require multiple accesses
• One can obtain the performance benefits of a column-store using a row-store by making some changes to the physical structure of the row store.– Vertically partitioning– Using index-only plans– Using materialized views
Vertical Partitioning• Process:
– Full Vertical partitioning of each relation• Each column =1 Physical table• This can be achieved by adding integer position column to every table• Adding integer position is better than adding primary key
– Join on Position for multi column fetch
Index-only plans
• Process:– Add B+Tree index for every Table.column– Plans never access the actual tuples on disk– Headers are not stored, so per tuple overhead is less
Using Hadoop for Datawarehouse
Hadoop ecosystem
Distributed Storage(HDFS)
Distributed Processing(MapReduce)
Metadata Management(Hcatlog)
Query(Pig)
Scripting(Pig)
Wor
kflow
& S
ched
ulin
g(O
ozie
)
Non
-Rel
ation
al D
atab
ase
(Hba
se)
Dat
a Ex
trac
tion
& L
oadi
ng(H
catlo
g AP
Is, W
ebH
DFS
,Ta
lend
Ope
n St
udio
for B
ig D
ata,
Sqo
op)
Man
agem
ent &
Mon
itorin
g (A
mba
ri, Z
ooke
eper
)
Ecosystem of open Source projects
Hosted by Apache Foundation
Google developed and shared concepts
Distributed File System that has the ability to scale out
Data Staging
Data archiving
Schema flexibility
Processing flexibility
Distributed DW architecture
Hadoop allows organizations to deploy an extremely scalable and economical ETL environment
Hadoop’s scalability and low cost enable organizations to keep all data forever in a readily accessible online environment
Hadoop can quickly and easily ingest any data format
Hadoop enables the growing practice of “late binding” – instead of transforming data as it’s ingested by Hadoop, structure is applied at runtime
Off load workloads for big data and advanced analytics to HDFS, discovery platforms and MapReduce
Promising uses of Hadoop in DW context
What led to Datawarehouse at Facebook
Data, data and more data
200 GB per day in
March 2008
2+ TB (compressed) per day
The Problem
Superior in availability, scalability
And Manageability compared
to commercial Databases
Uses Hadoop File System (HDFS)
The Hadoop Experiment
Programmability & Metadata
Map Reduce hard to program
Need to publish data in well
known schemas
Challenges with Hadoop
HIVE
Solution
What is Hive?What is Hive?
A system for managing and querying structured data built on top of Hadoop
Uses Map Reduce for execution
Uses HDFS for storage
Key Building PrinciplesKey Building Principles TablesTables
SQL on structured data as a familiar data warehousing tool
Pluggable map/reduce scripts in language of your choice: Rich Data Types
Performance
Each table has a corresponding directory in HDFS
Each table points to existing data directories in HDFS
Split data based on hash of a column – mainly for parallelism
Analytical platforms
Analytical platforms overview1010dataAster Data (Teradata)CalpontDatallegro (Microsoft)ExasolGreenplum (EMC)IBM SmartAnalyticsInfobrightKognitioNetezza (IBM)Oracle ExadataParaccelPervasiveSand TechnologySAP HANASybase IQ (SAP)TeradataVertica (HP)
Purpose-built database management systems designed explicitly for query processing and analysis that provides dramatically higher price/performance and availability compared to general purpose solutions.
Deployment Options-Software only (Paraccel, Vertica)-Appliance (SAP, Exadata, Netezza)-Hosted(1010data, Kognitio)
• Kelley Blue Book – Consolidates millions of auto transactions each week to calculate car valuations
• AT&T Mobility – Tracks purchasing patterns for 80M customers daily to optimize targeted marketing
Which platform do you choose?
Structured Semi-Structured Unstructured
Hadoop
Analytic Database
General Purpose RDBMS
Thank You
Please send your Feedback & Corporate Training /Consulting Services
requirements on BI to [email protected]