Top Five Reasons for Data Warehouse Modernization
Philip Russom TDWI Research Director for Data Management
May 28, 2014
3
Speakers
Philip Russom TDWI Research Director,
Data Management
Steve Sarsfield Product Marketing Manager,
HP Vertica
• Background – Why many users’ DWs
need modernization
– What is it?
– There are many reasons, but I’ll boil it down to five
• Top Five Reasons – Analytics
– Scale
– Speed
– Productivity
– Cost Control
• New DW Architectures – Resulting from
Modernization
• Recommendations
Agenda
PLEASE TWEET @pRussom, #TDWI, #EDW, #DataWarehouse,
#DataArchitecture, #Analytics, #RealTime
“DW Modernization” has many meanings… • Additions to existing data warehouse
– New data subjects, sources, tables, dimensions, etc.
• More standalone data platforms and tools – Complement DW without replacing it
– More marts and ODSs
– New appliances, columnar databases, Hadoop, NoSQL, etc.
• Architectural Adjustments – All the above
– Better design
• Upgrades – Newer versions of
current DBMS software
– More hardware
• Rip and Replace – Decommission current
DW platform and migrate to another
6
Contact Information
If you have further questions or comments:
Philip Russom, TDWI
[email protected] Randy Lea, Teradata [email protected]
Top Five Goals for DW Modernization
• I’ll mostly focus on improvements to:
– Analytics, Scale, Speed
• These regularly rank high in TDWI surveys, for example:
• I’ll also mention improvements to:
– Productivity, Cost Control
• These regularly come up in TDWI interviews with users
1. ANALYTICS
2. SCALE
3. SPEED
SOURCE: 2014 TDWI Report: Evolving Data Warehouse Architectures, Figure 4
DW Modernization
Goals are Related
• Analytics needs
better productivity
• The challenge is to
gain improvements
with the first four
goals without
incurring more of
the fifth: cost.
• Speed contributes
to scale and
productivity
CONCURRENCY • Competing Workloads
• Reporting, Real Time,
OLAP, Adv. Analytics, etc.
• Intra-Day Data Loads
• Thousands of Users
• Ad hoc Queries
SCALE • Big Data Volumes
• Detailed Source Data
• Thousands of Reports
• Scale Out Into: • Clouds, clusters, grids,
distributed architectures
SPEED • Streaming Big Data
• Event Processing
• Real-Time Operation • Operational BI
• Near-Time Analytics
• Dashboard Refresh
• Fast Queries
COMPLEXITY
• Big Data Variety • Unstructured Data
• Machine/sensor Data
• Web & Social Media
• Many Sources/Targets
• Complex Models & SQL
• High Availability
HIGH
PERFORMANCE
DATA
WAREHOUSING
(HiPer DW)
SOURCE: 2012 TDWI
Report: High
Performance Data
Warehousing, Figure 1.
BEYOND OLAP & REPORTING TO
Advanced Analytics • Organizations need more analytic insights
– To compete, serve customers, be profitable, control costs, improve quality, grow, etc.
• Analytics is becoming a larger portion of BI work – Reporting and OLAP are still important
• Organizations need advanced forms of analytics – Technologies: Extreme SQL, data mining, statistics, natural language
processing, text mining, AI, graph, etc.
– Methods: Predictive, clustering, segmentation, risk, fraud detection, etc.
• Most users designed EDWs for reporting and OLAP – Analytics’ requirements differ from reports and OLAP
• Users face multiple paths to enabling advanced analytics – Retrofit analytics onto report-focused EDW
– Deploy an analytic data platform that complements the EDW
– Replace the EDW’s platform with one that handles all workloads
Scale TO MORE DATA, USERS, REPORTS, ANALYSES…
• Data’s Growing Volumes are a Challenge – Large Data Warehouses – data for both reporting and analytics
– Big Data – volume aside, also diversity of data type, source, latency
• Scale is also a Challenge to Basic BI Functions, like Reporting – Thousands of Concurrent BI Users; Thousands of Reports
– Eventually, thousands of analytic users
• Scale to Increasing Complexity – More processing for ETL, integration, quality, analytics, real time, etc.
– Distributed DW architectures have more moving parts
• Scale despite Growing numbers of Concurrent Workloads – Reporting, Real Time, OLAP, Analytics, Data Loads, Ad hoc Queries…
• Users have a number of choices for scaling – Scale Up: More hardware for more data; efficient storage
– Scale Out: Clouds, clusters, grids, racks, distributed architectures
– Deploy or migrate to data platforms built for analytics with big data: columnar databases, data warehouse appliances, newer brands of databases, Hadoop, NoSQL, etc.
EVERTHING NEEDS MORE
Speed • Speed involves a temporal continuum
– From high performance to near time and true real time
• Speed is enabled by a functional continuum – From hardware to perky queries to event processing
– Many options are available for modernizing EDWs and analytics
• High performance functionality – In-memory databases, in-database analytics, columnar
databases, DW appliances, solid-state drives, modern CPUs, big memory in servers,
• Near-time functionality – Microbatches, federation, virtualization, replication, services,
query optimization, etc.
• Real-time functionality – Complex event processing (CEP), stream processing, operational
intelligence, etc.
MORE SOLUTIONS IN LESS TIME
Productivity • Agile and lean development methods
– Early prototype, built out iteratively
• Instead of older “big bang” deliverables
– Biz folks review/guide each iteration
• To assure IT-to-biz alignment
• Requirements gathering (RG) now done online
– Data exploration, discovery, profiling replace RG
– Req’s captured online, applied directly to solution
• Fast tools and platforms make analytics productive
– “Speed of thought” iterative analysis
– Fast queries & bulk loads build analytic datasets fast
• Less time per project means
– More projects
– Organization uses solution sooner
– Greater agility for the business
DATA VARIES IN VALUE; MANAGE IT ACCORDINGLY
Economics • As you modernize a DW environment, rethink its economics
• Cost continuum of data platforms:
• Choose a platform that fits a given data workload – but also fits the value of data
– High-value data on the core EDW
• Modeling, cleansing, aggregating, and documenting data (which is required for reports and OLAP) increases its value
– Analytic datasets in the mid tier
• This data is lightly prepared or prepped on the fly; temp sandboxes
– Source & archival data on the back tier
• This is more of a “data lake” that preserves data in its original form, so it can be repurposed repeatedly, as analytic projects arise
High $/Tb
Traditional Platforms
New Affordable Platforms,
built for DW/Analytics
Cheap Open Source:
Hadoop, NoSQL
ONE WAY TO MODERNIZE A DW
Multi-Platform Data Warehouse Environments
• Many enterprise data warehouses (EDWs) are evolving into
multi-platform data warehouse environments (DWEs).
• Users continue to add additional standalone data platforms to
their warehouse tool and platform portfolio.
• The new platforms don’t replace the core warehouse, because
it is still the best platform for the data that goes into standards
reports, dashboards, performance management, and OLAP.
• Instead, the new platforms complement the warehouse,
because they are optimized for workloads that manage,
process, and analyze new forms of big data, non-structured
data, and real-time data.
Modern DW System Architectures can be Complex
• The technology stack for DW, BI, analytics, and data integration has always been a multi-platform environment.
• What’s new? The trend toward a portfolio of many data platforms has accelerated.
• Why? More platform types to serve more data and workload types.
Complex,
Event
Processing
Streaming
Data
Tools
Analytic
Sand
Box
Data
Federation
& Virtuali-
zation
DW
Appliance
Columnar
DBMS Columnar
DBMS
DW
Appliances
No-SQL
Database
Hadoop
Distributed
File Sys
Map
Reduce
No-SQL
Database
Hadoop
Distributed
File Sys
Star or
Snowflake
Scheme
Data
Warehouse
Federated
Data
Marts
Customer
Mart or
ODS
Metrics for
Performance
Mgt
Multi-
dimensional
Data Models
Federated
Data
Marts
Federated
Data
Marts
Customer
Mart or
ODS
Real
Time
ODS
Data
Staging
Areas
OLAP
Cubes
Detailed
Source
Data
Data
Staging
Areas
Data
Staging
Areas
Detailed
Source
Data
Detailed
Source
Data
OLAP
DBMSs
DW from a
Merger
Over The Passage of Time
Good Reasons for Integrating
Hadoop with Relational EDW
• A Relational DBMS is good at:
– Metadata management
– Complex query optimization
– Query federation
– Table joins, views, keys, etc.
– Security, including roles, directories
– Much more mature development tools
• HDFS & other Hadoop tools are good at:
– Massive scalability
– Lower cost than most DW platforms & analytic DBMSs
– Multi-structured data & no-schema data
– Some ETL functions; late binding; custom code for analytics
– Use HDFS like a very scalable operational data store or data staging area, to modernize your existing DW environment
Recommendations • Revaluate your data warehouse
and related systems – There’s always room for improvement
– Change is afoot, in both biz & tech
• Prioritize modernization by putting biz goals first – Biz wants to manage big data and leverage it
– Biz wants to compete on analytics
– Biz needs real-time tech to operate faster
– Biz needs BI/DW solutions sooner, more agile
• Technology goals are also important, though secondary – Greater productivity from tech personnel
– Assuring capacity for growth
– Diversifying data platform and tool portfolio to support more types of data, workloads, development methods, etc.
– Migration to new platforms that are faster, more scalable, tuned for analytics, cost less, etc.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Cost Optimized Storage Steve Sarsfield, Product Marketing Manager, HP Vertica
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 19
Recognizing that it’s time to modernize
Feeling the Pain
What would be the business impact of reducing time from days to hours (hours to minutes)?
TIME IS MONEY
What is your plan for managing the need for real time data analysis as your data volumes continue to scale?
READY FOR BIG DATA
Are you getting the business insights from your organization’s data when you need it?
ANALYTIC INNOVATION
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20
Manage Huge Data Volumes
Deliver Fast
Analytics
Work with Legacy Tools
Support Data
Scientists
Advanced Analytics
Big Data Warehouse – Key Features
Manage Huge Data Volumes
Deliver Fast
Analytics
Work with Legacy Tools
Support Data
Scientists
Advanced Analytics
Joins, Complex Data Types
SQL-based Predictive Analytics
Python and R
SQL-based Visualization
ETL
Petabyte Scale
What-if, A/B testing
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 21
Analytics Capabilities
Manage Huge Data Volumes
Deliver Fast
Analytics
Work with Legacy Tools
Support Data
Scientists
Advanced Analytics
Reinforced Legacy Architectures
Manage Huge Data Volumes
Deliver Fast
Analytics
Work with Legacy Tools
Support Data
Scientists
Advanced Analytics
New NoSQL Architectures
Manage Huge Data Volumes
Deliver Fast
Analytics
Work with Legacy Tools
Support Data
Scientists
Advanced Analytics
Purpose-built Big Data Analytics Platform
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 22
Cost-Optimized Storage - ILM
Tier-off older data
Value Discovery
Interactive Data Frequently queried Vertica data cache
Batch Data Vertica data cache
Archive Data Vertica data
cache
Serve Convert data to Vertica storage format
Explore Any format
Store Any format
Location Format
Cold
Cool
Hot
Dark Data
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 23
Core Capabilities Impact
How Do We Achieve Huge Performance Increases?
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 24
Secret Sauce of HP Vertica
Columnar Storage
Compression MPP Scale-Out
Distributed Query
Projections
Speeds Query Time by
Reading Only Necessary Data
Lowers costly I/O to boost
overall performance
Provides high scalability on
clusters with no name node or other single
point of failure
Any node can initiate the
queries and use other nodes for work. No single point of failure
Combine high availability with
special optimizations
for query performance
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
A B D C E A
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 25
To find out more
Purpose built for Big Data from the first line of code
Download and Try
Community Edition supports up to 1 TB on 3 nodes
Contact us for more information or 30 day trial
Contact
http://www.vertica.com/try_vertica_community
+ 1 617-386-4400
27
Contact Information
If you have further questions or comments:
Philip Russom, TDWI
[email protected] Steve Sarsfield, HP [email protected]