Transcript

Top Five Reasons for Data Warehouse Modernization

Philip Russom TDWI Research Director for Data Management

May 28, 2014

Sponsor

3

Speakers

Philip Russom TDWI Research Director,

Data Management

Steve Sarsfield Product Marketing Manager,

HP Vertica

• Background – Why many users’ DWs

need modernization

– What is it?

– There are many reasons, but I’ll boil it down to five

• Top Five Reasons – Analytics

– Scale

– Speed

– Productivity

– Cost Control

• New DW Architectures – Resulting from

Modernization

• Recommendations

Agenda

PLEASE TWEET @pRussom, #TDWI, #EDW, #DataWarehouse,

#DataArchitecture, #Analytics, #RealTime

“DW Modernization” has many meanings… • Additions to existing data warehouse

– New data subjects, sources, tables, dimensions, etc.

• More standalone data platforms and tools – Complement DW without replacing it

– More marts and ODSs

– New appliances, columnar databases, Hadoop, NoSQL, etc.

• Architectural Adjustments – All the above

– Better design

• Upgrades – Newer versions of

current DBMS software

– More hardware

• Rip and Replace – Decommission current

DW platform and migrate to another

6

Contact Information

If you have further questions or comments:

Philip Russom, TDWI

[email protected] Randy Lea, Teradata [email protected]

Top Five Goals for DW Modernization

• I’ll mostly focus on improvements to:

– Analytics, Scale, Speed

• These regularly rank high in TDWI surveys, for example:

• I’ll also mention improvements to:

– Productivity, Cost Control

• These regularly come up in TDWI interviews with users

1. ANALYTICS

2. SCALE

3. SPEED

SOURCE: 2014 TDWI Report: Evolving Data Warehouse Architectures, Figure 4

DW Modernization

Goals are Related

• Analytics needs

better productivity

• The challenge is to

gain improvements

with the first four

goals without

incurring more of

the fifth: cost.

• Speed contributes

to scale and

productivity

CONCURRENCY • Competing Workloads

• Reporting, Real Time,

OLAP, Adv. Analytics, etc.

• Intra-Day Data Loads

• Thousands of Users

• Ad hoc Queries

SCALE • Big Data Volumes

• Detailed Source Data

• Thousands of Reports

• Scale Out Into: • Clouds, clusters, grids,

distributed architectures

SPEED • Streaming Big Data

• Event Processing

• Real-Time Operation • Operational BI

• Near-Time Analytics

• Dashboard Refresh

• Fast Queries

COMPLEXITY

• Big Data Variety • Unstructured Data

• Machine/sensor Data

• Web & Social Media

• Many Sources/Targets

• Complex Models & SQL

• High Availability

HIGH

PERFORMANCE

DATA

WAREHOUSING

(HiPer DW)

SOURCE: 2012 TDWI

Report: High

Performance Data

Warehousing, Figure 1.

BEYOND OLAP & REPORTING TO

Advanced Analytics • Organizations need more analytic insights

– To compete, serve customers, be profitable, control costs, improve quality, grow, etc.

• Analytics is becoming a larger portion of BI work – Reporting and OLAP are still important

• Organizations need advanced forms of analytics – Technologies: Extreme SQL, data mining, statistics, natural language

processing, text mining, AI, graph, etc.

– Methods: Predictive, clustering, segmentation, risk, fraud detection, etc.

• Most users designed EDWs for reporting and OLAP – Analytics’ requirements differ from reports and OLAP

• Users face multiple paths to enabling advanced analytics – Retrofit analytics onto report-focused EDW

– Deploy an analytic data platform that complements the EDW

– Replace the EDW’s platform with one that handles all workloads

Scale TO MORE DATA, USERS, REPORTS, ANALYSES…

• Data’s Growing Volumes are a Challenge – Large Data Warehouses – data for both reporting and analytics

– Big Data – volume aside, also diversity of data type, source, latency

• Scale is also a Challenge to Basic BI Functions, like Reporting – Thousands of Concurrent BI Users; Thousands of Reports

– Eventually, thousands of analytic users

• Scale to Increasing Complexity – More processing for ETL, integration, quality, analytics, real time, etc.

– Distributed DW architectures have more moving parts

• Scale despite Growing numbers of Concurrent Workloads – Reporting, Real Time, OLAP, Analytics, Data Loads, Ad hoc Queries…

• Users have a number of choices for scaling – Scale Up: More hardware for more data; efficient storage

– Scale Out: Clouds, clusters, grids, racks, distributed architectures

– Deploy or migrate to data platforms built for analytics with big data: columnar databases, data warehouse appliances, newer brands of databases, Hadoop, NoSQL, etc.

EVERTHING NEEDS MORE

Speed • Speed involves a temporal continuum

– From high performance to near time and true real time

• Speed is enabled by a functional continuum – From hardware to perky queries to event processing

– Many options are available for modernizing EDWs and analytics

• High performance functionality – In-memory databases, in-database analytics, columnar

databases, DW appliances, solid-state drives, modern CPUs, big memory in servers,

• Near-time functionality – Microbatches, federation, virtualization, replication, services,

query optimization, etc.

• Real-time functionality – Complex event processing (CEP), stream processing, operational

intelligence, etc.

MORE SOLUTIONS IN LESS TIME

Productivity • Agile and lean development methods

– Early prototype, built out iteratively

• Instead of older “big bang” deliverables

– Biz folks review/guide each iteration

• To assure IT-to-biz alignment

• Requirements gathering (RG) now done online

– Data exploration, discovery, profiling replace RG

– Req’s captured online, applied directly to solution

• Fast tools and platforms make analytics productive

– “Speed of thought” iterative analysis

– Fast queries & bulk loads build analytic datasets fast

• Less time per project means

– More projects

– Organization uses solution sooner

– Greater agility for the business

DATA VARIES IN VALUE; MANAGE IT ACCORDINGLY

Economics • As you modernize a DW environment, rethink its economics

• Cost continuum of data platforms:

• Choose a platform that fits a given data workload – but also fits the value of data

– High-value data on the core EDW

• Modeling, cleansing, aggregating, and documenting data (which is required for reports and OLAP) increases its value

– Analytic datasets in the mid tier

• This data is lightly prepared or prepped on the fly; temp sandboxes

– Source & archival data on the back tier

• This is more of a “data lake” that preserves data in its original form, so it can be repurposed repeatedly, as analytic projects arise

High $/Tb

Traditional Platforms

New Affordable Platforms,

built for DW/Analytics

Cheap Open Source:

Hadoop, NoSQL

ONE WAY TO MODERNIZE A DW

Multi-Platform Data Warehouse Environments

• Many enterprise data warehouses (EDWs) are evolving into

multi-platform data warehouse environments (DWEs).

• Users continue to add additional standalone data platforms to

their warehouse tool and platform portfolio.

• The new platforms don’t replace the core warehouse, because

it is still the best platform for the data that goes into standards

reports, dashboards, performance management, and OLAP.

• Instead, the new platforms complement the warehouse,

because they are optimized for workloads that manage,

process, and analyze new forms of big data, non-structured

data, and real-time data.

Modern DW System Architectures can be Complex

• The technology stack for DW, BI, analytics, and data integration has always been a multi-platform environment.

• What’s new? The trend toward a portfolio of many data platforms has accelerated.

• Why? More platform types to serve more data and workload types.

Complex,

Event

Processing

Streaming

Data

Tools

Analytic

Sand

Box

Data

Federation

& Virtuali-

zation

DW

Appliance

Columnar

DBMS Columnar

DBMS

DW

Appliances

No-SQL

Database

Hadoop

Distributed

File Sys

Map

Reduce

No-SQL

Database

Hadoop

Distributed

File Sys

Star or

Snowflake

Scheme

Data

Warehouse

Federated

Data

Marts

Customer

Mart or

ODS

Metrics for

Performance

Mgt

Multi-

dimensional

Data Models

Federated

Data

Marts

Federated

Data

Marts

Customer

Mart or

ODS

Real

Time

ODS

Data

Staging

Areas

OLAP

Cubes

Detailed

Source

Data

Data

Staging

Areas

Data

Staging

Areas

Detailed

Source

Data

Detailed

Source

Data

OLAP

DBMSs

DW from a

Merger

Over The Passage of Time

Good Reasons for Integrating

Hadoop with Relational EDW

• A Relational DBMS is good at:

– Metadata management

– Complex query optimization

– Query federation

– Table joins, views, keys, etc.

– Security, including roles, directories

– Much more mature development tools

• HDFS & other Hadoop tools are good at:

– Massive scalability

– Lower cost than most DW platforms & analytic DBMSs

– Multi-structured data & no-schema data

– Some ETL functions; late binding; custom code for analytics

– Use HDFS like a very scalable operational data store or data staging area, to modernize your existing DW environment

Recommendations • Revaluate your data warehouse

and related systems – There’s always room for improvement

– Change is afoot, in both biz & tech

• Prioritize modernization by putting biz goals first – Biz wants to manage big data and leverage it

– Biz wants to compete on analytics

– Biz needs real-time tech to operate faster

– Biz needs BI/DW solutions sooner, more agile

• Technology goals are also important, though secondary – Greater productivity from tech personnel

– Assuring capacity for growth

– Diversifying data platform and tool portfolio to support more types of data, workloads, development methods, etc.

– Migration to new platforms that are faster, more scalable, tuned for analytics, cost less, etc.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Cost Optimized Storage Steve Sarsfield, Product Marketing Manager, HP Vertica

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 19

Recognizing that it’s time to modernize

Feeling the Pain

What would be the business impact of reducing time from days to hours (hours to minutes)?

TIME IS MONEY

What is your plan for managing the need for real time data analysis as your data volumes continue to scale?

READY FOR BIG DATA

Are you getting the business insights from your organization’s data when you need it?

ANALYTIC INNOVATION

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20

Manage Huge Data Volumes

Deliver Fast

Analytics

Work with Legacy Tools

Support Data

Scientists

Advanced Analytics

Big Data Warehouse – Key Features

Manage Huge Data Volumes

Deliver Fast

Analytics

Work with Legacy Tools

Support Data

Scientists

Advanced Analytics

Joins, Complex Data Types

SQL-based Predictive Analytics

Python and R

SQL-based Visualization

ETL

Petabyte Scale

What-if, A/B testing

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 21

Analytics Capabilities

Manage Huge Data Volumes

Deliver Fast

Analytics

Work with Legacy Tools

Support Data

Scientists

Advanced Analytics

Reinforced Legacy Architectures

Manage Huge Data Volumes

Deliver Fast

Analytics

Work with Legacy Tools

Support Data

Scientists

Advanced Analytics

New NoSQL Architectures

Manage Huge Data Volumes

Deliver Fast

Analytics

Work with Legacy Tools

Support Data

Scientists

Advanced Analytics

Purpose-built Big Data Analytics Platform

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 22

Cost-Optimized Storage - ILM

Tier-off older data

Value Discovery

Interactive Data Frequently queried Vertica data cache

Batch Data Vertica data cache

Archive Data Vertica data

cache

Serve Convert data to Vertica storage format

Explore Any format

Store Any format

Location Format

Cold

Cool

Hot

Dark Data

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 23

Core Capabilities Impact

How Do We Achieve Huge Performance Increases?

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 24

Secret Sauce of HP Vertica

Columnar Storage

Compression MPP Scale-Out

Distributed Query

Projections

Speeds Query Time by

Reading Only Necessary Data

Lowers costly I/O to boost

overall performance

Provides high scalability on

clusters with no name node or other single

point of failure

Any node can initiate the

queries and use other nodes for work. No single point of failure

Combine high availability with

special optimizations

for query performance

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

A B D C E A

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 25

To find out more

Purpose built for Big Data from the first line of code

Download and Try

Community Edition supports up to 1 TB on 3 nodes

Contact us for more information or 30 day trial

Contact

http://www.vertica.com/try_vertica_community

[email protected]

+ 1 617-386-4400

26

Questions?

27

Contact Information

If you have further questions or comments:

Philip Russom, TDWI

[email protected] Steve Sarsfield, HP [email protected]


Recommended