40
How to Select an Analytic DBMS Overview, checklists, and tips by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com http://www.monash.com http://www.DBMS2.com

a final version of the slide deck

  • Upload
    rinky25

  • View
    277

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: a final version of the slide deck

How to Select an Analytic DBMS

Overview, checklists, and tips

byCurt A. Monash, Ph.D.

President, Monash ResearchEditor, DBMS2

contact @monash.comhttp://www.monash.comhttp://www.DBMS2.com

Page 2: a final version of the slide deck

Curt Monash

Analyst since 1981, own firm since 1987 Covered DBMS since the pre-relational days Also analytics, search, etc.

Publicly available research Blogs, including DBMS2 (www.DBMS2.com -- the

source for most of this talk) Feed at www.monash.com/blogs.html White papers and more at www.monash.com

User and vendor consulting

Page 3: a final version of the slide deck

Our agenda

Why are there such things as specialized analytic DBMS?

What are the major analytic DBMS product alternatives?

What are the most relevant differentiations among analytic DBMS users?

What’s the best process for selecting an analytic DBMS?

Page 4: a final version of the slide deck

Why are there specialized analytic DBMS?

General-purpose database managers are optimized for updating short rows …

… not for analytic query performance 10-100X price/performance differences

are not uncommon

At issue is the interplay between storage, processors, and RAM

Page 5: a final version of the slide deck

Moore’s Law, Kryder’s Law, and a huge exception

Growth factors:

Transistors/chip:

>100,000 since 1971 Disk density:

>100,000,000 since 1956 Disk speed:

12.5 since 1956

The disk speed barrier dominates everything!

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

Compound Annual Growth Rate

Transistors/Chipssince 1971

Disk Density since 1956

Disk Speed since 1956

04/10/23 DRAFT!! THIRD TEST!!

Page 6: a final version of the slide deck

The “1,000,000:1” disk-speed barrier

RAM access times ~5-7.5 nanoseconds CPU clock speed <1 nanosecond Interprocessor communication can be ~1,000X slower

than on-chip

Disk seek times ~2.5-3 milliseconds Limit = ½ rotation i.e., 1/30,000 minutes i.e., 1/500 seconds = 2 ms

Tiering brings it closer to ~1,000:1 in practice, but even so the difference is VERY BIG

Page 7: a final version of the slide deck

Software strategies to optimize analytic I/O

Minimize data returned Classic query optimization

Minimize index accesses Page size

Precalculate results Materialized views OLAP cubes

Return data sequentially Store data in columns Stash data in RAM

Page 8: a final version of the slide deck

Hardware strategies to optimize analytic I/O

Lots of RAM Parallel disk access!!! Lots of networking

Tuned MPP (Massively Parallel Processing) is the key

Page 9: a final version of the slide deck

Specialty hardware strategies

Custom or unusual chips (rare) Custom or unusual interconnects Fixed configurations of common parts

Appliances or recommended configurations

And there’s also SaaS

Page 10: a final version of the slide deck

18 contenders (and there are more)

Aster Data Dataupia Exasol Greenplum HP Neoview IBM DB2 BCUs Infobright/MySQL Kickfire/MySQL Kognitio Microsoft Madison

Netezza Oracle Exadata Oracle w/o Exadata ParAccel SQL Server w/o

Madison Sybase IQ Teradata Vertica

Page 11: a final version of the slide deck

General areas of feature differentiation

Query performance Update/load performance Compatibilities Advanced analytics Alternate datatypes Manageability and availability Encryption and security

Page 12: a final version of the slide deck

Major analytic DBMS product groupings

Architecture is a hot subject

Traditional OLTP Row-based MPP Columnar (Not covered tonight) MOLAP/array-based

Page 13: a final version of the slide deck

Traditional OLTP examples

Oracle (especially pre-Exadata) IBM DB2 (especially mainframe) Microsoft SQL Server (pre-Madison)

Page 14: a final version of the slide deck

Analytic optimizations for OLTP DBMS

Two major kinds of precalculation Star indexes Materialized views

Other specialized indexes Query optimization tools OLAP extensions SQL 2003 Other embedded analytics

Page 15: a final version of the slide deck

Drawbacks

Complexity and people cost Hardware cost Software cost Absolute performance

Page 16: a final version of the slide deck

Legitimate use scenarios

When TCO isn’t an issue Undemanding performance (and therefore

administration too) When specialized features matter

OLTP-like Integrated MOLAP Edge-case analytics

Rigid enterprise standards Small enterprise/true single-instance

Page 17: a final version of the slide deck

Row-based MPP examples

Teradata DB2 (open systems version) Netezza Oracle Exadata (sort of) DATAllegro/Microsoft Madison Greenplum Aster Data Kognitio HP Neoview

Page 18: a final version of the slide deck

Typical design choices in row-based MPP

“Random” (hashed or round-robin) data distribution among nodes

Large block sizes Suitable for scans rather than random accesses

Limited indexing alternatives Or little optimization for using the full boat

Carefully balanced hardware High-end networking

Page 19: a final version of the slide deck

Tradeoffs among row MPP alternatives

Enterprise standards Vendor size Hardware lock-in Total system price Features

Page 20: a final version of the slide deck

Columnar DBMS examples

Sybase IQ SAND Vertica ParAccel InfoBright Kickfire Exasol MonetDB SAP BI Accelerator (sort of)

Page 21: a final version of the slide deck

Columnar pros and cons

Bulk retrieval is faster Pinpoint I/O is slower Compression is easier Memory-centric processing is easier MPP is not quite as crucial

Page 22: a final version of the slide deck

Segmentation – a first cut

One database to rule them all One analytic database to rule them all Frontline analytic database Very, very big analytic database Big analytic database handled very cost-

effectively

Page 23: a final version of the slide deck

Basics of systematic segmentation

Use cases Metrics Platform preferences

Page 24: a final version of the slide deck

Use cases – a first cut

Light reporting Diverse EDW Big Data Operational analytics

Page 25: a final version of the slide deck

Metrics – a first cut

Total raw/user data Below 1-2 TB, references abound 10 TB is another major breakpoint

Total concurrent users 5, 15, 50, or 500?

Data freshness Hours Minutes Seconds

Page 26: a final version of the slide deck

Basic platform issues

Enterprise standards Appliance-friendliness Need for MPP? Cloud/SaaS

Page 27: a final version of the slide deck

The selection process in a nutshell

Figure out what you’re trying to buy Make a shortlist Do free POCs* Evaluate and decide

*The only part that’s even slightly specific to the analytic DBMS category

Page 28: a final version of the slide deck

Figure out what you’re trying to buy

Inventory your use cases Current Known future Wish-list/dream-list future

Set constraints People and platforms Money

Establish target SLAs Must-haves Nice-to-haves

Page 29: a final version of the slide deck

Use-case checklist -- generalities

Database growth As time goes by … More detail New data sources

Users (human) Users/usage (automated) Freshness (data and query results)

Page 30: a final version of the slide deck

Use-case checklist – traditional BI

Reports Today Future

Dashboards and alerts Today Future Latency

Ad-hoc Users Now that we have great response time …

Page 31: a final version of the slide deck

Use-case checklist – data mining

How much do you think it would improve results to Run more models? Model on more data? Add more variables? Increase model complexity?

Which of those can the DBMS help with anyway? What about scoring?

Real-time Other latency issues

Page 32: a final version of the slide deck

SLA realism

What kind of turnaround truly matters? Customer or customer-facing users Executive users Analyst users

How bad is downtime? Customer or customer-facing users Executive users Analyst users

Page 33: a final version of the slide deck

Short list constraints

Cash cost But purchases are heavily negotiated

Deployment effort Appliances can be good

Platform politics Appliances can be bad You might as well consider incumbent(s)

Page 34: a final version of the slide deck

Filling out the shortlist

Who matches your requirements in theory?

What kinds of evidence do you require? References?

How many? How relevant?

A careful POC? Analyst recommendations? General “buzz”?

Page 35: a final version of the slide deck

A checklist for shortlists

What’s your tolerance for specialized hardware? What’s your tolerance for set-up effort? What’s your tolerance for ongoing administration? What are your insert and update requirements? At what volumes will you run fairly simple

queries? What are your complex queries like? For which third-party tools do you need support?

and, most important,

Are you madly in love with your current DBMS?

Page 36: a final version of the slide deck

Proof-of-Concept basics

The better you match your use cases, the more reliable the POC is

Most of the effort is in the set-up You might as well do POCs for several

vendors – at (almost) the same time! Where is the POC being held?

Page 37: a final version of the slide deck

The three big POC challenges

Getting data Real?

Politics Privacy

Synthetic? Hybrid?

Picking queries And more?

Realistic simulation(s) Workload Platform Talent

Page 38: a final version of the slide deck

POC tips

Don’t underestimate requirements Don’t overestimate requirements Get SOME data ASAP Don’t leave the vendor in control Test what you’ll be buying Use the baseball bat

Page 39: a final version of the slide deck

Evaluate and decide

It all comes down to

Cost Speed Risk

and in some cases

Time to value Upside

Page 40: a final version of the slide deck

Further information

Curt A. Monash, Ph.D.President, Monash Research

Editor, DBMS2

contact @monash.comhttp://www.monash.comhttp://www.DBMS2.com