63
One Size Doesn’t Fit All: The New Database Revolution Mark Madsen & Robin Bloor

One Size Doesn't Fit All: The New Database Revolution

Embed Size (px)

DESCRIPTION

Slides from a webcast for the database revolution research report (report will be available at http://www.databaserevolution.com)Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget? Register for this Webcast to find out! Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. will present the findings of their three-month research project focused on the evolution of database technology. They will offer practical advice for the best way to approach the evaluation, procurement and use of today’s database management systems. Bloor and Madsen will clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.Webcast video and audio will be available on the report download site as well.

Citation preview

Page 1: One Size Doesn't Fit All: The New Database Revolution

One Size Doesn’t Fit All:The New Database Revolution

Mark Madsen & Robin Bloor

Page 3: One Size Doesn't Fit All: The New Database Revolution

Analysts Host

Bloor Madsen

Page 4: One Size Doesn't Fit All: The New Database Revolution

Introduction

Significant and revolutionary changes are taking place in database technology

In order to investigate and analyze these changes and where they may lead, The Bloor Group has teamed up with Third Nature to launch an Open Research project.

This is the final webinar in a series of webinars and research activities that have comprised part of the project

All published research will be made available through our web site: Databaserevolution.com

Page 5: One Size Doesn't Fit All: The New Database Revolution

Sponsors of This Research

Page 6: One Size Doesn't Fit All: The New Database Revolution

General Webinar Structure

Market Changes, Database Changes (Some Of The Findings)

Workloads, Characteristics, Parameters

A General Discussion of Performance

How to Select A Database

Page 7: One Size Doesn't Fit All: The New Database Revolution

Market Changes, Database Changes

Page 8: One Size Doesn't Fit All: The New Database Revolution

Database Performance Bottlenecks

CPU saturation

Memory saturation

Disk I/O channel saturation

Locking

Network saturation

Parallelism – inefficient load balancing

Page 9: One Size Doesn't Fit All: The New Database Revolution

Big Data = Scale Out

Page 10: One Size Doesn't Fit All: The New Database Revolution

Cloud Hardware Architecture

• It’s a scale-out model. Uniform virtual node building blocks.

• This is the future of software deployments, albeit with increasing node sizes, so paying attention to early adopters today will pay off.

• This implies that an MPP database architecture will be needed for scale.

X

Page 11: One Size Doesn't Fit All: The New Database Revolution

Multiple Database Roles

Now there are more...

Page 12: One Size Doesn't Fit All: The New Database Revolution

The Origin of Big Data

Page 13: One Size Doesn't Fit All: The New Database Revolution

Let’s Stop Using the Term NoSQL

As the graph indicates, it’s just not helpful. In

fact it’s downright confusing.

Page 14: One Size Doesn't Fit All: The New Database Revolution

NoSQL DirectionsSome NDBMS do not attempt to provide all ACID properties. (Atomicity, Consistency, Isolation, Durability)

Some NDBMS deploy a distributed scale-out architecture with data redundancy.

XML DBMS using XQuery are NDBMS.

Some documents stores are NDBMS (OrientDB, Terrastore, etc.)

Object databases are NDBMS (Gemstone, Objectivity, ObjectStore, etc.)

Key value stores = schema-less stores (Cassandra, MongoDB, Berkeley DB, etc.)

Graph DBMS (DEX, OrientDB, etc.) are NDMBS

Large data pools (BigTable, Hbase, Mnesia, etc.) are NDBMS

Page 15: One Size Doesn't Fit All: The New Database Revolution

The Joys of SQL?

SQL: very good for set manipulation. Works for OLTP and many query environments.

Not good for nested data structures (documents, web pages, etc.)

Not good for ordered data sets

Not good for data graphs (networks of values)

Page 16: One Size Doesn't Fit All: The New Database Revolution

The “Impedance Mismatch”

The RDBMS stores data organized according to table structures

The OO programmer manipulates data organized according to complex object structures, which may have specific methods associated with them.

The data does not simply map to the structure it has within the database

Consequently a mapping activity is necessary to get and put data

Basically: hierarchies, types, result sets, crappy APIs, language bindings, tools

Page 17: One Size Doesn't Fit All: The New Database Revolution

The SQL Barrier

SQL has:DDL (for data definition)

DML (for Select, Project and Join)

But it has no MML (Math) or TML (Time)

Usually result sets are brought to the client for further analytical manipulation, but this creates problems

Alternatively doing all analytical manipulation in the database creates problems

Page 18: One Size Doesn't Fit All: The New Database Revolution

Hadoop/MapReduce

Hadoop is a parallel processing environment

Map/Reduce is a parallel processing framework

Hbase turns Hadoop into a database of a kind

Hive adds an SQL capability

Pig adds analytics

Page 19: One Size Doesn't Fit All: The New Database Revolution

Market Forces

A new set of products appear

They include some fundamental innovations

A few are sufficiently popular to last

Fashion and marketing drive greater adoption

Products defects begin to be addressed

They eventually challenge the dominant products

Page 20: One Size Doesn't Fit All: The New Database Revolution

Market forces affecting database choice

Performance: trouble doing what you already do today▪ Poor response times

▪ Not meeting data availability requirements

Scalability: doing more of what you do today▪ Adding users, processing more data

Capability: doing something new with your data▪ Data mining, recommendations, real‐time

Cost or complexity: working more efficiently▪ Consolidating / rehosting to simplify and reduce cost

What’s desired is possible but limited by the cost of growing and supporting the existing environment.

Page 20

Page 21: One Size Doesn't Fit All: The New Database Revolution

The relational database is the franchise technology for storing and retrieving data, but…

1. Global, static schema model

2. No rich typing system

3. No concept of ordering, creating challenges with e.g. time series

4. Many are not a good fit for network parallel computing, aka cloud

5. Limited API in atomic SQL statement syntax  & simple result set return

6. Poor developer support

Relational has a good conceptual model, but a prematurely standardized implementation

Page 22: One Size Doesn't Fit All: The New Database Revolution

Unstructured data isn’t really unstructured.

The problem is that this data is unmodeled.

The real challenge is complexity.

Big data?

Page 23: One Size Doesn't Fit All: The New Database Revolution

Text, Objects and Data Don’t Always Fit Together

So this is what they meant by “impedance mismatch”

Page 24: One Size Doesn't Fit All: The New Database Revolution

Many new choices, one way to look at them

http://blog.nahurst.com/visual-guide-to-nosql-systems

Page 25: One Size Doesn't Fit All: The New Database Revolution

Advanced Analytic Methods

Machine learning

Statistics

Numerical methods

Text mining & text 

analytics

Rules engines & constraint programming

Information theory & IR

Visualization

What About Analytics?

GIS

Page 26: One Size Doesn't Fit All: The New Database Revolution

The holy grail of databases under current market hype

A key problem is that we’re talking mostly about computation over data when we talk about “big data” and analytics, a potential mismatch for both relational and nosql.

Page 27: One Size Doesn't Fit All: The New Database Revolution

Technologies are not perfect replacements for one another.

When replacing the old with the new (or ignoring the new over the old) you always make tradeoffs, and usually you won’t see them for a long time.

Page 28: One Size Doesn't Fit All: The New Database Revolution

Scalability and performance are not the same thing

Page 29: One Size Doesn't Fit All: The New Database Revolution

Throughput: the number of tasks completed in a given time period

A measure of how much work is or can be done by a system in a set amount of time, e.g. TPM or data loaded per hour.

It’s easy to increase throughput without improving response time.

Page 29

Performance measures

Page 30: One Size Doesn't Fit All: The New Database Revolution

Performance measures

Response time: the speed of a single task

Response time is usually the measure of an individual's experience using a system. 

Response time = time interval / throughput

Page 30

Page 31: One Size Doesn't Fit All: The New Database Revolution

Scalability vs throughput vs response time

Scalability = consistent performance for a task over an increase in a scale factor

Page 32: One Size Doesn't Fit All: The New Database Revolution

Scale: Data Volume

The different ways people count make establishing rules of thumb for sizing hard.

How do you measure it?▪ Row counts▪ Transaction counts▪ Data size▪ Raw data vs loaded data▪ Schema objects

People still have trouble scaling for databases as large as a single PC hard drive.

Page 33: One Size Doesn't Fit All: The New Database Revolution

Scale: Concurrency (active and passive)

Page 34: One Size Doesn't Fit All: The New Database Revolution

Scalability relationships

As concurrency increases, response time (usually) decreases,

This can be addressed somewhat via workload management tools.

When a system hits a bottleneck, response time and throughput will often get worse, not just level off.

Page 35: One Size Doesn't Fit All: The New Database Revolution

Scale: Computational Complexity

Page 36: One Size Doesn't Fit All: The New Database Revolution

A key point worth remembering:

Performance over size <> performance over complexity

Analytics performance is about the intersection of both.

Database performance for BI is mostly related to size and query complexity.

Size, computational complexity and concurrency are the three dimensions that constrain a product’s performance. Workloads fall somewhere along all three.

Page 37: One Size Doesn't Fit All: The New Database Revolution

Solving Your Problem Depends on the Diagnosis

Page 38: One Size Doesn't Fit All: The New Database Revolution

Three General Workloads

Online Transaction Processing▪ Read, write, update▪ User concurrency is the common performance limiter

▪ Low data, compute complexity

Business Intelligence / Data warehousing▪ Assumed to be read‐only, but really read heavy, write heavy, usually separated in time

▪ Data size is the common performance limiter

▪ High data complexity, low compute complexity

Analytics▪ Read, write▪ Data size and complexity of algorithm are the limiters

▪ Moderate data , high compute complexity

Page 39: One Size Doesn't Fit All: The New Database Revolution

Types of workloads

Write‐biased: ▪ OLTP▪ OLTP, batch▪ OLTP, lite▪ Object persistence▪ Data ingest, batch▪ Data ingest, real‐time

Read‐biased:▪ Query▪ Query, simple retrieval

▪ Query, complex

▪ Query‐hierarchical / object / network

▪ Analytic

MixedInline analytic execution, operational BI

Page 40: One Size Doesn't Fit All: The New Database Revolution

Technology choice depends on workload & need

Optimizing for:▪ Response time?

▪ Throughput?▪ both?

Concerned about rapid growth in data?

Unpredictable spikes in use?

Extremely low latency (in or out) requirements?

Bulk loads or incremental inserts and/or updates?

Page 41: One Size Doesn't Fit All: The New Database Revolution

Important workload parameters to know

• Read‐intensive  vs. write‐intensive

Page 42: One Size Doesn't Fit All: The New Database Revolution

Important workload parameters to know

• Read‐intensive  vs. write‐intensive

• Mutable vs. immutable data

Page 43: One Size Doesn't Fit All: The New Database Revolution

Important workload parameters to know

• Read‐intensive  vs. write‐intensive

• Mutable vs. immutable data

• Immediate vs. eventual consistency

Page 44: One Size Doesn't Fit All: The New Database Revolution

Important workload parameters to know

• Read‐intensive  vs. write‐intensive

• Mutable vs. immutable data

• Immediate vs. eventual consistency

• Short vs. long access latency

Page 45: One Size Doesn't Fit All: The New Database Revolution

Important workload parameters to know

• Read‐intensive  vs. write‐intensive

• Mutable vs. immutable data

• Immediate vs. eventual consistency

• Short vs. long data latency

• Predictable vs. unpredictable data access patterns

Page 46: One Size Doesn't Fit All: The New Database Revolution

Important workload parameters to know

• Read‐intensive  vs. write‐intensive

• Mutable vs. immutable data

• Immediate vs. eventual consistency

• Short vs. long data latency

• Predictable vs. unpredictable data access patterns

• Simple vs. complex data types

Page 47: One Size Doesn't Fit All: The New Database Revolution

You must understand your workload mix ‐ throughput and response time requirements aren’t enough.▪ 100 simple queries accessing month‐to‐date data

▪ 90 simple queries accessing month‐to‐date data and 10 complex queries using two years of history

▪ Hazard calculation for the entire customer master

▪ Performance problems are rarely due to a single factor. 

Page 48: One Size Doesn't Fit All: The New Database Revolution

Selectivity and number of columns queried

Row store or column store, indexed or not?

Chart from “The Mimicking Octopus: Towards a one-size-fits-all Database Architecture”, Alekh Jindal

Page 49: One Size Doesn't Fit All: The New Database Revolution

Characteristics of query workloads

Workload Selectivity Retrieval Repetition Complexity

Reporting / BI Moderate Low Moderate Moderate

Dashboards / scorecards

Moderate Low High Low

Ad‐hoc query and analysis

Low to high

Moderateto low

Low Low to moderate

Analytics (batch) Low High Low to High Low*

Analytics (inline) High Low High Low*

Operational / embedded BI

High Low High Low

* Low for retrieving the data, high if doing analytics in SQL

Page 50: One Size Doesn't Fit All: The New Database Revolution

Characteristics of read‐write workloads

Workload Selectivity Retrieval Repetition Complexity

Online OLTP High Low High Low

Batch OLTP Moderate to low

Moderate to high

High Moderate to high

Object persistence

High Low High Low

Bulk ingest Low (write) n/a High Low

Realtime ingest High (write) n/a High Low

With ingest workloads we’re dealing with write-only, so selectivity and retrieval don’t apply in the same way, instead it’s write volume.

Page 51: One Size Doesn't Fit All: The New Database Revolution

Workload parameters and DB types at data scale

Workload parameters

Write‐biased

Read‐biased

Updateabledata

Eventual consistency ok?

Un‐predictablequery path

Computeintensive

Standard RDBMS

ParallelRDBMS

NoSQL (kv,dht, obj)

Hadoop*

Streaming database

You see the problem: it’s an intersection of multiple

Page 52: One Size Doesn't Fit All: The New Database Revolution

Problem: Architecture Can Define Options

Page 53: One Size Doesn't Fit All: The New Database Revolution

A general rule for the read‐write axes

As workloads increase in both intensity and complexity, we move into a realm of specialized databases adapted to specific workloads.

Write intensity

Read intensity

OldSQL

NewSQL

NoSQL

Page 54: One Size Doesn't Fit All: The New Database Revolution

In general…

Relational row store databases for conventionally tooled low to mid‐scale OLTP

Relational databases for ACID requirements

Parallel databases (row or column) for unpredictable or variable query workloads

Specialized databases for complex data query workjloads

NoSQL (KVS, DHT) for high scale OLTP

NoSQL (KVS, DHT) for low latency read‐mostly data access

Parallel databases (row or column) for analytic workloads over tabular data

NoSQL / Hadoop for batch analytic workloads over large data volumes

Page 55: One Size Doesn't Fit All: The New Database Revolution

How To Select A Database

Page 56: One Size Doesn't Fit All: The New Database Revolution

How To Select A Database - (1)1.What are the data management requirements and policies (if any) in respect

of:- Data security (including regulatory requirements)?- Data cleansing?- Data governance?- Deployment of solutions in the cloud?- If a deployment environment is mandated, what are its technical

characteristics and limitations? Best of breed, no standards foranything, “polyglot persistence” = silos on steroids, data integrationchallenges, shifting data movement architectures

2.What kind of data will be stored and used?- Is it structured or unstructured?- Is it likely to be one big table or many tables?

Page 57: One Size Doesn't Fit All: The New Database Revolution

How To Select A Database - (2)3.What are the data volumes expected to be?

- What is the expected daily ingest rate?- What will the data retention/archiving policy be?- How big do we expect the database to grow to? (estimate a range).

4. What are the applications that will use the database?- Estimate by user numbers and transaction numbers- Roughly classify transactions as OLTP, short query, long query, long

query with analytics.- What are the expectations in respect of growth of usage (per user) and

growth of user population?5.What are the expected service levels?

- Classify according to availability service levels- Classify according to response time service levels- Classify on throughput where appropriate

Page 58: One Size Doesn't Fit All: The New Database Revolution

How To Select A Database - (3)6.What is the budget for this project and what does that cover?7.What is the outline project plan?

- Timescales- Delivery of benefits- When are costs incurred?

8.Who will make up the project team?- Internal staff- External consultants- Vendor consultants

9.What is the policy in respect of external support, possibly including vendorconsultancy for the early stages of the project?

Page 59: One Size Doesn't Fit All: The New Database Revolution

How To Select A Database - (4)10.What are the business benefits?

- Which ones can be quantified financially?- Which ones can only be guessed at (financially)?- Are there opportunity costs?

Page 60: One Size Doesn't Fit All: The New Database Revolution

A random selection of databasesSybase IQ, ASETeradata, Aster DataOracle, RACMicrosoft SQLServer, PDWIBM DB2s, NetezzaParaccelKognitioEMC/GreenplumOracle ExadataSAP HANAInfobrightMySQLMarkLogicTokyo Cabinet

EnterpriseDB LucidDBVectorwiseMonetDBExasolIlluminateVerticaInfiniDB1010 DataSANDEndecaXtreme DataIMSHive

AlgebraixIntersystems CachéStreambaseSQLStreamCoral8IngresPostgresCassandraCouchDBMongoHbaseRedisRainStorScalaris

And a few hundred more…

Page 61: One Size Doesn't Fit All: The New Database Revolution

Product Selection

Preliminary investigation

Short-list (usually arrived at by elimination)

Be sure to set the goals and control the process.

Evaluation by technical analysis and modeling

Evaluation by proof of concept.

Do not be afraid to change your mind

Negotiation

Page 62: One Size Doesn't Fit All: The New Database Revolution

Conclusion

Wherein all is revealed, or ignorance exposed

Page 63: One Size Doesn't Fit All: The New Database Revolution

Thank YouFor YourAttention