34
Finding the Right Data Solution for Your Application in the Data Storage Haystack Srinath Perera Ph.D. Senior Software Architect, WSO2 Inc. Visiting Faculty, University of Moratuwa Research Scientist, Lanka Software Foundation

Finding the Right Data Solution for Your Application in the Data Storage Haystack

Embed Size (px)

DESCRIPTION

The NoSQL movement has rekindled interest in data storage solutions. A few years ago, within limited scale systems, storage choices for programmers and architects were simple where relational databases were almost always the choice. However, advent of Cloud and ever increasing user bases for applications have given rise to larger scale systems. Relational databases cannot always scale to meet the needs of those systems, and as an alternative, the NoSQL movement has proposed many solutions.For a programmer who wants to select a data model, they now have to choose from a wide variety of choices like Local memory, Relational databases, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc. Furthermore, there are different layers/access choices such as directly accessing data, using object to relation mapping layer like hibernate/JPA, or using data services. Moreover, users also need to worry about how to scale up the storage in multiple dimensions like the number of databases, the number of tables, the amount of data in a table, frequency of requests, types of requests (read/write ratio).Consequently, choosing the right data model for a given problem is no longer trivial, and such a choice needs a clear understanding of different storage offerings, their similarities, differences, as well as associated tradeoffs. We faced the same problem while designing the data interfaces for Stratos Platform as a Service (SaaS) offering, and in this talk, we would like to share our findings and experiences of that work. We will present a survey of different data models, their differences as well as similarities, tradeoffs, and killer apps for each model. We believe the participants will walk away with a border understanding about data models and guidelines on which model to be used when.

Citation preview

Page 1: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Finding the Right Data Solution for Your Application in the Data

Storage Haystack

Srinath Perera Ph.D. Senior Software Architect, WSO2 Inc. Visiting Faculty, University of Moratuwa

Research Scientist, Lanka Software Foundation

Page 2: Finding the Right Data Solution for Your Application in the Data Storage Haystack

In Search for right Data Models §  There has been many data

models proposed (read Stonebraker’s “What Goes Around Comes Around” for more details) o  Hierarchical (IMS): late 1960’s

and 1970’s o  Directed graph (CODASYL):

1970’s o  Relational: 1970’s and early

1980’s o  Entity-Relationship: 1970’s o  Extended Relational: 1980’s o  Semantic: late 1970’s and 1980’s

Copyright Greg Morss and licensed for reuse under CC License , http://www.geograph.org.uk/photo/990700

§  Database systems (SQL) together with transactions has been the defacto data solution.

Page 3: Finding the Right Data Solution for Your Application in the Data Storage Haystack

For many years, choice of data storage was a easy one (use RDBMS)

Copyright by Alan Murray Walsh and licensed for reuse under CC License , http://www.geograph.org.uk/photo/1652880

Page 4: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Increasing Scale of Systems §  However, the scale of systems

are changing due to o  Increasing user bases of

systems. o  Mobile devices, online presence o  Cloud computing and multicore

systems

§  Scaling up RDBMS o  Put it in a bigger machine o  Replicate (Cluster) the database to 2-3 more nodes. But the

approach does not scale up. o  Partition the data across many nodes (distribute, a.k.a. sharding).

However, JOIN queries across many nodes are hard, and sometimes too slow. This often needs custom code and configurations. Also transactions do not scale as well.

Copyright digitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/

Page 5: Finding the Right Data Solution for Your Application in the Data Storage Haystack

CAP Theorem, Transactions, and Storage §  RDBMS model provide two things

o  Relational model with SQL o  ACID transactions – (Atomic,

Isolation, Consistent, Durable) §  It was a classical one size fit all

solution, but it worked for a quite a some time.

§  However, CAP theorem says that you can not have it all. o  Consistency, Availability and Partition

Tolerance, pick two!

§  But there are many usecases that do not need all RDBMS features, when those are dropped, systems could scale. (e.g. Google Big Table)

§  However, to use them, one has to understand and utilize the application specific behavior.

Copyright stephcarter and licensed for reuse under CC License , http://www.flickr.com/photos/stephcarter/541464462

Page 6: Finding the Right Data Solution for Your Application in the Data Storage Haystack

NoSQL and other Storage Systems §  Large internet

companies hit the problem first, they build systems that are specific to their problems, and those systems did scale. o  Google Big table o  Amazon Dynamo

§  Soon many others followed, and most of them are free and open source. Now there are couple of dozen

§  Among advantages of NoSQL are o  Scalability o  Flexible schema o  Designed to scale and support fault tolerance out of the Box

Copyright O hai :3 and licensed for reuse under CC License , http://www.flickr.com/photos/christigain/5636887941

Page 7: Finding the Right Data Solution for Your Application in the Data Storage Haystack

However, with NoSQL solutions, choosing a data storage is no longer simple.

Copyright Philipp Salzgeber on and licensed for reuse under CC License http://www.salzgeber.at/astro/pics/20081126_heart/index.html

Page 8: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Selecting the Right Data Solution

§  What are the right Questions to ask? §  Categorize Answers for each question §  Take different cases based on different answers and make

recommendations! Copyright by Krzysztof Poltorak, and licensed for reuse under CC License.

http://www.fotocommunity.com/pc/pc/display/22077920

Page 9: Finding the Right Data Solution for Your Application in the Data Storage Haystack

What are the right Questions? o  Types of data

-  Structured, Semi-Structured, Unstructured

o  Need for Scalability -  Number of users -  Number of data items -  Size of files -  Read/Write ratio

o  Types of Queries -  Retrieve by Key -  WHERE clauses -  JOIN queries -  Offline Queries

o  Consistency -  Loose Consistency -  Single Operation Consistency -  Transactions

Copyright by romainguy, and licensed for reuse under CC License http://www.flickr.com/

photos/romainguy/249370084

Page 10: Finding the Right Data Solution for Your Application in the Data Storage Haystack

4Q > Types of Data > Unstructured Data

§  This data are often stored in storage but consumed by humans at the end of the pipeline. (e.g. Document repository)

§  One common use case is building structured data from unstructured data

§  Often associate Metadata to help searching

Copyright Martyn Gorman and licensed for reuse under CC License, http://www.geograph.org.uk/photo/294134

§  Data do not have a particular structure, often retrieved through a key (name). o  E.g. File systems.

§  Humans are good in processing unstructured data, but computers do not.

Page 11: Finding the Right Data Solution for Your Application in the Data Storage Haystack

4Q > Types of Data > Structured Data §  Have a structure and often described through a Schema §  Often a table like 2D structure is used, but other structures

also possible. §  Main advantage of the structure is search

Copyright Marion Doss by and licensed for reuse under CC License , http://www.flickr.com/photos/ooocha/2611398859/

§  Schema can be provided at

the deployment time or at the runtime (dynamic schema)

§  Schema can be used to o  Validate data o  Support user friendly search o  Optimize storage and queries

Page 12: Finding the Right Data Solution for Your Application in the Data Storage Haystack

4Q > Types of Data > Semi-structured Data §  Structure is not fully defined.

But there is some inherent structure.

§  For example o  XML documents, data are

stored in a tree like structure o  Graph data o  Data structures like lists and

arrays §  Support queries based on

structure §  But processing data often

needs custom code.

Copyright Walter Baxter http://www.geograph.org.uk/photo/1069339

Page 13: Finding the Right Data Solution for Your Application in the Data Storage Haystack

4Q > Search §  Unstructured Data – no structure to support search.

o  Search based on an reverse index o  Search through Properties

§  Semi-Structured Data o  To search XML, Xpath or XQuery (Any tree like structure). o  Tuple spaces can be queried through tuple space templates o  Data registries can be searched for entries that matches with given

Metadata descriptions (search by properties) o  Graph’s can be queried based on connectivity

§  Structured Data o  Retrieve by Key o  WHERE clauses o  Queries with JOINs o  Offline Queries

Copyright bydigitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/

Page 14: Finding the Right Data Solution for Your Application in the Data Storage Haystack

4Q > Consistency and Scalability §  Scalability – this is ability to

handle more users, data, or larger files by adding more nodes. We will have 3 categories. o  Small systems (can handle with 1-3

nodes) o  Scalable systems (can handle with

about 10 nodes) o  Highly scalable systems (anything

larger, can be 100s or 1000s of nodes)

Copyright NNSANews and licensed for reuse under CC License , http://www.flickr.com/photos/nnsanews/

5347287260/

§  Consistency – this is how to keep the replicas of same data in many nodes synced up (e.g. replicas) how they can be updated without data corruptions. We will have 3 categories. o  Transactional – series of operations updated in ACID manner o  Atomic operation – single operation, updated in all replicas o  Eventual consistency - data will be eventually consistent

Page 15: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Data Storage Alternatives

Page 16: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Data Storage Implementations

§  Expectations from data storages o  Reliably store the data o  Efficient search and retrieval of data whenever needed o  Data management – delete, update data

Copyright Stephen Eckert and licensed for reuse under CC License , http://www.flickr.com/photos/s_eckert/5378588233

Page 17: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Challenges of Data Storage §  Reliability

o  Replicating data o  Creating backup or recovering using backups

§  Security §  Scaling and Parallel access

o  Distribution or replications o  ACID transactions

§  Availability o  Data replications

§  Vendor lock-in o  Interoperability, standard query languages

§  Simple use experience o  Hide the physical location of data, o  Provide simple API and security models o  Expressive query languages.

Page 18: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Data Storage Choices

Storage Type Advantages Disadvantages

Queries Transactio

ns Scale Flexible schema Key Where

Joins

Local memory

Structured

Very fast Not durable Yes No No No unless

STMs No Yes

Relational/ SQL Standardized

Rigid schema, good for read

oriented usecases. Yes Yes Yes Yes

Moderate No

Column families (NoSQL )

High write performance,

replicated

Not transactional, no-online joins Yes

Yes, secondary index No No High Yes

Documents DBs

High write performance,

replicated

Not transactional, no-online joins Yes

Yes, views No No Yes Yes

Object Databases

Easy to integrate with

programming languages Yes Yes Yes Yes No No

Page 19: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Storage Type Advantages Disadvanta

ges

Queries transaction

s Scale Flexible schema Key Search

Files

Unstructured

Save big files whose format not understood

No structured search on content Yes Indexing No Moderate Yes

Data Registries/ Metadata Catalogs

Metadata search

Yes

Property based search

(Where) No Moderate Yes

Queues

Semi-structur

ed

Representation of flow of messages over

time/ Tasks Yes N/A No Yes Yes

Triple Stores

Used to inference, very fast relationship

processing Yes Relationship

search No No Yes XML database XML native

XPath/ XQuery

Distributed Cache Fast, replicated No search Yes No No Yes Yes

Key-value pairs

High write performance,

replicated

Model is too simple in

some cases, not

transactional Yes No No Yes Yes

Graph DBs

Very fast joins, natural to represent relationships,

Not very scalable Yes Graph Search Yes Low N/A

Page 20: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Choosing the Right Data Solution

Page 21: Finding the Right Data Solution for Your Application in the Data Storage Haystack

How should We do this?

§  Consider structured, semi-structured, and unstructured separately. o  Then drill down based on other 3 properties: scale, consistency,

and search. §  Structured case is more complicated, other two are bit

simpler. §  Start by giving a defacto for each case

Copyright Brian Robert Marshall and licensed for reuse

under CC License , http://

www.geograph.org.uk/photo/938546

Page 22: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Handling Structured Data §  There are three main considerations: scale, consistency

and queries Small (1-3 nodes) Scalable (10 nodes) Highly Scalable (1000s

nodes)

Loose Consist

ency

Operation

Consistency

ACID Transactions

Loose Consistency

Operation

Consistency

ACID Transactions

Loose Consistency

Operation

Consistency

ACID Transactions

Primary Key

DB/ KV/ CF

DB/ KV/ CF

DB KV/CF KV/CF Partitioned DB?

KV/CF KV/CF No

Where DB/ CF/Doc

DB/ CF/Doc

DB CF/Doc(?)

CF/Doc (?)

Partitioned DB?

CF/Doc

CF/Doc

No

JOIN DB DB DB ?? ?? ?? No No No

Offline DB/CF/Doc

DB/CF/Doc

DB/CF/Doc

CF/Doc

CF/Doc

No CF/Doc

CF/Doc

No

*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems

Page 23: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Handling Small Scale Systems (1-3 nodes) §  In general using DB here for

every case might work. §  Reason for using options

other than DB o  When there is potential need

to scale later. o  High write throughput

§  KV is 1-D where as other two are 2D

Small (1-3 nodes)

Loose Consistency

Operation Consistency

ACID Transactions

Primary Key

DB/ KV/ CF

DB/ KV/ CF

DB

Where DB/ CF/Doc

DB/ CF/Doc

DB

JOIN DB DB DB

Offline DB/CF/Doc

DB/CF/Doc

DB/CF/Doc

*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems

Page 24: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Handling Scalable Systems §  KV, CF, and Doc can easily

handle this case. §  If DBs used with data shredded

across many nodes o  Transactions might work given that

participants on one transaction are not too many.

o  JOINs might need to transfer too much data between nodes.

o  Also should consider in Memory DBs like Vault DB.

§  Offline mode will work. §  Most systems let users choose

consistency, and loose consistency can scale more. (e.g. Cassandra)

Scalable (10 nodes)

Loose Consistency

Operation Consistency

ACID Transactions

Primary Key

KV/CF KV/CF Partitioned DB?

Where CF/Doc

CF/Doc Partitioned DB?

JOIN ?? ?? Partitioned DB??

Offline CF/Doc

CF/Doc No

*KV-Key-Value Systems, CF-Column Families, Doc- document based Systems

Page 25: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Highly Scalable Systems

§  Transactions do not work in this scale. (CAP theorem).

§  Same for JOINs. The problem is sometime too much data needs to be transferred between nodes to perform the JOIN.

§  Offline case handled through Map-Reduce. Even JOIN case is OK since there is time.

Highly Scalable (1000s nodes)

Loose Consistency

Operation Consistency

ACID Transactions

Primary Key

KV/CF KV/CF No

Where CF/Doc CF/Doc No

JOIN No No No

Offline CF/Doc CF/Doc No

*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems

Page 26: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Highly Scalable Systems + Primary Key Retrieval

§  This is (comparatively) the easy one.

§  Can be solved through DHT (Distributed Hash table) based solutions or architectures like OceanStore.

§  Both Key-Value storage(KV) and Column Families (CF) can be used. But Key-Value model is preferred as it is more scalable.

Highly Scalable (1000s nodes)

Loose Consistency

Operation

Consistency

ACID Transactions

Primary Key

KV/CF KV/CF No

Where CF/Doc(?)

CF/Doc(?)

No

JOIN No No No

Offline CF/Doc CF/Doc No

*KV-Key-Value Systems, CF-Column Families, Doc- document based

Systems

Page 27: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Highly Scalable systems + WHERE

§  This Generally OK, but tricky. §  CF work through a Secondary

index that do Scatter-gather (e.g. Cassandra).

§  Doc work through Map-Reduce views (e.g. CouchDB)

§  There is Bissa, which build a index for all possible queries (No range queries)

§  If you are doing this, you should do pilot runs and make sure things work.

Highly Scalable (1000s nodes)

Loose Consistency

Operation

Consistency

Transactions

Primary Key

KV/CF KV/CF No

Where CF/Doc(?)

CF/Doc(?)

No

JOIN No No No

Offline CF/Doc CF/Doc No

*KV-Key-Value Systems, CF-Column Families, Doc- document based Systems

Page 28: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Handling Unstructured Data

§  Storage Options o  Distributed File systems - generally scalable (e.g. NSF), but HDFS

(Hadoop) and Lustre are highly scalable versions. o  Metadata registries (e.g. Niravana, SDSC Resource Broker)

Page 29: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Small Scale (1-3 nodes)

Scalable (10 nodes) Highly Scalable

XML (Queried through XPath)

XML DB or convert to a structured

model

XML DB or convert to a structured model

??

Graphs Graph DBs Graph DBs if graph can be partitioned

??

Data Structures Data Structure Servers, Object

Databases

Queues Distributed Queues

Distributed Queues Distributed Queues

!

Handling Semi-Structured Data

§  Storage Options o  Answer depends on the type of structure. If there is a server

optimized for a given type, it is often much more efficient than using a DB. (e.g. Graph databases can support fast relationship search)

§  Search o  Very much custom. E.g. XML or any tree = Xpath, Graph can

support very fast relationship search

Page 30: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Hybrid Approaches §  Some solutions have many types

of data and hence need more than one data solution (hybrid architectures).

§  For example o  Using DB for transactional data and

CF for other data. o  Keeping metadata and actual data

separate for large data archives. o  Use GraphDB to store relationship

data while other data is in Column Family storage.

§  However, if transactions are needed, transactions have to be handled outside storage (e.g. using Atomikos Zookeeper ).

Copyright Matthew Oliphant by and licensed for reuse under CC License , http://www.flickr.com/

photos/fajalar/3174131216/

Page 31: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Other parameters §  Above list is not exhaustive, and there are other

parameters o  Read/ Write ratio – when high it is easy to scale o  High write throughput o  Very large data products – you will need a file system. May be

keep metadata in Data registry and store data in a file system. o  Flexible Schema o  Archival usecases o  Analytical usecases o  Others …

Page 32: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Take Home Message is ..

There is no silver bullet

You have to use right too for the

job Copyright eschipul, Siomuzzz and licensed for reuse under CC License , http://www.flickr.com/

photos/eschipul/4160817135 and http://www.flickr.com/photos/siomuzzz/2577041081

Page 33: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Sample Polyglot Architectures PaaS Structured

(Relational) Structured (NOSQL)

Unstructured

WSO2 Stratos MySQL based RDB as a Service

Cassandra as a Service

HDFS as a Service

Azure MSSQL as a Service

MS NoSQL storage

AppEngine Hosted MySQL

BigTable

Our work on Data Solutions for WSO2 Stratos motivated this work.

You can try out WSO2 Stratos Data offerings from https://data.stratoslive.wso2.com/home/index.html

Page 34: Finding the Right Data Solution for Your Application in the Data Storage Haystack

Conclusion §  For last 20 years or so, DBMS were the de facto storage

solution §  However, DBMS could not scale well, and many NoSQL

solutions have been proposed instead §  As a results. it is no longer easy to find the best data

solution for your problem. §  We discussed may dimensions (types of data, scalability,

queries, and consistency) and provided guidelines on when to use which data solution.

§  Your feedback and thoughts are most welcome .. Contact me through [email protected]