Storing and processing data with the wso2 platform

Storing and processing data with the WSO2 Platform

Deependra Ariyadewa Wathsala Vithanage

WSO2

• Founded in 2005 by acknowledged leaders in XML, Web Services Technologies & Standards and Open Source

• Producing entire middleware platform 100% open source under Apache license

• Business model is to sell comprehensive support & maintenance for our products

• Venture funded by Intel Capital and Quest Software.

• Global corporation with offices in USA, UK & Sri Lanka

• 150+ employees and growing.

Introduction to Data Problem

• Information explosion o Rapid growth of published data. o Managing large amounts of data is difficult (this leads to

an information overload) o Difficulties include Capture

Storage

Search

Sharing

Analytics

Visualization

o We need new tools to deal with BIG DATA.

The Well Known Data Solution

RDBMS • For many years this has been the choice

• Scaling up RDBMS

o Put it in a bigger computer o Replicate database over 2 - 3 nodes. This does not work well

with more than 2 - 3 nodes. o Partition data over several nodes. Although JOIN queries are

hard across many nodes, may require custom code and configuration. Transactions may not scale well.

CAP Theorem and RDBMS

• RDBMS has two key features o Relational Model with SQL o ACID transactions (Atomic, Consistent, Isolation &

Durable) • CAP theorem states that in distributed systems it is only

possible to have two properties out of the properties Consistency, Availability & Partition Tolerance at any given time. o Once you have picked two properties you will loose the

remaining one. • But there are some applications that do not need all the

properties of RDBMS. Once these are dropped system scales. (e.g. Google Big Tables)

Rise of NoSQL

• Large internet companies hit the problem first, they build systems that are specific to their problem, and they did scale. o Google Big table o Amazon Dynamo

• Soon many others followed, and most of them are free and

open source. • Among advantages of NoSQL are

o Scalability o Flexible schema o Designed to scale and support fault tolerance out of the

Box

Finding the right Data Solution

• Data Types o Unstructured Data Files

o Semi Structured Data XML Databases, Queues, Graphs and Lists

o Structured Data DBMS

Handling Unstructured Data

• Storage Options o Key - Value storages for small data items o Distributed file systems for other cases o Metadata Registries (Nirvana, SDSC Resource broker)

• Scalability o Key - Value storages are highly Scalable (e.g. Amazon

Dynamo) o Distributed File Systems are generally scalable (HDFS,

Lustre) o Metadata Registries are also highly scalable

• Search o Each of above provide key based retrieval o Metadata registries provide property based search. o It is possible to build a index for content using tools like

Lucence and use that for search.

Handling Semi-Structured Data

• Storage Options o Answer depends on the type of structure. (e.g. XML = XML Databases,

Graphs = Graph Databases, List = Data structure servers, work items = Queue)

o If there is a server optimized for a given type, it is often much more efficient than using a DB. (e.g. Graph databases can support fast relationship search)

• Scalabilty

o XML databases can shared data across nodes, so usually scalable, but others are not that scalable

• Search

o Very much custom. E.g. XML or any tree = XPath o Graph can support very fast relationship search

Handling Structured Data (1-3 nodes)

• In general using DB here

for every case might

work.

• Reason for using options

other than DB • When there is

potential need to scale

later. • High write throughput

• KV is 1-D where as other

two are 2D *KV: Key-Value Systems, CF: Column

Families, Doc: document based

Systems

Small (1-3 nodes)

Loose

Consistency

Operation

Consistency

Transactions

Primary Key DB/ KV/ CF DB/ KV/ CF DB

Where DB/ CF/Doc DB/ CF/Doc DB

JOIN DB DB DB

Offline DB/CF/Doc DB/CF/Doc DB/CF/Doc

Handling Structured Data (10 nodes)

*KV: Key-Value Systems, CF: Column


Systems

• KV, CF, and Doc can easily handle

this case. • If DBs used with data shredded

across many nodes. • Transactions might work with

given that participants on one

transaction are not too many. • JOINs might need to transfer too

much data between nodes. • Also should consider in Memory

DBs like Vault DB • Offline mode will work • Most systems let users choose

consistency, and loose consistency

can scale more. (e.g. Cassandra)

Scalable (10 nodes)

Loose

Consistency

Operation

Consistency

Transactions

Primary

Key

KV/CF KV/CF Partitioned

DB?

Where CF/Doc CF/Doc Partitioned

DB?

JOIN ?? ?? Partitioned

DB??

Offline CF/Doc CF/Doc No

Highly Scalable System

• Transactions does not work in this scale.

(CAP theorem). • Same for the JOIN. Problem is sometime

too much data needs to be transferred

between nodes to perform the JOIN. • Offline case handled through Map-

Reduce. Even JOIN case is OK since

there is time.



Systems

Highly Scalable (1000s nodes)

Loose

Consistency

Operation

Consistency

Transactions

Primary

Key

KV/CF KV/CF No

Where CF/Doc CF/Doc No

JOIN No No No


Highly Scalable Systems + Primary Key Retrieval

• This is (comparatively) the easy one.

• Can be solved through DHT

(Distributed Hash table) based solutions

or architectures like OceanStore.

• Both Key-Value Storages(KV) and

Column Families (CF) can be used. But

Key-Value model is preferred as it is

more scalable.



Systems


Loose

Consistency

Operation

Consistency

Transactions

Primary

Key

KV/CF KV/CF No

Where CF/Doc(?) CF/Doc(?) No

JOIN No No No


Highly scalable systems + WHERE

• This Generally OK, but tricky.

• CF work through a Secondary index that

do Scatter-gather (e.g. Cassandra).

• Doc work through Map-Reduce views

(e.g. CouchDB).

• There is Bissa, which build a index for all

possible queries (No range queries)

• If you are doing this, you should do pilot

runs and make sure things work.



Systems


Loose

Consistency

Operation

Consistency

Transactions

Primary

Key

KV/CF KV/CF No

Where CF/Doc(?) CF/Doc(?) No

JOIN No No No


Hybrid Approaches

• Some solution have many types of data and hence need more than one data solution (hybrid architectures).

• For example

o Using DB for transactional data and CF for other data. o Keeping metadata and actual data separate for large data

archives. o Use GraphDB to store relationship data while other while

other data is in Column family storage. • However, if transactions are needed, transactions have to

be handled outside storages (e.g. using Atomicas, Zookeeper ).

Other Parameters

• Above list is not exhaustive, and there are other parameters o Read/Write ratio - when high, easy to scale. o High write throughput. o Very large data products - you will need a file system.

May be keep metadata in Data registry and store data in a file system.

o Flexible schema. o Archival usecases o Analytical usecases o Others ...

WSO2 Data Solutions

• Data Service Server - DSS • Relational Storage Service - RSS • Column Store Service - CSS • File System as a service ( FSaaS) - HDFS

• DSS and RSS • DSS and CSS

WSO2 Data Service Server (DSS)

WSO2 Data Service Server (DSS) Support for large XML outputs

Content Filtering based on User's role

Support for named parameters

Ability to configure schema type for output elements

Mixing multiple data sources in nested queries

Distributed transaction support Oracle Ref Cursor support

Support for multiple data source types Clustering support for High Availability and High Scalability

Full support for WS-Security, WS-Trust, WS-Policy and WS-Secure Conversation and XKMS

JMX and Web interface based monitoring and management

WS-* and REST support

Data validations

UDT (User Defined Type) Support Complex Results

Auto Generated Keys Support

Boxcarring Support

Batch Request Support Scheduled Tasks

Registry Integration for Excel,CSV,XSLT

Web Scraping Support

Multiple SQL Dialect Support

DB -> DS Generation

Service Group/Hierarchy Support

Database Explorer

Data as a Service Features - DSS Stratos Service

o Cassandra Integration

o RDS Provisioning

WSO2 Data Service Server (DSS)

Data Services Description Language - DSDL

DSS Management Console

WSO2 Stratos Support for Relational Data

• Offering a “database as as service” for tenants

WSO2 Relational Storage Service

• Users create database and receive JDBC URL

• Database is allocated from Amazon RDS (MySQL) horizontal cluster • Tenants are isolated from each other and integrated with platform

security model

WSO2 Relational Storage Service

• Use your own database server (anywhere)

• Register database connection as a datasource Use RSS to allocate a database

Stratos RSS

Stratos RSS

Stratos RSS

RSS Sample

WSO2 Column Store Service - CSS

Users can log in to the Web Console and create Cassandra key spaces.

Column Store Service (Contd.)

• Key spaces will be allocated from a Cassandra clusters • Users can manage and share his key spaces through Stratos

Web Console and use those key spaces through Hector Client (Java Client for Cassandra)

• In essence we provide Cassandra as a part of Stratos as a

Service with Multi-tenancy support and Security integration with WSO2 security model

WSO2 CSS Admin Console

Left Menu

Keyspace View

WSO2 CSS Admin Console

Keyspace Connection Details

WSO2 CSS Sample

File System as a Service - FSaaS

File System as a Service - FSaaS

The volume will be allocated from a HDFS cluster they are isolated from other tenants in Stratos it is integrated with WSO2 Security model. Users can manage and share his File system through Stratos Web Console and use the file system like any other file system.

FSaaS Sample

Data Processing - Mapreduce

• Mapreduce is inspired by map and reduce functions used in functional programming. o Initially introduced by Google with some parts being

patented. • Hadoop is a Mapreduce implementation that comes under

Apache license agreement. • WSO2 provides Mapreduce as a service.

• WSO2 Business Activity Monitor (BAM2) is an example use-

case for WSO2's Mapreduce as a service.

WSO2 Mapreduce

• WSO2 Mapreduce is secure. • WSO2 Mapreduce can use both FSaaS and DSS.

o HDFS (FSaaS) o Cassendra (DSS)

WSO2 Mapreduce

WSO2 Mapreduce

WSO2 Mapreduce

WSO2 Mapreduce

WSO2 Mapreduce

WSO2 Mapreduce

Q&A

WSO2

• Founded in 2005 by acknowledged leaders in XML, Web Services Technologies & Standards and Open Source

• Producing entire middleware platform 100% open source under Apache license

• Business model is to sell comprehensive support & maintenance for our products

• Venture funded by Intel Capital and Quest Software.

• Global corporation with offices in USA, UK & Sri Lanka

• 150+ employees and growing.

https://ail.google.com/mail/u/0/?ui=2&i

k=ad9ae58f41&view=att&th=1331a70

983344a32&attid=0.1&disp=thd&reala

ttid=f_gtxto6mk0&zw

Selected Customers

WSO2 engagement model

• QuickStart

• Development

Support

• Development

Services

• Production

Support

• Turnkey Solutions • WSO2 Mobile Services Solution

• WSO2 FIX Gateway Solution

• WSO2 SAP Gateway Solution

Technology

Storing and processing data with the wso2 platform