Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Stratio Meta An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]

Alvaro Agea [email protected]

1"#CassandraSummit-2014

Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]



•  Stra3o-is-a-Big-Data-Company •  Founded-in-2013 •  Commercially-launched-in-2014 •  50+-employees-in-Madrid •  Office-in-San-Francisco •  Cer3fied-Spark-distribu3on

STRATIO Who are we?

#CassandraSummit-2014 3"

•  P2P-architecture •  Read/write-performance •  Fault-tolerance •  Easy-to-deploy •  CQL

Cassandra We love…


•  Introduction •  Crossdata architecture •  Metadata management •  Streaming sources •  Full text search •  Spark and Crossdata •  ODBC •  The future

Agenda

5"

#CassandraSummit-2014

o  Big-Data-analysis-is-commonly-associated-with-batch-processing

•  Users-aiming-to-combine-batch-and-stream-processing-have-to-rely-on-tailorRmade-architectures

o  Users-buy-Big-Data-plaSorms,-but

•  How-do-I-start? •  What-is-my-entry-point-to-the-plaSorm?

Introduction

6"

o  Easy-deployment

o  Easy-administra3on

o  Read/write-performance

o  EasyRtoRlearn-query-language- o  Integra3on-with-BI-Tools o  Join-opera3ons o  Support-for-streaming-sources

o  Integra3on-with-other-data-stores o  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)

What our clients demand?


!  Easy%deployment%

!  Easy%administra0on%

!  Read/write%performance%

!  Easy6to6learn%query%language%o  Integra3on-with-BI-Tools o  Join-opera3ons o  Support-for-streaming-sources

o  Integra3on-with-other-data-stores o  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)



!  Easy"deployment"

!  Easy"administra8on"

!  Read/write"performance"

!  Easy>to>learn"query"language"!  Integra3on-with-BI-Tools !  Join-opera3ons !  Support-for-streaming-sources

!  Integra3on-with-other-data-stores !  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)



o  A-new-technology-that: •  Is-not-limited-by-the-underlying-datastore-capabili3es

•  Leverages-Spark-to-perform-nonRna3vely-supported-opera3ons

•  Supports-batch-and-streaming-queries

•  Supports-mul3ple-clusters-and-technologies

Crossdata


Our architecture


o  Crossdata-defines-an-IConnector-extension-interface o  User-can-easily-add-new-connectors-to-support •  Different-datastores •  Different-processing-engines •  Different-versions

o  Where-each-connector-defines-its-capabili3es

Connecting to the outside world


Our planner will choose the best connector for each query

Query execution


Parsing" Valida8on" Planning" Execu8on"

C*"

Connector1"

Connector2"

Connector3"

Our planner will choose the best connector for each query

o  Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog-across-a-set-of-datastores.-

•  Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance

"  E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,-readRop3mized-cluster,-etc.-

•  A-table-is-saved-in-a-unique-datastore

Multi-cluster support



Logical and physical mapping

15"

C*"produc8on" C*"development" Other"datastores"

App"catalog"

Users"table" Test"table" old_users"table"

SELECT&*&FROM&app.users;&

Metadata Management

16"

o  Some-datastores-are-schemaless-but-our-applica3ons-are-not!-

•  Flexible-schemas-vs-Schemaless

•  Crossdata-provides-a-Metadata-manager-that-stores-schemas-for-any-datasource

"  Remember-ODBC-and-those-BI-tools

Metadata in the era of Schemaless NoSQL datastores


?""

101001010101010101101010101111010001111001000"

17"


Metadata management

18"

C*"produc8on"

Connector"

Infinispan"

Metadata"Store"

Metadata"Manager"

2%

Updated"metadata"informa8on"is"

maintained"among"Crossdata"servers"using"Infinispan"

If"the"connector"does"not"support"metadata"opera8ons"those"are"

skipped" 2%1%

Streaming sources

19"


o  Nowadays-use-cases-expect-some-type-of-streaming-datasource

•  Streaming-data-has-an-ephemeral-nature

•  In-Stra3o-Crossdata-we-defined-the-ephemeral-table-abstrac3on-to-work-with-streaming-sources-as-classical-RDBMS-tables

Managing streaming sources

20"

streaming"source"

col1:text" col2:int" col3:int" col4:text"

{schema:{col1:…},…}"Streaming_query0"

Streaming_queryn"

…"


o  Streaming-queries-are-infinite-by-defini3on

•  A-3me-window-is-defined-to-create-a-batch-like-view-of-the-rows-ingested-by-the-system-in-that-period

•  The-user-launches-queries-specifying-a-processing-3me-window

"  Crossdata-provides-methods-to-list-and-stop-running-streaming-queries

Streaming queries

21"


Streaming queries: windows syntax

22"

SELECT fieldGroup,avg(Field2) FROM eph_table WITH WINDOW 5 minutes WHERE field1=100 AND field2>100 GROUP BY fieldGroup;


Joining batch and streaming SELECT * FROM demo.temporal WITH WINDOW 10 secs INNER JOIN demo.users ON users.name = temporal.name;

SELECT * FROM demo.temporal WITH WINDOW 10 secs "

SELECT * FROM demo.users "

INNER JOIN ON users.name = temporal.name "

23"

Full text search

24"

o  Clients-request-the-ability-to-perform-full-text-searches

o We-have-developed-an-integra3on-between-Lucene-and-Cassandra

o  C*-users-can-now-enjoy-all-Lucene-features: •  Full-text-searches,-range-queries,-fuzzy-queries….

Full text search with


https://github.com/Stratio/stratio-cassandra

Stratio Lucene 2i


C*"node"

C*"node"

Lucene"index"

C*"node"

Lucene"index"

C*"node"

Lucene"index"

Lucene"index"

C*"node"

Lucene"index"

o  With-Crossdata,-we-simplify:

•  The-crea3on-syntax-

•  The-query-syntax-using-the-match-operator

Full text search queries


CREATE&FULLTEXT&INDEX&ON&app.users(name,email);&

SELECT&*&FROM&app.users&&where&email&MATCH&‘*@stratio.com’;&

& Stratio Crossdata

28"

o  Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons

o  Spark-brings-several-benefits-over-Hadoop- o  InRMemory-processing

o  RDD-abstrac3on o  Simpler-API-

o  Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping)

Why Spark?


o  Different-approach-to-query-execu3on •  We-only-use-Spark-when-it-speedups-queries

"  Na3ve-drivers-are-faster-for-simple-queries

"  Spark-SQL-has-limited-RDD-sources

•  Avoid-some-Spark-limita3ons

•  Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243

What about Spark SQL?



Query approach

Cassandra"

Spark"

SparkSQL"

Cassandra"

Spark" Na8ve"driver"

SparkSQL"approach" Crossdata"approach"

31"

Stra8o"Crossdata"


o  Project-started-in-June-2013 "  With-the-objec3ve-of-providing-a-method-to-interact-with-

Cassandra-from-Spark

"  Ini3al-approach-based-on-the-HadoopInputFormat-interface

"  Current-version-uses-the-na3ve-Datastax-Java-driver

Our Cassandra-Spark integration

32"

https://github.com/Stratio/stratio-deep


o  Benchmark-in-process-comparing-our-solu3on-with-the-Datastax-Spark-driver

•  Results-highly-influenced-by-the-split-size •  Ini3al-results-are-promising-for-Stra3o-Spark-Integra3on-

using-Datastax-default-values

•  Group-by-–-up-to-40%-faster •  Join-–-up-to-17%-faster

•  Stay-tuned-for-the-benchmark-publica3on!

Our Cassandra-Spark integration

33"


Spark vs Lucene 2i

34"

Time"

Records"returned"

Spark"

Lucen"2i"

ODBC

35"

o  WellRknown-interface-standard-(for-BI-tools,-external-apps,-…)

o  We-have-implemented-it-using-Simba-SDK

o  ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external-world

o  Currently-tested-with-Tableau,-Qlikview-and-MS-Excel

Stratio Crossdata ODBC


One ODBC for all datastores!

The future

37"


o  Security o  Query-op3mizer-and-smart-query-planner

o  Leverage-system-sta3s3cs

o  Support-for-UDFs o  Become-an-Apache-project

The future

38"

https://github.com/Stratio/stratio-meta


We are looking for an Apache Champion

39"

Can"you"help"us?"

o  Ability-to-stop-running-queries o  Interac3ve-users-are-unpredictable

o  Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes)

o  Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator

•  E.g.,-aggrega3ons-like-count(*)

A wish list for Cassandra


Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]



Software

Crossdata: an efficient distributed datahub with batch and streaming query capabilities