41
Stratio Meta An ecient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected] Alvaro Agea [email protected] 1 #CassandraSummit 2014

Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Embed Size (px)

DESCRIPTION

Big Data analysis is commonly associated with batch processing of data stored in distributed file systems. The advent of streaming data is exposing the shortcomings of the traditional data analysis. Users aiming to combine both worlds - batch processing and streaming - had to turn to unreliable in-house developments. We propose Stratio META to meet this new need. META is a technology based on a structured NoSQL datastore with advanced indexing capabilities. META includes an efficient query planner designed from scratch. The planner determines which is the optimal path to execute a query and which components should be involved.

Citation preview

Page 1: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Stratio Meta An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]

Alvaro Agea [email protected]

1"#CassandraSummit-2014

Page 2: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]

Alvaro Agea [email protected]

2"#CassandraSummit-2014

Page 3: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

•  Stra3o-is-a-Big-Data-Company •  Founded-in-2013 •  Commercially-launched-in-2014 •  50+-employees-in-Madrid •  Office-in-San-Francisco •  Cer3fied-Spark-distribu3on

STRATIO Who are we?

#CassandraSummit-2014 3"

Page 4: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

•  P2P-architecture •  Read/write-performance •  Fault-tolerance •  Easy-to-deploy •  CQL

Cassandra We love…

#CassandraSummit-2014 4"

Page 5: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

•  Introduction •  Crossdata architecture •  Metadata management •  Streaming sources •  Full text search •  Spark and Crossdata •  ODBC •  The future

Agenda

5"

Page 6: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

o  Big-Data-analysis-is-commonly-associated-with-batch-processing

•  Users-aiming-to-combine-batch-and-stream-processing-have-to-rely-on-tailorRmade-architectures

o  Users-buy-Big-Data-plaSorms,-but

•  How-do-I-start? •  What-is-my-entry-point-to-the-plaSorm?

Introduction

6"

Page 7: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

o  Easy-deployment

o  Easy-administra3on

o  Read/write-performance

o  EasyRtoRlearn-query-language- o  Integra3on-with-BI-Tools o  Join-opera3ons o  Support-for-streaming-sources

o  Integra3on-with-other-data-stores o  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)

What our clients demand?

#CassandraSummit-2014 7"

Page 8: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

!  Easy%deployment%

!  Easy%administra0on%

!  Read/write%performance%

!  Easy6to6learn%query%language%o  Integra3on-with-BI-Tools o  Join-opera3ons o  Support-for-streaming-sources

o  Integra3on-with-other-data-stores o  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)

What our clients demand?

#CassandraSummit-2014 8"

Page 9: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

!  Easy"deployment"

!  Easy"administra8on"

!  Read/write"performance"

!  Easy>to>learn"query"language"!  Integra3on-with-BI-Tools !  Join-opera3ons !  Support-for-streaming-sources

!  Integra3on-with-other-data-stores !  Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)

What our clients demand?

#CassandraSummit-2014 9"

Page 10: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

o  A-new-technology-that: •  Is-not-limited-by-the-underlying-datastore-capabili3es

•  Leverages-Spark-to-perform-nonRna3vely-supported-opera3ons

•  Supports-batch-and-streaming-queries

•  Supports-mul3ple-clusters-and-technologies

Crossdata

#CassandraSummit-2014 10"

Page 11: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Our architecture

#CassandraSummit-2014 11"

Page 12: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

o  Crossdata-defines-an-IConnector-extension-interface o  User-can-easily-add-new-connectors-to-support •  Different-datastores •  Different-processing-engines •  Different-versions

o  Where-each-connector-defines-its-capabili3es

Connecting to the outside world

#CassandraSummit-2014 12"

Our planner will choose the best connector for each query

Page 13: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Query execution

#CassandraSummit-2014 13"

Parsing" Valida8on" Planning" Execu8on"

C*"

Connector1"

Connector2"

Connector3"

Our planner will choose the best connector for each query

Page 14: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

o  Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog-across-a-set-of-datastores.-

•  Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance

"  E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,-readRop3mized-cluster,-etc.-

•  A-table-is-saved-in-a-unique-datastore

Multi-cluster support

#CassandraSummit-2014 14"

Page 15: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

Logical and physical mapping

15"

C*"produc8on" C*"development" Other"datastores"

App"catalog"

Users"table" Test"table" old_users"table"

SELECT&*&FROM&app.users;&

Page 16: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Metadata Management

16"

Page 17: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

o  Some-datastores-are-schemaless-but-our-applica3ons-are-not!-

•  Flexible-schemas-vs-Schemaless

•  Crossdata-provides-a-Metadata-manager-that-stores-schemas-for-any-datasource

"  Remember-ODBC-and-those-BI-tools

Metadata in the era of Schemaless NoSQL datastores

#CassandraSummit-2014

?""

101001010101010101101010101111010001111001000"

17"

Page 18: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

Metadata management

18"

C*"produc8on"

Connector"

Infinispan"

Metadata"Store"

Metadata"Manager"

2%

Updated"metadata"informa8on"is"

maintained"among"Crossdata"servers"using"Infinispan"

If"the"connector"does"not"support"metadata"opera8ons"those"are"

skipped" 2%1%

Page 19: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Streaming sources

19"

Page 20: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

o  Nowadays-use-cases-expect-some-type-of-streaming-datasource

•  Streaming-data-has-an-ephemeral-nature

•  In-Stra3o-Crossdata-we-defined-the-ephemeral-table-abstrac3on-to-work-with-streaming-sources-as-classical-RDBMS-tables

Managing streaming sources

20"

streaming"source"

col1:text" col2:int" col3:int" col4:text"

{schema:{col1:…},…}"Streaming_query0"

Streaming_queryn"

…"

Page 21: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

o  Streaming-queries-are-infinite-by-defini3on

•  A-3me-window-is-defined-to-create-a-batch-like-view-of-the-rows-ingested-by-the-system-in-that-period

•  The-user-launches-queries-specifying-a-processing-3me-window

"  Crossdata-provides-methods-to-list-and-stop-running-streaming-queries

Streaming queries

21"

Page 22: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

Streaming queries: windows syntax

22"

SELECT fieldGroup,avg(Field2) FROM eph_table WITH WINDOW 5 minutes WHERE field1=100 AND field2>100 GROUP BY fieldGroup;

Page 23: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

Joining batch and streaming SELECT * FROM demo.temporal WITH WINDOW 10 secs INNER JOIN demo.users ON users.name = temporal.name;

SELECT * FROM demo.temporal WITH WINDOW 10 secs "

SELECT * FROM demo.users "

INNER JOIN ON users.name = temporal.name "

23"

Page 24: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Full text search

24"

Page 25: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

o  Clients-request-the-ability-to-perform-full-text-searches

o We-have-developed-an-integra3on-between-Lucene-and-Cassandra

o  C*-users-can-now-enjoy-all-Lucene-features: •  Full-text-searches,-range-queries,-fuzzy-queries….

Full text search with

#CassandraSummit-2014 25"

https://github.com/Stratio/stratio-cassandra

Page 26: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Stratio Lucene 2i

#CassandraSummit-2014 26"

C*"node"

C*"node"

Lucene"index"

C*"node"

Lucene"index"

C*"node"

Lucene"index"

Lucene"index"

C*"node"

Lucene"index"

Page 27: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

o  With-Crossdata,-we-simplify:

•  The-crea3on-syntax-

•  The-query-syntax-using-the-match-operator

Full text search queries

#CassandraSummit-2014 27"

CREATE&FULLTEXT&INDEX&ON&app.users(name,email);&

SELECT&*&FROM&app.users&&where&email&MATCH&‘*@stratio.com’;&

Page 28: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

& Stratio Crossdata

28"

Page 29: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

o  Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons

o  Spark-brings-several-benefits-over-Hadoop- o  InRMemory-processing

o  RDD-abstrac3on o  Simpler-API-

o  Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping)

Why Spark?

#CassandraSummit-2014 29"

Page 30: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

o  Different-approach-to-query-execu3on •  We-only-use-Spark-when-it-speedups-queries

"  Na3ve-drivers-are-faster-for-simple-queries

"  Spark-SQL-has-limited-RDD-sources

•  Avoid-some-Spark-limita3ons

•  Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243

What about Spark SQL?

#CassandraSummit-2014 30"

Page 31: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

Query approach

Cassandra"

Spark"

SparkSQL"

Cassandra"

Spark" Na8ve"driver"

SparkSQL"approach" Crossdata"approach"

31"

Stra8o"Crossdata"

Page 32: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

o  Project-started-in-June-2013 "  With-the-objec3ve-of-providing-a-method-to-interact-with-

Cassandra-from-Spark

"  Ini3al-approach-based-on-the-HadoopInputFormat-interface

"  Current-version-uses-the-na3ve-Datastax-Java-driver

Our Cassandra-Spark integration

32"

https://github.com/Stratio/stratio-deep

Page 33: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

o  Benchmark-in-process-comparing-our-solu3on-with-the-Datastax-Spark-driver

•  Results-highly-influenced-by-the-split-size •  Ini3al-results-are-promising-for-Stra3o-Spark-Integra3on-

using-Datastax-default-values

•  Group-by-–-up-to-40%-faster •  Join-–-up-to-17%-faster

•  Stay-tuned-for-the-benchmark-publica3on!

Our Cassandra-Spark integration

33"

Page 34: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

Spark vs Lucene 2i

34"

Time"

Records"returned"

Spark"

Lucen"2i"

Page 35: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

ODBC

35"

Page 36: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

o  WellRknown-interface-standard-(for-BI-tools,-external-apps,-…)

o  We-have-implemented-it-using-Simba-SDK

o  ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external-world

o  Currently-tested-with-Tableau,-Qlikview-and-MS-Excel

Stratio Crossdata ODBC

#CassandraSummit-2014 36"

One ODBC for all datastores!

Page 37: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

The future

37"

Page 38: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

o  Security o  Query-op3mizer-and-smart-query-planner

o  Leverage-system-sta3s3cs

o  Support-for-UDFs o  Become-an-Apache-project

The future

38"

https://github.com/Stratio/stratio-meta

Page 39: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

#CassandraSummit-2014

We are looking for an Apache Champion

39"

Can"you"help"us?"

Page 40: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

o  Ability-to-stop-running-queries o  Interac3ve-users-are-unpredictable

o  Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes)

o  Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator

•  E.g.,-aggrega3ons-like-count(*)

A wish list for Cassandra

#CassandraSummit-2014 40"

Page 41: Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]

Alvaro Agea [email protected]

41"#CassandraSummit-2014