Upload
alvaro-agea-herradon
View
218
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Big Data analysis is commonly associated with batch processing of data stored in distributed file systems. The advent of streaming data is exposing the shortcomings of the traditional data analysis. Users aiming to combine both worlds - batch processing and streaming - had to turn to unreliable in-house developments. We propose Stratio META to meet this new need. META is a technology based on a structured NoSQL datastore with advanced indexing capabilities. META includes an efficient query planner designed from scratch. The planner determines which is the optimal path to execute a query and which components should be involved.
Citation preview
Stratio Meta An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]
Alvaro Agea [email protected]
1"#CassandraSummit-2014
Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]
Alvaro Agea [email protected]
2"#CassandraSummit-2014
• Stra3o-is-a-Big-Data-Company • Founded-in-2013 • Commercially-launched-in-2014 • 50+-employees-in-Madrid • Office-in-San-Francisco • Cer3fied-Spark-distribu3on
STRATIO Who are we?
#CassandraSummit-2014 3"
• P2P-architecture • Read/write-performance • Fault-tolerance • Easy-to-deploy • CQL
Cassandra We love…
#CassandraSummit-2014 4"
• Introduction • Crossdata architecture • Metadata management • Streaming sources • Full text search • Spark and Crossdata • ODBC • The future
Agenda
5"
#CassandraSummit-2014
o Big-Data-analysis-is-commonly-associated-with-batch-processing
• Users-aiming-to-combine-batch-and-stream-processing-have-to-rely-on-tailorRmade-architectures
o Users-buy-Big-Data-plaSorms,-but
• How-do-I-start? • What-is-my-entry-point-to-the-plaSorm?
Introduction
6"
o Easy-deployment
o Easy-administra3on
o Read/write-performance
o EasyRtoRlearn-query-language- o Integra3on-with-BI-Tools o Join-opera3ons o Support-for-streaming-sources
o Integra3on-with-other-data-stores o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 7"
! Easy%deployment%
! Easy%administra0on%
! Read/write%performance%
! Easy6to6learn%query%language%o Integra3on-with-BI-Tools o Join-opera3ons o Support-for-streaming-sources
o Integra3on-with-other-data-stores o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 8"
! Easy"deployment"
! Easy"administra8on"
! Read/write"performance"
! Easy>to>learn"query"language"! Integra3on-with-BI-Tools ! Join-opera3ons ! Support-for-streaming-sources
! Integra3on-with-other-data-stores ! Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 9"
o A-new-technology-that: • Is-not-limited-by-the-underlying-datastore-capabili3es
• Leverages-Spark-to-perform-nonRna3vely-supported-opera3ons
• Supports-batch-and-streaming-queries
• Supports-mul3ple-clusters-and-technologies
Crossdata
#CassandraSummit-2014 10"
Our architecture
#CassandraSummit-2014 11"
o Crossdata-defines-an-IConnector-extension-interface o User-can-easily-add-new-connectors-to-support • Different-datastores • Different-processing-engines • Different-versions
o Where-each-connector-defines-its-capabili3es
Connecting to the outside world
#CassandraSummit-2014 12"
Our planner will choose the best connector for each query
Query execution
#CassandraSummit-2014 13"
Parsing" Valida8on" Planning" Execu8on"
C*"
Connector1"
Connector2"
Connector3"
Our planner will choose the best connector for each query
o Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog-across-a-set-of-datastores.-
• Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance
" E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,-readRop3mized-cluster,-etc.-
• A-table-is-saved-in-a-unique-datastore
Multi-cluster support
#CassandraSummit-2014 14"
#CassandraSummit-2014
Logical and physical mapping
15"
C*"produc8on" C*"development" Other"datastores"
App"catalog"
Users"table" Test"table" old_users"table"
SELECT&*&FROM&app.users;&
Metadata Management
16"
o Some-datastores-are-schemaless-but-our-applica3ons-are-not!-
• Flexible-schemas-vs-Schemaless
• Crossdata-provides-a-Metadata-manager-that-stores-schemas-for-any-datasource
" Remember-ODBC-and-those-BI-tools
Metadata in the era of Schemaless NoSQL datastores
#CassandraSummit-2014
?""
101001010101010101101010101111010001111001000"
17"
#CassandraSummit-2014
Metadata management
18"
C*"produc8on"
Connector"
Infinispan"
Metadata"Store"
Metadata"Manager"
2%
Updated"metadata"informa8on"is"
maintained"among"Crossdata"servers"using"Infinispan"
If"the"connector"does"not"support"metadata"opera8ons"those"are"
skipped" 2%1%
Streaming sources
19"
#CassandraSummit-2014
o Nowadays-use-cases-expect-some-type-of-streaming-datasource
• Streaming-data-has-an-ephemeral-nature
• In-Stra3o-Crossdata-we-defined-the-ephemeral-table-abstrac3on-to-work-with-streaming-sources-as-classical-RDBMS-tables
Managing streaming sources
20"
streaming"source"
col1:text" col2:int" col3:int" col4:text"
{schema:{col1:…},…}"Streaming_query0"
Streaming_queryn"
…"
#CassandraSummit-2014
o Streaming-queries-are-infinite-by-defini3on
• A-3me-window-is-defined-to-create-a-batch-like-view-of-the-rows-ingested-by-the-system-in-that-period
• The-user-launches-queries-specifying-a-processing-3me-window
" Crossdata-provides-methods-to-list-and-stop-running-streaming-queries
Streaming queries
21"
#CassandraSummit-2014
Streaming queries: windows syntax
22"
SELECT fieldGroup,avg(Field2) FROM eph_table WITH WINDOW 5 minutes WHERE field1=100 AND field2>100 GROUP BY fieldGroup;
#CassandraSummit-2014
Joining batch and streaming SELECT * FROM demo.temporal WITH WINDOW 10 secs INNER JOIN demo.users ON users.name = temporal.name;
SELECT * FROM demo.temporal WITH WINDOW 10 secs "
SELECT * FROM demo.users "
INNER JOIN ON users.name = temporal.name "
23"
Full text search
24"
o Clients-request-the-ability-to-perform-full-text-searches
o We-have-developed-an-integra3on-between-Lucene-and-Cassandra
o C*-users-can-now-enjoy-all-Lucene-features: • Full-text-searches,-range-queries,-fuzzy-queries….
Full text search with
#CassandraSummit-2014 25"
https://github.com/Stratio/stratio-cassandra
Stratio Lucene 2i
#CassandraSummit-2014 26"
C*"node"
C*"node"
Lucene"index"
C*"node"
Lucene"index"
C*"node"
Lucene"index"
Lucene"index"
C*"node"
Lucene"index"
o With-Crossdata,-we-simplify:
• The-crea3on-syntax-
• The-query-syntax-using-the-match-operator
Full text search queries
#CassandraSummit-2014 27"
CREATE&FULLTEXT&INDEX&ON&app.users(name,email);&
SELECT&*&FROM&app.users&&where&email&MATCH&‘*@stratio.com’;&
& Stratio Crossdata
28"
o Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons
o Spark-brings-several-benefits-over-Hadoop- o InRMemory-processing
o RDD-abstrac3on o Simpler-API-
o Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping)
Why Spark?
#CassandraSummit-2014 29"
o Different-approach-to-query-execu3on • We-only-use-Spark-when-it-speedups-queries
" Na3ve-drivers-are-faster-for-simple-queries
" Spark-SQL-has-limited-RDD-sources
• Avoid-some-Spark-limita3ons
• Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243
What about Spark SQL?
#CassandraSummit-2014 30"
#CassandraSummit-2014
Query approach
Cassandra"
Spark"
SparkSQL"
Cassandra"
Spark" Na8ve"driver"
SparkSQL"approach" Crossdata"approach"
31"
Stra8o"Crossdata"
#CassandraSummit-2014
o Project-started-in-June-2013 " With-the-objec3ve-of-providing-a-method-to-interact-with-
Cassandra-from-Spark
" Ini3al-approach-based-on-the-HadoopInputFormat-interface
" Current-version-uses-the-na3ve-Datastax-Java-driver
Our Cassandra-Spark integration
32"
https://github.com/Stratio/stratio-deep
#CassandraSummit-2014
o Benchmark-in-process-comparing-our-solu3on-with-the-Datastax-Spark-driver
• Results-highly-influenced-by-the-split-size • Ini3al-results-are-promising-for-Stra3o-Spark-Integra3on-
using-Datastax-default-values
• Group-by-–-up-to-40%-faster • Join-–-up-to-17%-faster
• Stay-tuned-for-the-benchmark-publica3on!
Our Cassandra-Spark integration
33"
#CassandraSummit-2014
Spark vs Lucene 2i
34"
Time"
Records"returned"
Spark"
Lucen"2i"
ODBC
35"
o WellRknown-interface-standard-(for-BI-tools,-external-apps,-…)
o We-have-implemented-it-using-Simba-SDK
o ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external-world
o Currently-tested-with-Tableau,-Qlikview-and-MS-Excel
Stratio Crossdata ODBC
#CassandraSummit-2014 36"
One ODBC for all datastores!
The future
37"
#CassandraSummit-2014
o Security o Query-op3mizer-and-smart-query-planner
o Leverage-system-sta3s3cs
o Support-for-UDFs o Become-an-Apache-project
The future
38"
https://github.com/Stratio/stratio-meta
#CassandraSummit-2014
We are looking for an Apache Champion
39"
Can"you"help"us?"
o Ability-to-stop-running-queries o Interac3ve-users-are-unpredictable
o Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes)
o Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator
• E.g.,-aggrega3ons-like-count(*)
A wish list for Cassandra
#CassandraSummit-2014 40"
Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]
Alvaro Agea [email protected]
41"#CassandraSummit-2014