Stratio big data spain

AN EFFICIENT DATA MINING SOLUTION

Hadoop?

Cassandra?

Spark?

Stratio Deep

An efficient data mining solution

“Two and two are four?

Sometimes… Sometimes they are five.”

G. Orwell

#StratioBD

Goals

• Why do you need Cassandra?

• What is the problem?

• Why do you need Spark?

• How do they work together?

#StratioBD

Cassandra

#StratioBD

• Based on DynamoDB…

• Replication, Key/Value, P2P

• And based on Big Table…

• Column oriented

ROBUST FAST EFFICENT

NO BOTTLENECK REPLICATEDDECENTRALIZED

Another Database?

Why?

One User – Lot of data

Case A

#StratioBD

Many User – Few data

Case B

#StratioBD

Many user – Lot of data

Case C

#StratioBD

Crawler app

#StratioBD

Cassandra, I choose you

100MIndexedpages

3kreads

Query time

< 1s

But…

Marketingwalks in

New query

“I need to find all the reference to the domain ACME.

I need the answer by Friday.”

#StratioBD

Problem

Cassandra is not well suited to resolved this type of

queries

You need to design the schema with the query in mind

#StratioBD

ChallengeAccepted

What options do we have?

• Run Hive Query on top of C*

• Write an ETL script and load data into another DB

• Clone the cluster

#StratioBD

What options do we have?

Run Hive Query on top of C*

Write ETL scripts and load into another DB

Clone the cluster

#StratioBD

And now… what can we do?

“We can't solve problems by using the same kind

of thinking we used when we created them”

#StratioBD

Albert Einstein

• Alternative to MapReduce• A low latency cluster computing system• For very large datasets• Create by UC Berkeley AMP Lab in 2010.• May be 100 times faster than MapReduce for:

Interactive algorithms. Interactive data mining

Spark

#StratioBD

Logistic regression inSpark vs Hadoop

SOURCE | http://spark.incubator.apache.org/

#StratioBD

WHO USES SPARK?

Spark and Cassandra

Integration points

#StratioBD

Cassandra’s HDFS abstraction layer

Advantantages:• Easily integrates with legacy systems.

Drawbacks:• Very high-level: no access to low level Cassandra’s features.

• Questionable performance.

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

Cassandra’s Hadoop Interface

• Thrift protocol

• CQL3 (our implementation)

Uses the novel Cassandra’s CqlPagingInputFormat

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

• Supports CQL3 features

• Respects data locality

• Good compromise between

performance / implementation complexity

CQL3 Integration

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

CQL3 Integration (II)

Provides a Java friendly API:

• Developers map Column Families to custom serializable POJOs

• StratioDeep wraps the complexity of performing Spark calculations

directly over the user provided POJOs.

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

Demo

Drawbacks:

• Still not preforming as well as we’d like

Uses Cassandra’s Hadoop Interface

• No analyst-friendly interface:

No SQL-like query features

CQL3 Integration (III)

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3#StratioBD

Bring the integration to another level:

• Dump Cassandra’s Hadoop Interface

• Direct access to Cassandra’s SSTable(s) files.

• Extend Cassandra’s CQL3 to make use of Spark’s distributed

data processing power

Future extensions

What are we currently working on?

#StratioBD

#StratioBD

Conclusion

THANKS

Technology

Stratio big data spain