Upload
alvaro-agea-herradon
View
664
Download
3
Embed Size (px)
Citation preview
AN EFFICIENT DATA MINING SOLUTION
Hadoop?
Cassandra?
Spark?
Stratio Deep
An efficient data mining solution
“Two and two are four?
Sometimes… Sometimes they are five.”
G. Orwell
#StratioBD
Goals
• Why do you need Cassandra?
• What is the problem?
• Why do you need Spark?
• How do they work together?
#StratioBD
Cassandra
#StratioBD
• Based on DynamoDB…
• Replication, Key/Value, P2P
• And based on Big Table…
• Column oriented
ROBUST FAST EFFICENT
NO BOTTLENECK REPLICATEDDECENTRALIZED
Another Database?
Why?
One User – Lot of data
Case A
#StratioBD
Many User – Few data
Case B
#StratioBD
Many user – Lot of data
Case C
#StratioBD
Crawler app
#StratioBD
Cassandra, I choose you
100MIndexedpages
3kreads
Query time
< 1s
But…
Marketingwalks in
New query
“I need to find all the reference to the domain ACME.
I need the answer by Friday.”
#StratioBD
Problem
Cassandra is not well suited to resolved this type of
queries
You need to design the schema with the query in mind
#StratioBD
ChallengeAccepted
What options do we have?
• Run Hive Query on top of C*
• Write an ETL script and load data into another DB
• Clone the cluster
#StratioBD
What options do we have?
Run Hive Query on top of C*
Write ETL scripts and load into another DB
Clone the cluster
#StratioBD
And now… what can we do?
“We can't solve problems by using the same kind
of thinking we used when we created them”
#StratioBD
Albert Einstein
• Alternative to MapReduce• A low latency cluster computing system• For very large datasets• Create by UC Berkeley AMP Lab in 2010.• May be 100 times faster than MapReduce for:
Interactive algorithms. Interactive data mining
Spark
#StratioBD
Logistic regression inSpark vs Hadoop
SOURCE | http://spark.incubator.apache.org/
#StratioBD
WHO USES SPARK?
Spark and Cassandra
Integration points
#StratioBD
Cassandra’s HDFS abstraction layer
Advantantages:• Easily integrates with legacy systems.
Drawbacks:• Very high-level: no access to low level Cassandra’s features.
• Questionable performance.
INTEGRATION POINTS: HDFS OVER CASSANDRA
#StratioBD
Cassandra’s Hadoop Interface
• Thrift protocol
• CQL3 (our implementation)
Uses the novel Cassandra’s CqlPagingInputFormat
INTEGRATION POINTS: HDFS OVER CASSANDRA
#StratioBD
• Supports CQL3 features
• Respects data locality
• Good compromise between
performance / implementation complexity
CQL3 Integration
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
#StratioBD
CQL3 Integration (II)
Provides a Java friendly API:
• Developers map Column Families to custom serializable POJOs
• StratioDeep wraps the complexity of performing Spark calculations
directly over the user provided POJOs.
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
#StratioBD
Demo
Drawbacks:
• Still not preforming as well as we’d like
Uses Cassandra’s Hadoop Interface
• No analyst-friendly interface:
No SQL-like query features
CQL3 Integration (III)
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3#StratioBD
Bring the integration to another level:
• Dump Cassandra’s Hadoop Interface
• Direct access to Cassandra’s SSTable(s) files.
• Extend Cassandra’s CQL3 to make use of Spark’s distributed
data processing power
Future extensions
What are we currently working on?
#StratioBD
#StratioBD
Conclusion
THANKS