56
We know stuff. And get people excited about it.* *Even German engineers Rachel Pedreschi @RachelPedreschi Patrick McFadin @PatrickMcFadin A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise 1

Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Embed Size (px)

Citation preview

Page 1: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

We know stuff. And get people excited about it.*

*Even German engineers

Rachel Pedreschi @RachelPedreschi Patrick McFadin @PatrickMcFadin

A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise

1

Page 2: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Not (all queries preplanned. Very Fast)

Very(Ask anything, anytime. Slowest)

Adhociness

C* Search Analytics

Page 3: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

What is Search?

Page 4: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)
Page 5: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

The bright blue butterfly hangs on the breeze.

[the] [bright] [blue] [butterfly] [hangs] [on] [the] [breeze]

Tokens

Page 6: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html

Terms

Page 7: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

It can be lonely for Solr

Page 8: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

+ =

Page 9: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Cassandra

✓ Highly available

✓ Linear scalability

✓ Low latency OLTP queries C*

Page 10: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Like… High Availability

Page 11: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Data PartitioningApplication

Data Center 1

hash(key) => token(43)

80

10

3050

70

60

40

20

Page 12: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Replication

Data Center 1

hash(key) => token(43)

replication factor = 3

80

10

3050

70

60

40

20

Application

Page 13: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Multi-Data Center Replication

Data Center 1

hash(key) => token(43)

replication factor = 3

80

10

3050

70

60

40

20

Data Center 2

replication factor = 3

81

11

3151

71

61

41

21

Application

Page 14: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

How does DSE integrate Solr?

C* C*/Solr

Transactional Search

Page 15: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)
Page 16: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

SELECT * FROM killrvideo.videos WHERE solr_query='{ "q": "{!edismax qf=\"name^2 tags^1 description\”}datastax" }';

SELECT id, value FROM keyspace.table WHERE token(id) >= -3074457345618258601 AND token(id) <= 3074457345618258603 AND solr_query='id:*'

Page 17: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Vocab

Cassandra term Solr term

Column Family / Table Core

Row Document

Column Field

SSTable Index

Page 18: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

18

Node memory

Node file system

Client

partition key1 first:Oscar last:Orange level:42

partition key2 first:Ricky last:Red

Memtable (corresponds to a CQL table)

Coordinator

CommitLog

Append O

nly

… … … …

… … … …

… … … …

SSTables

Flush current state to SSTable

Compact relatedSSTables

Write <3, Betty, Blue, 63>

Acknowledge

partition key3 first:Betty last:Blue level:63

Compaction

Each write request …

Periodically …

Periodically …

Page 19: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

19

Node memory

Node file system

Client1 best 1

2 bright 2,3

Ram Buffer

Coordinator

… … … …

… … … …

… … … …

Segments

Flushes current state to Segment (Softcommit)

Write <1,blue, 2,3>

3 blue 2,3

Merge (STW)

Each write request …

Periodically …

On C* Memtable Flush, In memory segments hard commit to disk

Shard Router

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

Page 20: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

20

Node memory

Node file system

1 best 1

2 bright 2,3

Ram Buffer

… … … …

… … … …

… … … …

Segments

3 blue 2,3

Not Searchable

Searchable

Coordinator

Shard Router

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

… … … …

Page 21: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

And… Scalability

Page 22: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

80

10

3050

70

60

40

20Data Center 1

Application

Page 23: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Data Center 1

80

10

30

50

70

60

40

20

Application

Page 24: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Data Center 1

80

8

32

56

72

64

48

16

24

4040

24

Application

Page 25: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Even… Improved Performance

Page 26: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Standard Solr Indexing

DSE Search Live Indexing

Page 27: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Spark and Cassandra

Page 28: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Great combo

Store a ton of data Analyze a ton of data

Page 29: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Great combo

Spark Streaming

Near Real-time

SparkSQL

Structured Data

MLLib

Machine Learning

GraphX

Graph Analysis

Page 30: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Spark Streaming

Near Real-time

SparkSQL

Structured Data

MLLib

Machine Learning

GraphX

Graph Analysis

Great combo

CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour)) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);

Spark Connector

Page 31: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

How does it work? OSS Stack

Executer

Master

Worker

Executer

Executer

Server

Page 32: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

How does it work? OSS Stack

MasterWorker

0-24

Token Ranges 0-100

25-49

50-74

75-99

I will only analyze 25% of the data.

Worker Worker

Worker

Page 33: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Master

0-24

25-49

50-74

75-99

AnalyticsTransactional

Worker

WorkerWorker

Worker

0-24

75-99 25-49

50-74

Page 34: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

75-99

SELECT * FROM keyspace.table WHERE token(pk) > 75 AND token(pk) <= 99

Spark RDD

Spark Partition

Spark Partition

Spark Partition

Spark Connector

Executer

Executer

Executer

Worker

Master

Page 35: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Spark RDD

Spark Partition

Spark Partition

Spark Partition

Master

Worker

Executer

Executer

Executer

75-99

Page 36: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Spark ConnectorCassandra Cassandra +

Spark

Joins and Unions No Yes

Transformations Limited Yes

Outside Data Integration No Yes

Aggregations Limited Yes

Page 37: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Type mappingCQL Type Scala Typeascii Stringbigint Longboolean Booleancounter Longdecimal BigDecimal, java.math.BigDecimaldouble Doublefloat Floatinet java.net.InetAddressint Intlist Vector, List, Iterable, Seq, IndexedSeq, java.util.Listmap Map, TreeMap, java.util.HashMapset Set, TreeSet, java.util.HashSettext, varchar Stringtimestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTimetimeuuid java.util.UUIDuuid java.util.UUIDvarint BigInt, java.math.BigInteger*nullable values Option

Page 38: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)
Page 39: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Confidential

Let’s go code diving!

Page 40: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Behind the scenes…// Videos by id CREATE TABLE videos ( videoid uuid, userid uuid, name text, description text, location text, location_type int, preview_image_location text, tags set<text>, added_date timestamp, PRIMARY KEY (videoid) );

// Index for tag keywords CREATE TABLE videos_by_tag ( tag text, videoid uuid, added_date timestamp, userid uuid, name text, preview_image_location text, tagged_date timestamp, PRIMARY KEY (tag, videoid) );

Not a great idea

Possible Index

Page 41: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

// Videos by id CREATE TABLE videos ( videoid uuid, userid uuid, name text, description text, location text, location_type int, preview_image_location text, tags set<text>, added_date timestamp, PRIMARY KEY (videoid)

And th

is?

This?

This?

Page 42: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)
Page 43: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

1) Spin up a new C* Cluster with search enabled using the DSE installer.

$ sudo service dse cassandra -s

2) Run your schema DDL to create the C* keyspace and tables.3) Run dse_tool on the videos table

$ dsetool create_core killrvideo.videos generateResources=true

4) Use the Solr Admin to check sanity and make sure you have a core.5) Write a CQL query with a Solr Search in it.

SELECT * FROM killrvideo.videos WHERE solr_query='{ "q": "{!edismax qf=\"name^2 tags^1 description\”}?" }';

Page 44: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Search all of the things in 5 easy steps…

Page 45: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Now you get this!

SELECT name FROM videos WHERE solr_query = 'tags:crime*';

Page 46: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Attaching to Spark and Cassandra

// Import Cassandra-specific functions on SparkContext and RDD objectsimport org.apache.spark.{SparkContext, SparkConf}import com.datastax.spark.connector._

/** The setMaster("local") lets us run & test the job right in our IDE */val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster(“local[*]") .setAppName(getClass.getName) // Optionally .set("cassandra.username", "cassandra") .set("cassandra.password", “cassandra") val sc = new SparkContext(conf)

Page 47: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Comment table example

CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);

Page 48: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Simple example

/** keyspace & table */val tableRDD = sc.cassandraTable("killrvideo", “comments_by_video”) /** get a simple count of all the rows in the raw_weather_data table */val rowCount = tableRDD.count()println(s"Total Rows in Comments Table: $rowCount") sc.stop()

Page 49: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Simple example/** keyspace & table */val tableRDD = sc.cassandraTable("killrvideo", “comments_by_video”) /** get a simple count of all the rows in the comments_by_video table */val rowCount = tableRDD.count()println(s"Total Rows in Comments Table: $rowCount") sc.stop()

Executer

SELECT * FROM killrvideo.comments_by_video

Spark RDD

Spark Partition

Spark Connector

Page 50: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Using CQL

SELECT useridFROM comments_by_videoWHERE videoid = '01860584-de45-018f-12be-5f81704e8033'

val cqlRRD = sc.cassandraTable("killrvideo", “comments_by_video”) .select("userid") .where("videoid = ?”, “01860584-de45-018f-12be-5f81704e8033")

Page 51: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

qAAAaqq

spark-sql> SELECT cast(videoid as String) videoid, count(*) c FROM comments_by_video GROUP BY cast(videoid as String) ORDER BY c DESC limit 10;

Page 52: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Saving back to Cassandra

// Create insert dataval collection = sc.parallelize(Seq(("01860584-de45-018f-12be-5f81704e8033", "Great video", "cdaf6bd5-8914-29e0-f0b6-8b0bc6156777"), ("01860584-de45-018f-12be-5f81704e8033", "Hated it", "cdaf6bd5-8914-29e0-f0b6-8b0bc6156777"))) // Insert data into tablecollection.saveToCassandra("killrvideo", "comments_by_video", SomeColumns("videoid", "comment", "userid"))

Page 53: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

val solrQueryRDD = sc.cassandraTable("killrvideo", “videos") .select("name").where("solr_query='tags:crime*'") solrQueryRDD.collect().map(row => println(row.getString("name")))

Page 54: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

C*

C*/Solr C*/Spark

OLTP

Search Analytics

Page 55: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Confidential

Resourceswww.killrvideo.com

https://github.com/LukeTillman/killrvideo-csharp

www.datastax.com

Page 56: Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Enterprise (English)

Thank You