C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Top-k queries in real-time with Cassandra and Intravert

Jonathan Halliday, [email protected]

Rui Vieira, Newcastle [email protected]

mailto:[email protected]


#CassandraEU

What is Top-k ?

#CassandraEU

What is Top-k ?

#CassandraEU

Top-k queries

• Rank matching results for the term(s)– We don't really care about the scoring

algorithm

• Application: text search– Documents containing the search words

• Application: log analysis– Popular URLs in the time period

#CassandraEU

yawn ?

• SELECT document_id, scoreFROM dataWHERE term='top-k'ORDER BY score DESC, document_id LIMIT 100

• Lunch time!

#CassandraEU

Not so fast...

• SELECT document_id, scoreFROM dataWHERE term IN('top-k', 'algorithm')GROUP BY document_idORDER BY score DESC, document_id LIMIT 100

#CassandraEU

Distributed Top-k

• We have a lot of data

• It's spread out

• We need to combine a subset efficiently

• Map/Reduce to the rescue!– HiveQL, Stinger, Impala, Hawq

• Easy! But not fast

#CassandraEU

'real-time'

• Web pages, not control systems

• Performance, not Timeliness

• Pre-compute as much as possible– scores for each term

• Assemble pre-computed fragments at query time– 'group by'

#CassandraEU

Naive method

foreach(term in searchTerms) {SELECT ... FROM ... WHERE ...

}

• Handle group by in the application code

• Inefficient – transfers ALL the data for each term, even low scores

#CassandraEU

How much data is enough?

• Data is stored keyed (i.e. sorted) by{ term, score DESC, doc_id }

or { time_period, score DESC, Url }

• Partition keys IN the query params– We can filter efficiently

• Can we range limit on score?– Avoid going into the long tail

#CassandraEU

Bring on the clever algorithms

• Smart People thought about this problem already...

• ...but not in quite the same context– WAN distributed logs from CDNs

• Identify, adapt and reuse existing solutions– faster and less risky than starting over

#CassandraEU

Inside a clever algorithm

• Fetch a little bit of data

• Look at it, decide how much more we need

• Fetch some more• Rinse and repeat

– but not too many times.

#CassandraEU

Desirable Characteristics

• Fixed number of communication rounds is key

• Generality is good– Cope with any distribution of data

• So is flexibility– Tune for different use cases

#CassandraEU

Meet the candidates

Three-Phase Uniform Threshold (TPUT)'Efficient Top-K Query Calculation in Distributed Networks', Stanford/Princeton, 2004

Hybrid Threshold'Efficient Processing of Distributed Top-k Queries', UCSB, 2005

KLEE'KLEE: a framework for distributed top-k query algorithms', Max-Planck Institute, 2005

#CassandraEU

Implementation Issues

• Algorithms assume server side code execution

• Limitations of CQL3 add some round trips, increase network I/O

• Previous performance comparisons of algorithms may no longer be valid

#CassandraEU

Data Transfer vs. k

#CassandraEU

Execution Time vs. k

#CassandraEU

Execution Time vs. peers

#CassandraEU

#CassandraEU

YMMV

• Test with your own data

• Test with your own hardware

• Hybrid Threshold for exact top-k– Intravert optional

• KLEE for tunable approximate top-k– Inefficient without intravert– Requires metadata

#CassandraEU

Intravert

• Cassandra++– Embed and extend the existing server– Based on Vert.x

• JSON over HTTP, REST API– yup, virgil did that already

• Multiple commands per call, chain operations with REFs

#CassandraEU

Intravert

• Server side code execution– Groovy (for now – Vert.x is polyglot)

• Filter result sets

• Write path triggers– C* 2.0 has CASSANDRA-1311

• Run groovy scripts on the server– Easier than extending thrift api

#CassandraEU

Intravert

• Good trade-off between power and operational complexity

• More complex development cycle– Not easy to move code between client and

server

• Client not topology aware– 'run x on each node' not possible

#CassandraEU

Back to the clever algorithms

• Intravert server side execution enables cleaner, more efficient implementation

• Reduces network round trips

• Some dev and ops complexity increase• Less complexity than custom server

deployment– Reuse existing tools

#CassandraEU

Pre-aggregation

• For text search, can't predict common term sets

• For time periods, can predict contiguous periods

• Pre-calculate the rollups– Hours, days, weeks, months– Reduces number of terms (peers) to group

at query time

#CassandraEU

Really clever algorithms

• Hierarchical node topology– Map to cassandra ring: same node may

own multiple keys (peers != nodes)

• Budget constrained approximate top-k– Get as close as possible with the allowable

time and I/O constraints

• Fault tolerance– Approximation given available nodes

#CassandraEU

Questions?

Or email us:

Jonathan Halliday, [email protected]

Rui Vieira, Newcastle [email protected]



Technology

C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert