53
Berlin Buzzwords· June 2010 Basho Technologies Rusty Klophaus - @rklophaus Riak Search A Full-Text Search and Indexing Engine based on Riak

Riak Search - Berlin Buzzwords 2010

Embed Size (px)

DESCRIPTION

Riak Search is a distributed data indexing and search platform built on top of Riak. The talk will introduce Riak Search, covering overall goals, architecture, and core functionality.

Citation preview

Page 1: Riak Search - Berlin Buzzwords 2010

Berlin Buzzwords· June 2010

Basho Technologies

Rusty Klophaus - @rklophaus

Riak SearchA Full-Text Search

and Indexing Engine

based on Riak

Page 2: Riak Search - Berlin Buzzwords 2010

Why did we build it?

What are the major goals?

How does it work?

2

Page 3: Riak Search - Berlin Buzzwords 2010

Part One

Why did we build

Riak Search?

3

Page 4: Riak Search - Berlin Buzzwords 2010

Riak is

a scalable, highly-available, networked,

open-source key/value store.

4

Page 5: Riak Search - Berlin Buzzwords 2010

Key/Value

CLIENT RIAK

5

Writing to a Key/Value Store

Page 6: Riak Search - Berlin Buzzwords 2010

Object

CLIENT RIAK

6

Writing to a Key/Value Store

Page 7: Riak Search - Berlin Buzzwords 2010

Key

Object

CLIENT RIAK

Querying a Key/Value Store

7

Page 8: Riak Search - Berlin Buzzwords 2010

Key + Instructions

Object(s)

CLIENT RIAK

Walk to Related

Keys

Querying Riak via LinkWalking

8

Page 9: Riak Search - Berlin Buzzwords 2010

Key(s) + JS Functions

Computed Value(s)

CLIENT RIAK

Map

Reduce

Map

Querying Riak via Map/Reduce

9

Page 10: Riak Search - Berlin Buzzwords 2010

Key/Value Stores

like

Key-Based Queries

10

Page 11: Riak Search - Berlin Buzzwords 2010

where Category == "Shoes"

CLIENT RIAK

WTF!? I'm aKV store!

Query by Secondary Index

11

Page 12: Riak Search - Berlin Buzzwords 2010

"Converse AND Shoes"

CLIENT RIAK

This is getting old.

Full-Text Query

12

Page 13: Riak Search - Berlin Buzzwords 2010

These kinds of queries

need an Index.

*Market Opportunity!*

13

Page 14: Riak Search - Berlin Buzzwords 2010

Part Two

What are the major

goals of Riak Search?

14

Page 15: Riak Search - Berlin Buzzwords 2010

Your Application

Riak

An application built on Riak.

15

Page 16: Riak Search - Berlin Buzzwords 2010

Your Application

RiakIndex

Object

Hrm... I need an index.

16

Page 17: Riak Search - Berlin Buzzwords 2010

Your Application

Riak???

Hrm... I need an index with more features.

17

Page 18: Riak Search - Berlin Buzzwords 2010

Your Application

RiakLucene

Lucene should do the trick...

18

Page 19: Riak Search - Berlin Buzzwords 2010

Your Application

Lucene Lucene Lucene Riak

...shard to add more storage capacity...

19

Page 20: Riak Search - Berlin Buzzwords 2010

Your Application

Lucene Lucene Lucene

Lucene Lucene Lucene

Lucene Lucene Lucene

Riak

...replicate to add more throughput.

20

Page 21: Riak Search - Berlin Buzzwords 2010

Your Application

Lucene Lucene Lucene

Lucene Lucene Lucene

Lucene Lucene Lucene

Riak

...replicate to add more throughput.

21

Operations nightmare!

Page 22: Riak Search - Berlin Buzzwords 2010

Your Application

Riak-ifiedLucene

Riak

What do we really want?

22

Page 23: Riak Search - Berlin Buzzwords 2010

Your Application

RiakSearch

Riak

What do we really want?

23

Page 24: Riak Search - Berlin Buzzwords 2010

Functionality? Be like Lucene (and more).

• Lucene Syntax

• Leverages Java Lucene Analyzers

• Solr Endpoints

• Integration via Riak Post-Commit Hook (Index)

• Integration via Riak Map/Reduce (Query)

• Near-Realtime

• Schema-less

24

Page 25: Riak Search - Berlin Buzzwords 2010

Operations? Be like Riak.

• No special nodes

• Add nodes, get more compute and storage

• Automatically load balance

• Replicas for durability and performance

• Index and query in parallel

• Swappable storage backends

25

Page 26: Riak Search - Berlin Buzzwords 2010

Part Three

How do we do it?

26

Page 27: Riak Search - Berlin Buzzwords 2010

A Gentle Introduction to

Document Indexing

27

Page 28: Riak Search - Berlin Buzzwords 2010

Every dog has his day.#1

day, 1

dog, 1

every, 1

has, 1

his, 1

Inverted IndexDocument

The Inverted Index

28

Page 29: Riak Search - Berlin Buzzwords 2010

The dog's bark is worse than his bite.

Every dog has his day.

Let the cat out of the bag.

It's raining cats and dogs.

#1

#2

#3

#4

Combined Inverted IndexDocuments

and, 4

bag, 3

bark, 2

bite, 2

cat, 3

cat, 4

day, 1

dog, 1

dog, 2

dog, 4

every, 1

has, 1

...

The Inverted Index

29

Page 30: Riak Search - Berlin Buzzwords 2010

"dog AND cat"

AND

dog cat

At Query Time...

30

Page 31: Riak Search - Berlin Buzzwords 2010

AND

dog cat

dog, 1

dog, 2

dog, 4

cat, 3

cat, 4

At Query Time...

31

Page 32: Riak Search - Berlin Buzzwords 2010

AND(Merge Intersection)

1

2

4

3

4

Result: 4

At Query Time...

32

Page 33: Riak Search - Berlin Buzzwords 2010

OR(Merge Union)

1

2

4

3

4

Result: 1, 2, 3, 4

At Query Time...

33

Page 34: Riak Search - Berlin Buzzwords 2010

Complex Behavior from Simple Structures

34

Page 35: Riak Search - Berlin Buzzwords 2010

Storage Approaches...

35

Page 36: Riak Search - Berlin Buzzwords 2010

Riak Search uses

Consistent Hashing

to store data on

Partitions

36

Page 37: Riak Search - Berlin Buzzwords 2010

Partitions = 10

Number of Nodes = 5

Partitions per Node = 2

Replicas (NVal) = 2

Introduction to Consistent Hashing and Partitions

37

Page 38: Riak Search - Berlin Buzzwords 2010

Object

Introduction to Consistent Hashing and Partitions

38

Page 39: Riak Search - Berlin Buzzwords 2010

Document Partitioning

vs.

Term Partitioning

39

Page 40: Riak Search - Berlin Buzzwords 2010

...and the

Resulting Tradeoffs

40

Page 41: Riak Search - Berlin Buzzwords 2010

Every dog has his day.#1

Document Partitioning @ Index Time

41

Page 42: Riak Search - Berlin Buzzwords 2010

"dog OR cat"

Document Partitioning @ Query Time

42

Page 43: Riak Search - Berlin Buzzwords 2010

Every dog has his day.#1

day, 1

dog, 1

every, 1

has, 1

his, 1

Term Partitioning @ Index Time

43

Page 44: Riak Search - Berlin Buzzwords 2010

day, 1 has, 1

every, 1his, 1

dog, 1

Term Partitioning @ Index Time

44

Page 45: Riak Search - Berlin Buzzwords 2010

"dog OR cat"

Term Partitioning @ Query Time

45

Page 46: Riak Search - Berlin Buzzwords 2010

Document Partitioning Term Partitioning

+ Lower Latency Queries

- Lower Throughput

- Lots of Disk Seeks

- Higher Latency Queries

+ Higher Throughput

- Hotspots in Ring (the "Obama" problem)

Tradeoffs...

46

Page 47: Riak Search - Berlin Buzzwords 2010

Riak Search: Term Partitioning

47

Term-partitioning is the most viable approach for our beta clients’ needs: high throughput on Really Big Datasets.

Optimizations:

• Term splitting to reduce hot spots

• Bloom filters & caching to save query-time bandwidth

• Batching to save query-time & index-time bandwidth

Support for either approach eventually.

Page 48: Riak Search - Berlin Buzzwords 2010

Part Four

Review

48

Page 49: Riak Search - Berlin Buzzwords 2010

"Converse AND Shoes"

CLIENT RIAK

WTF!? I'm a

KV store!

Riak Search turns this...

49

Page 50: Riak Search - Berlin Buzzwords 2010

"Converse AND Shoes"

CLIENT RIAK

Gladly!

...into this...

50

Page 51: Riak Search - Berlin Buzzwords 2010

"Converse AND Shoes"

CLIENT RIAK

Keys or Objects

...into this...

51

Page 52: Riak Search - Berlin Buzzwords 2010

Your Application

RiakSearch

Riak

...while keeping operations easy.

52

Page 53: Riak Search - Berlin Buzzwords 2010

Thanks! Questions?

Search Team:

John Muellerleile - @jrecursive

Rusty Klophaus - @rklophaus

Kevin Smith - @kevsmith

Currently working with a small set of Beta users.

Open-source release planned for Q3.

www.basho.com