Elasticsearch – mye mer enn søk! [JavaZone 2013]

Preview:

DESCRIPTION

Søkemotorer kan løse langt fler utfordringer enn en søkeboks gir. Du har kanskje et søkeproblem uten å være klar over det? Elasticsearch, en open source søkemotor bygd på Lucene, får stadig mer oppmerksomhet - ikke bare fordi den er glimrende til å løse typiske søkeproblemer, men også fordi den kan brukes til analyse- og "big data"-utfordringer. Foredraget gir en oversikt over hva søkemotorer er gode på, relaterte problemer du kommer over, hvordan Elasticsearch kan bidra – samt hvordan den passer inn i teknologistacken din. Det er ingen tutorial, men med et relativt høyt tempo og eksempler med realistisk kompleksitet gis en oversikt over hva som er mulig. Vi runder av med hvordan Elasticsearch kan klassifiseres i mylderet av "NoSQL"-databaser.

Citation preview

ElasticsearchMye mer enn søk!

Alex Brasetvikalex@found.no@alexbrasetvik

Wednesday, September 11, 13

Hvem?

Co-founder av Found AS7+ år søk, 2+ Elasticsearch

Håndterer hundrevis av Elasticsearch-clustre

Wednesday, September 11, 13

Agenda

0. Elasticsearch

1. Bruksområder

2. Lingo

3. Datastrukturer

4. Tekstprosessering

5. Elasticsearch

6. NOSQL?

Wednesday, September 11, 13

Elasticsearch

Open source

Real-time søk og analyse

Skjemafri

Basert på Lucene

Wednesday, September 11, 13

��

��

��

��

��

Wednesday, September 11, 13

$ curl localhost:9200/sample_index/sample_type -XPOST -d '{ "user": { "name": "DEVOPS_BORAT" }, "followers": 42000, "location": { "lat": 56.78, "lon": 12.34 }, "tags": [ "questionable", "funny" ], "message": "1+1=2 only in legacy system. In modern distributed database with eventual consistent is 1+1=1.", "retweets": 123}'

{"ok":true,"_index":"sample_index","_type":"sample_message","_id":"rjs9KSmPRnqhvs7QjgxJJw","_version":1}

Wednesday, September 11, 13

$ curl localhost:9200/sample_index/sample_type/_search -XPOST -d '{ "query":{ "match": { "message": "consistent" } }}'

Wednesday, September 11, 13

{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.076713204, "hits" : [ { "_index" : "sample_index", "_type" : "sample_message", "_id" : "rjs9KSmPRnqhvs7QjgxJJw", "_score" : 0.076713204, "_source" : { "user": { "name": "DEVOPS_BORAT" }, "message": "1+1=2 only in legacy system. In modern distributed database with eventual consistent is 1+1=1.", "retweets": 123, ... } } ] }}

Wednesday, September 11, 13

{ "sample_index" : { "sample_message" : { "properties" : { "followers" : { "type" : "long" }, "location" : { "properties" : { "lat" : { "type" : "double" }, "lon" : { "type" : "double" } } }, "message" : { "type" : "string" }, "retweets" : { "type" : "long" }, "tags" : { "type" : "string" }, "user" : { "properties" : { "name" : { "type" : "string" } } } } } }}

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

{"id"=>12296272736,

"text"=>

"An early look at Annotations:

http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",

"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",

"in_reply_to_user_id"=>nil,

"in_reply_to_screen_name"=>nil,

"in_reply_to_status_id"=>nil

"favorited"=>false,

"truncated"=>false,

"user"=>

{"id"=>6253282,

"screen_name"=>"twitterapi",

"name"=>"Twitter API",

"description"=>

"The Real Twitter API. I tweet about API changes, service issues and

happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",

"url"=>"http://apiwiki.twitter.com",

"location"=>"San Francisco, CA",

"profile_background_color"=>"c1dfee",

"profile_background_image_url"=>

"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",

"profile_background_tile"=>false,

"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",

"profile_link_color"=>"0000ff",

"profile_sidebar_border_color"=>"87bc44",

"profile_sidebar_fill_color"=>"e0ff92",

"profile_text_color"=>"000000",

"created_at"=>"Wed May 23 06:01:13 +0000 2007",

"contributors_enabled"=>true,

"favourites_count"=>1,

"statuses_count"=>1628,

"friends_count"=>13,

"time_zone"=>"Pacific Time (US & Canada)",

"utc_offset"=>-28800,

"lang"=>"en",

"protected"=>false,

"followers_count"=>100581,

"geo_enabled"=>true,

"notifications"=>false,

"following"=>true,

"verified"=>true},

"contributors"=>[3191321],

"geo"=>nil,

"coordinates"=>nil,

"place"=>

{"id"=>"2b6ff8c22edd9576",

"url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",

"name"=>"SoMa",

"full_name"=>"SoMa, San Francisco",

"place_type"=>"neighborhood",

"country_code"=>"US",

"country"=>"The United States of America",

"bounding_box"=>

{"coordinates"=>

[[[-122.42284884, 37.76893497],

[-122.3964, 37.76893497],

[-122.3964, 37.78752897],

[-122.42284884, 37.78752897]]],

"type"=>"Polygon"}},

"source"=>"web"}

The tweet's unique ID. These

IDs are roughly sorted &

developers should treat them

as opaque (http://bit.ly/dCkppc).

Text of the tweet.

Consecutive duplicate tweets

are rejected. 140 character

max (http://bit.ly/4ud3he).

Tweet's

creation

date.

DE

PR

EC

AT

ED

The ID of an existing tweet that

this tweet is in reply to. Won't

be set unless the author of the

referenced tweet is mentioned.The screen name &

user ID of replied to

tweet author. Truncated to 140

characters. Only

possible from SMS.

Th

e a

uth

or

of

the

tw

ee

t. T

his

em

be

dd

ed

ob

ject

ca

n g

et

ou

t o

f syn

c.

Th

e a

uth

or's

use

r ID

.

The author's

user name.

The author's

screen name.

The author's

biography.

The author's

URL.The author's "location". This is a free-form text field, and

there are no guarantees on whether it can be geocoded.

Rendering information

for the author. Colors

are encoded in hex

values (RGB).The creation date

for this account.Whether this account has

contributors enabled

(http://bit.ly/50npuu). Number of

favorites this

user has.

Nu

mb

er

of

twe

ets

this

use

r h

as.

Number of

users this user

is following.The timezone and offset

(in seconds) for this user.

The user's selected

language.

Whether this user is protected

or not. If the user is protected,

then this tweet is not visible

except to "friends".

Number of

followers for

this user.

Wh

eth

er

this

use

r h

as g

eo

en

ab

led

(h

ttp

://b

it.ly/4

pF

Y7

7).

DEPRECATED

in this context

Whether this user

has a verified badge.

Th

e g

eo

ta

g o

n t

his

tw

ee

t in

Ge

oJS

ON

(h

ttp

://b

it.ly/b

8L

1C

p).

The contributors' (if any) user

IDs (http://bit.ly/50npuu).

DEPRECATED

The place associated with this

Tweet (http://bit.ly/b8L1Cp).

The place ID

The URL to fetch a detailed

polygon for this placeThe printable names of this place

The type of this

place - can be a

"neighborhood"

or "city"

The country this place is in

The bounding

box for this

place

The application

that sent this

tweetMap of a Twitter Status Object

Raffi Krikorian <raffi@twitter.com>18 April 2010

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

user: name: DEVOPS_BORATmessage: “1+1=2 only in legacy system. In modern distributed database with eventual consistent is 1+1=1.”location: lon: 12.34 lat: 56.78followers: 42000retweets: 123tags: [questionable, funny]

Wednesday, September 11, 13

Analysis

whitespace

The quick brown fox had a day off

whitespace-tokenizer

Wednesday, September 11, 13

Filter: boolean match

Query: match med score

Kan være satt sammen av andre queries

Filter / Query

Wednesday, September 11, 13

“Søk”

Hele informasjonsbehovet

Query, filtre, fasetter, paginering, ...

Wednesday, September 11, 13

Invertert indeks

"If you don't find it in the index, look very carefully through the entire catalog."

–Sears, Roebuck, and Co., Consumers' Guide 1897

Wednesday, September 11, 13

Wednesday, September 11, 13

AbstractEnterpriseSingletonProxyFactoryBean

Wednesday, September 11, 13

xkcd.com/292

Wednesday, September 11, 13

camelCase

AbstractSingletonProxyFactoryBean

camelCase-tokenizer

lowercase

Wednesday, September 11, 13

Prefiks-problemer!

Wednesday, September 11, 13

Prefiks-problemer

*suffix xiffus*

(60.6384, 6.5017) u4u8gyykk

123 {1-hundreds, 12-tens, 123} (forenkla)

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Elasticsearch

Distribuert

Cluster av noder

Selv-koordinerende

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Mapping

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Wednesday, September 11, 13

+P�

��

��

��

��

��

��

!

Wednesday, September 11, 13

+P�

��

��

��

��

��

��

Wednesday, September 11, 13

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Wednesday, September 11, 13

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Wednesday, September 11, 13

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Wednesday, September 11, 13

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Wednesday, September 11, 13

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Wednesday, September 11, 13

Så langt

Inverterte indekser

Tekstprosessering

Indeks-termer

Mappings

Indeks-maler

Wednesday, September 11, 13

Wednesday, September 11, 13

��

��

��

��

��

��

��

��

��

xkcd.com/208Wednesday, September 11, 13

��

��

��

��

��

��

��

��

��

��

��

Wednesday, September 11, 13

Wednesday, September 11, 13

  ?q={!boost b=div(popularity,price) v=$qq}         &qq={!dismax qf=desc^2,review}cheap         &bq={!lucene df=keywords}lucene solr java         &fq={!geofilt sfield=location pt=10.312,-20.556 d=3.5}         &fq={!term f=$ff v=$vv}&ff=keywords&vv=solr         &sort=query(keywords:lame) asc, score desc

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Filtre

Caches som bitmaps

Kompakte

Veldig raske

Wednesday, September 11, 13

term: className: "InternalFrameInternalFrameTitlePaneInternalFrameTitlePaneMaximizeButtonWindowNotFocusedState"

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

Filtre

Bruk filtre når du kan …

… og queries når du trenger rangering.

Wednesday, September 11, 13

Fasetter

Oppsummerer hele resultat-mengden

Filtre + fasetter grunnlag for analyse-bruk

Wednesday, September 11, 13

Wednesday, September 11, 13

Wednesday, September 11, 13

�Wednesday, September 11, 13

Wednesday, September 11, 13

Fasetterings-muligheter

Termer

Histogrammer

Tids-histogrammer

Geo-distanse

Statistisk fordeling

Filtre/Spørringer

Wednesday, September 11, 13

Fasetter

Ressurskrevende

CPU + minne

Viktig å ha nok minne

Wednesday, September 11, 13

Filter-cacher

Felt-cacher: fasetter, m.m.

Page-cache

CacherThere are two hard things in computer science:

cache invalidation, naming things, and off-by-one errors.

Wednesday, September 11, 13

CacherNow you are thinking with...

Per segment

Nye segmenter invaliderer ikke gamle

Viktig for (near) real time

Wednesday, September 11, 13

Wednesday, September 11, 13

PostgreSQL

Verifiserer ressursbrukTrygg >> rask

Bruker disk om den må

Wednesday, September 11, 13

Elasticsearch stoler på degBygd for fart

What could possibly go wrong?

Wednesday, September 11, 13

OutOfMemoryError

Woah thereI ate all the memories

Your cluster may or may not work any more

Wednesday, September 11, 13

NOSQL?

Kjapp, ikke robust

Dokumentdatabase

Skjema-fleksibel

Ingen transaksjoner

Lett å skalere/distribuere

Naïv leader-election

Ingen auth/authz

Wednesday, September 11, 13

?Slides og relevante linker på

found.no/jz13

(Prøv hosted Elasticsearch i 6 mnd. gratis)

Solr-meetup i community-rommeti morgen!

Wednesday, September 11, 13