77
1 Java Indexing and Searching By : Shay Sofer & Evgeny Borisov

JavaEdge09 : Java Indexing and Searching

Embed Size (px)

DESCRIPTION

From AlphaCSP's Java conference - JavaEdge09. The presentation of myself and Evgeny Borisov about 'Java Indexing and Searching' In this session we discussed the need of Full Test Search (as opposed to regular textual/SQL search) , Lucene and it's OO mismatches, the solution that Hibernate Search provides to those mismatches and then a bit about Lucene's scoring algorithm.

Citation preview

Page 1: JavaEdge09 : Java Indexing and Searching

1

Java Indexing and SearchingBy : Shay Sofer & Evgeny Borisov

Page 2: JavaEdge09 : Java Indexing and Searching

2

» Motivation» Lucene Intro» Hibernate Search» Indexing» Searching» Scoring» Alternatives

Agenda

Page 3: JavaEdge09 : Java Indexing and Searching

3

MotivationWhat is Full Text Search and why do I need it?

Page 4: JavaEdge09 : Java Indexing and Searching

4

Id Title Price

1 Head First Java 200

2 JBoss in action 120

3 Best jokes about Chuck Norris 250

4 Best of the best of the best 10

Motivation

Use case“Book” table

Good practices for Gava

Page 5: JavaEdge09 : Java Indexing and Searching

5

» We’d like to : Index the information efficiently answer queries using that index

» More common than you think

Full Text Search

Motivation

Page 6: JavaEdge09 : Java Indexing and Searching

6

» Integrated full text search engine in the database e.g. DBSight, Recent versions of MySQL, MS SQL Server,

Oracle Text, etc» Out of the box Search Appliances

e.g. Google Search Appliance» Third party libraries

Full Text Search Solutions

Motivation

Page 7: JavaEdge09 : Java Indexing and Searching

7

Lucene Intro

Page 8: JavaEdge09 : Java Indexing and Searching

8

» The most popular full text search library» Scalable and high performance» Around for about 9 years» Open source » Supported by the Apache Software Foundation

Apache Lucene

Lucene Intro

Page 9: JavaEdge09 : Java Indexing and Searching

9

Lucene Intro

Page 10: JavaEdge09 : Java Indexing and Searching

10

» “Word-oriented” search» Powerful query syntax

Wildcards, typos, proximity search.» Sorting by relevance (Lucene’s scoring algorithm) or

any other field» Fast searching, fast indexing

Inverted index.

Lucene’s Features

Lucene Intro

Page 11: JavaEdge09 : Java Indexing and Searching

11

Head First Java

Best of the best of the best

Chuck Norris in action

JBoss in action

Head 0

First 0

Java 0

Action 2 3

Best 1

JBoss 3

Chuck 2

Norris 2

0

2

1

3

Lucene Intro

Inverted Index DB

Page 12: JavaEdge09 : Java Indexing and Searching

12

» A Field is a key+value. Value is always represented as a String (Textual)

» A Document can contain as many Fields as we’d like» Lucene’s index is a collection of Documents

Basic Definitions

Lucene Intro

Page 13: JavaEdge09 : Java Indexing and Searching

13

Lucene Intro

Using Lucene API…IndexSearcher is = new IndexSearcher(“BookIndex");QueryParser parser = new QueryParser("title",

analyzer);

Query query = parser.parse(“Good practices for Gava”);return is.search(query);

Page 14: JavaEdge09 : Java Indexing and Searching

14

OO domain model Vs. Lucene’s Index structure

Lucene Intro

Extensible type system

Strong type system

Polymorphic

OO Domain ModelIndex structure

Page 15: JavaEdge09 : Java Indexing and Searching

15

» The Structural Mismatch Converting objects to string and vice versa No representation of relation between Documents

» The Synchronization Mismatch DB must by sync’ed with the index

» The Retrieval Mismatch Retrieving documents ( =pairs of key + value) and not objects

Object vs Flat text mismatches

Lucene Intro

Page 16: JavaEdge09 : Java Indexing and Searching

16

Hibernate Search

Emmanuel Bernard

Page 17: JavaEdge09 : Java Indexing and Searching

17

» Leverages ORM and Lucene together to solve those mismatches

» Complements Hibernate Core by providing FTS on persistent domain models.

» It’s actually a bridge that hides the sometimes complex Lucene API usage.

» Open source.

Hibernate Search

Page 18: JavaEdge09 : Java Indexing and Searching

18

» Document = Class (Mapped POJO)» Hibernate Search metadata can be described by

Annotations only» Regardless, you can still use Hibernate Core with XML

descriptors (hbm files)

» Let’s create our first mapping – Book

Mapping

Hibernate Search

Page 19: JavaEdge09 : Java Indexing and Searching

19

@Entity @Indexedpublic class Book implements Serializable { @Id private Long id;

@Boost(2.0f) @Field

private String title;

@Field private String description;

private String imageURL;

@Field (index=Index.UN_TOKENIZED) private String isbn; … }

Hibernate Search

Page 20: JavaEdge09 : Java Indexing and Searching

20

» Types will be converted via “Field Bridge”.» It is a bridge between the Java type and its

representation in Lucene (aka String)» Hibernate Search comes with a set for most standard

types (Numbers – primitives and wrappers, Date, Class etc)

» They are extendable, of course

Bridges

Hibernate Search

Page 21: JavaEdge09 : Java Indexing and Searching

21

» We can use a field bridge…

@FieldBridge(impl = MyPaddedFieldBridge.class, params = {@Parameter(name="padding",

value=“5")} )public Double getPrice(){ return price;}

» Or a class bridge - incase the data we want to index is more than just the field itself e.g. concatenation of 2 fields

Custom Bridges

Hibernate Search

Page 22: JavaEdge09 : Java Indexing and Searching

22

» In order to create a custom bridge we need to implement the interface StringBridge

» ParameterizedBridge – to inject params

Custom Bridges

Hibernate Search

Page 23: JavaEdge09 : Java Indexing and Searching

23

» Directory is where Lucene stores its index structure.» Filesystem Directory Provider» In-memory Directory Provider» Clustering

Directory Providers

Hibernate Search

Page 24: JavaEdge09 : Java Indexing and Searching

24

» Default» Most efficient» Limited only by the disk’s free space» Can be easily replicated» Luke support

Filesystem Directory Provider

Hibernate Search

Page 25: JavaEdge09 : Java Indexing and Searching

25

» Index dies as soon as SessionFactory is closed.» Very useful when unit testing. (along side with

in-memory DBs)» Data can be made persistent at any moment, if

needed.» Obviously, be aware of OutOfMemoryException

In-memory Directory Provider

Hibernate Search

Page 26: JavaEdge09 : Java Indexing and Searching

26

<!-- Hibernate Search Config --><property

name="hibernate.search.default.directory_provider"> org.hibernate.search.store.FSDirectoryProvider

</property>

<property name="hibernate.search.com.alphacsp.Book.directory_provider"> org.hibernate.search.store.RAMDirectoryProvider</property>

Directory Providers Config Example

Hibernate Search

Page 27: JavaEdge09 : Java Indexing and Searching

27

» Correlated queries - How do we navigate from one entity to another?

» Lucene doesn’t support relationships between documents

» Hibernate Search to the rescue - Denormalization

Relationships

Hibernate Search

Page 28: JavaEdge09 : Java Indexing and Searching

28

Hibernate Search

Page 29: JavaEdge09 : Java Indexing and Searching

29

@Entity @Indexedpublic class Book{ @ManyToOne @IndexEmbedded

private Author author;}

@Entity @Indexedpublic class Author{

private String firstName;}

» Object navigation is easy (author.firstName)

Relationships

Hibernate Search

Page 30: JavaEdge09 : Java Indexing and Searching

30

» Entities can be referenced by other entities.

Relationships – Denormalization Pitfall

Hibernate Search

Page 31: JavaEdge09 : Java Indexing and Searching

31

» Entities can be referenced by other entities.

Relationships – Denormalization Pitfall

Hibernate Search

Page 32: JavaEdge09 : Java Indexing and Searching

32

» Entities can be referenced by other entities.

Relationships – Denormalization Pitfall

Hibernate Search

Page 33: JavaEdge09 : Java Indexing and Searching

33

» The solution: The association pointing back to the parent will be marked with @ContainedIn

@Entity @Indexedpublic class Book{ @ManyToOne @IndexEmbedded private Author author;}

@Entity @Indexedpublic class Author{

@OneToMany(mappedBy=“author”) @ContainedIn private Set<Book> books;

}

Relationships – Solution

Hibernate Search

Page 34: JavaEdge09 : Java Indexing and Searching

34

» Responsible for tokenizing and filtering words » Tokenizing – not a trivial as it seems» Filtering – Clearing the noise (case, stop words etc) and

applying “other” operations» Creating a custom analyzer is easy

» The default analyzer is Standard Analyzer

Analyzers

Hibernate Search

Page 35: JavaEdge09 : Java Indexing and Searching

35

» StandardTokenizer : Splits words and removes punctuations.» StandardFilter : Removes apostrophes and dots from acronyms.» LowerCaseFilter : Decapitalizes words.» StopFilter : Eliminates common words.

Standard Analyzer

Hibernate Search

Page 36: JavaEdge09 : Java Indexing and Searching

36

Other cool Filters….

Hibernate Search

Page 37: JavaEdge09 : Java Indexing and Searching

37

» N-Gram algorithm – Indexing a sequence of n consecutive characters.

» Usually when a typo occurs, part of the word is still correct Encyclopedia in 3-grams = Enc | ncy | cyc | ycl | clo | lop | ope | ped | edi | dia

Approximative Search

Hibernate Search

Page 38: JavaEdge09 : Java Indexing and Searching

38

» Algorithms for indexing of words by their pronunciation

» The most widely known algorithm is Soundex » Other Algorithms that are available : RefinedSoundex,

Metaphone, DoubleMetaphone

Phonetic Approximation

Hibernate Search

Page 39: JavaEdge09 : Java Indexing and Searching

39

» Synonyms You can expand your synonym dictionary with your own

rules (e.g. Business oriented words)

» Stemming Stemming is the process of reducing words to their stem,

base or root form. “Fishing”, “Fisher”, “Fish” and “Fished” Fish Snowball stemming language – supports over 15

languages

Synonyms & Stemming

Hibernate Search

Page 40: JavaEdge09 : Java Indexing and Searching

40

» Lucene is bundled with the basic analyzers, tokenizers and filters.

» More can be found at Lucene’s contribution part and at Apache-Solr

Additional Analyzers

Hibernate Search

Page 41: JavaEdge09 : Java Indexing and Searching

41

» No free Hebrew analyzer for Lucene» Itamar Syn-Hershko

Involved in the creation of CLucene (The C++ port of Lucene) Creating a Hebrew analyzer as a side project Looking to join forces [email protected]

Hebrew?

Hibernate Search

Page 42: JavaEdge09 : Java Indexing and Searching

42

Hibernate Search

אחוות הטבעתשר הטבעות, גירסה ראשונה:

Page 43: JavaEdge09 : Java Indexing and Searching

43

» Motivation» Lucene Intro» Hibernate Search» Indexing» Searching» Scoring» Alternatives

Agenda

Page 44: JavaEdge09 : Java Indexing and Searching

44

» When data has changed?» Which data has changed?» When to index the changing data?» How to do it all efficiently?

Hibernate Search will do it for you!

Transparent indexing

Indexing

Page 45: JavaEdge09 : Java Indexing and Searching

45

Indexing – On Rollback

Application

Session (Entity Manager)

DB

Lucene Index

Insert/update

delete

Queue

Start Transaction

Page 46: JavaEdge09 : Java Indexing and Searching

46

Indexing – On Rollback

Application

Session (Entity Manager)

DB

Lucene Index

Insert/update

delete

QueueTransaction failed

Rollback

Start Transaction

Page 47: JavaEdge09 : Java Indexing and Searching

47

Indexing – On Commit

Application

Session (Entity Manager)

DB

Lucene Index

Insert/update

delete

QueueTransaction Committed

Page 48: JavaEdge09 : Java Indexing and Searching

48

<property name="org.hibernate.worker.execution“>async</property>

<property name="org.hibernate.worker.thread_pool.size“>2 </property>

<property name="org.hibernate.worker.buffer_queue.max“>10</property>

hibernate.cfg.xml

Indexing

Page 49: JavaEdge09 : Java Indexing and Searching

49

It’s too late! I already have a database without Lucene!

Indexing

Page 50: JavaEdge09 : Java Indexing and Searching

50

» FullTextSession extends from Session of Hibernate core Session session = sessionFactory.openSession(); FullTextSession fts = Search.getFullTextSession(session);

» index(Object entity)» purge(Class entityType, Serializable id)» purgeAll(Class entityType)

Manual indexing

Indexing

Page 51: JavaEdge09 : Java Indexing and Searching

51

tx = fullTextSession.beginTransaction(); //read the data from the database Query query = fullTextSession.createCriteria(Book.class); List<Book> books = query.list(); for (Book book: books ) {

fullTextSession.index( book); } tx.commit();

Manual indexing

Indexing

Page 52: JavaEdge09 : Java Indexing and Searching

52

tx = fullTextSession.beginTransaction(); List<Integer> ids = getIds(); for (Integer id : ids) { if(…){ fullTextSession.purge(Book.class, id ); } } tx.commit();

» fullTextSession.purgeAll(Book.class);

Removing objects from the Lucene index

Indexing

Page 53: JavaEdge09 : Java Indexing and Searching

53

Rrrr!!! I got an OutOfMemoryException!

Indexing

Page 54: JavaEdge09 : Java Indexing and Searching

54

session.setFlushMode(FlushMode.MANUAL);session.setCacheMode(CacheMode.IGNORE);Transaction tx=session.beginTransaction();ScrollableResults results =

session.createCriteria(Item.class) .scroll(ScrollMode.FORWARD_ONLY);

int index = 0;while(results.next()) { index++; session.index(results.get(0)); if (index % BATCH_SIZE == 0){ session.flushToIndexes(); session.clear();

} }tx.commit();

Indexing

54

100

Page 55: JavaEdge09 : Java Indexing and Searching

55

Searching

Page 56: JavaEdge09 : Java Indexing and Searching

56

title : lord title: rings+title : lord +title: rings title : lord –author: Tolkien title: r?ngs title: r*gs title: “Lord of the Rings” title: “Lord Rings”~5 title: rengs~0.8 title: lord author: Tolkien^2And more…

Lucene’s Query Syntax

Searching

Page 57: JavaEdge09 : Java Indexing and Searching

57

» To build FTS queries we need to: Create a Lucene query Create a Hibernate Search query that wraps the Lucene

query

Why?» No need to build framework around Lucene» Converting document to object happens

transparently.» Seamless integration with Hibernate Core API

Querying

Searching

Page 58: JavaEdge09 : Java Indexing and Searching

58

String stringToSearch = “rings";Term term = new Term(“title",stringToSearch);TermQuery query = new TermQuery(term);FullTextQuery hibQuery = session.createFullTextQuery(query,Book.class);

List<Book> results = hibQuery.list();

Hibernate Queries Examples

Searching

Page 59: JavaEdge09 : Java Indexing and Searching

59

String stringToSearch = "r??gs";Term term = new Term(“title",stringToSearch);WildCardQuery query = new WildCardQuery (term);...

List<Book> results = hibQuery.list();

WildCardQuery Example

Searching

Page 60: JavaEdge09 : Java Indexing and Searching

60

Id Title Price

1 Head First Java 200

2 Chuck Norris in action 120

3 Chuck Norris vs JBoss 120

4 JBoss strikes back 10

Motivation

Use caseBook table

Good practices for Gava

Page 61: JavaEdge09 : Java Indexing and Searching

61

HS Query Flowchart

Searching

Loads objects from the Persistence Context

Hibernate

SearchQuery

Client

LuceneIndex

DB

Query the index

Persistence Context

DB access

(if needed)

Receive matching ids

Page 62: JavaEdge09 : Java Indexing and Searching

62

» You can use list(), uniqueResult(), iterate(), scroll() – just like in Hibernate Core !

» Multistage search engine» Sorting» Explanation object

Querying tips

Searching

Page 63: JavaEdge09 : Java Indexing and Searching

63

Score

Page 64: JavaEdge09 : Java Indexing and Searching

64

» Most based on Vector Space Model of Salton

Score

Page 65: JavaEdge09 : Java Indexing and Searching

65

» Most based on Vector Space Model of Salton

Score

Page 66: JavaEdge09 : Java Indexing and Searching

66

Term Rating

Score

total number of documents containing term “I”

term weightnumber of documents in the index

Logarithm

best java in action books

Page 67: JavaEdge09 : Java Indexing and Searching

67

Term Rating Calculation

Score

0=)500

500log(

2=)50

5000log(

3=)5

5000log(

Page 68: JavaEdge09 : Java Indexing and Searching

68

1. Head First Java2. Best of the best of the best3. Best examples from Hibernate in action4. The best action of Chuck Norris

Scoring example

Score

Search for: “best java in action books”Term Frequency ScoreJava 1 Best 3Action 2

0.124940.30103

0.60206

Page 69: JavaEdge09 : Java Indexing and Searching

69

» Conventional Boolean retrieval» Calculating score for only matching documents» Customizing similarity algorithm» Query boosting» Custom scoring algorithms

Lucene’s scoring approach

Score

Page 70: JavaEdge09 : Java Indexing and Searching

70

Alternatives

Page 71: JavaEdge09 : Java Indexing and Searching

71

Alternatives

Shay Banon

Page 72: JavaEdge09 : Java Indexing and Searching

72

Alternatives

Simple

Lucene based

Configurable via XML or

annotations

Local & External TX Manager

Integrates with popular ORM frameworks

Spring support

Distributed

Page 73: JavaEdge09 : Java Indexing and Searching

73

Alternatives

Page 74: JavaEdge09 : Java Indexing and Searching

74

» Enterprise Search Server Supports multiple protocols (xml, json, ruby, etc...)

» Runs as a standalone Full Text Search server within a servlet e.g. Tomcat

» Heavily based on Lucene» JSA – Java Search API (based on JPA)

ODM (Object/Document Mapping) Spring integration (Transactions)

Apache Solr

Alternatives

Page 75: JavaEdge09 : Java Indexing and Searching

75

» Powerful Web Administration Interface Can be tailored without any Java coding!

» Extensive plugin architecture» Server statistics exposed over JMX» Scalability – easily replicated

Apache Solr

Alternatives

Page 77: JavaEdge09 : Java Indexing and Searching

77

Thank you!Q & A