Advanced full text searching techniques using Lucene

Efficient text searching techniques

Learn how to make an efficient search based web application using Java

Who am I?

Asad Abbas

BS Computer Science FAST NUCES

Software Engineer Etilize Private Ltd

Agenda

Introduction to full text search Mysql’s full text search solutions Lucene .. What it is and what it is not ( features) Pros and cons compared to Mysql Indexing and Searching Scoring Criteria Analyzers Query types Classes and Apis to remember Hello World Lucene code Faceted Search Apache Solr – Features Lucene resources and links

Application of text search

Nowadays, any modern web site worth its salt is considered to need a "Google-like" search function.

Users want to be able to just type the word(s) they’re seeking and have the computer do the rest

An important component of any application say a blog, news website , desktop application , email client , ecommerce website, a content based product such as CMS, or Inquire’s export system and so on.

Mysql’s search options

The famous LIKE clause“ select * from table where text LIKE ‘%query%’ and isactive

Flaws with this approach Bad performance for big tables No support for boolean queries

Mysql’s FULL TEXT INDEX

Why we index? The full-text index is much like other indexes: a sorted list of

"keys" which point to records in the data file. Each key has: Word -- VARCHAR. a word within the text.

Count -- LONG. how many times word occurs in text.Weight -- FLOAT. Our evaluation of the word's importance. Rowid -- a pointer to the row in the data file.

Can get results in order of relevance Boolean queries: Select * from contents where match(title,text) against(‘+Mysql –

YourSql’ in boolean mode)

Lucene An advanced full text search library Lucene is a high performance, scalable Information Retrieval (IR) library.

Lucene allows you to add search capabilities to your application.

Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it.

Support for single and multiterm queries, phrase queries, wildcards, fuzzy queries, result ranking, and sorting

Open source at ASF ( http://lucene.apache.org )

Ports available in .Net, Ruby , C++, Php , Python, Perl etc

Used by many of the big companies like Netflix, Linked In, Hewlett-Packard, Salesforce.com, Atlassian (Jira), Digg, and so on.

Lucene Vs Mysql full text searchLUCENE Speed of lucene is faster as

compared to mysql

lucene is much more complex to use as compared to mysql.

Index updation is very fast

No Joins in lucene

No support of full text in innodb

With Lucene, all the controls with a programmer ie defining stop words , case sensitivity, analyzer, relevance, scoring etc.

Highly scalable

MYSQL Slower

Simple , just add full text index on a field

Full text index Inserts become very slow.

Complex joins on full text fields of different tables.

No support of full text in innodb, its supported by MyIsam

Not many of the things are easily configurable/customizable.

Can’t scale for very large data and large number of transactions.

What role lucene plays in a search engine??

Logical box view of lucene index

Inverted index and searching

Scoring documents and relevance The factors involved in Lucene's scoring algorithm are as follows:

1. tf Implementation: sqrt(freq) Implication: the more frequent a term occurs in a document, the greater its scoreRationale: documents which contains more of a term are generally more relevant

2. idf Implementation: log(numDocs/(docFreq+1)) + 1 Implication: the greater the occurrence of a term in different documents, the lower its score Rationale: common terms are less important than uncommon ones

3. coord Implementation: overlap / maxOverlap Implication: of the terms in the query, a document that contains more terms will have a higher score Rationale: self-explanatory

4. lengthNorm Implementation: 1/sqrt(numTerms) Implication: a term matched in fields with less terms have a higher score Rationale: a term in a field with less terms is more important than one with more

Lucene Scoring

5. queryNorm = normalization factor so that queries can be compared

6. boost (index) = boost of the field at index-time

7. boost (query) = boost of the field at query-time

Types of Analyzer

WhitespaceAnalyzer, as the name implies, simply splits text into tokens on whitespace characters and makes no other effort to normalize the tokens.

"XY&Z Corporation - [email protected]“[XY&Z] [Corporation] [-] [[email protected]]

SimpleAnalyzer first splits tokens at non-letter characters, then lowercases each token. Be careful! This analyzer quietly discards numeric characters.[xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer is the same as SimpleAnalyzer, except it removes common words. By default it removes common words in the English language (the, a, etc.), though you can pass in your own set.[xy] [z] [corporation] [xyz] [example] [com]

StandardAnalyzer is Lucene’s most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens, such as company names, email addresses, and host names. It also lowercases each token and removes stop words.[xy&z] [corporation] [[email protected]]

Types of Query

Query ( Abstract Parent Class ) TermQuery ( For single term query ) RangeQuery( For ranges eg,

updatedate:[20040101 TO 20050101]) PrefixQuery ( search for prefix ) BooleanQuery ( Multiple queries ) WildcardQuery ( wildcard search ) FuzzyQuery ( near/close words eg for query

wazza we can get wazzu fazzu etc )

Lucene - important classes

Analyzer Creates tokens using a Tokenizer and filters them through zero or more TokenFilters

IndexWriterResponsible for converting text into internal Lucene format

Directory Where the Index is stored RAMDirectory, FSDirectory, others

Lucene - important classes

Document A collection of Fields Can be boosted

FieldFree text, keywords, dates, etc.Defines attributes for storing, indexing

Can be boosted

Field Constructors and parameters Open up Fieldable and Field in IDE

Lucene important classes

– Searcher Provides methods for searching Look at the Searcher class declaration• IndexSearcher, MultiSearcher, ParallelMultiSearcher

– IndexReader Loads a snapshot of the index into memory for searchingTopDocs - The search results

– QueryParser Converts a query into Query object

– Query Logical representation of program’s information need

Hello Lucene Code Index

//initialize analyzerStandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

// 1. create the index Directory index = new RAMDirectory();

// the boolean arg in the IndexWriter ctor means to // create a new index, overwriting any existing index IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED); addDoc(w, “Lucene in Action",“Lucene in action .. "); addDoc(w, "Lucene for Dummies"," Lucene for Dummies "); addDoc(w, "Managing Gigabytes"," Managing Gigabytes "); addDoc(w, "The Art of Computer Science"," The Art of Computer Science "); w.close();

Hello Lucene Code

private static void addDoc(IndexWriter w, String title,String text) throws IOException

{ Document doc = new Document(); Field titleField = new Field("title", title, Field.Store.YES, Field.Index.ANALYZED); titleField.setBoost(1.5F); doc.add(titleField); Field textField = new Field("text", text, Field.Store.YES, Field.Index.ANALYZED); doc.add(textField); w.addDocument(doc); }

Hello Lucene Code Query

TermQuery t1 = new TermQuery(new Term("title","art"));

TermQuery t2 = new TermQuery(new Term("text","art")); BooleanQuery bq = new BooleanQuery(); bq.add(t1,Occur.MUST); bq.add(t2,Occur.MUST);

OR

Query q = new QueryParser(Version.LUCENE_CURRENT, "title", analyzer).parse(“title:art AND text:art”);

Searchint hitsPerPage = 10;IndexSearcher searcher = new IndexSearcher(index, true); TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);searcher.search(bq, collector);ScoreDoc[] hits = collector.topDocs().scoreDocs;

Hello Lucene Code

Finally Display results

System.out.println("Found " + hits.length + " hits.");

for(int i=0;i<hits.length;++i) {

int docId = hits[i].doc;

Document d = searcher.doc(docId);

System.out.println((i + 1) + ". " + d.get("title") + " : " + d.get("text") );

}

Indexing databases

Indexing database exampleString sql = “select id,productid,value from paragraphproductparameter where isactive”;ResultSet rs = stmt.executeQuery(sql);while (rs.next() ){

Document doc = new Document();doc.add(new

Field(“productid”,rs.getString(“productid”,Field.Store.YES,Field.Index.NO_ANALYZED));

doc.add(new Field(“value”,rs.getString(“value”,Field.Store.YES,Field.Index. ANALYZED));

writer.addDocument(doc);}

Query boosting

Boosting queries

At the time of querytitle:free^2.0 AND text:free^1.0

Query.setBoost(float f);Sets query/subquery’s boost weight

Field.setBoost(float f);Sets a field boost at the time of index creation

Faceted Search concept Facets are often derived by analysis of the text of an item using

entity extraction techniques or from pre-existing fields in the database such as author, descriptor, language, and format.

Apache Solr

Stand Alone enterprise search server on top of Lucene, salient features include Distributed Index Replication Caching REST like api to update/get index Faceted Searching and filtering Clustering Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using

Apache Tika Opensource at http://lucene.apache.org/solr

Links and resources for more on this Lucene in Action ( Ebook ) LuceneTutorial http://www.lucenetutorial.com http://www.informit.com/articles/article.aspx?p=461633 http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-

lucene.html http://www.ibm.com/developerworks/library/wa-lucene/

Thanks a lot for attending the event

THANKS TO ALL FOR TAKING OUT YOUR PRECIOUS TIME FOR THE

PRESENTATION

Education

Advanced full text searching techniques using Lucene