[F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene

Information Access with Apache Lucene

Pierpaolo Basile, PhDResearch Assistant Univeristy Of Bari Aldo Moro

co-founder QuestionCube s.r.l.pierpaolo.basile@{uniba.it, questioncube.com}

CODE CAMP: 20 LUGLIO - 1 AGOSTO, 2012S. VITO DEI NORMANNI (BR)

Outline

• Search Engine– Vector Space Model– Inverted Index

• Apache Lucene– Indexing– Searching

• Advanced topics– Relevance feedback– Document similarity– Apache Tika

SEARCH ENGINE

Searching

Collection of documents

User query

Relevant documents

Information Retrieval

Information Retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web.

Wikipedia

Information Retrieval Process

UserInterface

Text Operations

Indexing

Ranking

Searching

QueryOperations

Index

Doc

Rankeddocuments

User feedback

Text

Query

Logic view

Inverted IndexQuery

Retrived documents

DocDocDocs

Information Retrieval Model

<D, Q, F, R(qi, dj)>

• D: document representation

• Q: query representation

• F: query/document representation framework

• R(qi, dj): ranking function

Bag-of-words representation

Document/query as unordered collection of words

John likes to watch movies. Mary likes too. John also likes to watch football games.

{"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10}

Vector Space Model

Query/Document Framework

• Algebraic model

• Documents/queries are represented as vector in a geometric space

• Terms are vector components

Vector Space Model

),,,(

),,,(

,,2,1

,,2,1

qtqq

jtjjj

wwwq

wwwd

Each vector component corresponds to a term

The component value wi,j is the weight of the term ti

in the document dj

Vector Space Model

dj=(hit:5, refresh:2)q1=(refresh:1)q2=(hit:1)

hit

refresh

dj

q2

q1

Vector Space Model

Ranking function

• Vector similarity between query and document computed by cosine

qd

qdqdR

j

j

j

cos),(

Vector Space Model

dj=(hit:5, refresh:2)q1=(refresh:1)q2=(hit:1)

hit

refresh

dj

q2

q1

θ1θ2

Term-Document matrix

•‘It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.’

• ‘It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number One,’ said Alice.

•Alice watched the White King as he slowly struggled up from bar to bar, till at last she said, ‘Why, you’ll be hours and hours getting to the table, at that rate.

•Alice looked round eagerly, and found that it was the Red Queen. ‘She’s grown a good deal!’ was her first remark.

• In the pool of light was a billiards table, with two figures moving around it. Alicewalked toward them, and as she approached they turned to look at her.

•Alice lay back, and closed her eyes. There was the Red Queen again, with that incessant grin. Or was it the Cheshire cat's grin?


D1 D2 D3

Cheshire Cat 1 0 1

Alice 2 2 2

book 1 0 0

King 1 1 0

table 0 1 1

Queen 0 1 1

grin 0 0 2


D1 D2 D3

Cheshire Cat 1 0 1

Alice 2 2 2

book 1 0 0

King 1 1 0

table 0 1 1

Queen 0 1 1

grin 0 0 2

Query:

Alice AND Queen

0

1

0

0

0

1

0

Inverted Index

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…

Inverted Index

D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…

Inverted Index

D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

Posting List

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…

Inverted Index

D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…

Inverted Index

D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…

Inverted Index

D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…

Inverted Index

D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…

Inverted Index

D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…

Inverted Index

D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…

Inverted Index

D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…

Inverted Index

D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

Posting List Position List

14, 49 1, 40 18, 34

33

Term positions in D1

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…

Inverted Index: query processing

AND (intersection )

Alice AND King -> (D1, D2)

D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

D1:2 D2:2 D3:2

D1:1 D2:1

<Alice><King>

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…


D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

OR (union )

Alice AND King -> (D1, D2, D3)

<grin> <King>

D3:2

D1:1 D2:1

Dictionary

Alice

…

book

…

Cheshire Cat

…

grin

…

King

…

Queen

…

table

…


D1:2 D2:2 D3:2

D1:1

D1:1 D3:1

D3:2

D1:1 D2:1

D2:1 D3:1

D2:1 D3:1

NOT (complement \)

Alice NOT grin -> (D1, D2)

<Alice> \ <grin>

D1:2 D2:2 D3:2

D3:2

Term-weight

• Measures the term relevance in a document

– component value in the document representation

– TF*IDF

• TF (term frequency): term occurrences in the document

• IDF (inverse document frequency): inverse to the number of documents in which the term occurs

}:{log*),(),(*

dtDd

Ddttfdtidftf

Term-weight

• Measures the term relevance in a document

– component value in the document representation

– TF*IDF

• TF (term frequency): term occurrences in the document

• IDF (inverse document frequency): inverse to the number of documents in which the term occurs

}:{log*),(),(*

dtDd

Ddttfdtidftf

Number of documents in the collection

idf

TF*IDF insights

• increases proportionally to the term frequency in the document

• decreases to the number of documents in which the term belongs

– common words are generally more frequent in the collection

• IDF depends on the collection, TF on the document

Inverted index/TF*IDF

• TF: computed by term occurrences in the posting list

• IDF: computed by the posting list cardinality

D1:2 D2:2 D3:2Alice |D|=100

3

100log*2)1,(* DAliceidftf

TF

Inverted index/TF*IDF

• TF: computed by term occurrences in the posting list

• IDF: computed by the posting list cardinality

D1:2 D2:2 D3:2Alice |D|=100

3

100log*2)1,(* DAliceidftf

IDF

APACHE LUCENE

http://lucene.apache.org

Apache Lucene

• Apache Software Foundation project

– http://lucene.apache.org

• What Lucene is

– Search library: indexing and searching Application Programming Interface (API)

• What Lucene is not

– Search engine (no crawling, server, user interface, etc.)

Lucene

• Full-text search

• High performance

• Scalable

• Cross-platform

• 100%-pure Java

• Porting in other languages: .NET, Python


UserInterface

Text Operations

Indexing

Ranking

Searching

QueryOperations

Index

Doc

Rankeddocuments

User feedback

Text

Query

Logic view

Inverted IndexQuery

Retrived documents


UserInterface

Analyzer/Tokenizer/Filter

IndexWriter

Similarity

IndexSearcher

QueryParser

Index

Document

Rankeddocuments

User feedback

Text

Query

Logic view

Inverted IndexQuery

Retrived documents

IndexReader

org.apache.lucene

Lucene Model 1/2

• Based on Vector Space Model• Features:

– multi-field document– tf–term frequency: number of matching terms in field– lengthNorm: number of tokens in field– idf: inverse document frequency– coord: coordination factor, number of matching

• Terms– field boost– query clause boost

Lucene Model 1/2

• coor(q,d) score factor based on how many of the query terms are found in the specified document

• queryNorm(q,d) is a normalizing factor used to make scores between queries comparable

• boost(t) term boost• norm(t,d) encapsulates a few (indexing time) boost and length

factors

Lucene

TopDocs

Document Indexing

FSDirectory dir = FSDirectory.open(new File(“/home/user/index_test));

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);

config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

IndexWriter writer = new IndexWriter(dir, config);

Document doc = new Document();

doc.add(new Field("super_name", "Sandman", Field.Store.YES,

Field.Index.ANALYZED));

doc.add(new Field("name", "William Baker", Field.Store.YES, Field.Index.ANALYZED));

doc.add(new Field("category", "superhero", Field.Store.YES, Field.Index.ANALYZED));

writer.addDocument(doc);

writer.close();

Field Options

• Field.Store

– NO: field value is not stored

– YES: field value is stored

• Field.Index

– ANALYZED: field is analyzed by the analyzer

– NO: not indexed

– NOT_ANALYZED: indexed but not analyzed

Luke

• Lucene diagnostic tool: http://code.google.com/p/luke/

http://code.google.com/p/luke/

Document: Fields Representationhttp://www.bbc.co.uk/news/technology-15365207

Document: Fields RepresentationData: Index.Store.YES,

Field.Index.NOT_ANALYZED

Authors: Index.Store.YES, Field.Index.ANALYZED

Title: Index.Store.NO, Field.Index.ANALYZED

Abstract: Index.Store.NO, Field.Index.ANALYZED

Content: Index.Store.NO, Field.Index.ANALYZED Comments: Index.Store.NO,

Field.Index.ANALYZED

http://www.bbc.co.uk/news/technology-15365207 URL: Index.Store.YES, Field.Index.NO

Text Operations

• Tokenization

– split text in token

• Stop word elimination

– remove common words and closed word class (e.g. function words)

• Stemming

– reducing inflected (or sometimes derived) words to their stem

Analyzers

• KeywordAnalyzer– Returns a single token

• WhitespaceAnalyzer– Splits tokens at whitespace

• Simple Analyzer– Divides text at nonletter characters and lowercases

• StopAnalyzer– Divides text at non letter characters, lowercases, and removes

stop words

• StandardAnalyzer– Tokenizes based on sophisticated grammar that recognizes e-

mail addresses, acronyms, etc.; lowercases and removes stop words (optional)

Analyzers

The quick brown fox jumped over the lazy dogs

• WhitespaceAnalyzer:

[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

• SimpleAnalyzer:

[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

• StopAnalyzer:

[quick] [brown] [fox] [jumped] [lazy] [dogs]

• StandardAnalyzer:

[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

Analyzers

"XY&Z Corporation - [email protected]"

• WhitespaceAnalyzer:

[XY&Z] [Corporation] [-] [[email protected]]

• SimpleAnalyzer:

[xy] [z] [corporation] [xyz] [example] [com]

• StandardAnalyzer:

[xy&z] [corporation] [[email protected]]

Tokenizers

• A Tokenizer is a TokenStream whose input is a Reader

• Tokenizers break field text into tokens

• StandardTokenizer– source string: “full-text lucene.apache.org”

– “full” “text” “lucene.apache.org”

• WhitespaceTokenizer– “full-text” “lucene.apache.org”

• LetterTokenizer– “full” “text” “lucene” “apache” “org”

TokenFilters

• A TokenFilter is a TokenStream whose input is another token stream

• LowerCaseFilter

• StopFilter

• LengthFilter

• PorterStemFilter– stemming: reducing words to root form

– rides, ride, riding => ride

– country, countries => countri

Analyzer

public class MyAnalyzer2 extends Analyzer {

private Set sw = StopFilter.makeStopSet(Version.LUCENE_36, new

ArrayList(StopAnalyzer.ENGLISH_STOP_WORDS_SET));

@Override

public TokenStream tokenStream(String string, Reader reader) {

TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_36, reader);

ts = new LowerCaseFilter(Version.LUCENE_36, ts);

ts = new StopFilter(Version.LUCENE_36, ts, sw);

return ts;

}

}

TermVectors

• Store information about term frequency, positions and offsets

• Relevance Feedback and “More Like This”

• Clustering

• Similarity between two documents

• Highlighter

– needs offsets info

Lucene TermVector

• TermFreqVector is a representation of all of the terms and term counts in a specific Field of a Document instance

• As a tuple:– termFreq = <term, term countD>

– <fieldName, <…,termFreqi, termFreqi+1,…>>

• As Java:– public String getField();

– public String[] getTerms();

– public int[] getTermFrequencies()

Lucene TermPositionVector

• TermPositionVector is a representation of all of the terms, term counts, term position and term offsets in a specific Field of a Document instance

• As Java:– public String getField();

– public String[] getTerms();

– public int[] getTermFrequencies()

– public int[] getTermPositions(int index)

– public TermVectorOffsetInfo[]

getOffsets(int index)

Store TermVector

• During indexing, create a Field that stores Term

• Vectors:– new

Field(“text",text,Field.Store.YES,Field.Index.A

NALYZED,Field.TermVector.YES);

• Options are:– Field.TermVector.YES

– Field.TermVector.NO

– Field.TermVector.WITH_POSITIONS

– Field.TermVector.WITH_OFFSETS

– Field.TermVector.WITH_POSITIONS_OFFSETS

Term Vector: Positions and Offsets

0 1 2 3 4 5 6 7 8

The quick brown fox jumped over the lazy dogs0<--->3 4<----->9 10<----->15 16<--->19 20 <------> 26 27<---->31 32<--->35 36<---->40 41<---->45

Positions

Offsets

Example

• Main method with three parameters– Input_dir: directory to index– Index_dir: directory containing the index files– Flag (string) corresponding to different

Field.TermVector... options• t: tokenize• tv: term vectors • tvp: term vectors with positions• tvo: term vectors with positions and offsets

• The constructor of the Index class takes in input the three parameters and indexes the directory– Filters only “.txt” files

Lucene Search

• IndexReader: interface for accessing an index

• QueryParser: parses the user query

• IndexSearcher: implements search over a single IndexReader

– search(Query query, int numDoc)

– TopDocs -> result of search

• TopDocs.scoreDocs returns an array of retrieved documents

Lucene SearchDirectory dir = FSDirectory.open(new File(args[0]));

IndexReader reader = IndexReader.open(dir);

IndexSearcher searcher = new IndexSearcher(reader);

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);

QueryParser parser = new QueryParser(Version.LUCENE_36, "text", analyzer);

Query query = parser.parse(args[1]);

TopDocs topDocs = searcher.search(query, 100);

System.out.println("matches: " + topDocs.totalHits);

for (int i = 0; i < topDocs.scoreDocs.length; i++) {

Document doc = reader.document(topDocs.scoreDocs[i].doc);

System.out.println("name: " + doc.get("name") + ", score: " +

topDocs.scoreDocs[i].score);

System.out.println();

}

searcher.close();

Lucene QueryParser

• Example: – queryParser.parse(“name:SpiderMan”);

• good human entered queries, debugging

• does text analysis and constructs appropriate queries

• not all query types supported

QUERY SYNTAX 1/2

• title:"The Right Way" AND text:go– Phrase query + term query

• Wildcard Searches– tes* (test - tests - tester)– te?t (test – text)– te*t (tempt) – tes?

• Fuzzy Searches (Levenshtein Distance) (TERM)– roam~ (foam – roams)– roam~0.8

• Range Searches– mod_date:[20020101 TO 20030101]– title:{Aida TO Carmen}

• Proximity Searches (PHRASE)– "jakarta apache"~10

QUERY SYNTAX 2/2

• Boosting a Term– jakarta^4 apache– "jakarta apache"^4 "Apache Lucene”

• Boolean Operator– NOT, OR, AND– + required operator– - prohibit operator

• Grouping by ( )• Field Grouping• title:(+return +"pink panther")• Escaping Special Characters by \

QUERY (BY CODE) 1/2

TermQuery query =

new TermQuery(new Term(“name”,”Spider-Man”))

• explicit, no escaping necessary

• does not do text analysis for you

• Query– TermQuery

– BooleanQuery

– PhraseQuery

– FuzzyQuery / WildcardQuery /

QUERY (BY CODE) 2/2

TermQuery tq1 = new TermQuery(new Term(“name”,”pippo”))

TermQuery tq2 = new TermQuery(new Term(“name”,”pluto”))

BooleanQuery bq = new BooleanQuery();

bq.add(tq1, BooleanClause.Occur.MUST);

bq.add(tq2, BooleanClause.Occur.SHOULD);

DELETING DOCUMENTS

IndexWriter– deleteDocuments(Term t)

– deleteDocuments(Term… t)

– deleteDocuments(Query t)

– deleteDocuments(Query… t)

– updateDocument(Term t, Document d)

– updateDocument(Term t, Document d,

Analyzer a)

– deleting does not immediately reclaim space

ADVANCED TOPICS

Relevance feedback

Documents similarity

Apache Tika

Relevance feedback

• Improve retrieval performance using information about document relevance

– Explicit: relevance indicated by the user (assessor)

– Implicit: from user behavior

– Blind (or pseudo): using information about the top k retrieved documents

Relevance feedback

Searcher

User query

Retrieveddocuments

Relevance feedback

Newretrieved

documents

New query

Rocchio Algorithm

norelkrelj DD

k

norelDD

j

rel

new DD

DD

QQ 11

Original query

Relevant document

Non-relevant document

Rocchio Algorithm (blind)

norelkrelj DD

k

norelDD

j

rel

new DD

DD

QQ 11

In blind (pseudo) relevance feedback we know only relevant documents (supposed to be the top Kdocuments)

(generally adopted by search engine)

Relevance feedback

Q = Alice

It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.

It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number One,’ said Alice

Alice lay back, and closed her eyes. There was the Red Queen again, with that incessant grin. Or was it the Cheshire cat's grin?

Alice looked round eagerly, and found that it was the Red Queen. ‘She’s grown a good deal!’ was her first remark.

Qnew = Alice Cheshire Cat King book

It’s a friend of mine — a Cheshire Cat,’ said Alice: ‘allow me to introduce it.

It’s the oldest rule in the book,’ said the King. ‘Then it ought to be Number One,’ said Alice

Alice book was reprinted and published in 1866.

…

Relevance feedback in Lucene

• Possible using TermVector

– Retrieve the top K documents

– Build the Qnew using TermVector

– Re-query using Qnew

Document similarity

• Documents are represented by vectors

– Build the document vector using TermVector

– Compute the similarity using cosine similarity

• Implement a functionality like “similar to this document”

Apache Tika

• Extract metadata and text from several file formats

– XML, HTML, OpenOffice, Microsoft Office, PDF

– MBOX format (email)

– Compressed files

– EXIF metadata from image

– Metadata from audio file (MP3)

• Apache project as Lucene

Extract metadata and text

Tika tika = new Tika();

String type = tika.detect(new File(args[0]));

System.out.println("File type: " + type);

Metadata metadata = new Metadata();

String text = tika.parseToString(new FileInputStream(new File(args[0])),

metadata);

System.out.println("Metadata");

String[] names = metadata.names();

for (int i = 0; i < names.length; i++) {

String value = metadata.get(names[i]);

if (value != null) {

System.out.println(names[i] + "=" + value);

}

}

System.out.println("Text: " + text);

Apache Tika + Lucene

Tika Lucene

Files/URLs

metadata

text

CONCLUSIONS

Conclusions

• A popular IR model

– Vector Space Model

• Lucene

– provides API to build search engine

• Apace Tika

– extracts metadata and text from files and URLs

• LET’S GO TO BUILD YOUR SEARCH ENGINE!

Lucene Contrib

• analyzers: new analyzer

– Arabic, Chinese, …

• highlighter: a set of classes for highlighting matching terms in search results

• Snowball: stemmer analyzer based on Snowball library http://snowball.tartarus.org

• spellchecker: tools for spellchecking and suggestions with Lucene

http://snowball.tartarus.org/

Lucene related project

• Apache Solr: open source enterprise search platform from the Apache Lucene– Full-Text Search Capabilities, High Volume Web Traffic,

HTML Administration Interfaces

• Apache Nutch: web-search engine based on Solrand Lucene– crawler, link-graph database, HTML

Thank you for your attention!

Questions?

Education

[F5 Hit Refresh] Pierpaolo Basile - Accesso alle informazioni con apache lucene