Codemotion 2013 - Creare il proprio motore di ricerca con Apache Solr

Preview:

Citation preview

Creare il proprio motore di ricerca con Apache Solr

alfonso.focareta@pronetics.it (@afocareta) Pro-netics S.p.A.angelo.quercioli@pronetics.it Pro-netics S.p.A

Alfonso FocaretaAngelo Quercioli

Solr & LuceneAlfonso FocaretaAngelo Quercioli

alfonso.focareta@pronetics.it Angelo.quercioli@pronetics.it

Lucene: featuresAlfonso FocaretaAngelo Quercioli

• High performance, full-text & scalable search library

• 100% pure Java

• Focus: Indexing + Searching Documents (“Document” is just a list of name+value pairs)

• No crawlers or document parsing Flexible Text Analysis (tokenizers + token filters)

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: featuresAlfonso FocaretaAngelo Quercioli

• A full text search server based on Lucene• XML/HTTP, JSON Interfaces• Faceted Search (category counting)• Flexible data schema to define types and fields• Hit Highlighting• Configurable Advanced Caching• Index Replication• Extensible Open Architecture, Plugins• Web Administration Interface• Written in Java5, deployable as a WAR

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: licenseAlfonso FocaretaAngelo Quercioli

OPEN SOURCE!!Apache License

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: ArchitectureAlfonso FocaretaAngelo Quercioli

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Installing and StartingAlfonso FocaretaAngelo Quercioli

• JDK5 or above intsalled

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

http://localhost:8983/solr/admin/ in your web browser for admin it

Solr: Define a schema.xmlAlfonso FocaretaAngelo Quercioli

Define a Schema (schema.xml)

The file schema.xml describes the structures of the data indexed.

• Type definitions• Field definitions• CopyField section• Additional definitions

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Define a schema.xml (type definition)Alfonso FocaretaAngelo Quercioli

Type Definition

List of type and component (simple and complex)• Primitive type• WhiteSpaceTokenizerFactory• StopFilterFactory• WordDelimiterFilterFactory• LowerCaseFilterFactory• SnowBallFilterFactory (stemming)

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Define a schema.xml (type definition- example)

Alfonso FocaretaAngelo Quercioli

Type Definition - Example

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Define a schema.xml (type definition- example)

Alfonso FocaretaAngelo Quercioli

Field Definitions

• Field Attributes: name, type, indexed, stored, multiValued, omitNorms, termVectors

<field name="id“ type="string" indexed="true" stored="true"/><field name="sku“ type="textTight” indexed="true" stored="true"/><field name="name“ type="text“ indexed="true" stored="true"/><field name=“inStock“ type=“boolean“ indexed="true“ stored=“false"/><field name=“price“ type=“sfloat“ indexed="true“ stored=“false"/><field name=“category“ type=”text_ws“ indexed=”true” stored=“true”

multiValued="true"/>

• Dynamic Fields

<dynamicField name="*_i" type="sint“ indexed="true" stored="true"/><dynamicField name="*_s" type="string“ indexed="true" stored="true"/><dynamicField name="*_t" type="text“ indexed="true" stored="true"/>

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Define a schema.xml (Copy Field- example)

Alfonso FocaretaAngelo Quercioli

Copy Field

Copies one field to another at index time.Case#1: Analyze same field different ways

– copy into a field with a different analyzer– boost exact-case, exact-punctuation matches– language translations, thesaurus, soundex

<field name=“title” type=“text”/><field name=“title_exact” type=“text_exact” stored=“false”/><copyField source=“title” dest=“title_exact”/>

Case #2: Index multiple fields into single searchable field

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Indexing MethodAlfonso FocaretaAngelo Quercioli

Indexing Method

You put documents in it (called "indexing") via :

• XML• JSON• CSV• Binary over http (multipart request)

alfonso.focareta@pronetics.it Angelo.quercioli@pronetics.it

Solr: Indexing (Java Api)Alfonso FocaretaAngelo Quercioli

Indexing by Solrj

Send an xml like this

alfonso.focareta@pronetics.it Angelo.quercioli@pronetics.it

<add><doc <field name=“id”>043564</field> <field name=“name”>Alfonso</field> <field name=“surname”>Focareta</field> <field name=“category”>developer</field> <field name=“language”>Italian</field> <field name=“language”>English</field></doc></add>

Solr: Indexing (Solrj)Alfonso FocaretaAngelo Quercioli

Solrj

Solrj is a java client to access solr, It offers a java interface to add, update, and query the solr index

Example ->

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Indexing (Solrj) ExampleAlfonso FocaretaAngelo Quercioli

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Delete DocumentAlfonso FocaretaAngelo Quercioli

Delete document(s)

• Delete by Id(most efficient)<delete>

<id>05591</id> <id>32552</id>

</delete>

• Delete by Query<delete>

<query>language:english</query>

</delete>

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Commit and OptimizeAlfonso FocaretaAngelo Quercioli

Commit and Optimize

Commit : when you are indexing documents to Solr none of the changes you are making will appear until you run the commit command!

Optimize: the command that reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents.

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: SearchingAlfonso FocaretaAngelo Quercioli

Searching You can search document in Solr by http or by solrj

library.http:/

/localhost:8983/solr/select?q=language:italian&start=0&rows=2&fl=name,surname

<response> <result numFound=“15" start="0"> <doc> <str name=“name">Angelo</str> <str name=“surname”>quercioli</str> </doc> <doc> <str name=“name">Alfonso</str> <str name=“surname”>Focareta</str> </doc> </result></response>

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Searching (Response Format)Alfonso FocaretaAngelo Quercioli

Response FormatYou can add &wt=json for JSON formatted response

{“result": {"numFound":15, "start":0, "docs": [ {“name”:”Angelo”, “surname”:”Quercioli”}, {“name”:” Alfonso”, “surname”:” Focareta”} ]}

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Searching – Query SyntaxAlfonso FocaretaAngelo Quercioli

Lucene Query Syntax

• Italian englishEquiv: italian OR englishQueryParser default operator is “OR”/optional

• Wildcard searches: ang?o, alf*o, rom*

• +italian+english –name:angelo Equiv: italian AND english NOT name:angelo

• “justice league” –name:aquaman• releaseDate:[2012-01-01T00-00-00Z TO 2013-12-

31T23:59:59Z]• description:“legge roma”~100•

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Searching – Query Syntax 2Alfonso FocaretaAngelo Quercioli

Lucene Query Syntax 2

• *:*• (angelo AND “pier francesco”) OR

(+federico +paolo)

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Function QueryAlfonso FocaretaAngelo Quercioli

Function Query• Allows adding function of field value to score– Boost recently added or popular documents

• Current parser only supports function notation• Example: log(sum(popularity,1))• sum, min, max, log, sqrt, currency, ms … etc• scale(x, target_min, target_max)– calculates min & max of x across all docs

• map(x, min, max, target)– useful for dealing with defaults

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Boosted QueryAlfonso FocaretaAngelo Quercioli

Boosted Query

• Score is multiplied instead of added– New local params {!...} syntax added

&q={!boost b=sqrt(popularity)}”super man”

• Parameter dereferencing in local params&q={!boost b=$boost v=$userq}&boost=sqrt(popularity)&userq=“super man”

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Solr: Facet QueryAlfonso FocaretaAngelo Quercioli

Facet QueryFaceted search breaks up search result into multiple

categories

http://solr/select?q=foo&wt=json&indent=on &facet=true&facet.field=cat &facet.query=price:[0 TO 100] &facet.query=manu:IBM

{"response":{"numFound":26,"start":0,"docs":[…]}, “facet_counts":{ "facet_queries":{ "price:[0 TO 100]":6, “manu:IBM":2}, "facet_fields":{ "cat":[ "electronics",14, "memory",3, "card",2, "connector",2] }}}

alfonso.focareta@pronetics.it angelo.quercioli@pronetic.it

Solr: Filter QueryAlfonso FocaretaAngelo Quercioli

Filter Query

• Filters are restrictions in addition to the query• Use in faceting to narrow the results• Filters are cached separately for speed

User queries for memory, query sent to solr is &q=memory&fq=inStock:true&facet=true&…2. User selects 1GB memory size &q=memory&fq=inStock:true&fq=size:1GB&…3. User selects DDR2 memory type &q=memory&fq=inStock:true&fq=size:1GB &fq=type:DDR2&…

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Demo!Alfonso FocaretaAngelo Quercioli

Demo!

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it

Demo!Alfonso FocaretaAngelo Quercioli

Questions ?

alfonso.focareta@pronetics.it angelo.quercioli@pronetics.it