Creare il proprio motore di ricerca con Apache Solr
[email protected] (@afocareta) Pro-netics [email protected] Pro-netics S.p.A
Alfonso FocaretaAngelo Quercioli
Solr & LuceneAlfonso FocaretaAngelo Quercioli
Lucene: featuresAlfonso FocaretaAngelo Quercioli
• High performance, full-text & scalable search library
• 100% pure Java
• Focus: Indexing + Searching Documents (“Document” is just a list of name+value pairs)
• No crawlers or document parsing Flexible Text Analysis (tokenizers + token filters)
Solr: featuresAlfonso FocaretaAngelo Quercioli
• A full text search server based on Lucene• XML/HTTP, JSON Interfaces• Faceted Search (category counting)• Flexible data schema to define types and fields• Hit Highlighting• Configurable Advanced Caching• Index Replication• Extensible Open Architecture, Plugins• Web Administration Interface• Written in Java5, deployable as a WAR
Solr: licenseAlfonso FocaretaAngelo Quercioli
OPEN SOURCE!!Apache License
Solr: ArchitectureAlfonso FocaretaAngelo Quercioli
Solr: Installing and StartingAlfonso FocaretaAngelo Quercioli
• JDK5 or above intsalled
[email protected] [email protected]
http://localhost:8983/solr/admin/ in your web browser for admin it
Solr: Define a schema.xmlAlfonso FocaretaAngelo Quercioli
Define a Schema (schema.xml)
The file schema.xml describes the structures of the data indexed.
• Type definitions• Field definitions• CopyField section• Additional definitions
Solr: Define a schema.xml (type definition)Alfonso FocaretaAngelo Quercioli
Type Definition
List of type and component (simple and complex)• Primitive type• WhiteSpaceTokenizerFactory• StopFilterFactory• WordDelimiterFilterFactory• LowerCaseFilterFactory• SnowBallFilterFactory (stemming)
Solr: Define a schema.xml (type definition- example)
Alfonso FocaretaAngelo Quercioli
Type Definition - Example
Solr: Define a schema.xml (type definition- example)
Alfonso FocaretaAngelo Quercioli
Field Definitions
• Field Attributes: name, type, indexed, stored, multiValued, omitNorms, termVectors
<field name="id“ type="string" indexed="true" stored="true"/><field name="sku“ type="textTight” indexed="true" stored="true"/><field name="name“ type="text“ indexed="true" stored="true"/><field name=“inStock“ type=“boolean“ indexed="true“ stored=“false"/><field name=“price“ type=“sfloat“ indexed="true“ stored=“false"/><field name=“category“ type=”text_ws“ indexed=”true” stored=“true”
multiValued="true"/>
• Dynamic Fields
<dynamicField name="*_i" type="sint“ indexed="true" stored="true"/><dynamicField name="*_s" type="string“ indexed="true" stored="true"/><dynamicField name="*_t" type="text“ indexed="true" stored="true"/>
Solr: Define a schema.xml (Copy Field- example)
Alfonso FocaretaAngelo Quercioli
Copy Field
Copies one field to another at index time.Case#1: Analyze same field different ways
– copy into a field with a different analyzer– boost exact-case, exact-punctuation matches– language translations, thesaurus, soundex
<field name=“title” type=“text”/><field name=“title_exact” type=“text_exact” stored=“false”/><copyField source=“title” dest=“title_exact”/>
Case #2: Index multiple fields into single searchable field
Solr: Indexing MethodAlfonso FocaretaAngelo Quercioli
Indexing Method
You put documents in it (called "indexing") via :
• XML• JSON• CSV• Binary over http (multipart request)
Solr: Indexing (Java Api)Alfonso FocaretaAngelo Quercioli
Indexing by Solrj
Send an xml like this
[email protected] [email protected]
<add><doc <field name=“id”>043564</field> <field name=“name”>Alfonso</field> <field name=“surname”>Focareta</field> <field name=“category”>developer</field> <field name=“language”>Italian</field> <field name=“language”>English</field></doc></add>
Solr: Indexing (Solrj)Alfonso FocaretaAngelo Quercioli
Solrj
Solrj is a java client to access solr, It offers a java interface to add, update, and query the solr index
Example ->
Solr: Indexing (Solrj) ExampleAlfonso FocaretaAngelo Quercioli
Solr: Delete DocumentAlfonso FocaretaAngelo Quercioli
Delete document(s)
• Delete by Id(most efficient)<delete>
<id>05591</id> <id>32552</id>
</delete>
• Delete by Query<delete>
<query>language:english</query>
</delete>
Solr: Commit and OptimizeAlfonso FocaretaAngelo Quercioli
Commit and Optimize
Commit : when you are indexing documents to Solr none of the changes you are making will appear until you run the commit command!
Optimize: the command that reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents.
Solr: SearchingAlfonso FocaretaAngelo Quercioli
Searching You can search document in Solr by http or by solrj
library.http:/
/localhost:8983/solr/select?q=language:italian&start=0&rows=2&fl=name,surname
<response> <result numFound=“15" start="0"> <doc> <str name=“name">Angelo</str> <str name=“surname”>quercioli</str> </doc> <doc> <str name=“name">Alfonso</str> <str name=“surname”>Focareta</str> </doc> </result></response>
Solr: Searching (Response Format)Alfonso FocaretaAngelo Quercioli
Response FormatYou can add &wt=json for JSON formatted response
{“result": {"numFound":15, "start":0, "docs": [ {“name”:”Angelo”, “surname”:”Quercioli”}, {“name”:” Alfonso”, “surname”:” Focareta”} ]}
Solr: Searching – Query SyntaxAlfonso FocaretaAngelo Quercioli
Lucene Query Syntax
• Italian englishEquiv: italian OR englishQueryParser default operator is “OR”/optional
• Wildcard searches: ang?o, alf*o, rom*
• +italian+english –name:angelo Equiv: italian AND english NOT name:angelo
• “justice league” –name:aquaman• releaseDate:[2012-01-01T00-00-00Z TO 2013-12-
31T23:59:59Z]• description:“legge roma”~100•
Solr: Searching – Query Syntax 2Alfonso FocaretaAngelo Quercioli
Lucene Query Syntax 2
• *:*• (angelo AND “pier francesco”) OR
(+federico +paolo)
Solr: Function QueryAlfonso FocaretaAngelo Quercioli
Function Query• Allows adding function of field value to score– Boost recently added or popular documents
• Current parser only supports function notation• Example: log(sum(popularity,1))• sum, min, max, log, sqrt, currency, ms … etc• scale(x, target_min, target_max)– calculates min & max of x across all docs
• map(x, min, max, target)– useful for dealing with defaults
Solr: Boosted QueryAlfonso FocaretaAngelo Quercioli
Boosted Query
• Score is multiplied instead of added– New local params {!...} syntax added
&q={!boost b=sqrt(popularity)}”super man”
• Parameter dereferencing in local params&q={!boost b=$boost v=$userq}&boost=sqrt(popularity)&userq=“super man”
Solr: Facet QueryAlfonso FocaretaAngelo Quercioli
Facet QueryFaceted search breaks up search result into multiple
categories
http://solr/select?q=foo&wt=json&indent=on &facet=true&facet.field=cat &facet.query=price:[0 TO 100] &facet.query=manu:IBM
{"response":{"numFound":26,"start":0,"docs":[…]}, “facet_counts":{ "facet_queries":{ "price:[0 TO 100]":6, “manu:IBM":2}, "facet_fields":{ "cat":[ "electronics",14, "memory",3, "card",2, "connector",2] }}}
Solr: Filter QueryAlfonso FocaretaAngelo Quercioli
Filter Query
• Filters are restrictions in addition to the query• Use in faceting to narrow the results• Filters are cached separately for speed
User queries for memory, query sent to solr is &q=memory&fq=inStock:true&facet=true&…2. User selects 1GB memory size &q=memory&fq=inStock:true&fq=size:1GB&…3. User selects DDR2 memory type &q=memory&fq=inStock:true&fq=size:1GB &fq=type:DDR2&…
Demo!Alfonso FocaretaAngelo Quercioli
Demo!
Demo!Alfonso FocaretaAngelo Quercioli
Questions ?