Needle in an enterprise haystack

Preview:

Citation preview

search engine integrationsNeedle in an enterprise haystack

1

Who am I?

Andrew MleczkoPlone IntegratorRedturtle Technology (Ferrara/Italy)andrew.mleczko@redturtle.net

2

so why do you need an external search engine?

3

why do you need an external search engine...

• Plone's portal_catalog is slow with big sites (large number of indexed objects)

• You want to reduce Plone memory consumption (by removing heavy indexes like SearchableText)

• You want to query Plone's content from external applications

• You want to use advanced search features

4

there are several solutions

that you can use

5

Plone external indexing and searching

• Out-of-the-box:

• collective.gsa (Google Search Appliance)

• collective.solr (Apache Solr)

• Custom integrations:

• Solr

• Tsearch2

http://www.flickr.com/photos/jenny-pics/3527749814

6

Solr?

http://www.flickr.com/photos/st3f4n/2767217547

7

a search enginebased on Lucene

http://www.flickr.com/photos/st3f4n/2767217547

8

Lucene?

http://www.flickr.com/photos/st3f4n/2767217547

9

Full-text search library 100% in java

http://www.flickr.com/photos/st3f4n/2767217547

10

Solr XML/HTTP, JSON interface,Open Source

http://www.flickr.com/photos/st3f4n/2767217547

11

collective.solr python API and Plone integration

http://www.flickr.com/photos/st3f4n/2767217547

12

Document formatsolr

collective.solr

13

Document format

<add><doc>

! <field name=”id”>123</field>

! <field name=”title”>The Trap</field>

! <field name=”author”>Agatha Christie</field>

! <field name=”genre”>thriller</field>

</doc></add>

solrcollective.solr

13

Document format

<add><doc>

! <field name=”id”>123</field>

! <field name=”title”>The Trap</field>

! <field name=”author”>Agatha Christie</field>

! <field name=”genre”>thriller</field>

</doc></add>

>>> conn = SolrConnection(host='127.0.0.1', ...)

>>> book = {'title': 'The Trap',

...!! ! 'author': 'Agatha Christie',

...!! ! 'genre' : 'thriller'}

>>> conn.add(**book)

solrcollective.solr

13

Response format

14

Response format

<response><result numFound=”2” start=”0”>

<doc><str name=”title”>Coma</str>

<str name=”author”>Robin Cook</str></doc>

<doc><str name=”title”>The Trap</str>

! <str name=”author”>Agatha Christie</str></doc>

</result></response>

solr

14

Response format

>>> query = {'genre': 'thriller'}

>>> response = conn.search(q=query)

>>> results = SolrResponse(response).response

>>> results.numFound

2

>>> results[0].title

'Coma'

>>> results[0].author

'Robin Cook'

collective.solr

14

Who use solr/lucene?

15

Who use solr/lucene?Who use Solr/Lucene?Who use Solr/Lucene?

15

"Biblioteca Virtuale Italiana di Testi in Formato Alternativo"

16

sources

Architecture

Z39.50

web site

Books

CSV

retriever

retriever

retriever

populator solr

search

populator ...

17

Retrievers

• they are normalizing sources to unique format

• source can be anything from CSV to public site

18

Public sites

• makes a query

• grabs HTML results

• using configurable xpath parser transform HTML results into python format

19

Normalize it!

• Title

• Description

• Authors

• Publisher

• Format

• ISBN

• ISSN

• Data

every Book needs to have minimal metadata:

20

Populators

Today:

• only one solr populator

In the future:

• populate other sites,

• populate RDBMS

• ...

21

Conclusions

• multiple retrivers – multiple populators

• we have used only collective.solr SolrConnection API

• 120.000 books indexed so far in solr - querying and indexing is extremly fast

22

tsearch2 ?

http://www.flickr.com/photos/st3f4n/2767217547

23

tsearch2 ?search engine fully integrated in PostgreSQL 8.3.x

http://www.flickr.com/photos/st3f4n/2767217547

24

tsearch2 main features

• Flexible and rich linguistic support (dictionaries, stop words), thesaurus

• Full UTF-8 support

• Sophisticated ranking functions with support of proximity and structure information (rank, rank_cd)

• Rich query language with query rewriting support

• Headline support (text fragments with highlighted search terms)

• It is mature (5 years of development)

25

first steps with tsearch2

1. PostgreSQL >= 8.4(but 8.3 will work as well)

2. COLUMNALTER TABLE content ADD COLUMN search_vector tsvector;

3. INDEXCREATE INDEX search_index ON content USING gin(search_vector);

26

first steps with tsearch2

4. TRIGGER

CREATE FUNCTION fullsearch_trigger() RETURNS trigger AS $$

begin

new.search_vector :=

setweight(to_tsvector('pg_catalog.english',

coalesce(new.subject,'')), 'A') ||

setweight(to_tsvector('pg_catalog.english',

coalesce(new.title,'')), 'B') ||

setweight(to_tsvector('pg_catalog.english',

coalesce(new.description,'')), 'C');

return new;

end

$$ LANGUAGE plpgsql;

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON content FOR EACH ROW EXECUTE PROCEDURE fullsearch_trigger();

27

tsearch2how to serialize Plone content to SQL?

http://www.flickr.com/photos/st3f4n/2767217547

28

ore.contentmirror„it focuses and supports out of the box, content deployment to a relational database”

http://www.flickr.com/photos/st3f4n/2767217547

29

how to add tsearch2 to ore.contentmirror ddl?

http://www.flickr.com/photos/st3f4n/2767217547

30

How to add tsearch2to ore.contentmirror ddl?

>>> from ore.contentmirror.schema import content

>>> def setup_search(event, schema_item, bind):

...!! bind.execute("alter table content add

...!! ! ! ! ! column search_vector tsvector")

>>> content.append_ddl_listener('after-create',... setup_search)

31

Geco - community portal for Italian youth

32

Geco

• Started in 2009 for Emilia-Romagna

• Multiple content types, including video, polls, articles and more

33

Geco

• 95 editors (Emilia-Romagna)

• 100.000 documents (Emilia-Romagna)

• This year: 2 other regions joins

• Future: all 20 regions joins the project

• Every region has it's own server deployment

34

Objectives

✓ fast and efficient search engine that can integrate multiple different Plone sites

✓ search results should be ordered by rank

✓ content should be serialized in SQL so it can be reused by other applications (ratings, comments)

35

rt.tsearch2

• integrates tsearch2 in PostgreSQL

• extend sqlalchemy query with rank sorting

36

rt.tsearch2

• integrates tsearch2 in PostgreSQL

• extend sqlalchemy query with rank sorting

>>> rank = '{0,0.05,0.05,0.9}'>>> term = 'Ferrara'>>> query = query.order_by(desc("ts_rank('%s', Content.search_vector,! to_tsquery('%s'))" % (rank, term)))

36

Conclusionshttp://www.flickr.com/photos/vramak/3499502280

37

Conclusions

✓ Integrating external search engine in Plone is easy!

✓ You can find a solution that suites your needs!

38

QuestionsAndrew MleczkoRedTurtle Technology andrew.mleczko@redturtle.net

39

Thank you.

40

Recommended