Needle in an enterprise haystack

search engine integrationsNeedle in an enterprise haystack

Who am I?

Andrew MleczkoPlone IntegratorRedturtle Technology (Ferrara/Italy)andrew.mleczko@redturtle.net

so why do you need an external search engine?

why do you need an external search engine...

• Plone's portal_catalog is slow with big sites (large number of indexed objects)

• You want to reduce Plone memory consumption (by removing heavy indexes like SearchableText)

• You want to query Plone's content from external applications

• You want to use advanced search features

there are several solutions

that you can use

Plone external indexing and searching

• Out-of-the-box:

• collective.gsa (Google Search Appliance)

• collective.solr (Apache Solr)

• Custom integrations:

• Solr

• Tsearch2

http://www.flickr.com/photos/jenny-pics/3527749814

http://www.flickr.com/photos/st3f4n/2767217547

a search enginebased on Lucene

Lucene?

Full-text search library 100% in java

Solr XML/HTTP, JSON interface,Open Source

collective.solr python API and Plone integration

Document formatsolr

collective.solr

Document format

! <field name=”id”>123</field>

! <field name=”title”>The Trap</field>

! <field name=”author”>Agatha Christie</field>

! <field name=”genre”>thriller</field>

</doc></add>

solrcollective.solr

Document format

! <field name=”id”>123</field>

! <field name=”title”>The Trap</field>

! <field name=”author”>Agatha Christie</field>

! <field name=”genre”>thriller</field>

</doc></add>

>>> conn = SolrConnection(host='127.0.0.1', ...)

>>> book = {'title': 'The Trap',

...!! ! 'author': 'Agatha Christie',

...!! ! 'genre' : 'thriller'}

>>> conn.add(**book)

solrcollective.solr

Response format

<str name=”author”>Robin Cook</str></doc>

! <str name=”author”>Agatha Christie</str></doc>

</result></response>

Response format

>>> query = {'genre': 'thriller'}

>>> response = conn.search(q=query)

>>> results = SolrResponse(response).response

>>> results.numFound

>>> results[0].title

'Coma'

>>> results[0].author

'Robin Cook'

collective.solr

Who use solr/lucene?

Who use solr/lucene?Who use Solr/Lucene?Who use Solr/Lucene?

"Biblioteca Virtuale Italiana di Testi in Formato Alternativo"

sources

Architecture

Z39.50

web site

retriever

populator solr

search

populator ...

Retrievers

• they are normalizing sources to unique format

• source can be anything from CSV to public site

Public sites

• makes a query

• grabs HTML results

• using configurable xpath parser transform HTML results into python format

Normalize it!

• Title

• Description

• Authors

• Publisher

• Format

• ISBN

• ISSN

• Data

every Book needs to have minimal metadata:

Populators

Today:

• only one solr populator

In the future:

• populate other sites,

• populate RDBMS

• ...

Conclusions

• multiple retrivers – multiple populators

• we have used only collective.solr SolrConnection API

• 120.000 books indexed so far in solr - querying and indexing is extremly fast

tsearch2 ?

tsearch2 ?search engine fully integrated in PostgreSQL 8.3.x

tsearch2 main features

• Flexible and rich linguistic support (dictionaries, stop words), thesaurus

• Full UTF-8 support

• Sophisticated ranking functions with support of proximity and structure information (rank, rank_cd)

• Rich query language with query rewriting support

• Headline support (text fragments with highlighted search terms)

• It is mature (5 years of development)

first steps with tsearch2

1. PostgreSQL >= 8.4(but 8.3 will work as well)

2. COLUMNALTER TABLE content ADD COLUMN search_vector tsvector;

3. INDEXCREATE INDEX search_index ON content USING gin(search_vector);

first steps with tsearch2

4. TRIGGER

CREATE FUNCTION fullsearch_trigger() RETURNS trigger AS $$

new.search_vector :=

setweight(to_tsvector('pg_catalog.english',

coalesce(new.subject,'')), 'A') ||

coalesce(new.title,'')), 'B') ||

coalesce(new.description,'')), 'C');

return new;

$$ LANGUAGE plpgsql;

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON content FOR EACH ROW EXECUTE PROCEDURE fullsearch_trigger();

tsearch2how to serialize Plone content to SQL?

ore.contentmirror„it focuses and supports out of the box, content deployment to a relational database”

how to add tsearch2 to ore.contentmirror ddl?

How to add tsearch2to ore.contentmirror ddl?

>>> from ore.contentmirror.schema import content

>>> def setup_search(event, schema_item, bind):

...!! bind.execute("alter table content add

...!! ! ! ! ! column search_vector tsvector")

>>> content.append_ddl_listener('after-create',... setup_search)

Geco - community portal for Italian youth

• Started in 2009 for Emilia-Romagna

• Multiple content types, including video, polls, articles and more

• 95 editors (Emilia-Romagna)

• 100.000 documents (Emilia-Romagna)

• This year: 2 other regions joins

• Future: all 20 regions joins the project

• Every region has it's own server deployment

Objectives

✓ fast and efficient search engine that can integrate multiple different Plone sites

✓ search results should be ordered by rank

✓ content should be serialized in SQL so it can be reused by other applications (ratings, comments)

rt.tsearch2

• integrates tsearch2 in PostgreSQL

• extend sqlalchemy query with rank sorting

rt.tsearch2

• integrates tsearch2 in PostgreSQL

• extend sqlalchemy query with rank sorting

>>> rank = '{0,0.05,0.05,0.9}'>>> term = 'Ferrara'>>> query = query.order_by(desc("ts_rank('%s', Content.search_vector,! to_tsquery('%s'))" % (rank, term)))

Conclusionshttp://www.flickr.com/photos/vramak/3499502280

Conclusions

✓ Integrating external search engine in Plone is easy!

✓ You can find a solution that suites your needs!

QuestionsAndrew MleczkoRedTurtle Technology andrew.mleczko@redturtle.net

Thank you.

Needle in an enterprise haystack

Technology

Finding the Needle and the Haystack: New Insights into Locating

Finding a Needle in a Haystack: An Image Processing Approach

Finding the needle in a high-dimensional haystack

Finding a needle in Haystack Facebooks Photo Storage Shakthi Bachala

Needle In A Haystack Referral

Laura Celia - How to find a needle in a haystack

Needle in the Haystack: The Technology of Internet Search

Needle in a Haystack: Hunting Mobile Theater Missiles on

Searching for a “needle in a haystack”

Finding the Needle in the Haystack - Carnegie Mellon University · 2017. 5. 30. · solution finding the “Needle in the Haystack”, but it definitely turns a needle into haystack

Location Data - Finding the needle in the haystack

Find a needle in Haystack: Facebooks storage system

Finding the Needle in the Haystack with Heuristically ...webdoc.sub.gwdg.de/univerlag/2010/mkwi/03_anwendungen/planen... · Finding the Needle in the Haystack with Heuristically Guided

Needle in a haystack: Protein complex purification from ...pubman.mpdl.mpg.de/pubman/item/escidoc:1896802/component/esci… · Needle in a haystack: Protein complex purification from

Particle Analysis – Finding the Needle in the Haystack · 2016. 6. 3. · Finding the Needle in the Haystack Nicole Erdmann, Magnus Hedberg Nuclear Safeguards and Security Unit

Pharmaceutical R&D: Looking for a Needle in a Haystack? Torch the Haystack

Asher's Finding YOU Needle In A Haystack

Picviz finding a needle in a haystack - usenix.org · "Finding a needle in a haystack... when you don’t even know how the needle looks like" To generate pictures like this Sébastien

Needle in a haystack: rare cell subtypes in ﬂow cytometryucakima/samsinugget.pdf · Needle in a haystack: rare cell subtypes in ﬂow cytometry Ioanna Manolopoulou, Cliburn Chan

Leveragong splunk for finding needle in the Haystack