If you can't read please download the document
Upload
jay-luker
View
4.733
Download
0
Embed Size (px)
Citation preview
Letting In The Light
Using Solr as an External Search Component
Jay LukerBenoit ThiellSAO/NASA Astrophysics Data System
http://adsabs.harvard.edu/
The SAO/NASA Astrophysics Data System (ADS) is a Digital Library portal for researchers in Astronomy and Physics, operated by the Smithsonian Astrophysical Observatory (SAO) under a NASA grant.
Here's what to expect...
Overview of ADS
Overview of Invenio
Our Solr-Invenio Integration Project
A few tips on Solr hacking along the way
The ADS Project
Established in 1989 (before the web!) as a portal for accessing astronomical data and bibliographic metadata
Was restructured in 1994 to become an A&I service for astronomers and astrophysicists, with fulltext archive
Has 100% penetration in astronomical community, with take-up in other areas of space sciences, engineering and physics
1994 was the move to the web
ADS Holdings
Almost 9M bibliographic metadata records
625K fulltext articles
Painstakingly curated collection of citations and links to
fulltext and data products
ADS Services
Free!
Search, Browse, Notifications, Personalization
API access to all content (TWITA)
Network of 12 mirror sites
ADS Labs: http://labs.adsabs.harvard.edu
Astronomy: 1.8M Physics: 5.8M Arxiv e-prints: 650K Citations: 40M (over 3.4M papers with citations)Curated links: 23M (fulltext, data products, citations)4M scanned pages, 625K articles650K pages historical materialAdvanced search allows for searching by astronomical object (via SIMBAD) and attributes like has datasetTWITA = The Website Is The API: via data_type= param, also structured metadata within the pages
Never heard of ?
1993: Started its life at CERN as a preprint server
2000: Extension of the server to allow storing multimedia content (photos, posters, brochures, videos) and creation of the open-source project CDSware project
Renamed CDS Invenio and then Invenio
Both an institutional repository and a digital library
Check it out! http://invenio-software.org/
Why choose Invenio?
ADS and Invenio share the same objectives: store and disseminate information to scientific communities
Growing penetration in the field of physics
Metadata curation tools (record editor, merger)
Support of citations graphs and citation-based searches
Second-order searches support
INSPIRE: Invenio for SPIRES, the Physics database at Stanford.
Under the hood
Written in Python, mod_wsgi, some C and Lisp
Coupled with MySQL only (for now)
Scales to sets of 2M+ records
MARC storage of records
Modular architecture with:OAI harvesting, OAI server
Format conversion (MARCXML, DC, NLM, etc)
References and citations handler
Plot and figure extraction
invenio.intbitset
Sets of Invenio record IDs (MARC controlfield 001)
In-house C implementation of Python sets
Fast dumping and loading marshalling functions
Stored marshalled in the database and used as such in the search engine
Invenio sounds great! Why use Solr then?
Invenio's search engine has trouble with 9M+ record (work-in-progress)
Invenio's indexing is slow by design (providing search speed) but it is too slow for such a large repository
Solr has a wide community of users/developers and lots of extensions.
Issues with the integration
Keeping the metadata on both systems in sync
Invenio's search engine requires full sets of results
Communicate over HTTP with very large payloads
Invenio + Solr
Objectives
Take advantage of Solr fulltext
indexing & searching
Take advantage of Solr faceting
Not duplicate existing Invenio functionality
Write as little code as possible
Keep things loosely coupled
Obviously, performance was also an objectiveInvenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came aroundIn spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code.Writing as little java as possible doesn't just come from a java-phobic frame of mind; it's also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords.Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.
Problem #1 Retrieving very large result set of ids.Like, millions.
The WTH Approach
http://myhost:8983/solr/select?q={foo}&fl=id&rows={n}
Query for foo
Only return the id field
Return n rows of the result
(A bit about ids)
Schema ids
Lucene ids
Defined in your schema.xml
Can be integers, strings, etc
Typically set as the
Internal to Lucene
Always integers
Unique within an index segment
When we talk about the ids being sent back and forth between Invenio & Solr we are talking about the schema ids.
The WTH Approach
* warmed cache, different servers, same LAN
seconds
So what's going on here? Our first thought was maybe it was the time needed to serialize/de-serialize the response, but that turned out not to be it.
So what's going on here?
documentcache
Lucene Doc
id: 1234,bibcode: , Title: , ...
Query Response
QueryResult
[1,5,16,84,...]
QueryResultMaxDocsCachedQueryResultWindowSizeenableLazyFieldLoading
Solution: Custom Collector
QueryResult
[1,5,16,84,...]
Query Response
...InvenioIdCollector collector = new InvenioIdCollector();searcher.search(query, collector);ArrayList ids = collector.getIds();rsp.add(ids, ids);...
MyQueryComponent.java
...ArrayList ids = new ArrayList();...Public void collect(int doc) {this.ids.add(this.idMap[doc]);}...
MyCollector.java
Solution: Custom Collector
OK, Let's Try This Again
http://myhost:8983/solr/select?q={foo}&qt=my_querytype
Query for foo
Use our custom query handler
No need to specify number of rows or which fields to return
Better. But ...
Problem #2 Facets.
Solr
QueryProcessingPost-processingReturn/Render
Fulltext Search
Record Ids
Invenio
What's Missing?
Post-processing = 2nd order searching, filteringCan't retreive facets with the initial query because the final list of search results will depend on Invenio post-processing.So how do you send a very large set of ids to get a set of facet results?
Solr
QueryProcessingPost-processingReturn/Render
Fulltext Search
Record ids
Invenio
Again, WTH?
Record ids?
Facets
Solr
QueryProcessingPost-processingReturn/Render
Fulltext Search
InvenioBitSet
Invenio
Current Solution
InvenioBitSet
Facets
Satisfies all most objectives.We get searching & facetingWe don't have to write a lot of python or java: invenio needs the indexing pieceNot duplicating anything that invenio already does very wellLoosely coupled because communication is in a form that is native to invenio, we could easily swap in/out different services for either piece
Parts Required
Custom QueryComponent for accepting fulltext search query and
returning an
Integer BitSet
Custom Collector to collect doc ids
Custom BitSet class (maybe)
Custom BinaryResponseWriter
Custom QueryComponent for accepting an Integer BitSet query and returning facets
Seems like a lot, but in total lines of code it's not that much, especially considering it's in Java. Plus I suck at Java and I was able to do it all in 2-3 weeks of trial and error hacking.Plus, it all very closely conforms to the affordances of the Solr API. Only one small thing that might be considered a hack.
Invenio Query Component Config
bitset_stream invenio_query stats
...
solrconfig.xml
Defining our custom query component and telling the default solr search handler to use itAlso defining our custom response writer
Invenio Query Component
public void process(ResponseBuilder rb)throws IOException{ SolrQueryResponse rsp = rb.rsp; SolrIndexSearcher searcher = rb.req.getSearcher();
InvenioIdCollector collector = \ new InvenioIdCollector();
SolrIndexSearcher.QueryCommand cmd = \ rb.getQueryCommand(); Query query = cmd.getQuery();
searcher.search(query, collector); InvenioBitSet bitset = collector.getBitSet(); rsp.add("bitset", bitset);}
InvenioQueryComponent.java
A query component class has two opportunities to interact with the incoming request: prepare & process. We only need process.
Invenio Id Collector
public void setNextReader(IndexReader reader, int docBase) throws IOException { this.reader = reader; this.docBase = docBase;
try { this.idMap = FieldCache.DEFAULT.getInts( this.reader, "id"); } catch (IOException e) { SolrException.logOnce( SolrCore.log, "Exception during idMap init", e); }}
InvenioIdCollector.java
Response Writer
public void write(OutputStream out, SolrQueryRequest req, SolrQueryResponse rsp) { InvenioBitSet bitset = \ (InvenioBitSet) rsp.getValues().get("bitset"); ZOutputStream zOut = new ZOutputStream(out, JZlib.Z_BEST_SPEED);
try { zOut.write(bitset.toByteArray()); zOut.flush(); } catch (IOException e) { SolrException.logOnce(SolrCore.log, "Exception during compression/output of bitset", e); }}
InvenioBitsetStreamResponseWriter.java
These times include decompressing and unmarshalling the bitset into an invenio intbitset object in python
Invenio Facet Component Config
json OR 0 true author_facet ... invenio_facets facet
solrconfig.xml
Defining our custom query component and telling the default solr search handler to use itAlso defining our custom response writer
A bit of python
r = urllib2.Request(facet_query_url)data = bitset.fastdump()boundary = mimetools.choose_boundary()
contents = '--%s\r\n' % boundarycontents += 'Content-Disposition: form-data;' \ + 'name="bitset"; filename="bitset"\r\n'contents += 'Content-Type: application/octet-stream\r\n'contents += '\r\n' + data + '\r\n'contents += '--%s--\r\n\r\n' % boundaryr.add_data(contents)
r.add_unredirected_header('Content-Type', 'multipart/form-data; boundary=%s' % boundary)
u = urllib2.urlopen(r)facet_data = simplejson.load(u)
Facet Query Component
...Iterable streams = req.getContentStreams();...InputStream is = stream.getStream();ByteArrayOutputStream bOut = new ByteArrayOutputStream();ZInputStream zIn = new ZinputStream(is);
IOUtils.copy(zIn, bOut);InvenioBitSet bitset = \ new InvenioBitSet(bOut.toByteArray());...
InvenioFacetComponent.java
Facet Query Component (cont.)
... BitDocSet docSetFilter = new BitDocSet(); int i = 0; while (bitset.nextSetBit(i) != -1) { int nextBit = bitset.nextSetBit(i); int lucene_id = idMap.get(nextBit); docSetFilter.add(lucene_id); i = nextBit + 1; }... SolrIndexSearcher.QueryCommand cmd = \ rb.getQueryCommand(); cmd.setFilter(docSetFilter);SolrIndexSearcher.QueryResult result = \ new SolrIndexSearcher.QueryResult(); searcher.search(result,cmd); rb.setResult( result ); ...
InvenioFacetComponent.java
PyluceneEmbedded solrcpython within Java...
Alternative Approaches
PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java's Native Invocation Interface (JNI).
Further Study
Can we make use of Solr's OpenBitSet?
Is there a way to bypass the Collector stage completely?
How can we return document scores?
Alternative approaches: pylucene, pylucene + solr, cpython within Java.
PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java's Native Invocation Interface (JNI).
Thanks!
Thanks also to: The ADS Team, @adsabs
The Invenio Team, especially...
Roman Chyla
Jan Iwaszkiewicz
https://github.com/lbjay/solr-invenio
Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came aroundIn spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code.Writing as little java as possible doesn't just come from a java-phobic frame of mind; it's also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords.Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.