Upload
opensource-connections
View
432
Download
2
Embed Size (px)
DESCRIPTION
War stories from building GPSN, a US Federal site for searching China's patents.
Citation preview
BUILDING A LIGHTWEIGHT DISCOVERY INTERFACE FOR
CHINESE PATENTS !
New York Solr/Lucene Meetup
ERIC PUGH | [email protected] | @dep4b
Who am I?
• Principal of OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummary
• Member of Apache Software Foundation
• SOLR-284 UpdateRichDocuments (July 07)
• Fascinated by the art of software development
Co-AuthorN
ext Edition June!
Congrats to Trey and Tim!
Agilista
Selected Customers
Telling some storieswar ^
• First USPTO application in “the cloud”
• Simple, and discoverable
• Expresses our philosophy of “Cloud meets Ocean”
!
• Check it out at http://gpsn.uspto.gov
Telling some stories
➡How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
Flow of understanding
Data UnderstandingInformation
Building “Discovery”
Engine
UX DataTension
Grok data at gut level
Look for outliers
!
!
User Interviews
Surveys
Card Sorting
Scenarios/Personas
!
UX
Data
brainstormMockups
Proof of concept
!
!
Where to spend time?
UX
Engine
Data
40%
!
20%
!
40%
!
40%
!
40%
!
20%
We spent
!
!
Telling some stories
• How to inject “Discovery” into your app
➡The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
Boy meets Girl Story
Boy meets Girl Story
Metadata
Ingest Pipeline
Discovery UX
Content Files
Nothing but JS and Solr!
• Updates are quarterly
• User state in browser
• Solr is the “RESTful” API ;-)
• KISS: EmberJS + Solr
How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
BrowsersMobile/
Tablet
Third Party Application
Servers
S3 BucketSolr
Yes, Solr is hangout out there on the Net…
• Using Jetty container security to lock down everything but the /select handler.
• Yes, the /admin interface appears to load, but no panels load.
• Go ahead, do a delete query! I dare you. Actually, please don’t. ;-)
Single 550 GB index
• Solr + Index are in a Amazon AMI image.
• Currently running two independent Solrs.
• Optimize works! Still.
• Elastic Load Balancer + AutoScale spins up more Solr’s if needed.
• Threw lots of “provisioned IOPS” at VM
A better security proxy
from Alex?https://github.com/
dergachev/solr-security-proxy
Spyglass
• EmberJS based Widget framework
• List of Results
• Facets
• Autocomplete
• “Deploy” is just .html + .js. S3 bucket!
• Tooling is a pain. EmberJS is complex!
Better then AjaxSolr!
Key scaling concept behind GPSN:
!
Cloud meets Ocean
More prosaically…
Database
Server
Server
Server
Client
Client
Client
$
$
$
$
Lessons Learned
Don’t Move Files
• Copying 5 TB data up to S3 was very painful.
• We used S3Funnel which is “rsync like”
• We bought more network bandwidth for our office
Never underestimate
the bandwidth of a station wagon
full of tapes hurtling down the highway.
–Andrew Tanenbaum, 1981
Data Size
0
250000
500000
750000
1000000
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Patent Count
277871
Think about Data Volume• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice, need more visibility into progress..
• Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!)
• 8 shards dropped time from 12 hours to 2 hours. Merging took 5!
• We had too many steps in our pipeline
Building a Patents IndexM
achi
ne C
ount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
➡Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
Why so many pipelines?Morphlines
Tika as a pipeline?
Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!
• multiple XML formats as well as CSV and EDI
• Purplebook, Yellowbook, Redbook,Greenbook, Questel, SIPO…
Tika as a pipeline!
• Auto detects content type
• Metadata structure has all the key/value needed for Solr
• Allows us to scale up with Behemoth project (and others!).
Lots of files!HHHHHT APS1 ISSUE - 760106!PATN!WKU 039302717!SRC 5!APN 5328756!APT 1!ART 353!APD 19741216!TTL Golf glove!ISD 19760106!NCL 4!ECL 1
<PatentGrant>! <BibliographicData>! <GrantIdentification>! <DocumentKindCode>B1</DocumentKindCode>! <GrantNumber>06644224</GrantNumber>! <CountryCode>US</CountryCode>! <IssueDateText>2003-11-11</IssueDateText>
Detector to pick Filepublic class GreenbookDetector implements Detector { ! private static Pattern pattern = Pattern.compile("PATN"); @Override public MediaType detect(InputStream stream, Metadata metadata) throws IOException { ! MediaType type = MediaType.OCTET_STREAM; InputStream lookahead = new LookaheadInputStream(stream, 1024); String extract = org.apache.commons.io.IOUtils.toString(lookahead, "UTF-‐8"); ! Matcher matcher = pattern.matcher(extract); ! if (matcher.find()) { type = GreenbookParser.MEDIA_TYPE; } ! lookahead.close(); return type; } }
Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
➡Don’t be Afraid to Share!
Your Search solution isn’t perfect
• Allow users to export data
• Most business users want to work in Excel! Accept it!
• Allow other applications to build on top of it.
GPSN has• Lots of easy “Print to
PDF” options.
• Data stored in S3 as:
• individual patent files
• chunky downloads.
• Filtering to expand or select specific data sets.
• Permalinks: simple, very sharable URLs.
• Underlying Solr service is exposed to public via proxy. You can query Solr yourself.
• Need advance querying? Use Lucene syntax in search bar.
One more thought...
Measuring the impact of our algorithms
changes is just getting harder as we get
smarter.
www.quepid.com
Quepid: Give your Queries some Love
Project SolrPanl
We need beta users!
Thank you! !
Questions?
• @dep4b
• www.opensourceconnections.com
• slideshare.com/o19s
Nervous about speaking up? Ask
me later!