Upload
lucidimagination
View
298
Download
12
Embed Size (px)
DESCRIPTION
Lucene and Solr are state of the art search technologies available for free as open source from The Apache Software Foundation. Lucene is the underlying search library, and Solr is a platform built on top of Lucene that makes it easy to build Lucene-based applications. Both are full-featured and have excellent performance, relevancy ranking and scalability. These technologies are used today by thousands of organizations and power substantial search applications at AOL, Comcast Interactive Media, IBM, Netflix, LinkedIn and MySpace.http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Introduction-Apache-Lucene-and-Solr
Citation preview
Introduction to Apache Solr & Lucid ImaginationGrant IngersollThursday, 29 July 2010
Sponsored by
Co-sponsored by
We deliver information solutions
Lucid Imagination, Inc.© 2010
Sponsored by
Co-sponsored by…
2
We consult and design.
We architect and build.
We support.
And we realise the
true value of your content...
We deliver information solutions.
Steve Odartwww.ixxus.com
Lucid Imagination, Inc.
Agenda
Introductions
About Lucid Imagination & Open Source Search
LucidWorks for Solr
Searching your domain with Solr
Putting Solr into production
Questions
3© 2010
Slides are posted for download at the end of this
presentation; full replay available within
~48 hours of live webcast
Lucid Imagination, Inc.
About me
Grant Ingersoll
Lucene/Solr committer
Co-founder Apache Mahout project
Co-author of upcoming “Taming Text”
Chair, Apache Lucene PMC
4© 2010
Lucid Imagination, Inc.
About Lucid Imagination
Build on, complement the open source technology & install base of Apache Lucene and Solr
Deliver subscription-based value-add software, support and training to enhance & extend Lucene/Solr
Center of excellence for Lucene/Solr app developers
5© 2010
Lucid Imagination, Inc.
Lucene Project Launched: 1997
Solr Project Launched 2006
Company Launched: Aug. 2007
Financing: Shasta Ventures, Granite Ventures, Walden International, In-Q-Tel
Paying Customers: 100+ (and counting…)
HQ: San Mateo, California, USA
Partners: US, Europe, Japan, Latin America
Company Background
6© 2010
Lucid Imagination, Inc.
SearchCustomersBuilding Better,
Faster, Less Costly Search ApplicationsBest Practices
Training
Consulting Subscriptions
Certified Distributions
Health Checks
Lucid Imagination Offerings
7© 2010
Lucid Imagination, Inc.© 2010
Lucene/Solr Success Stories with Lucid Imagination
8
Lucid Imagination, Inc.
Data Happens
Data constantly growing faster, more diverse
Mix of content, composition, and repositories: new terms, fields, range of data types grow in tandem with volume
Diversity and location of data arean application development problem
Search and discovery tools are the solution
Scalability, performance and relevancy key to user success
Transparency, breadth and flexibility are key to development success
9© 2010
Lucid Imagination, Inc. 10© 2010
Lucid Imagination, Inc.© 2010
Lucene/Solr
•Lucene, Solr and their logos are trademarks of the Apache Software Foundation
Java ported to 7 other environments (PHP, C++, Python, etc.)
Liberal Apache License
One of Top 5 Apache Projects
Top 10 Open Source Project
Hit highlighting
RDBMS integration
Distributed scalability
Solr: The Lucene Search Server
Lucene: powerful flexible search librarySpeed, accuracy, scalability, efficiency
Cross-platform portability of indexes
REST-like interface
Faceting
Rich Document Handling
Easy configuration
11
Lucid Imagination, Inc.
Lucene/Solr Open Source Quality @ the tipping point
Scalability
823 billion documents searched by Lucene at MySpace.com
Performance
Real time: LinkedIn search covers 48 million members, adding one new member (with new content) per second
Relevancy
Open source APIs deliver better customization and the ability to fine tune results
Economics
5-8x reduction in server footprint over commercial search
No vendor lock-in lowers lifecycle costs
12© 2010
Lucid Imagination, Inc.
Reduced risk
Better fitShorter time to market
Resulting from direct communication between innovators and users
From being locked into single-vendor relationships
Access to code results in increased adaptability of process to systems
Three key trends…
CREATING COMPETITIVE ADVANTAGE: Focus on core process innovations unique to your business instead of operating and maintaining 3rd party software packages
…result in:
Creating Lasting Business Value
13© 2010
Lucid Imagination, Inc.
Search 101
Search tools are designed for dealing with fuzzy data
Works well with structured and unstructured data
Performs well when dealing with large volumes of data
Many apps don’t need the limits that databases place on content
Search fits well alongside a DB too
Given a user’s information need, (query) find and, optionally, score content relevant to that need
Many different ways to solve this problem, each with tradeoffs
What’s “relevant” mean?
14© 2010
Lucid Imagination, Inc.
Two Foundation Concepts
Relevance IndexingFinds and maps terms and documents
Conceptually similar to a book index
At the heart of fast search/retrieve
Vector Space Model (VSM) for relevance
Common across many search engines
Apache Lucene is a highly optimized implementation of the VSM
15© 2010
Lucid Imagination, Inc.
Solr Basics
Content is modeled via Documents and Fields
Content can be text, integers, floats, dates, custom
Analysis can be employed to alter content before indexing
Controlled via schema.xml
Searches are supported through a wide range of Query options
Keyword
Terms
Phrases
Wildcards, other
16© 2010
Lucid Imagination, Inc.
Solr Basics
Schema
Define Fields, field metadata and Analysis
<field name="name" type="text" indexed="true" stored="true"/>
Solr Config
Define low-level Lucene controls
Specify how clients interact with Solr via Request Handlers (“mini servlets”)
Configure highlighting, spell checking, admin, etc.
17© 2010
Lucid Imagination, Inc.
Getting Started
1. Install LucidWorks Certified Distribution
2. Model your domain
3. Index your content
4. Test
5. Deploy
18© 2010
Lucid Imagination, Inc.
LucidWorks Certified Distribution
Free certified distribution
Installer
Simple
Plugins and enhancements
Updateable
Complete Reference Guide
Support for Linux, Windows, Mac
UI and headless both available
Get started at http://lucene.li/R
19© 2010
Lucid Imagination, Inc.
Master Your Domain with Solr
Get to know your content
Get to know your users
20© 2010
Lucid Imagination, Inc.
Modeling your Content
Collection/Aggregate
Examine collection level stats, like:
MIME Types
Number of Docs
Update rates
Languages present
Much, much more
Look for patterns and relationships
Identify helpful resources
21© 2010
Lucid Imagination, Inc.
Modeling your Content
Randomly sample a set of your documents
Look for:
Common structures like titles, tables, columns, etc.
Important metadata
Tokenization issues
Try out in http://localhost:8983/solr/admin/analysis.jsp
Importance Indicators
May also look at paragraph, sentence, word and character issues
22© 2010
Lucid Imagination, Inc.© 2010
Understanding your Users
Sophisticated vs. Simple
Speed and Relevance
Search and Discovery
Search
Faceting
Did you mean?
Similar Pages (More Like This)
Highlighting
UI expectations
23
Lucid Imagination, Inc.
Build your Application
Map your content into Documents and Fields via the Solr schema
Setup your Solr access patterns in the solrconfig.xml
Index your content
Search/Browse/Discover
24© 2010
Lucid Imagination, Inc.© 2010
Indexing
Many Clients
Java, PHP, Ruby, etc.
See example/exampledocs
Example: Upload CSV, Solr XML
<add><doc>
<field name="id">EN7800GTX/2DHTV/256M</field>
<field name="manu">ASUS Computer Inc.</field>
<field name="cat">electronics</field>
</doc></add>
25
Lucid Imagination, Inc.© 2010
Search
Clients also support search through API calls
HTTP support by definition:
http://localhost:8983/solr/select/?q=*:*&fl=score,id
http://localhost:8983/solr/select/?q=name:iPod&fl=score,id
26
Lucid Imagination, Inc.
Getting to Production
Some Issues to think about:
Scaling
Improving Findability
27© 2010
Lucid Imagination, Inc.
Scaling Solr
Get the most out of each machine
Typical Hardware (your mileage may vary):
Modern multicore CPU, Fast disk (SSD?), 4-16 GB RAM
High Query Volume
Large Index
Both
http://lucene.li/V
28© 2010
Lucid Imagination, Inc.© 2010
Improving Findability
Common Techniques
Analysis:
Lowercase, stemming, synonyms, stopwords, compound analysis (e.g. STR-AV220 -> STR AV 220)
Faceting
Spell Checking
Editorial
See http://lucene.li/U29
Lucid Imagination, Inc.
Improving Findability
Phrase Queries and other Position-based Queries (SpanQuery)
Disjunction Max Query (aka “DisMax”)
Intent Analysis
Invisible Queries
Fake Queries
Relevance Feedback and “More Like This”
See http://lucene.li/S
30© 2010
Lucid Imagination, Inc.
Resources
Websites
http://www.lucidimagination.com
http://search.lucidimagination.com
http://lucene.apache.org/solr
Solr Support
http://www.lucidimagination.com/How-We-Can-Help
31© 2010
Lucid Imagination, Inc.© 2010
Q&ASlides are posted for
download at http://lucene.li/a ;
full replay available within ~48 hours of live webcast