Upload
evelyn-hood
View
217
Download
2
Tags:
Embed Size (px)
Citation preview
Google Search Appliance
November 2, 2010
Susan Fagan
2
Why Google Search Appliance?
• A different approach to search at EPA
• Smarter ranking
• Improved indexing
• Easier operations
• A future
We’re going to call it GSA from here on in
3
How GSA ranks documents
• It’s a secret, but we know some things
– Page rank
– Self learning
• We can control some things
– Date biasing
– Source biasing
– Metadata biasing
– Best bets
• We’re going to let it do its thing before we tune it too much
4
How GSA ranks documents: Page Rank
• Who links to your pages?
• Who links to pages that link to your pages?
• How does everybody link?
– What does it say in the link text?
– Is the link always the primary URL (because if it isn’t,
you don’t get any points)?
A primary URL is a URL that contains no aliases
that are not primary. Primary as defined by what
you put in the TSSMS Alias Tool.
5
How GSA Ranks Documents: Things We Can Control
• Date biasing
– Newer is better
– We control how much better
• Source biasing
– Boost or decrease chunks of our website
– Regions are slightly decreased for Agency search
• Metadata biasing
– We control how much each metadata field counts
– We can turn up the bias as metadata quality improves
6
How GSA Ranks Documents: More Things We Can Control
• Best Bets
– Like buying keywords from Google.com
– Specific pages for specific keywords or phrases
– Always featured at the top
– Take effect immediately
7
How GSA Indexes Documents
• Continuous crawl
• Learns by experience
• Crawl rates tunable by host and time
• Requires some starting points (seeds)
• Restricted by Do Not Crawl list
A manually maintained list in the GSA Admin UI,
of URL patterns that the crawler should not visit.
• Respects robots.txt (in it’s own way)
8
How EPA is implementing GSA
• Same Java webapp on the same servers
• Your search form will stay the same
• Area search won’t change much
• Your XML search application may change (most
won’t)
• Smart, fast indexing, with some help
• Only indexing primary URLs
9
Implementing GSA: Your search form will stay the same
• Implemented Northern Light via an object-oriented Java
application
– We get to keep our code this time
– 6 weeks to change it, instead of 6 months
– Nothing changes for client pages
• Two Model 7007 Google Search Appliances -
- Primary
- Hot spare for failover
- Parallel indexes
• 2,000,000 document license
10
Implementing GSA: Your search form
• URL is the same
• All common elements work the same
• Some obscure elements go away
– weighted_search, search_crumbs
• Custom result templates work the same
• Advanced search works the same
11
Implementing GSA: Area Search
• Area search is here for now
• If you search by TSSMS
– We will translate it on the fly to URL
– We will only translate TSSMS to primary alias
• If you search by URL
– Nothing changes…
– …. But aliases are your problem
• Contact Peter to test your area search
12
Implementing GSA: Your XML search app
• Parameters and templates are unchanged
• GSA response packet automatically transformed
to original NL format
• Only 1,000 results are available for a single query
• 3 applications have been observed exceeding
that limit
13
Implementing GSA: Smart, fast indexing
• Continuous crawl – scans the website at least
daily for new links
• If it’s not linked, it won’t be found
• Librarian looks daily for new content
• If all this doesn’t work (quickly), tell the librarian
• Notes databases do not require Verity Views
14
Implementing GSA: Indexing your primary URL
• Search engines think different URLs are different
documents
• This means duplicates in search results
• All non-primary aliases are being placed in the Do
Not Crawl list
15
What will our customers see?
• The same thing…. At first.
• Breadcrumbs are gone…what were they,
anyway?
• Folders replaced by Related Searches
• FAQ will come back
• Best Bets for top documents
• The document they’re looking for!
16
What do we have to do?
• Plan our November 19 public access
implementation
• Test (with your help)
• Implement
• Make it better
17
What do you have to do?
• Keep working on ROT
• Keep working on metadata
• Don’t change your search form…
• … Area search will work, if you want it
• Tell us what you think
18
What are we leaving out … for now?
• EPA thesaurus
– Contains only general terms
– We will add EPA vocabulary
• Google’s spellchecker
– We’ll use our own for now
– We’ll compare and use the winner
• RSS presentation – delivers only raw XML in search
results, for now
• Recent searches
19
What’s in our future?
• Marketplace of One Box modules
– Faceted search?
– Contextual search?
– Business intelligence?
• More social media
• OneEPA integration
• Web CMS integration
• Advanced analytics
• Special collections
• Geographic search?
• GSA for intranet
21