Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
IBM Search Technologies –The Haifa Perspective
Aya SofferManager, Search Technologies Dept.
Websphere Portal DevelopmentIBM Haifa Labs
16/2/2004 2
Outline
The world of search from an IR point of viewEnterprise search vs. Web SearchIBM text search initiativesHaifa contributions
16/2/2004 3
The World of SearchThe World of Search
Content management, groupware, “Intranet search”
E-commerce,News, public documentation, government,…
Proprietary content(owned & stored by the SE) “local”
Market intelligence, news tracking, job data mining
Private access
“inward facing”
Web searchPublicly available“outward facing”
Public content(owned and stored by others) “global”
16/2/2004 4
User Expectations Shape Products
Global / Web search (public content)Users expect relevant answers to very low-content queries: ipod, mars (from Google Jan. top 10)Search engines deliver!Hence reinforcement of user expectations.
The global context is extremely popular Shapes user expectations at the workplaceShapes enterprise search productsIn contrast to previous enterprise to consumer path –spreadsheets, word processing, email.
Local / Enterprise search (proprietary content)Users expect a similar interaction style, basically unstructured, full-text search Hence paradigm change in local searchYet, more sophisticated user needs are also required
16/2/2004 5
Enterprise Search
16/2/2004 6
Enterprise Search —Harder than Web Search
Far fewer resources — but high expectationsRecall becomes an issue
Data is not “search friendly”Data not created with search in mind
Must index everything — legacy dataWeb search effect in the enterprise - find everything I can access
SecurityBut only show me what I am allowed to see
Link-based methods not as effectiveLinkage patterns not as robust
Enterprise knowledge needs to be factored into ranking Search is not cheap! About 5-10 cents/document/year
16/2/2004 7
Smaller scaleCorpus is much smallerQuery load is much lower
Less anarchic Central authority Data formats can be controlled
More potential structure Data is better organizedOpportunity for semantic search
No spamAt least not intentional
Enterprise Search —Easier than Web Search
16/2/2004 8
Differences in Content Corpus
On the WebHTML is over 95%By definition, all content is visible to everyoneSources are not structured
• With the exception of eCommerce sitesContent based filtering (for adult content) is a big issue
In the EnterpriseMicrosoft office formats are prevalent Notes and Exchange are big repositories
• Some of which should be visible only to some usersLarge number of authoritative structured sources
• E.g., directories of people
16/2/2004 9
Difference in Query Characteristics
On the WebVery flat distribution with a very long tail>300k queries to reach 25% coverage<25% of queries are navigational queries30% of queries are unique
In the EnterpriseMuch sharper drop-off<1000 terms account for 25% coverage>45% of queries are navigational Users’ queries as well as administrators’ queries
16/2/2004 10
Differences in Organization Structure
On the WebAdversarial relationship between search eng. & providers
• Content providers want to maximize traffic• Content providers cannot be trusted
Consequently, meta-data from providers cannot be trustedSearch engine is not accountable –best effort is the rule, no one to complain to
In the EnterpriseSearch engine and content providers cooperate
• Search engine serves the content providersConsequently, meta-data can be trustedSearch engine is accountable to providers –expect to find their pages in the index
16/2/2004 11
Differences in Economics
On the WebAdvertising pays for the search engineSearch engine does not pay for the time of the userUsers do not pay search engine for use of serviceSearch engine wants users to do more searches
In the EnterpriseSearch engine is a utility, not a profit centerSame organization pays for search engine and user’s timeOrganization wants users to do less searchingThis, together with query characteristics, allows for a very different, service oriented structure for search
16/2/2004 12
Search in IBM
16/2/2004 13
Search Integrated in IBM Products
Search technology plays major role in several IBM productsIBM Websphere Portal
IBM’s market-leading portal solutionIndex and search over both internal and external contentSearch content of Portal Document ManagerSearch data available via portletsPortal search engine is based on Juru
• 100% Java search engine developed in Haifa• Focus on high precision and customization
IBM Lotus WorkplaceSearch in collaborative portletsEmail, discussions, e-learning, people finding, chatAlso using Juru search engine
16/2/2004 14
WebSphere Portal
WebSphere Application Server
Search Engine Portal Document Mgr
Web Content PublishingLWPWCM
Collaboration*People / Teams / Document
/ Real Time
Federated Search Broker* Web Site Analyzer*
Portal Code Admin / SSO / ACL / Security / Web Services / Portlets / Presentation Layer / Language / Member Srvs. / etc…
RelationalDatabase
LDAPDir
16/2/2004 15
Juru – Highlights
Easy to use100% Java, thus platform independentSmall footprint
ExtensibleOpen APIs to add new document types and ranking models
EffectiveTop quality demonstrated on standard benchmarks –
• TREC organized by NIST for standard IR• INEX organized by DELOS for XML retrieval
Efficient Sub second query processingFast Indexing
Choice of tools for text and linguistic analysisFrom character based, through stemmers, through morphology
XML supportXML query by fragments - unique query by example technology
16/2/2004 16
Juru – Main Features
Full text query specificationRich query syntax
+, -, *, phrases, fielded search, parametric searchLexical Affinities for disambiguationStemming and morphological basis for multi-lingual supportN-gram tokenization for
Languages with no white space (e.g. CJK)Languages with no available stemmer (e.g. Hebrew)
Efficient constraint solver Filter search results on categoriesPre-defined filters for dates, security, numeric fields
Extensible ranking formulasExample, incorporate external scoring factors
Rich set of local and remote APIs Suitable for simple and complex implementations
16/2/2004 17
Portal Search - Index Build Process
Text analysisComponents:
•Categorizer•Summarization•Document filters
Approved set ofContent “In-basket”
Metadata injectedinto original content
1 2
CrawlerFilter
ApprovalWorkflow
Indexer
Collection
16/2/2004 18
Search Portlet – Detailed View
16/2/2004 19
Search Portlet – Browse View
16/2/2004 20
Lotus Workplace Search
16/2/2004 21
Lotus New Email Search
16/2/2004 22
Stand alone solution - adds scale, administration, global analysis and moreIndex and search over 8 million unique pages
Over 25 Million unique URLsOver 7,000 websites30-40K searches per day (300,000 employees)In production since Sept 2003Access a wide range of content – Intranet, news forums, Persona pages, Blue pages
Main FeaturesExcellent Relevancy of search Outstanding PerformanceEase of administrationHigh Scalability
IBM Enterprise Search
16/2/2004 23
Intranet Search Architecture
Crawler
Indexer
Search Server A
Search Server B
1 - Crawling
2 – Parsing & Tokenizing3 – Indexing: Build
& Push to Search Server
4 – Searching
W3.IBM.COM
Developed by Haifa Research based on Juru Technology
Developed by Haifa Research based on Juru Technology
16/2/2004 24
IBM Search in Support of the Enterprise
Free text, AND of terms first
Free text + Advanced search
Query Format
XML mappingXML fragmentsXML Support
Rule-based + fixed taxonomy
Rule-based + fixed taxonomy
Categorization
Search onlySearch & IndexingRich APIs
Built-in supportBuilt-in support + post filtering
Security
8 major formats + NNTP
Over 200 formatsDocument formats
Near duplicatesExact duplicatesDuplicate Elimination
Index all metadataIndex all metadataAdmin can change
Support for Document Structure
Enterprise SearchPortal Search
16/2/2004 25
Additional Haifa Research Highlights
QUEST – QUEry Sensitive TuningNovel adaptable ranking formula combining textual and static factors Parameters tuned according to query type
Efficient Query Evaluation Using a Two-Level Retrieval Process
Quickly determine if document is candidate for top K Fully evaluate only promising candidates Skip unpromising documentsGoal – maximize skipping and minimize full evaluations
Searching XML documents via XML fragmentsDocument and queries represented as XMLExtension of vector–space model to handle tags
16/2/2004 26
Conclusion
Enterprise search shaped by Web search but poses different challengesIBM is focused on Enterprise search solutionsSearch is embedded in many IBM productsHaifa center of competence in search in IBM
Fully owns Juru search engine integrated in Lotus productsSignificant contributor to IBM stand-alone search engine
Haifa extremely active in IR and Web research community
16/2/2004 27
Our Secret Sauce