The Haifa Perspective IBM Search Technologies · IBM Websphere Portal IBM’s market-leading portal solution Index and search over both internal and external content Search content

IBM Search Technologies –The Haifa Perspective

Aya SofferManager, Search Technologies Dept.

Websphere Portal DevelopmentIBM Haifa Labs

16/2/2004 2

Outline

The world of search from an IR point of viewEnterprise search vs. Web SearchIBM text search initiativesHaifa contributions

16/2/2004 3

The World of SearchThe World of Search

Content management, groupware, “Intranet search”

E-commerce,News, public documentation, government,…

Proprietary content(owned & stored by the SE) “local”

Market intelligence, news tracking, job data mining

Private access

“inward facing”

Web searchPublicly available“outward facing”

Public content(owned and stored by others) “global”

16/2/2004 4

User Expectations Shape Products

Global / Web search (public content)Users expect relevant answers to very low-content queries: ipod, mars (from Google Jan. top 10)Search engines deliver!Hence reinforcement of user expectations.

The global context is extremely popular Shapes user expectations at the workplaceShapes enterprise search productsIn contrast to previous enterprise to consumer path –spreadsheets, word processing, email.

Local / Enterprise search (proprietary content)Users expect a similar interaction style, basically unstructured, full-text search Hence paradigm change in local searchYet, more sophisticated user needs are also required

16/2/2004 5

Enterprise Search

16/2/2004 6

Enterprise Search —Harder than Web Search

Far fewer resources — but high expectationsRecall becomes an issue

Data is not “search friendly”Data not created with search in mind

Must index everything — legacy dataWeb search effect in the enterprise - find everything I can access

SecurityBut only show me what I am allowed to see

Link-based methods not as effectiveLinkage patterns not as robust

Enterprise knowledge needs to be factored into ranking Search is not cheap! About 5-10 cents/document/year

16/2/2004 7

Smaller scaleCorpus is much smallerQuery load is much lower

Less anarchic Central authority Data formats can be controlled

More potential structure Data is better organizedOpportunity for semantic search

No spamAt least not intentional

Enterprise Search —Easier than Web Search

16/2/2004 8

Differences in Content Corpus

On the WebHTML is over 95%By definition, all content is visible to everyoneSources are not structured

• With the exception of eCommerce sitesContent based filtering (for adult content) is a big issue

In the EnterpriseMicrosoft office formats are prevalent Notes and Exchange are big repositories

• Some of which should be visible only to some usersLarge number of authoritative structured sources

• E.g., directories of people

16/2/2004 9

Difference in Query Characteristics

On the WebVery flat distribution with a very long tail>300k queries to reach 25% coverage<25% of queries are navigational queries30% of queries are unique

In the EnterpriseMuch sharper drop-off<1000 terms account for 25% coverage>45% of queries are navigational Users’ queries as well as administrators’ queries

16/2/2004 10

Differences in Organization Structure

On the WebAdversarial relationship between search eng. & providers

• Content providers want to maximize traffic• Content providers cannot be trusted

Consequently, meta-data from providers cannot be trustedSearch engine is not accountable –best effort is the rule, no one to complain to

In the EnterpriseSearch engine and content providers cooperate

• Search engine serves the content providersConsequently, meta-data can be trustedSearch engine is accountable to providers –expect to find their pages in the index

16/2/2004 11

Differences in Economics

On the WebAdvertising pays for the search engineSearch engine does not pay for the time of the userUsers do not pay search engine for use of serviceSearch engine wants users to do more searches

In the EnterpriseSearch engine is a utility, not a profit centerSame organization pays for search engine and user’s timeOrganization wants users to do less searchingThis, together with query characteristics, allows for a very different, service oriented structure for search

16/2/2004 12

Search in IBM

16/2/2004 13

Search Integrated in IBM Products

Search technology plays major role in several IBM productsIBM Websphere Portal

IBM’s market-leading portal solutionIndex and search over both internal and external contentSearch content of Portal Document ManagerSearch data available via portletsPortal search engine is based on Juru

• 100% Java search engine developed in Haifa• Focus on high precision and customization

IBM Lotus WorkplaceSearch in collaborative portletsEmail, discussions, e-learning, people finding, chatAlso using Juru search engine

16/2/2004 14

WebSphere Portal

WebSphere Application Server

Search Engine Portal Document Mgr

Web Content PublishingLWPWCM

Collaboration*People / Teams / Document

/ Real Time

Federated Search Broker* Web Site Analyzer*

Portal Code Admin / SSO / ACL / Security / Web Services / Portlets / Presentation Layer / Language / Member Srvs. / etc…

RelationalDatabase

LDAPDir

16/2/2004 15

Juru – Highlights

Easy to use100% Java, thus platform independentSmall footprint

ExtensibleOpen APIs to add new document types and ranking models

EffectiveTop quality demonstrated on standard benchmarks –

• TREC organized by NIST for standard IR• INEX organized by DELOS for XML retrieval

Efficient Sub second query processingFast Indexing

Choice of tools for text and linguistic analysisFrom character based, through stemmers, through morphology

XML supportXML query by fragments - unique query by example technology

16/2/2004 16

Juru – Main Features

Full text query specificationRich query syntax

+, -, *, phrases, fielded search, parametric searchLexical Affinities for disambiguationStemming and morphological basis for multi-lingual supportN-gram tokenization for

Languages with no white space (e.g. CJK)Languages with no available stemmer (e.g. Hebrew)

Efficient constraint solver Filter search results on categoriesPre-defined filters for dates, security, numeric fields

Extensible ranking formulasExample, incorporate external scoring factors

Rich set of local and remote APIs Suitable for simple and complex implementations

16/2/2004 17

Portal Search - Index Build Process

Text analysisComponents:

•Categorizer•Summarization•Document filters

Approved set ofContent “In-basket”

Metadata injectedinto original content

1 2

CrawlerFilter

ApprovalWorkflow

Indexer

Collection

16/2/2004 18

Search Portlet – Detailed View

16/2/2004 19

Search Portlet – Browse View

16/2/2004 20

Lotus Workplace Search

16/2/2004 21

Lotus New Email Search

16/2/2004 22

Stand alone solution - adds scale, administration, global analysis and moreIndex and search over 8 million unique pages

Over 25 Million unique URLsOver 7,000 websites30-40K searches per day (300,000 employees)In production since Sept 2003Access a wide range of content – Intranet, news forums, Persona pages, Blue pages

Main FeaturesExcellent Relevancy of search Outstanding PerformanceEase of administrationHigh Scalability

IBM Enterprise Search

16/2/2004 23

Intranet Search Architecture

Crawler

Indexer

Search Server A

Search Server B

1 - Crawling

2 – Parsing & Tokenizing3 – Indexing: Build

& Push to Search Server

4 – Searching

W3.IBM.COM

Developed by Haifa Research based on Juru Technology

Developed by Haifa Research based on Juru Technology

16/2/2004 24

IBM Search in Support of the Enterprise

Free text, AND of terms first

Free text + Advanced search

Query Format

XML mappingXML fragmentsXML Support

Rule-based + fixed taxonomy

Rule-based + fixed taxonomy

Categorization

Search onlySearch & IndexingRich APIs

Built-in supportBuilt-in support + post filtering

Security

8 major formats + NNTP

Over 200 formatsDocument formats

Near duplicatesExact duplicatesDuplicate Elimination

Index all metadataIndex all metadataAdmin can change

Support for Document Structure

Enterprise SearchPortal Search

16/2/2004 25

Additional Haifa Research Highlights

QUEST – QUEry Sensitive TuningNovel adaptable ranking formula combining textual and static factors Parameters tuned according to query type

Efficient Query Evaluation Using a Two-Level Retrieval Process

Quickly determine if document is candidate for top K Fully evaluate only promising candidates Skip unpromising documentsGoal – maximize skipping and minimize full evaluations

Searching XML documents via XML fragmentsDocument and queries represented as XMLExtension of vector–space model to handle tags

16/2/2004 26

Conclusion

Enterprise search shaped by Web search but poses different challengesIBM is focused on Enterprise search solutionsSearch is embedded in many IBM productsHaifa center of competence in search in IBM

Fully owns Juru search engine integrated in Lotus productsSignificant contributor to IBM stand-alone search engine

Haifa extremely active in IR and Web research community

16/2/2004 27

Our Secret Sauce

Documents

The Haifa Perspective IBM Search Technologies · IBM Websphere Portal IBM’s market-leading portal solution Index and search over both internal and external content Search content