32
Surviving the Information Glut bbtitle 9/23/94 Presentation by Bob Boeri Factory Mutual Engineering & Research [email protected] October 7, 1994

Beyond Boolean - Enterprise Search Technologies

Embed Size (px)

DESCRIPTION

Presentation I gave at a local Boston conference in 1994 about Enterprise Search. some predictions panned out; some did not (at least not yet).

Citation preview

Page 1: Beyond Boolean - Enterprise Search Technologies

Surviving the Information Glut

bbtitle 9/23/94

Presentation by Bob BoeriFactory Mutual Engineering & [email protected]

October 7, 1994

Page 2: Beyond Boolean - Enterprise Search Technologies

Roots of the Problem<Storage: increasing <Access: faster<Document complexity: more<Information quantity: increasing

exponentially

bb1 9/13/94

"'A cat may look at a king,' said Alice. 'I've read that in some book,but I don't remember where.' ". Alice in Wonderland

Page 3: Beyond Boolean - Enterprise Search Technologies

Document Complexity<word processor types<fonts<rich layouts <tables<graphics/ photos<video, sound, hypertext<SGML<... what isn't a document?

bb2 9/13/94

"and what is the use of a book," thought Alice, "without pictures orconversations?" Alice in Wonderland

Page 4: Beyond Boolean - Enterprise Search Technologies

How to Find What You AreLooking For

<how large is your collection ofdocuments?

<how complex are they?<how complex are the searches?<who will search?•individuals working by themselves•members of a corporate organization

bb3 9/17/94

Page 5: Beyond Boolean - Enterprise Search Technologies

Searching a Very SmallCollection

<a few dozen documents<simple structure (e.g., memo or e-mail)<written consistently (e.g., you, one author)

bb4 9/17/94

Find that note about an inexpensive, simpleword processor that never needs upgradingand will let you add simple graphics to yourwriting. Runs under MS-Windows.

Page 6: Beyond Boolean - Enterprise Search Technologies

Trivial Search Techniques<browse through each one<use a word processor "list files"<simple search system, simple boolean

search

bb5 9/17/94

Page 7: Beyond Boolean - Enterprise Search Technologies

First Search Barrier<somewhere between a few dozen

documents and several hundred<can't remember exactly the words to

search, begin searching synonyms orusing wild cards.

bb6 9/14/94

She went on, rather surprised at not being able to think of the word. 'I mean to get under the --under the -under THIS, you know!' putting her hand on the trunk of a tree. --Alice in Wonder land

Page 8: Beyond Boolean - Enterprise Search Technologies

First-Level SearchTechniques

<Range searches: "word processor"<sentence> "inexpensive"

bb7 9/15/94

<only 2 hits (probably missed something).< forgot to ask about graphics support.

Page 9: Beyond Boolean - Enterprise Search Technologies

First-Level SearchTechniques

<Wild cards: word process* <sentence>basic

bb8 9/15/94

<99 hits; unusable< Maybe asking about graphics support toowill reduce number of hits

Page 10: Beyond Boolean - Enterprise Search Technologies

Searching Gets Complex <(basic <sentence> word process*)

<paragraph> (support* <sentence>graphic*)

bb9 9/16/94

< complex expression < system searches a long time< finds nothing useful.

Page 11: Beyond Boolean - Enterprise Search Technologies

Sample Hits from 1st-levelComplex Search

<"Although Visual Basic contains arudimentary word processor... graphicsupport is really limited to OLE andDDE."

<"Basic word processing skills cansometimes be transferred to..... programswhich allow you to create graphic effects.

bb10 9/17/94

Page 12: Beyond Boolean - Enterprise Search Technologies

< new ways to divide and conquor < richer, easier search aids< richer reporting of results

Need to Break the 1st-LevelSearch Barrier:

<reduce hits to most relevant<get hits when simpler searches fail<additional techniques beyond Boolean

bb11 9/19/94

Page 13: Beyond Boolean - Enterprise Search Technologies

Combine Structured and FullText Queries

<Apply search to portion of library ("formqueries")

<Requires knowledge of the library<Requires "catalog card" for each

document (e.g., date, subject)<Smart system might construct catalog

card•Requires highly regular documents•Risk of catalog errors

bb12 9/19/94

Page 14: Beyond Boolean - Enterprise Search Technologies

Combine Structured and FullText Queries

<Could design as a form for users to fill out<Example:

bb13 9/19/94

DATE: after 1/1/94

(inexpensive <sentence> "word processor"<sentence> "windows")

Page 15: Beyond Boolean - Enterprise Search Technologies

Relevancy Ranking<Puts most likely hits at the top of the list<Requires understanding of what's most

important•# of hits/document•weighting certain hits (e.g., exact matches) more

than others•weighting other criteria (such as date or other

structured fields)•let users say what's most important to them

bb14 9/19/94

Page 16: Beyond Boolean - Enterprise Search Technologies

Thesauruses<General<Specific•medical•legal•scientific<user-modifiable

bb15 9/20/94

"I don't know the meaning of half those long words, and what's more, I don't believe you do either!" -- Alice in Wonderland

Page 17: Beyond Boolean - Enterprise Search Technologies

Linguistic Helps<Automatic search for parts of speech•"sprinkle" also searches for "sprinkled,"

"sprinkling," etc.<Fuzzy search•"sprinkle" also searches for "sparkle"•helps overcome some OCR errors.•user-specifiable (how many letters to make "fuzzy")•gets words you would have missed•gets words that make no sense at all.<Natural Language Queries: ("Find me

cheap reliable easy Windows wordprocessors")

bb16 9/20/94"Language is worth a thousand pounds a word."

-- Through the Looking Glass

Page 18: Beyond Boolean - Enterprise Search Technologies

Complex and ModularQueries

<Create, debug, save queries<Use queries as models for new queries<If modular ("Lego•s")•assemble large search queries by plugging together

smaller ones.•fine tune searches (adjusting rankings of search

criteria). •build libraries of modular searches

bb19 9/22/94

Page 19: Beyond Boolean - Enterprise Search Technologies

Fuzzy Searches<use neural network technology<like sophisticated wildcard searches<help overcome OCR errors<find good matches and irrelevant ones<can distort relevancy rankings by hit

count

bb20 9/22/94

Page 20: Beyond Boolean - Enterprise Search Technologies

SGML Usage<"Zone" searches•Confine searches to paragraph headings, chapter

titles, etc.<Use SGML DTDs directly:•Full, Arbitrary (all DTDs)

A exploits full capabilities of your tag set A performance and/or size penalties

•Specific DTDs onlyA "Any color Ford you want as long as it's black."A May be tuned for better use

bb17 9/20/94

Page 21: Beyond Boolean - Enterprise Search Technologies

SGML Usage<Filter (convert) SGML tags to application

specific codes.•Not authentic SGML use•May be better performance than authentic SGML<Best when documents are themselves

highly structured.<One-way (from SGML to proprietary);

loses important SGML benefit.<Few vendors support SGML well<Those who do may skimp on other search

facilities.

bb18 9/21/94

Page 22: Beyond Boolean - Enterprise Search Technologies

Interest Profiling<Profile determined by any number of

means<"I like these documents. Find me more

like this."•simple•unexpected results•electronic highlighter improves search<The more search tools the better.

Page 23: Beyond Boolean - Enterprise Search Technologies

Looking in classifieds for a low-mileage Saab, prefer beige or red, one-owner,automatic, 1993 or newer, less than $10,000.

Looking in PC literature for Windows word processor , easy to use, never needsupgrades, can handle graphics, bug-free, uses 1MB disk, less than $29.95.

Information Agents<passive•computed once, updated periodically•use when you choose (whenever new CD-Rom title

appears)<active•information gobots•always on the lookout for anything relevant•inform you with results or email notification•on-line or jukeboxes

Page 24: Beyond Boolean - Enterprise Search Technologies

Collateral Issues: Authoringand Using

<Authoring•Populating the system•Subject areas and forms•Document size•Legacy Documents

bb24 9/23/94

Page 25: Beyond Boolean - Enterprise Search Technologies

Populating the system<Security: everyone have identical access?<Easy way to get documents into system?<Form per document for form queries?•date, subject area, sub-type)?•subject area (e.g., word processors)?•sub-types within areas (e.g., character-based, GUI)<Easy way to retract documents? Re-file

documents? "See also" subject areas?<QA of forms and documents•Form field info correct?•Complex document objects (e.g.,tables).

bb25 9/24/94

Page 26: Beyond Boolean - Enterprise Search Technologies

Document Size<Whole documents or chunks?<What's appropriate to users?•Effort to build collection•Precision of hits•Size of hit list•What's natural and expected

bb26 9/24/94

"What size do you want to be," the catepillar asked.

Oh, I'm not so particular as to size, Alice hastily replied. "Only one doesn' t like changing sooften, you know."

-- Alice in Wonderland

Page 27: Beyond Boolean - Enterprise Search Technologies

Legacy Documents<Paper•size, number, quality•OCR•Ability to attach page images•At least name file for faxing<Electronic•document type•quality of author practices•fonts. . . . . .•command launch when possible•what about form queries/document?

bb27 9/24/94

"These words were followed by a very longsilence, broken only by an occasionalexclamation of 'Hjckrrh!" from theGryphon."

-- Alice in Wonderland

Page 28: Beyond Boolean - Enterprise Search Technologies

Collateral Issues: Using<Pie fonts<Non-English characters<Equations<Font fidelity, size on-screen•letter "o" and zero•letters one "1", el "l", and capital i "I".

bb28 9/25/94

"The White Queen whispered, 'I can read words of one letter!... However, don't be discouraged,You'll come to it in time.'"

-- Through the Looking Glass

Page 29: Beyond Boolean - Enterprise Search Technologies

Collateral Issues: Using<Navigation within documents<Viewers<Launching when Viewers Inadequate<CD-Rom Performance<Exporting information for reuse.<Printing

bb29 9/25/94

"... the books are something like our books, only the words go the wrong way."

-- Through the Looking Glass

Page 30: Beyond Boolean - Enterprise Search Technologies

Collateral Issues: Using<Interactive searches<Batch searches ("go do this later and tell

me what you found")<Autonomous information agents•Continuous monitoring•Urgent, routine notification•Empower agents to "Ring a bell" ; "Push a button" •Active documents: "Go find me more like yourself"

bb30 9/26/94

Page 31: Beyond Boolean - Enterprise Search Technologies

Adobe Acrobat version 2.0<Powerful searching<CD-Rom performance<Font problem disappears<SGML promised

bb31 9/26/94

Page 32: Beyond Boolean - Enterprise Search Technologies

Even the best searching system can't findwhat isn't there. But the best ones will keepon trying.

And What of Our OriginalSearch... Perfect Word

Processor, Saab for a Song

bb28 9/25/94

Alice laughed. There's not use trying,' she said: one CAN'T believe impossible things.'

I daresay you haven't had much practice,' said the Queen. . -- Through the Looking Glass