Enterprise Search Share Point2009 Best Practices Final

Good Afternoon and many thanks for attending the last session on the last

day of this conference. The focus of this presentation are the many excellent

features contained in MOSS 2007 search. My goal is to show you why these

features are excellent so that you will make use of them. Because, if you do,

you will be able to walk the halls of your organization with your heads held

high and fear no “search sucks” cracks as you do.

1

I am a pointy-head and not a propeller-head. While there are technical

references in this presentation, the orientation will be more behavioral and

less technical. There are terrific technical resources contained in the

Resources section and the occasional snippet of code did make its way into

the main section.

2

3

UC Berkeley Study on How Much Information: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/

Print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002. Ninety-two percent of the new information was stored on magnetic media, mostly in hard disks.

How big is five exabytes? If digitized with full formatting, the seventeen million books in the Library of Congress contain about 136 terabytes of information; five exabytes of information is equivalent in size to the information contained in 37,000 new libraries the size of the Library of Congress book collections.

Hard disks store most new information. Ninety-two percent of new information is stored on magnetic media, primarily hard disks. Film represents 7% of the total, paper 0.01%, and optical media 0.002%.

The United States produces about 40% of the world's new stored information, including 33% of the world's new printed information, 30% of the world's new film titles, 40% of the world's information stored on optical media, and about 50% of the information stored on magnetic media.

How much new information per person? According to the Population Reference Bureau, the world population is 6.3 billion, thus almost 800 MB of recorded information is produced per person each year. It would take about 30 feet of books to store the equivalent of 800 MB of information on paper.

We estimate that the amount of new information stored on paper, film, magnetic, and optical media has about doubled in the last three years.

Information explosion? We estimate that new stored information grew about 30% a year between 1999 and 2002.

Paperless society? The amount of information printed on paper is still increasing, but the vast majority of original information on paper is produced by individuals in office documents and postal mail, not in formally published titles such as books, newspapers and journals.

Hosted websites [UC Berkeley How Much Information Project]

•July 1993: 1,776,000

•July 2005: 353,084,187

Size of the Web [Indexable Web: Guilli & Signorini 2005]

•1997: 200 million Web pages

•2005: 11.5 billion pages

4

http://www.prb.org/

5

Information Re/volution: Michael Wensch; Kansas State University

http://www.youtube.com/user/mwesch

All of his work is very good

And how we manage information is different because searchers are squishy – some just want to find “it”, others want it to find them and others want to change it, create it, manipulate it, share it…

•They are searching because they don’t know

•Language and perception are different

•Some people think women put their stuff in a purse, others a pocketbook, and others a handbag.

•“Animal” is a mammal, a Sesame Street character, and an uncouth person

•Enterprise information is individualized.

•Gates Foundation has different issues than PACCAR

•Providence Healthcare has different types of content than King County Library

•Codeplex has a different user type [or a more standard one] than Microsoft Virtual Earth

6

Search engines use bots to crawl pages and send compressed data based on grammatical requirements such as stemming [taking the word down to its most basic root] and stop words [common articles and others stipulated by the company] back to the index. This index is then inverted so that lookup is done on the basis of record contents and not the document ID which is a completelydifferent method of data storage and retrieval from other relational database data storage. A complete copy of the Web page may be stored in the search engine’s cache. With brute force calculation, the system pulls each record from the inverted index [mapping of words to where they appear in document text]. This is recall or all documents in the corpus with text instances that match your the term(s).

Search engine indexes are not like relational databases. There is no such thing as normalization, no unique identifiers and the loosest of structures.

The “secret sauce” for each search engine are algorithms that sort order the recall results in a meaningful fashion. This is precision or the number of documents from recall that are relevant to your query term(s). All search engines use a common set of values torefine precision. If the search term used in the title of the document, in heading text, formatted in any way, or used in link text, the document is considered to be more relevant to the query. If the query term(s) are used frequently throughout the document, the document is considered to be more relevant.

Another example is Term Frequency - Inverse Document Frequency [TF-IDF] weighting. Here the raw term frequency (TF) of a term in a document by the term's inverse document frequency (IDF) weight [frequency of occurrence in a particular document multiplied the number of documents containing the term divided by the number of documents in the entire corpus. [caveat emptor: high-level, low-level, level-playing-field math are not my strong suits].

7

There is a fundamental difference between Web search and Enterprise search.

Web Search:

•Web search is generic search. One size fits all. Features serve the technology to better enable it to serve the masses.

•Search technology has to work for the broadest document set, those 11 billion plus pages

•Keys off strong linking [the # and the structure]

•Links are “editorial” – endorsement of destination content through “vote”

•Millions of publishers that are not required to adhere to any specific standards

•Site structure is not often tied to content or context

•Search engines are constantly fighting attempts to game their technology in the Web search space. Black hat techniques like cloaking, link farms, spamming, keyword stuffing, Sybil attacks and the like are a blight. They manipulate the results and reduce user confidence in the system

•Technology changing and refining its operation to rely on both internal [document level] and external [site level] data. Examples of this would be: IBM’s narrative distiller, MSN link text analysis, Google Scout that finds related hyperlinks, andYahoo!’s document segmentation

Important to note: The PageRank algorithm is a pre-query calculation. It is a value that is assigned as a result of the search engine’s indexing of the entire Web and the associated value has no relationship to the user’s information need. There have been a number of additions and enhancements to lend some contextual credence to the relevance ranking of the results.

Enterprise Search:

•Bounded corpus of content

•Produced and maintained by a limited set of authors

•No strong linking strategy – links mostly for navigation [not editorial]

•Information related in ways that key outside of document content

•Hierarchical structure intended – part of corporate culture

•Publishing guidelines can be established to enforce meta data standards to tune a search appliance and improve relevance through enforced semantic relationships.

In the early days of search engines, Advanced Search was a means for those who could phrase their queries in Boolean or SQL language to do so for more refined results. As search engines became more sophisticated, the need for such coding ability discrimination.

Usability studies show that most customers avoid Advanced Search because they assume that it is too advanced for them. A better method is to offer means for the searcher to refine their own search using facets based on document type, subject or location.

8

From MOSS 2007 search Under the Hood PPT by Adir Ron

Search Query Execution:

•The query engine passes the query through a language-specific wordbreaker.

•After wordbreaking, the resulting words are passed through a stemmer to generate language-specific inflected forms of a given word.

•When the query engine executes a property value query, the index is checked first to get a list of possible matches.

•If the user does not have permission to a matching document, the query engine filters that document out of the list that is returned.

Search Architecture

http://www.sharepointblogs.com/heliosa/archive/2007/03/07/enterprise-search-architecture-in-sharepoint-technologies-2007.aspx

• Index Engine: Processes the chunks of text and properties filtered from content sources, storing them in the content index and property store.

• Query Engine: Executes keyword and SQL syntax queries against the content index and search configuration data.

• Protocol Handlers: Opens content sources in their native protocols and exposes documents and other items to be filtered.

• IFilters: Opens documents and other content source items in their native formats and filters into chunks of text and properties.

• Property Store: Stores a table of properties and associated values.

• Wordbreakers: Used by the query and index engines to break compound words and phrases into individual words or tokens.

9

10

SPS 2003 was SQL search - different db structure, more classic RDM

MOSS 2007 is indexed search = inverted index based on words not records -- scopes, structured Biz data search, people search

MOSS 2007

•Click Distance: Browsing distance from authoritative sites: shorter tends to be more relevant

•Anchor Text: Hyperlinks act as annotations on their target

•URL Depth: URLs higher in the hierarchy tend to be more relevant

•URL Matching: Direct matches on text in URLs

•Metadata Extraction: Automatically extract titles and authors from document text

•Automatic Language Detection: Helps bias toward results in your language

•File Type Biasing: For example, PPT docs tend to be more relevant than XLS

•Text Analysis: Traditional text ranking based on matching terms, term frequencies, word variants, etc.

SPS 2003

•Collection frequency: The number of documents a term appears in compared to total number of documents. Search terms that occur in only a few documents are likely to be more useful than terms that occur in many documents.

•Term frequency: The number of occurrences of the search term in a document. The more frequently a search term appears in a document the more important it is likely to be important for ranking that document.

•Document length: The length of the searched document. A term that occurs the same number of times in a short document as in a long one is likely to be more important to the short document.

•Term Position: The position of a word within a document, for example, presence of a term in the document’s title. A term that appears in a particular component of the document, such as the title, is more likely to be important for ranking that document.

11

Here is where you manage the components that manage search performance and search experience

Because search is a shared service, you only have to configure in one location

MOSS 2007 enables testing the configuration to ensure performance

Where you put the content is not necessarily where your customers will look for it

12

Better management and controlBetter resource management, both hardware and personnelAgile index changes

13

Text Analysis [internal]: Traditional text ranking based on such factors as matching terms, term frequencies, and word variants.

Dynamic and Static ranking: Like other search technology MOSS 2007 Search incorporates both internal [text on the page, term frequency, page layout and formatting, etc] and external metadata to more closely match user’s request. However, MOSS 2007 Search incorporates cutting edge technology from Microsoft Search to push beyond the 1 link=1 vote for quality/relevance of the PageRank model.

•Click Distance [external]: Browsing distance from authoritative sites (shorter distances tend to be more relevant).

•Anchor Text [external]: Hyperlinks act as annotations on their target. In addition, they tend to be highly descriptive.

•URL Depth [external]: URLs higher in the hierarchy tend to be more relevant.

•URL Matching [external]: Direct matches on text that's in URLs.

•Metadata Extraction [internal]: Automatically extracts titles and authors from document text if they are missing.

•Automatic Language [internal]: Detection Helps create preference for results in your language.

•File Type Biasing [internal]: Certain file types tend to be more relevant (for example, PPT files are often more relevant than XLS files).

You must turn on stemming and PDF indexing

14

Project Description from Codeplex http://www.codeplex.com/FacetedSearch

MOSS Faceted Search is a set of web parts that provide intuitive way to refine search results by category (facet).

The facets are implemented using SharePoint API and stored within native SharePoint METADATA store. The solution demonstrates following key features:Grouping search results by facet

Displaying a total number of hits per facet value

Refining search results by facet value

Update of the facet menu based on refined search criteria

Displaying of the search criteria in a Bread Crumbs

Ability to exclude the chosen facet from the search criteria

Flexibility of the Faceted search configuration and its consistency with MOSS administration

15

Estimated dev time to create own FLD file is 3 days (from MS internal)

Best to pass the query through and have destination do relevance ranking (saves bandwidth) than to access destination index (lose proprietary relevance ranking though)

Day Software Delivers Standardized Connectivity for Open Text Livelink

http://www.econtentmag.com/Articles/ArticleReader.aspx?ArticleID=19280

Using SharePoint 2007 to Index Lotus Notes

http://meiyinglim.blogspot.com/2007/01/using-sharepoint-2007-to-index-lotus.html

3/23/2009

16

17

Microsoft Knowledge Network: Stored on separate server

Version 1.0 is an add-on product for Enterprise version of Stand-alone Search and for both versions of Full Product

Refinement/scoping available

Initial results are presented with identity masked – KN server takes user request and sends to person who can accept or reject the request through the KN server without identity ever being revealed.

19

The Business Data Catalogue (BDC) crawls and integrates data from other applications [email servers, line-of-business applications, external databases, customer relationship management apps] and puts into a cache for crawl by the search server.

Accesses these repositories with a connector http://msdn.microsoft.com/en-us/library/ms563661.aspx

Available in MOSS 2007 Search Enterprise edition and both version of MOSS 2007 Full Product

Short term: FAST will remain an independent entity that Microsoft will continue to support on the non-Windows platforms with a connector for MOSS 2007. Next release will see 2 versions of FAST ESP, a stand-alone successor and a SharePoint edition that will incorporate the connect and add new features that require less customization

Relevance by using the underlying semantic relationships

•Categorization

•Transformation (lemmatization)

•Presentation

FAST Platform

•unity (federation of results from outside resources)

•admomentum (search driven monetization with ad serving)

•recommendations (recommendation engine similar to Amazon/Netflicks - based on behavior of user base - cookie based, item to item, people to items)

•featured content (search driven content merchandizing)

•fast unity (search driven portal experiences)

Core Capabilities

•phrasing and anti-phrasing: strips out the extraneous terms

•clustering: comprehension through association

•can be taxonomy based or on the Open Source Directory

•flexible relevancy model: boost block search results - dynamic on per query basis

•whole equalizer with whole set of knobs - reissues query with different weights based on choices -ranking more than filtering - does not change the # of results, changes the order of display

•can work in conjunction with faceted search

3/23/2009

20

21

Search Scopes

Represent a collection of documents mapped to a single element [i.e. authored by, specific directory, file type, metadata type], no longer tied to an index crawl – effective immediately.

By default, the scope plug-in will create scopes for the following:

•Display URL

•Site (domain, sub-domain, host-name)

•Author

•All content (used to include all content)

•Global query exclusions (used to exclude content)

Results Collapsing

Results collapsing can group duplicated or similar results together, so that they are displayed as one entry in the search result set. This entry includes a link to display the expanded results for that collapsed result set entry. Search administrators can collapse results for the following content item groups:

•Duplicates and derivatives of documents

•Windows SharePoint Services discussion messages for the same topic

•Microsoft Exchange Server public folder messages for the same conversation topic

•Current versions of the same document

•Different language versions of the same document

•Content from the same site

By default, results collapsing is turned on in Enterprise Search. The search administrator can configure it, however, either through the Search Administration UI or the Search Administration object model.

Security Trimmed Results: they don’t see what they are not allowed to see

Best Bets: editorially programmed results or what you want them to want to see

22

23

24

Report Center

•Dashboard-style data presentation

•Keys of document library of reports

•Can import KPIs

KPIs are a central way of presenting business intelligence for an organization. High level goals for organization or site

KPIs increase the speed and efficiency of evaluating progress against key business goals. Reduces the amount of data for analysis

KPIs connect to business data from various sources. Consolidates data against KPI, not repository.

Each KPI gets a single value from a data source, either from a single property or by calculating averages across the selected data, and then compares that value against a pre-selected value. Data sources include:

•Excel workbooks: The data comes from an Excel workbook.

•SQL Server 2005 Analysis Services: The data comes from database stores known as cubes, for connections in a data connection library.

•Manually entered information: The data is from a static list, rather than based on underlying data sources. This is used less frequently, for test purposes prior to deployment or on occasions when regular data sources are unavailable but you still want to provide performance indicators

Sometimes configuring search can seem like that big ticking box from Acme…

25

26

Frank Lloyd Wright said something along the lines of it being easier to take an eraser to the drafting table than a sledgehammer to the construction site.

Don’t boil the ocean.

A smaller segment of your content is satisfying a significant portion of your customer searches

Search logs, customer feedback, server logs will reveal this portion

27

28

HILLTOPPerformed on a small subset of the corpus that best represents nature of the whole

Ranked according to the number of non-affiliated “experts” point to it – i.e. not in the same site or directory

Affiliation is transitive [if A=B and B=C then A=C]

Beauty of Hilltop is that unlike PageRank, it is query-specific and reinforces the relationship between the authority and the user’s query. You don’t have to be big or have a thousand links from auto parts sites to be an “authority”

Segmentation of corpus into broad topics

Subset that is then extrapolated to Web as a whole

Selection of authority sources within these topic areas

Authorities have lots of non-related pages on the same subject pointing to them

Quality of links more important than quantity of links

Determination of HUBS (pages that point to many authority sources)

Pre query calculations applied at query time

TOPIC SENSITIVE PR

•Consolidation of Hypertext Induced Topic Selection [HITS] and PageRank

•Pre-query calculation of factors based on subset of corpus: context of term use in document, context of term use in history of queries and context of term use by user submitting query

•Computes PR based on a set of representational topics [augments PR with content analysis]

•Topic derived from the Open Source directory

•Uses a set of ranking vectors: Pre-query selection of topics + at-query comparison of the similarity of query to topics

3/23/2009

29

30

31

32

33

34

35

During the age of early explorers, map makers would insert this phrase when

they reached the edge of their known world.

The “dragons” on the following slides are known issues that Ascentium

developers have discovered in working with MOSS 2007 search or found

through my own research. Few diamonds are flawless. I find it best to

address the shortcomings upfront and have solutions in hand to

mitigate customer pain.

36

37

38

39

40

41

42

43

•Advanced auto-classification, taxonomy management and compound term metadata tagging technology

•Only statistical metadata generation, auto Classification and taxonomy management vendor in the world that uses concept extraction and compound term processing

•Proven to deliver the highest precision without the loss of recall

•Only Tagging and classification solution fully integrated with MOSS, Microsoft Office, Exchange and Microsoft Enterprise Search

•Automatically classifies content at the time creation or ingestion

•Generates compound term metadata (concepts) and stores in SharePoint properties

•Automatic classification within MS Office applications, metadata stored in the document

•Taxonomy Manager -Supports multiple taxonomies

•Priced by server -$95K per production server, $47.5 per staging/test server

•Highly scalable

•Vertical applications (Legal, Finance, eDiscovery, Services, Oil & Gas, Manufacturing, Government, Education, Life Sciences & Healthcare, Energy & Utilities)

•Horizontal applications (ECM, Document Management, Compliance & Risk Management, Records Management, Enterprise Search, Portals, Intranets & Information Rich Web Sites

3/23/2009

44

45

Notes:

•The weights used in the product were carefully tested. Changes to the weights may also have a negative effect on relevance.

•After you set property.weight you must call the property.Update() method to save the change.

46

47

48

49

Used in custom Web parts to execute queries against the enterprise search servicehttp://msdn.microsoft.com/en-us/library/ms544561.aspx