View
7
Download
0
Category
Preview:
Citation preview
Technical White paper - IntelliSearch ESP 2.0
March 2007
1
TECHNICAL WHITE PAPER
Enterprise Search Platform
one access – any file – any source
Technical White paper - IntelliSearch ESP 2.0
March 2007
2
Table of Contents
Introduction ................................................................................................................................ 3
Platform architecture – Open, modular, scaleable ....................................................................... 4
IntelliSearch ESP architecture ................................................................................................ 4
Performance........................................................................................................................... 6
Platform Search technology ........................................................................................................ 7
Platform Administration tools ...................................................................................................... 8
Administration - Search Engine............................................................................................... 8
Administration – Reporting...................................................................................................... 9
Other technical issues ................................................................................................................ 9
End-user Interaction integrations ............................................................................................ 9
Multi-tier Search Architecture................................................................................................ 10
Available file formats ............................................................................................................ 11
System Requirements .............................................................................................................. 12
Software............................................................................................................................... 12
Hardware ............................................................................................................................. 12
Technical White paper - IntelliSearch ESP 2.0
March 2007
3
Introduction IntelliSearch’s award winning platform allows enterprises access to information in any file, any
file-server, any mail server, any application, or any website. IntelliSearch ESP processes all forms
of structured and unstructured information. The platform has user-friendly user interfaces, with
advanced search and monitoring techniques, to ensure quick and relevant search and results.
The platform enables distribution of information to a large number of channels such as web,
email, sms and more. Distribution is configurable on a regular or ad-hoc basis.
IntelliSearch Enterprise Search Platform (ESP) is a standalone search solution that securely
covers all enterprise sources, and is easy to use and deploy. IntelliSearch deploys a combination
of technologies to enable a contextual understanding of text, Web pages, e-mails, documents and
people's interests - including all formats on all platforms – and offers a unique solution to a
growing number of applications and a host of platforms and devices that are increasingly
dependent on utilizing unstructured information.
IntelliSearch ESP can power any application dependent on finding and analyzing unstructured
information. IntelliSearch ESP is built to provide:
• Accuracy
• Speed and performance
• Scalability
• Security
• Language Independence
• Easy integration
• Support for any content format
IntelliSearch can to power any application dependent upon unstructured information including:
• Business Intelligence
• Content Publishing
• E-Commerce
• Electronic Customer Relationship Management
• ERP / Custom application
• Enterprise information portals
• Internet Portals
• Knowledge Management
• Online Publishing
This documents describes the IntelliSearch ESP technical architecture.
Technical White paper - IntelliSearch ESP 2.0
March 2007
4
Platform architecture – Open, modular, scaleable
IntelliSearch ESP architecture
IntelliSearch ESP is modular and scalable platform that is written in .NET. The platform can be
set up to access any file of any format and from any internal and external source. The Search
user interfaces can be integrated into any 3rd party application through web services. IntelliSearch
provides indexing of the following sources:
• File servers
• Mail servers
• Portals
• Standard applications
• Customs applications
• Databases (ODBC)
• Meta data
• Disks
• Public websites
• Password protected websites
• External newsfeeds
It comes with its own user interface and administration tools. The platform is depicted below:
Files/-Documents
Databases
Internet
Media
CustomApplications
Query
Results
Alert
VerticalApplications
Portals
MobileDevices
PipelineQUERY /RESULT
PROCESSING
PipelineFILTER
SEARCH
ALERT
FILE
PROCESSING
Pipeline
CO
NT
EN
T A
PI
QU
ER
Y A
PI
MANAGEMENT & APPLICATION SERVICES
SECURITY ACCESS
Deployment Business Application Administration
TOOLS & TOOL BUILDING FRAMEWORK
Custom
DATABASECONNECTOR
FILECONVERTER
WEB
CRAWLER
TUNING
INTELLISEARCH PLATFORMINTELLISEARCH PLATFORM
Files/-Documents
Databases
Internet
Media
CustomApplications
Query
Results
Alert
VerticalApplications
Portals
MobileDevices
PipelineQUERY /RESULT
PROCESSING
PipelineFILTER
SEARCH
ALERT
FILE
PROCESSING
Pipeline
CO
NT
EN
T A
PI
QU
ER
Y A
PI
MANAGEMENT & APPLICATION SERVICES
SECURITY ACCESS
Deployment Business Application Administration
TOOLS & TOOL BUILDING FRAMEWORK
Custom
DATABASECONNECTOR
FILECONVERTER
WEB
CRAWLER
TUNING
PipelineQUERY /RESULT
PROCESSING
PipelineFILTERFILTER
SEARCHSEARCH
ALERTALERT
FILE
PROCESSING
FILE
PROCESSING
PipelinePipeline
CO
NT
EN
T A
PI
QU
ER
Y A
PI
MANAGEMENT & APPLICATION SERVICES
SECURITY ACCESS
Deployment Business Application Administration
TOOLS & TOOL BUILDING FRAMEWORK
Custom
DATABASECONNECTORDATABASECONNECTOR
FILECONVERTER
FILECONVERTER
WEB
CRAWLER
WEB
CRAWLER
TUNINGTUNING
INTELLISEARCH PLATFORMINTELLISEARCH PLATFORM
The IntelliSearch ESP consists of 12 main components. Each of these is described below:
Technical White paper - IntelliSearch ESP 2.0
March 2007
5
Content APIs (connectors) represent a family of built-in connectors. Connectors are ready-
made interfaces to third party systems built on our generic content API. The connector family
provides access to documents that reside in the external proprietary systems and applications.
Examples of built-in connectors are: Windows NT Filesystems (NTFS), EMC Documentum
Content Server, IBM Lotus Notes and Microsoft Exchange. All connectors are pre-configured
(additional licensing may be required for some of the above connectors). IntelliSearch actively
monitors the market for popular applications and have the objective of supporting all such 3rd
party applications. In addition, our indexing interface allows customers and system integrators to
develop their own connectors to proprietary systems that may exist within the organisation.
Query APIs provide search interface to external applications and devices. The platform can
interface to any application through Web Services/SOAP or HTTP/Post, and have a special built
interface for mobile devices.
Web Crawler is a process activated to a set schedule. The crawler access any web page e.g.
XML, HTML, WML. When activated, the crawler spawns a configurable number of processor
threads that fetch documents from various data sources. Whenever the crawler encounters
embedded, non-HTML documents during the crawling, it uses filters to automatically detect the
document type and to filter and index the document.
File Converter is based on Microsoft iFilter that enables indexing of most popular file formats.
Examples are pdf, xls, ppt, doc, jpeg etc. A complete list is provided in a separate section. The
IntelliSearch ESP also has a built in OCR converter, that enables OCR conversion on-the-fly.
File processor analyze and index content to make it searchable. It converts and process content
through pre-processing pipeline consisting of tokenization, spell checking, stemming, dictionaries,
vectorization and custom dictionary. How it works is described in a separate section.
Search User Interface is an out-of-the-box user interface. It also provides a web services API for
building custom applications for querying indexed data, and contains interfaces for Basic Search
Form, Advanced Search Form, Query Result Display, authentication and authorization, and so
on.
Alert Manager is an out-of-the-box user interface enabling the end user to set personal alerts.
Filter is the security mechanism that returns only the result that the end-user is allowed to see.
Technical White paper - IntelliSearch ESP 2.0
March 2007
6
Result processing is the process responsible of returning the result to the end-user. It converts
and processes results through result pipeline. Tasks includes organization for categorization,
auto-clustering, dynamic drill-down, pass results on to application, push the results to alert engine
and then external environment (e.g. mail, queue)
Tuning and administration is where the administrator set up the search parameters such as
relevance and prioritization. Examples are absolute and relative query boosting, relative
document boosting, custom processing logic (pre-index, query). The administration tool is a
browser-based application that you use to configure and schedule the crawler, configure the
server, run several reporting features.
Security. IntelliSearch ESP unique combination of sophisticated mathematical algorithms
automates processing and conceptual analysis of large volumes content without sacrificing critical
security aspects. IntelliSearch ESP provides three basic forms of security. These are:
- Authentication: This governs who is able to log in to the system. IntelliSearch ESP
allows direct connections to the preferred authentication directory, such as Notes,
Active Directory, LDAP, Exchange, Netware or Oracle.
- Entitlement: This governs which items in the results list can be seen by the user. In
all corporate environments it is essential that underlying security entitlement models
be respected. IntelliSearch remains synchronized with all underlying security models.
Updates and changes are immediately reflected in the IntelliSearch entitlement
model.
- Authorization: This governs who is able to view documents having clicked on the
links in the results list and is not required with entitlement.
Performance
The IntelliSearch ESP represents a high performance, scaleable platform. Below are the current
platform performance figures:
• Up to 50mill documents on one server*
• Number of users – greater than 1,000 Queries per second
• Latency: Less than 1 sec data input and query latency
* Hardware configuration dependent - see System requirement in separate section
Technical White paper - IntelliSearch ESP 2.0
March 2007
7
Platform Search technology
For the user, Search is all about speed and relevance. Technically it is about text strings, how
they are interpreted by the search engine, and how the search result is presented to the user.
The IntelliSearch ESP use advanced search functionality to find all relevant documentation
independent of misspellings, use of synonyms, and stemming. The IntelliSearch ESP enables
keyword and relevant search, and allows for automatically switching between the two. Keyword
search is simple search, while relevant search use a statistical algorithm that looks for text
uniqueness and finds matching relevant documents. ESP search supports exact matches,
wildcards, paragraphs, integer, Boolean expressions and truncation.
This combined with unlimited text strings, enables for precise search results. Other advanced
search mechanisms that improves the search results are spell checks, use of base forms of a
word, use of synonyms and dictionaries. The process for matching a search string to the search
engine’s index is shown below:
TOKENIZER
SPELL
CHECKER
BASEFORM
REDUCTION
SYNONYMS
VECTORI-
ZATION
CUSTOM
DICTION-
ARIES
Stemming + Synonym:Reduction to base form,
represented symbolically:
Thesaurus support- for narrower & broader
terms
-Norsk - nynorsk
TokenizerEnsure correct treatment of characters
- e.g. on demand: no lower casing
Adaptive Query
Evaluation
Ranking profiles
Geo position
Adaptive Query
Evaluation
Relevance- Applying vectorization for
relevance indexing
THE SEARCH STRING PROCESSINGTHE SEARCH STRING PROCESSING
TOKENIZER
SPELL
CHECKER
BASEFORM
REDUCTION
SYNONYMS
VECTORI-
ZATION
CUSTOM
DICTION-
ARIES
Stemming + Synonym:Reduction to base form,
represented symbolically:
Thesaurus support- for narrower & broader
terms
-Norsk - nynorsk
TokenizerEnsure correct treatment of characters
- e.g. on demand: no lower casing
Adaptive Query
Evaluation
Ranking profiles
Geo position
Adaptive Query
Evaluation
Relevance- Applying vectorization for
relevance indexing
THE SEARCH STRING PROCESSINGTHE SEARCH STRING PROCESSING
Technical White paper - IntelliSearch ESP 2.0
March 2007
8
Platform Administration tools
Administration - Search Engine
The administration is conducted by 5 user groups – End user, advertiser, business manager,
administrator and developer. IntelliSearch ESP comes with a number of functionality to suit each
user groups need for administration. Depending on the group - the administration access is
provided in the end-user interface, in an administrator tool, and in a developer tool.
The multiple levels of administration for various users are shown below:
• Sorting
• Navigation
• Feedback
• Alerts
• Media windows
• Banner upload & positioning
• Keyword ads
• Editing of information page
Control Mechanisms
• Profile
• Security settings
• Manual data cleansing
Business Rules
User Profiles
Core Algorithmic Model
Application Model
USER GROUP
End Users
Advertizer
Business
Managers• Alert Parameters
• Boosting/ Priortization
• Ad/Banner/Keyword pricing
Administrator
Developer • Algorithm “weights”
• Categorization
Co
ntr
ol
lev
els
Multiple levels of control
ADMINISTRATION FRAMEWORKADMINISTRATION FRAMEWORK
• Sorting
• Navigation
• Feedback
• Alerts
• Media windows
• Banner upload & positioning
• Keyword ads
• Editing of information page
Control Mechanisms
• Profile
• Security settings
• Manual data cleansing
Business Rules
User Profiles
Core Algorithmic Model
Application Model
USER GROUP
End Users
Advertizer
Business
Managers• Alert Parameters
• Boosting/ Priortization
• Ad/Banner/Keyword pricing
Administrator
Developer • Algorithm “weights”
• Categorization
Co
ntr
ol
lev
els
Multiple levels of control
ADMINISTRATION FRAMEWORKADMINISTRATION FRAMEWORK
Business Rules
User Profiles
Core Algorithmic Model
Application Model
USER GROUP
End Users
Advertizer
Business
Managers• Alert Parameters
• Boosting/ Priortization
• Ad/Banner/Keyword pricing
Administrator
Developer • Algorithm “weights”
• Categorization
Co
ntr
ol
lev
els
Multiple levels of control
ADMINISTRATION FRAMEWORKADMINISTRATION FRAMEWORK
For the administrator a tool is provided to set the following:
• Define and crawl data sources.
• Define crawler parameters like URL boundary rules, crawling depth, proxy settings, etc.
• Create and modify schedules for the crawler.
• Set query options - Query options allow users to limit their searches. Searches can be
limited to document attributes (e.g. title, author) and data groups. Data source groups are
logical entities exposed to the search engine user.
• Adjust relevancy ranking of the search hit list . ESP allows administrators to influence the
order that documents are ranked in the search hit list. Use this to promote important
documents to higher scores and make them easier to find.
• Define suggested links for specific search terms.
Technical White paper - IntelliSearch ESP 2.0
March 2007
9
• Define alternative words for specific search terms.
• Setup authentication mechanisms for certain data sources.
Administration – Reporting IntelliSearch provides tools to capture the following statistics:
• All search-strings
• Samples of available statistics are:
– Top searches
– Searches with no results
– Vendor/product/services search statistics
– Click through stats per vendor
– Correlation between ranking and click-throughs
– Banner ad showing
– Call-me button statistics
Data extraction is configurable, and IntelliSearch ESP offers export to Excel and all analysis tools
through web-services and XML
Other technical issues
End-user Interaction integrations
IntelliSearch ESP provides the possibility of setting up the following interaction option in the
search result:
• Company links
• Text messaging (mobile)
• V-card
• Call-back button
• Integrated chat room
• Integrated discussion forums
• Integrated feedback system
Technical White paper - IntelliSearch ESP 2.0
March 2007
10
Multi-tier Search Architecture
IntelliSearch ESP supports a multi-tier architecture. Customers in a multi-continent environment
may choose to setup separate physical search cluster for performance reasons. IntelliSearch
ESP provides multi-index support to support multiple search centres. This enables a superb end-
user search experience in a global company without sacrificing relevancy and freshness.
IntelliSearch ESP provides index synchronization at a regular basis at frequencies set by the
customer. Below is an example of a multi-tier architecture.
Clients
Search Cluster
Client Handler
Search Server
Client Handler
Search Server
Search Server
Other
Search Clusters
Other
Search Clusters
Core Services Host
Database
A
Database
B
Database
C
Database
D
Clients
Search Cluster
Client Handler
Search Server
Client Handler
Search Server
Search Server
Other
Search Clusters
Other
Search Clusters
Core Services Host
Database
A
Database
B
Database
C
Database
D
Customer
data
Clients
Search Cluster
Client Handler
Search Server
Client Handler
Search Server
Search Server
Other
Search Clusters
Other
Search Clusters
Core Services Host
Database
A
Database
B
Database
C
Database
D
Clients
Search Cluster
Client Handler
Search Server
Client Handler
Search Server
Search Server
Other
Search Clusters
Other
Search Clusters
Core Services Host
Database
A
Database
B
Database
C
Database
D
Customer
data
Technical White paper - IntelliSearch ESP 2.0
March 2007
11
Available file formats
When creating the index, the IntelliSearch platform uses the Microsoft iFilter interface
to extract text and property information from files. The filtering interface extracts
chunks of text from documents, filtering out embedded formatting and retaining
information about the position of the text. It also extracts chunks of values, which are
properties of an entire document or of well-defined parts of a document.
The following file formats are available for indexing:
Available Filters included in IntelliSearch Microsoft Office Word Microsoft Office Excel Microsoft Office PowerPoint Microsoft Office Visio HTML XML RTF - Rich-Text Format Text WordPad Adobe Acrobat PDF Word Perfect 8 JPEG Filter DjVu MP3 Microsoft Scheduler+ News NNTP
Other filters available at a charge: Flash Open Office Microsoft Project SolidWorks Pro/Engineering vCard XMP - JPEG, GIF, TIFF, PNG, PS, EPS, PSD, AI og SVG. Mail MSG filer AutoCad 2002 Windows Media/Audio AutoCad Coreldraw Pro Engineering Visio 2002
Archive formats: ZIP, SFX, SPLIT ZIP, JAR,JAR SFX, CAB, LHA, LHA SFX, LZH, LZH SFX, GZIP, TAR, TZ, TAZ, TGZ, UUE/XXE/ENC.
Any other formats not on this list can be delivered on demand.
Technical White paper - IntelliSearch ESP 2.0
March 2007
12
System Requirements
When choosing hardware for your IntelliSearch ESP server, please follow the specifications as
given in this brief document. Before purchasing or installing server please read the
implementations guide
Software
Operating System Windows 2003 R2 64Bit
File System NTFS Computer Role Domain member IP Address Fixed Applications MySql or MS SQL Server 2005 (for statistics
only), .net Framework 3.0, Lotus Notes Client (for Lotus Notes indexing)
Hardware
When choosing what hardware to use the most important parameters to consider are
• The number of documents
• The number of connectors
• File size, Large text documents require more disk space
• Total number of uses, the total number of users is a substantial factor. One CPU can
handle roughly 100 queries per second. Scaling for thousands of users requires several CPUs
Recommended Minimum
Documents (K) CPU* Memory Hard Drive CPU Memory Hard Drive 0-50 2,5 GHz 1 GB 50 GB 2 GHz 512 KB 50 GB 50 -100 3 GHz 2 GB 75 GB 2,5 GHz 1 GB 75 GB 100-500 3 GHz 3 GB 100-200GB 2,5 GHz 1,5 GB 100-200GB 500-1000 2*3 GHz 4 GB 200-250GB 3 GHz 2 GB 200-250GB 1000 – 2000 2*3,5 GHz 6 GB 250-400GB 2*3 GHz 3 GB 250-400GB 2000 – 5000 4*4 GHz 8 GB 400 –500G 4*3,5 GHz 4 GB 400 –500G 5000 – 50.000 4*4 GHz 16 GB 500G+
All CPUS require 64Bit capability, either x64 or ia64. The Hard drive performance is a major
factor for search engine performance. IntelliSearch recommends SAS or SCSI drives running at
10K Rpm or faster. There is no performance gain with Striping (Raid) solutions. IntelliSearch do
not support virtual machines in a production environment.
For further information please contact info@intellisearch.no
* Actual CPU frequencies depends on processor family
Recommended