II-SDV 2015, 20 - 21 April, in Nice

1

An Overview of the Enterprise Search Market, & Current Best Practices

Iain Fletcher

[email protected]

April 20, 2015

mailto:[email protected]

2

Agenda

• A brief overview of the current enterprise search

market

• The convergence of search with analytics

disciplines

• Likely future architectures for search applications

3

Search engines continue to proliferate…

4

High-level Search Engine Classifications

1. Part of a portfolio, many are recently acquired technologies

– E.g. SharePoint, HP Autonomy, IBM/Vivisimo, Dassault/Exalead

2. Stand-alone specialists, often bought to address specific apps

– E.g. GSA, Coveo, Attivio, Sinequa, Recommind

3. Open source, with or without support or proprietary add-ons

– Raw: E.g. Lucene, Solr, Elasticsearch

– With support/add-ons: E.g. LucidWorks, Cloudera Search, Elastic

4. Cloud-based services, typically based on open source technology

– E.g. Amazon Cloudsearch, MS Azure search

5

The dominant market share is with SharePoint, open

source, and the Google Search Appliance

• SharePoint 2013 search is credible, and bundled

– Search teams are under pressure to use it, or to provide a

compelling reason to do otherwise

• Solr and Elasticsearch are robust and reliable

– Thanks to very wide-spread deployment

• The Google brand sells search – and a lot of GSAs have

been shipped during the past few years

Market Observations

6

Functional Observations

• Core indexing / searching is generally fast and reliable

– Search is a maturing technology

• Key differences remain in peripheral functionality, such as

content processing prior to indexing. For example:

– Coveo, Attivio, Sinequa all have well-developed indexing

pipelines, UI tools, and a range of data connectors

– SharePoint and GSA have limited content processing

functionality and rely on 3rd parties for connectivity

– Solr, Elasticsearch, AWS Cloudsearch and Azure search don’t

provide a formal indexing pipeline, UI, or connectors

7

Further Observations

• The search engines with less focus on peripheral issues

(such as content processing and connectivity) have

dominant market share

• Connectivity remains challenging, especially when

combined with continual data growth

• The movement of data sets to the cloud adds further

complexity

– Hybrid indexing environments will be with us for some years

8

Content Processing / Text Analysis Examples

• Normalization

– Names, dates, synonyms, spelling

• Entity identification and resolution

• Additional metadata from content analysis

• Categorization

• Document vector extraction

• Splitting and concatenation

• Dupe & near-dupe detection

• Link analysis

• Ingesting external signals

• Security enforcement and analysis

Index

security

category

metadata

9

Future Directions

So what will search architectures look like in the future?

Important Influences:

• The need for organizational and analytical agility

• The convergence of search and (“big data”) analytics

• Continual growth in data volumes, and churn in repository

/ storage fashions

10

Converging Architectures

Let’s take a brief look at:

1. The “Big Data Architecture”, evangelized by IBM,

Cloudera, etc.

2. Contemporary Search Architectures

Background Info

11

The Big Data Architecture

Designed for Structured Data

12

The Traditional Search Architecture

Integrated Search EngineContentSources

Connectors Index Pipeline SearchIndexEmployee

Directory

CMS

File Share

UI

Etc.

Designed for Unstructured Content

13

The Traditional Search Architecture

Integrated Search EngineContentSources

Connectors Index Pipeline SearchIndexEmployee

Directory

CMS

File Share

UI

Etc.

• A few documents-per-second?

• There are only 2.6 million seconds in a month

• If you change something significant in the index

pipeline, you will need to re-index

RE-INDEX

14

A Better Search Architecture

• Re-indexing rates greatly improved

• “Touch-time” with repositories can be managed autonomously

Search EngineContentSources

ConnectorsIndex

PipelineSearchIndex

EmployeeDirectory

CMS

Etc.

RE-INDEX

Content

Processing

Staging Repository

Iterative

Development

15

The Future Architecture?

Hadoop


ConnectorsIndex

PipelineSearchIndexEmployee

Directory

CMS

Etc.

RE-INDEX

Content

Processing

Staging Repository

Iterative

Development

• This environment will encourage ever more sophisticated content processing• We expect much innovation in text analytics during the next few years

• Driven by cheap, easily available processing power

• The deliverable is a richer search index

16

The Future Architecture

Hadoop


ConnectorsIndex

PipelineSearchIndexEmployee

Directory

CMS

Etc.

RE-INDEX

Content

Processing

Staging Repository

Iterative

Development

• Google.com works something like this for 10+ years

17

An Integrated Search/Analytics Architecture

Hadoop

ContentSources

Connectors

/ Crawlers

CMS

File system

Rapid, & ad hoc Indexing

Content

Processing

Staging Repository

Iterative

Development

ETL

DataSources

Data Warehouse

Logfiles

Etc.

OSINT Search App.

Search App.

Analysis App.

Analysis App.

• Encourages agile exploitation of data and content resources

18

Summary• Search and Analytics are tending towards to the same

architecture

• Autonomous connectivity and content processing systems simplify and de-risk projects

• The “search index” is a mature technology, and becoming a commodity

– Thanks to open source alternatives setting high standards

• The centre of attention is shifting from the index to the content preparation

– This perhaps fits well with the profile of dominant market leaders: SharePoint, GSA, Solr, Elasticsearch….

19

Conclusion

• The foundation of great search and analytical applications

is a clean, rich and detailed index

• Much of the innovation during the next years will be in

content analytics

– The architecture discussed makes it easy to adopt new ideas

and products

– And it promotes agility, experimentation, and innovation

• In a data-driven world, agility is vital

20

The analyst quote….

And finally….

“Enterprise Search Can Bring Big Data Within Reach”

• Multiple, purpose-built indexes that are derived from enriched content are necessary.

http://blogs.gartner.com/darin-stewart/2014/04/01/enterprise-search-can-bring-big-data-within-reach/

* Darin Stewart, Enterprise Search Can Bring Big Data Within Reach, April 2014 Blog



21

An Overview of the Current Enterprise Search Market, & Current Best Practices

Iain Fletcher

[email protected]

April 20, 2015

Thank you!

mailto:[email protected]

22

Spare Slides

23

Reference Architecture

Content sources

Connectors

Indexes

Semantics

Text Mining

Quality Metrics

Content Processing Pipelines

Big Data Framework

Indexes

Queryparsing

Search Engine

Web Browser

Staging Repository

24CONFIDENTIAL

25CONFIDENTIAL

Internet

II-SDV 2015, 20 - 21 April, in Nice