14
1 © 2004 EMC Corporation. All rights reserved. 1 1 Enterprise Content Integration Services Jacques Conan ECI Product Manager Pierre-Yves “Pitch” Chevalier ECI Software Development Manager EMC Software Group © 2004 EMC Corporation. All rights reserved. 2 Information Overload Information Overload “Workers spend up to 50% their time searching for actionable information.” IDC Difficulty to locate high value information 80% of the information accumulated by companies is never used.” KM World © 2004 EMC Corporation. All rights reserved. 3 Enterprise Content Integration Solution Adapts to changes of the information space Coexists with any kind of IT infrastructure Documentum Enterprise Content Integration Services Provides access to the global content in just one query Enhances existing solutions in place without imposing change Find, assimilate, synthesize, share and integrate information

Enterprise Content Integration Services - Dell EMC · Enterprise Content Integration Services Jacques Conan ... flaubert or stendhal / title:contains:(madame AND bovary) ... Microsoft

  • Upload
    dothu

  • View
    228

  • Download
    0

Embed Size (px)

Citation preview

1

© 2004 EMC Corporation. All rights reserved. 111

Enterprise Content Integration Services

Jacques ConanECI Product Manager

Pierre-Yves “Pitch” ChevalierECI Software Development Manager

EMC Software Group

© 2004 EMC Corporation. All rights reserved. 22

Information Overload

Information Overload“Workers spend up to 50% their time searching for actionable information.”

IDC

Difficulty to locate high value information“80% of the information accumulated by companies is never used.”

KM World

© 2004 EMC Corporation. All rights reserved. 33

Enterprise Content Integration Solution

Adapts to changes of the information space

Coexists with any kind of IT infrastructure

Documentum Enterprise Content Integration Services

Provides access to the global content in just one query

Enhances existing solutions in place without imposing change

Find, assimilate, synthesize, share and integrate information

2

© 2004 EMC Corporation. All rights reserved. 44

Documentum ECI status

ECI Services Version 4First release of EMC Documentum Enterprise Content Integration solution, Released on July 24th 2004 for EF languages

ECI Services Version 4 SP1Localization for IGSJK, new version of linguistic enginePlanned for Nov 2004

ECI 5.3DCTM 5.3 to embed ECI capabilitiesECI Services Version 5.3

© 2004 EMC Corporation. All rights reserved. 55

Distributed Search

OpenTextNotesOracle

Adapter Adapter Adapter

Adapter

Adapter

Adapter

Clustering

Automobile

SUV Sports Car

Convertible Hatchback

SedanMinivan

TruckMotorcycle

Vehicle

Hardtop

Public Web

Adapter

Adapter

Other Contents

© 2004 EMC Corporation. All rights reserved. 66

The Indexing Barrier

20-30% of total document size20-30% of total document size

Dynamic Web (forms, login)

Cannot go throughLogin formsCannot go throughLogin forms

Cannot buildpermanent URLs(session expires)

Cannot buildpermanent URLs(session expires)

X

Analysis

Index

Static Web

Cannot crawlconstantlyCannot crawlconstantly

Index becomesout-of-dateIndex becomesout-of-date

Databases, Content repositories

Duplicateinternalindexes

Duplicateinternalindexes

Proprietary Applications

Unlikely to preservestructured informationUnlikely to preservestructured information

3

© 2004 EMC Corporation. All rights reserved. 77

Live Query

Dynamic queries to up-to-date contentDynamic queries to up-to-date content

Preserve structureand meta-dataPreserve structureand meta-data

FlexibleFlexible

Static Web

Analysis

Index Existing index,categorizationExisting index,categorization

Dynamic Web (forms, login)

Pass throughlogin requestsPass throughlogin requests

Build permanent URLs(across sessions)Build permanent URLs(across sessions)

Databases, Content repositories

Proprietary Applications Leverage

internalindexes

Leverageinternalindexes

© 2004 EMC Corporation. All rights reserved. 88

Search Domains

© 2004 EMC Corporation. All rights reserved. 99

Single Sign-On

List of user credentials for sources requiring authenticationManaged by each user individuallySubmitted with each query automatically

4

© 2004 EMC Corporation. All rights reserved. 1010

Sophisticated Query Language

date:BEFORE:1997-01-29date is before dayBEFORE

date:AFTER:1997-01-29date is after dayAFTER

sum:>:29.29attribute is superior to the numerical value<>

sum:>=:29.29attribute is greater than or equal to the numerical value>=

sum:=:29.29attribute is equal to the numerical value=

sum:<=:29.29attribute is less than or equal to the numerical value<=

sum:<:29.29attribute is less than the numerical value<

author:EQUALS:balzacattribute is equal to the stringEQUALS

title:CONTAINS:Chaco canyon musictitle:Chaco canyon music

attribute contains a sub-string that matches the value (default)CONTAINS

UsageSemanticName

ExamplesGeneral, Libraryauthor:equals:flaubert or stendhal / title:contains:(madame AND bovary) OR (chartreuse AND parme) body:contains:flaubert or stendhal / title:contains:(madame AND bovary) OR (chartreuse AND parme) Generaltitle:contains:"find good paying jobs" / date:after:2002-09-01Shoppingtitle:contains:Tolkien or (Lord and rings) / price:<=:20

© 2004 EMC Corporation. All rights reserved. 1111

Search Status

© 2004 EMC Corporation. All rights reserved. 1212

Query Statistics

5

© 2004 EMC Corporation. All rights reserved. 1313

Quick Skimming of Results

“Floating window” shows a short abstract by just moving the mouse over a particular result

© 2004 EMC Corporation. All rights reserved. 1414

Scheduled Queries

© 2004 EMC Corporation. All rights reserved. 1515

The Problem with Metadata

6

© 2004 EMC Corporation. All rights reserved. 1616

Metadata Extraction and Normalization

© 2004 EMC Corporation. All rights reserved. 1717

Document Snapshots

Content Extract– Determine quickly content

relevance– No need to open actual

document

Metadata– Any metadata found– Organized for easy

comprehension

© 2004 EMC Corporation. All rights reserved. 1818

Dynamic Linguistic Contextual ClusteringClustering

– Designed for “state of the art” type searches– Allows to quickly view a large set of results

by topics of interest

Dynamic– Organizes and groups search results

automatically– Complements existing pre-defined

taxonomies

Linguistic– Extraction of terms (multi word expression),

stemming

Contextual– Selection of most interesting groups based

on search context– Helps to refine search criteria

7

© 2004 EMC Corporation. All rights reserved. 1919

Personal Relevancy Ranking

© 2004 EMC Corporation. All rights reserved. 2020

Cross-Lingual Content Retrieval

Multi-Lingual Search– On the fly query

translation

Content comprehension aid– Build-in dictionary for

words and expressions

© 2004 EMC Corporation. All rights reserved. 2121

Content Export

8

© 2004 EMC Corporation. All rights reserved. 2222

DEMO

© 2004 EMC Corporation. All rights reserved. 2323

Adapter TechnologyInterface to any application

Creates structure out of un-structured information

Meta-data extraction and attributes based filtering

Intelligent components enhancing source querying capabilities

Unique framework for rapid adapter production and maintenance

© 2004 EMC Corporation. All rights reserved. 2424

Adapter Library

Content Providers – Factiva – Lexis Nexis

World Wide Web (General) – AltaVista – Google – Open Directory – Yahoo! – Apple Sherlock plug-ins

Full-text search engines – Verity K2, PortalOne, Search97 – Verity Ultraseek – Google Search Appliance

Enterprise Repositories– Documentum ECM 4.x, 5.x – Documentum eRoom 6, 7 – Documentum AX 4.6, 5.0 – DocuShare 1.5, 2.0, 2.1, 3.0 – Lotus Notes R4.6– Lotus Domino R5, R6 – Microsoft SiteServer 3.0– Microsoft Exchange – Oracle 7.2, Oracle 8i – JDBC/ODBC– Z39.50 (optional)

Bundles– Pharma Bundle (35 adapters)– Sciences Bundle (15 adapters)

9

© 2004 EMC Corporation. All rights reserved. 2525

Anatomy of an Adapter

Query mapping

Query submission

Service Description

Meta-Data Extraction

Post Filtering

Information Source

User Query

Responses

Filter Function

© 2004 EMC Corporation. All rights reserved. 2626

Query Mapping

select r_object_id from dm_document search topic (‘Betaferon*‘ or ‘Interferon*’) where folder(‘Drugs', descend) and (lower(title) like '%approved%' or lower(object_name) like '%approved%')

select r_object_id from dm_document search topic (‘Betaferon*‘ or ‘Interferon*’) where folder(‘Drugs', descend) and (lower(title) like '%approved%' or lower(object_name) like '%approved%')

full-text:contains:Betaferon OR Interferon / title:contains:approvedfull-text:contains:Betaferon OR Interferon / title:contains:approved

(Betaferon OR Interferon) AND approved

(Betaferon OR Interferon) AND approved

+Betaferon +approved+Betaferon +approved

+Interferon +approved+Interferon +approved

DCTM:-Fielded search-Boolean operators

PubMed:-Full-text search-Boolean operators

FDA:-Full-text search-Internet-style -No Boolean

Products content repository PubMed web site FDA CDER web site

SEARCH

ADAPTER

NATIVE

SOURCE

Post-filter:4 results validout of 4 received

Post-filter:2 results validout of 50 received

Post-filter:8 results validout of 100 received

© 2004 EMC Corporation. All rights reserved. 2727

Metadata retrieval for HTML Pages

UNSTRUCTURED CONTENT

STRUCTURED CONTENT

10

© 2004 EMC Corporation. All rights reserved. 2828

Metadata extraction: learning phase

HTML Results page Sample1

HTML Results page Sample2

HTML Results page SampleN

learning phase Extraction Agent

Version 1

Extraction Agent

Version 2

Extraction Agent

Final Version 3

update

update

learning phase

learning phase

© 2004 EMC Corporation. All rights reserved. 2929

Metadata extraction: self-repairing

When change is detected: find out trusted fragments; apply one or more recovery routines; repairing can be complete or partial; update grammar if recovery threshold is satisfiedConceptual shift assumption

– Context change: change in the page mark-up, like putting title in bold font– Content change: CIKM “Conf. Information Knowledge Management”– Structural change: label addition/removal, order permutation, etc.

Context change

Content change

Structure change

© 2004 EMC Corporation. All rights reserved. 3030

Adapter BuilderUnique framework for rapid adapter production, testing and maintenance

DEMO

11

© 2004 EMC Corporation. All rights reserved. 3131

Adapter Configuration: setup

Admin Center:– Web-based configuration– Wizard for enterprise adapters:

DCTM, Domino, eRoom, …

© 2004 EMC Corporation. All rights reserved. 3232

Adapter Configuration: diagnostic

© 2004 EMC Corporation. All rights reserved. 3333

Adapter Exchange Site

http://customernet.documentum.com/developer/articles/ECISAdapters.htmlRepository to publish and to exchange sample adapters (source-code or binary delivery)

– InternetArchive– InvisibleWeb– Java Developer Connection– SourceForge– …

Dedicated forum to discuss tips & issues related to adapter developmentGo & contribute!

12

© 2004 EMC Corporation. All rights reserved. 3434

ECI 5.3 Architecture

Docbases

PortalWebtop

DFC 5.3

WDK 5.3

ECI Services 5.3

Lexis NexisLotus Domino

ECI Client 5.3 ECI Portal 5.3

Google

WWW

Access to external contents

© 2004 EMC Corporation. All rights reserved. 3535

WDK Search Components

Single box search– Multi-docbases

Results and Status– Wait screen – Results– Status– Status refreshed– Status stopped– Enter credentials from status– No results

Advanced and Revise– Advanced from results (revise a search)– Advanced from browsing– Advanced cleared

Preferences– Search locations - favorite repositories– Search locations - selected specifically– Favorite repositories– Favorite repositories (admin)

Changing Sources– Change sources – Check box selected – Authenticate repository– Repository added to list – Navigate into repository and breadcrumb – Navigate into a cabinet and breadcrumb

Saving a Search – My saved searches– Properties of a saved search– All saved searches

© 2004 EMC Corporation. All rights reserved. 3636

DCTM 5.3 Extended Search

Query Building in DFCimport com.documentum.fc.client.search.*;

// Initialization

IDfSearchService searchService = client.newSearchService(sessionMgr);

IDfMetadataMgr metadataMgr = searchService.newMetadataMgr();

IDfQueryMgr queryMgr = searchService.newQueryMgr(metadataMgr);

// Creation of the query

IDfQueryBuilder queryBuilder = queryMgr.newQueryBuilder();

queryBuilder.addSelectedSource(“dm_notes”);

queryBuilder.addSelectedSource(“AmazonBooks”);

// Definition of the constraints

IDfExpressionSet exprSet = queryBuilder.getRootExpressionSet();

exprSet.addFullTextExpression(“singleton Java”);

exprSet.addSimpleAttrExpression(“title", IDfValue.DF_STRING,

IDfSimpleAttrExpression.SEARCH_OP_CONTAINS, false, false, “design patterns”);

13

© 2004 EMC Corporation. All rights reserved. 3737

DCTM 5.3 Extended Search

Query Execution in DFC

// Synchronous execution

IDfQueryProcessor syncProcessor = searchService.newQueryProcessor(queryBuilder);

IDfResultsSet results = syncProcessor.blockingSearch(-1); // -1 == no timeout

manipulator = service.newResultsManipulator(metadataMgr);

IDfResultsSet sortedResults = manipulator.sortBy(results, "date", true);

© 2004 EMC Corporation. All rights reserved. 3838

Customers Using Documentum ECI Services

European Patent Office

© 2004 EMC Corporation. All rights reserved. 3939

Conclusion

Get ECIS and ADK!http://customernet.documentum.com

Adapter Exchangehttp://customernet.documentum.com/developer/articles/ECISAdapters.

html

14

© 2004 EMC Corporation. All rights reserved. 4040

Uniting The World Through Content