Upload
dothu
View
228
Download
0
Embed Size (px)
Citation preview
1
© 2004 EMC Corporation. All rights reserved. 111
Enterprise Content Integration Services
Jacques ConanECI Product Manager
Pierre-Yves “Pitch” ChevalierECI Software Development Manager
EMC Software Group
© 2004 EMC Corporation. All rights reserved. 22
Information Overload
Information Overload“Workers spend up to 50% their time searching for actionable information.”
IDC
Difficulty to locate high value information“80% of the information accumulated by companies is never used.”
KM World
© 2004 EMC Corporation. All rights reserved. 33
Enterprise Content Integration Solution
Adapts to changes of the information space
Coexists with any kind of IT infrastructure
Documentum Enterprise Content Integration Services
Provides access to the global content in just one query
Enhances existing solutions in place without imposing change
Find, assimilate, synthesize, share and integrate information
2
© 2004 EMC Corporation. All rights reserved. 44
Documentum ECI status
ECI Services Version 4First release of EMC Documentum Enterprise Content Integration solution, Released on July 24th 2004 for EF languages
ECI Services Version 4 SP1Localization for IGSJK, new version of linguistic enginePlanned for Nov 2004
ECI 5.3DCTM 5.3 to embed ECI capabilitiesECI Services Version 5.3
© 2004 EMC Corporation. All rights reserved. 55
Distributed Search
OpenTextNotesOracle
Adapter Adapter Adapter
Adapter
Adapter
Adapter
Clustering
Automobile
SUV Sports Car
Convertible Hatchback
SedanMinivan
TruckMotorcycle
Vehicle
Hardtop
Public Web
Adapter
Adapter
Other Contents
© 2004 EMC Corporation. All rights reserved. 66
The Indexing Barrier
20-30% of total document size20-30% of total document size
Dynamic Web (forms, login)
Cannot go throughLogin formsCannot go throughLogin forms
Cannot buildpermanent URLs(session expires)
Cannot buildpermanent URLs(session expires)
X
Analysis
Index
Static Web
Cannot crawlconstantlyCannot crawlconstantly
Index becomesout-of-dateIndex becomesout-of-date
Databases, Content repositories
Duplicateinternalindexes
Duplicateinternalindexes
Proprietary Applications
Unlikely to preservestructured informationUnlikely to preservestructured information
3
© 2004 EMC Corporation. All rights reserved. 77
Live Query
Dynamic queries to up-to-date contentDynamic queries to up-to-date content
Preserve structureand meta-dataPreserve structureand meta-data
FlexibleFlexible
Static Web
Analysis
Index Existing index,categorizationExisting index,categorization
Dynamic Web (forms, login)
Pass throughlogin requestsPass throughlogin requests
Build permanent URLs(across sessions)Build permanent URLs(across sessions)
Databases, Content repositories
Proprietary Applications Leverage
internalindexes
Leverageinternalindexes
© 2004 EMC Corporation. All rights reserved. 88
Search Domains
© 2004 EMC Corporation. All rights reserved. 99
Single Sign-On
List of user credentials for sources requiring authenticationManaged by each user individuallySubmitted with each query automatically
4
© 2004 EMC Corporation. All rights reserved. 1010
Sophisticated Query Language
date:BEFORE:1997-01-29date is before dayBEFORE
date:AFTER:1997-01-29date is after dayAFTER
sum:>:29.29attribute is superior to the numerical value<>
sum:>=:29.29attribute is greater than or equal to the numerical value>=
sum:=:29.29attribute is equal to the numerical value=
sum:<=:29.29attribute is less than or equal to the numerical value<=
sum:<:29.29attribute is less than the numerical value<
author:EQUALS:balzacattribute is equal to the stringEQUALS
title:CONTAINS:Chaco canyon musictitle:Chaco canyon music
attribute contains a sub-string that matches the value (default)CONTAINS
UsageSemanticName
ExamplesGeneral, Libraryauthor:equals:flaubert or stendhal / title:contains:(madame AND bovary) OR (chartreuse AND parme) body:contains:flaubert or stendhal / title:contains:(madame AND bovary) OR (chartreuse AND parme) Generaltitle:contains:"find good paying jobs" / date:after:2002-09-01Shoppingtitle:contains:Tolkien or (Lord and rings) / price:<=:20
© 2004 EMC Corporation. All rights reserved. 1111
Search Status
© 2004 EMC Corporation. All rights reserved. 1212
Query Statistics
5
© 2004 EMC Corporation. All rights reserved. 1313
Quick Skimming of Results
“Floating window” shows a short abstract by just moving the mouse over a particular result
© 2004 EMC Corporation. All rights reserved. 1414
Scheduled Queries
© 2004 EMC Corporation. All rights reserved. 1515
The Problem with Metadata
6
© 2004 EMC Corporation. All rights reserved. 1616
Metadata Extraction and Normalization
© 2004 EMC Corporation. All rights reserved. 1717
Document Snapshots
Content Extract– Determine quickly content
relevance– No need to open actual
document
Metadata– Any metadata found– Organized for easy
comprehension
© 2004 EMC Corporation. All rights reserved. 1818
Dynamic Linguistic Contextual ClusteringClustering
– Designed for “state of the art” type searches– Allows to quickly view a large set of results
by topics of interest
Dynamic– Organizes and groups search results
automatically– Complements existing pre-defined
taxonomies
Linguistic– Extraction of terms (multi word expression),
stemming
Contextual– Selection of most interesting groups based
on search context– Helps to refine search criteria
7
© 2004 EMC Corporation. All rights reserved. 1919
Personal Relevancy Ranking
© 2004 EMC Corporation. All rights reserved. 2020
Cross-Lingual Content Retrieval
Multi-Lingual Search– On the fly query
translation
Content comprehension aid– Build-in dictionary for
words and expressions
© 2004 EMC Corporation. All rights reserved. 2121
Content Export
8
© 2004 EMC Corporation. All rights reserved. 2222
DEMO
© 2004 EMC Corporation. All rights reserved. 2323
Adapter TechnologyInterface to any application
Creates structure out of un-structured information
Meta-data extraction and attributes based filtering
Intelligent components enhancing source querying capabilities
Unique framework for rapid adapter production and maintenance
© 2004 EMC Corporation. All rights reserved. 2424
Adapter Library
Content Providers – Factiva – Lexis Nexis
World Wide Web (General) – AltaVista – Google – Open Directory – Yahoo! – Apple Sherlock plug-ins
Full-text search engines – Verity K2, PortalOne, Search97 – Verity Ultraseek – Google Search Appliance
Enterprise Repositories– Documentum ECM 4.x, 5.x – Documentum eRoom 6, 7 – Documentum AX 4.6, 5.0 – DocuShare 1.5, 2.0, 2.1, 3.0 – Lotus Notes R4.6– Lotus Domino R5, R6 – Microsoft SiteServer 3.0– Microsoft Exchange – Oracle 7.2, Oracle 8i – JDBC/ODBC– Z39.50 (optional)
Bundles– Pharma Bundle (35 adapters)– Sciences Bundle (15 adapters)
9
© 2004 EMC Corporation. All rights reserved. 2525
Anatomy of an Adapter
Query mapping
Query submission
Service Description
Meta-Data Extraction
Post Filtering
Information Source
User Query
Responses
Filter Function
© 2004 EMC Corporation. All rights reserved. 2626
Query Mapping
select r_object_id from dm_document search topic (‘Betaferon*‘ or ‘Interferon*’) where folder(‘Drugs', descend) and (lower(title) like '%approved%' or lower(object_name) like '%approved%')
select r_object_id from dm_document search topic (‘Betaferon*‘ or ‘Interferon*’) where folder(‘Drugs', descend) and (lower(title) like '%approved%' or lower(object_name) like '%approved%')
full-text:contains:Betaferon OR Interferon / title:contains:approvedfull-text:contains:Betaferon OR Interferon / title:contains:approved
(Betaferon OR Interferon) AND approved
(Betaferon OR Interferon) AND approved
+Betaferon +approved+Betaferon +approved
+Interferon +approved+Interferon +approved
DCTM:-Fielded search-Boolean operators
PubMed:-Full-text search-Boolean operators
FDA:-Full-text search-Internet-style -No Boolean
Products content repository PubMed web site FDA CDER web site
SEARCH
ADAPTER
NATIVE
SOURCE
Post-filter:4 results validout of 4 received
Post-filter:2 results validout of 50 received
Post-filter:8 results validout of 100 received
© 2004 EMC Corporation. All rights reserved. 2727
Metadata retrieval for HTML Pages
UNSTRUCTURED CONTENT
STRUCTURED CONTENT
10
© 2004 EMC Corporation. All rights reserved. 2828
Metadata extraction: learning phase
HTML Results page Sample1
HTML Results page Sample2
HTML Results page SampleN
learning phase Extraction Agent
Version 1
Extraction Agent
Version 2
Extraction Agent
Final Version 3
update
update
learning phase
learning phase
© 2004 EMC Corporation. All rights reserved. 2929
Metadata extraction: self-repairing
When change is detected: find out trusted fragments; apply one or more recovery routines; repairing can be complete or partial; update grammar if recovery threshold is satisfiedConceptual shift assumption
– Context change: change in the page mark-up, like putting title in bold font– Content change: CIKM “Conf. Information Knowledge Management”– Structural change: label addition/removal, order permutation, etc.
Context change
Content change
Structure change
© 2004 EMC Corporation. All rights reserved. 3030
Adapter BuilderUnique framework for rapid adapter production, testing and maintenance
DEMO
11
© 2004 EMC Corporation. All rights reserved. 3131
Adapter Configuration: setup
Admin Center:– Web-based configuration– Wizard for enterprise adapters:
DCTM, Domino, eRoom, …
© 2004 EMC Corporation. All rights reserved. 3232
Adapter Configuration: diagnostic
© 2004 EMC Corporation. All rights reserved. 3333
Adapter Exchange Site
http://customernet.documentum.com/developer/articles/ECISAdapters.htmlRepository to publish and to exchange sample adapters (source-code or binary delivery)
– InternetArchive– InvisibleWeb– Java Developer Connection– SourceForge– …
Dedicated forum to discuss tips & issues related to adapter developmentGo & contribute!
12
© 2004 EMC Corporation. All rights reserved. 3434
ECI 5.3 Architecture
Docbases
PortalWebtop
DFC 5.3
WDK 5.3
ECI Services 5.3
Lexis NexisLotus Domino
ECI Client 5.3 ECI Portal 5.3
WWW
Access to external contents
© 2004 EMC Corporation. All rights reserved. 3535
WDK Search Components
Single box search– Multi-docbases
Results and Status– Wait screen – Results– Status– Status refreshed– Status stopped– Enter credentials from status– No results
Advanced and Revise– Advanced from results (revise a search)– Advanced from browsing– Advanced cleared
Preferences– Search locations - favorite repositories– Search locations - selected specifically– Favorite repositories– Favorite repositories (admin)
Changing Sources– Change sources – Check box selected – Authenticate repository– Repository added to list – Navigate into repository and breadcrumb – Navigate into a cabinet and breadcrumb
Saving a Search – My saved searches– Properties of a saved search– All saved searches
© 2004 EMC Corporation. All rights reserved. 3636
DCTM 5.3 Extended Search
Query Building in DFCimport com.documentum.fc.client.search.*;
// Initialization
IDfSearchService searchService = client.newSearchService(sessionMgr);
IDfMetadataMgr metadataMgr = searchService.newMetadataMgr();
IDfQueryMgr queryMgr = searchService.newQueryMgr(metadataMgr);
// Creation of the query
IDfQueryBuilder queryBuilder = queryMgr.newQueryBuilder();
queryBuilder.addSelectedSource(“dm_notes”);
queryBuilder.addSelectedSource(“AmazonBooks”);
// Definition of the constraints
IDfExpressionSet exprSet = queryBuilder.getRootExpressionSet();
exprSet.addFullTextExpression(“singleton Java”);
exprSet.addSimpleAttrExpression(“title", IDfValue.DF_STRING,
IDfSimpleAttrExpression.SEARCH_OP_CONTAINS, false, false, “design patterns”);
13
© 2004 EMC Corporation. All rights reserved. 3737
DCTM 5.3 Extended Search
Query Execution in DFC
// Synchronous execution
IDfQueryProcessor syncProcessor = searchService.newQueryProcessor(queryBuilder);
IDfResultsSet results = syncProcessor.blockingSearch(-1); // -1 == no timeout
manipulator = service.newResultsManipulator(metadataMgr);
IDfResultsSet sortedResults = manipulator.sortBy(results, "date", true);
© 2004 EMC Corporation. All rights reserved. 3838
Customers Using Documentum ECI Services
European Patent Office
© 2004 EMC Corporation. All rights reserved. 3939
Conclusion
Get ECIS and ADK!http://customernet.documentum.com
Adapter Exchangehttp://customernet.documentum.com/developer/articles/ECISAdapters.
html