Upload
chrisjelinski
View
432
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A brief introduction to cDiscovery / Contract Discovery
Citation preview
1
Whitepaper
Introducing cDiscovery
Why search and eDiscovery are inadequate when trying to locate contracts in an enterprise
2
May 2011
Contents
Background 3
ContractualInformationFormats 5
RelationalandRelevance 6
AutomatedContractRecognition 7
InformationNormalisation 8
HighRisk/ValueClauseDetection 9
cDiscovery Extensions over Search and eDiscovery 10
OCR and spell checking 10
TableRecognitionandInformationExtraction 11
SignatureDetection 11
CustomerSpecificInformation 12
Search Methods 13
SearchandeDiscoverylimitations 13
Summary 14
3
Background
Withintoday’senterprisearemanydifferentdocumenttypesheldwithinmanydifferinginformationsources.Onesuchinformationtypeiscontracts,withdifferingdocumentformatssuchasscannedimages,officefilesandPDF’s.
HistoricallycontractshavebeenheldwithinfileshareswithlimitedmetadataattachedtothemasTIFForimageembeddedPDFfiles,originatingfromscannersoremailattachments.Duetothenatureofcontractualinformation,contractsneedtobesignedbybothparties,andassuchthefinalcontractwouldbeheldaseithertheoriginalpaperdocumentor,inmostcases,faxedbacktothecontractingparties.
Withineachbusinessunitandgeographicallocation,variouscontractmanagementpoliciescouldhavebeendeployed.Thiscancreateadispersedandhighlyirregularcontractmanagementenvironment.
Addingtothecomplexity,eachlocationordepartmentcouldhavedeployedandusedmanydifferingcontracttemplatesandhavereceivedhundredsofvariousinboundcontractformats.
Asmanyoftheformats,layoutsandinformationwillbeunknown,searchingforinformationbecomesanarduoustask.Witheverydifferentcontractingparty,forexample,thenumberofcombinationsand“falsepositives”willincrease,basedonstandardeDiscoveryorsearchmethods.
Afalsepositiveisdefinedas:-“relatingtoorbeinganindividualoratestresultthatiserroneouslyclassifiedinapositivecategory”
Relatingthisdirectlybacktosearch,youaregivenaresultthatmatchesyourquery,howeveritisnotacorrectmatch.
A common misunderstanding regarding eDiscovery is it is perceived toproviderichcontextualinformation.SomeeDiscoverysolutionsusedigitalfingerprints,NIST’s,toremovenonrelevantdatatoimprovetherelevancyofresultsets.ThisapproachhoweveristhedirectinverseofthecDiscoveryprocess,whichclassifiesINinformation.Additionally,eDiscoveryusessearchwithintheprocessingwiththesamefunctionallimitationsrelatingtocontractualinformation.
4
May 2011
An example of the size of the issue is best illustrated with a simple examplebasedonthecontractingpartiesofacontract.
Eachcontractingpartycouldbeaperson,company,organisation,country, departmentoranyothercombinationofaddressesandentities.Toeffectivelysearch,locateandreviewcontractsbasedoncontractingpartiesalone,auserwouldneedtoknowthepartiesbeforestartingthesearch,filterouttheaverage80%+falsepositives,openthemandrevieweachitemforthecorrectdata.
Thereasonwhythefalsepositivescouldbesohighisafactorofthesearch.TakeforexampleacontractingpartyofSonyMusic.SearchingforthisalonewouldproducearesultsetconsistingofeveryitemwhereSonyMusicismentioned.Thesearchisnotabletodifferentiatebetweenasimpledocumentreferringtoamusictrackheld by Sony Music, or a contract where Sony Music is one of the contractingparties.
A simple test of this is to place the following search into Google, “Sony music+“contracts”,thiswillresultinover1.2millionhits,withveryfewifanybeingactualcontracts.Ifwefurtherrefinethesearchwith“Sonymusic”+”contractingparty”,theresultsarejustover1200items.HoweverofthoseitemsnoneareactualcontractswhereSonyMusicisthecontractingparty.
WhileaGooglesearchoftheinternetisnotadirectrepresentationofinternalcustomerenvironments,thechallengesremainthesame.WithmanyorganisationsalreadyusinganinternalsearchengineoreDiscovery,theydonotprovidetherequiredfunctionalitytomeetthechallengeoflocatingcontractualinformationlocation.
Thissimpleillustrationshowsthattoeffectivelysearch,discoverand manage contracts a new approach is required, targeted at theinformationheldwithincontracts.Withthisinmind,anewtechnologyandmethodologyisneeded.
We will at this point in the whitepaper, introduce a new technology calledcDiscovery.Thistermwillbeusedtorefertothediscoveryofcontractsthroughouttherestofthisdocument.CurrentlytheonlycDiscoverysolutiononthemarket,andthebasisforreferencewithinthisdocument,istheSealSoftwarecDiscoverysolution.
5
ContractualInformationFormats
To further understand cDiscovery’s importance, it is necessary to considercontractualformatsandlayouts.Contractsinmanycasescanbefreeforminformationitems,withdates,parties,clausesandobligationsrandomlydistributedwithinthedocuments.
Withthemajorityofcontractsbeinginimageformats,OpticalCharacterRecognition(OCR)isrequiredtoextractinformation.Duringthisextractionprocess,errorsduetopoorqualitycanbeintroduced,forexamplean“I”becomesan“!”,thuscausesthe“client”tobecomethe“cl!ent”.
Furtherformats,suchasimagesimbeddedwithinPDFfiles,alsorequirespecifichandlingandprocessing.ForexamplewhenprocessingPDFfiles,doesthesystemuseaPDFifilterorequivalentor does it process the items via an OCR engine because it contains embedded images?
Notonlyaretheactualdocumentformatstobeconsidered,thelayoutswithinthedocumentsmustalsoberecognised.Takeforexample a contract with a table detailing the contract party, contract valuesandjurisdiction,therelationoftheheadingsandcellsneedstobeunderstoodtoenableeffectiveprocessinganddiscovery.Simpletextextractionisnotcapableofproducingarelationalviewordatacorrelationbetweencells.
6
May 2011
RelationalandRelevance
One of the main advantages of cDiscovery over Search and eDiscovery,isitsabilitytodeterminetherelationalmappingandrelevancebetweeninformationwithinthecontract.Toillustratethispoint,acontractordocumentcancontainalocation,sayNewYork.Todeterminetherelevanceandimportanceofthislocation,the system must process the preceding and following terms, words, phrasesandsentences.Thuswithintheprocessingofthelocation,thesystemmustfirstdiscoverthelocation,investigateitscontextandthenextracttheinformationifrequired.
TheprocessofidentifyingthecontractsJurisdictionisagoodexampleofrelevance.Theactualcontractmighthavemanydifferinglocations,countriesorstateslistedwithinit.ThustodeterminetheJurisdiction,GoverningLawandApplicableLawallneedtobeaccountedfor.This can only be done when an understanding of the relevance and relationalpositioningoftherelevanttermsisunderstood.
Whilestandardsearchenginesarecapableofdetermininglocationsandpresentingfiltersbasedonthem,theydon’tpresenttheuserwithinformationtargetedattherelevantandrelationallevel.Inmanycasesonlythelocationisaccountedfor.
Furtherillustrationofrelevanceandrelationalinformationcanbeappliedtothefirstexample.Let’stakecDiscoveryasthediscoveryengineinsteadofaSearchoreDiscoveryengine.Thesearchresultsnowreturnonlyitemswhere“SonyMusic”isactuallylistedasthecontractingparties,thusreducingtheamountof“noise”andfalsepositiveresults.
7
AutomatedContractRecognition
Evenwiththerelevanceandrelationalawarenessdetailedabove,further methods are needed to detect contracts and provide users withasimpleproactiveviewofthecontractualinformation.
Oneareawherethisisimportantiswithintheactualrecognitionofthecontracttype.cDiscoverysolutionspresentagradedlevelofconfidenceonitemsclassifiedascontracts.Thisisimportantasfalsepositiveswilllikelyoccur,thoughgreatlyreduced.
Topresentagradedconfidencelevelondiscoveredcontracts,thesystemneedstoextractthe“type”ofcontractdiscovered.Toeffectivelydothis,notonlyaretherelationalandrelevancemethodsneeded,butalsothedynamicbuildingofcontracttypesisrequired.
Takeasimpleexample,aNon-DisclosureAgreementcouldbelistedwithinacontractasNon-Disclosure,Non-DisclosureAgreement,NDA,MutualNDAetc.Onecanseetherearemanydifferingcombinationsforthesamecontracttype.Thustoensurethatthecorrectcontracttype is applied, dynamic building of the contract types based on wording,phrasesandrelationalinformationneedstobeapplied.ThisisasignificantbenefitofthecDiscoverymethodologyandapplicationoverstandardsearchandeDiscoverymethods.
Onceacontracttypehasbeenidentified,theconfidencelevelthattheitemisactuallyacontractisatitshighestlevel.Therearecontracts that will be extracted with no contract type, but contain relevantcontractualinformation.Thesearegiventhenexthighestrelevance scores, thus leaving items that contain some contractual matches.Forexample,suchasacoverletterforacontractwithdetailsonstartandterminationnotices.
8
May 2011
InformationNormalisation One area commonly overlooked and misunderstood within a search processisinformationnormalisation.InformationNormalisationistheprocessofautomaticallydeterminingthecorrectvaluewhenambigu-ityexists.Thiscanbemostoftenseenwhenprocessingdates.
Date formats can be US English, UK English, European and many others,withshort,longandtextualdatesbeingused.Anexampleofthisis01/06/10.Thisdatecanbethe6thofJanuary2010orthe1stofJune2010basedononlyUSandUKformats.Ifthisisfurtherextrapolatedtowordbaseddates,thisbecomestheFirstdayofJanuary2010.Itisclearthatnormalisationabsolutelyneedstotakeplace.
TheNormalisationprocesscoversnotonlydates;italsocoverslocations,peopleandcompanies,whereshortnamesorabbreviationsareused.
The cDiscovery process, unlike search engines, needs to understand therelevanceandcontextofdatesandformatswithincontracts.Italsoneedstonormalisetheinformation.Withoutityoucouldmissarenewalorterminationdateby6months,referringtoourexampleabove.
This process of understanding the local and relevance of the informationisakeydifferentiatorbetweencDiscoveryandSearchoreDiscoverymethods.
9
HighRisk/ValueClauseDetection
AnotherbenefitofcDiscovery,isitsabilitytoidentifycontractualclausesorwordingsthatpresentriskorvaluetotheorganisation.Oncesuchexample,isthe“Assignment“clausewithinmanycontracts.Thecontractingpartieseitherhave,ordon’thavetherighttoassignthecontractduringasale,mergeroroutsourcingevent.
Recognitionoftheriskwithintheclauseisalsoextendedtounderstandingdatesandrelativetimeperiods.Takeforexampleaconditionalassignmentofacontract,wherethemaincontractingpartyisgiventherighttoassignbutmustfirstprovide28dayswrittennoticetoallparties.
Againthedetectionofrelevance,proximity,durationsandthenormalisationofvaluesisrequired.Thustounderstandtheinherentriskorvalueofaclauseorbodyoftext,thecDiscoverysolutionmustcorrelatemultiplevalues.
Theabilitytoquicklyextendedandtailorthedetectionandextractionofkeycontextualmetadata,isalsoacriticalaspectofthecDiscoveryprocess.Anitemofvaluetoonecompanycanbeseenashighrisktoanother.ThusacDiscoverysolutionmusthavetheabilityto“learn”sothatitcanbetailoredtocustomers’needsbasedon“teaching”.ThisiterativeprocessimprovesandrefinescDiscovery’soverallaccuracyandprecision.
SearchandstandardeDiscoverymethodsarenotwellpositionedtoprovidethislevelofcorrelation,theyaredesignedtoprovidefast access to result sets over millions of documents, but leave the correlationandunderstandingtotheuser.
10
May 2011
cDiscovery Extensions over Search and eDiscoveryWithintheprecedingsections,referencehasbeenmadetocDiscoveryfunctionalityandhowthisdiffersfromwithinastandardsearchsolution.TofurtherunderstandtheextensionsprovidedoverandabovesearchandeDiscovery,somekeyfunctionalextensionsarerequired.
OCR and spell checking
Asmany,ifnotall,images,files,TIFF,GIFF,PDFetc,areembeddedintocontracts.OCRprocessingandinformationcapturethereforeneedstobeperformed.Duringthisprocess,thequalityofthescannedimagescanintroducenoiseanderrorswithinthetext.Theerrorsintroducedcould,ifleftunmanaged,causecontractstobemissedduringthediscoveryphase.
To counter and eliminate, where possible, errors of this nature spell checkingandintelligentprocessingneedstobeperformed.Intelligentprocessingofspellingmistakesiswheretheapplicationagainlooksto the surrounding wordings and phrases to determine the best contextualandrelevantreplacementforanincorrectlyspelledword.
Whilesomesearchenginesdoprovidespellingsuggestionsandcorrections,thisisbasedprimarilyontheLevenshteindistanceordictionarybasedlookupsofcommonwords.Whilethismightworkfor searching and eDiscovery methods alone, it does not provide the requiredrelevanceandproximitycalculations.Thismethodalsorelieson users typing errors, rather than errors being correct at the source ofextraction.
11
TableRecognitionandInformationExtraction
WitheDiscoveryandSearchbeingtargetedatfindingasmuchinformationaspossibleandleavingtheprocessingtotheusers,formattingandtabulardataisnotprocessedincontext.WhilethisisOKforsearchesandeDiscoverytasks,whendealingwithrelationalcontractualdata,tabularinformationmustbeaccountedfor.
Take for example a pricing structure that is based on a table, with dates, items and values including a total contractual value within the cells.InmostifnotallcasestheeDiscoveryandsearchengineswillextract all the headings followed by all the data as a single stream of text,thistotallyremovesanyrelationshipbetweentheheadings,cellsand columns, thus making it impossible to determine the context and relevanceoftheinformation.
WhilecDiscoveryreliesonbeingabletoprocessinformationwithincontext,itisimperativethatitmaintainsthelinkagebetweenitems.Therefore,itisrequiredtobeabletoprocesstabularinformationwithin the tables, which can become challenging when dealing with imagebaseditems.
SignatureDetection
Tofurtherreducethepossibilityoffalsepositiveresultsandtotargetsigned contracts, the capability to detect a possible signature within thecontractshouldbeavailable.Withthisdetection,userscanbepresented with a targeted set of contracts that have a very high confidencelevelofbeingenteredintocontracts.
Combiningthiswiththeextractedinformation,terminationandrenewaldatesornoticeperiods,ariskandvaluematrixcanbequicklydetermined.
12
May 2011
CustomerSpecificInformation
AfurtherextensionthatthecDiscoverysolutionprovidesistheabilitytoallowtheapplicationto“learn”abouttheenvironmentithasbeeninstalledinto.Inmuchthesamewasasachildlearnsbyexamplesandreferenceinformation,thecDiscoverysolutionshouldbeabletolearnaswell.Itshouldnotonlybeableforexampletorecogniseasimplelistsayofcompaniesorpeople,thatareknowntotheorganisation;itshouldalsobeabletoquicklyuseandincorporatethisinformationintotheprocessingalgorithmstoimproveaccuracyandextractionofrelevantinformation.
13
Search Methods
As with eDiscovery, cDiscovery requires a search engine to process andpresentinformationtousers.Thus,allowinguserstosearchbasedonthefulltextinformationwithinthecontractsortheproactiveextractionofcontextualinformation.
Becauseoftheproactiveextractionofinformation,cDiscoverysolutionscanpresentuserswithinformationwithouttheusersknowingwhattheyarelookingfor.Anexampleofthistypeofinformationmanagementandpresentationisthecontractingpartyandcontracttype.
Takethefirstexampleof“SonyMusic”.Howeverthistimetheuserispresented with a view that lists groups of all contracts based on the type,sayIntellectualPropertySaleandSalesContracts.AtthispointtheuseronlyneedstoselectthevieworFacetedsearchview,toseeonlythecontractsrelatingtoitstype.AddtothisabilitytothesearchforContractingPartiesofSonyMusic,andthesystempresentstheuser with accurate and targeted results with the ability to view all the extractedinformationwithinasingleview.
SearchandeDiscoverylimitations
The main challenges faced by Search and eDiscovery methods today, arelostmetadataandformattingwhendocumentsareconvertedtoimagetypefiles.Mostifnotallenteredintocontractsareimagefiles,with historical data almost always being faxed versions of the original signedcontract.
Ashasbeenpreviouslydetailed,thelossofformattingandmetadatacausestheapplicationstoonlyextractstreamsoftext,dependingonifanOCRprocesshasbeenused.EvenwiththeOCRprocess,littleornoerrorcorrectionandinformationcorrelationisperformedbytheeDiscovery and Search engines, thus introducing errors within the extractedtext.
14
May 2011
With the loss of the metadata and the induced errors, accurate discoveryandclassificationofcontractsbecomesasignificantchallenge, and one that current Search and eDiscovery engines cannot meet.
Summary
Itshouldbeclearthatwithinaneffectivecontractsdiscoveryprocess,additionalfunctionsandmethodsareneededoverandabovewhatSearchandeDiscoveryoffer.
cDiscoveryisacombinationofSearch,eDiscovery,complexdocumentprocessingandtargetedlogicfunctions,fortheproactiveextractionandpresentationofinformationwithincontext.TheSearchandeDiscoveryprocessesprovideinformationreactively,relyingonusers’knowledgeandeffortstocompletetheprocessing.cDiscoveryprovidesproactivepresentation,aswellaswarningsonpendingcontractualobligationsormilestones.
cDiscovery should be seen as a logical extension to any eDiscovery process,astheinformationdiscoveredandextractedcanbeutilisedbytheeDiscoveryengines.Furthertothis,standardwebservicesinterfaces are provided within the Search, eDiscovery and cDiscovery applications.Processingofthecorrectinformationcanthereforeoccurwithintheappropriateapplication,withinformationflowingseamlesslybetweeneachfunction.
Withnewregulationsandreportingrules,companiescannolongerignorecontractualinformationwithintheirenvironments.NoEnterprise Search or eDiscovery engine is complete without the complementofcDiscoveryprocessing.
COMMERCIALINCONFIDENCE©Copyright2011.SealSoftwareSolutionsLimited.Allrightsreserved.ThecontentsofthisdocumentarecommercialinconfidenceandarenottobecopiedorsuppliedinpartorwholetothirdpartieswithoutthepriorwrittenconsentofSealSoftwareSolutionsLimited.