14
1 Whitepaper Introducing cDiscovery Why search and eDiscovery are inadequate when trying to locate contracts in an enterprise

Whitepaper Introducing C Discovery

Embed Size (px)

DESCRIPTION

A brief introduction to cDiscovery / Contract Discovery

Citation preview

Page 1: Whitepaper Introducing C Discovery

1

Whitepaper

Introducing cDiscovery

Why search and eDiscovery are inadequate when trying to locate contracts in an enterprise

Page 2: Whitepaper Introducing C Discovery

2

May 2011

Contents

Background 3

ContractualInformationFormats 5

RelationalandRelevance 6

AutomatedContractRecognition 7

InformationNormalisation 8

HighRisk/ValueClauseDetection 9

cDiscovery Extensions over Search and eDiscovery 10

OCR and spell checking 10

TableRecognitionandInformationExtraction 11

SignatureDetection 11

CustomerSpecificInformation 12

Search Methods 13

SearchandeDiscoverylimitations 13

Summary 14

Page 3: Whitepaper Introducing C Discovery

3

Background

Withintoday’senterprisearemanydifferentdocumenttypesheldwithinmanydifferinginformationsources.Onesuchinformationtypeiscontracts,withdifferingdocumentformatssuchasscannedimages,officefilesandPDF’s.

HistoricallycontractshavebeenheldwithinfileshareswithlimitedmetadataattachedtothemasTIFForimageembeddedPDFfiles,originatingfromscannersoremailattachments.Duetothenatureofcontractualinformation,contractsneedtobesignedbybothparties,andassuchthefinalcontractwouldbeheldaseithertheoriginalpaperdocumentor,inmostcases,faxedbacktothecontractingparties.

Withineachbusinessunitandgeographicallocation,variouscontractmanagementpoliciescouldhavebeendeployed.Thiscancreateadispersedandhighlyirregularcontractmanagementenvironment.

Addingtothecomplexity,eachlocationordepartmentcouldhavedeployedandusedmanydifferingcontracttemplatesandhavereceivedhundredsofvariousinboundcontractformats.

Asmanyoftheformats,layoutsandinformationwillbeunknown,searchingforinformationbecomesanarduoustask.Witheverydifferentcontractingparty,forexample,thenumberofcombinationsand“falsepositives”willincrease,basedonstandardeDiscoveryorsearchmethods.

Afalsepositiveisdefinedas:-“relatingtoorbeinganindividualoratestresultthatiserroneouslyclassifiedinapositivecategory”

Relatingthisdirectlybacktosearch,youaregivenaresultthatmatchesyourquery,howeveritisnotacorrectmatch.

A common misunderstanding regarding eDiscovery is it is perceived toproviderichcontextualinformation.SomeeDiscoverysolutionsusedigitalfingerprints,NIST’s,toremovenonrelevantdatatoimprovetherelevancyofresultsets.ThisapproachhoweveristhedirectinverseofthecDiscoveryprocess,whichclassifiesINinformation.Additionally,eDiscoveryusessearchwithintheprocessingwiththesamefunctionallimitationsrelatingtocontractualinformation.

Page 4: Whitepaper Introducing C Discovery

4

May 2011

An example of the size of the issue is best illustrated with a simple examplebasedonthecontractingpartiesofacontract.

Eachcontractingpartycouldbeaperson,company,organisation,country, departmentoranyothercombinationofaddressesandentities.Toeffectivelysearch,locateandreviewcontractsbasedoncontractingpartiesalone,auserwouldneedtoknowthepartiesbeforestartingthesearch,filterouttheaverage80%+falsepositives,openthemandrevieweachitemforthecorrectdata.

Thereasonwhythefalsepositivescouldbesohighisafactorofthesearch.TakeforexampleacontractingpartyofSonyMusic.SearchingforthisalonewouldproducearesultsetconsistingofeveryitemwhereSonyMusicismentioned.Thesearchisnotabletodifferentiatebetweenasimpledocumentreferringtoamusictrackheld by Sony Music, or a contract where Sony Music is one of the contractingparties.

A simple test of this is to place the following search into Google, “Sony music+“contracts”,thiswillresultinover1.2millionhits,withveryfewifanybeingactualcontracts.Ifwefurtherrefinethesearchwith“Sonymusic”+”contractingparty”,theresultsarejustover1200items.HoweverofthoseitemsnoneareactualcontractswhereSonyMusicisthecontractingparty.

WhileaGooglesearchoftheinternetisnotadirectrepresentationofinternalcustomerenvironments,thechallengesremainthesame.WithmanyorganisationsalreadyusinganinternalsearchengineoreDiscovery,theydonotprovidetherequiredfunctionalitytomeetthechallengeoflocatingcontractualinformationlocation.

Thissimpleillustrationshowsthattoeffectivelysearch,discoverand manage contracts a new approach is required, targeted at theinformationheldwithincontracts.Withthisinmind,anewtechnologyandmethodologyisneeded.

We will at this point in the whitepaper, introduce a new technology calledcDiscovery.Thistermwillbeusedtorefertothediscoveryofcontractsthroughouttherestofthisdocument.CurrentlytheonlycDiscoverysolutiononthemarket,andthebasisforreferencewithinthisdocument,istheSealSoftwarecDiscoverysolution.

Page 5: Whitepaper Introducing C Discovery

5

ContractualInformationFormats

To further understand cDiscovery’s importance, it is necessary to considercontractualformatsandlayouts.Contractsinmanycasescanbefreeforminformationitems,withdates,parties,clausesandobligationsrandomlydistributedwithinthedocuments.

Withthemajorityofcontractsbeinginimageformats,OpticalCharacterRecognition(OCR)isrequiredtoextractinformation.Duringthisextractionprocess,errorsduetopoorqualitycanbeintroduced,forexamplean“I”becomesan“!”,thuscausesthe“client”tobecomethe“cl!ent”.

Furtherformats,suchasimagesimbeddedwithinPDFfiles,alsorequirespecifichandlingandprocessing.ForexamplewhenprocessingPDFfiles,doesthesystemuseaPDFifilterorequivalentor does it process the items via an OCR engine because it contains embedded images?

Notonlyaretheactualdocumentformatstobeconsidered,thelayoutswithinthedocumentsmustalsoberecognised.Takeforexample a contract with a table detailing the contract party, contract valuesandjurisdiction,therelationoftheheadingsandcellsneedstobeunderstoodtoenableeffectiveprocessinganddiscovery.Simpletextextractionisnotcapableofproducingarelationalviewordatacorrelationbetweencells.

Page 6: Whitepaper Introducing C Discovery

6

May 2011

RelationalandRelevance

One of the main advantages of cDiscovery over Search and eDiscovery,isitsabilitytodeterminetherelationalmappingandrelevancebetweeninformationwithinthecontract.Toillustratethispoint,acontractordocumentcancontainalocation,sayNewYork.Todeterminetherelevanceandimportanceofthislocation,the system must process the preceding and following terms, words, phrasesandsentences.Thuswithintheprocessingofthelocation,thesystemmustfirstdiscoverthelocation,investigateitscontextandthenextracttheinformationifrequired.

TheprocessofidentifyingthecontractsJurisdictionisagoodexampleofrelevance.Theactualcontractmighthavemanydifferinglocations,countriesorstateslistedwithinit.ThustodeterminetheJurisdiction,GoverningLawandApplicableLawallneedtobeaccountedfor.This can only be done when an understanding of the relevance and relationalpositioningoftherelevanttermsisunderstood.

Whilestandardsearchenginesarecapableofdetermininglocationsandpresentingfiltersbasedonthem,theydon’tpresenttheuserwithinformationtargetedattherelevantandrelationallevel.Inmanycasesonlythelocationisaccountedfor.

Furtherillustrationofrelevanceandrelationalinformationcanbeappliedtothefirstexample.Let’stakecDiscoveryasthediscoveryengineinsteadofaSearchoreDiscoveryengine.Thesearchresultsnowreturnonlyitemswhere“SonyMusic”isactuallylistedasthecontractingparties,thusreducingtheamountof“noise”andfalsepositiveresults.

Page 7: Whitepaper Introducing C Discovery

7

AutomatedContractRecognition

Evenwiththerelevanceandrelationalawarenessdetailedabove,further methods are needed to detect contracts and provide users withasimpleproactiveviewofthecontractualinformation.

Oneareawherethisisimportantiswithintheactualrecognitionofthecontracttype.cDiscoverysolutionspresentagradedlevelofconfidenceonitemsclassifiedascontracts.Thisisimportantasfalsepositiveswilllikelyoccur,thoughgreatlyreduced.

Topresentagradedconfidencelevelondiscoveredcontracts,thesystemneedstoextractthe“type”ofcontractdiscovered.Toeffectivelydothis,notonlyaretherelationalandrelevancemethodsneeded,butalsothedynamicbuildingofcontracttypesisrequired.

Takeasimpleexample,aNon-DisclosureAgreementcouldbelistedwithinacontractasNon-Disclosure,Non-DisclosureAgreement,NDA,MutualNDAetc.Onecanseetherearemanydifferingcombinationsforthesamecontracttype.Thustoensurethatthecorrectcontracttype is applied, dynamic building of the contract types based on wording,phrasesandrelationalinformationneedstobeapplied.ThisisasignificantbenefitofthecDiscoverymethodologyandapplicationoverstandardsearchandeDiscoverymethods.

Onceacontracttypehasbeenidentified,theconfidencelevelthattheitemisactuallyacontractisatitshighestlevel.Therearecontracts that will be extracted with no contract type, but contain relevantcontractualinformation.Thesearegiventhenexthighestrelevance scores, thus leaving items that contain some contractual matches.Forexample,suchasacoverletterforacontractwithdetailsonstartandterminationnotices.

Page 8: Whitepaper Introducing C Discovery

8

May 2011

InformationNormalisation One area commonly overlooked and misunderstood within a search processisinformationnormalisation.InformationNormalisationistheprocessofautomaticallydeterminingthecorrectvaluewhenambigu-ityexists.Thiscanbemostoftenseenwhenprocessingdates.

Date formats can be US English, UK English, European and many others,withshort,longandtextualdatesbeingused.Anexampleofthisis01/06/10.Thisdatecanbethe6thofJanuary2010orthe1stofJune2010basedononlyUSandUKformats.Ifthisisfurtherextrapolatedtowordbaseddates,thisbecomestheFirstdayofJanuary2010.Itisclearthatnormalisationabsolutelyneedstotakeplace.

TheNormalisationprocesscoversnotonlydates;italsocoverslocations,peopleandcompanies,whereshortnamesorabbreviationsareused.

The cDiscovery process, unlike search engines, needs to understand therelevanceandcontextofdatesandformatswithincontracts.Italsoneedstonormalisetheinformation.Withoutityoucouldmissarenewalorterminationdateby6months,referringtoourexampleabove.

This process of understanding the local and relevance of the informationisakeydifferentiatorbetweencDiscoveryandSearchoreDiscoverymethods.

Page 9: Whitepaper Introducing C Discovery

9

HighRisk/ValueClauseDetection

AnotherbenefitofcDiscovery,isitsabilitytoidentifycontractualclausesorwordingsthatpresentriskorvaluetotheorganisation.Oncesuchexample,isthe“Assignment“clausewithinmanycontracts.Thecontractingpartieseitherhave,ordon’thavetherighttoassignthecontractduringasale,mergeroroutsourcingevent.

Recognitionoftheriskwithintheclauseisalsoextendedtounderstandingdatesandrelativetimeperiods.Takeforexampleaconditionalassignmentofacontract,wherethemaincontractingpartyisgiventherighttoassignbutmustfirstprovide28dayswrittennoticetoallparties.

Againthedetectionofrelevance,proximity,durationsandthenormalisationofvaluesisrequired.Thustounderstandtheinherentriskorvalueofaclauseorbodyoftext,thecDiscoverysolutionmustcorrelatemultiplevalues.

Theabilitytoquicklyextendedandtailorthedetectionandextractionofkeycontextualmetadata,isalsoacriticalaspectofthecDiscoveryprocess.Anitemofvaluetoonecompanycanbeseenashighrisktoanother.ThusacDiscoverysolutionmusthavetheabilityto“learn”sothatitcanbetailoredtocustomers’needsbasedon“teaching”.ThisiterativeprocessimprovesandrefinescDiscovery’soverallaccuracyandprecision.

SearchandstandardeDiscoverymethodsarenotwellpositionedtoprovidethislevelofcorrelation,theyaredesignedtoprovidefast access to result sets over millions of documents, but leave the correlationandunderstandingtotheuser.

Page 10: Whitepaper Introducing C Discovery

10

May 2011

cDiscovery Extensions over Search and eDiscoveryWithintheprecedingsections,referencehasbeenmadetocDiscoveryfunctionalityandhowthisdiffersfromwithinastandardsearchsolution.TofurtherunderstandtheextensionsprovidedoverandabovesearchandeDiscovery,somekeyfunctionalextensionsarerequired.

OCR and spell checking

Asmany,ifnotall,images,files,TIFF,GIFF,PDFetc,areembeddedintocontracts.OCRprocessingandinformationcapturethereforeneedstobeperformed.Duringthisprocess,thequalityofthescannedimagescanintroducenoiseanderrorswithinthetext.Theerrorsintroducedcould,ifleftunmanaged,causecontractstobemissedduringthediscoveryphase.

To counter and eliminate, where possible, errors of this nature spell checkingandintelligentprocessingneedstobeperformed.Intelligentprocessingofspellingmistakesiswheretheapplicationagainlooksto the surrounding wordings and phrases to determine the best contextualandrelevantreplacementforanincorrectlyspelledword.

Whilesomesearchenginesdoprovidespellingsuggestionsandcorrections,thisisbasedprimarilyontheLevenshteindistanceordictionarybasedlookupsofcommonwords.Whilethismightworkfor searching and eDiscovery methods alone, it does not provide the requiredrelevanceandproximitycalculations.Thismethodalsorelieson users typing errors, rather than errors being correct at the source ofextraction.

Page 11: Whitepaper Introducing C Discovery

11

TableRecognitionandInformationExtraction

WitheDiscoveryandSearchbeingtargetedatfindingasmuchinformationaspossibleandleavingtheprocessingtotheusers,formattingandtabulardataisnotprocessedincontext.WhilethisisOKforsearchesandeDiscoverytasks,whendealingwithrelationalcontractualdata,tabularinformationmustbeaccountedfor.

Take for example a pricing structure that is based on a table, with dates, items and values including a total contractual value within the cells.InmostifnotallcasestheeDiscoveryandsearchengineswillextract all the headings followed by all the data as a single stream of text,thistotallyremovesanyrelationshipbetweentheheadings,cellsand columns, thus making it impossible to determine the context and relevanceoftheinformation.

WhilecDiscoveryreliesonbeingabletoprocessinformationwithincontext,itisimperativethatitmaintainsthelinkagebetweenitems.Therefore,itisrequiredtobeabletoprocesstabularinformationwithin the tables, which can become challenging when dealing with imagebaseditems.

SignatureDetection

Tofurtherreducethepossibilityoffalsepositiveresultsandtotargetsigned contracts, the capability to detect a possible signature within thecontractshouldbeavailable.Withthisdetection,userscanbepresented with a targeted set of contracts that have a very high confidencelevelofbeingenteredintocontracts.

Combiningthiswiththeextractedinformation,terminationandrenewaldatesornoticeperiods,ariskandvaluematrixcanbequicklydetermined.

Page 12: Whitepaper Introducing C Discovery

12

May 2011

CustomerSpecificInformation

AfurtherextensionthatthecDiscoverysolutionprovidesistheabilitytoallowtheapplicationto“learn”abouttheenvironmentithasbeeninstalledinto.Inmuchthesamewasasachildlearnsbyexamplesandreferenceinformation,thecDiscoverysolutionshouldbeabletolearnaswell.Itshouldnotonlybeableforexampletorecogniseasimplelistsayofcompaniesorpeople,thatareknowntotheorganisation;itshouldalsobeabletoquicklyuseandincorporatethisinformationintotheprocessingalgorithmstoimproveaccuracyandextractionofrelevantinformation.

Page 13: Whitepaper Introducing C Discovery

13

Search Methods

As with eDiscovery, cDiscovery requires a search engine to process andpresentinformationtousers.Thus,allowinguserstosearchbasedonthefulltextinformationwithinthecontractsortheproactiveextractionofcontextualinformation.

Becauseoftheproactiveextractionofinformation,cDiscoverysolutionscanpresentuserswithinformationwithouttheusersknowingwhattheyarelookingfor.Anexampleofthistypeofinformationmanagementandpresentationisthecontractingpartyandcontracttype.

Takethefirstexampleof“SonyMusic”.Howeverthistimetheuserispresented with a view that lists groups of all contracts based on the type,sayIntellectualPropertySaleandSalesContracts.AtthispointtheuseronlyneedstoselectthevieworFacetedsearchview,toseeonlythecontractsrelatingtoitstype.AddtothisabilitytothesearchforContractingPartiesofSonyMusic,andthesystempresentstheuser with accurate and targeted results with the ability to view all the extractedinformationwithinasingleview.

SearchandeDiscoverylimitations

The main challenges faced by Search and eDiscovery methods today, arelostmetadataandformattingwhendocumentsareconvertedtoimagetypefiles.Mostifnotallenteredintocontractsareimagefiles,with historical data almost always being faxed versions of the original signedcontract.

Ashasbeenpreviouslydetailed,thelossofformattingandmetadatacausestheapplicationstoonlyextractstreamsoftext,dependingonifanOCRprocesshasbeenused.EvenwiththeOCRprocess,littleornoerrorcorrectionandinformationcorrelationisperformedbytheeDiscovery and Search engines, thus introducing errors within the extractedtext.

Page 14: Whitepaper Introducing C Discovery

14

May 2011

With the loss of the metadata and the induced errors, accurate discoveryandclassificationofcontractsbecomesasignificantchallenge, and one that current Search and eDiscovery engines cannot meet.

Summary

Itshouldbeclearthatwithinaneffectivecontractsdiscoveryprocess,additionalfunctionsandmethodsareneededoverandabovewhatSearchandeDiscoveryoffer.

cDiscoveryisacombinationofSearch,eDiscovery,complexdocumentprocessingandtargetedlogicfunctions,fortheproactiveextractionandpresentationofinformationwithincontext.TheSearchandeDiscoveryprocessesprovideinformationreactively,relyingonusers’knowledgeandeffortstocompletetheprocessing.cDiscoveryprovidesproactivepresentation,aswellaswarningsonpendingcontractualobligationsormilestones.

cDiscovery should be seen as a logical extension to any eDiscovery process,astheinformationdiscoveredandextractedcanbeutilisedbytheeDiscoveryengines.Furthertothis,standardwebservicesinterfaces are provided within the Search, eDiscovery and cDiscovery applications.Processingofthecorrectinformationcanthereforeoccurwithintheappropriateapplication,withinformationflowingseamlesslybetweeneachfunction.

Withnewregulationsandreportingrules,companiescannolongerignorecontractualinformationwithintheirenvironments.NoEnterprise Search or eDiscovery engine is complete without the complementofcDiscoveryprocessing.

COMMERCIALINCONFIDENCE©Copyright2011.SealSoftwareSolutionsLimited.Allrightsreserved.ThecontentsofthisdocumentarecommercialinconfidenceandarenottobecopiedorsuppliedinpartorwholetothirdpartieswithoutthepriorwrittenconsentofSealSoftwareSolutionsLimited.