28
October 28, 20 08 JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 1 Why E-Discovery is a CIKM-Hard Problem David A. Evans JustSystems Evans Research, Inc. October 28, 2008 JSE-PR-08-03

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 1 Why E-Discovery

Embed Size (px)

Citation preview

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 1

Why E-Discovery is a

CIKM-Hard Problem

David A. Evans

JustSystems Evans Research, Inc.

October 28, 2008

JSE-PR-08-03

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 2

What is E-Discovery? Briefly…

• The requirement to provide to a party in a lawsuit (court case) documents that exist in electronic form.

• Electronic documents include e-mail, text messages, electronic calendars, voicemail, audio files, graphics, photographs, drawings, spreadsheets, CAD files, metadata, animations, files on portable devices and storage media, digital data, etc.

• Note: E-Discovery currently supplements – but is gradually replacing – traditional discovery of “paper” materials. A great deal of E-Discovery practice is grounded in the experiences and expectations of people who are steeped in traditional paper-based (and manual) document discovery.

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 3

Why Care about E-Discovery?

• 2009 Market size projected to be $4B… [Source: Socha-Gelbmann Electronic Discovery Survey Public Report, 2007]

• Expected 35% Annual Growth through 2011 [Source: Gartner MarketScope for E-Discovery and Litigation Support Vendors, 2007]

• For the Enterprise…– High Cost of Compliance (Far more than the sale

of a large software system, typically)– Very High Cost of Failure (On the order of a small

acquisition)

It’s a Growing and Expensive Problem…

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 4

Cases Typically Involving E-Discovery

• Environmental Protection / Violations• Pharmaceutical (Drug) & General Product

Liability• Infringement• Antitrust• Fraud• Shareholder Actions• Financial (Securities) Violations• …

For U.S. and Foreign Enterprises Doing Business in the U.S.

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 5

What is CIKM?

• Historically, the expanding intersection of three distinct disciplines– database management technology– information retrieval– knowledge management

• Increasingly, the three disciplines have come to share core methodologies and definitions of problems…

Also Briefly…

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 6

What are CIKM2008 Problems?

• Good examples include data and text mining, aspects of knowledge discovery, semi-automatic structuring (classification) of information, social graphs and interactions…

• Solutions involve combinations of technology, often applied in critical sequences, that exploit techniques from the three disciplines.

• In many cases, success or failure may depend as much on the data structure or representation of features used as on the algorithm.

Again, Briefly…

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 7

What are CIKM-Hard Problems?

• Not AI-Hard (which dominates CIKM-Hard)– Requiring the equivalent of near-perfect human

intelligence– For example, automated scientific discovery – Encompassing perception, symbolic abstraction,

reasoning, arguments and explanations communicated in extended discourse, etc.

• CIKM-Hard is beyond what we can do today, but almost imaginable (!)– For example, delivering to a user precisely the

information that the user requires in a particular context of work or activity

Also Briefly…

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 8

Typical Challenges in E-Discovery

• Lots of data (order of terabytes) • Little of actual value• Short amounts of time for processing• Importance of manual review• Redundancy; near-redundancy; faux-

redundancy• Embeddings• Encodings• Co-mingling / Co-occurrence of data

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 9

Typical Goals of Processing / Analysis

• Enumeration / Individuation of Items • Determining & Defending Information Status

– Privileged vs. non-Privileged• Identifying Individuals (and Documents) that

should be Involved in Depositions• Establishing a Chain of Custody / Possession

or Knowledge of Events at Points in Time• …

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 10

Some Framing Issues in E-Discovery

• Emphasis on Recall (avoiding false negatives; insuring exhaustive coverage)

• Human-in-the-Loop Processing – from initial formulation of the problem to evaluation (review) of the results

• Absence of Training Data (uniqueness of circumstances in each case)

• Absence of a Standard Practice (including evaluation metrics that translate into success in practical cases)

Distinctions from CIKM Practice

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 11

Some Framing Issues in E-Discovery

• Extreme Importance of Context – Social Network / Social Communications; Time; Replication (Protected vs. Public); Status of Agents; Status of Knowledge (before or after critical event); etc.

• Heterogeneous Information Typology, where one encounters text and non-text intimately intertwined and related; structured and non-structured

• Multi-Language Data (in every sense)• Importance of Non-Textual Information

Distinctions from CIKM Practice, continued

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 12

Focus in E-Discovery

• Compliance (exhaustive accountability)• Argumentation (serving a forensic purpose;

information that fits into a narrative)• Evidence (information whose interpretation

is determined by the circumstances of its discovery; contrast with alternative information)

• Explanation (not retrieval; not simple Q-A)

Distinctions from CIKM Practice

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 13

Example Case

• Source Data: Approximately 300 GB of E-Mail• Number of Directories / “People”: ~5,000• Number of files: ~1,000,000• Many Attachments, Compressed Archives• Text, Data, Multiple Human Languages• Rate of Redundancy / Duplication: ~50%• Rate of Errors in Individuation: ~10%

Sorry, No Subject-Matter Details!

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 14

Ingredients of a Solution Process Flow and Techniques…

Defining theScope (Data)

Gatheringthe Data

Enumeratingthe Items

Removing “Privileged”Information

Deliveringthe Data

Processing

Review

Analysis

• code normalization• unzipping compressed data• language ID• lexical-atom discovery• NLP• term EQ-class discovery• person identification

• indexing (term/feature selection)• duplicate/near-duplicate ID• enumeration/individuation• cross-linking related items• social network analysis• clustering (for topic threads)

• filtering• classification (P/~P)• topic mapping• time series analysis• pseudo-causal modeling

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 15

Illustrations of Typical Problems

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 16

Counting Instances of a Document

A AA

OfficeDesktop

PersonalLaptop

StorageDevice

One Instance? Or Three?

“Same”Document,DifferentLocations,SameUser

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 17

A B C

Counting Instances of a Document

One Instance? Or Three?

“Same”Document,DifferentLocations,DifferentUsers

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 18

Counting Instances of a Document

One Instance? Or Three?

“Same”Document,DifferentContexts VS. VS.

EmbeddedAttachment

EmbeddedAttachment

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 19

Counting Instances of a Document

One Instance? Or …?

Archives (Tape Back-Up Copies)

t1 t5t4t3t2

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 20

Context Matters

Document Status is not Transitive…

PrivilegedCommunicationat time t1

A B

t1

B C

t2

StillPrivilegedCommunicationat time t2 ?

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 21

x.x.x.x.x.x.

Context Matters

What is the Content?What is the Comment?What is the Role of Sender? Receiver? CC? BCC?

Threads,EmbeddedText inE-Mail

y.y.y.y.y.y.y.y.y.

VS. VS.

x.x.x.x.x.x.

z.z.z.z.z.z.z.z.z.z.z.z.z.z.z.z.

y.y.y.y.y.y.y.y.y.

x.x.x.x.x.x.

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 22

Signature of an Event

A1

B1

C1

D1

Baseline:• Time t1…tk

• Topic T• Count >

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 23

A1

B1

C1

D1

A3

A4

A2

B2

B3

C2C3

C4

D2

Event:• Time tk+1…tk+j

• Topic T• Count >

Signature of an Event

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 24

A1

B1

C1

D1

A3

A4

A2

B2

B3

C2C3

C4

D2

Event:• Time tk+j…tk+m

• Topic T• Count >

Signature of an Event

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 25

A1

B1

C1

D1

Post-Event:• Time tk+m…tn

• Topic T• Count >

Signature of an Event

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 26

Summary Observations

• The examples show types of problems that are on the edges or outside the focus of CIKM2008 work.

• Especially difficult is the need to integrate and formalize representations for complex context with social behavior over time – along with very fine-grained access to content.

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 27

Conclusion

• E-Discovery is here to stay – it will get “worse” before it gets better.

• No single “solution” will prevail – in particular, it’s not about IR or better ranking or finding the “smoking gun.”

• All the technology and techniques we are developing in the fold of CIKM will be needed.

• It’s CIKM-Hard, but not impossible!

October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 28

The EndThanks!