Upload
jordan-emil-hodge
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 1
Why E-Discovery is a
CIKM-Hard Problem
David A. Evans
JustSystems Evans Research, Inc.
October 28, 2008
JSE-PR-08-03
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 2
What is E-Discovery? Briefly…
• The requirement to provide to a party in a lawsuit (court case) documents that exist in electronic form.
• Electronic documents include e-mail, text messages, electronic calendars, voicemail, audio files, graphics, photographs, drawings, spreadsheets, CAD files, metadata, animations, files on portable devices and storage media, digital data, etc.
• Note: E-Discovery currently supplements – but is gradually replacing – traditional discovery of “paper” materials. A great deal of E-Discovery practice is grounded in the experiences and expectations of people who are steeped in traditional paper-based (and manual) document discovery.
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 3
Why Care about E-Discovery?
• 2009 Market size projected to be $4B… [Source: Socha-Gelbmann Electronic Discovery Survey Public Report, 2007]
• Expected 35% Annual Growth through 2011 [Source: Gartner MarketScope for E-Discovery and Litigation Support Vendors, 2007]
• For the Enterprise…– High Cost of Compliance (Far more than the sale
of a large software system, typically)– Very High Cost of Failure (On the order of a small
acquisition)
It’s a Growing and Expensive Problem…
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 4
Cases Typically Involving E-Discovery
• Environmental Protection / Violations• Pharmaceutical (Drug) & General Product
Liability• Infringement• Antitrust• Fraud• Shareholder Actions• Financial (Securities) Violations• …
For U.S. and Foreign Enterprises Doing Business in the U.S.
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 5
What is CIKM?
• Historically, the expanding intersection of three distinct disciplines– database management technology– information retrieval– knowledge management
• Increasingly, the three disciplines have come to share core methodologies and definitions of problems…
Also Briefly…
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 6
What are CIKM2008 Problems?
• Good examples include data and text mining, aspects of knowledge discovery, semi-automatic structuring (classification) of information, social graphs and interactions…
• Solutions involve combinations of technology, often applied in critical sequences, that exploit techniques from the three disciplines.
• In many cases, success or failure may depend as much on the data structure or representation of features used as on the algorithm.
Again, Briefly…
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 7
What are CIKM-Hard Problems?
• Not AI-Hard (which dominates CIKM-Hard)– Requiring the equivalent of near-perfect human
intelligence– For example, automated scientific discovery – Encompassing perception, symbolic abstraction,
reasoning, arguments and explanations communicated in extended discourse, etc.
• CIKM-Hard is beyond what we can do today, but almost imaginable (!)– For example, delivering to a user precisely the
information that the user requires in a particular context of work or activity
Also Briefly…
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 8
Typical Challenges in E-Discovery
• Lots of data (order of terabytes) • Little of actual value• Short amounts of time for processing• Importance of manual review• Redundancy; near-redundancy; faux-
redundancy• Embeddings• Encodings• Co-mingling / Co-occurrence of data
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 9
Typical Goals of Processing / Analysis
• Enumeration / Individuation of Items • Determining & Defending Information Status
– Privileged vs. non-Privileged• Identifying Individuals (and Documents) that
should be Involved in Depositions• Establishing a Chain of Custody / Possession
or Knowledge of Events at Points in Time• …
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 10
Some Framing Issues in E-Discovery
• Emphasis on Recall (avoiding false negatives; insuring exhaustive coverage)
• Human-in-the-Loop Processing – from initial formulation of the problem to evaluation (review) of the results
• Absence of Training Data (uniqueness of circumstances in each case)
• Absence of a Standard Practice (including evaluation metrics that translate into success in practical cases)
Distinctions from CIKM Practice
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 11
Some Framing Issues in E-Discovery
• Extreme Importance of Context – Social Network / Social Communications; Time; Replication (Protected vs. Public); Status of Agents; Status of Knowledge (before or after critical event); etc.
• Heterogeneous Information Typology, where one encounters text and non-text intimately intertwined and related; structured and non-structured
• Multi-Language Data (in every sense)• Importance of Non-Textual Information
Distinctions from CIKM Practice, continued
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 12
Focus in E-Discovery
• Compliance (exhaustive accountability)• Argumentation (serving a forensic purpose;
information that fits into a narrative)• Evidence (information whose interpretation
is determined by the circumstances of its discovery; contrast with alternative information)
• Explanation (not retrieval; not simple Q-A)
Distinctions from CIKM Practice
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 13
Example Case
• Source Data: Approximately 300 GB of E-Mail• Number of Directories / “People”: ~5,000• Number of files: ~1,000,000• Many Attachments, Compressed Archives• Text, Data, Multiple Human Languages• Rate of Redundancy / Duplication: ~50%• Rate of Errors in Individuation: ~10%
Sorry, No Subject-Matter Details!
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 14
Ingredients of a Solution Process Flow and Techniques…
Defining theScope (Data)
Gatheringthe Data
Enumeratingthe Items
Removing “Privileged”Information
Deliveringthe Data
Processing
Review
Analysis
• code normalization• unzipping compressed data• language ID• lexical-atom discovery• NLP• term EQ-class discovery• person identification
• indexing (term/feature selection)• duplicate/near-duplicate ID• enumeration/individuation• cross-linking related items• social network analysis• clustering (for topic threads)
• filtering• classification (P/~P)• topic mapping• time series analysis• pseudo-causal modeling
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 15
Illustrations of Typical Problems
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 16
Counting Instances of a Document
A AA
OfficeDesktop
PersonalLaptop
StorageDevice
One Instance? Or Three?
“Same”Document,DifferentLocations,SameUser
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 17
A B C
Counting Instances of a Document
One Instance? Or Three?
“Same”Document,DifferentLocations,DifferentUsers
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 18
Counting Instances of a Document
One Instance? Or Three?
“Same”Document,DifferentContexts VS. VS.
EmbeddedAttachment
EmbeddedAttachment
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 19
Counting Instances of a Document
One Instance? Or …?
Archives (Tape Back-Up Copies)
t1 t5t4t3t2
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 20
Context Matters
Document Status is not Transitive…
PrivilegedCommunicationat time t1
A B
t1
B C
t2
StillPrivilegedCommunicationat time t2 ?
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 21
x.x.x.x.x.x.
Context Matters
What is the Content?What is the Comment?What is the Role of Sender? Receiver? CC? BCC?
Threads,EmbeddedText inE-Mail
y.y.y.y.y.y.y.y.y.
VS. VS.
x.x.x.x.x.x.
z.z.z.z.z.z.z.z.z.z.z.z.z.z.z.z.
y.y.y.y.y.y.y.y.y.
x.x.x.x.x.x.
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 22
Signature of an Event
A1
B1
C1
D1
Baseline:• Time t1…tk
• Topic T• Count >
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 23
A1
B1
C1
D1
A3
A4
A2
B2
B3
C2C3
C4
D2
Event:• Time tk+1…tk+j
• Topic T• Count >
Signature of an Event
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 24
A1
B1
C1
D1
A3
A4
A2
B2
B3
C2C3
C4
D2
Event:• Time tk+j…tk+m
• Topic T• Count >
Signature of an Event
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 25
A1
B1
C1
D1
Post-Event:• Time tk+m…tn
• Topic T• Count >
Signature of an Event
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 26
Summary Observations
• The examples show types of problems that are on the edges or outside the focus of CIKM2008 work.
• Especially difficult is the need to integrate and formalize representations for complex context with social behavior over time – along with very fine-grained access to content.
October 28, 2008JSE-PR-08-03: CIKM 2008 -- Why E-Discovery is a CIKM-Hard Problem © Copyright 2008, JustSystems Evans Research, Inc. 27
Conclusion
• E-Discovery is here to stay – it will get “worse” before it gets better.
• No single “solution” will prevail – in particular, it’s not about IR or better ranking or finding the “smoking gun.”
• All the technology and techniques we are developing in the fold of CIKM will be needed.
• It’s CIKM-Hard, but not impossible!