A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ A Case-based Approach to Content Analysis in Cross-Domain Information Sharing

1

Research support by Airforce Research Lab (AFRL), Rome, New York, under

SBIR Phase I (Contract number FA8750-08-C-0110) and

SBIR Phase II (Contract number FA8750-07-C-0117)

Thomas Reichherzer, Badri Lokanathan, Ashwin Ram

Enkia Corp., Atlanta GA

KSCO 2012, Pensacola, FL

+ Presentation Outline

Cross-Domain Information Sharing & Challenges.

Automated Approaches for Reliable Human Review.

Experimental Evaluation.

Discussions and Conclusion.

2


+ Cross-Domain Information Sharing

Government and industry alike benefit from information sharing.

industry:

develop and expand new partnerships among business partners,

public relations

government:

exchange of mission-critical information across different agencies

freedom-of-information act

Sharing takes place across institutional boundaries or security domains.

information cannot be freely shared

3


+ Reliable Human Review

Information must be reviewed to remove sensitive content.

released information must be in compliance with non-disclosure policies across security domains

policies guide release of information

Information review completed by review officers (e.g. FDO).

reviewer identifies sensitive content in document to be removed priority release

review process is time intensive and requires significant human expertise

policies are complex and subject to changes


4

Reliable Human Review (RHR) presents a significant bottleneck to “just-in-time” information needs.

+ Just-Enough Information Sharing

Problem: Identifying shareable information in documents is time consuming and

laborious.

Security policies are high-level and difficult to capture by rules.

Timely dissemination of appropriate information is critical in crisis situations.

5

Need:

Tools for assistance with the classification of information across multiple security domains.

Tools to develop and apply security policies.

Proposed Approach:

Assist RHR by automatic text classification of unstructured text.

A combined approach of Natural Language Processing (NLP) and Case Based-Reasoning (CBR) to automate process of selecting sensitive content.


+ Assisted RHR: Automation Steps

Select documents that have been marked up by RHR.

mark-up indicates sensitive content with respect to release domain

mark-up captures non-disclosure policies


6

Training

P

Recommending

case base

Feedback

Feed marked up documents into a text classifier.

“learn” security policies from mark-up information

mark-up is labeled into categories

train different classifiers for different release domains

Apply classifier to unmarked documents.

use feedback from RHR to adjust classifier

+ Sanitizing Unstructured Text Workflow

7

Analyst opens/authors

document in MS Word or Web Editor.

Analyst selects policy to be applied.

Analyst reviews automatically

generated markup.

Analyst edits text to enforce selected

policy.

P


+ Policy Creation Workflow

8

Act of Terrorism

Defense

Secret

Defense Acts of

Terrorism

Analyst reviews existing markup.

Analyst identifies markup

categories.

Analyst creates release policies.


+

9

Problem-Solving Steps

Learning User Feedback

Decision-Making Recommendation

Text Analysis

Content Query

Unmarked Documents

Mark-up recommendations are generated at sentence level!

+

10

How Recommendations are generated

1. User gives classifier some examples.

Sensitive

Non-sensitive.

2. User presents a new problem to classifier.

?

3. Classifier retrieves similar cases.

4. Classifier decides on classification.

5. Classifier gives a recommendation.

+

11

Architecture Overview

Marked-up

Documents

Text Classification System

Document Analyzer

Case Builder

CBR Engine

Classifier / Marker

Learner

Unmarked

Documents

User

Feedback

Case Libraries

& Indices

+

12

Building a Case Base

Marked-up

Documents

Case Library

Training Workflow

Text Analyzer

Case Builder

CBR Engine

Sentence Segmenter

Pronoun Resolver

Sentence Parser

Domain A

Domain A

+

13

Generating Recommendations

User

Feedback

New

Documents

Case

Library

Classification Workflow

Document Analyzer

Case Retriever CBR Engine

Marker / Classifier

Marked-up

Documents

Learner

Domain A

+

14

Input Sentence - Example

BEARCLAW aircraft operating inside Friendlandia along the Narcotica border have intercepted communications indicating the site of a large heroin processing facility at PK848972 approx 4km north of the village of Lago Springo.

information not to be disclosed

+

15

Case Construction - Parsing

[[NP BEARCLAW aircraft] [VP operating inside Friendlandia along the Narcotica border]] [VP [VP have intercepted] [NP communications] [S [VP indicating] [NP the site of a large heroin processing facility] [PP at PK848972 approx 4km north of the village of Lago Springo.]

+

16

Case Construction - Mapping

[[PROBLEM [SUBJECT BEARCLAW aircraft] [SUBJECT_SUB1 operating inside Friendlandia along the Narcotica border] [PREDICATE have intercepted] [OBJECT communications] [OBJECT_SUB1 indicating the site of a large heroin processing

facility at PK848972 approx 4km north of the village of Lago Springo]]

[SOLUTION [MARKUP operating Friendlandia Narcotica border] [CLASSIFICATION NDP-1 Category 1?]]

+

17

Feature Vector of Case

Features: SUBJECT, SUBJECT_SUB1,

PREDICATE, OBJECT, OBJECT_SUB1, …

[[PROBLEM

[SUBJECT w1 w2 … wn]

[SUBJECT_SUB1 w1 w2 … wn]

[PREDICATE w1 w2 … wn] [OBJECT]

[OBJECT_SUB1 w1 w2 … wn]] [SOLUTION [MARKUP w1 w2 … wn] [CLASSIFICATION c]]

+

18

Distance between Sentences

S1: aircraft transport troops & equipment.

S2: aircraft fly over enemy territory.

S3: troops are deployed in foreign country. S1

S2

S3

[[SUBJECT aircraft]

[PREDICATE fly over]

[OBJECT enemy territory]]

[[SUBJECT troops]

[PREDICATE are deployed in]

[OBJECT foreign country]]

S2 – S3 sentence comparison:

[[SUBJECT aircraft]

[PREDICATE transport]

[OBJECT troops & equipment]]

[[SUBJECT aircraft]

[PREDICATE fly over]

[OBJECT enemy territory]]


[[SUBJECT aircraft]

[PREDICATE transport]

[OBJECT troops & equipment]]

[[SUBJECT troops]

[PREDICATE are deployed in]

[OBJECT foreign country]]


+

19

Distance Metric (Sentences)

Let S1 = <f11,f12,f13,f14,f15>

S2 = <f21,f22,f23,f24,f25> features

D(S1,S2)= k1sim(f11,f21) + k2sim(f12,f22)

+ k3sim(f13,f23) + k4sim(f14,f24) +

k5sim(f15,f25)

k1, k2, k3, k4, k5 Parameter:

+

20

Distance Metric (Words)

If f11 = [w111,w112,…,w11n] and

f21 = [w211,w212,…,w21n]

where

sim(f11,f21) = sim(w111,w211)+

sim(w111,w212)+ … + sim(w11n,w21n1)

Lin Thesaurus

+ Performance Evaluation – Data Set

IMdb database.

movie descriptions on selected movies rated PG, PG-13, and R

description was manually classified using 6 different categories:

21


Category Number of cases

general violence 65

graphic 730

nudity 498

drug use 349

dark topic 298

sexual content 6

+ Performance Evaluation – Training and Testing

Train classifier using marked-up information from movie descriptions.

Condition A: use all mark-up data or 1946 cases.

Condition B: use 90% of the mark-up data (randomly selected) or 1752 cases.

Error rate:

Under condition B, some cases generalize.


22

Condition A Condition B

miss-labeled 30.5% 33.1%

not-labeled 1.6% 1.6%

+ Conclusions

Information sharing is critical; automated methods are needed. methods need to go beyond keyword checking

Proposed approach captures human expertise in classifying information. policies are indirectly captured as cases in case base

Markup and classification generated at sentence level, not document level.

Direct feedback from reviewer refines and revises system’s classification knowledge.

Scalability affected by larger training sets. more examples improve accuracy

more examples slow down classification


23