Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
+ A Case-based Approach to Content Analysis in Cross-Domain Information Sharing
1
Research support by Airforce Research Lab (AFRL), Rome, New York, under
SBIR Phase I (Contract number FA8750-08-C-0110) and
SBIR Phase II (Contract number FA8750-07-C-0117)
Thomas Reichherzer, Badri Lokanathan, Ashwin Ram
Enkia Corp., Atlanta GA
KSCO 2012, Pensacola, FL
+ Presentation Outline
Cross-Domain Information Sharing & Challenges.
Automated Approaches for Reliable Human Review.
Experimental Evaluation.
Discussions and Conclusion.
2
KSCO 2012, Pensacola, FL
+ Cross-Domain Information Sharing
Government and industry alike benefit from information sharing.
industry:
develop and expand new partnerships among business partners,
public relations
government:
exchange of mission-critical information across different agencies
freedom-of-information act
Sharing takes place across institutional boundaries or security domains.
information cannot be freely shared
3
KSCO 2012, Pensacola, FL
+ Reliable Human Review
Information must be reviewed to remove sensitive content.
released information must be in compliance with non-disclosure policies across security domains
policies guide release of information
Information review completed by review officers (e.g. FDO).
reviewer identifies sensitive content in document to be removed priority release
review process is time intensive and requires significant human expertise
policies are complex and subject to changes
KSCO 2012, Pensacola, FL
4
Reliable Human Review (RHR) presents a significant bottleneck to “just-in-time” information needs.
+ Just-Enough Information Sharing
Problem: Identifying shareable information in documents is time consuming and
laborious.
Security policies are high-level and difficult to capture by rules.
Timely dissemination of appropriate information is critical in crisis situations.
5
Need:
Tools for assistance with the classification of information across multiple security domains.
Tools to develop and apply security policies.
Proposed Approach:
Assist RHR by automatic text classification of unstructured text.
A combined approach of Natural Language Processing (NLP) and Case Based-Reasoning (CBR) to automate process of selecting sensitive content.
KSCO 2012, Pensacola, FL
+ Assisted RHR: Automation Steps
Select documents that have been marked up by RHR.
mark-up indicates sensitive content with respect to release domain
mark-up captures non-disclosure policies
KSCO 2012, Pensacola, FL
6
Training
P
Recommending
case base
Feedback
Feed marked up documents into a text classifier.
“learn” security policies from mark-up information
mark-up is labeled into categories
train different classifiers for different release domains
Apply classifier to unmarked documents.
use feedback from RHR to adjust classifier
+ Sanitizing Unstructured Text Workflow
7
Analyst opens/authors
document in MS Word or Web Editor.
Analyst selects policy to be applied.
Analyst reviews automatically
generated markup.
Analyst edits text to enforce selected
policy.
P
KSCO 2012, Pensacola, FL
+ Policy Creation Workflow
8
Act of Terrorism
Defense
Secret
Defense Acts of
Terrorism
Analyst reviews existing markup.
Analyst identifies markup
categories.
Analyst creates release policies.
KSCO 2012, Pensacola, FL
+
9
Problem-Solving Steps
Learning User Feedback
Decision-Making Recommendation
Text Analysis
Content Query
Unmarked Documents
Mark-up recommendations are generated at sentence level!
+
10
How Recommendations are generated
1. User gives classifier some examples.
Sensitive
Non-sensitive.
2. User presents a new problem to classifier.
?
3. Classifier retrieves similar cases.
4. Classifier decides on classification.
5. Classifier gives a recommendation.
+
11
Architecture Overview
Marked-up
Documents
Text Classification System
Document Analyzer
Case Builder
CBR Engine
Classifier / Marker
Learner
Unmarked
Documents
User
Feedback
Case Libraries
& Indices
+
12
Building a Case Base
Marked-up
Documents
Case Library
Training Workflow
Text Analyzer
Case Builder
CBR Engine
Sentence Segmenter
Pronoun Resolver
Sentence Parser
Domain A
Domain A
+
13
Generating Recommendations
User
Feedback
New
Documents
Case
Library
Classification Workflow
Document Analyzer
Case Retriever CBR Engine
Marker / Classifier
Marked-up
Documents
Learner
Domain A
+
14
Input Sentence - Example
BEARCLAW aircraft operating inside Friendlandia along the Narcotica border have intercepted communications indicating the site of a large heroin processing facility at PK848972 approx 4km north of the village of Lago Springo.
information not to be disclosed
+
15
Case Construction - Parsing
[[NP BEARCLAW aircraft] [VP operating inside Friendlandia along the Narcotica border]] [VP [VP have intercepted] [NP communications] [S [VP indicating] [NP the site of a large heroin processing facility] [PP at PK848972 approx 4km north of the village of Lago Springo.]
+
16
Case Construction - Mapping
[[PROBLEM [SUBJECT BEARCLAW aircraft] [SUBJECT_SUB1 operating inside Friendlandia along the Narcotica border] [PREDICATE have intercepted] [OBJECT communications] [OBJECT_SUB1 indicating the site of a large heroin processing
facility at PK848972 approx 4km north of the village of Lago Springo]]
[SOLUTION [MARKUP operating Friendlandia Narcotica border] [CLASSIFICATION NDP-1 Category 1?]]
+
17
Feature Vector of Case
Features: SUBJECT, SUBJECT_SUB1,
PREDICATE, OBJECT, OBJECT_SUB1, …
[[PROBLEM
[SUBJECT w1 w2 … wn]
[SUBJECT_SUB1 w1 w2 … wn]
[PREDICATE w1 w2 … wn] [OBJECT]
[OBJECT_SUB1 w1 w2 … wn]] [SOLUTION [MARKUP w1 w2 … wn] [CLASSIFICATION c]]
+
18
Distance between Sentences
S1: aircraft transport troops & equipment.
S2: aircraft fly over enemy territory.
S3: troops are deployed in foreign country. S1
S2
S3
[[SUBJECT aircraft]
[PREDICATE fly over]
[OBJECT enemy territory]]
[[SUBJECT troops]
[PREDICATE are deployed in]
[OBJECT foreign country]]
S2 – S3 sentence comparison:
[[SUBJECT aircraft]
[PREDICATE transport]
[OBJECT troops & equipment]]
[[SUBJECT aircraft]
[PREDICATE fly over]
[OBJECT enemy territory]]
S1 – S2 sentence comparison:
[[SUBJECT aircraft]
[PREDICATE transport]
[OBJECT troops & equipment]]
[[SUBJECT troops]
[PREDICATE are deployed in]
[OBJECT foreign country]]
S1 – S3 sentence comparison:
+
19
Distance Metric (Sentences)
Let S1 = <f11,f12,f13,f14,f15>
S2 = <f21,f22,f23,f24,f25> features
D(S1,S2)= k1sim(f11,f21) + k2sim(f12,f22)
+ k3sim(f13,f23) + k4sim(f14,f24) +
k5sim(f15,f25)
k1, k2, k3, k4, k5 Parameter:
+
20
Distance Metric (Words)
If f11 = [w111,w112,…,w11n] and
f21 = [w211,w212,…,w21n]
where
sim(f11,f21) = sim(w111,w211)+
sim(w111,w212)+ … + sim(w11n,w21n1)
Lin Thesaurus
+ Performance Evaluation – Data Set
IMdb database.
movie descriptions on selected movies rated PG, PG-13, and R
description was manually classified using 6 different categories:
21
KSCO 2012, Pensacola, FL
Category Number of cases
general violence 65
graphic 730
nudity 498
drug use 349
dark topic 298
sexual content 6
+ Performance Evaluation – Training and Testing
Train classifier using marked-up information from movie descriptions.
Condition A: use all mark-up data or 1946 cases.
Condition B: use 90% of the mark-up data (randomly selected) or 1752 cases.
Error rate:
Under condition B, some cases generalize.
KSCO 2012, Pensacola, FL
22
Condition A Condition B
miss-labeled 30.5% 33.1%
not-labeled 1.6% 1.6%
+ Conclusions
Information sharing is critical; automated methods are needed. methods need to go beyond keyword checking
Proposed approach captures human expertise in classifying information. policies are indirectly captured as cases in case base
Markup and classification generated at sentence level, not document level.
Direct feedback from reviewer refines and revises system’s classification knowledge.
Scalability affected by larger training sets. more examples improve accuracy
more examples slow down classification
KSCO 2012, Pensacola, FL
23