40
© 2010 IBM Corporation 1 Mass Declassification What If? Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics [email protected] September 23, 2010

© 2010 IBM Corporation 1 Mass Declassification What If? Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics [email protected]

Embed Size (px)

Citation preview

© 2010 IBM Corporation1

Mass Declassification

What If?

Jeff Jonas, IBM Distinguished EngineerChief Scientist, IBM Entity Analytics

[email protected]

September 23, 2010

© 2010 IBM Corporation2

The Ask

What emerging technology or innovative approaches come to mind … which may have applicability to this task?

Use your imagination. What if?

Not talking about any specific products Not focusing on the widely available COTS/GOTS technologies

(OCR, document management, case management, workflow, etc.)

© 2010 IBM Corporation3

The Problem at Hand

Volumes may be beyond human, brute force review (@5min/ea = 18,382 FTEs)

Necessitates some form of machine triage– Red: A disclosure risk

– Yellow: A possible disclosure risk

– Green: No disclosure risk

Reliable machine triage requires substantially better prediction systems

Even then, advanced means for humans to deal with the remaining large volumes of “possibles” is still required

© 2010 IBM Corporation4

Background

Early 80’s: Founded Systems Research & Development (SRD), a custom software consultancy

1989 – 2003: Built numerous systems for Las Vegas casinos including a technology known as Non-Obvious Relationship Awareness (NORA)

2001/2003: Funded by In-Q-Tel

2005: IBM acquires SRD

Cumulatively: I have had a hand in a number of systems with multi-billions of rows describing 100’s of millions of entities

Affiliations:– Member, Markle Foundation Task Force on National Security in the Information Age

– Senior Associate, Center for Strategic and International Studies (CSIS)

– Distinguished Research Faculty (adjunct), Singapore Management University, School of Information Systems

– Member, EPIC advisory board

– Board Member, US Geospatial Intelligence Foundation (USGIF), the GEOINT organizing body

© 2010 IBM Corporation5

In Today’s Session

Intro to context accumulating systems

Predictions and data points needed for mass declassification

Strawman architecture

Challenges

Q&A

© 2010 IBM Corporation6

Context Accumulating Systems

© 2010 IBM Corporation7

From Pixels to Pictures to Insight

Observations

Contextualization

Context

Relevance

Consumer(An analyst, a system, the sensor itself, etc.)

© 2010 IBM Corporation8

Context, definition of:

Better understanding something by taking into account the things around it.

© 2010 IBM Corporation9

Without Context

[email protected]

© 2010 IBM Corporation10

Consequences

Algorithms flat-lining (e.g., alert queues)

Enterprise amnesia on the rise

Overwhelmed by false positives and false negatives? You have seen nothing yet

Not enough humans to fix this with brute force

Risk assessment becomes the risk

© 2010 IBM Corporation11

Context Accumulation

TrustedSupplier

Job Applicant

Stolen Identity

KnownTerrorist

[email protected]

© 2010 IBM Corporation12

Puzzle Metaphor Primer

Imagine an ever-growing pile of puzzle pieces of varying sizes, shapes and colors

What it represents is unknown – there is no picture on hand

Is it one puzzle, 15 puzzles, or 1,500 puzzles?

Some pieces are duplicates and some are missing

Some are pieces are incomplete, low quality, or have been misinterpreted

Some pieces may even be professionally fabricated lies

Until you take the pieces to the table, you don’t know what you are dealing with

© 2010 IBM Corporation13

How Context Accumulates

With each new observation … one of three assertions are made: 1) Un-associated; 2) near like neighbors; or 3) connections

Asserted connections must favor the false negative

New observations sometimes reverse earlier assertions

Some observations produce novel discovery

As the working space expands, computational effort increases

The emerging picture helps focus collection interests

Given sufficient observations, there can come a tipping point

Thereafter, confidence improves while computational effort decreases!!!!

© 2010 IBM Corporation14

Observations

Un

iqu

e Id

enti

ties

True Population

False Negatives Overstate The Universe

© 2010 IBM Corporation15

Counting Is Difficult

Mark Smith6/12/1978

443-43-0000

Mark R Smith(707) 433-0000DL: 00001234

File 1

File 2

© 2010 IBM Corporation16

Observations

Un

iqu

e Id

enti

ties

True Population

The Rise and Fall of a Population

© 2010 IBM Corporation17

Data Triangulation

Mark Randy Smith443-43-0000

DL: 00001234

New Record

Mark Smith6/12/1978

443-43-0000

Mark R Smith(707) 433-0000DL: 00001234

File 1

File 2

© 2010 IBM Corporation18

Observations

Un

iqu

e Id

enti

ties

True Population

Increasing Accuracy and Performance

© 2010 IBM Corporation19

“Expert Counting” is Fundamental to Prediction

Is it 5 people each with 1 account … or is it 1 person with 5 accounts?

If one cannot count … one cannot estimate vector or velocity (direction and speed).

Without vector and velocity … prediction is nearly impossible.

Therefore, if you can’t count, you can’t predict.

© 2010 IBM Corporation20

Mass DeclassificationPredictions

© 2010 IBM Corporation21

Mass Declassification Predictions

Whose equity is it?

Machine triage – disposition

Queue prioritization

© 2010 IBM Corporation22

Using What Data Points?

FOR EXAMPLE: 450M target documents Dirty words Previous declassifications Previous declassification denials FOIA’s Intellipedia Wikipedia WikiLeaks Deceased persons Publically available accounts/facts

© 2010 IBM Corporation23

© 2010 IBM Corporation24

Open Source Discovery/Scoring

“Height of Pakistan’s Mufasa missile.”

– What is 15.5 meters?

– New York Times, Sept 21, 2010, C3“Pakistan unveils Mufasa 7 Warhead”

– Wikipedia: Mufasa_7_Warhead

© 2010 IBM Corporation25

Context Accumulation

FOIAMarch 2010

Open SourceReference

Dirty Word

Classified – Asserted

Mufasa 7Warhead

© 2010 IBM Corporation26

Context Accumulation + Statistics

Document Element Total | Declass | Class-Default | Class-Asserted

Author: “Billy K” 4503 1600 403 0Codeword: “Tomatoe” 4818 4600 218 0Classification: “SI/TK/001” 23 22 1 0Actors: “Salam Ahmed” 782 700 82 0

Declassification dispositions … becoming a force multiplier.

The more human dispositions, the more automated dispositions.

Humans Auto Triage5,000 2010,000 4,000100,000 65,0001,000,000 17,000,000

© 2010 IBM Corporation27

Policy Questions

What related information is already available in the public domain?

– Evidence: Exists in open source

What damage might conceivably result from disclosure and what benefits might ensue

– Evidence: Same text already released (by same equity holder)

© 2010 IBM Corporation28

Strawman Architecture

© 2010 IBM Corporation29

Strawman Architecture

450M Docs

Historical Dispositions

DirtyWords

Etc.

Feature Extraction

& Classification

Context Accumulation

Predictions(*)

WorkflowSystem

(*) Recommendations: Equity of, Disposition, Priority

Dispositions

© 2010 IBM Corporation30

Another Idea: Crowd Sourcing

Can you predict specific people with privileges and knowledge … to whom can be routed selected documents for evaluation?

Can you publish machine-triage recommendations to a wiki or other form of internal broadcast for community crowd sourcing?

© 2010 IBM Corporation31

Another Idea: Better Classification

Using the overall declassification platform to assist in proper classification (real-time)

And, better pre-tagging to assist in future auto-declassification

© 2010 IBM Corporation32

Challenges

© 2010 IBM Corporation33

Challenges

Entity extraction is imperfect

Predictions may still not good enough, often enough

Not in English

The user work surface and its distribution

Consequences of an inappropriate release

With super access and super tools, this may call for stronger audit and insider-threat protections

Your contracting cycle and the creation of the system might take until mid-2011 or 2012 or 2013

© 2010 IBM Corporation34

Closing Thoughts

© 2010 IBM Corporation35

Closing Thoughts

Contextualization is essential to better prediction

There are not enough humans to ask every question every day

“Human attention directing” systems are critical to the mission

The data must find the data, the relevance must find the user

© 2010 IBM Corporation36

Worst Case Scenario

Rich context enables better hints for users, results in faster dispositions

Rich context enables improved sequencing of the work

© 2010 IBM Corporation37

Related Blog Posts

Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems

Data Finds Data

Puzzling: How Observations Are Accumulated Into Context

The Fast Last Puzzle Piece

Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel

How to Use a Glue Gun to Catch a Liar

It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You

Smart Systems Flip-Flop

© 2010 IBM Corporation38

Blogging At:

www.JeffJonas.TypePad.com

Information ManagementPrivacy

National Security

and Triathlons

Questions?

© 2010 IBM Corporation39

Mass Declassification

What If?

Jeff Jonas, IBM Distinguished EngineerChief Scientist, IBM Entity Analytics

[email protected]

September 23, 2010

© 2010 IBM Corporation40

The Problem at Hand

450M documentsx5min/document =2.25B minutes /60 = 37.5M hours /2040 = 18,382 FTE’s