Cognitive Support for Intelligent Survivability Management CSISM TEAM June 21, 2007

Cognitive Support for Intelligent Cognitive Support for Intelligent Survivability ManagementSurvivability Management

CSISM TEAM

June 21, 2007

2

OutlineOutline

• Introduction

• Status, results and plans for technical thrusts– Multi-layer reasoning for cyber-defense administration

• Knowledge representation and rules for system wide reasoning (OLC) • Fast containment response and policies (ILC)

– Improving defense parameters and strategies by learning augmentation

– Implementation and Integration

• Conclusions

CSISM Introduction and CSISM Introduction and BackgroundBackground

4

Problem Domain: Self-Regenerative SystemsProblem Domain: Self-Regenerative Systems

Our Focus: Automated interpretation of observation and response selection..

Level of service w/o attack

undefended

Survivable (3rd Gen.)

Regenerative

time

Level of service

Start of focused attack

Graceful degradation: Adaptive response limited to static use of diversity and policy; Event-interpretation and response selection by human experts.

Level of service w/o attack

undefended

Survivable (3rd Gen.)

Regenerative

time

Level of service

Start of focused attack

Retain level of service and improve defense: Static and dynamic use of artificial diversity; Use of wide area distribution; Automated interpretation of observation and response selection, augmented by learning from past experience.

• Cyber-Defense• Survivable systems

• Automated …• Self-improving……

5

Cyber-Defense Decision-Making LandscapeCyber-Defense Decision-Making Landscape

level of automation

scal

e &

co

mp

lexi

ty o

f co

nte

xt

generality of scope

SRS Phase 1

CSISM

3rd generation (DPASA)

considerable human involvement

automated expert behavior

singl

e ap

plica

tion

DoD

rele

vant

info

rmat

ion

syst

em

Virtu

aliza

tion

of D

oDre

leva

nt in

form

atio

n sy

stem

0

+

+

+

mostly autonomic

level of automation

scal

e &

co

mp

lexi

ty o

f co

nte

xt

generality of scope

SRS Phase 1

CSISM

3rd generation (DPASA)

considerable human involvement

automated expert behavior

singl

e ap

plica

tion

DoD

rele

vant

info

rmat

ion

syst

em

Virtu

aliza

tion

of D

oDre

leva

nt in

form

atio

n sy

stem

0

+

+

+

mostly autonomic

6

ChallengesChallenges

• Goal: Automate the reasoning performed by expert cyber-defense administrators– Effective, reusable, easy to port and retarget

• Challenges:– Making sense of low-level information (alerts, observations)

to drive low-level defense-mechanisms (block, isolate etc.) such that higher-level objectives (survive, continue to operate) are achieved

– Doing it as good as human experts– Additional difficulties

• Rapid and real time decision-making and response • Uncertainty due to incomplete and imperfect information• Widely varying operating conditions (no alerts to 100s of alerts per

second) • New symptoms and changes in adversary’s strategy

7

ApproachApproach

– Multi-perspective multi-hypothesis deliberation• Keep all options open– delay the bindings• Divide and conquer

– Current-utility as well as potential adversarial counter-response based response selection

• A simple “match” is insufficient against intelligent adversary• Unpredictability to counter gaming

– Contain while deliberate• Buy time

– Learning-based dynamic modification of defense parameters and strategies

• “Immunity” against repeats and variants

Inte

rpre

tSe

lect

Res

pons

e

ILC

OLC

Lear

ning

Mul

ti-La

yer r

easo

ning

Knowledge Representation and Knowledge Representation and Rules for System-wide ReasoningRules for System-wide Reasoning

9

Objectives

• Represent knowledge of cyber-defense• Allow reasoning about attack and defense,

including look-ahead• Automate most reasoning• Encode enough detail to estimate relative

goodness of alternatives in most situations

• Extract knowledge from Red Team encounters; attempt to generalize• Separate generic, reusable, knowledge from

system-specific

10

Achievements

• Classification of knowledge• Classification of reasoning• Breadth-first:

• Relationship between alerts, accusations, corruption, flooding, failures

• Instantiate for DPASA

• Depth-first:• DPASA registration protocol• Run 6, Nov 2005 Red Team exercise

• Encode knowledge and reasoning• 1st-order logic prototype• Soar rules and data• Representing concepts, instances and relations– use of a

common ontology (Adventium’s Netbase)

11

1. Symptomatic: possible explanations for a given anomalous event– Both generic and system-specific

2. Relational: constraints that reinforce or eliminate possible explanations– Mostly system-specific

3. Teleological: possible attacker goals and actions that may be used to accomplish the goals– Mostly generic

4. Reactive: possible defensive countermeasures for a given attack– Both generic and system-specific

Kinds of Knowledge

Focus so far has been on 1, 2, and 4

12

Focus so far has been on restrictive reasoning.

• Restrictive– From observations of past events and knowledge

of system properties, deduce good explanations and good defensive responses

– (the reasoning restricts what is possible)

• Predictive– Look ahead, comparing alternatives

Kinds of Reasoning

13

Example from Run 11, Nov 2005

Server 1(Linux)

Server 3(Solaris)

Server 4(Linux)

Server 2(Windows)

accusation: violated protocol

accusation

accusation

Reasoning:Under most likely assumption, no common-mode failure and exploit of at most one OS, Servers 2 and 3 can’t both be lying, so Server 1 must be corrupt. It’s not restartable, so quarantine it. Note that no information source is completely trusted.

14

(Simplified) Example from Run 6, Nov 2005

Monitor 3 Monitor 4Monitor 2Monitor 1

Client 2

Client2 LAN

Client 1

commcomm

accusation:no heartbeats

accusations

Reasoning:All 4 monitors claim to have received communication from oneclient but accuse another client of not delivering heartbeats. Theycan’t all be lying. The communication path for some must be OK,so either Client 2 or its LAN is bad. Ping Client2 to determine which.

15

OLC Reasoning Flow OLC Reasoning Flow

Reason about info. flow:Refine the interpretation by considering the potential sources of omission or corruption implied in the accusation.

Reason about bad behavior: Create initial baseline interpretation of the reported event and observation-- one entity in the system accuses another

Reason about attacker goal: Further refinement - reduce the potential set of failures & corruptions by considering attacker objectives & assumptions Reason about the context:

Additional refinement –eliminate candidate failures and corruptions by considering current scenario or workflow state

Intermediatecandidate

hypotheses

Conditional jump to response selection

Hypotheses: potential conditions explaining

observed state

Event reports and observations

Even

t Int

erpr

etat

ion

Even

t Int

erpr

etat

ion

Resp

onse

Sel

ectio

n Re

spon

se S

elec

tion

Match responses for the candidate hypotheses

Select responses providing most utility

Look ahead fixed no of steps for possible adversary counter-response


hypotheses


hypotheses

Intermediatecandidate responses


Conditional jump to response engagement Response selected

for execution





Reason about attacker goal: Further refinement - reduce the potential set of failures & corruptions by considering attacker objectives & assumptions

Reason about attacker goal: Further refinement - reduce the potential set of failures & corruptions by considering attacker objectives & assumptions Reason about the context:

Additional refinement –eliminate candidate failures and corruptions by considering current scenario or workflow state

Reason about the context: Additional refinement –eliminate candidate failures and corruptions by considering current scenario or workflow state


hypotheses

Conditional jump to response selection

Hypotheses: potential conditions explaining

observed state

Event reports and observations

Even

t Int

erpr

etat

ion

Even

t Int

erpr

etat

ion

Resp

onse

Sel

ectio

n Re

spon

se S

elec

tion



Select responses providing most utilitySelect responses providing most utility




hypotheses


hypotheses



Conditional jump to response engagement Response selected

for execution

16

Rapid Prototyping

Use automatic theorem prover– “prover9”, McCune, UNM– 1st order– encode restrictive reasoning

– Advantage over Soar:– Existing algorithm for deep reasoning– Easier to get started

– Disadvantages compared to Soar:– Goals are not selected automatically– Reasoning algorithm can’t be controlled– Non-1st-order reasoning not available

17

Encoding in Soar

Soar is based on more than 20 years research into human

cognition. It uses pattern-directed inference and hierarchical

control to reason in a manner similar to human thinking

The OLC inference engine will use coherence theory to search for a set of hypotheses that is maximally consistent with the observations and with its experience—we anticipated the need, but our implementation has not yet faced a situation

Use of standard ontology and Protégé

Managing the complexity of knowledge acquisition

Use of Herbal to generate Soar rules from higher level representation

18

Conclusion and Next Steps

• A good start:• Knowledge and reasoning sufficient for defense of

DPASA in some Red Team exercises, e.g., run 6• Rough estimate of coverage:

• Existing rules would reason about all alerts and defend successfully in roughly half of Nov 2005 runs in which human operators also defended successfully

• 2nd half will be harder

• Needed now:• Immediately: rules for flooding; redundant groups;

phases of mission• Soon: attacker objectives in larger-scale attacks

Fast Containment Response Fast Containment Response and Policiesand Policies

20

Inner Loop Controller (ILC) ObjectivesInner Loop Controller (ILC) Objectives

Attempt to contain and correct the problem at the earliest stage possible

• Policy Driven: Implement policies and tactics from OLC on a single host.

• Autonomous: high speed responsecan work when disconnected from the OLC by an attack or failure

• Flexible: Policies can be updated at any time

• Adaptive: Use learned characteristics of host and monitored services to tune the policy.

• Low impact on mission: able to back out of

defensive decisions when warranted

Policy DB

Chk Pt DB

HW/OS Watchdog

AppController

AppFactory

ILC

App1

App2

Outer Loop ControlRemote App

policy layer

sensorsactuators

Control Data

instantiate

Policy DB

Chk Pt DB

HW/OS Watchdog

AppController

AppFactory

ILC

App1

App2App2

Outer Loop ControlRemote App

policy layer

sensorsactuators

Control Data

instantiate

21

Survey of ILC WorkSurvey of ILC Work

• Requirements– The threat model, Performance, Range of

sensing and response, OLC communications

• Design– Study typical applications and recovery needs

• Policies

• First Prototype– Dynamically configurable rule-based policies

• Plans for Integration and Testing– With the testbed emulating the DPASA

survivable JBI– As a stand-alone program on real host

22

ILC Prototype-1 ArchitectureILC Prototype-1 Architecture

• Java Driver Program– Instantiate reasoning

components, start load

• System API– OLC Communications– Sensing and Response

• Jess Inference Engine

• Policy Modules– For each application and

services monitored

Java Driver

Jess Rule Engine

A

System API (Java+Jess)

B C D

SavedStateFiles

jess facts and rules

D

23

Components of ILC ResponseComponents of ILC Response

Monitored Service SStatus, Settings

DetectionRules for SProblem Types

ProblemInstance P

ProblemTypes andResponse Policies

Detection API

Response API

Internal Objects usedin implementing ILC responses.

internal timers

Evidence E

24

ILC Status – June 2007ILC Status – June 2007

• Requirements and design for ILC• Working Java Driver

– Initializes Jess inference engine– Remote access to ILC for policy manipulation or

remote debugging• Preliminary System API modules for

– ILC embedded in emulated test environment– Standalone ILC for Linux host– Initial ties with learning/adaptation module

• Sample policy modules– for SELinux, EFWAgent (Typical defense

mechanisms)

25

Next StepsNext Steps

• Integration with emulated test environment– Flesh out API, make compatible with ontology– Explore interactions with OLC, e.g. strategies

involving dynamic ILC policy changes– Complete ties to the learning module

• More sample application policies– Explore broader range of behaviors, e.g.

nondeterminism• Standalone Testing

– Install ILC on workstation and/or server and monitor live applications/services

– Probe ILC response under failures and attacks

Improving Defense Parameters Improving Defense Parameters and Strategies and Strategies

27

Learning Augmentation: MotivationLearning Augmentation: Motivation

• Why learning?– Extremely difficult to capture all the complexities of the

system, particularly interactions among activities– The system is dynamic (static configuration gets out of

date)• CSISM will learn to

– improve the defensive posture • better knowledge (about the attacks or attacker), better policies

– improve how the system responds to symptoms • better connection between response actions and their triggers

Adaptation is the key to survival

28

Development Plan for Learning in CSISMDevelopment Plan for Learning in CSISM

1. Responses under normal conditions (Calibration)

2. Situation-dependent responses under attack conditions

3. Multi-stage attacks

29

Analysis: RegTime by QuadAnalysis: RegTime by Quad

Quad 0&1 are slower than Quads 2&3.

Complex domain: human calibration

(incorrectly) claimed that Quad 1

was slowest, missing Quad 0

30

Analysis: Registration Times by Client TypeAnalysis: Registration Times by Client Typecaf_plan, chem_haz

and maf_plan are slower than other clients

Complex domain: human calibration

(incorrectly) claimed that caf_plan & maf_plan were

slowest because of hand-typed

password, missing chem_haz

31

Step 1: CalibrationStep 1: Calibration

• Calibrate the parameters of rules for normal operating conditions – Important first step because it learns how to respond to

normal conditions– Initially, timing parameters from ILC, e.g.

• Client Registration, PSQ server local probes, SELinux enforcement, SELinux flapping, File integrity checks

• Core challenge:Offline Training

+ Good data+ Complex environment

- Dynamic system

Online Training- Unknown data

+ Complex environment+ Dynamic system

CSISM’s Experimental Sandbox+ Good data (self-labeled)+ Complex environment

+ Dynamic system

Very hard for adversary from “training” the

learner!!!

Human+ Good data

- Complex environment- Dynamic system

Sandbox approach successfully tried in

SRS phase 1

32

Step 1: CalibrationStep 1: Calibration

• Using algorithm of Last & Kandel– Calculates a membership score for each sample,

based on how similar it is to nearby samples (the distance-to-density ratio).

• If score < threshold, it is an outlier

– It can make estimates even for multi-modal data.

x xx

xxx

xxx

xx

x xxxx

xx

xxxx

Threshold

Score

33

Results for CombOps RegistrationResults for CombOps Registration

If threshold were 0.90,

then x-values inside the green box

would be OK

Beta=0.001

Beta=0.0025

Beta=0.005

34

Results for Results for allall Registration times Registration times

Beta=0.0001

Beta=0.0005

35

Beta=0.0005

Results for Results for allall Registration times Registration times

In the demo, you’ll see these two “shoulder” points, indicating upper

and lower limits.

As more observations are collected, the estimates become more confident of the

range of expected values (i.e. tighter estimates to observations)

36

Status, Development Plan & Future stepsStatus, Development Plan & Future steps

1. Responses under normal conditions (Calibration)a. Analyze DPASA data (done)

b. Integrate with ILC (single node) (done)

c. Add experimentation sandbox (single-node)

d. Calibrate across nodes

2. Situation-dependent responses under attack conditions

3. Multi-stage attacks

Implementation and IntegrationImplementation and Integration

38

Objectives and AssumptionsObjectives and Assumptions

• Objectives– CSISM Components should be reusable and portable

• Maximize genericity, and clear demarcation between specific and generics• Standardized representation, generating CSISM internal representations from higher level

specification– Evaluation framework should be “system scale”, easy to construct, easy to inject attack

effects into, easy to interface with • Emulation

• Assumptions– Soar can process alerts as fast as they are generated (not to say that the OLC input will

not be flooded)– The survivable system ensures that alerts make it to the OLC and Learner– The survivable system ensures the ILC process runs with higher privilege – If the target is not corrupt, OLC’s command will be executed by the survivable system – Source IP addresses are not spoofed (can be satisfied by the ADF cards)

• Challenges Addressed– Standardized representation of concepts, instances and relationships involved in a

survivable system– Time handling in reasoning and evaluation– Thread handling in the reasoning engine

39

Integration FrameworkIntegration Framework

40

Achievement SummaryAchievement Summary

• OLC– Reasoning about accusations, information flow, and some context and

protocol specific situations covering all alerts in half of the DPASA attack runs

• A subset of these is exercisable by the emulated testbed, the rest are tested from Soar (apart from rapid prototyping in Prover9)

• ILC – Confirmation that reactive response policies for typical defended

applications or defense mechanisms can be built from small, reusable rule-based components

• Learning Augmentation– Calibration– set up and initial example (e.g., registration time)

• Validation framework for CSISM capabilities– Emulation of a subset of ODV survivable JBI implemented, ongoing

• Integration– OLC-system under test– Learner-ILC

41

Next Steps Next Steps

• Challenges/obstacles?– Consistent set of hypotheses

• Coherence theory

• Plan for next steps in individual tasks– Outlined in earlier sections

• Plan for next steps in Integration– KR-work fully integrated with the OLC and system under

test– Fuller emulation– ILC- system under test integration– ILC-OLC and Learning-OLC integration– More attack variations and support for red team access– Improved viewport into reasoning and metrics

42

ConclusionConclusion

• Good start, gathered momentum• Preliminary results are promising

– OLC coverage– ILC feasibility– Learning insights

• Cross-project integration potential– Looked into SPDR at more detail

• Reasoning about attack plan recognition and OLC bin 3• ILC and DRED• Same ontological representation

– Would like to look into• Other projects, for example:

– VICI defense against rootkit to protect the ILC• Other issues (e.g. timeliness)

– Of defense– Interference with the timeliness requirements of the system under test

• Evaluation vehicle

Backup notesBackup notes

44

Enforcement OffEnforcement Offno-enforcement.soarno-enforcement.soar

• Current:– Interpretation: node reports process-protection off, we note

that self accusation

– Response selection: enforcement-off self accusation causes blocking all ADF NICs on that host

• Next step:– Treat the self accusation generically—many alerts will be

“self-accusations”– they will be handled by a single set of rules

– Response selection will consider other actions like restarting a process, rebooting a host, blocking the NICs or isolating the LAN

45

RegistrationRegistrationcallback.soar, prepare-registartion.soar, reboot.soar, gui-up.soarcallback.soar, prepare-registartion.soar, reboot.soar, gui-up.soar

• Observation that a client is invited sets up an expectation (that GUI should appear in the future)

• If the GUI does not appear that triggers some interpretation (see below)• Current:

– An intermediate condition with a ordered prescription for remedies• Reboot the client: It’s a client issue that rebooting may fix

• Re-register from another SM: If there is an SM/DC/AP issue this may solve the problem

• If all quads exhausted, try refresh the AP refs and reinvite– If there is a reason to suspect a quad, try isolating that SM before refresh

• Future:– Hypotheses that the client or the inviting SM may be bad, or the path may be

bad – Restrictive reasoning considering info flow and other incoming events to narrow

eliminate• Maximally consistent set of hypotheses

– Select response based on utilities (and predictive reasoning)

Documents

Cognitive Support for Intelligent Survivability Management CSISM TEAM June 21, 2007