Upload
albert-fitzgerald
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Cognitive Support for Intelligent Cognitive Support for Intelligent Survivability ManagementSurvivability Management
CSISM TEAM
June 21, 2007
2
OutlineOutline
• Introduction
• Status, results and plans for technical thrusts– Multi-layer reasoning for cyber-defense administration
• Knowledge representation and rules for system wide reasoning (OLC) • Fast containment response and policies (ILC)
– Improving defense parameters and strategies by learning augmentation
– Implementation and Integration
• Conclusions
CSISM Introduction and CSISM Introduction and BackgroundBackground
4
Problem Domain: Self-Regenerative SystemsProblem Domain: Self-Regenerative Systems
Our Focus: Automated interpretation of observation and response selection..
Level of service w/o attack
undefended
Survivable (3rd Gen.)
Regenerative
time
Level of service
Start of focused attack
Graceful degradation: Adaptive response limited to static use of diversity and policy; Event-interpretation and response selection by human experts.
Level of service w/o attack
undefended
Survivable (3rd Gen.)
Regenerative
time
Level of service
Start of focused attack
Retain level of service and improve defense: Static and dynamic use of artificial diversity; Use of wide area distribution; Automated interpretation of observation and response selection, augmented by learning from past experience.
• Cyber-Defense• Survivable systems
• Automated …• Self-improving……
5
Cyber-Defense Decision-Making LandscapeCyber-Defense Decision-Making Landscape
level of automation
scal
e &
co
mp
lexi
ty o
f co
nte
xt
generality of scope
SRS Phase 1
CSISM
3rd generation (DPASA)
considerable human involvement
automated expert behavior
singl
e ap
plica
tion
DoD
rele
vant
info
rmat
ion
syst
em
Virtu
aliza
tion
of D
oDre
leva
nt in
form
atio
n sy
stem
0
+
+
+
mostly autonomic
level of automation
scal
e &
co
mp
lexi
ty o
f co
nte
xt
generality of scope
SRS Phase 1
CSISM
3rd generation (DPASA)
considerable human involvement
automated expert behavior
singl
e ap
plica
tion
DoD
rele
vant
info
rmat
ion
syst
em
Virtu
aliza
tion
of D
oDre
leva
nt in
form
atio
n sy
stem
0
+
+
+
mostly autonomic
6
ChallengesChallenges
• Goal: Automate the reasoning performed by expert cyber-defense administrators– Effective, reusable, easy to port and retarget
• Challenges:– Making sense of low-level information (alerts, observations)
to drive low-level defense-mechanisms (block, isolate etc.) such that higher-level objectives (survive, continue to operate) are achieved
– Doing it as good as human experts– Additional difficulties
• Rapid and real time decision-making and response • Uncertainty due to incomplete and imperfect information• Widely varying operating conditions (no alerts to 100s of alerts per
second) • New symptoms and changes in adversary’s strategy
7
ApproachApproach
– Multi-perspective multi-hypothesis deliberation• Keep all options open– delay the bindings• Divide and conquer
– Current-utility as well as potential adversarial counter-response based response selection
• A simple “match” is insufficient against intelligent adversary• Unpredictability to counter gaming
– Contain while deliberate• Buy time
– Learning-based dynamic modification of defense parameters and strategies
• “Immunity” against repeats and variants
Inte
rpre
tSe
lect
Res
pons
e
ILC
OLC
Lear
ning
Mul
ti-La
yer r
easo
ning
Knowledge Representation and Knowledge Representation and Rules for System-wide ReasoningRules for System-wide Reasoning
9
Objectives
• Represent knowledge of cyber-defense• Allow reasoning about attack and defense,
including look-ahead• Automate most reasoning• Encode enough detail to estimate relative
goodness of alternatives in most situations
• Extract knowledge from Red Team encounters; attempt to generalize• Separate generic, reusable, knowledge from
system-specific
10
Achievements
• Classification of knowledge• Classification of reasoning• Breadth-first:
• Relationship between alerts, accusations, corruption, flooding, failures
• Instantiate for DPASA
• Depth-first:• DPASA registration protocol• Run 6, Nov 2005 Red Team exercise
• Encode knowledge and reasoning• 1st-order logic prototype• Soar rules and data• Representing concepts, instances and relations– use of a
common ontology (Adventium’s Netbase)
11
1. Symptomatic: possible explanations for a given anomalous event– Both generic and system-specific
2. Relational: constraints that reinforce or eliminate possible explanations– Mostly system-specific
3. Teleological: possible attacker goals and actions that may be used to accomplish the goals– Mostly generic
4. Reactive: possible defensive countermeasures for a given attack– Both generic and system-specific
Kinds of Knowledge
Focus so far has been on 1, 2, and 4
12
Focus so far has been on restrictive reasoning.
• Restrictive– From observations of past events and knowledge
of system properties, deduce good explanations and good defensive responses
– (the reasoning restricts what is possible)
• Predictive– Look ahead, comparing alternatives
Kinds of Reasoning
13
Example from Run 11, Nov 2005
Server 1(Linux)
Server 3(Solaris)
Server 4(Linux)
Server 2(Windows)
accusation: violated protocol
accusation
accusation
Reasoning:Under most likely assumption, no common-mode failure and exploit of at most one OS, Servers 2 and 3 can’t both be lying, so Server 1 must be corrupt. It’s not restartable, so quarantine it. Note that no information source is completely trusted.
14
(Simplified) Example from Run 6, Nov 2005
Monitor 3 Monitor 4Monitor 2Monitor 1
Client 2
Client2 LAN
Client 1
commcomm
accusation:no heartbeats
accusations
Reasoning:All 4 monitors claim to have received communication from oneclient but accuse another client of not delivering heartbeats. Theycan’t all be lying. The communication path for some must be OK,so either Client 2 or its LAN is bad. Ping Client2 to determine which.
15
OLC Reasoning Flow OLC Reasoning Flow
Reason about info. flow:Refine the interpretation by considering the potential sources of omission or corruption implied in the accusation.
Reason about bad behavior: Create initial baseline interpretation of the reported event and observation-- one entity in the system accuses another
Reason about attacker goal: Further refinement - reduce the potential set of failures & corruptions by considering attacker objectives & assumptions Reason about the context:
Additional refinement –eliminate candidate failures and corruptions by considering current scenario or workflow state
Intermediatecandidate
hypotheses
Conditional jump to response selection
Hypotheses: potential conditions explaining
observed state
Event reports and observations
Even
t Int
erpr
etat
ion
Even
t Int
erpr
etat
ion
Resp
onse
Sel
ectio
n Re
spon
se S
elec
tion
Match responses for the candidate hypotheses
Select responses providing most utility
Look ahead fixed no of steps for possible adversary counter-response
Intermediatecandidate
hypotheses
Intermediatecandidate
hypotheses
Intermediatecandidate responses
Intermediatecandidate responses
Conditional jump to response engagement Response selected
for execution
Reason about info. flow:Refine the interpretation by considering the potential sources of omission or corruption implied in the accusation.
Reason about info. flow:Refine the interpretation by considering the potential sources of omission or corruption implied in the accusation.
Reason about bad behavior: Create initial baseline interpretation of the reported event and observation-- one entity in the system accuses another
Reason about bad behavior: Create initial baseline interpretation of the reported event and observation-- one entity in the system accuses another
Reason about attacker goal: Further refinement - reduce the potential set of failures & corruptions by considering attacker objectives & assumptions
Reason about attacker goal: Further refinement - reduce the potential set of failures & corruptions by considering attacker objectives & assumptions Reason about the context:
Additional refinement –eliminate candidate failures and corruptions by considering current scenario or workflow state
Reason about the context: Additional refinement –eliminate candidate failures and corruptions by considering current scenario or workflow state
Intermediatecandidate
hypotheses
Conditional jump to response selection
Hypotheses: potential conditions explaining
observed state
Event reports and observations
Even
t Int
erpr
etat
ion
Even
t Int
erpr
etat
ion
Resp
onse
Sel
ectio
n Re
spon
se S
elec
tion
Match responses for the candidate hypotheses
Match responses for the candidate hypotheses
Select responses providing most utilitySelect responses providing most utility
Look ahead fixed no of steps for possible adversary counter-response
Look ahead fixed no of steps for possible adversary counter-response
Intermediatecandidate
hypotheses
Intermediatecandidate
hypotheses
Intermediatecandidate responses
Intermediatecandidate responses
Conditional jump to response engagement Response selected
for execution
16
Rapid Prototyping
Use automatic theorem prover– “prover9”, McCune, UNM– 1st order– encode restrictive reasoning
– Advantage over Soar:– Existing algorithm for deep reasoning– Easier to get started
– Disadvantages compared to Soar:– Goals are not selected automatically– Reasoning algorithm can’t be controlled– Non-1st-order reasoning not available
17
Encoding in Soar
Soar is based on more than 20 years research into human
cognition. It uses pattern-directed inference and hierarchical
control to reason in a manner similar to human thinking
The OLC inference engine will use coherence theory to search for a set of hypotheses that is maximally consistent with the observations and with its experience—we anticipated the need, but our implementation has not yet faced a situation
Use of standard ontology and Protégé
Managing the complexity of knowledge acquisition
Use of Herbal to generate Soar rules from higher level representation
18
Conclusion and Next Steps
• A good start:• Knowledge and reasoning sufficient for defense of
DPASA in some Red Team exercises, e.g., run 6• Rough estimate of coverage:
• Existing rules would reason about all alerts and defend successfully in roughly half of Nov 2005 runs in which human operators also defended successfully
• 2nd half will be harder
• Needed now:• Immediately: rules for flooding; redundant groups;
phases of mission• Soon: attacker objectives in larger-scale attacks
Fast Containment Response Fast Containment Response and Policiesand Policies
20
Inner Loop Controller (ILC) ObjectivesInner Loop Controller (ILC) Objectives
Attempt to contain and correct the problem at the earliest stage possible
• Policy Driven: Implement policies and tactics from OLC on a single host.
• Autonomous: high speed responsecan work when disconnected from the OLC by an attack or failure
• Flexible: Policies can be updated at any time
• Adaptive: Use learned characteristics of host and monitored services to tune the policy.
• Low impact on mission: able to back out of
defensive decisions when warranted
Policy DB
Chk Pt DB
HW/OS Watchdog
AppController
AppFactory
ILC
App1
App2
Outer Loop ControlRemote App
policy layer
sensorsactuators
Control Data
instantiate
Policy DB
Chk Pt DB
HW/OS Watchdog
AppController
AppFactory
ILC
App1
App2App2
Outer Loop ControlRemote App
policy layer
sensorsactuators
Control Data
instantiate
21
Survey of ILC WorkSurvey of ILC Work
• Requirements– The threat model, Performance, Range of
sensing and response, OLC communications
• Design– Study typical applications and recovery needs
• Policies
• First Prototype– Dynamically configurable rule-based policies
• Plans for Integration and Testing– With the testbed emulating the DPASA
survivable JBI– As a stand-alone program on real host
22
ILC Prototype-1 ArchitectureILC Prototype-1 Architecture
• Java Driver Program– Instantiate reasoning
components, start load
• System API– OLC Communications– Sensing and Response
• Jess Inference Engine
• Policy Modules– For each application and
services monitored
Java Driver
Jess Rule Engine
A
System API (Java+Jess)
B C D
SavedStateFiles
jess facts and rules
D
23
Components of ILC ResponseComponents of ILC Response
Monitored Service SStatus, Settings
DetectionRules for SProblem Types
ProblemInstance P
ProblemTypes andResponse Policies
Detection API
Response API
Internal Objects usedin implementing ILC responses.
internal timers
Evidence E
24
ILC Status – June 2007ILC Status – June 2007
• Requirements and design for ILC• Working Java Driver
– Initializes Jess inference engine– Remote access to ILC for policy manipulation or
remote debugging• Preliminary System API modules for
– ILC embedded in emulated test environment– Standalone ILC for Linux host– Initial ties with learning/adaptation module
• Sample policy modules– for SELinux, EFWAgent (Typical defense
mechanisms)
25
Next StepsNext Steps
• Integration with emulated test environment– Flesh out API, make compatible with ontology– Explore interactions with OLC, e.g. strategies
involving dynamic ILC policy changes– Complete ties to the learning module
• More sample application policies– Explore broader range of behaviors, e.g.
nondeterminism• Standalone Testing
– Install ILC on workstation and/or server and monitor live applications/services
– Probe ILC response under failures and attacks
Improving Defense Parameters Improving Defense Parameters and Strategies and Strategies
27
Learning Augmentation: MotivationLearning Augmentation: Motivation
• Why learning?– Extremely difficult to capture all the complexities of the
system, particularly interactions among activities– The system is dynamic (static configuration gets out of
date)• CSISM will learn to
– improve the defensive posture • better knowledge (about the attacks or attacker), better policies
– improve how the system responds to symptoms • better connection between response actions and their triggers
Adaptation is the key to survival
28
Development Plan for Learning in CSISMDevelopment Plan for Learning in CSISM
1. Responses under normal conditions (Calibration)
2. Situation-dependent responses under attack conditions
3. Multi-stage attacks
29
Analysis: RegTime by QuadAnalysis: RegTime by Quad
Quad 0&1 are slower than Quads 2&3.
Complex domain: human calibration
(incorrectly) claimed that Quad 1
was slowest, missing Quad 0
30
Analysis: Registration Times by Client TypeAnalysis: Registration Times by Client Typecaf_plan, chem_haz
and maf_plan are slower than other clients
Complex domain: human calibration
(incorrectly) claimed that caf_plan & maf_plan were
slowest because of hand-typed
password, missing chem_haz
31
Step 1: CalibrationStep 1: Calibration
• Calibrate the parameters of rules for normal operating conditions – Important first step because it learns how to respond to
normal conditions– Initially, timing parameters from ILC, e.g.
• Client Registration, PSQ server local probes, SELinux enforcement, SELinux flapping, File integrity checks
• Core challenge:Offline Training
+ Good data+ Complex environment
- Dynamic system
Online Training- Unknown data
+ Complex environment+ Dynamic system
CSISM’s Experimental Sandbox+ Good data (self-labeled)+ Complex environment
+ Dynamic system
Very hard for adversary from “training” the
learner!!!
Human+ Good data
- Complex environment- Dynamic system
Sandbox approach successfully tried in
SRS phase 1
32
Step 1: CalibrationStep 1: Calibration
• Using algorithm of Last & Kandel– Calculates a membership score for each sample,
based on how similar it is to nearby samples (the distance-to-density ratio).
• If score < threshold, it is an outlier
– It can make estimates even for multi-modal data.
x xx
xxx
xxx
xx
x xxxx
xx
xxxx
Threshold
Score
33
Results for CombOps RegistrationResults for CombOps Registration
If threshold were 0.90,
then x-values inside the green box
would be OK
Beta=0.001
Beta=0.0025
Beta=0.005
34
Results for Results for allall Registration times Registration times
Beta=0.0001
Beta=0.0005
35
Beta=0.0005
Results for Results for allall Registration times Registration times
In the demo, you’ll see these two “shoulder” points, indicating upper
and lower limits.
As more observations are collected, the estimates become more confident of the
range of expected values (i.e. tighter estimates to observations)
36
Status, Development Plan & Future stepsStatus, Development Plan & Future steps
1. Responses under normal conditions (Calibration)a. Analyze DPASA data (done)
b. Integrate with ILC (single node) (done)
c. Add experimentation sandbox (single-node)
d. Calibrate across nodes
2. Situation-dependent responses under attack conditions
3. Multi-stage attacks
Implementation and IntegrationImplementation and Integration
38
Objectives and AssumptionsObjectives and Assumptions
• Objectives– CSISM Components should be reusable and portable
• Maximize genericity, and clear demarcation between specific and generics• Standardized representation, generating CSISM internal representations from higher level
specification– Evaluation framework should be “system scale”, easy to construct, easy to inject attack
effects into, easy to interface with • Emulation
• Assumptions– Soar can process alerts as fast as they are generated (not to say that the OLC input will
not be flooded)– The survivable system ensures that alerts make it to the OLC and Learner– The survivable system ensures the ILC process runs with higher privilege – If the target is not corrupt, OLC’s command will be executed by the survivable system – Source IP addresses are not spoofed (can be satisfied by the ADF cards)
• Challenges Addressed– Standardized representation of concepts, instances and relationships involved in a
survivable system– Time handling in reasoning and evaluation– Thread handling in the reasoning engine
39
Integration FrameworkIntegration Framework
40
Achievement SummaryAchievement Summary
• OLC– Reasoning about accusations, information flow, and some context and
protocol specific situations covering all alerts in half of the DPASA attack runs
• A subset of these is exercisable by the emulated testbed, the rest are tested from Soar (apart from rapid prototyping in Prover9)
• ILC – Confirmation that reactive response policies for typical defended
applications or defense mechanisms can be built from small, reusable rule-based components
• Learning Augmentation– Calibration– set up and initial example (e.g., registration time)
• Validation framework for CSISM capabilities– Emulation of a subset of ODV survivable JBI implemented, ongoing
• Integration– OLC-system under test– Learner-ILC
41
Next Steps Next Steps
• Challenges/obstacles?– Consistent set of hypotheses
• Coherence theory
• Plan for next steps in individual tasks– Outlined in earlier sections
• Plan for next steps in Integration– KR-work fully integrated with the OLC and system under
test– Fuller emulation– ILC- system under test integration– ILC-OLC and Learning-OLC integration– More attack variations and support for red team access– Improved viewport into reasoning and metrics
42
ConclusionConclusion
• Good start, gathered momentum• Preliminary results are promising
– OLC coverage– ILC feasibility– Learning insights
• Cross-project integration potential– Looked into SPDR at more detail
• Reasoning about attack plan recognition and OLC bin 3• ILC and DRED• Same ontological representation
– Would like to look into• Other projects, for example:
– VICI defense against rootkit to protect the ILC• Other issues (e.g. timeliness)
– Of defense– Interference with the timeliness requirements of the system under test
• Evaluation vehicle
Backup notesBackup notes
44
Enforcement OffEnforcement Offno-enforcement.soarno-enforcement.soar
• Current:– Interpretation: node reports process-protection off, we note
that self accusation
– Response selection: enforcement-off self accusation causes blocking all ADF NICs on that host
• Next step:– Treat the self accusation generically—many alerts will be
“self-accusations”– they will be handled by a single set of rules
– Response selection will consider other actions like restarting a process, rebooting a host, blocking the NICs or isolating the LAN
45
RegistrationRegistrationcallback.soar, prepare-registartion.soar, reboot.soar, gui-up.soarcallback.soar, prepare-registartion.soar, reboot.soar, gui-up.soar
• Observation that a client is invited sets up an expectation (that GUI should appear in the future)
• If the GUI does not appear that triggers some interpretation (see below)• Current:
– An intermediate condition with a ordered prescription for remedies• Reboot the client: It’s a client issue that rebooting may fix
• Re-register from another SM: If there is an SM/DC/AP issue this may solve the problem
• If all quads exhausted, try refresh the AP refs and reinvite– If there is a reason to suspect a quad, try isolating that SM before refresh
• Future:– Hypotheses that the client or the inviting SM may be bad, or the path may be
bad – Restrictive reasoning considering info flow and other incoming events to narrow
eliminate• Maximally consistent set of hypotheses
– Select response based on utilities (and predictive reasoning)