34
SRS Architecture Study SRS Architecture Study Partha Pal Franklin Webber

SRS Architecture Study

  • Upload
    maine

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

SRS Architecture Study. Partha Pal Franklin Webber. Outline. Level of service w/o attack. Regenerative. Level of service. Survivable (OASIS Dem/Val). undefended. Start of focused attack. time. S elf- R egenerative S urvivable System: Self: Organic decision making - PowerPoint PPT Presentation

Citation preview

Page 1: SRS Architecture Study

SRS Architecture Study SRS Architecture Study

Partha Pal

Franklin Webber

Page 2: SRS Architecture Study

2

OutlineOutline

• Study goals• SRS Technologies• Top down• Bottom up• Strawman• Issues, challenges

Level of service w/o attack

undefended

Survivable (OASIS Dem/Val)

Regenerative

time

Level of service

Start of focused attack

Self-Regenerative Survivable System:

Self: Organic decision making

Regenerative: Better than graceful degradation/simple recovery– reversing the trend

Page 3: SRS Architecture Study

3Balances pros and cons of both approaches

If the high watermark is implemented then it provides a concrete context, but “grand fathering” may impact choice and Integration of new capabilities

This is study in the abstract..leading to an abstract architecture that will need a concrete context to realize..

3rd generation assumptions are still valid- Absolute prevention, and accurate and on time detection are impossible to achieve

Study PlanStudy Plan

Understand how to incorporate the new technologies in a distributed information system that not only tolerates the effects of cyber-attacks, but also attempts to stop and reverse the loss of resources and capabilities

Start with the new (SRS) capabilities, build a partial architectural framework, and then see what other capabilities, mechanisms and services are needed to complete the architecture–

• offers a high level of resistance to attacks (protection),

• improves visibility of attacker activity/attack effects (detection), and

• is able to adapt to changes caused by the attacker (react)

Start with a high watermark survivability architecture, identify where SRS capabilities could benefit, re-organize the architecture to integrate the selected capabilities, mechanisms and services

Combine & contrast the abstract architecture with the more concrete case to create a Strawman Self-regenerative Survivable System Architecture

Bottom

up.. Top

down.

.

Page 4: SRS Architecture Study

4

Summary of SRS Technology StudySummary of SRS Technology Study

• Sent a questionnaire to each original SRS project (i.e., all except Asbestos)

• General outline:– Claims– Key Capabilities– Benefits and Other Distinguishing Factors– Assumptions– Use Cases and Interface Issues

• Customized for issues we thought especially important or were confused about

• All responded, some very quickly some needed gentle prodding – thank you!

• Varying degrees of maturity– Some projects started with existing technology

• At least half of the projects offer multiple technologies that could be used independently

• Less overlap than we expected: many technologies seem complementary

• Unsurprisingly, not a lot of support for integration

Process

General Observations

Page 5: SRS Architecture Study

5

Biologically-Inspired Diversity ProjectsBiologically-Inspired Diversity Projects

• Genesis– A toolkit offering a variety of transformations– Based on Strata and is portable

• DAWSON– A toolkit offering a variety of transformations– Based on Windows DLLs

• Comparisons– Some overlap in randomization techniques– Genesis also offers highly-attack-resistant runtime

transformations that incur Strata’s overhead– DAWSON also offers Windows-specific transformations– May be combined but value and difficulty are unclear

Page 6: SRS Architecture Study

6

Cognitive Immunity and Self-Healing Cognitive Immunity and Self-Healing ProjectsProjects

• Learning and Repair– Daikon: learns program constraints from a set of traces– Kvasir: monitors program to create traces for Daikon– Archie: checks program constraints at runtime– Repair Tool: repairs damage to conform to constraints– Tools existed before SRS but are being improved

• RMPL (Concurrent Model-Based Execution)– A language expressing temporal properties without fully

specifying an order of execution, and probabilistic assumptions and choices

– An executive that plans, dispatches methods and replans when necessary

Page 7: SRS Architecture Study

7

• AWDRAT– Language to specify behavior (Architectural Model)– Language to describe Method Selection Metadata– Tools to instrument Java to monitor and control behavior– An executive that

• Detects anomalies by Architectural Differencing• Combines other observations to update a Trust Model• Selects methods to maximize utility and/or minimize costs

• Cortex– A “taste-tester” framework for redundant components– Scyllarus: situation assessment– CIRCA: generates controllers from models

Cognitive Immunity and Self-Healing, Cognitive Immunity and Self-Healing, cont’dcont’d

Page 8: SRS Architecture Study

8

• Comparisons– Learning and Repair tools are complementary to others– Cortex learning by taste testing is also complementary– AWDRAT and RMPL address some of the same issues

but:• AWDRAT is middleware to defend existing application; RMPL

is a language and environment for building new applications• Geared to different application domains:

– RMPL– embedded/autonomous vehicle systems

– AWDRAT- information processing systems

– AWDRAT’s Trust Modeling is complementary to others

Cognitive Immunity and Self-Healing, Cognitive Immunity and Self-Healing, cont’dcont’d

Page 9: SRS Architecture Study

9

Granular Scalable Redundancy Granular Scalable Redundancy ProjectsProjects

• Steward– Scalable support for Byzantine fault-tolerant state-machine replication

• BFT-like protocol for LANs• Paxos-like protocol for WANs• Library for threshold crypto

• CMU– Byzantine fault-tolerant data storage using scalable asynchronous

protocols• read/write (R/W)• query/update (Q/U)

• QuickSilver– Tempest (time-critical; probabilistic; SlingShot protocol)– QuickSilver (scale to many groups; virtual synchronous protocol)– Cayuga (efficient automata for searching publication histories)– ChunkySpread (dynamic IP multicast)

Page 10: SRS Architecture Study

10

• Comparisons among protocols– Significantly different attack (fault) models– Significantly different assumptions about applications– CMU’s Q/U protocol makes the weakest assumptions

about the attacker but has more restrictive application than Steward, SlingShot or QuickSilver

Granular Scalable Redundancy, cont’dGranular Scalable Redundancy, cont’d

Page 11: SRS Architecture Study

11

Reasoning about Insider Threat Reasoning about Insider Threat ProjectsProjects

• PMOP– Framework for monitoring operator behavior,

recognizing and blocking bad actions

• HDSM (High-Dimensional Search and Modeling)– Insider Modeler and Analyzer, currently used offline– Search engine for high-dimensional space of sensor

data– Response Engine

• Asbestos– New x86 OS with efficient support for trustworthy

isolation in hosts and processes running untrusted code

Page 12: SRS Architecture Study

12

• Comparisons– All are complementary to each other– PMOP seems to be AWDRAT’s Architectural

Differencing applied to operators rather than components

– HDSM’s search engine is complementary to other SRS technologies but the Response Engine overlaps in scope with AWDRAT executives

Reasoning about Insider Threat, Reasoning about Insider Threat, cont’dcont’d

Page 13: SRS Architecture Study

13

SRS Technologies

Top Down ApproachTop Down Approach

What can we learn about the architecture of SRS systems by trying to transform a high watermark survivable system into an SRS system?

DPASA Architecture applied to the JBI Exemplar used in OASIS Dem/Val

And, its limitations and shortcomings, as identified by:

Developers’ experiences

Testing and validation

Out of lab deployment

Multiple red team exercises

Understanding of their

Capabilities

Assumptions

Limitations

Maturity

Our study found that the there is sizable intersection that pushes the high watermark more towards an SRS system!

• Much better than finding that technologies do not address the identified problems; or even if they do, “self” and “regenerative” aspects had no gain

These changes are incremental improvements over current DPASA architecture. Changing the architecture substantially, (e.g., implementing JBI CAPI using QuickSilver) without appropriate forethought is not likely to lead to a more survivable system because the system will lose the well tested interaction of existing protection, detection and adaptive response mechanisms

Page 14: SRS Architecture Study

14

Limitations and Shortcomings of the DPASA Limitations and Shortcomings of the DPASA Architecture Architecture

• Recovery supported only for some key components

• Availability seems to be the most attractive target for the adversary

• Interpretation of observation, deduction and decision making require expertise

• More options for adaptive response

• Lack of support for improving the system on the fly

The last three are more tightly inter-related among themselves and more SRS oriented, but SRS technologies may help in all but the last one

Page 15: SRS Architecture Study

15

Improving Recovery Improving Recovery

• State:– Partially implemented: some clients

and some PSQ (those committed to MySQL)

• Connection: – Reasonably handled

• Group view:– PSQ:

• View among servers: handled well• View of servers from clients: takes a

long time– SM:

• Dependant on Spread: could be broken in a bad way

• Improvement possibilities– Need “safe” state transfer or carry over

• Can SRS technologies help?– Replace Spread transmitter?– Implement the (in memory) data

structures maintained by PSQ servers as Q/U objects using CMU protocol?

– Clients and DC: use Asbestos for protecting check-pointed state?

SM and PSQ are redundant, maintain some replicated state SRS technologies provide supporting infrastructure

Self: who makes the decision to recover (or not to recover), and when?Regenerative: Recovering to “operational” without any other “changes” is still in the realm of “delaying the eventual degradation”

Full recovery

Restart with state loss

Page 16: SRS Architecture Study

16

Some DetailsSome Details

q1sm

Sig Vrfy

Voting

q2sm

Sig Vrfy

Voting

q3sm

Sig Vrfy

Voting

q4sm

Sig Vrfy

Voting

SXMTR SXMTR SXMTR SXMTR

SPREAD GCS

q1sm wants to multicast message M: q1sm signs M and hands it to its XMTR, which returns success only of all XMTRs in the group acknowledges receiving M

q1sm q2sm q3sm q4sm

q1psq

q1dc

q2psq q3psq q4psq

q2dc q3dc q4dc

Combination of managed switches and ADF policies define who can talk to whom and over which port and protocol

SsXMTR SsXMTR SsXMTR SsXMTR

Steward or QuickSilver

It is not clear whether the unavailability observed is purely an implementation problem, but switching over to Steward or QuickSilver transport may still be advantageous:

• Maintaining the state machine replication abstraction is advantageous for state recovery

•Simpler XMTR

• Can handle more quads

The way client’s PSQ messages are handled by our PSQ servers are similar to using CMU’s Q/U protocols– imagine the subscription info as a Q/U object, replicated at each PSQ server, part of which is maintained in memory-one difference is that instead of the client, one PSQ server acts as its proxy.

q1psq

Sig Vrfy

Voting

q2psq

Sig Vrfy

Voting

q3psq

Sig Vrfy

Voting

q4psq

Sig Vrfy

Voting

sock

et

sock

et

sock

et

sock

et

sock

et

sock

et

sock

et

sock

et

sock

et

sock

et

sock

et

sock

et

Q/U Objects

Q/U client

Client’s PSQ Req

Q/U protocol and Object Synching

Using the Q/U object abstraction and associated protocol will help state recovery of a restarted PSQ server—different clients may have interacted with different quads while the recovering quad was down.

Page 17: SRS Architecture Study

17

Making Availability Compromises More Making Availability Compromises More DifficultDifficult

Unavailability triggered by corruption:• Non-redundant and homogeneous perimeter (PIX FW Routers)

• Corrupt references

Attacks on Java: serialization bombs, garbage collection/lease

SQL Injection

Need privileged access on inside host(s)

From outside

WNIDS

CombOPS

Wing Ops LAN

ANIDS

MAF

AMC CONUS LAN

q4sm

q4ps

q4cor

q4psq

q4dc

q4NIDS

q4ap

q4sm

q4ps

q4cor

q4psq

q4dc

q4NIDS

q4ap

PIX

PIXPIX

xxx

Attacker network

Not brute force DOS!

PIX PIX

HUB

MAF

ANIDS

AMC CONUS LAN

Redundant and fail overMonitor all legs

other

Diversity (costly)

Taste tester?

Dynamic diversity using Genesis?May not be memory exploit?

PIX T tester

HUB

MAF

ANIDS

AMC CONUS LAN

spl hw

Page 18: SRS Architecture Study

18

Availability cont’dAvailability cont’d

q1sm

Corrupt references:

Attacks on Java: serialization bombs, garbage collection/lease mechanisms

q2sm q3sm q4sm

Q1’s IP=127.0.0.1 Q2’s IP=127.0.0.1 Q3’s IP=127.0.0.1 Q4’s IP=127.0.0.1

Registering client gets 127.0.0.1 for all quads

Flaw: Q1SM’s unsolicited statement about other quads IP address is believed by everybody

Send a serialized packet with a huge number in the size fieldJVM tries to allocate the memory in the size field and throws OOME

Variations: Send a number of such packets to consume available memory

Packets may or may not be well formed, but to keep the memory allocated “serialization” must succeed

Tell a RMI server that a client reference has been released

Need to guess the client side refs

• Create variants of JVM or other libraries using Genesis or DAWSON tools

Enforce a size rule?

• Use AOP to implement a check before allocation?

• Use Daikon toolset to learn the max size of serialized packets, enforce it as an invariant and fix when violated?

Code/Implementation problem

Page 19: SRS Architecture Study

19

Availability cont’dAvailability cont’d

SQL Injection

Client

PSQ Proxy

PSQ Server

RDB

DC

Roles and other checks

Encrypted message containing data D

Session check

Encrypted message containing data D

Query created from D by PSQ

DPASA has the notion of escrow

MySQL in all quads, but on different OS

Improving Detection and Response

• Add taste tester

• Two possibilities at PSQ level or at the RDB level

Improving Prevention (& detection) X

Strictly control what is executed on the RDB

• Vet D

• Create a white list

Use diverse DBs (hoping some will behave differently)

• Can SRS diversity techniques help

• Genesis tainting?

Client

PSQ Proxy

PSQ Server

RDB

DC

Encrypted message containing data D

Session check

Encrypted message containing data D

Query created from D by PSQ

T taster PSQ

T taster RDBX

X

Cost…

Applicability, Extendibility …

Page 20: SRS Architecture Study

20

More Organic Decision Making More Organic Decision Making

At which granularity the cost overruns benefits?

Most DPASA implemented components have some of these in “code”– should they be made explicit?

Should we add these as architectural elements at key components SM, PS, PSQ and LC

q1sm

q1ps

q1cor

q1psq

q1dc

q1NIDS

q1ap

q2sm

q2ps

q2cor

q2psq

q2dc

q2NIDS

q2ap

q3sm

q3ps

q3cor

q3psq

q3dc

q3NIDS

q3ap

q4sm

q4ps

q4cor

q4psq

q4dc

q4NIDS

q4ap

q4sm

q4ps

q4cor

q4psq

q4dc

q4NIDS

q4ap

ENIDS

WxHaz

ChemHaz

EDC

JEES

ENIDS

WxHaz

ChemHaz

EDC

JEES

SCRBT

TAP

SW

DIS

T

AO

DB

SV

R

TA

PD

B

PNIDS

AODB

Target

CAF

SCRBT

TAP

SW

DIS

T

AO

DB

SV

R

TA

PD

B

PNIDS

AODB

Target

CAF ANIDS

MAF

WNIDS

CombOPS

ENV LAN PLANNING LAN

Wing Ops LAN

AMC CONUS LAN

Q1sm invites Combat Ops, but does not see all heartbeats

Q2SM sees heartbeats from 4 out of 5 Combat Ops components

Q3sm shows some missing heartbeats from Combat Ops

Q4sm same as Q3SM

GUI Up, but cannot subscribe

No significant alerts in Emerald

Combat ops got bad references for Q1, Q3 and Q4?

• Most likely not all at once

Try to push right references

• Try refreshing these first

• If fails try refreshing with q3 blocked? (DPASA Operators)DPA

SA O

pera

tors

Organic Decision Making: within the system, by the system

Issu

es to

be

addr

esse

d by

the

arch

itect

ure

• Detection– Arch differencing– Deviation from spec

• Interpretation– Models, JHU A-DAGs

• Deductive analysis, hypothesis testing

– HDSM? Cortex

• Response selection– RMPL? Cortex

Page 21: SRS Architecture Study

21

More Maneuvering Room for DefenseMore Maneuvering Room for Defense

• Beyond restart process, reboot ,and graceful degradation (block or isolate, reduce quorum size etc)– More spares, distributed widely

• (Scalable redundancy)

– Restart a variant• (Genesis, Dawson)

– Reboot a new system• (Asbestos?)

– Change transport• (from QuickSilver to SlingShot, accept the weaker guarantees)

SRS technologies provide the infrastructure or mechanisms– but the management?• Policies, decision making– when to restart a variant, when to reboot with what restrictions, which transport?• SRS cognitive capabilities (reasoning about the system) will likely fall short in reasoning about SRS technologies

Carrying over state and keys?

Page 22: SRS Architecture Study

22

Improving the System on the FlyImproving the System on the Fly

• Even if improvement causing changes are identified along with the right time to apply them, the system must be “architected” to take the changes– Authorized vs unauthorized changes– Risk of automation– a new attack avenue– Different kinds “Change”

• Code changes– – Restart– state and key issues

• Policy or configuration changes– IP Tables, ADF, rate limiting, size checking

» Hooks exists, can be done manually

• Protocol/transport changes

This is an architecture and implementation issue– solution will likely be dependant on the technologies being used

Page 23: SRS Architecture Study

23

A Futuristic DPASA++ System A Futuristic DPASA++ System

Taste testers: at key service providers such as PSQ (using existing redundancy) and may even at the perimeter router.

q1sm

q1ps

q1cor

q1psq

q1dc

q1NIDS

q1ap

q2sm

q2ps

q2cor

q2psq

q2dc

q2NIDS

q2ap

q3sm

q3ps

q3cor

q3psq

q3dc

q3NIDS

q3ap

q4sm

q4ps

q4cor

q4psq

q4dc

q4NIDS

q4ap

q4sm

q4ps

q4cor

q4psq

q4dc

q4NIDS

q4ap

ENIDS

WxHaz

ChemHaz

EDC

JEES

ENIDS

WxHaz

ChemHaz

EDC

JEES

SCRBT

TAP

SW

DIS

T

AO

DB

SV

R

TA

PD

B

PNIDS

AODB

Target

CAF

SCRBT

TAP

SW

DIS

T

AO

DB

SV

R

TA

PD

B

PNIDS

AODB

Target

CAF ANIDS

MAF

WNIDS

CombOPS

ENV LAN PLANNING LAN

Wing Ops LAN

AMC CONUS LAN

q4sm

q4ps

q4cor

q4psq

q4dc

q4NIDS

q4ap

q4sm

q4ps

q4cor

q4psq

q4dc

q4NIDS

q4ap

q4sm

q4ps

q4cor

q4psq

q4dc

q4NIDS

q4ap

q4sm

q4ps

q4cor

q4psq

q4dc

q4NIDS

q4ap

q4sm

q4ps

q4cor

q4psq

q4dc

q4NIDS

q4ap

q7sm

q7ps

q7cor

q7psq

q7dc

q7NIDS

q7ap

More quads (PSQ/SM: Scalable Redundancy)

Enhanced SMs: eliminate advisors, more decision support interfaces

Emerald Auto-action Arch Difference HD Search

Diverse variants of JVM and libraries

OS support of isolation– keys, check pointed data, etc.

LCs enhanced with Arch Diff and Cognitive Executive

Use Genesis, DAWSON, Asbestos, RMPL/AWDRAT technologies

LC

LC

LC

LC

LC

LC

LC

LC

LC

LC

Removal of existing component/feature

Enhancement of existing component/feature

Addition of new component/feature

Color Code

Page 24: SRS Architecture Study

24

Bottom Up Approach:Bottom Up Approach: Self-Regeneration Feedback Loop Self-Regeneration Feedback Loop

servicedeviation

Controller

Application

service

servicespecification

resourceallocation

“service” may include the app’s• functional correctness and/or• quality of service delivery

Page 25: SRS Architecture Study

25

Resource

Resource

Controller

Application

Resource

servicemeasurement

servicemeasurement

resourceconfiguration

resourceallocation

service deviationknowledge

analysis

strategy

service

servicespecification

Feedback Loop Including ResourcesFeedback Loop Including Resources

Page 26: SRS Architecture Study

26

Using SRS Technologies in Feedback Using SRS Technologies in Feedback LoopLoop

• Service specification:– RMPL, AWDRAT, Daikon

• Service measurement:– Archie, RMPL, Architectural Differencing, PMOP

• Resource configuration:– Genesis, DAWSON, Repair Tool, Cortex, HDSM

• Resource allocation:– RMPL, AWDRAT

• Controller:– Knowledge: Cortex– Analysis: Trust Modeling, HDSM– Strategy: RMPL, AWDRAT

Page 27: SRS Architecture Study

27

Using SRS Technologies for Using SRS Technologies for DistributivityDistributivity

• Self-Regenerative System will likely distribute– Application and/or– Resources and/or– Controller

• For coordinating distributed redundant application services and resources– Steward, Q/U, R/W, QuickSilver (virtual synchrony)

• For coordinating distributed redundant controllers– SlingShot (probabilistic time-critical)

Page 28: SRS Architecture Study

28

Design Choices for Feedback LoopsDesign Choices for Feedback Loops

• Hierarchy– Loops may be placed within application components,

resources, and/or controllers of larger loops– Loops may share resources and/or controllers– Controllers often share data:

• Synthesized from lower layers• Inherited from higher layers

– Trade speed for smarts:• small loops are fast and dumb; large loops slow and aware

• Coordination– Replicated controllers allow easier analysis of

defensive properties– Autonomous, decentralized controllers reduce the

cost of coordination

Page 29: SRS Architecture Study

29

Example: Multiple Components, Example: Multiple Components, Nested and Distributed Controllers, Nested and Distributed Controllers,

Shared ResourcesShared Resources

Controller

Component

Resource

Controller

Component

Controller

Resource

Resource

Resource

Resource

Page 30: SRS Architecture Study

30

Design Rules-Of-ThumbDesign Rules-Of-Thumb

• Use purely local reaction only when accurate self-accusation is possible– “Organic” decision-making– Examples: if uncaught exception, restart thread;

if seg fault, start new variant

• Controller scope should follow some boundary defined by access controls.– Examples: a LAN bounded by firewalls

• For every resource, some controller scope should monitor all its uses.

Page 31: SRS Architecture Study

31

Natural Architectural FragmentsNatural Architectural Fragments

• Use AWDRAT, RMPL, or Cortex as Controller framework• entire system or a significant subsystem and/or• one object or process

• Use Genesis or DAWSON to create alternate method implementations used in AWDRAT or RMPL

• Use Asbestos to compartmentalize data for multiple clients in Q/U protocol or multiple groups in QuickSilver protocol

• Construct a Unified Communication Service from multicast protocols• Runtime selection of alternate communication protocol with

different properties

• Apply Learning and Repair technology to other SRS components

Page 32: SRS Architecture Study

32

ConclusionConclusion

• Various SRS technologies would have allowed improvement to our DPASA system defenses.

• Taken collectively, SRS technologies address most parts of the problem of self-regenerative control.

• Underlying SRS ideas seem sound but many implementations are immature.

• SRS technologies do not show how to distribute and scale self-regenerative control loops.

Page 33: SRS Architecture Study

BackupBackup

Page 34: SRS Architecture Study

34

Placeholder for StrawmanPlaceholder for Strawman

• Componentization of defense– Protection, detection and adaptation– Organic decision making

• Unified Communication Service• Architecture:

– Organizing defense-enabled components over the UCS substrate

• Layered vs monolithic• Loose confederation vs Logical centralization• (DPASA is layered and logically centralized)

– Deliberative inter-component adaptations

FixIt