40
Learning Rules and Clusters for Network Anomaly Detection Philip Chan, Matt Mahoney, Muhammad Arshad Florida Institute of Technology

Learning Rules and Clusters for Network Anomaly Detection

  • Upload
    hammer

  • View
    54

  • Download
    1

Embed Size (px)

DESCRIPTION

Learning Rules and Clusters for Network Anomaly Detection. Philip Chan, Matt Mahoney, Muhammad Arshad Florida Institute of Technology. Outline. Related work in anomaly detection Rule Learning algorithm: LERAD Cluster learning algorithm: CLAD Summary and ongoing work. - PowerPoint PPT Presentation

Citation preview

Page 1: Learning Rules and Clusters for Network Anomaly Detection

Learning Rules and Clusters for Network Anomaly Detection

Philip Chan, Matt Mahoney, Muhammad Arshad

Florida Institute of Technology

Page 2: Learning Rules and Clusters for Network Anomaly Detection

Outline

• Related work in anomaly detection• Rule Learning algorithm: LERAD• Cluster learning algorithm: CLAD• Summary and ongoing work

Page 3: Learning Rules and Clusters for Network Anomaly Detection

Related Work in Anomaly Detection

• Host-based– STIDE (Forrest et al., 96): system calls, instance-based– (Lane & Brodley, 99): user commands, instance-based– ADMIT (Sequeira & Zaki, 02): user commands,

clustering

• Network-based– NIDES (SRI, 95): addresses and ports, probabilistic– SPADE (Silicon Defense, 01): addresses and ports,

probabilistic– ADAM (Barbara et al., 01): hybrid anomaly-misuse

detection

Page 4: Learning Rules and Clusters for Network Anomaly Detection

LERAD: Learning Rules for Anomaly Detection

(ICDM 03)

Page 5: Learning Rules and Clusters for Network Anomaly Detection

Probabilistic Models

• Anomaly detection: – P(x | D NoAttacks)

– Given training data with no attacks, estimate the probability of seeing event x

– Easier if event x was observed during training• actually, since x is normal, we aren’t interested in its

likelihood

– Harder if event x was not observed (zero frequency problem)

• we are interested in the likelihood of anomalies

Page 6: Learning Rules and Clusters for Network Anomaly Detection

Estimating Probability with Zero Frequency

• r = number of unique values in an attribute in the training data

• n = number of instances with the attribute in the training data

• Likelihood of observing a novel value in an attribute is estimated by:

p = r / n

(Witten and Bell, 1991)

Page 7: Learning Rules and Clusters for Network Anomaly Detection

Anomaly Score

• Likelihood of novel event = p• During detection, if a novel event (unobserved

during training) actually occurs: – anomaly score = 1/p [surprise factor]

Page 8: Learning Rules and Clusters for Network Anomaly Detection

Example

• Training Sequence1 = a, b, c, d, e, b, f, g, c, h– P(NovelLetter) = 8/10

– Z is observed during detection, anomaly score = 10/8

• Training Sequence2 = a, a, b, b, b, a, b, b, a, a– P(NovelLetter) = 2/10

– Z is observed during detection, anomaly score = 10/2

Page 9: Learning Rules and Clusters for Network Anomaly Detection

Nonstationary Model

• More likely to see a novel value if novel values were seen recently (e.g., during an attack)

• During detection, record when the last novel value was observed

• ti = number of seconds since the last novel value in attribute Ai

• Anomaly score for Ai: Scorei = ti / pi

• Anomaly score for an instance = i Scorei

Page 10: Learning Rules and Clusters for Network Anomaly Detection

LEarning Rules for Anomaly Detection (LERAD)

• PHAD uses prior probabilities: P(Z)• ALAD uses conditional probabilities: P(Z|A)• More accurate to learn probabilities that are

conditioned on multiple attributes: P(Z|A,B,C…)• Combinatorial explosion• Fast algorithm based on sampling

Page 11: Learning Rules and Clusters for Network Anomaly Detection

Rules in LERAD

• • If the antecedent is satisfied, the Z attribute has

one of the values z1, z2, z3…

• Unlike association rules, our rules allow a set of values in the consequent

• Unlike classification rules, our rules don’t require a fixed attribute as the consequent

,...},,{,...,, 321 zzzZcCbBaA

Page 12: Learning Rules and Clusters for Network Anomaly Detection

Semantics of a Rule

• • If the antecedent is satisfied but none of the values

in the Z attribute is matched, the anomaly score is n/r (similar to PHAD/ALAD)

• r = size of Z (# of unique values in Z)• n = # of tuples that satisfy the antecedent and have

the Z attribute (support)•

,...},,{,...,, 321 zzzZcCbBaA

nrcCbBaAzzzZP /),,|,...},,{( 321

Page 13: Learning Rules and Clusters for Network Anomaly Detection

Overview of the Algorithm

• Randomly select pairs of tuples (packets, connections, …) from a sample of the training data

• Create candidate rules based on each pair• Estimate the score of each candidate rule based on

a sample of the training data• Prune the candidate rules• Update the consequent and calculate the score for

each rule using the entire training set

Page 14: Learning Rules and Clusters for Network Anomaly Detection

Creating Candidate Rules

• Find the matching attributes; for example, given this randomly selected pair of tuples:

• <A=1,B=2,C=3,D=4> and <A=1,B=2,C=3,D=6>• Attributes A, B, and, C match• Create these rules:• A=1, B=2 => C=?• B=2, C=3 => A=?• A=1, C=3 => B=?

Page 15: Learning Rules and Clusters for Network Anomaly Detection

Estimating Rule Scores

• Randomly pick a sample from the training set to estimate the score (n/r) for each rule

• The consequent of each rule is now estimated• n/r=100/3• n/r=10/2• n/r=200/100• The larger the score (n/r), the higher the

confidence that the rule captures normal behavior

}4,3,2{2,1 CBA

}5,1{3,2 ACB

}100,...,3,2,1{3,1 BCA

Page 16: Learning Rules and Clusters for Network Anomaly Detection

Pruning Candidate Rules

• To reduce the amount of time for learning from the entire training set and during detection

• High scoring rules: more confidence for top rules• Redundancy check: some rules are not necessary• Coverage check: minimum set of rules that

describe the data

Page 17: Learning Rules and Clusters for Network Anomaly Detection

Redundancy Check

• Rule 1: • Rule 2: • Rule 2 is more general than Rule 1, which is

redundant and can be removed• Rule 3: • Rule 2 and Rule 3 don’t overlap• Rule 4: • Rule 4 is more general than Rule 3, remove Rule 3

}4,3{2,1 CBA}4,3{1 CA

}6,5,4,3{2 CB

}6,5,4,3{* C

Page 18: Learning Rules and Clusters for Network Anomaly Detection

Coverage Check

• A rule can cover multiple tuples, but a tuple can only be covered by one rule (highest-scoring rule).

• Rules are checked in descending order of scores• For each rule in the candidate rule set

– mark tuples that are covered by the rule

• Rules that don’t cover any tuples are removed• Our coverage check includes the redundancy

check

Page 19: Learning Rules and Clusters for Network Anomaly Detection

Final Training

• The selected rules are trained on the entire training set: consequent and score are updated

• n/r=100000/5• n/r=4000/2• 90% for training the rules• 10% for validating the rules: rules that cause false

alarms are removed (being conservative--the remaining rules are highly predictive)

}6,5,4,3,2{2,1 CBA

}5,1{3,2 ACB

Page 20: Learning Rules and Clusters for Network Anomaly Detection

Scoring during Detection

• Score for a matched rule that is violated

S = t * n/r

where t is the duration since the last time the rule was violated (anomaly occurred wrt the rule)

• Anomaly score for the tuple = i Si

Page 21: Learning Rules and Clusters for Network Anomaly Detection

Attributes Used in LERAD-tcp

• TCP connections are reassembled (similar to ALAD)

• Last 2 bytes of the destination IP address• 4 bytes of the source IP address• Source and destination ports• Duration (from the first packet to the last)• Length, TCP flags• First 8 strings in the payload (delimited by

space/new line)

Page 22: Learning Rules and Clusters for Network Anomaly Detection

Attributes used in LERAD-all

• Attributes used in LERAD-tcp• UDP and ICMP header fields

Page 23: Learning Rules and Clusters for Network Anomaly Detection

Experimental Data and Parameters

• DARPA 99 data set• Training: Week 3; Testing: Weeks 4 & 5• Training: 35K tuples (LERAD-tcp); 69K tuples

(LERAD-all)• Testing: 178K tuples (LERAD-tcp); 941K tuples

(LERAD-all)• 1,000 pairs of tuples were sampled to form

candidate rules (more didn’t help much)• 100 tuples were sampled to estimate scores for

candidate rules (more didn’t help much)

Page 24: Learning Rules and Clusters for Network Anomaly Detection

Experimental Results

• Average of 5 runs• 10 false alarms per day• 201 attacks; 74 “hard-to-detect” attacks

(Lippmann, 2000)• LERAD-tcp: ~117 detections (58%); ~45 “hard-

to-detect” (60%) • LERAD-all: ~112 detections (56%); ~41 “hard-to-

detect” (55%)

Page 25: Learning Rules and Clusters for Network Anomaly Detection

LERAD-all vs. LERAD-tcp

Detections

(10FA/Day)

LERAD-all only

Both LERAD-tcp only

PROBE 7 23 0

DoS 10 30 8

R2L 0 30 5

U2R 0 10 10

Total 17 93 23

Page 26: Learning Rules and Clusters for Network Anomaly Detection

Experimental Time Statistics

• Preprocessing: ~7.5 minutes (2.9GB, training set), ~20 minutes (4GB, test set)

• LERAD-tcp: ~6 seconds (4MB, training), ~17 seconds (17MB, testing)

• LERAD-all: ~12 seconds (8MB, training), ~95 seconds (91MB, testing)

• 50-75 learned final rules

Page 27: Learning Rules and Clusters for Network Anomaly Detection

Results from Mixed Data (RAID 03)

• DARPA 99 data set: attacks are real, background is simulated

• Compared with collected real data• Artifacts: smaller range of values, little “crud,”

values stop growing• Modified LERAD: 87 detections, 49 (56%)

legitimate• Mixed data: 30 detections, 25 (83%) legitimate

Page 28: Learning Rules and Clusters for Network Anomaly Detection

CLAD: Clustering for Anomaly Detection

(In Data Mining against Cyber Threats,

Kumar et al., 03)

Page 29: Learning Rules and Clusters for Network Anomaly Detection

Finding Outliers

• Cluster the data points

• Outliers: points in far away and sparse clusters

• Inter-cluster distance: average distance from the rest of the clusters

• Density: number of data points in a fixed-volume cluster

Page 30: Learning Rules and Clusters for Network Anomaly Detection

CLAD

• Simple efficient clustering algorithm (large amount of data)

• Clusters with fixed radius• If a point is within the radius of an existing cluster

– Add the point to the cluster

• Else– The point becomes the centriod of a new cluster

Page 31: Learning Rules and Clusters for Network Anomaly Detection

CLAD Issues

• Distance for discrete attributes– Values that are more frequent are likely to be more

normal and are consider “closer”

– Difference in frequency of discrete values

• Power-law distributions: logarithm• Radius of clusters

– Select a small random sample

– Calculate the distance of all pairs

– Average of the smallest 1%

Page 32: Learning Rules and Clusters for Network Anomaly Detection

Sparse and Dense Regions

• Outliers are in distant and sparse regions• However, an attack might generate many

connections and can make its neighborhood not sparse.

• (distant and sparse) or (distant and dense)– Distant: distance > avg(distance) + sd(distance)– Sparse: density < avg(density) – sd(density)– Dense: density > avg(density) + sd(density)

Page 33: Learning Rules and Clusters for Network Anomaly Detection

Experiments

• Weeks 1, 2, 4, and 5• No explicit training-testing, looking for

outliers• A model for each port• Ports with less than 1% traffic are lumped

into the “Others” model• Anomaly scores are normalized in SD’s, the

“Combined” model simply merges the scores from different models

Page 34: Learning Rules and Clusters for Network Anomaly Detection

HTTP (Port 80)

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9ln(Count)

Ln

(In

ter-

Dis

tan

ce

)

CD > 0.8

HTTP (Port 80)

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9

Ln(Count)

Ln

(In

ter

Dis

tan

ce

)

CD < 0.2

Page 35: Learning Rules and Clusters for Network Anomaly Detection

SMTP (Port 25)

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8

Ln (COUNT)

Ln

(In

ter

Dis

tan

ce )

CD > 0.8

SMTP (Port 25)

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8

Ln(Count)

Ln

(In

ter

Dis

tan

ce)

CD <0.2

Page 36: Learning Rules and Clusters for Network Anomaly Detection

Results

Attack Type AttacksDetections

(10 FA/Day)

Probe 28 19 (70%)

DOS 45 25 (55%)

R2L41

15 (37%)

U2R/Data 37 14 (38%)

Total 151 74 (49%)

Page 37: Learning Rules and Clusters for Network Anomaly Detection

LERAD vs. CLAD

LERAD CLAD

Assume all training data are normal

Training data can have unlabeled attacks

Off-line algorithm On-line algorithm

Concise and comprehensible models

Harder to explain alerts

Efficient detection Comparing large # of centroids

Page 38: Learning Rules and Clusters for Network Anomaly Detection

Ongoing Work

• On-line, noise-tolerant LERAD• Applying LERAD to system calls, including

arguments• Tokenizing payload to create features

Page 39: Learning Rules and Clusters for Network Anomaly Detection

Data Mining for Computer Security Workshop at ICDM03

Melbourne, FLNov 19, 2003

www.cs.fit.edu/~pkc/dmsec03/

Page 40: Learning Rules and Clusters for Network Anomaly Detection

http://www.cs.fit.edu/~pkc/id/

Thank you

Questions?