Learning Rules and Clusters for Network Anomaly Detection
Philip Chan, Matt Mahoney, Muhammad Arshad
Florida Institute of Technology
Outline
• Related work in anomaly detection• Rule Learning algorithm: LERAD• Cluster learning algorithm: CLAD• Summary and ongoing work
Related Work in Anomaly Detection
• Host-based– STIDE (Forrest et al., 96): system calls, instance-based– (Lane & Brodley, 99): user commands, instance-based– ADMIT (Sequeira & Zaki, 02): user commands,
clustering
• Network-based– NIDES (SRI, 95): addresses and ports, probabilistic– SPADE (Silicon Defense, 01): addresses and ports,
probabilistic– ADAM (Barbara et al., 01): hybrid anomaly-misuse
detection
LERAD: Learning Rules for Anomaly Detection
(ICDM 03)
Probabilistic Models
• Anomaly detection: – P(x | D NoAttacks)
– Given training data with no attacks, estimate the probability of seeing event x
– Easier if event x was observed during training• actually, since x is normal, we aren’t interested in its
likelihood
– Harder if event x was not observed (zero frequency problem)
• we are interested in the likelihood of anomalies
Estimating Probability with Zero Frequency
• r = number of unique values in an attribute in the training data
• n = number of instances with the attribute in the training data
• Likelihood of observing a novel value in an attribute is estimated by:
p = r / n
(Witten and Bell, 1991)
Anomaly Score
• Likelihood of novel event = p• During detection, if a novel event (unobserved
during training) actually occurs: – anomaly score = 1/p [surprise factor]
Example
• Training Sequence1 = a, b, c, d, e, b, f, g, c, h– P(NovelLetter) = 8/10
– Z is observed during detection, anomaly score = 10/8
• Training Sequence2 = a, a, b, b, b, a, b, b, a, a– P(NovelLetter) = 2/10
– Z is observed during detection, anomaly score = 10/2
Nonstationary Model
• More likely to see a novel value if novel values were seen recently (e.g., during an attack)
• During detection, record when the last novel value was observed
• ti = number of seconds since the last novel value in attribute Ai
• Anomaly score for Ai: Scorei = ti / pi
• Anomaly score for an instance = i Scorei
LEarning Rules for Anomaly Detection (LERAD)
• PHAD uses prior probabilities: P(Z)• ALAD uses conditional probabilities: P(Z|A)• More accurate to learn probabilities that are
conditioned on multiple attributes: P(Z|A,B,C…)• Combinatorial explosion• Fast algorithm based on sampling
Rules in LERAD
• • If the antecedent is satisfied, the Z attribute has
one of the values z1, z2, z3…
• Unlike association rules, our rules allow a set of values in the consequent
• Unlike classification rules, our rules don’t require a fixed attribute as the consequent
,...},,{,...,, 321 zzzZcCbBaA
Semantics of a Rule
• • If the antecedent is satisfied but none of the values
in the Z attribute is matched, the anomaly score is n/r (similar to PHAD/ALAD)
• r = size of Z (# of unique values in Z)• n = # of tuples that satisfy the antecedent and have
the Z attribute (support)•
,...},,{,...,, 321 zzzZcCbBaA
nrcCbBaAzzzZP /),,|,...},,{( 321
Overview of the Algorithm
• Randomly select pairs of tuples (packets, connections, …) from a sample of the training data
• Create candidate rules based on each pair• Estimate the score of each candidate rule based on
a sample of the training data• Prune the candidate rules• Update the consequent and calculate the score for
each rule using the entire training set
Creating Candidate Rules
• Find the matching attributes; for example, given this randomly selected pair of tuples:
• <A=1,B=2,C=3,D=4> and <A=1,B=2,C=3,D=6>• Attributes A, B, and, C match• Create these rules:• A=1, B=2 => C=?• B=2, C=3 => A=?• A=1, C=3 => B=?
Estimating Rule Scores
• Randomly pick a sample from the training set to estimate the score (n/r) for each rule
• The consequent of each rule is now estimated• n/r=100/3• n/r=10/2• n/r=200/100• The larger the score (n/r), the higher the
confidence that the rule captures normal behavior
}4,3,2{2,1 CBA
}5,1{3,2 ACB
}100,...,3,2,1{3,1 BCA
Pruning Candidate Rules
• To reduce the amount of time for learning from the entire training set and during detection
• High scoring rules: more confidence for top rules• Redundancy check: some rules are not necessary• Coverage check: minimum set of rules that
describe the data
Redundancy Check
• Rule 1: • Rule 2: • Rule 2 is more general than Rule 1, which is
redundant and can be removed• Rule 3: • Rule 2 and Rule 3 don’t overlap• Rule 4: • Rule 4 is more general than Rule 3, remove Rule 3
}4,3{2,1 CBA}4,3{1 CA
}6,5,4,3{2 CB
}6,5,4,3{* C
Coverage Check
• A rule can cover multiple tuples, but a tuple can only be covered by one rule (highest-scoring rule).
• Rules are checked in descending order of scores• For each rule in the candidate rule set
– mark tuples that are covered by the rule
• Rules that don’t cover any tuples are removed• Our coverage check includes the redundancy
check
Final Training
• The selected rules are trained on the entire training set: consequent and score are updated
• n/r=100000/5• n/r=4000/2• 90% for training the rules• 10% for validating the rules: rules that cause false
alarms are removed (being conservative--the remaining rules are highly predictive)
}6,5,4,3,2{2,1 CBA
}5,1{3,2 ACB
Scoring during Detection
• Score for a matched rule that is violated
S = t * n/r
where t is the duration since the last time the rule was violated (anomaly occurred wrt the rule)
• Anomaly score for the tuple = i Si
Attributes Used in LERAD-tcp
• TCP connections are reassembled (similar to ALAD)
• Last 2 bytes of the destination IP address• 4 bytes of the source IP address• Source and destination ports• Duration (from the first packet to the last)• Length, TCP flags• First 8 strings in the payload (delimited by
space/new line)
Attributes used in LERAD-all
• Attributes used in LERAD-tcp• UDP and ICMP header fields
Experimental Data and Parameters
• DARPA 99 data set• Training: Week 3; Testing: Weeks 4 & 5• Training: 35K tuples (LERAD-tcp); 69K tuples
(LERAD-all)• Testing: 178K tuples (LERAD-tcp); 941K tuples
(LERAD-all)• 1,000 pairs of tuples were sampled to form
candidate rules (more didn’t help much)• 100 tuples were sampled to estimate scores for
candidate rules (more didn’t help much)
Experimental Results
• Average of 5 runs• 10 false alarms per day• 201 attacks; 74 “hard-to-detect” attacks
(Lippmann, 2000)• LERAD-tcp: ~117 detections (58%); ~45 “hard-
to-detect” (60%) • LERAD-all: ~112 detections (56%); ~41 “hard-to-
detect” (55%)
LERAD-all vs. LERAD-tcp
Detections
(10FA/Day)
LERAD-all only
Both LERAD-tcp only
PROBE 7 23 0
DoS 10 30 8
R2L 0 30 5
U2R 0 10 10
Total 17 93 23
Experimental Time Statistics
• Preprocessing: ~7.5 minutes (2.9GB, training set), ~20 minutes (4GB, test set)
• LERAD-tcp: ~6 seconds (4MB, training), ~17 seconds (17MB, testing)
• LERAD-all: ~12 seconds (8MB, training), ~95 seconds (91MB, testing)
• 50-75 learned final rules
Results from Mixed Data (RAID 03)
• DARPA 99 data set: attacks are real, background is simulated
• Compared with collected real data• Artifacts: smaller range of values, little “crud,”
values stop growing• Modified LERAD: 87 detections, 49 (56%)
legitimate• Mixed data: 30 detections, 25 (83%) legitimate
CLAD: Clustering for Anomaly Detection
(In Data Mining against Cyber Threats,
Kumar et al., 03)
Finding Outliers
• Cluster the data points
• Outliers: points in far away and sparse clusters
• Inter-cluster distance: average distance from the rest of the clusters
• Density: number of data points in a fixed-volume cluster
CLAD
• Simple efficient clustering algorithm (large amount of data)
• Clusters with fixed radius• If a point is within the radius of an existing cluster
– Add the point to the cluster
• Else– The point becomes the centriod of a new cluster
CLAD Issues
• Distance for discrete attributes– Values that are more frequent are likely to be more
normal and are consider “closer”
– Difference in frequency of discrete values
• Power-law distributions: logarithm• Radius of clusters
– Select a small random sample
– Calculate the distance of all pairs
– Average of the smallest 1%
Sparse and Dense Regions
• Outliers are in distant and sparse regions• However, an attack might generate many
connections and can make its neighborhood not sparse.
• (distant and sparse) or (distant and dense)– Distant: distance > avg(distance) + sd(distance)– Sparse: density < avg(density) – sd(density)– Dense: density > avg(density) + sd(density)
Experiments
• Weeks 1, 2, 4, and 5• No explicit training-testing, looking for
outliers• A model for each port• Ports with less than 1% traffic are lumped
into the “Others” model• Anomaly scores are normalized in SD’s, the
“Combined” model simply merges the scores from different models
HTTP (Port 80)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9ln(Count)
Ln
(In
ter-
Dis
tan
ce
)
CD > 0.8
HTTP (Port 80)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9
Ln(Count)
Ln
(In
ter
Dis
tan
ce
)
CD < 0.2
SMTP (Port 25)
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8
Ln (COUNT)
Ln
(In
ter
Dis
tan
ce )
CD > 0.8
SMTP (Port 25)
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8
Ln(Count)
Ln
(In
ter
Dis
tan
ce)
CD <0.2
Results
Attack Type AttacksDetections
(10 FA/Day)
Probe 28 19 (70%)
DOS 45 25 (55%)
R2L41
15 (37%)
U2R/Data 37 14 (38%)
Total 151 74 (49%)
LERAD vs. CLAD
LERAD CLAD
Assume all training data are normal
Training data can have unlabeled attacks
Off-line algorithm On-line algorithm
Concise and comprehensible models
Harder to explain alerts
Efficient detection Comparing large # of centroids
Ongoing Work
• On-line, noise-tolerant LERAD• Applying LERAD to system calls, including
arguments• Tokenizing payload to create features
Data Mining for Computer Security Workshop at ICDM03
Melbourne, FLNov 19, 2003
www.cs.fit.edu/~pkc/dmsec03/
http://www.cs.fit.edu/~pkc/id/
Thank you
Questions?