Upload
gamma
View
51
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Data Mining Approach for Network Intrusion Detection. Zhen Zhang Advisor: Dr. Chung-E Wang 04/24/2002 Department of Computer Science California State University, Sacramento. Outline. Background Intrusion Detection: promises and challenges Data Mining in IDS: how can it help Motivation - PowerPoint PPT Presentation
Citation preview
Data Mining Approach for Network Data Mining Approach for Network Intrusion DetectionIntrusion Detection
Zhen ZhangZhen Zhang
Advisor: Dr. Chung-E WangAdvisor: Dr. Chung-E Wang
04/24/200204/24/2002
Department of Computer ScienceDepartment of Computer Science
California State University, SacramentoCalifornia State University, Sacramento
OutlineOutline
BackgroundBackground– Intrusion Detection: promises and challengesIntrusion Detection: promises and challenges
– Data Mining in IDS: how can it helpData Mining in IDS: how can it help
MotivationMotivation Approaches, tasks, problems and my Approaches, tasks, problems and my
contributionscontributions ResultsResults Conclusion and future workConclusion and future work
Intrusion DetectionIntrusion Detection- - Building a Secure NetworkBuilding a Secure Network
Primary assumptionsPrimary assumptions– System activities are observableSystem activities are observable
– Normal and intrusive activities have distinct evidenceNormal and intrusive activities have distinct evidence
Main techniquesMain techniques– Misuse detection: patterns of well-known Misuse detection: patterns of well-known
attacksattacks– Anomaly detection: deviation from normal Anomaly detection: deviation from normal
usage usage
Data Mining in IDSData Mining in IDS
Shortfalls with current IDS (mostly misuse Shortfalls with current IDS (mostly misuse detections)detections)– VariantsVariants: Intrusions change easily and : Intrusions change easily and
frequently.frequently.
– False positiveFalse positive: Difficult to pick up intrusions.: Difficult to pick up intrusions.
– False negativeFalse negative: Detecting attacks for which there are : Detecting attacks for which there are no known signaturesno known signatures
– Data overloadData overload: Amount of data grows rapidly: Amount of data grows rapidly. .
What is Data MiningWhat is Data Mining
Data Mining:Data Mining:Take data and pull from it patterns or deviations.Take data and pull from it patterns or deviations.
Many different types of algorithms:Many different types of algorithms:Decision Tree,Decision Tree, Link analysis, Clustering, Association, Link analysis, Clustering, Association, Rule abduction, Deviation Analysis, and Sequence Rule abduction, Deviation Analysis, and Sequence analysis.analysis.
Software and Tools:Software and Tools:– MS SQL Server 2000MS SQL Server 2000– Ripper and many others Ripper and many others
How can Data Mining helpHow can Data Mining help
VariantsVariants– Use anomaly detection, no great concern with variants in Use anomaly detection, no great concern with variants in
an exploit codean exploit code.. False positivesFalse positives
– To identify recurring sequences of alarms in order to help To identify recurring sequences of alarms in order to help identify valid network activity.identify valid network activity.
False negativesFalse negatives – Attacks for which signatures have not been developed Attacks for which signatures have not been developed
might be detected.might be detected. Data overloadData overload
– Data mining plays a vital role.Data mining plays a vital role.
Summary of my workSummary of my work
Identify objectiveIdentify objective– Distinguish network attacks from normal trafficDistinguish network attacks from normal traffic
– New area, several research projects, no commercial productsNew area, several research projects, no commercial products
– Focus on the principle and basic implementation of conceptsFocus on the principle and basic implementation of concepts
Data CollectionData Collection Data Pre-processing on tcpdump datasetData Pre-processing on tcpdump dataset Apply data mining on processed dataApply data mining on processed data Investigate resultsInvestigate results Software packages used: Visual Basic, Microsoft Software packages used: Visual Basic, Microsoft
SQL Server 2000 with Analysis Server, TcpdumpSQL Server 2000 with Analysis Server, Tcpdump
Data CollectionData Collection
Tcpdump data Tcpdump data (http://iris.cs.uml.edu:8080/)(http://iris.cs.uml.edu:8080/)
– Tcpdump was executed on the gateway, to capture the Tcpdump was executed on the gateway, to capture the traffic between LAN and external, and broadcast traffic between LAN and external, and broadcast packets within LANpackets within LAN
– Only header, no user dataOnly header, no user data
– Filters were used, only TCP and UDP packetsFilters were used, only TCP and UDP packets
– Baseline and 4 simulated attacksBaseline and 4 simulated attacks
TCPDUMP data formatTCPDUMP data format TCP packetTCP packet
– Time stamp Time stamp
– Source IP address Source IP address
– Source port Source port
– Destination IP address Destination IP address
– Destination port Destination port
– Flags (SYN, FIN, PUSH, RST, or .) Flags (SYN, FIN, PUSH, RST, or .)
– Data sequence number of this packet Data sequence number of this packet
– Data sequence number of the data expected in return Data sequence number of the data expected in return
– Number of bytes of receive buffer space available Number of bytes of receive buffer space available – Indication of whether or not the data is urgentIndication of whether or not the data is urgent
Tcpdump data formatTcpdump data format
UDP packetUDP packet– Time stamp Time stamp
– Source IP address Source IP address
– Source port Source port
– Destination IP address Destination IP address
– Destination port Destination port
– Length of the packet Length of the packet
Example dataExample data
Example tcpdump dataExample tcpdump data
Data Pre-processingData Pre-processing- 80% ~ 90% work- 80% ~ 90% work
Packet level information to connection Packet level information to connection levellevel
– Group by same source/destination IP/PortGroup by same source/destination IP/Port
– Use flags, acks to determine status of the connectionUse flags, acks to determine status of the connection» SF, REJ, S0, S1, S3, S3, S4, RSTOSn, RSTRSn, SS, SH, SF, REJ, S0, S1, S3, S3, S4, RSTOSn, RSTRSn, SS, SH,
SHR, OOS1, OOS2SHR, OOS1, OOS2
– Record start time, duration, protocolRecord start time, duration, protocol
– Calculate bytes in, bytes out, resent rateCalculate bytes in, bytes out, resent rate
– UDP is connectionless, so simply treat each packet as UDP is connectionless, so simply treat each packet as a connectiona connection
First round of processingFirst round of processing
Intrinsic FeaturesIntrinsic Features
Establish more informationEstablish more informationCount_per_destCount_per_dest # of connections to this # of connections to this
destination IPdestination IP
REJ_count_per_destREJ_count_per_dest # of connections that get the # of connections that get the flag “REJ”flag “REJ”
S01_count_per_destS01_count_per_dest # of connections that send a # of connections that send a SYN packet but never get the SYN packet but never get the ACK packet (S0), or receive an ACK packet (S0), or receive an ACK on SYN that they never ACK on SYN that they never have sent (S1).have sent (S1).
Diff_Services_per_destDiff_Services_per_dest # of unique services# of unique services
Diff_Service_RateDiff_Service_Rate Diff_Services / CountDiff_Services / Count
Same Destination Temporal and Statistical Attributes (last 2 seconds)Same Destination Temporal and Statistical Attributes (last 2 seconds)
Establish more informationEstablish more informationCount_per_serviceCount_per_service # of connections to this type of # of connections to this type of
serviceservice
REJ_count_per_serviceREJ_count_per_service # of connections that get the # of connections that get the flag “REJ” (SYN met by RST)flag “REJ” (SYN met by RST)
S01_count_per_serviceS01_count_per_service # of connections that send a # of connections that send a SYN packet but never get the SYN packet but never get the ACK packet (S0), or receive an ACK packet (S0), or receive an ACK on SYN that they never ACK on SYN that they never have sent (S1).have sent (S1).
Diff_Hosts_per_serviceDiff_Hosts_per_service # of unique destination hosts# of unique destination hosts
Diff_Hosts_RateDiff_Hosts_Rate Diff_Hosts / CountDiff_Hosts / Count
Same Service Temporal and Statistical Attributes (last 2 seconds)Same Service Temporal and Statistical Attributes (last 2 seconds)
Second round of processingSecond round of processing
Same Destination Temporal and Statistical Attributes
Final round of processingFinal round of processing
Final, but importantFinal, but important– Reduce data amountReduce data amount
– Remove noise or trivial informationRemove noise or trivial information
– Re-organization data, add new feature if necessaryRe-organization data, add new feature if necessary
ChallengesChallenges– Hard to tell which data to reduced/removeHard to tell which data to reduced/remove
– Requires tremendous domain knowledgeRequires tremendous domain knowledge
– Need experiments and adjustmentsNeed experiments and adjustments
Data MiningData Mining
Decision Tree AlgorithmDecision Tree Algorithm Microsoft SQL Server 2000 Analysis Microsoft SQL Server 2000 Analysis
ServerServer Steps:Steps:
– 80% of baseline (normal) dataset as training data80% of baseline (normal) dataset as training data
– Use 20% left as validation data, compute Use 20% left as validation data, compute misclassification.misclassification.
– 20% of each of the four intrusion datasets as 20% of each of the four intrusion datasets as predication data, compute misclassification.predication data, compute misclassification.
Dependency NetworkDependency Network
Decision TreeDecision Tree
Apply Data Mining Model to Validate/PredicateApply Data Mining Model to Validate/Predicate
ResultsResults
% misclassification (by final state)% misclassification (by final state)
NormalNormal 149/1510 = 9.86%149/1510 = 9.86%
Intrusion1Intrusion1 443/2324 = 19.06%443/2324 = 19.06%
Intrusion2Intrusion2 376/1968 = 19.10%376/1968 = 19.10%
Intrusion3Intrusion3 386/2011 = 19.19%386/2011 = 19.19%
Intrusion4Intrusion4 437/2298 = 19.01%437/2298 = 19.01%
Conclusion and future improvementConclusion and future improvement
AccuracyAccuracy – Preliminary experiments of using DM on the Preliminary experiments of using DM on the tcpdumptcpdump
data showed promising resultsdata showed promising results– depends on sufficient training data and right feature set.depends on sufficient training data and right feature set.
PerformancePerformance– 6 hours on one dataset (628775 records)6 hours on one dataset (628775 records)
Size of time windowSize of time window– 2 seconds or larger?2 seconds or larger?
Automated processAutomated process– Call MSSQL DM and DTS procedures within VBCall MSSQL DM and DTS procedures within VB– Real-time monitor and alarmReal-time monitor and alarm
ReferencesReferences
Intrusion DetectionIntrusion Detection,, Rebecca Gurley Bace, Macmillan Technical Rebecca Gurley Bace, Macmillan Technical Publishing, 2000Publishing, 2000
Data Mining: Concepts and TechniquesData Mining: Concepts and Techniques, , Jiawei Han Micheline Jiawei Han Micheline kamber, Morgan Kaufmann Publishers 2001kamber, Morgan Kaufmann Publishers 2001
Data Mining with Microcoft SQL Server 2000Data Mining with Microcoft SQL Server 2000, Claude Seidman. , Claude Seidman. Microsoft Press, 2001Microsoft Press, 2001
http://www.cs.columbia.edu/~sal/hpapers/USENIX/usenix.htmlhttp://www.cs.columbia.edu/~sal/hpapers/USENIX/usenix.html http://iris.cs.uml.edu:8080/network.htmlhttp://iris.cs.uml.edu:8080/network.html http://www-nrg.ee.lbl.gov/http://www-nrg.ee.lbl.gov/. Network Research Group (NRG) of the . Network Research Group (NRG) of the
Information and Computing Sciences Division Information and Computing Sciences Division (ICSD) at (ICSD) at Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory (LBNL) in (LBNL) in BerkeleyBerkeley, , CaliforniaCalifornia..
Thank You!Thank You!