CHAPTER 3 PROBLEM FORMULATION 3.1 MAIN OBJECTIVEshodhganga.inflibnet.ac.in/bitstream/10603/76667/11/11_chapter 3.p… · cloud computing. When an insider attack is being performed

63

CHAPTER 3

PROBLEM FORMULATION

3.1 MAIN OBJECTIVE

The main objective of this investigation is to design eCloudIDS, a next-

generation generic security framework with a hybrid two-tier expert

engine-based IDS for public, private and hybrid cloud computing

environments, which completely safeguards the cloud service provider’s

infrastructure along with cloud service user’s virtual machines (with data

and applications) in utmost protected mode.

3.2 SPECIFIC OBJECTIVES

In order to achieve the main objective it is essential to address the

following security concerns:

Security Concern #1: Who retains the data ownership and control

ownership?

Security Concern #2: Who maintains the audit records of the

data?

Security Concern #3: What is the mechanism in the delivery of

this audit record to the customer?

Security Concern #4: As a real owner of the data, does the CSP

allow customers to secure and manage

access from end-users (customer’s client)?

64

The eCloudIDS Generic Cloud Security Framework should take

into account the behavioral patterns of persons accessing cloud VMs,

and also all abnormalities and inconsistencies brought by attackers

from CSP’s side, hobbyist hackers accessing cloud through Internet,

malicious worms, viruses, etc.

In short, the following are the challenges considered as the

specific objectives for the design of eCloudIDS Generic Security

Framework.

eCloudIDS Objective #1: Logical storage segregation and multi-

tenancy security issues

eCloudIDS Objective #2: Identity management issues

eCloudIDS Objective #3: Insider attacks

eCloudIDS Objective #4: Virtualization issues

eCloudIDS Objective #5: Cloud VM auditor management issues

eCloudIDS Objective #6: Hacker attacks

eCloudIDS Objective #7: Signature based attacks caused by

Worms, Viruses, Trojans, etc.

3.2.1 eCloudIDS Objective 1: Logical storage segregation and

multi-tenancy security issues

One of the top/primary security concerns in cloud computing, logical

storage segregation and multi-tenancy security issue, gains major

attention from all the stakeholders on cloud. Given the nature of cloud,

multi-tenancy being unique and more beneficial property with respect to

utmost utilization of pooled resources with multiple users, it carries a

65

new style of security threat to the entire cloud infrastructure. Hence, this

research work will primarily focus towards the way and means of

handling logical storage and multi-tenancy security issues.

3.2.2 eCloudIDS Objective 2: Identity management issues

Being an essential component of today’s enterprise computing, identity

management has become more influential with respect to all aspects of

the way business run. When considering the essence and importance of

identity management, it can concluded that without a proper identity

management, cloud computing may not be complete and effective. At

one hand, the traditional identity management portfolio getting more

maturity; on the other hand it is also having its own challenges. When

considering identity management for cloud, certainly it requires enhanced

capabilities beyond its current/traditional abilities. Hence, this research

work will focus to leverage identity management capabilities for cloud

environment.

3.2.3 eCloudIDS Objective 3: Insider attacks

One of the most popular security threats for many decades by now,

insider attacks are gaining more vigor due to the new characteristics of

cloud computing. When an insider attack is being performed at cloud

service providers premises, the volume and velocity of its effect is

multiple times than a traditional environment. Hence, this research work

will attempt to prevent insider attacks with possible mechanism.

66

3.2.4 eCloudIDS Objective 4: Virtualization issues

Virtualization being a backbone of achieving cloud’s objectives, it also

brings a new line of vulnerabilities to the cloud infrastructure. Protecting

both cloud infrastructure and the hosted VMs are equally important from

CSP and cloud user(s) perspectives. Hence, this research work aimed to

address the virtualization security issues to possible extend.

3.2.5 eCloudIDS Objective 5: Cloud VM auditor management

issues

In a traditional environment, a dedicated system’s host (server) OS being

a parent process on top of underlying server hardware. Due to the new

nature of cloud, this scenario gets a new dimension. In a typical cloud

environment, on top of the underlying server(s) infrastructure, hypervisor

(or cloud native OS) becomes a parent process and host/guest OSs

become child processes. Hence, protecting cloud VMs along with

performing a regular/continuous auditing becomes mandatory in cloud

environment. The proposed system will be focused towards building a

security framework that operates based on cloud VM auditing/monitoring

mechanism(s).

3.2.6 eCloudIDS Objective 6: Hacker attacks

After protecting a cloud infrastructure from the newly derived cloud

characteristics-based threats and vulnerabilities, it is also important to

consider other regular hacker or third-party enemy attacks. Hence, this

research will target to handle possible hacker attacks.

67

3.2.7 eCloudIDS Objective 7: Signature based attacks caused by

Worms, Viruses, Trojans, etc.

Newly derived/designed worms, viruses and Trojans for cloud

environments are becoming more challenging and difficult to the existing

traditional security tools and frameworks. This proposed system is aimed

to capture such possible new attacks through its machine learning

intelligence.

3.3 SUMMARY

Cloud computing becomes a key driver for supporting all on-demand

needs of corporate IT today. It promises extensive mechanisms for apt

resource sharing, maximum resource utilization, and true elasticity as

compared to any of its early contestants. Even after recognizing its

advantages, many organizations are reluctant to take up this next-

generation computing, due to its severe security concerns. In today’s

world, whole business computing relies on the security of customers and

their data. Due to its public and multi-tenancy nature, security concerns

are very huge in capacity.

Hence, in this research work, an attempt has been made to influence

further on to the problems involved in cloud computing; alongside

designing a generic cloud security solution/framework.

68

CHAPTER 4

METHODOLOGY

4.1 INTRUSION DETECTION SYSTEM

Intrusion detection refers to the process of monitoring the events

happening in a computer system or network, examining them for signs of

security problems. The general meaning of intrusion detection reminds

the analogous monitoring systems in other areas, including burglar

alarms and video-monitoring systems found in banks and other renowned

stores. Even the warning systems in civil defense and military fall into

this functional category. Although the strategies employed are different

in the various monitoring systems, yet the basic idea remains the same.

But in this context [60] [76] [77], intrusion detection is defined as a

process of detecting and responding to malicious activity directed at

computing and networking resources.

Intrusion detection is defined as the process of observing the events

occurring in a computer system or network and analyzing the violations

or imminent threats of security policies or standard security practices

violation. These violations may be caused by malware such as worms,

spyware, virus, unauthorized access to the systems by some attacker, and

authorized users misusing their privileges or flaws resulting in granting

the attacker an elevated access to the network [62].

An Intrusion Detection System (IDS) is a software used for the

automation of intrusion detection process [62]. IDS monitor network or

system events for malicious activities that tend to compromise the

confidentiality, integrity, and availability of network and send a report to

69

the management station. An IDS gathers and analyses the information

within a network or a computer to perceive possible security fissures,

which includes both attacks from outside the organization and within the

organization. It uses a technology, known as vulnerability assessment or

scanning, for assessing the security of a computer or a network.

The intrusion detection system procures data about information

system to perform the analysis on the security status of that system. The

foremost goal of IDS is to detect the security breaches, including both

attempted breaches and potential breaches. A simple typical IDS is

shown in the Figure 4.1 [78].

Fig. 4.1. A Simple Intrusion Detection System.

An intrusion detection system is similar to a detector that processes

information coming from the system to be protected. This IDS has the

70

ability to launch probes that can trigger the audit process, such as

requesting version numbers for applications. It makes use of three

categories of information: long-term information associated with the

technique that is used to detect intrusions (such as a knowledge base of

attacks), configuration information that describes the present state of the

system, and audit information unfolding the events that are happening on

the system. IDS eliminates the surplus information from the audit trail. It

presents either a synthetic view of the security-related actions taken

during normal usage of the system, or a synthetic view of the current

security state of the system. Then the decision is taken to estimate the

probability that these actions or this state can be regarded as the

indications of an intrusion or vulnerabilities. At last, a countermeasure

component takes a counteractive action to either impede the actions from

being accomplished or modification of the state of the system back to a

secure state [78].

In order to ensure the proper functionality of the IDS, sensors are

used to detect data, analyzers to evaluate data, panels to monitor

activities, and user-interfaces to manipulate configuration settings. The

IDS items can be in the form of packets, audit records of system,

computed hash values or other data formats. Analyzers receive input

from sensors and then determine the intrusive activity.

According to Porras and Valdes [79], the efficiency of an intrusion

detection system depends on the following parameters:

• Accuracy: It deals with the proper discovery of attacks and the

nonoccurrence of false alarms.

• Performance: It is the rate at which audit events are processed.

• Completeness: It is the property of an intrusion detection system

to identify all attacks.

71

• Fault Tolerance: An intrusion detection system needs to be

resilient to attacks, especially denial-of-service attacks [80].

• Timeliness: An intrusion detection system has to accomplish and

thrive its analysis as quickly as possible in order to empower the

security administrator to respond before much damage has been

done, and also to inhibit the attacker from subverting the audit

source or the IDS itself [80].

Methods by which IDS automate the intrusion detection can be

classified are false positives and false negatives. False positives are those

sequences of innocuous events that IDS speciously classifies as intrusive,

while false negatives refer to intrusion attempts that IDS fails to report

[81]. Detection of hostile attacks depends on both the number and type of

suitable actions [77]. Figure 4.2 describes the series of activities

performed by an intrusion detection system.

Fig. 4.2. Intrusion Detection System Activities.

A common architecture for the structure of IDS comprises of a

detection module which gathers data that contains evidence of intrusions,

an analysis engine which processes the data for identification of intrusive

activity and a response component which produces report for intrusions.

Figure 4.3 illustrates the components of intrusion detection system.

72

Fig. 4.3. IDS Components.

Figure 4.4 describes the characteristics of IDS [78] as follows.

• The detection method defines the characteristics of analyzer. It is

categorized on the basis of information being used by IDS. It is

classified as,

o Behavior Based: When the information is about the normal

system behavior.

o Knowledge Based: When the information is in relation to

attacks.

• The behavior on detection defines the response of IDS to attacks.

The IDS is termed as,

o Active: When it reacts to the attack by taking either

corrective or pro-active actions.

o Passive: If it functions only to generate the alarms.

• The audit source location distinguishes the IDS on the basis of

the type of input information analyzed. This input information

can be,

o System log files on a host

o Network packets

73

o Application logs

o Intrusion detection alerts generated by other IDSs

• The detection paradigm illustrates the detection mechanism used

by IDS. IDS can evaluate as,

o States

o Transitions

• This evaluation can be performed in a non-obtrusive way or by

actively stimulating the system to obtain a response.

Fig. 4.4. Characteristics of IDS.

74

4.1.1 Types of IDS

IDS can be categorized into various types, on the basis of different

monitoring and analysis approaches. IDS can monitor events at three

levels:

• Network

• Host

• Application

IDS can analyze these events using,

• Signature Detection

• Anomaly Detection

Host-based IDS

Host-based Intrusion Detection System (HIDS) refers to the class of IDS

that resides on a host machine and monitor it. The analysis of activities

on the host is done at very fine granularity to determine precisely which

processes and users are performing malicious activities on the operating

system.

The system characteristics that can be used by HIDS for collection

of data [81] [82] are:

• File System

The activities conducted on the host can be indicated by the

changes done to a host’s file system. The irregular patterns of

file system access and changes to sensitive portions of file

system provide the clues in discovery of attacks.

75

• Network Events

To view the data exactly as it will be perceived by the end

process, IDS can intercept all the network communications after

being processed by the network stack before passing on to the

user-level processes. However, this is useless in detecting

attacks that are launched by a user with terminal access or

attacks on the network stack itself.

• System Calls

An IDS is positioned in such a way so as to observe all the

system calls, which will provide very rich data indicating the

behavior of the program.

Hence, choosing appropriate system characteristics is the critical

decision in any HIDS design.

Network-based IDS

Network-based IDS (NIDS), presently the most common commercial

product offering, detect attacks by capturing and analyzing the packets

that navigate in a given network link. NIDS consists of a set of single-

purpose hosts that sniff the network traffic and report the attacks to a

single management console. NIDS is secured against attack as no other

applications run on hosts are used by it. These NIDSs have “stealth”

modes [61] which make it almost impossible for an attacker to detect

their presence.

NIDS monitors the characteristics of network data and performs the

intrusion detection. Most NIDS operate by examining the IP and

transport layer headers of discrete packets, the contents of packets, or

some other combination [81].

76

Application-based IDS

Application-based IDS monitor the events transpiring within an

application. This IDS detects attacks by analyzing the application’s log

files. Application-based IDSs are likely to have a fine-grained view of

suspicious activity in the application by interfacing with an application

directly and having significant application knowledge.

Signature-based IDS

Signature-based IDS centers around the usage of expert system to

identify the intrusions based on a predetermined knowledge base. It can

be used to detect each known attack if properly programmed. This

technique is an effective method used in commercial products for

detecting attacks.

Anomaly-based IDS

Anomaly-based IDS finds an attack by identifying the anomalous (i.e.

unusual) behavior on a host or a network. The functionality of anomaly-

based IDS is based on the logic that some attackers behave differently

than normal users and hence the attacks can be easily detected by the

systems that identify these differences. These systems may generate an

overwhelming number of false alarms since the variation of normal user

and network behavior can vary haphazardly. Anomaly-based IDS can be

used to detect the never-before-seen attacks.

77

4.1.2 Role of IDS in Infrastructure Security

Infrastructure security is the security provided to safeguard the

infrastructure, predominantly critical infrastructure (infrastructure related

to network). Intrusions and disruptions in one infrastructure can result in

unexpected failures to others [83]. A simple example can be used as an

illustration for this scenario. Suppose one has a system managed by a

person with a high rank in an organization. And this system describes the

list of all the tasks to be performed by the sub-ordinates working under

that person at different locations. The failure of such a system will affect

the working of that whole group.

Each hardware device in an organizational system, such as firewalls,

routers, switches, servers, workstations, modems etc., has security

challenges associated with it. Securing all such components is quite

essential as the security of network depends on its weakest link. So if that

weakest link gets affected then it has an impact on the functionality of the

entire system. Another harsh reality is that the devices within the

infrastructure are most likely to be accessible by the remote devices in

some or the other way. The problem is that every access path to the

devices within infrastructure architecture has inherent vulnerabilities.

Intrusion Detection Systems (IDS) and Intrusion Prevention Systems

(IPS) are a set of software or hardware devices designed to address the

challenges of infrastructure security.

IDS have both passive as well as active components, yet IDS

devices are most usually passive. The packets are observed moving

through the network from a monitoring port, comparison is made

between traffic and already configured rules, and at last in case anything

suspicious is detected then a suitable alert is raised [34]. IDS can detect

78

several types of malicious traffic that would pass through a typical

firewall, comprising network attacks against services, host-based attacks

like unauthorized logins, data-driven attacks on applications, and

malware like viruses, Trojans, and worms. The detection methods used

by IDS can be signature-based detection or anomaly-based detection

[81]. Passive procedures can be used for the inspection of system

configuration files in order to render the inadvisable settings, inspection

of password files to point out the imprudent passwords, and inspection of

system areas to detect whether policies are violated. Whereas the active

procedures usually include mechanisms to discover known techniques

used to attack, mechanisms to reprogram the firewall, mechanisms to

log-off users, and mechanisms to log system responses.

4.1.3 Attacks Commonly Detected by IDSs

Three types of attacks [84] detected and reported by IDSs are,

• Scanning Attacks

• Denial of Service (DOS) Attacks

• Penetration Attacks

4.1.4 Limitations of Intrusion Detection System

One must be cognizant of the restrictions of intrusion detection systems

before undertaking an IDS deployment [61] [81] [84].

• Poor scalability of IDSs. Problems associated include lack of

proper integration with other security tools and sophisticated

network management systems, the issues arising due to

79

investigation of numerous alarms generated by different IDS

sensors, and the inability to visualize threats at enterprise level.

• Many IDSs create an enormous number of false positives that

waste administrator’s precious time and may initiate damaging

automated responses.

• During the heavy network activity or host activity, the IDSs

which are supposed to be used as real time systems, may take

several minutes before reporting and automatically responding to

an attack.

• IDSs usually cannot detect recently published attacks or any

deviations of existing attacks. So once a new attack gets posted

on the web, the attacker may hastily make use of it to breach a

target network.

• IDSs may be able to stop the novice attackers, but automated

responses generated by IDSs are often ineffective against erudite

attackers. If these IDSs are not configured properly then they

have the capability to interrupt the network by disturbing the

legitimate traffic in the network.

• It is mandatory to monitor the IDSs by some skilled computer

security personnel to accomplish maximum performance and to

cognize the implication of what is detected by the IDSs.

• An ample amount of personnel resources is essential for the

maintenance and analysis of IDS.

• IDSs cannot be well protected from attack, hence they are not

failsafe.

• The detection of coordinated or cooperative attacks is not

possible by the intrusion detection systems as the user interfaces

are not offered by these systems.

80

• IDSs must be part of a framework of computer security

measures. It can never be used in isolation.

4.2 MACHINE LEARNING

Machine Learning is a scientific discipline that is geared towards finding

the solution to the question “Can systems be programmed in such a way

that they learn by themselves and improve with experience?” [35]. This

question applies to almost all the real life aspects, such as how to predict

which jobs users will apply to, how to design autonomous mobile robots

which learn to move from their own experience, how to develop search

engines that customize themselves automatically according to user’s

preferences, etc.

Tom M. Mitchell [85] defines machine learning as “A computer

learns from experience E with respect to some tasks T and performance

measure P, if it’s performance at tasks T improves with experience E as

measured by the performance factor P”. In a situation where a system is

developed that classifies emails as spam or not spam. Task T will classify

the emails in those two categories and experience E watches the user

labeling the email as spam or non-spam. And based on that learning by

watching the user, the system will try to label the new emails. So the

performance P is the number of emails it correctly classifies as spam.

Figure 4.5 [86] shows the architecture of a typical AI. This AI agent

perceives the environment and finds which actions it can take, perhaps by

proper logical reasoning and calculating their outcomes. Whenever any

change is reflected in any of the component in the figure, it is termed as

learning. Different learning mechanisms are used depending upon the

subsystem on which changes are reflected.

81

Fig. 4.5. A Simple Architecture of AI.

In computer science, every action that has to be performed is

modeled as a function with sets of inputs and outputs. A machine

learning task estimates this function by observing the sets of inputs and

outputs.

The pictorial representation of machine learning’s general execution

flow is shown in Figure 4.6.

With the rapid growth rate of Internet and the ever increasing

complex nature of communication protocols security has become a

critical issue these days. New and complex attack methods are being

developed by attackers at a warning rate. CERT has reported 8064 new

susceptibilities in the year 2006 and this number is increasing constantly

over the past few years. We need to analyze how these machine learning

(or data mining) methodologies can be used to enhance the security of

the computer.

82

Fig. 4.6. Machine Learning – A General Execution Flow.

Although various intense ways for achieving security are available,

these approaches help to tackle only known attacks. New attacks are

always posing a big problem in the computer field. The prevalent

security software require a lot of human touch to spot threats, select

characteristics from the threats, and encode them into the application to

catch the threats. This labor demanding process can be made efficient by

algorithms of machine learning.

Another critical issue in the field of security that is prevalent these

times are of insider attacks. An organization’s employee is assumed to be

a credible user. However, a study by CERT [87] showed that insider

attacks are a reason of a lot of concrete and abstract losses to various

organizations in the recent past.

Machine learning has been one of the promising improvements in IT

and has found achievement in solving complex data classification

problems. It constructs programs that improve themselves with

observation and experience. The experience E is fed in the form of raw

data inputs and the actual learning happens with the help of algorithms.

83

The two main issues that are solved by machine learning are the ability to

learn about the given data input and second are to make various

predictions about new data based on learning that has happened from the

experience, both of which are difficult as well as time consuming for

human analysts. Machine learning is thus, well-suited to problems that

depend on expensive, rare and unreliable human experts.

The task of detecting network intrusions also come under machine

learning as it involves the classification data into normal and abnormal

behavior. Thus, for intrusion detection [87], we have,

• Task: To detect the intrusions in an accurate and precise manner.

• Experience: A dataset with instances representing normal and

attack data.

• Performance Measure: Accuracy in the correct classification of

intrusion events and normal events and other statistical metrics

including precision, recall, F- measure and kappa statistic.

4.2.1 Types of Machine Learning Algorithms

Machine learning algorithms can be classified based on

• Outcome of the algorithm

• Type of input fed during training

Supervised Learning

It is a process by which a function is deduced with the help of labelled

data also known as training data. The training data consist of numerous

training examples. Each training example contains a pair of input vector

84

and desired output value. The algorithm analyzes the training data and

produces a function, which will be used for mapping the new examples.

An optimal solution will be reached when the algorithm correctly

determines the labels for unknown instances. Figure 4.7 describes the

graphical representation of supervised learning.

Fig. 4.7. Supervised Learning.

The data can be represented as pairs of {X, Y} where Ys are labels of

different data elements in X. For example, each element xi Є X can be an

image (a 400 X 400 pixels black and white image) and yi Є Y is a binary

indicator showing for instance if there is a difference in the image xi or

not. The main goal would then be prediction of labels Ynew for a new

dataset of data Xnew which are without labels. Hence the experience of the

full pair’s dataset is used to predict other labels.

Lot of supervised learning algorithms are available, each with its

strength and weaknesses. There is no unique learning algorithm that

works best on all scenarios.

85

Unsupervised Learning

It deals with how systems can learn themselves to represent particular

input vectors in such a way that it reflects the statistical structure or

pattern of collection of inputs. In contrast with supervised learning, there

are no explicit target output labels or environmental evaluations

associated with each input. Many methods related to unsupervised

learning are based on data mining methods which are used to preprocess

data. Figure 4.8 describes the graphical representation of unsupervised

learning.

Fig. 4.8. Unsupervised Learning.

The most common type included in unsupervised learning is

clustering. The goal is to find resemblances in the training data. The

notion goes that the clusters discovered will match reasonably well with

an intuitive classification. The algorithm won't have names to give to

these clusters; it will produce them and then use those clusters to assign

new examples into one or the other of the clusters. This data oriented

approach works well when there is appropriate data; for example, social

information filtering algorithms, like Amazon.com uses to recommend

books, is based on the norm of finding similar groups of people and then

allocating new users to groups [88].

86

Approaches to unsupervised learning include clustering (e.g. k-

means, mixture models), blind signal separation using the feature

extraction techniques by dimensionality reduction (e.g. Principal

component analysis), self-organizing maps etc.

4.2.2 Supervised Machine Learning Algorithms

Support Vector Machines (SVM)

SVM are one among the best supervised machine learning algorithms.

SVM are used for classification and regression analysis. SVMs deliver

state of the art performance in almost every kind of real world

applications such as image classification, text categorization, bio

sequence analysis, hand written character recognition etc. and thus are

now established as one of the standard tool for machine learning as well

as data mining. The basic SVM takes as input a set of data and for each

given input predicts, which of two possible classes will be the output,

making it a non-probalistic binary linear classifier [90]. The essence of

SVM classification centers around four basic concepts [91]. They are,

• Separating hyper plane

• Maximum margin hyper plane

• Soft margin

• Kernel function

87

K Nearest Neighbors (K-NN)

It is a method built around classifying objects based on the closest

training examples in a given feature space. K-NN is a type of lazy

learning technique where the function is first approximated locally and

all other computation is deferred until classification. Because induction is

delayed until run time, it is considered as Lazy Learning technique

classification or Case Based classification. Since the training examples

need to be in memory at run time; it is sometimes called Memory-Based

classification. An object is classified by a majority of votes of its

neighbors, and the object is assigned to the class which is most common

amongst its k nearest neighbors [92].

In K Nearest Neighbors, the learning step is trivial; one simply

stores the dataset in system’s memory. In K-NN, the steps to be followed

to find the class of instance q whose attributes are q.Ai is as follows:

1. Find the k input instances in the given dataset that are closest to

q

2. These k instances then vote which helps to determine the class of

q

88

Fig. 4.9. Representation of K-NN, with two classes – triangle and circle,

a query point q and the value of k (number of nearest neighbors) as 4.

The naive version of the algorithm is very easy to implement by

computing the distances from the test examples to all stored examples,

but the loophole is that it is computationally intensive for large training

sets.

Artificial Neural Networks (ANN)

An artificial neural network (ANN) is a programmed computational

model that replicates the functioning and neural structure of the human

brain. Artificial neural networks are used in pattern and sequence

recognition systems, data processing, robotics network, etc. Multilayer

perceptron (MLP) is a feed forward neural network algorithm with one or

multiple layers between the input and output layer. The meaning of feed

forward is that data flows in one direction only from input layer to output

layer (forward). This type of network uses the back propagation learning

algorithm for training. MLPs are widely used for pattern recognition,

89

classification, prediction and approximation. Multilayer Perceptron can

solve problems which aren’t linearly separable [93] [94].

Fig. 4.10. Artificial Neural Network.

4.2.3 Unsupervised Machine Learning Algorithms

Self-Organizing Maps (SOM)

The Self-Organizing Map is a neural network based model for analyzing

and visualizing data which is high dimensional. The SOM outlines a

mapping from high dimensional input data space onto a two-dimensional

array of neurons. It is a competitive network where the objective is to

convert an input data set of any arbitrary dimension to a one- or two-

dimensional topological map. The model was first pronounced by the

Finnish professor Teuvo Kohonen and is thus referred to as a Kohonen

Map. It aims to discern the underlying structure, e.g. feature map, of the

input data set by constructing a topology preserving map which tells

neighborhood relations of the points in the data set [95] [96].

90

Fig. 4.11. Self-Organizing Map.

The SOM is usually used in the fields of data compression and

pattern recognition. The SOM is a single feed forward network, in which

each source node has connection to all output neurons. The input

dimension is generally higher than the output dimension.

K-means Clustering

According to Hartigan, each of k-clusters can be denoted by the mean of

documents assigned to that cluster, which is often termed as the centroid

of that cluster [97]. According to Berkhin, two versions of k-means

algorithm exist [62].

The first version, also known as Forgy’s algorithm [77], is the batch

version. The major iterations involved in it are:

• Reassigning all the data sets to their nearest centroids.

91

• Recompilation of centroids of newly assembled groups.

Before the starting of iterations, firstly k documents are selected as

initial centroids. Iterations continue until some stopping criterion is

achieved.

The second version of k-means algorithm is termed as online or

incremental version. According to Steinbach et al., the second version of

k-means offers better performance in comparison to the batch version in

domain of text documents [62]. Initially, k documents are selected

randomly as initial centroids. After that, iteratively documents are

allocated to their nearest centroid and centroids are updated after each

assignment of document. Iteration stops, when no more reassignments

occur.

The centroid vector c of cluster C of documents is defined as

average of weights of the terms of documents in C.

Divisive Hierarchical Clustering

Divisive algorithms begin with a single cluster of all documents. At each

iteration the most suitable cluster is fragmented until a stopping criterion

such as a pre-decided number k of clusters is achieved. Kaufman and

Rousseeuw devised a method to implement this algorithm [98] [99]. In

this technique, at each step the cluster with largest diameter is split. This

largest diameter cluster comprises of the most distant pair of documents

i.e. the least similar pair of documents. Within this cluster the document

with the least average similarity to other documents is removed to form a

new singleton cluster. The algorithm continues by iterative assignment of

the documents in the cluster being split to a new cluster, if they have

92

greater average resemblance to the documents in the new cluster. As the

clustering quality of this cluster is not comparable with the other

algorithms, so a minor modification is made. The least similar pair of

documents in the cluster being split is removed to form two new

singleton clusters. Rest of the documents in the cluster is iteratively

assigned to one of the new clusters depending upon the average

similarity.

4.3 IDENTITY MANAGEMENT

4.3.1 Digital Identity Lifecycle

Digital identity is said to be at the heart of many contemporary strategic

modernizations and innovations, ranging from crime, misconduct,

offence, internal and external security, business models etc. This

necessitates disclosing the personal information within ubiquitous

environment [100]. This raises serious security concerns and discomfort

and worry to users and creates the requirements of an Identity

Management (IM) within ubiquitous environments.

In the digital world an individual or a person can be represented by

sets of data (attributes) which can be managed and handled by technical

means, so-called digital identities. Depending on the condition and the

context only, the subsets of these attributes are required to represent an

individual or an identity both in the physical and the digital world, so-

called (digital) partial identities [101]. An identity management system

helps by providing the tools for managing and handling these partial and

restricted identities in the digital world.

93

The processes and technology used for creating, deleting and

managing account and entitlement changes and track policy agreement,

including some or all of the following [102]:

• Provisioning/de-provisioning. Here the accounts in multiple

systems are automatically created and are expired based on data

that is acquired from authoritative data sources, thereby reducing

the effort for creating the accounts and managing the accounts

manually, and also by reducing security risk by the automatic

application of policies.

• Workflow. The mechanization of steps within the identity

lifecycle management process comprising, approval,

notification, escalation and creation of audit data.

• Administration. The simplification of the administration of

identities, regularly through the deployment of a web-based user

administration console. Such interfaces are frequently used for

delegated administration and also possibly even user self-

service, in conjunction with workflow.

• Credential management. Passwords, certificates and smart cards.

Lifecycle management includes an integrated and comprehensive

solution for managing the complete lifecycle of user identities and their

related credentials and entitlements. The lifecycle of an individual in an

organization must be managed in such a method that it should confirm

the security of the organization’s confidential and secured data. The

functionality of the Identity lifecycle management is mainly divided in to

two components. They are provisioning component and administrative

component. The administrative rule mainly specifies delegation rules,

providing self-service components for changing the personal details or

making request to the user. Delegating the rights of an administrator to a

94

second person is a critical for an instable and lively cloud based

situations [10].

The cautious and alert design of the digital identity’s complete

lifecycle will provide an effective solution for identity management.

Currently, the life cycle is demonstrated as a sequential multiphase

process, which moves from a creation to a termination phase and also it

provides support for updating and maintenance but such kind of

sequential process do not actually encounter the requirements that

multiple dependable identities pose [103].

4.3.2 Access-Control Lists

Access Control List which is also called as “ACL” rules the ability of a

role to perform and execute actions such a read, delete or update the files.

Access Control List provide the mechanism for defining network security

that restricts and control the access between the users and the network

resources based on some specified rules. Access Control lists are

developed using individual rules and regulations that applies to the

receiving packet. Identity Driven Manager (IDM) ACLs are unique in

which they are applied to the individual user as they connect to the

network and are then removed when the user is not linked or connected

to the network. Using the IDM ACLs the administrators who control the

network can provide access permissions for users and groups directly to

the network resources to which they need the access [104].

In access control systems there is a clear separation between

mechanisms and policies. Policies are a high-level guideline which

decides how the accesses are to be controlled and access decisions are

determined. On the other hand mechanisms are low-level hardware and

95

software functions that can be constructed to implement a policy.

Security researchers are required to create mechanisms associated with

access controls that are free and independent of the policy for which they

can be used. This is an anticipated objective in order to permit the reuse

of mechanisms that helps in a variety of security purposes. The choice of

access control policy relies on the specific features of the environment

that is to be secured and protected. This is because not every system will

have the requirement for the same protection. For example; access

control policies which are strict and critical to some systems may be

inapt for environments where the user requires greater flexibility. Access

matrix is a theoretical and a conceptual model that clearly explains the

rights and permissions that every subject possesses for each object. The

access matrix comprises of rows and columns. Each row in the access

matrix is for each subject and column is for each object. Each cell of the

access matrix specifies and explains the access authorized for the subject

which is in the row to the object which is in the column. The main task

that is to be accomplished by the access control is to confirm that only

operations which are specified and authorized by the access matrix

actually get executed. Here it is to be noticed that the access matrix

clearly separates authentication from that of the authorization.

Access Control List (ACL) is the most popular approach to

implement access matrix. Each object is associated with an ACL which

indicates that for each and every subject in the system that accesses the

subject, is authorized and approved to execute on the system. ACL

provides suitable access review with respect to an object. The access to

the object which is in the column can be revoked by just emptying the

existing ACL. If all the access of a subject is to be revoked then all ACLs

essentially are visited one by one [105].

96

4.3.3 Components of Identity Management

Identity management is a broad administrative zone that deals with

identifying and recognizing individuals in a system such as an enterprise,

a network and controlling or restricting their access to resources within

that system by correlating user rights and limitations with the established

identity. The passport is a best example of identity management: citizens

are identified by their passport number and nationality and user

specifications such as not permitted to a particular country after the

expiry or the validity date of the passport of that particular individual.

Identity management denotes to the process of employing emerging

and evolving technologies to manage and handle information about the

identity of users and restrict access to company resources. The main

objective of identity management is to expand productivity and security

while dropping costs connected with managing and dealing users and

their attributes, identities, and credentials [106].

Traditionally, identity management has been given more

concentration and is concerned with managing an organization’s

employees to confirm that their authentication and authorization

information is reliable, consistent and up to date within the organization’s

information systems. This traditional approach continues to pose several

challenges and issues for security architects and designers, predominantly

given the large base of legacy systems. However, the true price of

identity management comes into play with consumers and business

partners [107].

The identity management infrastructure has many different

components and authoritative source, a directory component, an

administration component, a directory integration component, a

97

provisioning component, an access control component, and a generalized

application interfaces component [100].

4.3.4. Identity Management Objectives

Identity management system is a technology developed mainly for

enhancing security by providing security in the identification,

authentication and giving authorization to users of the system and

monitoring them by logging the user identity and the data that they try to

authorize. The need for the identity management rose when the attacks

on sensitive and vulnerable data started occurring. After the development

of cloud, there is a gradual shift from traditional environment to cloud

computing environment. The attacks on the system by estimation, comes

majorly from insider rather than from outsiders. The identity manager is

built on cloud to uniquely identify the user of application. The objective

and role of the identity manager in the cloud environment will be

discussed in the following sections [108].

The objectives of identity management are:

• Uniqueness of Identity: In an organization, identity manager is

used to give unique identity to the users, employees and

customers [108]. Any sensitive records accessed will enable the

admin to easily identify the user accessing the resource.

• Having Online Privacy: The users using the system can ensure

privacy while accessing the resource, interactive with people and

saving files online [109].

• Manage Personal Data: The user will be able to manage the level

of publicity of personal data. The data managed by the users will

98

not be affected by the other users. The user can also hide their

data from other users [109].

• Delegation: The process of securely allocating one person to a

role with a priority higher or equal to the user’s authority [109].

Delegation is easily done with identity management in use.

• Management of Identity System in a Federated Environment: In

a federated environment, many organization establish a trust

relation and let the identity of users of organizations working

together to be federated. In this case [109], many issues like

conflicting identities, same user-id, different login interfaces

arise. For this a federated identity manager should be stabled in

this federated environment, which implements the organizations

identity manager rules.

4.3.5. Role of Identity Management in Cloud Environment

In a cloud environment, the resources which are to be published to

everyone will be in public domain by default. When a resource is put in

public cloud where no restriction is present, the resource is accessible by

everyone inside the cloud. When a user wants to add a resource to a

cloud and wants to keep it private, then the resource should be kept in a

space where only identified and authorized users can have access to it.

For the privacy or hiding or securing data in a cloud environment,

identity manager is introduced in cloud computing [110].

Recently a new architecture is introduced, namely Security as a

Service. The service provides security for users and providers [111]. The

new architecture may include in it many security measures among which

identity and access management is also present. The service enables one

99

cloud to work with another cloud, where the service is itself on a

different cloud. The security is provided from one cloud which can be

called as security provider to the cloud environment which would use the

security service on its cloud.

Unlike a traditional environment, in cloud environment many

operating systems will be established on a single hypervisor. The identity

manager in a traditional environment can manage identities on all the

system in the organizational network by creating a main active directory.

This active directory will be used to map users to their system or their

identification and authentication on a system in the network. In a cloud

environment since many operating system could be found on a single

hypervisor and maintaining a parent active directory of the users in the

hypervisors in the cloud is not yet established, the flexibility for the user

to use the same identity on different hypervisors is not present. For this

an identity manager is required for the global maintenance of identities.

This identity manager should solve the problem of identity issues

between the hypervisors and between applications [112].

The identity manager also helps in protecting private data of users

by enabling a lock on their data. This lock can be a password, finger-print

reader, barcode reader or retina identification, etc. In any cloud

environment which is not secured with any security for private

information, sensitive information like bank account number, credit card

number, etc. should not be stored on it. When identity manager is

implemented, every data stored in the cloud has its own storage space

[113]. This storage space could be public domain, user’s private storage,

an organization’s storage area (accessible only by the users from the

organization), or cloud manager’s storage area, etc. Each storage area in a

100

cloud has its own level of accessibility. This can ensure protection or

privacy of data.

Identity manager has a feature known as delegating administration

[113]. With this feature, a user in the management system can delegate

authority for a role temporarily or permanently. Along with the authority,

privilege is also granted during delegation.

4.4 SUMMARY

Cloud computing being a modern infrastructure technology, it has

utmost advancements involved with it. To design a security system for

cloud, it really needs the integration of many modern technologies.

Hence, a new state-of-the-art cloud security framework named

eCloudIDS has been designed with intrusion detection system (IDS),

machine learning, and identity management (IDM).

Documents

CHAPTER 3 PROBLEM FORMULATION 3.1 MAIN OBJECTIVEshodhganga.inflibnet.ac.in/bitstream/10603/76667/11/11_chapter 3.p… · cloud computing. When an insider attack is being performed