Autonomic Computing Omer F. Rana (Cardiff University)

Autonomic Computing

Omer F. Rana (Cardiff University)

Overview• Illustrative example:

– Managing Web Servers– Reference to IBM’s AC vision

• Use of SLAs to support system management – SLA standards, use of SLA in adaptation

• Approaches to adaptation – Stigmergy (social insects)– Utility-based approaches

• Toolkits

Recap … AC

• Automating the management of computer resources

• System components more complex– Better functionality– Hard to appreciate functionality– Interaction between components not always

obvious

• System admins under increasing pressure to respond to complexity

AC … 2

• Manual tuning– Generally script driven (requires updates to

configuration files)– Error-prone process (requires skilled personnel)

• Automated tuning– Try to model behaviour of the system – Use this behaviour as a “predictive” tool to determine

likely response from system– Design feedback control mechanisms (and use on-

line operation to adjust control)

AC application

Can be applied at two levels:• Individual component level

– Make each component more intelligent – Provide support infrastructure around this

intelligent component

• Interaction level– Facilitate better interaction between

components in some way – Allow “useful” interactions to “emerge”

Four Concepts• Self-configuring:

– Dynamic adaptation to changing environment– Addition of new features dynamically

• Self-healing:– Discover, diagnose and react to disruptions– Handling failure and isolating a component

• Self-Optimising:– Monitor and tune resource utilisation – Includes: dynamic partitioning, workload management

• Self-Protecting:– Anticipate/Identify, detect and protect from attacks– Extend existing security infrastructure to achieve this

Relationship to other themes

• Machine Learning and AI • Knowledge Management (Semantics)• Coordination Mechanisms and Protocols • System Administration • Performance Engineering and Monitoring

• Related Emerging areas– Ambient Intelligence – Amorphous Computing– Computational “Fabrics”

From:IBM

Response Time

Actual BOPS

Predicted BOPS

#Active Servers

#Requested Servers

11. Steady State

From Alan Ganek, IBM

2. Monitor, Detect Surge

Response Time

Actual BOPS

Predicted BOPS

#Active Servers

#Requested Servers

1 2


3. Forecast, Provision Servers

Response Time

Actual BOPS

Predicted BOPS

#Active Servers

#Requested Servers

1 21 2 3


Response Time

Actual BOPS

Predicted BOPS

#Active Servers

#Requested Servers

4. Monitor, Remove Servers1 2 43


Apache Web Server Tuning

• Based on a client-server basis with a limit on MaxClients and KeepAlive – Tuning is equivalent to modifying MaxClients and

KeepAlive

• Performance Metrics – End-user response time – Resource utilisation – CPU and memory utilisation

• Measure parameters on server side• Over utilisation == thrashing and potential failure

Basis for Metrics• Master process + pool of worker processes• Each worker process handles interaction with a

Client • Worker processes limited by MaxClients• Worker Process: idle, waiting and busy

– Idle (no TCP connection made)– Waiting (waiting for HTTP request from client)– Busy (processing request)

• Persistent HTTP/1.1: TCP connection remains open between consecutive HTTP requests (reduces time to set up a connection)

• Persistent connection can be terminated by master or client process – if waiting time exceeds max. allowed by KeepAlive

Desired CPU level=0.5, and Memory=0.6

Manual Tuning

Dynamic Workload (additional requests at 20th Control Interval)

Manual Tuning … 2

Dynamic Workload

• To maintain CPU and Memory criteria, it is necessary to tune manually

• Achieved by adjusting MaxClients and KeepAlive parameters

• Dynamic workload (generally unpredictable) requires continuous re-tuning

• Trying to follow changes resulting from dynamic workload can be continuous process

AutoTune agents

• Autotune Adaptor Bean – Interfaces with target system for service level

metrics– Sets tuning parameters

• Autotune Controller Bean– Specifies control strategies (based on data

captured)– Interacts with system admin to configure

control strategy

Manages (1) timer,(2) Async events

Can set (1) controland (2) sample intervals

AutoTune Functionality

AutoTune Architecture Data set generator

AutoTune Agent Operations• Three agents:

– Feedback controller design• Model based controller• Linear Quadratic Regulation (LQR) controller

– Modelling• Non-production/testing mode• Alters tuning parameters: MaxClients and KeepAlive• Records performance metrics: CPU and memory• Construct dynamic model (based on time series)

– Run-time control• Production mode • Uses output from controller – dynamically adjusts MaxClients

and KeepAlive

Modelling agent• Build a mathematical model of the system

– Queuing theory– Data analysis based

• Mathematical model – Requires understanding of inner workings of server– May need to know about particular properties (exceptions) of the

way the server operates• Data-based model (“blackbox” approach)

– Gather data of system in the “wild”– Assume have covered sufficient number of test cases

• User Input– Range of Tuning Parameters: MaxClient [1,1024]; KeepAlive

[1,50]– Max delay required for tuning parameters to take effect on the

performance metrics: MaxClients (10m); KeepAlive (20m)

LinearModel

Feedback Control

• PID (proportional-integral-derivative) control – Correct error between a measured process

variable and a desired point– Calculating and outputting a corrective action

to adjust process accordingly

From Wikipedia

http://upload.wikimedia.org/wikipedia/en/4/40/Pid-feedback-nct-int-correct.png

Feedback Control … 2• Proportional: reaction to current error

• Integral: reaction based on recent error (time based)

• Derivative: reaction based on rate by which error has been changing

• Use a weighted sum of the three modes

• Output as a corrective action to a control element

Proportional Mode

• Responds to a change in the process variable proportional to the current measured error value

• Multiply the error by a constant Kp (proportional gain)

m: output signal;Kp : proportional gaine: error (expected – actual)PB: proportionall Band

Integral Model• Controller output is proportional to the

amount and duration of the error

• Algorithm calculates the accumulated proportional offset over time

• Leads to controller approaching required value quicker – but contributes to system instability – may cause “overshoot”

m: output signal;Ti: Integral timee: error (expected – actual)

Derivative Mode• Acts as a breaking or damping action to

the controller response – as it overshoots

• Use of slope of error vs. time (rate of error change)

• Controller may be slower to reach required point (counters work of integral model controller)

m: output signal;Ti: Derivative timee: error (expected – actual)

Combining the three

• Output(t) = P + I + D

K_p = K; K_i = (K/T) ; K-D = KT_d

Run-time Control agent

• Implements an error feedback controller

• Makes use of a (1) desired, and (2) actual system utilisation

• Kp and Ki matrices obtained by the controller design agent

• Controller performance– Time to recover from a

workload change in the system

e=error between actualand desired value at kth interval

Accumulated error

Kp = proportional control gain, Ki = integral control gainFor stead state error

Controller Design Agent

• Relies on output of modelling agent

• Aims to minimise a quadratic cost function (J(Kp,Ki))

• Q, R are weighting matrices: Q is a 2x2 matrix and R is a 4x4 matrix

• Q = diag(q1,q2,q3,q4), and R=diag(r1,r2)– q1=1, q2=2, q3=(1/10^2),

q4=(1/2^2) (10% random CPU fluctuation, and 2% memory)

– r1=(1/50^2), r2=(1/1000^2)

Implementation• Undertaken with ABLE – extend AutoTune agent• Modelling agent

– Data generator extends AutotuneController bean (extends the process() method)

– ApacheAdaptor extends AutotuneAdaptor bean (implements socket connection with Apache Web server)

• Run-time Controller agent – Extends the AutotuneController bean– Also uses the ApacheAdaptor

• Controller Design agent– Extends the AutotuneController bean– Extends AutotuneAdaptor to read in model

parameters from Modelling agent

Experiment setup• Linux (v2.2.16) Apache HTTP v1.3.19• MaxClient and KeepAlive parameters to be

dynamically modifiable • Multiple clients supporting workload generator

– WAGON (Web trAffic GeneratOr and beNchmark) – Liu et al. (INRIA)

– Httperf to generate synthetic HTTP requests– File access distributions from Webstone 2.5

• Static and Dynamic workloads used – Static: Web page requests – session arrivals followed

a Poisson distribution (20 sessions/second)– Dynamic: Web page requests – session arrivals

followed a Poisson distribution (10 sessions/second)• Control Parameters

– Control interval (adaptation time): 5 seconds – Goal: CPU=0.5 and Memory=0.6

Automatic tuning of Apache Web Server (about 50 control intervals to converge)

With Dynamic Workload (at 20th Interval) – takes 20 intervals to adjust

Types of system components

• Computer Servers

• Web Servers

• Database systems

• Devices– Pervasive Computing– Ubiquitous Computing

Upgrades and Problem Diagnosis

FaultyModules

Upgrades and Problem Diagnosis

• Upgrade has 5 new autonomic modules

• Three modules found to be faulty (system reverts to old version)

• Analyse module dependencies

• Analyse log files to infer which of the three modules is the culprit

• Generate a “problem ticket” to software developer

QoS Management• QoS has been explored in:

– Computer Networks• Bandwidth, Delay, Packet loss rate and Jitter.

– Multimedia Applications• Frame rate and computation resource.

– Grid Computing• Network QoS, computation and storage

requirements.

Continue …

• QoS management:– Covers a range of different activities, from resource

specification, selection and allocation through to resource release.

• QoS system should address the following:– Specifying QoS requirements– Mapping of QoS requirements to resource capability– Negotiating QoS with resource owners– Establishing contracts / SLAs with clients– Reserving and allocating resources– Monitoring parameters associated with QoS sessions– Adapting to varying resource quality characteristics– Terminating QoS sessions

• User Expectations vs. Resource Management

When QoS is needed?

• Interactive sessions– Computation steering (control parameters & data

exchange)– Interactive visualization (visualization & simulations

services)

• Response within a limited time span• Co-scheduling or co-location support

From SCIRun, University of Utah

– Application QoS–User perception, response time, appl. Security, etc.– Middleware QoS–Comp., Memory and Storage– Network QoS–BW, Packet loss, Delay, Jitter

What is a Service Level Agreement (SLA) and why is useful for AC?

Client Provider

Can youdo X for mefor Y in return?

Yes

SLASLA

Distinguish between: Discovery of suitable provider Establishment of an SLA

P2P Search,Directory Service

SLA-Offer

SLA-AcceptSLA-Reject

A relationship between a client and provider in the context of a particularcapability (service) provision

SLA as a basis to support adaptive behaviour

What is an SLA?

Client Provider


No, but Ican do Zfor Y

SLASLA

Accept

SLA-CounterOffer

SLA-Offer

SLA-AcceptSLA-Reject

What is an SLA?

Client Provider


No

SLASLA

Can youdo Z for mefor Y in return?

NegotiationPhase(Single orMulti-Round)

SLA-Offer

SLA-CounterOffer

SLA-OfferDependency

Variations

Client

Providers

SLA

Client

Providers

SLA SLA

Multi-provider SLA

Single SLA is dividedacross multiple providers(e.g. workflow composition)

SLA dependencies

For an SLA to be valid, anotherSLA has to be agreed(e.g. co-allocation)

• Dynamically established and managed relationship between two parties

• Objective is “delivery of a service” by one of the parties in the context of the agreement

• Delivery involves:– Functional and non-functional properties of service

• Management of delivery:– Roles, rights and obligations of parties involved

What is an SLA?

Forming the Agreement

• Distinguish between:– Agreement itself – Mechanisms that lead to the formation of the

agreement

• Mechanisms that lead to agreement:– Negotiation (single or multi-shot)– One-shot creation– Policy-based creation of agreements, etc.

SLA Life Cycle• Identify Provider

– On completion of a discovery phase

• Define SLA– Define what is being requested

• Agree on SLA terms– Agree on Service Level Objectives

• Monitor SLA Violation– Confirm whether SLO’s are being violated

• Destroy SLA– Expire SLA

• Penalty for SLA Violation

WS-Agreement• Framework for SLA creation – interface

conforming to Web Services standards

• Service Client/Provider does not need to be a Web Service

• Provides a two layered model:– Agreement layer: Web Service-based

interface to create, represent and monitor agreements

– Service layer: Application specific-layer of service being provided

WS-Agreement

Agreement Initiator may be Service Consumer or Service Provider

ServiceLayer

AgreementLayer

WS-Agreement

Name/ID

Context

Terms Composition

Guarantee Terms

Service Terms

AgreementInformation about AgreementInitiatorResponderExpiration Time

Information about ServiceService Description Terms(generally, these are domaindependent)

Information about ServiceLevelService Level Objectives,Qualifying Conditions for the agreement to be valid,Penalty Terms, etc

WS-Agreement Terms

From: Viktor Yarmolenko (U Manchester)

WS-Agreement• Specification for Service Level Agreements

– Developed through GRAAP WG at the Open Grid Forum

– WSLA (from IBM) – previous efforts

• Provides:– Schema for agreement terms – A very simple protocol (two stage)– A state sequence – Support penalty clauses

• No support for negotiation

WS-Agreement Specification Document (GFD.107)

http://GFD.107.pdf/

Data Center Scenario … 1• Identical servers – dynamically allocated among

multiple Web apps • For each application:

– Application Manager (performance optimiz.)

Interacting with a Resource Arbiter (server allocation)– Optimisation goal (“expected business value”) defined

by an “objective function”

• Resource Arbiter goal:– Allocate servers to maximise sum of expected

business value over all applications– Local value functions must share a common scale

Data Center Scenario … 2

Use of ReinforcementLearning

Resource Arbiter goal: allocate servers to maximize the sum of expected businessValue over all applications (assuming a common scale).

A Hybrid Reinforcement Learning Approach to Autonomic Resource AllocationGerald Tesauro et al., Proceedings of ICAC 2006, Dublin, Ireland.

Vi(.): utility curveEstimate of expectedbusiness value;e.g. Payments-penalties

Arbiter assignslist of assignedservers

Not all SLAs are equal• App events for trade stock data• Customer classes:

– Gold customers: pay for data– Public customers: connected over Internet

• Public customers get less information than Gold• Gold customers expect reliable delivery

– Need for acks increasing overhead in system• Cannot alter flow rate to tolerate delays

– But can support “admission” control

Utility Abstract measure of benefit to user (seek to maximize this given available resources)

SLA Classes

Risk-Aware Limited Lookahead Control for Dynamic Resource Provisioning in Enterprise Computing Systems, Dara Kusic and Nagarajan Kandasamy, Proceedings of ICAC 2006, Dublin, Ireland.

Assumes the existenceof multiple QoSclasses

Control System Architecture

• r_alloc: rate to a flow when it enters system• n_alloc: number of consumers (admitted for each class)

Utility-aware Resource Allocation in an Event Processing System, Sumeer Bhola, Mark Astley, Robert Saccone and Michael Ward, Proceedings of ICAC 2006, Dublin, Ireland.

Control System Strategies• Assumes knowledge of some “good” (ideal) state• Move system towards the good/ideal state• Impacted by:

– Response time (current good state transition)– Variability in operational environment (stability of approach)– Execution time– Discrete domain (tuning options from a finite set)

• Feedback control– PID– Kalman filter

• Neural network-based control – Use of learning approaches

• Rule-based approaches – Use of event recognition and triggers

Kalman Filters• Discrete time linear dynamic systems• Modelled on a Markov chain (with noise)• Linear operator applied to state to generate a new state

Fk = state transition model appliedto previous state xk-1

Bk = control input model applied toControl vector uk

Wk: process noise (normally distributed)

http://upload.wikimedia.org/wikipedia/en/b/b6/Kalman_filter_model.png

Differentiated Quality of Service

SilverCustomer

GoldCustomer

PlatinumCustomer

SAN Manager

SilverPolicy

GoldPolicy

PlatinumPolicy

SANStorage

From Joe Bigus (IBM)

SAN Manager Scenario Overview

Uses new AbleRuleAgent as rules-based policy manager Models multiple quality of service levels (represented by rule sets)N systems are defined, each with associated QoS levelsRequests include system identifier and current utilizationThe SAN Manager: Looks up QoS for that system Invokes the corresponding QoS rule set Rule sets make recommendations that allocations are either unchanged, increased or decreased SAN Manager evaluates recommendations and changes allocations based on total capacity limit


Platinum QoS RuleSet // Low allocation : if Allocation is Low and Utilization is Low then RecommendedAction = NoAction; : if Allocation is Low and Utilization is Normal then RecommendedAction = NoAction; : if Allocation is Low and Utilization is High then RecommendedAction = IncreaseAllocation;

// Normal allocation : if Allocation is Normal and Utilization is Low then RecommendedAction = DecreaseAllocation; : if Allocation is Normal and Utilization is Normal then RecommendedAction = NoAction; : if Allocation is Normal and Utilization is High then RecommendedAction = IncreaseAllocation;

// High allocation : if Allocation is High and Utilization is Low then RecommendedAction = DecreaseAllocation; : if Allocation is High and Utilization is Normal then RecommendedAction = DecreaseAllocation; : if Allocation is High and Utilization is High then RecommendedAction = Send.Warning_LowMem; : if Allocation is positively High and Utilization is positively High then RecommendedAction = Send.Warning_CritMem;



Dynamic SLA

• Limitations of a single agreement– Modifications since agreement was in place

• Cost of doing re-establishment– Not fully aware of operating environment

• Flexibility in describing Service Level Objectives– Not sure what to ask for (not fully aware of the

environment in which operating)– Too many violations

Dynamic WS-Agreement• Case 1: Static Agreement

– Identify Service Description Terms,– Guarantee Terms, and – Service Level Objectives (SLOs)

• Case 2: Dynamic Agreement– Identify Service Description Terms,– Guarantee Terms: defined as ranges or as

functions– Service Level Objectives: defined as ranges

or as functions

From: Viktor Yarmolenko

Function-based SLA (Yarmolenko et al.)

• Express initial SLA-Offer as a function of provider capability




Guarantee terms as functions






SLA Classes

• Guaranteed– constraints to be exactly observed– SLA is precisely/exactly defined– adaptation algorithm/optimization heuristics

• Controlled-load– some constraints may be observed– Range-oriented SLA– optimization heuristics

• Best-effort– any resources will do– no adaptation support

SLA Adaptation

• Assume capacityTotal: C= CG + CA + CB

• ‘best effort’ can uses the adaptive capacity, as long as its not used by the ‘guaranteed’

• When QoS degrades for ‘guaranteed’ • Then adaptive is utilized to compensate for

the degradation

• ‘best effort’ can still utilize the remaining capacity of the adaptive, as long as its not used by the ‘guaranteed’

• When the congested capacity is restored, the adaptive capacity can be used entirely by the ‘best effort’

G A B

G BA

G A B

BAG

G BA

o Before invoking the adaptive function:o Ensuring that the request at time (t) the agreed upon in the SLAo Ensuring that the total capacities within all SLAs at time (t) CG

Aim: compensation for QoS degradation for

‘guaranteed’ class only

Grid Node

Reservation ManagerAllocation Manager

Policy Manager

QoS Grid Service

Resources

Grid QoS service interface

Main components

• Policy Manager– To provide dynamic info about the domain-specific

resource characteristics and policy

• Reservation Manger– To provide advance/immediate resource reservation

• Data structure contains reservation entries• Interact with policy manager for resource char.

• Allocation Manger– To interact with the underlying resource manager for

resource allocation (e.g DSRT, Bandwidth Broker)

UDDIe

QoS Broker

Grid node 1 Grid node 2 Grid node 3

QoS Discovery

Client's Appl.

QoS service

ReservationAllocation

Policy

QoS service


Policy

QoS service


Policy

SLASLA

SLA

Joint work withArgonne National Lab.(Gregor von Laszewski et al.)

Reservation Approaches

• Resource reservation / allocation based on two strategies:– Time-domain: reserve the whole ‘compute’

power of Grid node.• Guaranteed exclusive access

– Resource-domain: reserve a CPU slot of the Grid node.

• Shared access – guaranteed resource capacity• Suitable for light weight applications/services.

CoG QoS Broker

UDDIeJava CoG Kit Core

Applications Portals Swing Legacy

Allocation ManagerReservation Manager

CoG QoS Grid Service

Policy Manager

CPU

Network

Disk

QoS Handler

Reso

urce

sRe

sour

ces

Resource Mangrs.Resource Mangrs.

Serv

ice

Agr

eem

ent

Serv

ice

Agr

eem

ent

Client

Client

Grid

Grid

GT2 Handler GT3 Handler

UDDIe HandlerReput Handler

CoG

Rep

utat

ion

Ser

vice

G-QoSMArchitecture

G-QoSM

Implementation Status

• References:– Rashid Al-Ali, Kaizar Amin, Gregor von Laszewski, Omer Rana and David Walker. An OGSA-

Based Quality of Service Framework. Proceedings of the Second International Workshop on Grid and Cooperative Computing (GCC 2003), Shanghai, China, December 2003.

– Rashid Al-Ali, Omer Rana, David Walker, Sanjay Jha and Shaleeza Sohail. G-QoSM: Grid

Service Discovery Using QoS Properties. Computing and Informatics Journal , Special Issue on Grid Computing, 21 (4), 2002.

• The QoS implementation is open source available for download from the Java CoG site http://www.globus.org/cog/java

Application Integration

1. Prepare: QoS negotiation TaskReturns: Agreement ID

2. Prepare: QoS job submission Task

3. Submit job to QoS service

QoS Job Submission Taskprivate void prepareQosJobSubmissionTask(){ // create a QoS JobSumbission Task Task task = new TaskImpl(``myTask'', QoS.JOBSUBMISSION); this.task.setAttribute(``agreementToken'', token); // create a remote job specification JobSpecification spec = new JobSpecificationImpl();

// set all the job related parameters spec.setExecutable(``/rashid/myExecutable''); spec.setRedirected(false); spec.setStdOutput(``QosOutput'');

//associate the specification with the task task.setSpecification(spec);

// create a Globus version of the security context SecurityContextImpl securityContext = new GlobusSecurityContextImpl(); securityContext.setCredential(null); task.setSecurityContext(securityContext); Contact contact = new Contact(``myQoScontact'');

ServiceContact service = new ServiceContactImpl(qosServiceURL); contact.setServiceContact(``QGSurl'',service); task.setContact(contact);}

QoS Task Submission

/*** QoS: Task Submission to QoS Handler ***/

private void QosTaskSubmission(Task task){ TaskHandler handler = new QoSTaskHandlerImpl();

// submit the task to the handler handler.submit(task);}

With Globus Toolkit 2

Best Effort

Guaranteed

Web Services Distributed Management (WSDM)

• Management USING Web Services (MUWS)– Web services to describe and access manageability of

resources

– Management applications use Web services just like other applications use Web services

• Management OF Web Services (MOWS) – An application of Management Using Web Services

for the Web Service as the IT resource

• Use Web Services as the distributed computing platform to enable interoperability between managers and manageable resources

WSDM Presentation WSMF Presentation

WSDM

Disturbance Benchmarking

From Aaron Brown and Peter Shum (IBM)



Useful to comparethis with performancebenchmarks thatwe are much moreaware of


Compare with automatedtesting mechanisms









Behaviours and Interactions

• Interactions not “hard coded” – but expressed as high level objectives, eg. – Maximise this utility function– Find a reputable message translation service

• Autonomic Service providers can say “No”– Service provision must be consistent with

local policy and long term goals

• Policies may be expressed using logic or other formalisms

Emergence and Self-Organisation

• Increased complexity and autonomy implies that “global” coherent behaviours may be hard to specify

• Concept of “Emergence”• Interactions between autonomous systems that

can lead to useful global behaviours– How can we constrain each individual element within

such a system?– How can useful global behaviours be recognised

effectively?

Self Organisation

• Self-Organisation is a set of dynamical processes whereby structures or order appears at global level of a system from the interactions between the lower-level entities. The rules underlying the behaviour and that specify the interactions among the entities are implemented on the basis of local information, without any reference to the global pattern.

Emergence

• A dynamic, non-linear process that results in “macro-level” structures to form, based on interactions of system parts at the micro-level.

• Such emergence is “novel” – i.e. cannot be easily understood by taking the system apart and looking at the parts (reductionism)

Issues• Macro-Micro effect• Novelty

– Global behaviour is novel

• Coherence– Emergence has some sense of identity (i.e.

persists over some time)

• Dynamic– Emergence arise as system evolves over time

• Non-Linear• Distributed/Non-Centralised Control

– Not possible to control the entire system

Influences• Social Societies

– Emerging area of “Socionics”

• Biological Paradigms (Stigmergy)– Ant Colonies (Social Insects)– Swarms

• Particle Systems (fluidity and elasticity)– Chemical reactions– Spin Glass theory (due to temperature

changes)

Concepts of Utility

• What is considered “important”

• Value assigned to actions and operations

• Utility– Cost– Performance – Availability

• Some kind of “measurable” metric

Utility … 2• Payoff function

– assess behaviour of a particular action (reward signal)

• Analysis tool– relationship between local utility vs. utility of the

community

• Cost function– success w.r.t. a particular task

• Trust measure– measure of trust in a particular participant

Economic Utility: Metrics “Pyramid”

Utility OptimisationExpected Utility – E(x)

Infinite Horizon

Finite Horizon

0<<1

“U” may be negative

Long term rewards less useful

Social Insect Behaviour

• Self-organising Behaviour • The idea of simple behaviours interacting in a manner that produces a range

of interesting complex behaviours is very useful and exciting for designing complex systems :

• Positive Feedback (Autocatalytic) - Recruitment and Reinforcement

• Negative Feedback - Saturation, Exhaustion, or Competition• Fluctuations and Randomness - Random Walks, Errors,

Random Task-Switching etc.• Multiple Interactions

• Stigmergetic Behaviour• Waggle and Tremble dances (Bees)

From: Ashish Umre

Stigmergy

• Indirect communication via interaction with environment [Gassé, 59]– Sematonic [Wilson, 75] stigmergy

• action of agent directly related to problem solving and affects behavior of other agents.

– Sign-based stigmergy• action of agent affects environment not directly

related to problem solving activity.

Self-organised behaviour can be characterised by key properties like -

• The creation of spatiotemporal structures in an initially homogeneous medium, e.g. Nest Architectures, foraging trails, or social organisation.

• Multistability - possible coexistence of several stable states

• Existence of Bifurcations when some parameters are varied. (“Snowball effect”).

From: Ashish Umre

What do Ants do?• A few examples of collective behaviour that have been observed in

several species of Ants are: regulating nest temperature within limits of 1C; forming bridges; raiding particular areas of food; building and protecting their nest; sorting brood and food items; co-operating in carrying large items; emigration of a colony; complex patterns of egg and brood care; finding the shortest routes from nest to a food source; preferentially exploiting the richest available food source. task partitioning and division of labour

From: Ashish Umre

Ants in Nature

From: Ashish Umre

Adapting to Environment Changes

Pheromone Trails

D

E

H C

A

B

d=0.5

d=0.5

d=1.0

d=1.0

E

H

E

D

H C

A

B

30 ants

D

C

A

B

30 ants

15

ants

15

ants

15

ants

15

ants

30

ants

10

ants

20

ants

20

ants

10

ants

30

ants

T = 0 T = 1

What do Bees do?• Foraging Behaviour (Waggle

Dance)

• Task Partitioning and Division of Labour

• Scout-Recruit Concept (Tremble Dance)

• Group Decision Making and Colony Cooperation

• Regulating Hive temperature

• Communication : Food sources are exploited according to quality and distance from the hive

Waggle Dance

From: Ashish Umre

Wasps

• Pulp foragers, water

foragers & builders

• Complex nests

– Horizontal columns

– Protective covering

– Central entrance

hole

Pervasive Ants : Resource Discovery in Dynamic and Reconfigurable Networks

using Artificial Ants• Ants continuously explore new solutions

• Pulses “Drumming” used to update resource tables (The Modulatory Communication signal category of Drumming in the European Carpenter ants Camponotous herculeeanus and C. ligniperda. The worker ants strike the surface of the wooden chambers and galleries in which they live within their mandibles and gasters, producing vibrations that can be perceived by nestmates for 20 centimetres or more. Much, of the behaviour is classifiable as direct alarm communication. The behaviour of some categories is “tightened up”. Transition probabilities are raised, and hence uncertainty is reduced. The modulatory communication appears to be a primitive phenomenon in ants and other social insects.)

• Adaptive to continuous node failure and addition of new nodes and resources, and change in traffic conditions

From: Ashish Umre

Ant-Based Control Introduction

• Ant Based Control (ABC) is introduced to route calls on a circuit-switched telephone network– ABC is the first SI routing algorithm for

telecommunications networks• 1996

R. Schoonderwoerd, O. Holland, J. Bruten, L. Rothkranz, Ant-based load balancing in telecommunications networks, 1996.

ABC: Overview

• Ant packets are control packets• Ants discover and maintain routes

– Pheromone is used to identify routes to each node– Pheromone determines path probabilities

• Calls are placed over routes managed by ants• Each node has a pheromone table maintaining

the amount of pheromone for each destination it has seen– Pheromone Table is the Routing Table

ABC: Route Maintenance

• Ants are launched regularly to random destinations in the network

• Ants travel to their destination according to the next-hop probabilities at each intermediate node– With a small exploration probability an ant will

uniformly randomly choose a next hop

• Ants are removed from the network when they reach their destination

ABC: Routing Probability Update

• Ants traveling from source s to destination d lay s’s pheromone– Ants lay a pheromone trail back to their

source as they move– Pheromone is unidirectional

• When a packet arrives at node n from previous hop r, and having source s, the routing probability to r from n for destination s increases

Ant Algorithm

An ant in the network launched at node 3 with destination node 2, and has just travelled from node 4 to node 1. This ant will first alter node 1’s table corresponding to node 3 (its source node) by increasing the probability of selection ofnode 4; it will then select its next node randomly according to the probabilities in the table corresponding to its destination node, node 2.

•Every node has a pheromone table for every destination node in the network•A node with four neighbours in a 30-node network has 29 pheromone tables with four entries each.

Ants going from node 1 to 3

Updating Pheromone table• Ants can be launched from any node• Select next node according to probabilities

in the pheromone table for their destination nodes

• When ants arrive at a node – they update the probabilities of that node’s pheromone table (corresponding to their source node)

• Alter table to increase probability pointing to their previous node

• On reaching destination – ants die

Update law

• P = new probability (or pheromone) increase

• Probability can be reduced by operation of normalization (increase in another cell in table)

• Prob. can approach zero but never reaches it

Ant Algorithm

r

rtrtr

imsi

ms

1

)()1( ,

,

r

trtr

ilsi

ls

1

)()1( ,

.

r = 0.25 age

This equation specifies the new reinforced weight for the relevant node that corresponds to the ant’s last node

This equation specifies the weight for all other weights that do not correspond to the ant's last node

This equation specifies the reinforcement parameter that is employed in first two equations

From: Ashish Umre

Ageing• Delta_p changes with the age of the ant

– Age == path length (each hop increases ants age)

– Ants moving along shorter routes have higher age

– Age == delay of ants at nodes that are congested

– Delay ants age increases quicker

• As flow rate of ants to neighbours decreases – prevents ants from affecting pheromone table

ABC: Route Selection (Call Placement)

• When a call is originated, a circuit must be established

• The highest probability next hop is followed to the destination from the source

• If no circuit can be established in this way, the call is blocked

• Calls operate independently of ants

ABC: Initialization

• Pheromone Tables are randomly initialized• Ants are released onto the network to

establish routes• When routes are sufficiently short, actual

calls are placed onto the network• Calls and ants dynamically interact • New calls influence load on nodes

influences the ants by means of a delay mechanism

Relationship between calls, node utilisation, pheromone tables and ants. An arrow indicates the direction of influence

From: Ashish Umre

Average Packet Delay (With the Algorithm)

From: Ashish Umre

Average Packet Delay(Without Algorithm)

From: Ashish Umre

Packet and Pulse Loss (With the Algorithm)

From: Ashish Umre

Packet and Pulse Loss (Without the Algorithm)

From: Ashish Umre

Design Concerns

• Swarm Intelligent Systems are hard to

‘program’ since the problems are usually

difficult to define

– Solutions are emergent in the systems

– Solutions result from behaviors and

interactions among and between individual

agents

Summary of ABC• Ants regularly launched with random destinations • Ants walk randomly according to probabilities in pheromone

tables for their particular destination• Ants update the probabilities in the pheromone table for the

location they were launched• from, by increasing the probability of selection of their previous

location by subsequent ants.• The increase in these probabilities is a decreasing function of

the age of the ant, and of the original probability.• This probability increase could also be a function of penalties

or rewards the ant has gathered on its way.• The ants get delayed on parts of the system that are heavily

used.• The ants could eventually be penalised or rewarded as a

function of local system utilisation.• To avoid overtraining through freezing of pheromone trails,

some noise can be added to the behaviour of the ants.

Possible Solutions to Create Swarm Intelligence Systems

• Create a catalog of the collective behaviours • Model how social insects collectively perform

tasks– Use this model as a basis upon which artificial

variations can be developed– Model parameters can be tuned within a biologically

relevant range or by adding non-biological factors to the model

What are Ad Hoc Networks?

• Ad Hoc networks are

– self-organising multi-hop wireless networks;– no fixed infrastructure, such as base stations

or routers, is required;– ad hoc networks are rapidly deployable

networks;– all mobile hosts are embedded with packet

forwarding capabilities;

From: Ashish Umre

Current Routing Algorithms for Ad hoc Mobile Wireless Networks

• Table Driven routing Protocols:• Destination-Sequenced Distance Vector Routing (DSDV)

• Clustered Gateway Switch Routing (CGSR)

• The Wireless Routing Protocol (WRP)

• Source-Initiated On-Demand Routing:• Ad hoc On-Demand Distance Vector Routing (AODV)

• Dynamic Source Routing (DSR)

• Temporally-Ordered Routing Algorithm (TORA)

• Associativity-Based Routing (ABR)

• Signal Stability Routing (SSR)

From: Ashish Umre

Four Ingredients of Self Organization

• Positive Feedback

• Negative Feedback

• Amplification of Fluctuations - randomness

• Reliance on multiple interactions

Positive Feedback

Positive Feedback reinforces good solutions

• Ants are able to attract more help when a food source is found

• More ants on a trail increases pheromone and attracts even more ants

Negative Feedback

Negative Feedback removes bad or old solutions from the collective memory

• Pheromone Decay

• Distant food sources are exploited last– Pheromone has less time to decay on closer

solutions

Randomness

Randomness allows new solutions to arise and directs current ones

• Ant decisions are random– Exploration probability

• Food sources are found randomly

• Initially an ant will attempt to follow a random path to “explore” possible food sources

Multiple Interactions

No individual can solve a given problem. Only through the interaction of many can a solution be found

• One ant cannot forage for food; pheromone would decay too fast

• Many ants are needed to sustain the pheromone trail

• More food can be found faster• “Swarm” behaviour

Stigmergy

in

Action

This general “Clustering” behaviour is a key themein such approaches

Ants Agents

• Stigmergy can be operational– Coordination by indirect interaction is

more appealing than direct communication

– Stigmergy reduces (or eliminates) communications between agents

SI Advantages for Routing

SI based algorithms generally enjoy:• Multipath routing

– Probabilistic routing will send packets all over the network

• Fast route recovery– Packets can easily be sent to other neighbors by

recomputing next-hop probabilities

• Low Complexity– Little special purpose information must be maintained

aside from pheromone/probability information

More SI Advantages for Routing

• Scalability– As with any colonies numbering in the

millions, SI algorithms can potentially scale across several orders of magnitude

• Distributed Algorithm– SI based algorithms are inherently distributed

SI Disadvantages for Routing

SI also suffers from:

• Directional Links– Bidirectional links are generally assumed by

using reverse paths

• Novelty– SI is a relatively new approach to routing. It

has not been characterized very well, analytically

Pharaoh Ant (Monomorium Pharaonis)

• Colony Behaviours• Multiple Queening• Nest Conflict and

Cooperation• Migration• Clustering

• Analogies• Resource Allocation,

Discovery and Sharing• Adaptive Clustering

From: Ashish Umre

Current Issues in Mobile Agent Technologies

• Application Issues• Jumping Agents (Shopping, Taxi/Airport)• Location Sensitive (Bluetooth, HomeRF)• Profile Oriented

• Deployment Issues• Is the Infrastructure ready?

• Security Issues• Physical Mobility • Logical Mobility

From: Ashish Umre

Mobile Agents• Generalizing the “ant” based approach as a mobile agent• A paradigm based on code mobility

– Remote Evaluation – Code-on-demand (the Java Applet model)– Peer-2-Peer

• Migrate from one host to another “autonomously”– “Intelligent Viruses”? (do we really want these?)– Lead to security nightmares– Require writing in obscure languages (Tcl, Java etc)

• Provide an interesting paradigm for Grid computing– Assuming other Grid infrastructure is there

How do they differ from other DC paradigms

• Host supported mobility vs. autonomous migration – weak vs. strong mobility (Bradshaw and Suri’s work

on Nomad, vs. Aglets or Voyager)• What’s in a message?

– state– code or data

• How large should be a mobile agent • Tracking a mobile agent (forwarders, location service,

pheromone trails)• Host assisted

– state persistence (vs. soft state)– introspection

The overhyped differences between mobile objects and agents

• Mobile objects do not migrate autonomously– control transfer issues

• Mobile objects generally part of some application– limited or no access to a separate execution context

• Mobile object granularity is generally much finer– agents must carry code to interact with host (context

or place)• Mobile objects do not support a well defined API

– such as moveTo, retract, dispatch etc• Division of application into agents vs. objects will be

different • Absence of any standard framework

The overhyped reasons for why mobile agents are (apparently) useful

• Reduce in network load • Overcome network latency • Can encapsulate a protocol • Can execute autonomously and asynchronously • Can dynamically adapt their itinerary • May be heterogeneous • Are robust and can sustain faults in their environment

and why not … • all of the above can be done via messaging• too many security issues to be useful • unlike to support host platforms (standardisation has not

resulting in anything useful) • too hard to code, and abstraction is not obvious

Standardisation• MASIF (Mobile Agent System Interoperability Facility)

– Crystaliz, General Magic, IBM, GMD Fokus, Open Group

• Address interface between agent systems, and not agent applications

• MASIF Aim: Enable mobile agents to travel across various hosts in an open environment

• Support for locating an agent (MAFFinder)

• Released via OMG

MASIFStandardise on four areas:• Agent Management

– use of standard operations to manage agents from different vendors

• Agent Transfer– use of standard operations to create and migrate

agents from different agent systems• Agent and Agent System Naming

– use of standard Syntax and Semantics of parameters– part of MAFFinder

• Agent System Type and Location Syntax– use of standard syntax for location– part of MAFFinder

MASIF … 2void create_agent (

in Name agent_name,

in AgentProfile agent_profile,

in OctetString agent,

in string place_name,

in Arguments arguments,

in ClassNameList class_names,

in string code_base,

in MAFAgentSystem class_provider)

raises (ClassUnknown, ArgumentInvalid,

SerializationFailed,MAFExtendedException);

IDL Definition

MASIF … 3Location find_nearby_agent_system_of_profile(

in AgentProfile profile)

raises (EntrynotFound);

void resume_agent(

in Name agent_name_

raises (NameInvalid, ResumeFailed);

void list_all_agents_of_authority(

in Authority authority) ;

NameList list_all_agents() ;

Location list_all_places() ;

IDL Definition

MASIF … 4interface MAFFinder{

void register_agent(

in Name agent_name,

in Location agent_location,

in AgentProfile agent_profile)

raises (NameInvalid);

void register_agent_system(

in Name agent_system_name,

in Location agent_system_location,

in AgentSystemInfo agent_system_info)


IDL Definition

MASIF … 5Location lookup_agent(

in Name agent_name,

in AgentProfile agent_profile)

raises (EntryNotFound);

Location register_place(

in string place_name,

in Location place_location)


IDL Definition

At each host ...• An Agent Server

– one or more such servers can co-exist on a particular machine

– an agent server must be identifiable by a unique URL– must also be able to launch and subsequently support

tracking of the agent• System support for migratable, non-persistent code

– memory, CPU• System support for handling local security policy

– sandbox, authentication/access control mechanism, certificate verification mechanism, etc

MA Lifecycle

A

A

dispatch

retract

create

Class file

Class file

deactivate activate

dispose

Based on IBM Aglets

Why are they useful in Grids? • Important code delivery paradigm

• Must operate in the context of existing Grid systems

– may alleviate some issues with mobility

• Support essential needs of Grid computing

– software and protocol updates

– load balancing and migration

– user migration

• Most importantly -- they support a “Demand Oriented” style of computing

– move computation and data “on demand”

– move a limited set of functionality “on demand”

Achieving Parallelism• Mobile Agents also useful to support parallelism at a

coarser granularity

– simultaneous dispatch of agents to multiple sites

– simultaneous dispatch of messages to multiple sites via specialised group formation (aspect of “Spaces” -- formed through multicast groups)

– Integration with existing message passing libraries (MPI or PVM) via the host machine

• Achieved parallelism can be more dynamic

– Agents can decide where to migrate vs. pre-defined message transfer based on MPI or PVM

• May not be useful for “production grade” parallelism

Supporting Mobility• Object Identity: Killing old object as copy sent to a

remote host (address space) -- use of Java garbage collection when no references exist to object

– mobile object pool

• Object Serialisation: what happens to private, transient and state variables -- when to move?

– Java.io.serializable

– serialization of threads?

• State synchronisation and sharing: HORUS -- object server?

• Concurrency through Actors (objects that own their own thread) -- Actors are non-blocking

Explicit Serialization• Via the Externalizable interface in Java

– must be manually implemented by programmer

– can customise how an object’s fields are mapped to a stream

– means of checkpointing state (includes object’s field values + metadaat about class version, and field types)

– Write out all visible states of a thread to a stream, read back state, initiate a thread

• Consider method invocation as a “single” unit of computation

– allow thread read only before or after a method invocation (i.e. no active threads)

• Access to stack variables

– stack variables made part of object’s state

Custom Classloaders• Can also implement custom classloaders• Classloader used to:

– dynamically determine which code to migrate– which code should be released – how code interacts with the operating environment

• Classloaders are a useful way to extend existing Grid systems – use of the CoG Java toolkit or OGSA to link to Globus – interactivity between existing scheduling systems

• Offer class loading features as a Grid Service– characterised by application features?

• Classloaders take away intelligence from migrating code -- hence not the ideal solution

Write your own Classloader()

• Extend “Primordial Classloader” in Java – invoked after calling main() method– Matrix m = new Matrix() ; -- execute “new”

bytecode– System.out.println() -- invoke static

reference to class (putstatic, getstatic etc)• Class loaders enable Java apps (EMACS or Scientific

codes) to be dynamically extended• Byte code verifier - defineClass, ClassFormatError

• Package over-write/addition: java.lang.hackit -- protect system namespace

• Multiple Classloaders can co-exit

Dynamic Itinerary • A mobile agent may visit a number of hosts• This itinerary may change over time

– based on data collected at intermediate hosts– may not return to host machine

• Itinerary may be dictated by a particular host – agent may override this

• Dynamic itinerary useful in Grid context– load may not be known beforehand– hosts may not always be available or reliable – services may not always be present– users/experts may migrate

Locating an agent• Use of proxy

– local proxy to track agent

• Forwarders– creating a chain of non-persistent forwarders– pheromone based approaches

• A location service– event notification service – query service

Application scenario: Load gathering• Sensors measure network load

– similar to SNMP • Report this to an event gateway and monitor this at a given

control site• JAMM system an example

– other work taking place in the Global Grid Forum Network Monitoring group

• Mobile agent may be used to gather load – carry a schema for gathering parameters– interact via local host to SNMP gateway – record local parameters and carry statistics – pass through a given host to lodge results– itinerary may be application dependent

Java Agent Measurement and Monitoring (JAMM) - LBNL

JAMM scenario

Load gathering

Application Profiles• Application categories:

– restrict itinerary – identify common patterns

• Resource suggestions– identify common patterns– resource characteristics

• MA-MA interaction– used to inform about other resources– share application requirements– determine commonality in applications

Load imposed by Mobile Agents• MA performance becomes an issue• Issues

– where should a mobile agent visit next? – What should the mobile agent carry vs. leave behind?– How long does a mobile agent spend on a given host?– How long does it take for a mobile agent to visit from

A->B• Need for tools that can help gather this data

– Recorded within each agent – Support for specialised services which gather this – Data can be queried based on MA authorisation

David Kotz, Guofei Jiang et al. (Dartmouth College)

Fernando Pinel, Omer F. Rana (Cardiff)

Benchmarking• MA benchmarking efforts also important in this context.• Benchmarks can be micro-

– create (locally or remotely) and dispatch an agent– Retrieve an agent – blocking and non-blocking message exchanges

• or macro-– forwarding– roaming – proxy servers

M. Dikaiakos, M. Kyriakou, G. Samaras, "Performance Evaluation of Mobile-agent Middleware: A Hierarchical Approach." In Proceedings of the 5th IEEE International Conference on Mobile Agents, J.P. Picco (ed.), Lecture Notes of Computer Science series, vol. 2240, pages 244-259, Springer, Atlanta, USA, December 2001

Additional uses: Consumer Grids• More open perspective on Grids

• Individuals and organisations can operate as suppliers of services/resources

• Service providers must be able to:

– Dynamically download software to participate on the Grid

– Varying resource capabilities

– Dynamically determine resource properties

• Resource aware visualisation

– Remotely configure resource

• Mobile agents provide an important abstraction

• Many existing technologies are useful contenders: Peer-2-Peer and Web Services

Resource sharing• Peer-2-Peer

– CPU sharing (Entropia, Parabon, UD, SETI@HOME)– File sharing (Napster, Gnutella, Freenet)

• CPU sharing– Utilisation of free cycles via standard downloads– Requires upload of data on which to operate– Generally high redundancy and replication

• File Sharing– Search for common file types, and support file

placement– Use of indexing or intermediate servers

• Development libraries: JXTA

Resource Sharing … 2• In MA:

– CPU sharing: migration of mobile agent– File sharing: migration of associated data and state

• Migration and execution can be more intelligent• Use of forwarding and location services can be coupled

with additional services:– Work distribution and current state of computation– Resource events to support migration

• P2P infrastructure also useful:– Development of itineraries via overlay networks or

index servers– Security issues (?)

File Space Management

• Cache management– migration support for files (temporary results,

configuration etc)• File space re-ordering

– sharing of directory space across machines– virtual “file stores”

• Results to common queries– file placement closer to computation– file replication to support availability levels

• Managing user and project groups

Common Themes• Load balancing and migration

• Data capture (especially performance related)

• Trigger and configuration – set up of execution at remote sites– updates to execution or changes– user set up

• Establishing dynamic resource groups

• Resource provisioning beyond regional and national centres

Concerns• Dealing with licensed software

– proprietary code or data

• Dealing with production codes

– highly tuned performance

– issues of Grid computing are questionable here

• Domain decomposition

– issues in translating large scale codes to mobile agents

– where is the abstraction most suitable/relevant

• Interfaces between Grid systems and Mobile Agent systems

Issues … Swarm/Ant Systems

• Tragedy of the Commons: Self Organisation does not always produce the desired outcome (Thomas Schelling's Micromotives and Macrobehavior):– El Farol Bar problem– Sheep Grazing problem

• Some individuals and organizations are more comfortable and moreefficient with hierarchical organizations that are more centrallycontrolled

Issues … 2

• Useful in an “experiment” and “explorative” environment

• System must be “non-conservative” in its approach to experiment and evaluate different system behaviours

El Farol Bar … 2

• Agents select a night (1—7) – based on expected attendance or reward (from prior experience)

• Agent attends the bar– Attendance on selected night – Output of the reward function

• Update agent’s model of the system• Agents cannot communicate with each other• Global objective: Maximise cumulative reward of

entire system

Tragedy of the Commons

• Self-interested gain of one member of the community is to the detriment of the whole community

• Pasture on which each agent keeps cattle– Utility increases as number of animals

increase– Overgrazing affects all agents detrimentally

• Agent needs to decide whether to cooperate or defect

Braess’ Paradox

• Agents traverse a network consisting of a set of nodes – and a number of connections between the nodes

• Aim: each agent must reach its destination as quickly as possible– Traffic networks, water supply networks, electrical

networks etc • BP: Addition of an extra link has a detrimental

effect on performance• Introducing a shortest path link in a network that

has reached equilibrium

A

B C

D

A

B C

D

Occurs when a community of agents is unable to coordinate their activities to takeadvantage of changes in the environment.

Collective Intelligence (COIN)

• Developed at NASA by Wolpert et al.• Scalable coordination technique for

adaptable, learning based multiagent systems (MAS).

• All agents strive to maximise their local utility function.

• The goal of the system is to maximise the global utility function.

Collective Intelligence (COIN)

Local utility functions are derived from the global utility functions so that:

• Maximisation of local utility functions maximises the global utility function – global optimum ‘line-up’ with the Nash Equilibrium.

• Local utility functions are learnable: good signal-to-noise ratio for learning algorithms.

• Agents are coordinated indirectly. Emergent behaviour is still possible as agents are not given explicit instructions and behaviour is not predefined.

Adapting Collective Intelligence

• We are aiming to adapt this technique for agents that can be deployed via the internet.

• COIN concentrates of specific applications: coordinating communications satellites, robotic rovers.

• We want to apply this technique dynamically and concentrate on software agents.

LEAF – Learning Agent FIPA Compliant Community Toolkit

• Utility functions assigned dynamically.

• Utility extended to form two separate types: functional utility and performance utility.

• Assignment of multiple utility functions possible.

• Java API provided to support development of FIPA compliant agents.

FIPA - Foundation for Intelligent Physical Agents

• Standards for interoperable agent systems.• FIPA ACL: conversations consisting of FIPA

performatives such as inform, request, query etc.

• Agent management system (AMS) and directory facilitator (DF) part of the FIPA platform.

• LEAF utilises FIPA-OS implementation from Emorphia.

Community Building Kit: LEAFFour core concepts:

LEAF agentsLEAF utility functionsESNsLEAF tasks

Provides support for:JESS based policy descriptionReinforcement learning

LEAF Agent

LEAF: Learning Agent FIPA-Compliant Community Toolkit

Implementation of LEAF is based on FIPA-OS

FIPA-OS

LEAF

FIPAOSAgent Class

LeafNode Class

ESN Class

Task Class

LeafTask Class

• Coordination: utility functions are assigned to agents by an environment service node.


ESN

Community

f1

f2


ESN

Community a

f1

f2

ESN

f3

Community b

sum f2,f3

Multiple utility functions can be assigned

• Utility functions can have parameters that are not available locally to the agent.


ESN

Community

f1


• Utility functions can have parameters that are not available locally to the agent.


ESN

Community

R

O

O: observable propertiesR: remote parameters

f1


Performance and Functional

Utility

P

F

Speed of execution, number of tasks, CPU usage etc. Decision making,

learning - high level behaviour.

Performance Utility

• Provides a utility measure based on performance engineering related aspects– Comms metrics:

• number of messages exchanged, size of message, response time

– Execution metrics: • execution time, time to convergence, queue time

– Memory and I/O metrics: • Memory access time, disk access time

• The effect of implementation decisions (algorithms; languages) and deployment decisions (platforms; networks), can be assessed.

Functional Utility• Utility based on “problem solving” capability

• Statically defined– related to service properties (capability based)– degree of match between task properties and service

capability• syntax match (exact match)• range match• semantic match (subsumption/subclass)

• Dynamically defined– related to execution output (MSE)

Utility Function Implementation

• Extend the LocalUtilityFunction abstract class.

• Implement the compute() method.

• Functions have access to remote parameters and observable properties.

Utility Function Implementation

Utility functions

• Global Utility (G) = Si Local Utility (Ui)

• U = (jobs of type X processed)/(jobs of type X submitted)

• U = 1/(idle time)

Can you consider other utility functions that may be relevant?

For students

Access to utility functions

double computeFunctionalUtility()Computes the sum of all currently assigned functional utility functions.

double computePerformanceUtility()Computes the sum of all currently assigned performance utility functions.

String[] getFunctionalUtilityRequiredProperties()Returns the observable properties required to compute the agent’s functional utility functions.

String[] getPerformanceUtilityRequiredProperties()Returns the observable properties required to compute the agent’s performance utility functions.

Resource management

• The objective is to provide users with on-demand access to resources needed to execute applications.

• Each peer/agent can undertake three different roles: application agent, resource agent, broker agent.

• Multiple roles may be undertaken by the same peer.• Each peer is an autonomous agent capable of

learning within it’s environment with the goal of local utility maximisation.

Application Agents• Accept applications from users.• Decompose applications into tasks.• Identify suitable resources for task execution,

via broker agents.• Schedule and submit tasks to resource agents.• Manage dynamic application execution process.• Coordinated learning may be of benefit in

resource selection.

Resource Agents• Manage access to a particular resource.• Resources may be computational, visualisation,

scientific, or instrumentation based.• Resource agents allow tasks to be submitted

and executed on the resource.• Coordinated learning may allow resource agents

to optimise resource properties, and prioritise tasks.

Broker Agents

• Maintain information about discovered resource agents.

• Offer a matchmaking service, aimed at allowing application agents to discover resource agents.

• Coordinate learning may allow brokers to optimise their matchmaking service.

Agent based resource management

• Previous work used planning based BDI agents within the same framework.

• Current research involves investigating whether agents can benefit from coordinated learning.

• The eventual goal is to integrate the two techniques.

Agent Communities

• Communities are centred on the application/resource type: computational (C), visualisation (V), scientific (S), instrumentation (I) – there can be multiple communities of the same type.

• When an agent joins a community, it is assigned a local utility function.

• The agent learns to optimise this function to benefit the community.

• Agents are allowed to join multiple communities in an attempt to maximise their utility.

Agent Communities

Each community has a global utility function, based on community objectives:

1. Peers acting as application agents process as many applications as possible.

2. Peers acting as as application agents process as many applications as possible.

3. Peers acting as broker agents facilitate (1) and (2).

Global Utility Functions

where A is the number of applications processed by the community, idlei is the amount of time agent i spends idle. c1,c2 are constants

Application agent utility functions

where Aa is the number of applications processed by agent a, and Ja is the total resource usage time used by a. c1,c2 are constants

Resource agent utility functions

where Tr is the number of tasks processed by resource agent r, and idler is the total time spent idle by the resource. c1,c2 are constants

Broker agent utility functions

where n resources have been recommended by the resource agent, and Ul(i)Ti is the local utility of the recommended resource at the time of recommendation.

Simulations• 4 communities – (C,V,S,I)• 10 resource agents• 3 application agents• 1 broker agent• The current focus is on resource agent learning –

joining communities and updating resource properties

• Peers attempt to join communities in order to increase their utility, and will only remain in the community as long as their utility is above a certain threshold.

0

5

10

15

20

25

30

35

0 50 100 150 200 250time

Global UtilityNumber of Members

computational community

visualisation community

0

10

20

30

40

50

60

70

80

90

0 50 100 150 200 250 300 350 400time


0

5

10

15

20

25

30

35

40

45

50

0 50 100 150 200 250 300 350 400 450 500time

Global UtilityNumber of Membersstorage

community

instrumentation community

0

1

2

3

4

5

0 100 200 300 400 500 600time


Current research objectives

• The aim is to allow peers to form communities, around which the collection of peers is ‘greater than the sum of their parts’.

• Current work involves the engineering of this application, and the evolution of the utility functions to include a greater degree of social context

• Learning is currently very difficult for the agents – need to allow learning algorithms to converge.

Common Themes• Load balancing and migration

• Data capture (especially performance related)

• Trigger and configuration – set up of execution at remote sites– updates to execution or changes– user set up

• Establishing dynamic resource groups

• Resource provisioning beyond regional and national centres

Toolkits: ABLE

• ABLE (Agent Building and Learning Environment)

• Support use of Java Beans

• Provides a host of pre-built functionality

• Also provides Tuning agents for:– Load Balancing– System Control function

AbleBeans – Java Agent Building Blocks

AbleBean

AbleBean

Direct method calls

Notification Events

Action Events

AbleEvents

AbleBean, AbleRemoteBean: a Java interface (local and remote) AbleObject: AbleBean instantiation with autonomous threadBean interactions: Direct method calls and event passingAbleEvents: Notification and Action events with synchronous and asynchronous event handling AbleBeanInfo and Customizer required for use in Agent Editor Set of core data access and algorithm beans supplied


AbleAgent

Sensor Eff ector

get app data

call app action

AbleBean A

AbleBean CAbleBean B

App/ Service 1 App/ Service 2

AbleAgent, AbleRemoteAgent: a Java interface (extends AbleBean) Composable: can contain other AbleBeans and AbleAgentsSensors and Effectors: Allow agents to interface with apps Can be distributed, synchronous or asynchronous (autonomous)

AbleAgents – Intelligent JavaBeans


ABLE Component Library

Machine Learning

Machine Reasoning

Agents

Data Access/Analysis

Back propagationSelf organizing mapsRadial Basis FunctionsTD-LambdaDecision TreesNaive Bayes

Script (procedures) Forward / Backward chaining Predicate logic (Prolog)Rete'-based pattern matchFuzzy systemsPlanning (STRIPS)

Text/DB read/writeCache, Filter, TransformStatistical routinesGenetic algorithmsother math analysis

Classification Autotune (closed loop control) Clustering Storage manager (multiple QoS)Prediction


ABLE Application Design

ABLE Core Beans

Custom Beans

(domain-specific)

Application

AgentABLE Library


AbleBean Wrapper Design Pattern

myAlgorithmBean

myAlgorithmCustomizer

myAlgorithmBeanI nfo

theAlgorithm

init()myAlgorithmBean()

process()

setters()getters()

theAlgorithm()

init()

process()

getters()setters()

processTimerEvent()

Allows easy integration of existing J ava algorithms into the Able environmentRequires creation of 3 J ava classes, Bean wrapper, BeanI nfo and CustomizerBean contains an instance of the algorithm and calls methods on it No (or minimal) source changes required in the algorithm class


Rule Blocks <type> <name>() using <engine> { ruleList } ; • Semantically equivalent to Java methods• Can specify a return data type• Can use pre-defined or user-defined name• No formal parameter lists, use global vars• Specify inference engine via using <engine> clause • <engine> can be any AbleInferenceEngine Java subclass• Body of ruleblock contains one or more Rules• Use setControlParameter() built-in function to set goals,

options, etc. • Ruleblock can have local or shared working memory

ARL Rule Syntax

<ruleLabel> { preConditions } [priority] : <ruleBody>;

• ruleLabel – unique identifier in ruleset• preConditions – list of Java objects

(e.g.TimePeriods)• priority – used in conflict resolution during

inferencing • Rule body must be one of the ARL rule types • myRule { weekdaysOnly } [ 3.0 ] : println(“wow”);

ABLE Rule Templates Allow IT Developer or Programmer to create rulesets and templates using WSAD editor Minimize external meta-data or artifacts Business user can create rules from templates using web-based UI Allow easy parameterization of rules and rule logic, with constraints on parameter values Reuse existing ABLE data types and ARL syntax

Allow users to customize rule templates and create new rules Variable values are constrained based on ruleset author constraints Can generate individual rules or entire rulesets via templates Can edit generated rules using same authoring environment

ARL Rule Template Syntax Ruleset myRuleTemplateExample { import com.ibm.myclass.Customer; variables { Customer customer = new Customer() ; // myclass type template Categorical customerLevel = new Categorical("gold", "silver", "platinum"); template String salesMsg = new String("Thank you for shopping IBM"); // example msg template Continuous discountValue = new Continuous(0.01, 0.50); // allow range from 1% to 50% Double discount = new Double(0.0) ; }

inputs { customer } ; outputs { discount } ; void process() { Rule1: if (a > b) then println("regular old rule") ; Rule2: if (a <= b) then println("another regular old rule") ;

template myRuleTemplate1: if ( customer.level == customerLevel ) // NOTE: Rule is a template then { discount = discountValue ; println( salesMsg ) ; } } }

Agent Properties• Flexible• Autonomic• Generic

• KeepAlive

• MaxClients

• CPU

• MEM

Users

Apache Web Server

Desired Utilization Level

AutoTune Agent- Modeling

- Run-time Control

Autotune Agent Web-Tuning Scenario

Design Phase I: System Modeling

SysAdminBrainRuleSet

SysAdminActionsRuleSet

CPUWatcher

FindLargeObjectsfindDuplcateJobs

CleanupFindRunawayJobs

DiskWatcher

DiskPredictor

NOJWatcher

iSeries System Adminstration using ABLE

SysAdmin Agent

Task/Info Agents

Action Agents

P e r f o r m a n c e P r e d ic t io n u s in g N e u r a l N e t w o r k sP e r f o r m a n c e P r e d ic t io n u s in g N e u r a l N e t w o r k s

M o n i t o r D a t a

N e u r a l

P r e d i c t i o n A g e n t

W e b S e r v e r r u n n i n g o n W i n d o w s 2 0 0 0

H i t w i t h v a r i a b l e w o r k l o a d , s e a s o n a l i t y

C a p t u r e P e r f o r m a n c e M o n i t o r D a t a

T r a i n n e u r a l n e t w o r k t o p r e d i c t f u t u r e r e s p o n s e t i m e

WinGamma

• Data analysis toolkit – especially for time series data

• Can support identification of:– Time series “embedding” dimension – Level of noise present within data – Based on the “Gamma” statistic

• Can be used prior to training a neural network

WEKA: Waikato Environment for Knowledge Analysis

Explorer: building “classifiers”

• Classifiers in WEKA are models for predicting nominal or numeric quantities

• Implemented learning schemes include:– Decision trees and lists, instance-based

classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …

• “Meta”-classifiers include:– Bagging, boosting, stacking, error-correcting

output codes, locally weighted learning, …

Monitoring Tools

• NWS (Network Weather Service)– Support a forecasting model – Work at “application-level” and not necessarily at the

network (resource) level

• NetLogger– Now supports instrumentation for Globus calls– Useful data capture process (event based)– Manage level of data captured

• Specialist support via Apache Web Server– Messaging and Execution time

From Brian Tierney (LBNL)

From: G. Obertelli (UCSB)

Additional Info.• IBM Autonomic Computing Web site

– http://www.research.ibm.com/autonomic/• IBM Autonomic Computing Library

– http://www-03.ibm.com/autonomic/library.html• LEAF project

– http://users.cs.cf.ac.uk/O.F.Rana/leaf/• DIPSO/FAEHIM project

– http://users.cs.cf.ac.uk/Ali.Shaikhali/faehim/• WinGamma

– http://www.cs.cf.ac.uk/wingamma/• WEKA

– http://www.cs.waikato.ac.nz/ml/weka/• ABLE Toolkit – Tutorial

– http://www.cs.iastate.edu/~colloq/docs/able2_bigus.ppt

http://www.research.ibm.com/autonomic/



http://www-03.ibm.com/autonomic/library.html

http://users.cs.cf.ac.uk/O.F.Rana/leaf/



http://users.cs.cf.ac.uk/Ali.Shaikhali/faehim/



http://www.cs.cf.ac.uk/wingamma/

http://www.cs.waikato.ac.nz/ml/weka/

http://www.cs.iastate.edu/~colloq/docs/able2_bigus.ppt

Documents

Autonomic Computing Omer F. Rana (Cardiff University)