2009.08 grid peer-slides

Yehia El-khatib, Christopher Edwards, Michael Mackay, and Gareth Tyson

Computing Department, Lancaster University, UK

www.ec-gin.eu

� Motivation

� Objective

� System Overview◦ The GridMAP Service

◦ Monitoring Daemon

� Network Measurement Evaluation

� Conclusions

� Future Work

� Public networks are unpredictable.◦ Heterogeneous components

◦ Independent standards and protocols

◦ IP provides best-effort, “one size fits all” delivery.

� Potential to hinder the performance of any networked application.

Why measure the network?

networked application.

� IP networks do not readily provide feedback about their operational performance.

� Hence, numerous network monitoring tools:◦ management, troubleshooting, and/or pre- and post-deployment probation.

� Stand-alone tools are ad hoc and manual.

� Grids are dynamic systems that aggregate resources to run very demanding applications.

� High performance is always expected and hence contention on resources is similarly high.

Why measure the network in

grids?

hence contention on resources is similarly high.

� Efficient scheduling is only possible if access to correct resource information is available.

� Current middlewares and Grid Information Systems (GIS) are ineffective and cumbersome.◦ Information is insufficient and/or outdated

◦ Needs to be gathered from different publishers

� Our aim is to enable schedulers to make more informed decisions on node selection and resource allocation.

� There is a need for a means by which grid schedulers can obtain knowledge about changes in the grid.changes in the grid.◦ changes in the availability of remote computational resources

◦ changes in the network path to those resources

� This requires accurate end-to-end measurements to be provided to schedulers.

�Motivation

�Objective

� System Overview◦ The GridMAP Service



� Conclusions

� Future Work

� GridMAP (Grid Monitoring, Analysis and Prediction) is a distributed system which collects network performance and resource availability information.

� This information is used by GridMAP to provide, analyze and predict performance and provide, analyze and predict performance and availability.

� It is made up of:Grid Service Monitoring Daemon

� It is a grid service; i.e. a WSRF Web Service that also conforms to the OGSI standard.

� It provides a set of standard interfaces that allow convenient access for schedulers.

� The retrieved information can be � The retrieved information can be incorporated by schedulers into job and data allocation processes to automatically adapt to perceived and foreseeable resource and network status.

� A daemon runs on each grid node to measure resource and network performance.

� These measurements are sent to the GridMAP service to be indexed and stored.

Application

Scheduler

Run job xyz

GridMAP

requirements: delay, CPU, memory

NodeDelay CPU Memory

Average Predicted Average Predicted Average Predicted

Application

Scheduler

Copy file

GridMAP

NodeDelay Throughput Disk Space

Average Predicted Average Predicted Average Predicted

requirements: delay, throughput, disk space

� Measurements are accessible via one interface, from one publisher.

� Behind the interface is a distributed application:◦ distributed repository

� automatic replication

� no single point-of-failure

� ensures resilience

◦ makes it possible to afford the demanding costs of storing, indexing and analyzing the measurements

�Motivation

�Objective

� System Overview�The GridMAP Service



� Conclusions

� Future Work

� Why passive measurements?◦ Active techniques obligates the network to accommodate artificial traffic probes in addition to real traffic, decreasing overall performance.

� e.g. TTCP, iperf, UDPmon

◦ Passive techniques: arguably less accurate.◦ Passive techniques: arguably less accurate.

� e.g. Sting, Synack, IPTraf

◦ ICMP messaging: It is not uncommon for ICMP to be disabled or treated differently than TCP traffic.

� e.g. ping, fping, traceroute

�Best of both worlds: to avoid added network overhead without compromising accuracy.

� Why is passive measurement relevant for grid systems?◦ Grid nodes constantly exchange data sets, job state, result sets, and control signals.

◦ Most if not all grid traffic is TCP-based.

� We exploit such frequent TCP interactions to � We exploit such frequent TCP interactions to extract network metrics (RTT, throughput).

� This technique is not viable in systems other than grids, which is partly the reason why other TCP-based passive techniques are usually supplemented with active probes.

� Uses pcap to capture packet headers.◦ 3-way handshake is used to measure RTT.

◦ As connections end, throughput is calculated.

� Metrics are calculated for each flow, and stored in a local cache.

The daemon also measures availability of � The daemon also measures availability of local resources (such as CPU, memory, etc.).

� On a regular basis, these ‘performance snapshots’ are sent to the GridMAP service.

�Motivation

�Objective

�System Overview�The GridMAP Service

�Monitoring Daemon


� Conclusions

� Future Work

� Aim: to verify the accuracy of the obtained measurements.

� Setup:◦ 5 connections of varying lengths.

◦ Trigger 34 iperf probes of different durations (1-◦ Trigger 34 iperf probes of different durations (1-500 seconds).

◦ Run the daemon on the sending node.

◦ Compare results against those of ping and iperf.

� Experiment 1: Ethernet connection

1 hop1 hop~0.57 ms~0.57 ms

� Experiment 2: Local DSL connection

4 hops4 hops~19 ms~19 ms

� Experiment 3: Lancaster → Oxford

12 hops12 hops~9 ms~9 ms~9 ms~9 ms

� Experiment 4: Lancaster → Munich


� Experiment 5: Innsbruck→ Lancaster


Ethernet Oxford

Munich Innsbruck

Note: During the DSL connection test, ping packets did not get through due to disabled ICMP messaging.

Ethernet DSL

Oxford

Munich Innsbruck

� On average, our measurements were:◦ 1.55% away from the minimum ping values and 2.33% away from the mean ping values

◦ 2.20% away from the iperf measurements

�Motivation

�Objective

�System Overview�The GridMAP Service

�Monitoring Daemon

�Network Measurement Evaluation

� Conclusions

� Future Work

� Daemon works in an entirely passive fashion:

◦ no disruption caused to real traffic

◦ measurements cannot be mistaken for threats such as TCP-SYN floods or DoS attacks

� Independent operation:

◦ no need for peer coordination/synchronization◦ no need for peer coordination/synchronization

◦ no reliance on router accounting schemes (e.g. IP accounting, NetFlow)

� Monitoring traffic becomes an automatic process.

� The technique is quite trivial, but provides a powerful viewpoint which results in measurements that directly reflect the experience of grid traffic.

� Development of the GridMAP grid service is ongoing.

� We will expand the range of metrics.◦ e.g. one-way delay variation is important to virtualization applications

We then plan to test our technique against � We then plan to test our technique against more active and passive measurement tools.

Yehia El-khatib yehiayehia

@ comp.lancs.ac.uk@ comp.lancs.ac.ukChristopher Edwards cece

Michael Mackay m.mackaym.mackay

Computing Department, Lancaster University, Lancaster, LA1 4WA, United Kingdom

Michael Mackay m.mackaym.mackay

Gareth Tyson g.tysong.tyson

www.ec-gin.eu

Technology

2009.08 grid peer-slides