Network-Aware Clustering of Web Clients Advanced IP Topics Seminar, Fall 2000 Supervisor: Anat Bremler Speaker: Zotenko Elena

Network-Aware Clustering of Web Clients

Advanced IP Topics Seminar, Fall 2000

Supervisor: Anat Bremler

Speaker: Zotenko Elena

Paper

• presentation is based on:– Balachander Krishnamurthy and Jia Wang, “On

Network-Aware Clustering Of Web Clients”, Proc. of ACM SIGCOM 2000;

– Balachander Krishnamurthy and Jia Wang, “On Network-Aware Clustering Of Web Clients”, Technical Report 000101-01-TM, AT&T Labs-Research January 2000

Agenda

• problem definition

• simple approach for the problem solution

• network-aware approach for the problem solution using information from BGP routing tables

• applications

9.3.5.1109.3.5.111

12.2.94.3012.2.95.3012.2.95.33

12.2.94.3012.2.95.3012.2.95.33

9.3.5.1109.3.5.111

Problem Definition

• definition of clustering in our case:– a partitioning of a set of IP addresses into non-

overlapping groups, such that all IP addresses in a group are topologically close and under common administrative control

net A9.3.5/255.255.255

net B12.2.94/255.255.254

Simple Approach

• assumes that 24 MSB of each IP address identify network

• groups IP addresses based on network portion of IP address

• drawbacks, assumption is not always correct due to CIDR:– aggregation; – sub-netting;

Simple Approach

clusters identified correctly

misidentified clusters:• one cluster contains several networks;

misidentified clusters:• one network spans several clusters;

network prefix distribution for BGP routing table snapshot for MAE-West NAP

NAA Overview

• identifies networks based on:– BGP routing tables snapshots

– IP dump files

• includes validation and adaptation stage

NAA Overview

prefix table clustering validationself-correction

andadaptation

input – network prefixes from:•BGP routing table snapshots;•IP dump files from ARIN and NLANR;output – prefix table:•contains all prefixes in one format;

NAA Overview


andadaptation

input:•prefix table;•IP addresses for clustering;output – raw clusters:•each network prefix represents a cluster;•put IP address into cluster with longest match;

NAA Overview


andadaptation

example:•prefix table contains:

•prefix A: 172.30.0.0/255.255.0.0•prefix B: 172.30.110.0/255.255.255.0

•172.30.110.256 will be assigned to cluster represented by prefix B;•172.30.115.256 will be assigned to cluster represented by prefix A;

NAA Overview


andadaptation

input:•raw clusters;estimates the goodness of raw clusters by cross check on small number of clusters (sample of 1% of clusters);

NAA Overview


andadaptation

goodness:•cluster are too big => includes IP addresses from different “networks”;•cluster are too small => several clusters include IP address from the same “network”;•“network” – group of IP addresses which are topologically close and under common administrative control;

NAA Overview


andadaptation

dynamically change clustering according to changes in network topology

NAA Building Prefix Table

ARIN (American Registry For Internet Numbers) IP dump file:•contains IP addresses registered with ARIN•on one hand may contain addresses of non-existent networks, thus is much larger than any BGP snapshot•on the other hand may contain IP address which contains several networks•only 1% clients is clustered based on IP dump files

BGP snapshot taken from AADS NAP:•publicly available via www.merit.edu/ipma/routing_table •much smaller than IP dump files;•contains networks that physically exists and are reachable;

NAA Validation

• cross-check of clustering based on names or if names are unavailable based on paths

• based on assumption that, hosts in the same network:

– share the same non-trivial suffix in their names

– share the same last few hops on the paths toward them

NAA Validation

• why names can be unavailable (50% of clients):

– host is behind a firewall– local network acquiring dynamic IP addresses

via DHCP server– ISP does not having registered any names for

its customers

NAA Validation

• validation procedure:– sample 1% of clusters

– for each cluster:• use modified traceroute to resolve host name or last few

hops toward host for each IP address in the cluster

– if cluster contains hosts from several networks declare cluster as misidentified

– if several clusters contain hosts from several networks declare those clusters as misidentified

NAA Validation

about 10 % of clusters are misidentified;one reason for misidentification is existence of national gateways (e.g. France, Japan), such that information about networks behind these gateways is unavailable in routing tables;

about half of sampled clients have names resolvable

BGP Dynamics And NAA

• BGP routing tables change dynamically due to changes in network topology

• NAA clusters clients based on BGP tables snapshots, which may not reflect current network topology or network topology in the time when client IP address was logged/recorded


• trying to find out how BGP dynamics affect NAA clustering:– download BGP snapshots daily over period of

n days– denote by S[i] set of prefixes downloaded

during day i– denote by maximum effect to be the size of set

of prefixes that change during entire testing period

iS


• example: testing period of 3 days

S[2]S[1]

S[3]set of prefixes which is unchanged during entire testing period of 3 days;

set of prefixes which are changed during entire testing period of 3 days;the size of this set is maximum effect;


number of prefixes in AADS BGP snapshot during 4th day in testing period of 14 days

maximum effect observed till now, during 4 daysnumber of prefixes from AADS BGP

snapshot used to identify clusters from Apache server log

maximum effect for prefixes used to identify Apache server log clusters observed till now;maximum effect is about 121/3929 = 3% of all client clusters;

NAA Adaptation

• although empirical results show that only about 3% of client clusters are affected by changing network topology can employ adaptation step to improve NAA applicability

• run periodic traceroute on sampled clusters• using traceroute results:

– merge clusters that span the same network into one cluster

– divide cluster that include several networks into several clusters

Applications

• position of WEB caching proxy

caching proxy:-acts as server for clients;-acts as client for original server;-caches frequently accessed resources;

Applications

• filter out hosts with unusual access patterns:– caching proxy – spider

Applications

client request distribution of a client cluster containing a spider

a spider which issues 99.79% of all requests in the cluster

Applications

number of request over time for entire server log

number of request over time for cluster containing a proxy

number of request over time for cluster containing a spider

Applications

• identify busy clusters based on metrics such as number of clients, number of requests issued

• in front of each busy cluster place caching proxy

Applications

THE END

Documents

Network-Aware Clustering of Web Clients Advanced IP Topics Seminar, Fall 2000 Supervisor: Anat Bremler Speaker: Zotenko Elena