1 Lessons Learned Monitoring Les Cottrell, SLAC ESnet R&D Advisory Workshop April 23, 2007 Arlington,…

1

Lessons Learned Monitoring

Les Cottrell, SLAC

ESnet R&D Advisory Workshop April 23, 2007Arlington, Virginia

Partially funded by DOE and by Internet2

2

Uses of Measurements• Automated problem identification & trouble shooting:

– Alerts for network administrators, e.g. • Baselines, bandwidth changes in time-series, iperf, SNMP

– Alerts for systems people• OS/Host metrics

• Forecasts for Grid Middleware, e.g. replica manager, data placement

• Engineering, planning, SLA (set & verify), expectations• Also (not addressed here):

– Security: spot anomalies, intrusion detection– Accounting

3

History• PingER (1994), IEPM-BW (2001)

– E2E, active, regular, end user view, – all hosts owned by individual sites, – core mainly centrally designed & developed

(homogenous), contributions from FNAL, GATech, NIIT (close collaboration)

• Why are you monitoring: – network trouble management, planning,

auditing/setting SLAs, Grid forecasting are very different though may use same measurements

4

PingER (1994)• PingER project originally (1995) for measuring network performance

for US, Europe and Japanese HEP community - now mainly R&E sites

• Extended this century to measure Digital Divide:– Collaboration with ICTP Science Dissemination Unit http://sdu.ictp.it – ICFA/SCIC: http://icfa-scic.web.cern.ch/ICFA-SCIC/

• Monitor 44 sites in S. Asia

• Most extensive active E2E monitoring in world

• >120 countries (99% world’s connected population)• >35 monitor sites in 14 countries• Uses ubiquitous ping facility

5

PingER Design Details• PingER Design (1994: no web services, RRD,

security not a big thing, etc.)– Simple, no remote software (ping everywhere), no

probe development, monitor host install 0.5 day effort for sys-admin

– Data centrally gathered, archived, analyzed, so hard jobs (archiving, analysis, viz) do NOT require distribution, only one copy

– Database flat ASCII files, rawdata, analyzed data, file/pair/day. Compressed saves factor 6 (90GB)

– Data available via web (lot of use, some uses unexpected, often analysis by Excel)

6

PingER Lessons• Measurement code rewritten twice, once to add extra data, once to

document (perldoc) / parameterize / simplify installation• Gathering code (uses LYNX or FTP) pull from archive, no major mods in 10

years• Most of development: for download, analyze data, viz, manage

– New ways to use data: jitter, out of order, duplicates, derive throughput, MOS all required study of data then implement and integrate

– Dirty data (pathologies not related to network) require filtering or filling before analysis

• Had to develop easy make/install download, instructions, FAQ, still new installs require communication:– pre-reqs, getting name registration, getting cron jobs running, getting web server

running, unblock, clarify documentation (often non-native English speakers)• Documentation (tutorials, help, FAQs), publicity (brochures, papers, maps,

presentations/travel), get funding/proposals• Monitor availability of (developed tools to simplify/automate):

– monitor sites (hosts stop working: security blocks, hosts replaced, site forgets), nudge contacts

– critical remote sites (beacons), choose new one (automatically updates monitor sites)

• Validate/update meta data (name, address, institute, lat/long, contact …) in database (need easy update)

7

IEPM-BW (2001)• 40 target hosts in 13 countries• Bottlenecks vary from 0.5Mbits/s to 1Gbits/s• Traverse ~ 50 AS’, 15 major Internet providers• 5 targets at PoPs, rest at end sites• Added Sunnyvale for UltraLight• Covers all USATLAS tier 0, 1, 2 sites• Recently Added FZK, QAU• Main author (Connie Logg) retired

8

IEPM Design Details• IEPM-BW (2001):

– More focused (than PingER), fewer sites (e.g. BaBar collaborators), more intense, more probe tools (iperf, thrulay, pathload, traceroute, owamp, bbftp …), more flexibility

– Complete code set (measurement, archive, analyze, viz) at each monitoring site. Data distributed.

– Needs dedicated host– Remote sites need code installed

• Originally executed remote via ssh, still needed code installed– Security, accounts (require training), recovery problems

– Major changes with time:• Use servers rather than ssh for remote hosts• Use mysql for configuration data bases rather than require perl

scripts• Provide management tools for configuration data etc.• Add/replace probes

9

IEPM Lessons (1)• Problems & recommendations:

– Need right versions of mysql, gnuplot, perl (and modules) installed on hosts

– All possible failure modes for probe tools need to be understood and accomodated

– Timeout everything, clean up hung processes– Keep logfiles for day or so for debugging– Review how processes run with Netflow (mainly manual)– Scheduling:

• don’t run file transfer, iperf, thrulay, pathload at same time on same path• Limit duration and frequency of intensive probes so do not impact network

– Host loses disk, upgrades OS, loses DNS, applications upgraded (e.g. gnuplot), IEPM database zapped etc.

• Need backup– Have a local host as target for sanity check (e.g. monitoring host based

issues)– Monitor monitoring host load (e.g. Ganglia, Nagios…)

10

IEPM Lessons (2)• Different paths need different probes (performance and interest

related)• Experiences with probes (lot of work to understand, analyze &

compare):– Owamp vs ping: owamp needs server and accurate time; ping only

round trip available everywhere, may be blocked– Traceroute: need to analyze significance of results– Packet pair separation:

• Abwe noisy, inaccurate especially on Gbps paths– Pathchirp better, pathload best (most intense approaches iperf), problems at

10Gbps, look at pathneck– TCP:

• thrulay more information, more manageable than iperf, • need to keep TCP buffers optimized/updated

– File transfer• Disk to disk close to iperf/thrulay• disk measures file/disk system – not network, but end user important

• Adding new hosts still not easy

11

Other Lessons• Traceroute no good for layers 2 &1• Packet pair surpasses time granularity at 10Gbps• Forecasting hard if path is congested, need to account

for diurnal etc. variations• Net admin cannot review thousands of graphs each

day: – need event detection, alert notification, and diagnosis

assistance• Comparing QoS vs best effort requires adding path

reservation• Keeping TCP buffer parameters optimized difficult• Network & configurations not static• Passive/Netflow valuable, complementary

12

PerfSONAR• Our future focus (for us 3rd Generation):• Open source, open community

– Both end users (LHC, GATech, SLAC, Delaware) and network providers (ESnet, I2, GEANT, Eu NRENs, Brazil, …)

– Many developers from multiple fields– Requires from the get go: shared code, documentation, collaboration– Hopefully not as dependent on funding as a single team, so persistent?

• Transparent gathering and storage of measurement, both from NRENs and end users

• Sharing of information across autonomous domains– Uses standard formats– More comprehensive view– AAA to provide protection for of sensitive data– Reduces debugging time

• Access to multiple components of the path• No need to play telephone tag

• Currently mainly middleware, needs:– Data mining and viz– Topology also at layers 1 & 2– Forecasting– Event detection and event diagnosis

13

Active E2E Monitoring

14

E.g. Using Active IEPM-BW measurements

• Focus on high performance for a few hosts needing to send data to a small number of collaborator sites, e.g. HEP tiered model

• Makes regular measurements with probe tools– ping (RTT, connectivity), owamp (1 way delay) traceroute (routes) – pathchirp, pathload (available bandwidth)– iperf (one & multi-stream), thrulay, (achievable throughput)– supports bbftp, bbcp (file transfer applications, not network)

• Looking at GridFTP but complex requiring renewing certificates– Choice of probes depends on importance of path, e.g.

• For major paths (tier 0, 1 & some 2) use full suite• For tier 3 use just ping and traceroute

• Running at major HEP sites: CERN, SLAC, FNAL, BNL, Caltech, Taiwan, SNV to about 40 remote sites– http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_w

an_bw_tests.html

http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_wan_bw_tests.html

http://www.slac.stanford.edu/comp/net/iepm-bw.slac.stanford.edu/slac_wan_bw_tests.html

15

IEPM-BW Measurement Topology• 40 target hosts in 13 countries• Bottlenecks vary from 0.5Mbits/s to 1Gbits/s• Traverse ~ 50 AS’, 15 major Internet providers• 5 targets at PoPs, rest at end sites

Taiwan

TWaren• Added Sunnyvale for UltraLight

• Adding FZK Karlsruhe

16

Top page

17

Probes: Ping/traceroute• Ping still useful

– Is path connected/node reachable?– RTT, jitter, loss– Great for low performance links (e.g. Digital Divide), e.g.

AMP (NLANR)/PingER (SLAC)– Nothing to install, but blocking

• OWAMP/I2 similar but One Way– But needs server installed at other end and good timers– Now built into IEPM-BW

• Traceroute– Needs good visualization (traceanal/SLAC) – No use for dedicated λ layer 1 or 2

• However still want to know topology of paths

18

Probes: Packet Pair DispersionUsed by pathload, pathchirp, ABwE available bw• Send packets with known separation• See how separation changes due to bottleneck• Can be low network intrusive, e.g. ABwE only 20

packets/direction, also fast < 1 sec• From PAM paper, pathchirp more accurate than

ABwE, but– Ten times as long (10s vs 1s)– More network traffic (~factor of 10)

• Pathload factor of 10 again more– http://www.pam2005.org/PDF/34310310.pdf

• IEPM-BW now supports ABwE, Pathchirp, Pathload

Bottleneck

Min spacingAt bottleneck Spacing preserved

On higher speed links

http://www.pam2005.org/PDF/34310310.pdf

19

BUT…• Packet pair dispersion relies on accurate timing

of inter packet separation– At > 1Gbps this is getting beyond resolution of Unix

clocks– AND 10GE NICs are offloading function

• Coalescing interrupts, Large Send & Receive Offload, TOE

• Need to work with TOE vendors– Turn off offload (Neterion supports multiple channels, can

eliminate offload to get more accurate timing in host)– Do timing in NICs– No standards for interfaces

• Possibly use packet trains, e.g. pathneck

20

Achievable Throughput• Use TCP or UDP to send as much data as can

memory to memory from source to destination• Tools: iperf (bwctl/I2), netperf, thrulay (from

Stas Shalunov/I2), udpmon …• Pseudo file copy: Bbcp also has memory to

memory mode to avoid disk/file problems

21

BUT…• At 10Gbits/s on transatlantic path Slow start

takes over 6 seconds– To get 90% of measurement in congestion

avoidance need to measure for 1 minute (5.25 GBytes at 7Gbits/s (today’s typical performance)

• Needs scheduling to scale, even then …• It’s not disk-to-disk or application-to application

– So use bbcp, bbftp, or GridFTP

22

AND …• For testbeds such as UltraLight,

UltraScienceNet etc. have to reserve the path– So the measurement infrastructure needs to add

capability to reserve the path (so need API to reservation application)

– OSCARS from ESnet developing a web services interface (http://www.es.net/oscars/):

• For lightweight have a “persistent” capability• For more intrusive, must reserve just before make

measurement

http://www.es.net/oscars/



23

Visualization & Forecasting in Real

World

24

• Some are seasonal• Others are not• Events may affect

multiple-metrics

• Misconfigured windows• New path• Very noisy

Examples of real data

• Seasonal effects– Daily & weekly

Caltech: thrulay

Nov05

Mar060

800Mbps

UToronto: miperfNov05

Jan060

250Mbps

UTDallas Pathchirp

thrulay

Mar-10-06 Mar-20-06iperf0

120Mbps

• Events can be caused by host or site congestion• Few route changes result in bandwidth changes (~20%)• Many significant events are not associated with route

changes (~50%)

25

Scattter plots & histograms

RT

T (m

s)

Throughput (Mbits/s)

Thrulay

Pathchirp & iperf (Mbps)

Thr

ulay

(Mbp

s) IperfPathchirp

Scatter plots: quickly identify correlations between metrics

Histograms: quickly identify variability or multimodality

PathchirpThrulay

26

Changes in network topology (BGP) can result in dramatic changes in performance

Snapshot of traceroute summary table

Samples of traceroute trees generated from the table

ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

Drop in performance(From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech )

Back to original path

Changes detected by IEPM-Iperf and AbWE

Esnet-LosNettos segment in the path(100 Mbits/s)

Hour

Rem

ote

host

Dynamic BW capacity (DBC)

Cross-traffic (XT)

Available BW = (DBC-XT)Mbi

ts/s

Notes:1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:002. ESnet/GEANT working on routes from 2:00 to 14:003. A previous occurrence went un-noticed for 2 months4. Next step is to auto detect and notify

Los-Nettos (100Mbps)

27

On the other hand• Route changes may affect the RTT (in yellow)• Yet have no noticeable effect on on available

bandwidth or throughput

Route changes

AvailableBandwidth

AchievableThroughput

28

However…• Elegant graphics are great to understand

problems BUT:– Can be thousands of graphs to look at (many site

pairs, many devices, many metrics)– Need automated problem recognition AND

diagnosis• So developing tools to reliably detect

significant, persistent changes in performance– Initially using simple plateau algorithm to detect

step changes

29

Seasonal Effects on events• Change in bandwidth (drops) between 19:00 &

22:00 Pacific Time (7:00-10:00am PK time)• Causes more anomalous events around this

time

30

Forecasting• Over-provisioned

paths should have pretty flat time series

– Short/local term smoothing

– Long term linear trends

– Seasonal smoothing

• But seasonal trends (diurnal, weekly need to be accounted for) on about 10% of our paths

• Use Holt-Winters triple exponential weighted moving averages

31

Experimental Alerting• Have false positives down to reasonable level

(few per week), so sending alerts to developers• Saved in database• Links to traceroutes, event analysis, time-series

32

Passive• Active monitoring

– Pro: regularly spaced data on known paths, can make on-demand

– Con: adds data to network, can interfere with real data and measurements

• What about Passive?

33

Netflow et. al.• Switch identifies flow by sce/dst ports, protocol• Cuts record for each flow:

– src, dst, ports, protocol, TOS, start, end time• Collect records and analyze• Can be a lot of data to collect each day, needs lot cpu

– Hundreds of MBytes to GBytes• No intrusive traffic, real: traffic, collaborators, applications• No accounts/pwds/certs/keys• No reservations etc• Characterize traffic: top talkers, applications, flow lengths etc.• LHC-OPN requires edge routers to provide Netflow data• Internet 2 backbone

– http://netflow.internet2.edu/weekly/• SLAC:

– www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html

http://netflow.internet2.edu/weekly/

http://www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html

34

Typical day’s flows• Very much work in

progress• Look at SLAC border• Typical day:

– ~ 28K flows/day– ~ 75 sites with > 100KB

bulk-data flows– Few hundred flows >

GByte

• Collect records for several weeks• Filter 40 major collaborator sites, big (> 100KBytes) flows, bulk

transport apps/ports (bbcp, bbftp, iperf, thrulay, scp, ftp …)• Divide by remote site, aggregate parallel streams• Look at throughput distribution

35

Netflow et. al. • Peaks at known capacities and RTTs

– RTTs might suggest windows not optimized, peaks at default OS window size(BW=Window/RTT)

36

How many sites have enough flows?• In May ’05 found 15 sites at SLAC border with > 1440

(1/30 mins) flows– Maybe Enough for time series forecasting for seasonal

effects• Three sites (Caltech, BNL, CERN) were actively

monitored• Rest were “free”• Only 10% sites have

big seasonal effects in active measurement

• Remainder need fewer flows

• So promising

37

Mining data for sites• Real application use (bbftp) for 4 months• Gives rough idea of throughput (and confidence) for

14 sites seen from SLAC

38

Multi months

• Bbcp SLAC to PadovaBbcp throughput from SLAC to Padova

Fairly stable with time, large variance• Many non network related factors

39

Netflow limitations• Use of dynamic ports makes harder to detect app.

– GridFTP, bbcp, bbftp can use fixed ports (but may not)– P2P often uses dynamic ports– Discriminate type of flow based on headers (not relying on

ports)• Types: bulk data, interactive …• Discriminators: inter-arrival time, length of flow, packet length,

volume of flow• Use machine learning/neural nets to cluster flows• E.g. http://www.pam2004.org/papers/166.pdf

• Aggregation of parallel flows (needs care, but not difficult)

• Can use for giving performance forecast– Unclear if can use for detecting steps in performance

40

Conclusions• Some tools fail at higher speeds• Throughputs often depend on non-network

factors:– Host: interface speeds (DSL, 10Mbps Enet,

wireless), loads, resource congestion– Configurations (window sizes, hosts, number of

parallel streams)– Applications (disk/file vs mem-to-mem)

• Looking at distributions by site, often multi-modal

• Predictions may have large standard deviations• Need automated assist to diagnose events

41

In Progress• Working on Netflow viz (currently at BNL & SLAC) then work

with other LHC sites to deploy• Add support for pathneck• Look at other forecasters: e.g. ARMA/ARIMA, maybe Kalman

filters, neural nets• Working on diagnosis of events

– Multi-metrics, multi-paths• Signed collaborative agreement with Internet2 to collaborate

with PerfSONAR– Provide web services access to IEPM data– Provide analysis forecasting and event detection to PerfSONAR data– Use PerfSONAR (e.g. router) data for diagnosis– Provide viz of PerfSONAR route information– Apply to LHCnet– Look at layer 1 & 2 information

42

Questions, More information• Comparisons of Active Infrastructures:

– www.slac.stanford.edu/grp/scs/net/proposals/infra-mon.html • Some active public measurement infrastructures:

– www-iepm.slac.stanford.edu/– www-iepm.slac.stanford.edu/pinger/ – e2epi.internet2.edu/owamp/ – amp.nlanr.net/

• Monitoring tools– www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html – www.caida.org/tools/ – Google for iperf, thrulay, bwctl, pathload, pathchirp

• Event detection– www.slac.stanford.edu/grp/scs/net/papers/noms/noms14224-122705-

d.doc

43

Outline• Deployment, keeping in sync, management,

timeouts, killing hung processes, host OS/env different

• Implementation:– MySQL dbs for data and configuration (host, tools,

plotting etc.) info– Scheduler, prevents backup– Log files, analyze for troubles– Local target as a sanity check on monitor

Documents

1 Lessons Learned Monitoring Les Cottrell, SLAC ESnet R&D Advisory Workshop April 23, 2007 Arlington,…