50
Simplifying Network Troubleshooting in Data Centers Aug 3, 2017 Dinesh G Dutt

Best practices for network troubleshooting

Embed Size (px)

Citation preview

Page 1: Best practices for network troubleshooting

1

Simplifying Network Troubleshooting in Data Centers

Aug 3, 2017

Dinesh G Dutt

Page 2: Best practices for network troubleshooting

2Cumulus Networks

What’s Changed Limitations of existing tools and why do we

need new ones ?

Page 3: Best practices for network troubleshooting

3Cumulus Networks

Demo Topology

spine-1 spine-2 spine-3

torc-11 torc-12 torc-21 torc-22 tor-1 tor-2

hostd-11 hostd-21 hosts-11 hosts-21

3.0.0.4 3.0.3.132

Page 4: Best practices for network troubleshooting

4Cumulus Networks

Multipathing

● Modern data center is completely multipathed

● Traceroute: the truth, not the whole truth

● Linux is acquiring 5-tuple based load

balancing in 4.12

Page 5: Best practices for network troubleshooting

5Cumulus Networks

Network Virtualization

● Tunnels obscure path and reachability

● Incorrect MTU configuration can cause

unexplained connectivity problems

Page 6: Best practices for network troubleshooting

6Cumulus Networks

Microservices

● Applications are more distributed than ever,

connectivity is even more critical than before

● Short lives of containers means it’s harder to

do post-mortem analysis

Page 7: Best practices for network troubleshooting

7Cumulus Networks

Deployment Speed

● Needs the ability to make changes with

confidence

● Network needs to be immutable to daily

dynamic needs

○ In public/private clouds for example, no adding

and deleting of VLANs

Page 8: Best practices for network troubleshooting

8Cumulus Networks

Scale

● Automation is key, not a nice-to-have

● Rethink of network design and architecture to

align with automation

● Network as generic infrastructure

● Push everything not related to connectivity to

the edge

○ Security, segmentation endpoints, services

Page 9: Best practices for network troubleshooting

9Cumulus Networks

Rise of Whitebox

Switching

● Simple, uniform building blocks

● Merchant switching silicon

● Disaggregate hardware and software (NOS)

just like servers

Page 10: Best practices for network troubleshooting

10

What’s Not Changed

Page 12: Best practices for network troubleshooting

12Cumulus Networks

The Ball of Technologies The Network Admins Deal With

Page 13: Best practices for network troubleshooting

13Cumulus Networks

What Network Admins Go To Bat With

Page 15: Best practices for network troubleshooting

15Cumulus Networks

What Makes Network Troubleshooting Particularly Difficult ?

Network devices are appliances

Lack of platform approach means the device is generally closed

to any non-vendor program

Lack of programmatic access or structured output

CLI screen scraping

Packet forwarding happens in silicon which has limited troubleshooting capabilities

In comparison to compute where software is king

Page 16: Best practices for network troubleshooting

16Cumulus Networks

In Short...

In short, levels of abstraction have grown in modern data center networks…

without a corresponding increase in tools that break down the levels

Page 19: Best practices for network troubleshooting

19Cumulus Networks

Many Layers to Peel

From network architecture to

configuration to diagnostic to forensic

Page 20: Best practices for network troubleshooting

20Cumulus Networks

Key

Observations

Choose the right architecture to limit or eliminate

problems

Use automation to eliminate random errors

Catch problems close to the source

Page 21: Best practices for network troubleshooting

21Cumulus Networks

Three Step Process to Network Troubleshooting

Right Architecture => Eliminate errors due to poor design, simplify design

Right Configuration => Eliminate errors due to complex configuration, complex automation scripts

Right Telemetry => To catch errors that’ve slipped past despite the rigor, catch operational drifts due to changes beyond control like aging effect - cable faults

Page 22: Best practices for network troubleshooting

22

The Appropriate Architecture

Page 23: Best practices for network troubleshooting

23Cumulus Networks

Key Takeaways

Modern DC architecture: build large networks out of simple building blocks

Simple building blocks significantly change the complexity of troubleshooting networks

Page 24: Best practices for network troubleshooting

24Cumulus Networks

Benefits of

Simple,

Common

Building

Blocks

“Google uses a very common set of building

blocks across all of its software, so by

instrumenting these building blocks Dapper

is able to automatically generate a lot of

useful trace information without any

application involvement. “

- Dapper Paper, 2015

Page 25: Best practices for network troubleshooting

25Cumulus Networks

Tackling Cabling

Complexity in

Clos Networks

● Catch miscabling errors as miscabling errors, not

protocol or application errors

Page 26: Best practices for network troubleshooting

26Cumulus Networks

Verify Cabling:

Prescriptive

Topology

Manager

SPINE

LEAF

S1 S2 S3 S4

L1 L2 L3 L5L4

Graph G {S1:p1 – L1:p1;S1:p2 – L2:p1;S1:p3 – L3:p1;S1:p4 – L4:p1;S1:p5 – L5:p1;S2:p1 – L1:p2;S2:p2 – L2:p2;

...S4:p5 – L5:p4;

}

● Define expected topology using DOT language

● Verify connectivity per topology plan using LLDP

● Take dynamically defined actions based on

mis/match of expected & actual

● https://github.com/CumulusNetworks/ptm

Page 27: Best practices for network troubleshooting

27

Network Telemetry

Page 28: Best practices for network troubleshooting

28Cumulus Networks

What Data Can We Gather ?

Logs Network state can be configuration or runtime

Page 29: Best practices for network troubleshooting

29Cumulus Networks

Logs

Pros

In theory, catch errors and warnings or exceptions

Mature tools now available to handle logs

ELK, Splunk

Cons

Usually box specific.

Errors that require fabric awareness can be hard to catch, and so can’t be easily logged. Example: Duplicate IP address, routing loop

Page 30: Best practices for network troubleshooting

30Cumulus Networks

Metrics

The good thing about metrics is that there are so many to gather

Brendan Gregg’s USE model is a good yardstick to decide which metrics to gather:

“For every resource, check utilization, saturation, and errors.”

For example, applying USE to network interfaces:

Utilization: Basic RX/TX rates

Saturation: Buffer monitor stats per port

Errors: Drops, errors for RX/TX

Page 31: Best practices for network troubleshooting

31Cumulus Networks

Metrics Usage

For troubleshooting:

For a network operator, network latency is probably the one

thing that can be used as an indicator to determine if

suboptimal performance is due to the network

Other metrics and mechanisms come into play to isolate the

problem in the network

For capacity planning:

Usage and saturation metrics help you decide if we’re reaching

network capacity

Page 32: Best practices for network troubleshooting

32Cumulus Networks

Metrics Dos and Donts

Gather your data as frequently as possible

Practical limits maybe how quickly the hardware stats are

updated

1-5 seconds is quite possible

Do not use SNMP

The first bullet prevents this anyhow

Do not aggregate data quickly

Use a good TSDB

InfluxDB and Prometheus are the ones I encounter the most

Page 33: Best practices for network troubleshooting

33Cumulus Networks

Packet Capture

Pros

Useful for identifying what sort of traffic is flowing through

For security compliance

For things like IDS

Cons

Relatively expensive to capture as much data as flowing even in a single switch (3.2Tbps and increasing)

sFlow and its cousins are better suited for identifying traffic

Make most sense for use reactively, in troubleshooting

Page 34: Best practices for network troubleshooting

34Cumulus Networks

Network State

Properly designed, can be a good balance between packet capture and formal verification to answer questions such as:

Did this change break my network ?

Was there a forwarding loop at 10 pm last night

Show me the changes between 1h and 2h

Page 35: Best practices for network troubleshooting

35Cumulus Networks

Problem Remains

Lots of data can be gathered

Ability to correlate across these is still mostly lacking

Eg: Graphs show a drop in interface throughput without showing

at the same time an annotation indicating that the drop is

because a link failed

Building actionable alerts remains elusive

Many customers tell me that they essentially ignore alerts due to

the high false positive rate

Page 36: Best practices for network troubleshooting

36

Troubleshooting Tools

Page 37: Best practices for network troubleshooting

37Cumulus Networks

The Problem ● Network admins are typically contacted for one of

two cases:

o A can’t talk to B OR

o A can talk to B, but sub-optimally

● Also check for proper network segmentation

Page 38: Best practices for network troubleshooting

38Cumulus Networks

What Tool For What Problem ?

A can’t talk to B:

Network state maybe the most useful thing to identify problem

A can talk to B, but suboptimally:

This is a performance issue and metrics gathered can be used

to identify problem

Check compliance such as traffic does not leak across virtual network (VLAN, VRF or VxLAN)

Network state is the most helpful to answer this question

Page 39: Best practices for network troubleshooting

39Cumulus Networks

One Simple Step: Make Servers Discoverable

Enable LLDP on server

If using lldpd, configure it to send ifname

Add a file called portidsubtype.conf to /etc/lldpd.d with contents:configure lldp portidsubtype ifname

Restart lldpd via sudo systemctl restart lldpd (or equivalent)

With PTM, enables cabling verification to servers too

Page 40: Best practices for network troubleshooting

40Cumulus Networks

Traceroute family

traceroute mtr tracepath traceroute

-paris

traceroute

-dublin

scamper

ECMP support: traceroute, traceroute-paris/dublin, scamper

PMTU support: traceroute, tracepath, mtr

NAT detection: traceroute-dublin

IPv6 Support: All except traceroute-dublin

VxLAN support: None

Page 41: Best practices for network troubleshooting

41Cumulus Networks

NetQ

Designed for Linux-based networking devices and hosts

An open framework with a paid analysis engine

Users can build their own analysis engine or customize

Designed around the modern data center use case

Simplify codifying validation

Simplify troubleshooting

Codify troubleshooting

Time machine debugging (or DVR) included

Page 42: Best practices for network troubleshooting

42

NetQ Architecture

42Ubuntu 16.04 RHEL 7 CentOS 7

Q

Q

Q

Q

Q

Q Q Q

NetQ

Telemetr

y Server

Page 43: Best practices for network troubleshooting

43

NetQ: Fabric Change Log

Linux Kernel

L3 L2 VxLAN

NetQ New Route Added

OSPF Neighbor Change

MAC Address Removed

See state now or any point in the past

Page 44: Best practices for network troubleshooting

44

NetQ: Analysis Engine

• Validate Current State

▪ BGP

▪ OSPF

▪ MTU

▪ mLAG

▪ VxLAN

• Telemetry Server analyzes entire network state

Cumulus Networks Confidential

Page 45: Best practices for network troubleshooting

45

NetQ: Intelligent Visibility

• View remote information

▪ IPs

▪ MACs

▪ OS

▪ System Specs

• Improve Command Outputs

▪ Resolve hostnames in any Linux command

▪ No need for DNS

Cumulus Networks Confidential

Page 46: Best practices for network troubleshooting

46

NetQ: Advanced Notification

• NetQ Notifier Service

• Automatically Alert on Check Failures

▪ Syslog

▪ ChatOps (Slack)

▪ ELK

▪ Splunk

Cumulus Networks Confidential

Page 47: Best practices for network troubleshooting

47

Page 48: Best practices for network troubleshooting

48

Summary

Page 49: Best practices for network troubleshooting

49Cumulus Networks

Summary

Network troubleshooting remains hard for most people

The modern data center has the potential to make both troubleshooting simpler and more complex

Avoiding troubleshooting is better than troubleshooting

The right architecture and configuration models go a long way in

addressing this

Correlating across network state, logs and metrics is still beyond the reach of most network operators

Newer tools are on the rise to address this

Page 50: Best practices for network troubleshooting

50

Thank you!Visit us at cumulusnetworks.com or follow us @cumulusnetworks or

slack.cumulusnetworks.com

© 2017 Cumulus Networks. Cumulus Networks, the Cumulus Networks Logo, and Cumulus Linux are trademarks or registered trademarks of Cumulus

Networks, Inc. or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. The registered trademark

Linux® is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis.