BRKCRS-3141 Troubleshooting Cisco Catalyst 3750 3560 and 2960.pdf

Preview:

Citation preview

1

6/26/2013 Cisco Live 2013

2

6/26/2013 Cisco Live 2013

On average, how do your network administrators and other network IT professionals spend their time on your access (edge) switches?

3

4

6/26/2013 Cisco Live 2013

5

6

7

8

9

10

Port connectivity issues arise from NIC, cable, or switch port faults. These problems may be hardware or software related. There are two different traffic related issues: Not passing any traffic and Passing too much. If no traffic is passing verify configuration settings on both devices of the link. Move one of the devices to a known working partner and see if the problem moves with it or not. Too much traffic could indicate a number of issues: broadcast storm, unicast flood, or oversubscription. A Sniffer capture is usually necessary to identify the types of packets overtaking the interface. Speed/Duplex issues are typically seen at the access layer where user PCs move and change frequently. It is recommended that, where possible, that you hard code speed/duplex setting – particularly for connections that should not change, eg: servers, routers, etc.

11

12

13

%LINK-4-ERROR: FastEthernet0/1 is experiencing errors The above Link error message pops up because of the following potential problems: Excessive Alignment errors -> 1 per 100 ms Excessive FCS errors -> 1 per 100 ms Excessive TX collisions -> 1 per 100ms Late collisions -> 1 per 100 ms

14

Algorithms are used to determine path (platform specific). Either mac address or IP address used for path determination. All packets take same path for a given source to destination.

15

fixing speed and duplex should be done on both sides See CSCtj21335 and workaround: https://supportforums.cisco.com/docs/DOC-23267

17

TDR= Time domain Reflectometry; available for copper interfaces up to 1GE speed Interfaces will be brought down and up when run on active ports

© 2012, Cisco Systems, Inc. All rights reserved. 18

19

Packets with IP options, Packets with expired TTL, Glean packets, ARP, Snooping, Software ACLs, SNMP, etc.

20

Capturing process utilization at the “right” moment is key for identifying the cause Processes : For ex “show tech” causes the virtual exec process to use some CPU resources Traffic Forwarding :

Data traffic not forwarded by ASIC

Excessive Control Plane / Management traffic:

DoS attacks (TTL=1)

SVI ping test

Requires inspecting CPU queues and ASIC

*Note: show tech causes the virtual exec process to use

some CPU resources

CPU util sustained below 50% will not cause problems.

Example of Syslog msg for high CPU

002182: *Jul 20 04:23:36: %SYS-1-CPURISINGTHRESHOLD: Threshold:

Process CPU Utilization(Total/Intr): 9%/0%, Top 3 processes(Pid/Util):

214/3%, 153/0%, 159/0%

Sorting the output is better than filtering the output with “exclude 0.00%”

because that will exclude processes that you want to see.

2960-S will have a higher CPU util that is considered normal.

The “normal” cpu usage depends on number of members in the stack, routing protocols, spanning tree instances, …

High CPU Utilization?

Looking for lost packets?

Use EEM to take the process usage snapshot at the right time: event manager applet High_CPU event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.6 get-type next entry-op gt entry-val 90 poll-interval 10 action 1.1 syslog msg "High CPU DETECTED. Please wait - logging Information to flash:show_proc_cpu.txt" action 1.2 cli command "enable" action 1.3 cli command "end" action 1.4 cli command "term exec prompt timestamp" action 1.5 cli command "show process cpu sorted | redirect flash:show_proc_cpu.txt" action 2.1 syslog msg "Capturing IP Traffics - logging Information to flash:show_ip_traffic.txt" action 2.2 cli command "show ip traffic | redirect flash:show_ip_traffic.txt" action 3.1 syslog msg "Capturing show tech. Please wait - logging Information to flash:show_tech.txt" action 3.2 cli command "show tech | redirect flash:show_tech.txt" action 3.3 syslog msg "Self-removing applet from configuration..." action 3.4 cli command "configure terminal” action 3.5 cli command "no event manager applet High_CPU" action 3.5 cli command "end"

20

21

Each queue has its own tuned buffers and scheduling. These are not configurable, but were highly tested and tweaked before the Catalyst 3750 was released.

rpc: Internal messaging queue STP: STP messages ipc: L3 internal messaging queue Routing protocols: L3 protocols like OSPF L2 protocol: CDP etc other than STP remote console: Stacking slave console Software forwarding: Fast switching, CPU forwarding Host to Host functions: Ping, Telnet etc. Broadcast: L2 broadcast packets cbt-to-spt: Used by Multicast IGMP snooping: Used by Multicast ICMP: Used by IP Logging: ACL logging and Smart Logging RPF-fail: Multicast RPF fail Queue14: Unused CPU heartbeat: Internal wellness check

For example there are more buffers for STP and handle more STP packets since one doesn’t want to drop them. Logging in comparison is a low priority activity. The thresholds are different can vary from 1 to 100s of packets. They are dynamic and can grow. The thresholds numbers may be changed, if required, depending on new feature requirements etc. The stack ring has capability of reserving bandwidth for priority traffic. We use it to ensure the stack messages can work even under heavy user

load. This is guaranteed bandwidth by hardware even under heavy user overloading of the stack. All stack message are handled in the rpc queue which is tuned with larger buffers and scheduling. Flows eligible for CPU forwarding are:

Control plane traffic Management Traffic TCAM overflow traffic

ACL overflow MAC entry overflow Routing Table Overflow

Special protocol flows, these are typically low volume and unofficially supported.

Depth of CPU Qs cannot be modified

The HW (i.e. the port ASIC) will drop on queue congestion

Overload on one CPU Queue should not affect other Queues STP has its own queue – Queue 1 Queue 4 for the other L2 Protocols Values are cumulative Use “clear controllers cpu” or repeat the command multiple times

21

22

Use the debug platform cpu-queues privileged EXEC command to enable debugging of platform central processing unit (CPU) receive queues. software-fwd-q Debug packets received by the software forwarding queue. When running the debug:

Command Purpose

Step 1 configure terminal Enter global configuration mode.

Step 2 no logging console Disable logging to the console terminal.

Step 3 logging buffered 128000 Enable system message logging to a local buffer, and set the buffer size to

12800 bytes.

Step 4 service timestamps debug datetime msecs localtime

Configure the system to apply a timestamp to debugging messages or

system logging messages.

Step 5 exit Return to privileged EXEC mode.

BGL-3700-3#sh cd ne

Capability Codes: R - Router, T - Trans Bridge, B - Source Route Bridge

S - Switch, H - Host, I - IGMP, r - Repeater, P - Phone,

D - Remote, C - CVTA, M - Two-port Mac Relay

Device ID Local Intrfce Holdtme Capability Platform Port ID

BGL14-TACLAB-ASW-J08

Gig 1/0/2 158 S I WS-C3550- Fas 0/16

BGL14-TACLAB-ASW-J08

Gig 2/0/2 131 S I WS-C3550- Fas 0/40

BGL-3700-3#sh arp

Protocol Address Age (min) Hardware Addr Type Interface

Internet 14.160.38.130 - c471.fe1e.f0c0 ARPA Vlan1

Internet 14.160.38.1 1 0007.7d75.88c0 ARPA Vlan1

BGL-3700-3#

Ping with options from 14.160.38.1

*Mar 1 10:37:33.205 AEDT: SW-FWD-Q:IP packet: Local Port Fwding L3If:Vlan1

22

L2If:GigabitEthernet2/0/2 DI:0x2F, LT:7, Vlan:1 SrcGPN:56, SrcGID:56, ACLLogIdx:0x0, MacDA:c471.fe1e.f0c0, MacSA: 0007.7d75.88c0 IP_SA:14.160.38.1 IP_DA:14.160.38.130 IP_Proto:1 IP Opts

TPFFD:D8C00038_00010001_00A00076-0000002F_E2C50000_00000000

*Mar 1 10:37:33.205 AEDT: SW-FWD-Q:Consumed by SW-Bridging: Local Port Fwding L3If:Vlan1 L2If:GigabitEthernet2/0/2 DI:0x2F, LT:7, Vlan:1 SrcGPN:56, SrcGID:56, ACLLogIdx:0x0, MacDA:c471.fe1e.f0c0, MacSA: 0007.7d75.88c0 IP_SA:14.160.38.1 IP_DA:14.160.38.130 IP_Proto:1 IP Opts

TPFFD:D8C00038_00010001_00A00076-0000002F_E2C50000_00000000

*Mar 1 10:37:53.765 AEDT: SW-FWD-Q:IP packet: Local Port Fwding L3If:Vlan1 L2If:GigabitEthernet2/0/2 DI:0x2F, LT:7, Vlan:1 SrcGPN:56, SrcGID:56, ACLLogIdx:0x0, MacDA:c471.fe1e.f0c0, MacSA: 0007.7d75.88c0 IP_SA:14.160.38.1 IP_DA:14.160.38.130 IP_Proto:1 IP Opts

TPFFD:D8C00038_00010001_00A00076-0000002F_E2C50000_00000000

A good practice to protect and to monitor the CPU utilization is to confifure the process cpu threshold and to configure the SW to control the broadcast, multicast and unicast traffic per interface.

22

23

Debug traffic received by CPU. In case below “routing-protocol-q” is shown Packet ingress intf, Dest MAC, SrcMAC, Dest IP, Src IP are shown

24

When free buffers reaches below watermark(32), throttling might occur, resulting in packet drops slow responsiveness to network management

25

Receives a copy of the traffic for which an ICMP packet needs to be generated. Hardware forwarding of the packet still occurs

26

(due to throttling mechanism it won’t reach 99%) of which

27

Add featureset influence over CPU --3750X - 22%- 50% (depending on number of switches)

28

Configuring Traffic Storm Control to avoid packets flood the LAN, creating excessive traffic and degrading network performance.

29

30

I/O memory is not used for normal packet switching

31

Note: lowest free level since boot up

32

Memory Allocation Failure Memory Allocation failure is the condition where the system has used all available memory (temporarily or permanently), or the memory has fragmented into such small pieces that the switch cannot find a usable available block. Memory Leak A memory leak occurs when a process requests or allocates memory and then forgets to free (de-allocate) the memory when it is finished with that task. As a result, the memory block is reserved until the system is reloaded. Over time, more and more memory blocks are allocated by that process until there is no free memory available.

33

34

Use caution while running the command. Might cause cpu spikes Run multiple times to benchmark

In the above output , Total represent the total number of buffers in the pool, which include used and unused buffers. Permanent identifies the permanent number of allocated buffers in the pool. These buffers are always in the pool and can not be trimmed. In free list identifies the number of buffers currently in the pool that are available for use. Min identifies the minimum number of buffers that the system should attempt to keep in the free list. If the number of buffers in free list falls below the min value, system attempts to create more buffers for that pool. Max-allowed identifies the maximum number of buffers that are allowed in the free list Hits identifies the number of buffers that have been requested from the pool. The hits counter provides a mechanism to determine which pool must meet the highest demand for buffers. Misses identifies the number of times that a buffer has been requested and the system detected in which pool additional buffers were required. The misses counter represents the number of times the system has been forced to create additional buffers. Trims identifies the number of buffers that the system has trimmed from the pool, when the number of buffers in the free list exceeded the number of max-allowed buffers.

35

Created identifies the number of buffers that have been created in the pool. Failures identifies when IOS fails to get a Small buffer, it does not drop the packet. It increments the failed counter and falls through to the next level buffer, which is the Middle buffer and requests a buffer there. If it fails to get a middle buffer, it requests the next level buffer, which is a Big buffer. This process continues until it hits the Huge buffer pool. If it fails to get a Huge buffer, then it drops the packet. No memory identifies the number of failures caused by insufficient memory to create additional buffers. Buffer Misses do not necessarily mean a bad thing, as long as the system is able to create additional buffers . The fields to look for in the 'show buffers' output are Failures and No Memory. If there a lot of Failures and No Memory (constantly incrementing) for any Buffer Pool, try to narrow down the source of the buffer failure.You can use the 'show memory debug leaks' to detect I/O memory leaks as well. However remember - it is mandatory that memory leak detector be invoked multiple times and that only leaks that consistently appear in all reports be interpreted as leaks. This is especially true for packet buffer leaks.

35

36

37

38

39

40

41

42

View Asic stats for Ingress Queue (enqueue’d and dropped) & supervisor Queue- - -Output is different for C3750X than C3750G

C2960S does not have ingress Queues

43

44

45

Why have Ingress QoS?

C2960-S Smallest configurable policing rate is 16Kbps, and 8Kbps for everything else.

2960-X and 2960-XR will follow a similar model

© 2009, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 48

49

Assume there’s a policer defined on the interface via a service policy

Transition to slower speed link – packets take longer to egress than ingress

Eg: Gigabit interfaces for Data Center Servers and old IP Phones

Over Subscription : Many interfaces transmitting to one egress interface

50

A small burst from the 10Gig (faster) interface causes congestion on 100Mbps (slower) interface

51

Total Passengers: 538 First Class: 9 Business Class: 80 Economy Upper Deck: 106 Economy Lower Deck: 343

© 2009, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 53

Buffers: relative allocation among queues Reserved: minimal amount percentage of buffer reserved by each queue; extra amount is released to common buffer T1, T2, MAX: flexible threshold, expressed as a percentage of nominal queue buffer, which can be used Each traffic class maps to a specific queue number and threshold number (T1, T2 or T3=MAX) For example, the orange class maps to Q2-T1 and the violet class to Q2-T3 (Q2-MAX), allowing more violet packets to queue as they can use common pool buffer space

© 2009, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 54

Using “maps”, traffic classes mapped to Queue and threshold

55

56

57

58

59

Note: egress interface speed change at top.

60

Queue numbering is 0 based in this slide, rather than 1 based on previous slide (with DSCP mapping) Old IOS version do not have detailed Queue output. Starts in 12.2(46)SE

61

Besides, using “show mls qos interface <intf> statistics” command, use the platform command to get per interface per queue statistics for drop (and for successfully egress)

62

Fixing drops from previous example on Q4 and T1

63

Modify the queue-set from the previous slide to prevent packet drops Threshold maximum is 3200%

64

65

66

See Chapter “Configuring SDM Templates” in the Catalyst Switch Configuration Guide for more information

67

68

Content Addressable Memory (CAM) • Very high speed lookup in large tables • Binary operation—matches based on 0 or 1 values • Exact match returns “hit” • Useful for lookups where lookup key must exactly match a table entry (VLAN + MAC in bridge table) TCAM Tables • Ternary Content Addressable Memory (TCAM) • Very high-speed, fixed latency lookups with wildcarding • Ternary operation—matches based on 0, 1 or X (don’t care) • Longest match returns “hit” • Memory structure broken into groups of “patterns” and associated “masks” • Masks used to “wildcard” some bits in the patterns • Useful for lookups where not all fields of lookup key (CEF, ACL lookups)

69

70

71

72

73

Power discovery allows switches and PoE capable devices to convey power information. LLDP-MED provides information related to how the device is powered (from the line, from a backup source, from external power source, etc.), power priority (how important is it that this device has power?), and how much power the device needs. NOTE: LLDP-MED just advertises device consumption and is not a negotiation protocol. A third party IEEE device to be able to use PoE+ power ( > 15.4W) needs IEEE 802.3at LLDP Power-via- MDI protocol. Cisco Discovery Protocol includes separate TLVs for power requested and power available, allowing the switch and the PoE capable device to negotiate the power used. Some Cisco IP phones can operate at multiple power settings, lowering their consumption when less power is available. Using CDP the PD requests the worst-case power (including the link loss) required. LLDP PD requests only the power required, the PSE has to add the link loss values. When a powered device connected to a PoE+ port restarts and sends a CDP or LLDP packet with a power TLV, the switch locks to the power-negotiation protocol of that first packet and does not respond to power requests from the other protocol. For example, if the switch is locked to CDP, it does not provide power to devices that send LLDP requests. If CDP is disabled after the switch has locked on it, the switch does not respond to LLDP power requests and can no longer power on any accessories. In this

74

case, you should restart the powered device.

74

In the PoE controllers, there are three separate current thresholds that are used for different purposes. These are I(cut), I(limit), and I(short). I(cut) is used as the threshold at which power is removed from the port if the PD draws more power than allocated. (ex. 15.4W) I(limit) is used as the threshold at which the PoE controller will start to reduce the port voltage in order to control the current, but is not used to remove power from the port completely. I(short) is used as the threshold at which the port sees a very fast current spike that must be dealt with immediately and bypasses all the timers that are used to remove power from the port and shuts down immediately. Imax error is reported by PoE controller of the switch, when a PoE PD device misbehaves and draws more power ( Port Current ) beyond theirs specified limit. Imax error is reported after the device is Powered up and it1s an operating fault. When Iport > Icut for a time period of Tovld (50-75 milliseconds).( Iport - port current, Icut - port cut-off current, Tovld - time duration for the overload condition to be reported). The typical values of Icut in a switch varies by their PoE controller components, but they are always within IEEE range. Tstart error is reported when the device violates Tinsrush (50 milliseconds), what that means, while powering up the device, the device draws a port inrush current which is greater than Iinrush (450 mA) for at least Tinrush (50 milliseconds). Tstart is a start up fault before even the Device reported Power Good. ( T-inrush a Port start up current monitor time, I-inrush a Start up current)

75

6/26/2013 Cisco Live 2013

The workaround is present in the following platforms only : 3750-E,3560-E, 3750-X, 3560-X, 3560-C, 2960S, 2960,2960C The other platforms do not support 2X power mode and the workaround would be to use a longer cable. See DDTS CSCsw18530

75

show platform frontend-controller subordinate <number> Displays the statistics of errors received as reported by a subordinate. In this command output check the state of the Subordinate and look for I2c errors. If the I2C errors are non-zero check if they are incrementing, if yes reload. If issue still exists it could be a bad hardware.

76

Debug commands should always be run with care. Specific debug conditions can be used where available A debug condition can help to only keep the debug condition x/x

77

6/26/2013 Cisco Live 2013

78

79

Check Major version

80

81

82

show switch stack-ring activity was introduced in 12.2(20)SE A new, out-of-the-box switch (one that has not joined a switch stack or has not been manually assigned a stack member number) ships with a default stack member number of 1. When it joins a switch stack, its default stack member number changes to the lowest available member number in the stack. Stack members in the same switch stack cannot have the same stack member number. Every stack member, including a standalone switch, retains its member number until you manually change the number or unless the number is already being used by another member in the stack. If you manually change the stack member number by using the “switch current-stack-member-number renumber new-stack-member-number” global configuration command, the new number goes into effect after that stack member resets (or after you use the “reload slot stack-member-number” privileged EXEC command) and only if that number is not already assigned to any other members in the stack. “show platform sf-asic stat ?” gives more detailed stack statistics for the 3750E

LED on the port with the corresponding switch number will illuminate

For ex, if the switch is # 4 in the stack, port 4’s LED will light up

83

to stop stack port flap Switch <> stack port <> en/disable show switch stack-ports summary was introduced in 12.2(50)SE “# Changes to LinkOK” = number of times stack port went into Link OK “cable length” in CentiMetres

84

85

86

87

88

Note: this mac matches slide with “show platfrom forward ...” command example

89

90

91

92

Use this command to view the egress interface for Layer2 forwarding. In this case egress is Gi1/0/4

93

----- Meeting Notes (4/11/13 15:58) ----- example 1 and 2

94

95

96

97

Notes: ios view of how things should be.

98

The switch does not make fwding decisions based on icmp values. But, the command requires them. Just put in anything in range 0-255

99

100

101

102

Some failure scenarios

103

104

105

Note: change date / time to use datetime, and not uptime. For consistency

106

107

109

6/26/2013 Cisco Live 2013

110

111

6/26/2013 Cisco Live 2013

112

113

114

The Catalyst 2960/3650/3750 supports four egress queues, which can be configured on a per-interface basis to operate in either 4Q3T or 1P3Q3T modes. Additionally, the Catalyst 2960/3650/3750 supports two queue-sets, allowing certain interfaces to be configured in one manner and others to be configured in a different manner. The Catalyst 2960/3650/3750 has Queue 1 (not Queue 4) as the optional priority queue; in a converged campus environment it is recommended to enable the priority queue via the priority-queue out interface command. The three remaining egress queues on the Catalyst 2960/3650/3750 are scheduled by a Shaped Round-Robin (SRR) algorithm, which can be configured to operate in shaped mode or in shared mode. In shaped mode, assigned bandwidth is limited to the defined amount; in shared mode, any unused bandwidth is shared among other classes (as needed). To make the queuing structure consistent with the previously discussed best-practice queuing principles:

Queues 2 through 4 should be set to operate in shared mode (which is the default mode of operation on Queues 2 through 4). The ratio of the shared weights determines the relative bandwidth allocations (the absolute values are meaningless). Since the PQ of the Catalyst 2960/3650/3750 is Q1 (not Q4 as in the Catalyst 3550), the entire queuing model can be flipped upside down, with Q2 representing the Critical Data queue, Q3 representing the Best Effort queue, and Q1 and Q4 representing the Scavenger queue. Therefore, shared weights of 70, 25, and 5 can be assigned to Queues 2, 3, and 4, respectively.

115

116

6/26/2013 Cisco Live 2013

117

Note: add X and S. C3560E(config)#diagnostic monitor test ? <1-6> Test ID Number ID Test Name [On-Demand Test Attributes] --- ------------------------------------------- 1 TestPortASICStackPortLoopback [B*N****] 2 TestPortASICLoopback [B*D*R**] 3 TestPortASICCam [B*D*R**] 4 TestPortASICRingLoopback [B*D*R**] 5 TestMicRingLoopback [B*D*R**] 6 TestPortASICMem [B*D*R**] --- ------------------------------------------- WORD Test ID list (e.g. 1,3-6) or Test Name all Select all test ID C3560E(config)# Scheduled Example:

switch(config)# diagnostic schedule switch 1 test 1 on jan 3 2003

23:32

switch(config)# diagnostic schedule switch 1 test 1 daily 14:45

switch(config)# diagnostic schedule switch 1 test all weekly Monday

3:33

switch(config)# diagnostic schedule switch 5 test 1,3-6 daily 23:55

Router# show run

Building configuration...

Current configuration : 4618 bytes

diagnostic schedule switch 1 test 1 on January 3 2003 23:32 cardindex

1

diagnostic schedule switch 1 test 2 daily 14:45 cardindex 1

diagnostic schedule switch 1 test all weekly Monday 3:33 cardindex 1

diagnostic schedule switch 5 test 1,3-6 daily 23:55 cardindex 1

117

118

119

The bottom two boxes are referencing diag test 2.

120

Here is an example of failure Overall diagnostic result: MAJOR ERROR Test results: (. = Pass, F = Fail, U = Untested) 1) TestPortAsicStackPortLoopback ---> . 2) TestPortAsicLoopback ------------> F 3) TestPortAsicCam -----------------> . 4) TestPortAsicRingLoopback --------> . 5) TestMicRingLoopback -------------> . 6) TestPortAsicMem ----------------->

121

Note: updated switch support for X and S To disable OBFL: no hw-module module [switch-number] logging onboard To clear all the OBFL data in he flash memory except for uptime and CLI commands information clear logging onboard -”show logging onboard status” to see if it is running and for which features -“summary” option contains historical data (compressed) -”continuous” option contains current data (more detailed) -”detail” option contains summary and continuous output -”copy logging onboard module flash:” creates a tar file in flash -”archive tar /xtract flash:” to extract tar file contents as ascii text

122

123

124

125

126

Assumes that Mcast routing or IGMP snooping is setup

127

128

6/26/2013 Cisco Live 2013

Recommended