29
Juniper T/M/J JUNOS troubleshooting basics Version 0.1 Author: Department: Date: Version: 0.1

JUNIPER JUNOS TMJ Troubleshooting

  • Upload
    wp66

  • View
    326

  • Download
    3

Embed Size (px)

Citation preview

Page 1: JUNIPER JUNOS TMJ Troubleshooting

Juniper T/M/J

JUNOS troubleshooting basics

Version 0.1

Author: Department:Date:Version: 0.1

Page 2: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

1 Table of contents

1 Table of contents...................................................................................................................1

2 Introduction............................................................................................................................ 2

2.1 Document Objective.........................................................................................................2

2.2 Scope............................................................................................................................... 2

2.3 Document History.............................................................................................................3

2.4 Related documents...........................................................................................................3

3 Troubleshooting guidelines..................................................................................................4

3.1 Basis troubleshooting for all events..................................................................................4

3.2 Common events...............................................................................................................5

3.2.1 Power supply failure..................................................................................................5

3.2.2 Fan failure/Temperature alert....................................................................................5

3.2.3 Device reboot with unknown cause...........................................................................6

3.2.4 Chassis event (component failure)............................................................................7

3.2.5 Routing-engine..........................................................................................................8

3.2.6 Link failure................................................................................................................. 9

3.2.7 Management IP unreachable (ICMP)......................................................................10

3.2.8 In-band Loopback IP unreachable (ICMP)..............................................................11

3.2.9 BGP neighbor..........................................................................................................12

3.2.10 ISIS adjacency........................................................................................................14

3.2.11 VRRP...................................................................................................................... 15

3.2.12 LDP neighbor/MPLS................................................................................................16

3.2.13 PIM neighbor/multicast............................................................................................17

3.3 Non-fault management alarms or undocumented events...............................................18

3.3.1 Undocumented event..............................................................................................18

3.3.2 Network slow...........................................................................................................18

3.3.3 Reachability problem...............................................................................................18

3.3.4 Complete service/product not working....................................................................18

3.4 Disaster recovery............................................................................................................19

3.5 Hardware maintenance verification................................................................................20

1

Page 3: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

2 Introduction

2.1 Document Objective

2.2This document will show basic instructions for certain types of alarms. The basic troubleshooting steps defined will be categorized per event and are valid for JUNOS software running on M/T/J series models.

For most of the events reference to the vendor documentation is given where additional information can be looked up. This vendor documentation is also available in PDF format and should be present at a common location for operational personal (accompanying this document).

The output interpretation of the command can also be looked up in the vendor documentation:

Go to www.juniper.net type command in the search area, all command output reference information can be found there.

2.3 Scope

2.4This document will describe the initial troubleshooting for the most common events. It will also describe a generic approach per fault.

It assumes the following knowledge and capabilities from the operator:

Basic topology knowledge of the network (what is core, distribution, access) Basic knowledge of Juniper T/M/J hardware (know the generic architecture of

the box, should know about routing engine, FPC, PIC, SIB, etc). Basic knowledge of JUNOS (can log-in, can run commands) Basic knowledge of BGP/ISIS/LDP/PIM (what are these protocols doing in

general).

2

Page 4: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

2.5 Document History

Version Reason for Change Modified by Date

2.6 Related documents

Vendor documentation at www.juniper.net

3

Page 5: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3 Troubleshooting guidelines

3.1 Basis troubleshooting for all events

Below commands should be run in all situations:

show versionshow system uptimeshow log messages | last 100show chassis alarmsshow chassis hardware

Details:

show version -> this will show the model your are working on show system uptime –> this will show the current system uptime and when it

has been configured for the last time. It will indicate via the load figures how busy the system is.

show log messages | last 100 -> this will show the last 100 events which happened on the router

show chassis alarms -> this will show if there are any alarms active on the router for the chassis.

show chassis hardware -> this will show which hardware is present

4

Page 6: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.2 Common events

3.2.1 Power supply failure

Diagnostics:

Below commands should be run in case of power supply failure. Please note that not all systems have a PEM module.

show chassis environmentshow chassis environment pem <number>

Impact:

Most chassis will have redundant power supply.

Common causes:

Power supply has failed External power interrupted

Solution:

Replace power supply Fix external power

Further reference:

For additional information: http://www.juniper.net/techpubs/software/nog/nog-hardware/html/nog-hardwareTOC.html

3.2.2 Fan failure/Temperature alert

Diagnostics:

Below commands should be run in case of fan failure:

show chassis environment

Impact:

Most chassis will have redundant failures. Overheating can be caused if the fan is not fixed.

Common causes:

Fan has failed Air filter is dirty Housing location is to hot

Solution:

5

Page 7: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

Replace Fan/Clear air filters Contact housing location

Further reference:

For additional information: http://www.juniper.net/techpubs/software/nog/nog-hardware/html/nog-hardwareTOC.html

3.2.3 Device reboot with unknown cause

Diagnostics:

Below commands should be run in case of unknown cause reboot

show log messages

Impact:

It depends where in the topology this system is. In general for systems with an access related function this means a short outage has occurred. In the core impact should be minimal

Common causes:

Power failure Bug/crash Routing engine failure (can be hard-disk failure on RE)

Solution:

Cases should be created with vendor for analysis as soon as possible to establish if it is a software issue or hardware issue.

If failing hardware is the cause; it must be replaced in a service window on redundant systems. On non-redundant systems it must be replaced as soon as possible.

If a component is causing re-occurring failures the component should be removed as soon as possible. If it is non-redundant it should be replaced as soon as possible.

If a software bug is causing re-occurring failures it should be escalated to next level of support.

Further reference:

For additional information: http://www.juniper.net/techpubs/software/nog/nog-hardware/html/nog-hardwareTOC.html

6

Page 8: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.2.4 Chassis event (component failure)

Diagnostics:

Below commands should be run in case of component failures on a chassis:

show chassis alarmsshow chassis craft-interfaceshow chassis routing-engineshow chassis fpc

Look in the further reference section for your specific model (and then under “monitoring model XXX” components section).

Impact:

PIC -> This will cause interface problems FPC -> This will cause multiple PIC problems Other components -> Other components will be chassis related (see further

information)

Common causes:

Hardware failure

Solution:

Replace the hardware via the vendor contract. In most cases there will be a service contract with a 3 hour time-to-fix. Open a ticket with this supplier as soon as possible and let them replace the hardware.

Further reference:

For additional information: http://www.juniper.net/techpubs/software/nog/nog-hardware/html/nog-hardwareTOC.html

7

Page 9: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.2.5 Routing-engine

Diagnostics:

Below commands should be run in case of routing-engine failures:

show chassis alarmsshow chassis routing-engine

Most of the time a RE failure will also cause other alarms (for example BGP/LDP/ISIS restarting).

Impact:

On dual routing engine systems the backup will take over. A short interruption has occurred.

On single routing-engine systems either another system in the topology will take over or this machine is down and all services it is providing are also down. In most situations it will not occur that there is impact (except for the normal fail-over times which apply)

If the backup RE has failed there is no service interruption

Common causes:

Hard-disk failure on routing-engine Hardware failure on routing engine

Solution:

Primary RE failureo Check if backup RE has taken over if not manually switch over via:

request chassis routing engine master switcho Replace the faulty RE (in a service window in case of redundancy)

In case of backup RE failure replace it in a service window

Further reference:

For additional information: http://www.juniper.net/techpubs/software/nog/nog-hardware/html/nog-hardwareTOC.html

8

Page 10: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.2.6 Link failure

Diagnostics:

Below commands should be run in case of link failures:

The monitor interface command can be used if a link currently is transmitting traffic. Show hardware detail will show which PIC and SFP are present.

show interface terseshow interface <interface> show interface <interface> extensive monitor interface <X>show hardware details

Impact:

Core: most links will be redundant Distribution: most links will be redundant Access: most times customer impact

Common causes:

Fiber failures GBIC/XENPAK failure PIC failure Other side failure

Solution:

If it is a flapping link -> disable the link until repaired If it is a fiber cut -> disable the link and enable it after the fiber has been

repaired If it is a GBIC/XENPAK failure -> replace the GBIC If it is a PIC failure -> replace the PIC (see component failure)

Our hardware supplier should be able to help out to diagnose if faulty equipment causes link failures if no obvious alarms or related network error conditions are present for the reporting devices.

Further reference:

For additional information:http://www.juniper.net/techpubs/software/nog/nog-interfaces/html/nog-interfacesTOC.html

9

Page 11: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.2.7 Management IP unreachable (ICMP)

A management IP can be recognized because it is a private IP. It is reachable via the DCN network.

Diagnostics:

Below commands should be run in case management IP failures (the DCN gw can be found as backup-router in the configuration:

#primary REping <DCN gateway>#backup RErequest routing-engine login otherping <DCN gateway>

Also try to ping the IP from a DCN management station to verify if the fault management system.

Impact:

This could indicate that the routing-engine failed (see routing engine failure)

Common causes:

DCN network failure Routing engine failure Configuration change which affected DCN interface

Solution:

Resolve DCN issue Replace routing-engine (see routing engine failure)

Further reference:

N/A

10

Page 12: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.2.8 In-band Loopback IP unreachable (ICMP)

In-band loopback can be recognized because it is a Public IP address. There is only one point where we enter the NGN and from there we reach the monitored nodes via the NGN itself.

Diagnostics:

Below commands should be run in case of in-band loopback IP failures.

Log in to a core (NCR) node inside the NGN and:

ping <loopback>show route <ip>

If a loopback is not reachable this means that must be a major problem with this node and other alarms should be present for this node or neighboring nodes should report problems.

Impact:

Access: This could indicate that the node is down and not providing any service

Core: This could mean that the complete node is down in the core another system will have taken over.

Common causes:

Power outage at housing location Routing issue within the NGN Component failure

Solution:

Fix component (escalate to supplier) When there is a routing issue in the NGN causing the loopback to be

unavailable it must be escalated to the next-level.

An in-band loopback down will probably be caused by one of the other events (most likely a component failure or power failure).

Further reference:

N/A

11

Page 13: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.2.9 BGP neighbor

There are different kinds of BGP neighbors in the NGN network: IBGP neighbors this can be recognized because the AS number of the

neighbor is our own AS13127 number. These are always important and should never go down.

EBGP neighbors. These can be recognized because the remote peer does not have our own AS number. There are a couple of EBGP neighbor types.

o Peer -> This neighbor is connected via a public exchange. Typical there will be lots of neighbors in this category which are down (because we have so many). One it is only a small number which are down this is not causing any network problem. This type of neighbor will only be present at BR (border) routers. No actions should be taken for peers who are down then less then 24 hours.

o Transit -> These sessions are also only present at BR routers. We always have multiple which are each other backups. They provide reachability to the complete internet for us. If a transit is down it must be fixed as soon as possible.

o Content -> From here we retrieve special content. Currently there is only one: the NOB connected to the MBR (multicast BR). Most of the time this will be redundant setups.

o Customers -> These are present at IAR1X routers. These will be individual business customers. The customer should be contacted in case of problems.

Diagnostics:Below commands should be run in case of BGP neighbor events.

Log in at the node reporting the BGP problem:

show bgp summaryshow bgp neighbor <ip> For most of the IBGP failures also other failures which correlate to the BGP event should be present (e.g. link-down, ISIS).

Impact: IBGP sessions down -> normally systems are connected to two BGP neighbors

for redundancy. If both are down this could be service affecting. EBGP session down -> For important traffic redundant BGP sessions should be

available. If both are down then this could have impact for customer reachability.

Common causes:IBGP: ISIS routing problems Configuration error for new commissioned system Remote neighbor failure

EBGP: Neighbor router failure Neighbor configuration failure

12

Page 14: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

13

Page 15: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

Solution:Please not that most of the times to solve BGP neighbor issues nothing has to be configured on the reporting node itself

Verify remote router Verify ISIS is running normally

Further reference:

For additional information (look for BGP):http://www.juniper.net/techpubs/software/nog/nog-baseline/html/nog-baselineTOC.html

14

Page 16: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.2.10 ISIS adjacency

Diagnostics:

Below commands should be run in case of ISIS neighbor failures.

Log in to the node and do a:

show isis adjacency

Impact:

If from a topology perspective a system is isolated from all its other ISIS neighbors then the system will not be able to provide any services.

Common causes:

Layer 2 (switch) failure Configuration error (MD5 password) Remote node failure Link-down (maybe because of component failure)

Solution:

Please note that most time to solve an ISIS neighbor failure nothing has to be done at the reporting router itself.

Verify that there are no on-going layer 2 problems Verify that the remote neighbor is not down

Further reference:

For additional information (look for ISIS):http://www.juniper.net/techpubs/software/nog/nog-baseline/html/nog-baselineTOC.html

15

Page 17: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.2.11 VRRP

VRRP is used on SAR routers to provide redundancy to connected hosts. Normally two VRRP routers are present where one will be the master and the other one the backup router. When the master fails the backup will take over. Also when the uplink interface fails the router will also swap to the other backup node.

Diagnostics:

Below commands should be run in case of VRRP problems.

Log in to the VRRP node and run:

#master VRRP nodeshow vrrp briefshow vrrp track#backup VRRP nodeshow vrrp brief

Normally with a VRRP event there should also be another event which is the cause of the VRRP alarm (link down, node down, switch down).

Impact:

Normally the backup router will take over and no service interruption should occur.

Common causes:

Remote node failure Link failure Switch failure Configuration failure (authentication)

Solution:

Verify remote neighbor Verify layer 2 switch Verify uplink of SAR Verify local interface status

Further reference:

N/A

16

Page 18: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.2.12 LDP neighbor/MPLS

LDP is used in combination with VPN’s in the network. VPN are for example used for VOD, VOIP and wholesale traffic.

Diagnostics:

Below commands should be run in case of LDP neighbor failures or MPLS problems.

Log in to the node and run:

show ldp neighborshow ldp sessionshow mpls interface

Impact:

Systems should have more redundant LDP sessions. If multiple sessions are down it can cause reachability problems within the VPN’s and affecting VOD and VOIP services.

Common causes:

Layer 2 (switch) failure Link-down (maybe because of component failure) Remote node failure Configuration failure

Solution:

Please note that most time to solve an MPLS/LDP events nothing has to be done at the reporting router itself.

Verify that there are no known layer 2 problems Verify that the remote neighbor is not down

Further reference:

N/A

17

Page 19: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.2.13 PIM neighbor/multicast

Multicast traffic is only forwarded in the network if the PIM, BGP and ISIS protocols work correctly. A special router is present which is performing the so called RP function. This router should be reachable at all times.

Diagnostics:

Below commands should be run in case of PIM or multicast problems.

Log in to the node and run:

show pim interfaceshow pim neigbors show pim rps ping <above found RP address)show multicast route

Impact:

Multiple PIM neighbors should be present. If more the one PIM neighbor is down it can mean that no multicast is flowing through the router

If the RP is not reachable no multicast traffic can be send to this router

Common causes:

ISIS problems RP failure Remote node failure Configuration failure BGP problems

Solution:

Please note that most time to solve an multicast/PIM event nothing has to be done at the reporting router itself.

Verify that the remote neighbor is not down Verify that the RP is reachable Verify that BGP is running as expected

Further reference:

N/A

18

Page 20: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.3 Non-fault management alarms or undocumented events

3.3.1 Undocumented eventFor undocumented events at least do the following:

Do the basis check for all failures on the node where you think the problem is Check if there is customer impact Check if you can find a common node in the topology (consult that NGN

network drawing) which causes the problem Try to find if you can relate a problem to a certain protocol or service Use the service documentation to trouble shoot the service itself Always contact the next escalation level that something undocumented at

platform level has happened so this document can be improved.

3.3.2 Network slow

If there are complaints that the network is “slow” please check the main network capacity indicators:

Peering points NGN – Cisco backbone interconnection Transit interconnection

Check for huge traffic spikes (might be DDOS attack) or huge traffic declines (might be routing problem).

3.3.3 Reachability problem

If it has been verified in the fault management system that no problems are currently present which could cause this it can be that a routing problem has occurred. This can happen after changes (check which RFC have been performed) or certain event (a router which has never been used before has become active after a switchover). Escalate to the next-level if there is customer impact.

3.3.4 Complete service/product not working

In this case make sure that all outstanding in the fault management system are confirmed and that they cannot impact the service which is currently not working. It is highly unlikely that a complete service cannot be working without any alarms for it.

19

Page 21: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.4 Disaster recovery

Almost all JUNOS based systems in the Tele2-Versatel network have a dual-routing engine setup. It is highly unlikely that we will every come into a situation where we loose the complete router and all configuration (there are a minor number of single RE chassis systems in the network). Below you can find a procedure to recover the configuration in case of disaster.

The configuration of JUNOS based systems (M/T/J) is stored in the network management system JUNOScope:

Reachable at: http://10.0.129.120:8080/jtk/login Login: frops Password: frops

To find the latest stored configuration of a device got to:

Configuration -> Repository -> Display -> <select device> - > <latest date>

There are to options how to recover the configuration

option 1: DCN is up and working, configuration is copied to FTP server in the DCN (octopus)

ftp <ip of octopus>get <filename>quiteditload override <filename>commit synchronize

option 2: Cut and paste via console

editload override terminal<cut en paste configuration>CTRL-Dcommit synchronize

20

Page 22: JUNIPER JUNOS TMJ Troubleshooting

Troubleshooting guide for T/M/J JUNOS routers

3.5 Hardware maintenance verification

Juniper has an excellent Network Operations guide available which documents what to do with hardware failure, maintenance and replacement. See further reference where to find the hardware documentation (this includes per component how to verify correct behavior).

In general always do (before and after hardware replacement):

show chassis alarmsshow chassis hardwareshow chassis <hardware component name>show log messages

This will show the present hardware; the specific component command will show the status of the component(s).

Further reference:http://www.juniper.net/techpubs/software/nog/nog-hardware/html/nog-hardwareTOC.html

21