Tail-f Systems Whitepaper - Introduction to NETCONF

Tail-f Systems White Paper

Tail-f Systems © 2008 Page 1 of 8

AddressingNetworkManagementChallengeswithNETCONF

ExecutiveSummary

At the same time as networks have become large and complex, network operators are under greater pressure to reduce operating costs and eliminate network disruptions. Manual misconfiguration of network devices is a leading cause of network downtime. For example, a few years ago, AT&T experienced an outage that affected most of its network and took two hours to rectify. This was caused by a manual misconfiguration of the OSPF routing protocol in one of AT&T’s backbone routers [1].

By necessity, many service providers and enterprises manage their networks with a battery of independent management interfaces rather than taking a unified approach to network configuration. The NETCONF protocol promises to unify the process of configuration management with a single API. Tail-f Systems delivers a family of products to help build network management systems that fully support the NETCONF standard.

This paper reviews the business cost of outages such as the one AT&T experienced. It also details some of the reasons that network architecture has become so complex, discusses current approaches to the problem and finally presents NETCONF as a powerful building block solution to the network management problem.

BusinessImplicationsofNetworkOutages

The network architectures of network providers (such as telecom companies, mobile operators, internet service providers, and enterprise networks) are growing increasingly complex. They are also becoming increasingly critical, such that the cost of poor implementation or outright failure has grown drastically. Network or service outages lead to lost revenue, organizational inactivity, PR nightmares and the alienation of customers and/or employees, any of which can cost tens or hundreds of thousands of dollars per hour, if not millions, while also creating significant long-term setbacks.

Additionally, the loss of control over customers’ and subscribers’ personal data has become an increasingly central issue over the last few years. This liability, as well as the misuse of services -- such as abuse, fraud, worms, viruses, and other malevolent or vulnerable software -- place increased demands on the network architecture. Such network security challenges arise from having complex services that can be abused when misconfigured.

PressuresthatLeadtoComplexNetworkArchitectures

A range of internal and external pressures contribute to the increasingly complex network architectures



present in many service provider and enterprise environments. The issues related to the network itself include: • increased number of network elements • interaction among heterogeneous network elements • increased sophistication in each element • more sophisticated customer and partner demands and services

While satisfying these often disparate demands, the network administration organization must accomplish its tasks quickly, efficiently and inexpensively. For example, if a virus or worm attacks a large service provider this may require the system administration organization to combat the infestation by reconfiguring a network of several thousands of routers with complex policies, as quickly as possible in order to minimize interruption.

Networks typically include a range of elements, each of which is increasingly complex to manage and configure. Required services are also growing more complex, involving ever larger configurations. Also, individual boxes can provide an increasing number of services.

Most network devices are configured to cooperate with other members of the network. The most critical network functions usually require specialized components as well as the basic network building blocks. For example, if a web server is deployed, regardless of its function (for internal use, third-party hosting, web services, or other services), it is normally physically replicated and protected by load balancers to provide resilience and scalability and augmented by firewalls and SSL accelerators to provide security and offload computationally intensive tasks. This increases the network’s complexity as each entity must be individually architected, managed, and upgraded to provide maximum functionality in a changing business environment.

A further cause of complexity is the many ways in which organizations evolve. Examples include organic growth, geographical diversification, and mergers and acquisitions.

As an organization and its network grow, devices are added to increase capacity, reliability, and security. The addition of new offices, points of presence, or data centers results in additional architectural issues which increase the complexity of the organization and the network. For example, company growth and reorganizations may add new sites to the company’s virtual private network which requires complex network reconfiguration by the VPN service provider.

CurrentApproachestoNetworkManagement

Vendors of complex network devices provide management interfaces (such as CLIs) and custom data stores (such as per-device databases) for their devices. These stovepipe applications help network administrators in the short-term, but do not integrate with other applications. As such, these tools hamper long-term network management because of disparate data stores and custom management solutions.



It is clear that managing each device by hand, even aided by a central management console, is not scalable or economical. Common pitfalls include low productivity levels and the presence of human error. The likelihood of such errors is increased by the repetitive nature of the work and presence of diverse management interfaces without a common look and feel.

To scale beyond the current situation of network management, service providers should move to a model in which the network, rather than the network elements, is directly managed. As noted by industry experts [3], moving to this abstraction is reminiscent of programmers moving away from assembly language toward higher-level languages. An ideal network management system should provide four crucial properties: • automation • consolidation • standardization • formalization

By automating tasks that are repetitive, tedious, and easily mishandled, the system administration organization gains flexibility, reliability and efficiency. At the most basic level, management is done by editing configuration files and instructing the device to re-read its configuration. At a higher level, a task can be automated by encapsulating it in a script. The script is invoked when the task must be performed. A conventional stovepipe CLI can provide these functions for a particular device, but cannot do so for multiple devices. Plain scripting can still be problematic, however. At an even higher level, the network information can be structured and made available to tools. The process of scripting can then be highly simplified or automated, as discussed below.

Network management systems (NMS) or operations or business service systems (OSS/BSS) provide a logically central (but possibly physically replicated and/or geographically distributed) management point which consolidates some or all of the network and its associated services. Multiple administrators with multiple roles collaborate on management tasks against such consolidators; the consolidation server then ensures that the tasks are implemented by querying and/or reconfiguring the actual network equipment.

In order to build these management systems, two parallel approaches are available. The first approach is to integrate existing equipment. Automating activity against such a device normally means running scripts against it. These scripts send textual commands to the device and then analyze its textual output. The advantage of this approach is that legacy equipment can be managed, after a fashion. However, there are also strong disadvantages associated with this approach.

The first disadvantage is bit rot: as new network elements are added, old ones retired, and existing ones upgraded, the collection of management scripts must be updated to account for this. Fresh scripts must be written and tested to work against new or updated equipment. The way to get past this is standardization of the management interface in order to make purely textual issues irrelevant.



The second great disadvantage is that all knowledge of the network is embodied implicitly in the collection of scripts rather than explicitly in a data model. While this may seem to be a minor point, the drawback is significant. Consider a network where router300 is a brand A router. The system administrator replaces the router with a brand B router. At this point, all A-specific scripts operating on router300 must be updated to B-specific scripts. A manual update is, as usual, tedious and error prone.

Instead, as the second approach, such updates could be handled by the management server. This is possible if we formalize the network architecture and the network elements into a data model which is used by the management system and system administrators. A formalized system model could be used architecturally for a multitude of tasks, such as implementing and checking network configurations and policies, but also organizationally for tasks such as planning, organizing and optimizing the network. For example, emerging tools for analyzing router misconfigurations could be applied to a formalized network model to detect errors [4].

Traditionally, SNMP has been the standard for remote management and a large body of work has emerged to support it. However, there are several problems with SNMP that make it a less suitable candidate for some management tasks. First, SNMP operates over UDP which is an unreliable datagram protocol. Second, UDP limits the maximum message size so large configurations cannot be sent in a single datagram. Both of these properties combine to make UDP unsuitable for writing configurations. Third, SNMP uses a protocol-specific security mechanism rather than a standard method (such as SSH) which increases administrator workload and complicates the network architecture. Finally, while some vendors provide commit/rollback-like functionality, SNMP lacks a standard commit operation for individual devices or groups of network elements. Therefore in practice, SNMP is rarely used for writing configurations. The new IETF protocol for device management, NETCONF, offers a powerful solution to the problem of network management, as it satisfies all of the criteria of an ideal management system.

WhatisNETCONF?

NETCONF is an XML-based protocol for automated configuration management. NETCONF was formalized as RFCs 4741 through 4744 by the IETF in December of 2006. As a building block for device management automation, it provides the mechanisms for installing, querying, manipulating and deleting the configurations of network devices. NETCONF is a way for equipment vendors to outsource the management interface of network elements. NETCONF is a way for service providers to optimize the administrative workflow by moving management intelligence out of the device under management and by enabling this consolidation of management into higher-level applications, such as the consolidation servers discussed previously.

NETCONF exposes a standardized RPC-style API based on XML. The XML requests and responses are sent over a persistent, secure, authenticated transport protocol, such as SSH. See Figure 1 below for examples of NETCONF layers.



The use of encryption means that the requests and responses are confidential and tamper-proof. In addition to a secure communication system, NETCONF requires devices to track client identities and enforce permissions associated with identities. This means that devices can be managed over an untrusted wide area network, a distinct advantage compared to other approaches. Configuration over a WAN means that network management can be centralized through consolidation of all management to a single site, and also decentralized, as multiple sites can share device management work.

NETCONF offers an alternative to the previous device-specific means of interrogating a box from a CLI with an XML-based API. NETCONF can be inexpensively provided by devices and straightforwardly used by higher-level management tools. It thus provides a suitable base protocol for management servers.

NETCONF offers a number of advantages over other approaches to network management. Since NETCONF is based on XML and XML Schema, it easily represents complex data, which are stored in XML-databases and manipulated with tools such as XSLT, XML style sheets and others. NETCONF provides a secure answer to network management as it is based on SSH which enables simple remote management.

NETCONF is extensible, future proof, and reasonably straightforward to implement for vendors, which reduces cost, price and time-to-market. NETCONF sessions begin with a capability discovery phase in which the device (or server) exposes its capabilities to the client, and the client discards unknown capabilities. New features can be defined locally, but are defined formally, with rigorous syntax and semantics. If the client detects unknown capabilities, it will not use them.

Finally, NETCONF provides protocol mechanisms for locking configurations and manipulating configurations in bulk. By locking and working on multiple devices at once, a management system built on NETCONF can implement network-wide policies as logical management operations. NETCONF also provides a candidate commit mechanism. After a configured interval, devices automatically revert to their original configuration, unless the change has been confirmed by a second, confirming commit. Administrators can use this capability to test configurations that may potentially

Figure 1: NETCONF Protocol Layers



degrade or disable connectivity. If such an error occurs, the confirming commit does not reach the misconfigured devices. After a timeout, the network automatically reverts back to the original working configuration.

ExamplesofNETCONFatWork

A management system can use NETCONF to script, automate or routinely carry out some management tasks. A sampling of NETCONF implementations follows, in which the individual NETCONF steps are defined.

Task: Install a new network element

1. Query the device to determine the model and software revision. 2. Generate or select an appropriate configuration at the management server. 3. Deposit the concrete configuration document at URL uuu. 4. Ask the device to copy the configuration from uuu. 5. Perform device-specific adjustments as directed by the management server. 6. Make the configuration permanent at the device.

Task: Simultaneously configure two devices

1. Lock device A, then device B. 2. Modify the configuration of A to refer to B, and the configuration of B to refer to A as

appropriate. 3. Commit new configurations on A and B. 4. Release locks on both configurations.

Task: Translate a device-independent configuration to become device-specific

1. On the management server, read the device-independent configuration document. 2. Get the device-specific parameters. 3. Select and perform an XSLT transformation to generate a new XML configuration document

at URL uuu. 4. Copy the configuration from URL uuu to the device or devices, or perform further

transformations.

Task: Troubleshoot multiple devices

1. Lock the running configurations of the devices. 2. Read the configurations to the management server. 3. Unlock the device configurations. 4. Analyze the XML documents to determine problems, verify policy compliance, archive the

network state, or similar.



TailfSystems

Tail-f Systems is the leading developer of XML-based network management software for the providers of networking equipment and software. Customers using Tail-f Systems’ technology radically reduce their time-to-market and benefit from carrier-grade implementations of NETCONF, Web, CLI, and SNMP interfaces. Tail-f Systems is an active participant in the NETCONF Working Group within the IETF. Tail-f’s ConfD software (see figure 2 below) provides a complete solution to build on-device network management systems including a NETCONF management interface that supports all base and optional capabilities of the NETCONF standard. Tail-f’s Instant NETCONF Manager enables southbound NETCONF interfaces to be added to EMS/NMS systems. For more information on Tail-f Systems, go to www.tail-f.com.

Conclusion

NETCONF allows network operators to automate the configuration of their networks in a reliable and efficient fashion. NETCONF is being adopted in the telecommunications industry to help reduce operating costs, in data centers to help reduce network outages, and by the military to ensure con-tinuous service and ironclad security.

Figure 2: Tail-f Systems Software Overview



References

[1] BGPexpert newsletter, 2002-10-15; http://www.bgpexpert.com/archive2002q4.php

[2] D. Patterson. A Simple Way to Estimate the Cost of Downtime. In Proceedings LISA'02: Sixteenth System Administration Conference. USENIX Association.

[3] Paul Anderson, quoted at LISA'03 Configuration Management BoF. http://www2.parc.com/csl/members/jthornton/LISA03ConfigMgmtBoF.html

[4] Nick Feamster, Hari Balakrishnan, Detecting BGP Configuration Faults with Static Analysis. In Proc. 2nd Symp. on Networked Systems Design and Implementation (NSDI), Boston, MA, May 2005.

[5] The NETCONF working group, protocol specifications and related documents, can be found at http://www.ops.ietf.org/netconf/.

Tail-f Systems North America 109 S. King Street, Suite 4 Leesburg, VA 20175, USA Phone: +1 703 777 1936

Tail-f Systems Headquarters Klara Norra Kyrkogata 31 SE-111 22, Stockholm, Sweden Phone: +46 8 21 37 40

[email protected] www.tail-f.com

Technology

Tail-f Systems Whitepaper - Introduction to NETCONF