Anatomy of an IP Router Vahid Tabatabaee Fall 2007

1ENTS689L: Packet Processing and SwitchingAnatomy of an IP Router

Anatomy of an IP Router

Vahid Tabatabaee

Fall 2007


References

Title: Network Processors Architectures, Protocols, and PlatformsAuthor: Panos C. LekkasPublisher: McGraw-Hill

James Aweya, “IP Router Architectures: An Overview”, Nortel Networks, Ottawa, Canada

Florian Brodersen, Alexander Klinetschek, “Anatomy of a High Performance IP router”, Communication Network Seminar 2003/04, Hasso-Plattner-Institute, University of Potsdam, Jan. 2004

Steve Kohalmi, Tim Hale, “Anatomy of an IP Service Edge Switch”, 2002 Quary Technologies.

Cisco Systems CRS-1 router documents.


Basic IP Router Components

Network Interfaces

Processing Modules

Buffering Modules

Interconnection Unit (switch fabric)

The processing and buffering modules may be replicated either fully or partially on the network interfaces.

Path computation, Routing Table Maintenance

Packet Forwarding, Packet Processing,May cache routing table

Transfer Packets btw. Ingress and Egress Interface (Line) Cards


Basic Functions of a Router

Route Processing (Routing Protocols OSPF, RIP, …)

Path ComputationRouting Table

MaintenanceReachability Propagation

Packet Forwarding

Slow Path orControl Plane

Fast Path orData Plane


Packet Forwarding

IP Packet Validation Version Number Header length field Check sum.

Dest. IP address parsing and table lookup. Local delivery in the network. Unicast delivery to an output port. Multicast delivery to a set of output ports

Packet Lifetime Control Adjust the time-to-live (TTL) field A packet with positive TTL is delivered to a local address Packet delivered to output ports has its TTL decremented and rechecked before

forwarding Packet Fragmentation

Check if the packet size is larger than MTU of the network If yes, fragment the packet.


First Generation of Routers

Similar to a typical computer layout.

All functionality is implemented in software.

Single CPU, single Memory, Single Bus!


Problems with first generation routers

Processing speed is limited by the single CPU.

The CPU should process all packets destined to it and those packets that are passing through it.

Major packet processing tasks such as table lookups are memory intensive operations and can not be done faster by simple processor upgrades.

Software implementation is inefficient, since it is a small set of operations repeated on all packets.

Slow path and fast path are implemented on the same CPU. Therefore, slow path can influence the fast path.

The routing table size has grown from 20,000 entries from 1994 to 260,000 entries today.

Moving data from one interface to another can be time consuming that often exceeds the packet processing time. Source http://bgp.potaroo.net/


The routing table lookup speed can not be improved if we use traditional memories.

The conventional bus structure for the interconnection is very inefficient.

Every packet has to pass the bus at least twice.

The whole packet (not just the header) is transferred.

Problems with first generation routers


How fast a router should be?

An OC-48 link data rate : 2.488 Gbps

Packet rate is more important than the data rate.

Bottleneck is caused by the minimum packet size which depends on the technology.

E.g. Packet-over-SONET (PoS): 40 byte IP payload + 6 byte PPP/HDLC overhead:

2.405 Gbps /(8 x 46) = 6.53 MPPS The aggregate packet rate for a 16 port system:

16 x 6.53 = 104.48 MPPS One decision every 9.57 nsec.

SDRAM speed is about 10ns from sequential locations and practically around 20-50 ns.


What is the solution

Take advantage of Parallelism: NIC became more intelligent and took care of most packet

forwarding. We use ASIC in NIC (line cards) for packet classification and

forwarding. Most packets do not go to the CPU card (control card).

Switching Interface: Use switching element to pass packets between line cards

directly and simultaneously.


Modern Switch Based Architecture

Classification and forwarding decisions are done in line cards.

High speed interconnection mechanism (switching) between the line cards.

This provides a fast data path.

Standard CPU (RISC processor) is used for the control plane (slow path).

Hardware and/or software implementation for classification and forwarding in the line card.


Functional Blocks in a Modern Switches

The PHY Interface Responsible for transmitting and receiving information Conversion of the bit stream from digital form to analog signal and vice

versa. Switch Fabric

The router has a bus or a backplane The switch fabric reads packet from input port and routes it to the

output port. Packet processing

Fast path (data path): Handles all operations that are executed in real time on packets (e.g.: framing/parsing, classification, modification, compression/encryption, queueing)

Slow path (control path): Operations executed of the packet flows. (e.g.: add. Resolution, route calculation, update of routing table,…)

Host processing Network management, configuring devices, diagnostics Implemented in software on a CPU


Line cards in Modern Switches

Line card handles packet processing such as:

Classification

Forwarding

Traffic Policing and shaping

Monitoring and Statistics


Data Path Diagram

OpticsCDR &Serdes

Framer/

Mapper

NetworkProcessor

TrafficManager

SwitchInterface

SwitchingElement

SchedulingElement

Egress

Line CardSwitch Card

Ingress

Packet Processing Units

Source: Light Reading Report


Data Path Functions

- Parse- Identify flow- Determine Egress Port- Mark QoS Parameters- Append TM or SF Header

- Police- Manage congestion (WRED)- Queue packets in class-based VOQs- Segment packets into switch cells

- Queues cells in class based VOQs- Flow control TM per class based VOQ - Schedule class based VOQs to egress ports

-Reassemble cells into packets-Shape outgoing traffic-Schedule egress traffic

Network ProcessorNetwork Processor Ingress Traffic ManagerIngress Traffic Manager Switch FabricSwitch Fabric Egress Traffic ManagerEgress Traffic Manager

WRED

Discard

Segmentation + header

TM Scheduler

SF Flow Control

SF Arbiter

Rea

ssem

ble

Egress Scheduler&

ShaperIncoming packets

Class based queueing of outgoing packets

Ingress Line CardIngress Line Card Switch FabricSwitch Fabric Egress Line CardEgress Line Card


Switch Card to Line Card Connection

This connection should pass through the Backplane. Serdes (Serializer-Deserializer) is used for this

connection.Each Serdes signal run over two wires and two pins

(differential mode signal).The speed is usually around 3.125 Gbps.They run some sort of coding (8b/10b encoding)The actual data rate would be around 2.5 Gbps.There are attempts to provide 10 Gbps serdes.


How many serdes do we need?

How fast should be the connection between switch card and line card?

The line speed is not enough.

Switch fabric throughput is less than 100% due to contention.

Network Processor, Traffic manager and switch fabric add their headers.

There is also cell tax.


Speedup

SwitchInterface

SwitchingElement

SchedulingElement

Line CardSwitch Card

LineCard

Elements

RL RTM RSFF

rag

me

nta

tio

n(C

ell

Ta

x)

Tra

ffic

Ma

na

ge

rH

ea

de

r

Sw

itc

h F

ab

ric

He

ad

er

Effective Speedup = RSF/RTM

In the commercial systems, speedup usually refers to RSF/RL.

Higher speedup factor: Increases system design

complexity. Increases power

consumption. Creates signal integrity

issues. Required Speedup factor is

around 2


Redundancy

We have spare switch cards and control cards in the system.

The redundancy models: Passive redundancy (N:1) We have

one inactive switch card in the system that starts to work after failure.

Passive redundancy (1:1, N:N) for each active switch card, we have one inactive card.

Load-Sharing Redundancy (N-1) all cards are active and when a failure happens, performance will degrade gracefully.

Active Redundancy (1+1): Two sets of fabrics carrying the same traffic.

Source: www.idt.com/content/switchblock.jpg


Example

In a 16 port 10Gbps switch with 2X speed up with and N:N redundancy how many 2.5 Gbps serdes do we need?

We need 20 Gbps active and 20 Gbps redundant data rate for each line card.

This means 16 serdes for each line card.

For 16 line cards we need 256 serdes in this system.


Example

What is the effective speed up of this system for 40 byte IP packets if the traffic manager header size is 12 bytes, switch fabric header size is 8 bytes and the payload size of the cell is 52 bytes.

Solution: In slide 9 example we show that there can be 6.53 MPPS (40 byte packets) on an OC-48 line.

Similarly on an OC-192 there can be up to 9.622/(8x46) = 26.15 MPPS. Each packet is encapsulated in one cell, since

40 + 6 < 52 The maximum number of cells that a line card can generate is

(2.5 x 8 Gbps) / ((52+8+12)x8) = 34.722 Effective Speedup is,

Speedup = (34.722/26.15) = 1.33


Traces Per Serdes

Typical LVDS speed is 1.25Gbps For 2.5Gbps we need 2 channels LVDS is differential, i.e. 2 traces per channel LVDS is unidirectional, i.e. 2 for full duplex Full duplex 2.5Gbps, using LVDS requires 8

traces In the previous example we will have 256 x 8 =

2048 traces on the back plane.


A sample Router (Cisco CRS-1)


The line card chassis

8 service cards and 8 physical layer interface module cards



Physical layer Interface Module

Routing Processor (control plane)

Switching Card

16 slot Single-Shelf system

The distributed route processor (DRP) is optional components that provide enhanced routing capabilities.• The DRP contains two symmetric multiprocessors (SMPs), each of which performs routing functions. • Processor-intensive tasks (such as BGP speakers and ISIS) can be offloaded from the route processors (RPs) to the DRPs.


Multishelf Systems

2 to 72 line card shelves1 to 8 fabric card shelves


How to handle packet processing?

Of the shelf CPUThis usually would be a RISC processor.In low end systems it could be a CISC processor.

ASICSpecialized high performance ASIC to handle packet

processing.

Ideal approach for companies such as IBM and intel, since they are manufacturers of Integrated Circuits


Off-the-shelf CPU Systems

Packet processing is implemented in software running on the CPU.

Modifications, upgrades and debugging is accomplished by simple software updates and downloads

Update time much shorter which is good for both user and developer

Not very efficient: spending many clock cycles on tasks not related to packet processing.

Fastest off-the-shelf CPU can handle about 1 gigabit per second.

Trend is to do deeper packet processing (more on this later).


Memory Bottleneck

The pipeline architecture of CPU enables them to perform billions of instructions per second.

However, in order to sustain the pipeline they should fetch data from memory and store it back continuously.

This can be done with very sophisticated multi-level hierarchy of different memory technology, interleaving memory banks.

This requires prohibitive cost, design complexity and power consumption.

Hence typical processor pipeline end-up being often empty, which reduces the system throughput.

Network traffic statistics models are completely different from local traffic on a computer bus. They do not have the same spatial and temporal locality properties. Hence, the typical processor’s cache systems are not effective.


Sup-optimal Instruction Set

The instruction set that we need for packet processing requires specific bit level operations.

These instructions should be done at wire speed.

These instructions are not available as standard instructions of off-the-shelf CPU.

Hence, we have to assemble multiple standard instructions to perform the intended functionality.


Packet processing with ASIC

ASIC typically delivers higher performance.

ASIC is not programmable: Adding new functionality new design Adding new protocol new design New design Costly for both vendor and the user.

ASIC design is very time consuming Design cycle takes 12 to 18 months. If we need some modification we may need to recode the

whole design. Many start-up failures are due to time delay.


ASIC development is costly

Expensive and time-consuming to change. For testing an ASIC you need to design a

system Expensive development tools (design and

verification). Requires ASIC designers (much more

expensive). Tape out of a chip costs around a million dollar.


So is there a middle ground solution?

Can we have a technology that :

Has flexibility of programmable processors

Has high speed of ASICs

Solution is called Network Processor!

Network processor are programmable similar to CPU, but their performance is close to ASIC


Network Processors value proposition

Shorter time to market: Instead of 18 months it takes about 6 months to complete

development cycle of packet processing part. Longer time in market:

New features can be embedded into a deployed network processor based product.

Increased time in market reduces cost of product ownership over the life of product.

Just-in-time delivery of new features: We can modify the design and adding new features in the

field without penalizing the customer. Greater focus on other issues of business management

Most functions are already coded in a standard way Developers can focus on differentiating features


Packet Processing Stages

1. Remove Link Layer Headers and Decryption: Ethernet PPP (Point-to-Point Protocol) Frame PPP over ATM PPP over Ethernet over ATM

2. Identify Ingress Subscriber: To extract information from the link layer protocol header about the owner of the

packet.

3. Filtering: To permit of deny specific traffic flows, based on various attributes of the IP and

higher layers headers.

4. Traffic Classification: To allow different traffic management, QoS, security and routing policies applied to

different types of flows.

5. Traffic Metering, Marking & Policing: To control Peak and Committed Information Rate. To determine PHB in the DiffServ Model (chnaging the priority)


Packet Processing Stages

6. Custom Routing Polices: To direct some traffic through specific paths (internet, VPN, specific destination) Virtual Private Routed Network allows users to network in privacy over their own

routed network using their own private address. Sending suspicious traffic to explicit locations for special processing.

7. NAT & NAPT (Network Address [Port] Translation): Address translation at the source if the user is using a private address space. Static one-to-one with NAT and dynamic many-to-one with NAPT.

8. Route Table Look-up: Best matching prefix look-up on the destination IP address.

9. Enforcing the PHB/ PerFlow (Link Sharing): Priority, WRR, WFQ scheduling, WRED (weighted random early detection).

10. Egress Side Processing (QoS, filtering, encryption, NAT, Egress Subscriber Identification, Traffic Classification, Link Sharing)

11. Statistical Collection


Deep Packet Processing

In deep packet processing we need to look at the contents of the packet not just the header.

Why do we need deep packet processing? Deep packet inspection for firewalls and intrusion detection

systems. Traffic shape or discard P2P traffic

Server load balancing: distribution of traffic among servers

based on the web destination

Network Monitoring and Analysis


Packet Processing Implementation issues

We need multiple table look-ups for each packet. Access to whole packet not just the IP header is necessary. There can be ten of thousands of simultaneously active

subscribers comprising millions of application flows. In a fully loaded Gigabit Ethernet connection about 1.5 million

packets per second must be processed Modern general processors are optimized for numeric computation

rather than processing packets. Memory read and write speeds become bottlenecks. Caching and high-speed memory burst capability does not help,

since packet processing requires: Large tables Short entries Random access queries


How do network processors do this?

Specialized circuitry and micro-engines to perform all generic packet processing functions.

They also usually embed a major programmable module, usually a tailor-made RISC CPU (and sometimes more than one). Real time operating system Handshake communication with other parts of the system


Network Processor Categories

Platform Network Processors objectives: Handle most packet processing functions Minimize the number of components and the hardware cost Optimize the trade-off btw. Performance and flexibility Accelerate software development cycle

Peripheral Network Processors Designed to optimize a specific function Compressor chips IP security


The other side argument

Every single task can be done in wire-speed.

How about multi-tasks at the same time.

What is a realistic scenario to consider?

Challenge of Benchmarking

Programming complexity

Documents

Anatomy of an IP Router Vahid Tabatabaee Fall 2007