IDT White Paper Optimum Search Methods for Switch/Router ...snarayan/sada/research/docs/networkingWhitePaper.pdfbetween equipments in the mobile infrastructure (i.e. the GGSN, SGSN

27 January 2003 Page 1 of 24

IDT White Paper

Optimum Search Methods for Switch/Router Databases in Access

and Metro Edge Networks

Abstract

Recent years have seen considerable research into fast database search methods for wire speed packet forwarding and classification in next generation routers. Most of this work focuses on algorithms for network processors combined with external SRAM. Router tables are generally too large to cache, so techniques are used to reduce memory accesses per search. Search algorithms fall into three categories: �exact match� for L2 MAC address translation, �best match� for L3 forwarding and �Multiple Field Searches� for QoS Classification and Access Control. Most routers however, currently use Network Search Engines (NSEs) to implement fast searches. This is despite much of the research dismissing NSE as too clunky for smaller applications, and un-scaleable for large router tables. This paper analyses the performance/storage trade-offs for both algorithmic and NSE implementations and shows how the new generation of NSE offer streamlined and scaleable solutions that compare well with algorithmic methods.

Authors: Michael Miller, CTO: System Technology, IDT USA Sean Cochrane, European Technical Marketing Manager Date: 27th January 2003

20th January 2003

Optimum Search Methods for Switch/Router Databases

IDT White Paper Page 2 of 24

CONTENTS

1. Introduction ...................................................................................................................................... 4

2. Generic Router Operation ............................................................................................................... 4

3. Database Search Requirement ....................................................................................................... 5 Packet Encapsulation in Mobile Networks........................................................................................................6 Packet Encapsulation in Broadband Access Networks......................................................................................8 Search Requirement Summary ..........................................................................................................................9

4. Review of Database Search Methods .......................................................................................... 11 Direct Index Mapping & Hashing ...................................................................................................................11 Binary (& Patricia) Trie Algorithms ...............................................................................................................11 Multi-Level Trie Algorithms...........................................................................................................................13 Trie Additions/Deletions (Maintenance) .........................................................................................................14 Multiple Field Search Algorithms ...................................................................................................................15 Heuristic Algorithms .......................................................................................................................................16 Network Search Engines .................................................................................................................................17 Algorithm versus NSE Summary ....................................................................................................................19

5. Search Method Performance Analysis......................................................................................... 20 Low cost SRAM Configuration.......................................................................................................................20 Increased Performance SRAM Configuration.................................................................................................21 NSE Implementation .......................................................................................................................................21 SRAM-NSE Performance Summary ...............................................................................................................22

6. References...................................................................................................................................... 23

A1 Appendix: CIDR & Longest Prefix Match..................................................................................... 24

20th January 2003



ABREVIATIONS

3GPP 3rd Generation Partnership Project AAL ATM Adaptation Layer ARP Address Resolution Protocol ATM Asynchronous Transfer Mode BGP Boarder Gateway Protocol CAM Content Addressable Memory DSLAM Digital Subscriber Line Access Multiplexer FOM Foreign Object Model GERAN GSM Enhanced Radio Access Network GPRS GSM Packet Radio System IMS Initialisation, Maintenance and Search IP Internet Protocol IPSec IP Security IPv4 IP Version 4 IPv6 IP Version 6 IXA Internet Exchange Architecture NPU Network Processor NSE Network Search Engine OSPF Open Shortest Path First PPP Point-to-Point Protocol QDR Quad Data Rate RIP Routing Information Protocol RSVP ReSource ReserVation Protocol SDK Software Development Kit SIP Session Initiation Protocol TCAM Ternary CAM UMTS Universal Mobile Telephony System UTRAN UMTS Terrestrial Radio Access Network VPN Virtual Private Network ZBT Zero Bus Turn-around

20th January 2003



1. Introduction This paper reviews the operation of a generic switch/router and highlights the need for multiple database search functions. These are expressed as either exact-match, best-match or multiple-field searches (depending upon the pattern of wild card fields within the �search key�). The database search requirements for a variety of applications, (including Mobile Access, Broadband Access, Metro-Edge and Backbone-Edge equipments) are then established. The paper continues with a review of the main classes of search algorithm and their theoretical performance/storage requirements. These are compared with the basic operation, performance and storage requirements of Network Search Engines (NSEs)1. The issues involved with table maintenance (i.e. how algorithmic approaches often require significant processing to add/delete table entries) are also evaluated.

The information outlined above is then used to analyse the memory access and storage trade-offs for a range of applications using both algorithmic and NSE based search implementations. The limits of both approaches are demonstrated in terms of memory bus utilisation (taking a QDR bus as the reference for the analysis). The document concludes with recommendations on where Algorithmic/SRAM search methods are appropriate, and where NSEs offer advantages. The document also highlights the main trends in NSE technology that promise streamlined and scalable search solutions for a range of applications.

2. Generic Router Operation Traditional �best effort� routers implement packet forwarding. This will remain a basic router function, common to Core, Edge and Access platforms. The intelligence being built into Edge Routers however requires additional processing, and deeper examination of each incoming packet. Figure 1 shows a high level logical flow diagram of a typical Ingress Router. The reliance on several databases to support each of the main processes is depicted.

IPSec SPD

SwitchInput

QoS Queues

L3 Table

L2 Table

Header & Content

Ingress DataChannel

ACL n

ACL 1

Billing Data

QoS RulesPolicy

Classification& Policy Admin

ForwardingEngine

Traffic Manager

PacketBuffer

HeaderUpdate

ARP

Flow Cache

BGP

RSVP

DatabaseMaintenance

ControlPlaneProcessor

RIP & OSPF

Encrypt, DecryptSign, Authenticate

IPSecSAD

IKE

Figure 1: QoS Enabled Ingress Channel (Logical Architecture)

1 Network Search Engines are commonly and incorrectly referred to as CAMs or TCAMs. Network Search Engines are devices that integrate high performance logic and use TCAM or CAM engines to accelerate database searches in high performance intelligent networking applications.

20th January 2003



• Packet forwarding determines the L2 link and L3 network addresses for the next hop in the packet�s route. Forwarding information is maintained in tables that can be accessed by the L2 & L3 forwarding addresses. The layer 3 routing table is maintained by a BGP (Boarder Gateway Protocol: Ref_1). Additional protocols to track and tabulate various network metrics are also employed such as RIP (Routing Information Protocol: Ref_2) and OSPF (Open Shortest Path First). The ARP (Address Resolution Protocol) maintains the layer 2 forwarding table (assuming Ethernet is used for the link layer).

• Packet Classification determines the QoS service group of each packet. This assessment is typically based on a wide range of L3 & L4 header fields, including the network source and destination addresses, the ToS byte, the protocol ID, and the TCP source and destination port numbers. In some circumstances classification can be based on L5-7 information, including Cookies and URLs. This so called �Content Aware� classification allows individual web pages and sessions to be identified.

• Policy & Admission Control is based on comparing the packet�s header characteristics against a series of Access Control Lists (ACLs). These define Access Rules for each packet header combination. Admission control procedures then check that the packet�s requested service is met by both the terms of the users contract (under a billing database) and the status of the network.

• Security is a common router requirement. This can be applied at Layer 3 by the use of IPSec (Ref_3). IPSec uses a �Security Policy Database� (SPD) to define (against header field criteria similar to an ACL), how each packet should be treated by the routers security policy. The Security Association Database (SAD) maintains keys and encryption algorithms relevant to each traffic flow.

• A �flow cache� may also be used to exploit the �flow� nature of packet traffic (i.e. an application is likely to send a series of packets showing the same header characteristics). A cache of recently processed headers (and their associated Forwarding, Classification and Policy data) can therefore be used to speed up router operation.

• Flow Caching may also be used to store packet header characteristics described by the ReSource reserVation Protocol (RSVP: Ref_4). RSVP informs each router in a packets path how traffic should be processed for a specific flow. Resources within the router must then be reserved for that session.

• Finally figure 1 shows a traffic manager and QoS queues. Having classified packets, separate streams have to be managed for different QoS categories.

3. Database Search Requirement As channel line rates in core networks increase from OC-48 to OC-768, core routers will be needed to process hundreds of millions of packets per second. This creates the need for very fast searches of L2 & L3 forwarding tables. In Access & Edge networks the requirement for fast search operations also becomes significant. Whilst line rates are lower (OC-48 and below), increasing �network intelligence� means that each packet typically requires several database searches. This can include operations for packet forwarding, QoS classification, access control and security.

Forwarding and Classification tasks can be divided between Layer 2 to Layer 7 operations, and further divided between Control and Data/User Plane functionality (see figure 2). A typical router will need to strip back these header layers to determine how each packet should be handled. This process is sometimes referred to as �decapsulation�. The process of re-attaching new header fields is also required (i.e. encapsulation).

20th January 2003



Ethernet Header IPv4 Header TCP HeaderURL, Cookies,

Content, Application, User-name . . . .

L2Data Link

L3Network

L4Transport

L5-7 Ses, Pres & App

ATM/AAL Header

IPv6 Header

ESPAH

UDP HeaderSDH/PDH Frames

RTPHeader

SIP/SAP

RSVP

ARPBGP

RIP/OSPF

PacketFields

DataPlaneProcesses

ControlPlaneProcesses

Forwarding

IPSecIKE

IPSec

Forwarding

QoS Classification

Access Control

IPSec

QoS Classification

Access Control

QoS Classification

Access Control

Billing

Forwarding Classification

Figure 2: Router Function & Header Fields

Understanding the encapsulation requirements of a system is central to understanding the operational requirements of the supporting switch/router equipments. Complex forms of encapsulation drive the need for detailed packet inspection, (i.e. database searches on several header fields). To demonstrate typical requirements in real equipments, the encapsulation and functional requirements of Mobile and Broadband networks are reviewed below.

Packet Encapsulation in Mobile Networks

The 3GPP have defined GTP (GSM Tunneling Protocol: Ref_5 & Ref_6) for tunneling user plane packets between equipments in the mobile infrastructure (i.e. the GGSN, SGSN and RNC/BSC). The Mobile Network accesses external service providers via the GGSN, setting up sessions for individual users on mobile terminals. In practice a terminal may have several simultaneous sessions towards more than one service provider (application server). This could support applications such as Mobile VPNs. These in turn require different Quality of Service levels. Different tunnels are therefore needed to support different QoS levels between nodes.

Tunneling moves the problem of classification and access control to the edges of the IP network. This follows the general model of �Intelligent Edge, Dumb Core� networking. New sessions are classified once at the edge, a new tunnel is created if needed, and IP packet forwarding is used through the network. The �Edges� in this instance are the GGSN and RNC (for UTRAN traffic), and the GGSN and SGSN (for GERAN traffic). More advanced network architectures may move the Classification function to the Base Station itself.

Figure 3 shows the supporting protocol stack defined by the 3GPP for mobile infrastructure. Note that this relates to the �Iu mode� specified for UTRAN traffic only. The GPRS Tunnelling Protocol for the user plane (GTP-U) supports the tunnelling operation described above. User data is tunnelled between the UTRAN, SGSN, and GGSN. User packets are encapsulated by GTP, which is in turn supported by a UDP/IP layer running over a layer 2 technology such as ATM or Ethernet.

20th January 2003



L2/L1

UTRANUE-MS GGSNSGSN

GTP-U

UDP

IP

Physical Air IF

PDCP

RLC

MAC

L2/L1

GTP-U

UDP

IP

L2/L1

GTP-U

UDP

IP

L1

L2

TCP/UDP

Application

Host

Physical Air IF

PDCP

TCP/UDP

Application

RLC

MAC

IP

PPP

IP

PPP

IP

L2TP

L1

L2

IP

L2TP

Edge

L2/L1

GTP-U

UDP

IP

Relay Relay

Radio Network LayerTransport Network Layer

Figure 3: Mobile Network Protocol Stack (Based on Ref_5)

Function UTRAN SGSN GGSN HLR Network Access Control: Registration X Authentication and Authorisation X X Admission/Access Control X X Message Screening X Packet Terminal Adaptation Charging/Accounting Data Collection X X Operator Determined Barring X X Packet Routing & Transfer: Relay X X X Routing (Forwarding) X X X Address Translation and Mapping X X X Encapsulation X X X Tunnelling X X X Compression X Ciphering X X

Figure 4: Mobile Equipment Functionality (Extracted from Ref_5)

20th January 2003



Figure 4 provides a useful guide to the functional requirements of the different mobile network equipments. It shows how the tasks of admission (access) control, routing and address mapping (forwarding), encapsulation, tunnelling, charging (billing) relate to each platform. For example, the process of authenticating and authorising each session is performed by the SGSN. This equipment collects billing data and checks that users receive the services to which they are entitled. The UTRAN also needs to monitor the state of the radio interfaces and admit/deny services according to network policies.

UTRAN router equipments require relatively small forwarding tables (several thousand entries rather than the 100K+ entries typical to metro edge routers). This is because the wireless infrastructure is essentially private (i.e. each node only needs to know where to send a packet within the enclosed network). Only the GGSN (which connects to external Service Providers) needs access to a full Internet Routing table. These equipments will however need to inspect several layers of the packets encapsulation to determine the service class of each packet and assign it to the correct tunnel (or block transmission if access is to be denied).

The concept of �Mobile� IP (RFC2002) introduces a further encapsulation layer (not shown on figure 3). User packets are tunneled between a users �home agent� and �care-of� address. As the care-of address changes each time a Mobile Station moves within the network, tunnel set-up/tear-down will be frequent, requiring highly dynamic table management.

One additional feature of RAN applications is the sensitivity of packet traffic to delays. Where speech packets are being transported over low bandwidth connections, minimum delay must be achieved. This is an important consideration when several database searches need to be implemented in a router platform.

In the control plane, SIP was introduced by the 3GPP Release 5 standard (prior to this SS7 was specified for call signaling). SIP uses �Layer 7� information to establish a flow capable of supporting a multimedia session (Ref_7). A server platform handling the SIP messaging traffic must therefore perform a stateful track of the SIP transaction at the start of a flow. When the flow is established a Flow Cache can be established. Once again this process requires highly dynamic management of forwarding and classification tables.

Finally, no discussion on Mobile Networks is complete without some mention of IPv6. This offers the potential for billions of �always on� terminals using their own IPv6 addresses. The wider address structures (128 bits instead of 32 bits for IPv4) increase the complexity of table management and database search operation. Solutions therefore need to be found for handling much wider search keys, often up to 576 bits (for an examination of multiple header fields).

Packet Encapsulation in Broadband Access Networks

Figure 5a shows a protocol stack typical to broadband access networks used in current generation support for DSL services. In this architecture the �Access Network� consists of the DSLAM and ATM switch. These make no examination of packets beyond the ATM header. PPP (Ref_8) connects individual sessions across an ATM transport network, with applications such as VPN�s protected by an IPSec layer that is invisible to the Access Network.

In this scenario, Multi-protocol Encapsulation over AAL5 (Ref_9) is one means of supporting VPN traffic over the ATM network. In environments where a limited number of Virtual Channels (VC�s) are available a Link Layer Controller (LLC) is specified for separate connections. Alternatively each VPN may utilise an individual VC. These can be either permanently or dynamically assigned. These VC based approaches lead to large, highly dynamic (several thousand updates per second) forwarding tables in DSLAM equipments.

Within the protocol model shown in figure 5a, functions such as QoS Classification and Access Control are performed by the Remote Access Server (Broadband - RAS). QoS is supported in the ATM access network only by the service levels assigned to the individual VC�s.

Future considerations for Broadband networks include merging BRAS functionality with the DSLAM (i.e. moving IP network edge processing closer to the end user). This allows greater control over service provisioning, highly granular QoS, and simplified (lower cost) network management. Figure 5b shows one possible protocol stack for supporting a merged Access/Edge network. This introduces the L2TP (Ref_10), as a means of extending the PPP session through to the VPN Gateway.

20th January 2003



Application

TCP/UDP

IP

IPSec

IP

EthernetHPNA802.11a/b

IP


IP

XDSLPHY

PPP

ATM

AAL5

XDSLPHY

ATM

SONET/SDH

ATM

SONET/SDH

ATM

SONET/SDH

ATM

IPIP

SONET/SDH

PPP

ATM

AAL5

HDLC

SONET/SDH

Application

TCP/UDP

IP

IPSec

IP

HDLC

SONET/SDH

DSLAMCPE-NT BRAS VPN GatewayTerminal ATM Switch

Figure 5a: Broadband Network Protocol Stack (Current)

Application

TCP/UDP

IP


IP


IP

XDSLPHY

PPP

ATM

AAL5

IP

XDSLPHY

PPP

ATM

AAL5

Application

TCP/UDP

IP

PPP

L2TP

UDP

DSLAM/ISMCPE-NT VPN GatewayTerminalEdge Router

IP

Ethernet

PPP

UDP

L2TP

IP IP

SONET/SDH

HDLCEthernet

IP IP

SONET/SDH

HDLC

Figure 5b: Broadband Network Protocol Stack (Future)

Search Requirement Summary

Figure 6 summarizes the search rate requirements of networking equipments. It shows the line rates relevant to access, metro and core edge deployments. These are translated into peak packet rates (based on minimum

20th January 2003



packet size of 40 bytes for sustained IPv4 messaging traffic). The trends in equipment functionality described above are also highlighted in terms of forwarding, classification, access control and billing operations. These are translated into the estimated number of searches required per packet.

L2 Forwarding

Access Control

Content Aware

2 8 32 128

Searches per Packet

1

2

4

8

16

L3 Forwarding

QoS Classification

IPSec

Billing/Counting

622M 2.5G 10G 40G

0.5

155M

Switch/Router or Line Card Capacity

500 Msps

64 Msps

8 MspsEnterprise

Switch

Metro EdgeRouters

PublicAccess

CPE

BroadbandAccess

Mobile Access &Corporate CPE

Core EdgeRouters

Figure 6: Search Rate Requirements for Core, Edge & Access Networks

20th January 2003



4. Review of Database Search Methods Several database search algorithms exist for processors using standard SRAM technology. These are reviewed and compared with the use of NSE technology below.

Direct Index Mapping & Hashing

At its simplest level a �search key� can directly address an SRAM. The search key consists of the packet header fields used to locate each database entry. The memory location addressed by the search key would then contain an index (bit field) that can be used to locate the relevant database information. This approach is fast (an index is obtained in a single memory access), but limited by its inefficient use of memory. As a search key of �W� bits requires 2W memory locations, keys much beyond 20 bits are impractical.

Hashing algorithms provide a fast method of indexing large tables with wider keys (even where �W� equates to several hundred bits). Hashing randomises the search key and performs a search on just the first 16 to 20 bits (possibly using a direct index map). One method of hashing/randomising the search key is to use an encryption algorithm (e.g. DES) with fixed keys.

It is possible that two search keys when �randomised� equate to the same bit sequence for the first ~20 bits. They would therefore �collide� (i.e. generate the same index). A hashing algorithm must therefore include further steps to resolve these cases. Typically this involves a direct comparison of the search key against the two (or more) colliding table entries. Whilst resolving these cases slows down a hashing algorithm, the probability of a clash may be quite low. The performance hit, may therefore not be significant.

Of greater significance is the fact that �hashing� is limited to �exact match� operations. This means that every bit in the search key must match the search criteria. This is suitable for some router operations, but by no means all. Layer 3 route tables need to support Longest Prefix Match (LPM - see appendix A1). In addition, Classification, Access Control and Security Policy Databases invariably use wild cards. These introduce numerous Don�t Care bits to the search key.

Binary (& Patricia) Trie Algorithms

The Binary Search Retrieval algorithm (Figure 7a) allows wild cards to be set for the lower order bits of a search key. A bit-by-bit examination of an incoming search key (e.g. a forwarding address) is performed. Left and right pointers are accessed during each bit examination. These locate the next node in the search. When a pointer reaches the don�t care conditions, the final pointer provides the table index. This is referred to as a �best match� search, as multiple pointers that relate to shorter prefixes may be identified in the path. This is demonstrated in figure 7 by the identification of both P3 and P4 as matching the incoming search key. P3 has the longest prefix and is therefore the �best match�.

Binary search tries are memory efficient, but slow. A memory location has to be accessed for each bit in the search key. As most tables are too large to be stored on an onboard cache memory, these typically have to be driven across an external CPU-Memory bus. Search performance could be gained by increasing the number of SRAM bus interfaces (to allow parallel examination of multiple search keys). This however adds cost (through pin count) and significant design complexity (through interleaved memory accesses).

The Patricia Trie is a variant of the standard Binary Trie algorithm that reduces the number of trie search nodes (and therefore reduces memory storage requirements even further). Patricia Trie nodes (denoted by A & B in Figure 7b) store �bit position� values, as well as left and right pointers. The bit position refers to the next bit in the search key to be examined. This allows the search operation to skip bit positions in the search (going straight to that denoted by the �bit position� field). This eliminates nodes where no decision is effectively made (for example D & E in Figure 7a). The draw back of this approach is that the nodes at the prefix pointers (P1, P2 etc) have to store the full prefix so that a match can be verified. The search function is also slowed down by having to back track through the trie structure. In Figure 7b the search for key 11011 would initially reach P6, identify NO match at P6, then go back to Node B, search P5, identify NO match at P5, and then finally return to P4 before deciding that this value offers the longest prefix match. This process in memory efficient, but can under worst case conditions be very slow.

20th January 2003



0P6

11100*P61110*P51*P4101*P301*P20110*P1

Lookup 10111 => P3

Trie nodeTrie nodePrefix Pointer

left-ptr right-ptr

Best MatchBest Match

10

A

EP3B

DCP2

P40 11

111

0 0

Root

P1 P5

Figure 7a: Binary Retrieval2 Algorithm

11100*P61110*P51*P4101*P301*P20110*P1

Lookup 11011 => P4

Trie nodeTrie nodeBit Position

left-ptr right-ptr

Best MatchBest Match

10

P3P2

2P4

01

1

10

0

1Root

P1

P6

0110* 01*101*

1*

11100*

4A

P5

1110*

5B

Figure 7b: Patricia Retrieval Algorithm

2 The term "trie� table is derived from the word �retrieval�

20th January 2003



Multi-Level Trie Algorithms

Multiple Level Trie algorithms are a common means of increasing search performance. These examine multiple bits of the search key in a single cycle. Each node stores pointers for all the bit combinations examined, i.e. if �K� bits are examined per cycle, 2K pointers are required for each node (note that K is sometimes referred to as the �stride factor�).

This speeds up search performance, but involves a geometric increase the amount of SRAM required to store node data. This can be an expensive way of improving search performance. A variable stride factor may also be deployed. Figure 8 shows the memory allocations required for a 16, 8, 8 variable stride Trie search. This is a common configuration for IPv4 forwarding. Figure 8 details the amount of memory this requires.

Next HopData

K0 =16 K1=8 K2=8

32 bit search key

0

216-1

0

255

0

255

0

255

0

255

0

255

0

255

Level 0Root Node

64Kpointers

Level 1Node 1

Level 1Node 2

Level 1Node N

Level 2Node 1

Level 2Node 2

Level 2Node N

AssociativeData

Memory

Figure 8: Variable Stride Trie Structure (16, 8, 8)

Note: the variable stride factors K0, K1, and K2 describe the number of bits of the 32 bit search key examined at each level (or cycle) of the search.

The number of pointers needed per node = 2K0 for level 0, 2K1 for level 1, and 2K2 for level 2. At level zero there is only a single (root) node, so it makes sense to make this as large as possible (e.g. 216).

The number of nodes at each level is determined by the number of pointers in the previous level (the next closest level to the root). This is shown as the number of �geometric nodes� in figure 9 (due to its rapid expansion at each level). In practice however, the number of nodes at each level will not exceed the number of table entries (N). With small tables N determines the overall size of the Trie table. For comparison a similar calculation is provided for a 16, 4, 4, 4, 4 structure in figure 10.

20th January 2003



Detailed No of Table Entries CyclesCalculation 4,096 3 Stride Factors K0 K1 K2

16 8 8 Cumulative Stride 16 24 32 Node Size 65,536 256 256 Geometric Nodes 1 65,536 16,777,216 Actual Nodes 1 4,096 4,096 Words 65,536 1,048,576 1,048,576 Words Used 65,536 1,048,576 1,048,576 Word Size 32 32 32 Mbits Used 2.1 33.6 33.6 Total Mbits 2.1 35.7 69.2

Figure 9: Detailed Storage Calculation for (16, 8, 8) Trie Structure

Calculation 4,096 32 5 Strides 1 1 1 1 1 Sride Factor 16 4 4 4 4 Cumulative Stride 16 20 24 28 32 Node Size 65,536 16 16 16 16 Geometric Nodes 1 65,536 1,048,576 16,777,216 268,435,456 Actual Nodes 1 4,096 4,096 4,096 4,096 Words 65,536 65,536 65,536 65,536 65,536 Words Used 65,536 65,536 65,536 65,536 65,536 Word Size 32 32 32 32 32 Mbits Used 2.1 2.1 2.1 2.1 2.1 Total Mbits 2.1 4.2 6.3 8.4 10.5

Figure 10: Detailed Storage Calculation for (16, 4, 4, 4, 4) Trie Structure

Trie Additions/Deletions (Maintenance)

If an IPv4 address is added with a /12 prefix (see appendix A1 for discussion on CIDR), then all entries in the root node identified by the first 12 bits of the IP address need to be assigned pointers. This equates to 2(16-12) = 16 locations. On this basis a /8 prefix requires 256 location updates and a /16 prefix just 1 entry update (i.e. this occurs on the boundary of level 0). Data from the MAE-EAST routing table (www.merit.edu), shows that prefixes lower than /8 are not used, so the possibility of a /7 prefix requiring 512 location updates (or worse) never occurs.

For 16, 8, 8 structure shown previously, prefixes between /16 and /24 will require a single update to the root node, and between 1 and 256 updates to the relevant level 1 node. Likewise prefixes between /25 & /32 require single updates to the root and relevant level 1 node, and between 0 and 255 updates to the level 2 node.

Care should also be taken not to overwrite pointers relating to existing table entries (with different prefixes). A memory map for the level 0 (root) node for the 16, 8, 8 trie is shown in figure 11. The location relating to 128.65.122.107/16 has to store two prefix pointers (for both /16 and /8 entries). Simply deleting the /8 pointer might seem attractive, as this value will never be returned (assuming Longest Prefix Match). This can not be done however, as the /16 entry may be deleted at some time in the future. This would leave the /8 entry as the required search result. Each pointer must therefore store separate pointers for all /8 to /16 possibilities.

http://www.merit.edu/

20th January 2003



00000000 00000000

10000000 00000000

10000000 11111111

11111111 11111111

128.65.155.107/8 10000000 01000001 128.65.122.107/16

Figure 11: Memory Map for Level 0 (Root) Node (16 bit stride factor)

In relatively static networks 256 memory updates per table addition/deletion may not be an issue. BGP additions typically run at around 3K entries per second. This translates to just 6M memory access per second (~2% of QDR bus running at 250MHz). For other more dynamic applications table updates may be an issue. The allocation of VLAN traffic to Switched VC�s (RFC_9), or Tunnel set-up/teardown in Mobile Networks may result in 100K+ table updates per second. This translates to more than 25% of a QDR bus� capacity.

In summary, the use of the large stride factors (e.g. 16, 8, 8) provides a 3 stage look-up for an IPv4 address, compared to the 5 cycles required for a 16, 4, 4, 4, 4 structure. This speed improvement is however purchased at the expense of: a seven fold increase in the SRAM storage requirements (70Mbits instead of 10M), and a 16 fold increase in maintenance cycles (256 location compared to just 16).

Multiple Field Search Algorithms

Trie algorithms are further complicated when applied to the multiple field searches demanded by QoS Classification, Access Control and IPSec Policy administration. Multiple Field searches involve the placement of wild card (don�t care) conditions across more than one part of the search key. This is inherently more complex than a single field search where the wild cards are limited to the lower order bits of the search key (as with LPM). This complexity is demonstrated below.

Figure 12 shows a �Hierarchical Search Algorithm� operating on just two fields. A set of rules made up of two fields F1 and F2 are coded into the trie structure. The algorithm scans through the Field 1 nodes (marked F1) to identify those that involve a jump to an F2 node (denoted by the arrows). A jump to an F2 node may (in theory) occur for every bit in F1. This means that an F2 node structures may be searched for every bit in F1.

Assuming: W1 = # bits in F1

W2 = # bits in F2

Then Maximum # of Cycles = W1*W2

Note that if the wild card bits are limited to a single field of W1+W2 bits, the total number of cycles equates to just W1+W2. The introduction of two wild card fields to the search key increases the number of cycles to W1*W2. This is a considerable increase in search complexity, that continues to increase with the introduction of further fields to the search key (i.e. W1*W2*W3 . . . ).

20th January 2003



00*11*R5

011*0*R4

R3R2

R1

Rule

0*1*01*0*

10*11*

F2F1

Lookup 110,00010

Hit

1

1

0

0

1

0

Root

F1 F1

F1

F2

F2

R1

F2

F2

R2

F2

R3

1

R4

Hit

0F2

R5

0

Figure 12: Hierarchical Search for Multiple Fields

A 5-tuple classification may involve several search fields (# cycles: W1 x W2 x W3 x W4 x W5). The theoretical worst-case search time may therefore run into millions of memory accesses per classification. Alternative algorithms reduce search time, but succeed only in moving the �complexity� to excessive storage.

Heuristic Algorithms

More efficient multiple field searches algorithms are currently being researched. Typically these reduce storage and/or access cycles by limiting search key configurations. These techniques require a thorough understanding of the classification and/or access rules used in an application, and often use pre-calculated tables. An example of one such algorithm is described in figure 13.

The IPv4 5-tuple search key is broken down into 16 (or 8) bit chunks. Variations in each chunk are represented by a 2, 3 or 4 bit �Equivalency ID�. The functions f1, f2 and f3, and the indexing tables can therefore be configured to produce a unique index for each rule that needs to be encoded. These techniques are used in the �Recursive Flow Classification� algorithm (Ref_11).

In certain conditions this technique can offer fast classification for relatively small storage. It is however limited by the number of rules that can be encoded (only a handful of variations in each 16 bit chunk). The number of rules and their format therefore has to be closely examined for each application. Further work is also required on how IPv6 5-tuples can be managed. A further limitation is the significant pre-processing of the tables and functions (potentially thousands of cycles per table addition). This limits the approach to static tables. Dynamic table management (typical to many next generation equipments) require fast table updates.

20th January 2003



Chunk 1Direct Index

2^16 words

Chunk 2Direct Index

2^16 words

Chunk 3Direct Index

2^16 words

Chunk 4Direct Index

2^16 words

Chunk 5Direct Index

2^16 words

Chunk 7Direct Index

2^16 words

Direct Index2^16 words

Direct Index2^16 words

DirectIndex2^16

words

16

16

16

16

16

8

W= 104

IPv45 Tuple

3

3

3

3

3

2

f3 Index

f1

f2

Equivalency ID

Storage (words) ~ (2^16)*W/16Search Time ~ W/16 * Mem AccessSearch Time ~ W/16 * Mem Access

Number of Chunks~ W/16

Chunk 6Direct Index

2^16 words

16 3

Figure 13: Heuristic Classification Algorithm

Network Search Engines

Network Search Engines (NSEs) are massively parallel database search devices. Executing database searches and constructing databases in a network search engines is different when compared to previous discussed methods. Sophisticated Network Search Engines can implement multiple databases with different application-specific configurations. An example of various database configurations is shown in Figure 14.

Network Search Engine

256K entries36-bits wide

128K entries72-bits wide

16K entries, 576-bits wide

64K entries, 144-bits wide

Figure 14: Example Database Configurations

20th January 2003



Each database contains a number of entries (depth) and each entry in a single database has a fixed number of bits (width). For example a database with 256K entries, whose entries each contain 36 bits is said to be 256K entries deep and 36-bits wide. In �binary Network Search Engine� designs, each bit in each entry can have one of two states, 0 or 1. In more contemporary �ternary Network Search Engine� designs, each bit in each entry can have one of three states, 0, 1 or X (�don�t care�). The type of application using the database normally dictates �binary� or �ternary� database requirements.

11 11 00 11 11 00 00 0000 XX 11 00 11 00 00 1111 00 11 00 XX 00 11 0011 11 00 XX 11 00 11 1111 XX 11 11 11 11 00 0000 11 11 00 11 11 00 1111 00 00 XX 11 11 11 00

33445566778899

Address Database Entries

Database Example

Search Result

11 XX 00 11 11 00 00 00 33

Figure 15: Database Search Example

Database entries, as shown in Figure 15, are implemented in a flat database structure, and not in a hierarchical or tree structure as in some previously discussed algorithms. Additionally, when this database is searched, the entire database is searched simultaneously. Figure 15 also shows a search key, which in Ternary Network Search Engines also can contain bits with 0, 1 and X. The Network Search Engine simultaneously compares the search key with every entry in the database, resulting in an extremely high performance search. The search result corresponds to the address of the matching database entry. Search performance for high performance Network Search Engines can effectively reach one search per clock cycle. Note that in general, binary Network Search Engines use search keys and database entries that contain two states, 0 or 1 while ternary Network Search Engines use search keys and database entries that contain three states, 0, 1 or x. Executing searches on a Network Search Engine results in high performance due to the massively parallel comparison of the key with selected database entries. Executing searches is also computationally simple due the fact that no tree walking or collision detection is required.

20th January 2003



Algorithm versus NSE Summary

It is useful to compare the various database search algorithms, with respect to the number of access cycles per search and the amount of storage required per database. The access cycles indicate bus utilization that can limit system performance, while storage indicates the compactness of the database.

Application Algorithm RAM Access Cycles per search.

Storage (Cells)

Direct Index 1 32*2W Hash Hash + 1, 2, 3 � N*(W + 96)

L2 Ethernet Flow Caching IPSec SAD Binary NSE W/32 + 1 N*W*10/6

Patricia Trie W*W 32*N Multi-ary Trie W/K 32* N*W*2(K-1)

L2 ATM VPI/VCI L3 IPv4/6 (LPM)

Ternary NSE W/32 + 1 N*W Hierarchical Trie Wd d*W*N Heuristic W/16 + 4 8*(2^16)*W/16

QoS Classification Access Control List IPSec SPD NSE with Re-Issue* W/32 + 1x(# ACL) N*W

• W = Width of Search Key • N = Number of search table entries • K = Trie Search Stride Factor • #ACL = Number of access control lists

Figure 16: Search Method Comparison Table

* Re-Issue is a special function built into advanced IDT NSEs. This allows motile Access Control Lists to be searched without reloading the search key. This can be significant for IPv6, where the search keys can consist of several hundred bits.

20th January 2003



5. Search Method Performance Analysis This section models a router platform with relatively small table dimensions, typical to the access network. This could support an IP-Base Station, RNC, DSLAM, or Low End Multiple Service platform. A mixture of IPv6 and IPv4 traffic is assumed, with the worst case conditions determined by IPv6. A mixture of table requirements are assumed (including ATM and Ethernet forwarding, IP forwarding, Classification, Access Control and IPSec). The model calculates the total database storage (in terms of either 18M SRAM devices or 4M Network Search Engines). QDR Bus Access cycles are calculated to support the overall performance of the router. Calculations are made for both algorithmic and NSE approaches. The equations shown in Figure 16 are used for this purpose. The performance limit analysed is that of the QDR bus against a range of aggregate/line bit rates. 100% QDR bus efficiency is assumed.

The 10G line rate would be more typical of a Metro/Core router. These would require larger tables, and therefore might not reach the performance limits shown below. An additional constraint that may also reduce the effective performance limit is that of table maintenance.

Low cost SRAM Configuration

Figure 17 indicates that at 2.5G line/switch capacity, SRAM offers a viable low cost solution for forwarding. Even at this relatively low aggregate traffic rate however, access control and IPSec are beyond the scope of SRAM based algorithms. QoS Classification is just feasible assuming that a heuristic type of algorithm can be applied. This may not be the case, as these techniques are limited to relatively small tables (few hundred entries) and static table maintenance.

MinPkt 0.6 2.5 5.0 10.0

Bytes M TotalApp. Packet SRAM DeviceDB Cells

IPSec SAD 12.0 160 TRIE 40 192 66 226 909 1818 3636 232 13IPSec SPD 3.0 296 HEUR 19 152 66 179 720 1439 2879 109 7ACL 2, 3, 4 . . . 3.0 888 HEUR 56 133 66 157 630 1259 2519 99 6ACL 1 1.0 296 HEUR 19 77 66 91 365 729 1458 70 4QoS Classification 2.0 296 HEUR 19 58 66 68 275 549 1098 60 4L3 IPv6 Forwarding 4.1 128 TRIE 32 39 66 46 185 369 739 50 3L3 IPv4 Forwarding 4.1 32 TRIE 8 15 40 29 117 234 469 16 1L2 Ethernet Switch 4.1 48 HASH 4 4 40 8 31 63 125 8 1L2 ATM Switch 4.1 28 TRIE 7 7 53 10 41 83 165 7 1

250 333 18SRAM K= 4 32

Max Bus IO MHz

Usage

Solution Bus Width SRAM/NSEM Cells Per Device

Search Method

Million QDR BusAccesses per second

Application Database Requirements

ConfigurationK

Entries N

Width W

Line Rate Gbits/s DatabaseAccess Cycles

Million

For Each

Figure 17: Low cost SRAM Configuration3

This table also only analyses throughput. The latency on an individual packet may be determined by a worst case sequence of events that includes a missing entry in the forwarding table, a subsequent table update, and further searches of the Classification and Access Control tables.

3 The term �Bus Width� refers to the bit width of the data transfer for each read/write cycle. This assumes a QDR burst of 2 transfers per across a 16 bit bus. As trie searches require a read cycle to complete before the next operation can commence, the burst capability of the QDR bus does not speed up the overall performance (as it requires sequential addressing).The later analysis of the NSE approach shows that most of the cycles consumed involve the transfer of key data to the NSE. This can make use of the burst write capability.

Within QDR Limit (250MHz)

Exceeds QDR Limit (333 MHz)Within QDR Limit (333MHz)

20th January 2003



A further reduction in SRAM usage could be achieved with a Patricia Trie approach. This would however result in considerably lower speed performance, as the backtracking effects of the search algorithm would result in significantly increase SRAM access cycle requirements.

Increased Performance SRAM Configuration

Figure 18 shows that some performance gains can be achieved by increasing the Trie algorithm stride factor. These minimal improvements are however achieved by a large increase in the amount of SRAM used to store the search tables.

MinPkt 0.6 2.5 5.0 10.0

Bytes M TotalApp. Packet SRAM DeviceDB Cells

IPSec SAD 12.0 160 TRIE 20 153 66 180 722 1444 2888 1437 80IPSec SPD 3.0 296 HEUR 19 133 66 156 627 1255 2509 454 26ACL 2, 3, 4 . . . 3.0 888 HEUR 56 114 66 134 537 1075 2150 444 25ACL 1 1.0 296 HEUR 19 58 66 68 272 545 1089 415 24QoS Classification 2.0 296 HEUR 19 39 66 45 182 365 729 405 23L3 IPv6 Forwarding 4.1 128 TRIE 16 20 66 23 92 185 369 395 22L3 IPv4 Forwarding 4.1 32 TRIE 4 8 40 16 63 125 250 126 8L2 Ethernet Switch 4.1 48 HASH 4 4 40 8 31 63 125 59 4L2 ATM Switch 4.1 28 TRIE 4 4 53 5 21 41 83 59 4

250 333 18SRAM K= 8 32

For Each


MillionApplication Database Requirements

ConfigurationK

Entries N

Width W

Max Bus IO MHz

Usage


Search Method


Figure 18: Increased Performance SRAM Configuration

NSE Implementation

Figure 19 shows how a NSE alleviates the bottleneck associated with the QDR bus. These performance figures are based on IDT�s family of Network Search Engines. These offer advanced features that improve bus and packet processor efficiency when compared to other search engines. A further bottleneck occurs in systems attempting deep packet classification at 10G (OC192).

MinPkt 0.6 2.5 5.0 10.0

Bytes M TotalApp. Packet NSE DeviceDB Cells

IPSec SAD 12.0 160 NSE 6 36 66 42 170 341 682 7.3 2IPSec SPD 3.0 296 NSE 1 30 66 35 142 284 568 5.4 2ACL 2, 3, 4 . . . 3.0 888 NSE 3 29 66 34 137 275 549 4.5 1ACL 1 1.0 296 NSE 10 26 66 31 123 246 492 1.9 1QoS Classification 2.0 296 NSE 10 16 66 19 76 152 303 1.6 1L3 IPv6 Forwarding 4.1 128 NSE 5 6 66 7 28 57 114 1.0 1L3 IPv4 Forwarding 4.1 32 NSE 2 4 40 8 31 63 125 0.4 1L2 Ethernet Switch 4.1 48 NSE 2 2 40 4 16 31 63 0.3 1L2 ATM Switch 4.1 28 NSE 1 1 53 1 6 12 24 0.1 1

250 333 4.6NSE K= NA 32

Max Bus IO MHz

Usage


Search Method


Application Database Requirements

ConfigurationK

Entries N

Width W


Million

For Each

Figure 19: NSE Configuration using IDT Network Search Engine

20th January 2003



SRAM-NSE Performance Summary

Figure 20 translates the QDR bus limits described previously onto the search requirements of different applications. It should be noted that the bus limit may be lower than that shown if larger tables are used, latency becomes a limiting factor, and/or highly dynamic table updates need to be sustained.

L2 Forwarding

Access Control

Content Aware

2 8 32 128

Searches per Packet

1

2

4

8

16

L3 Forwarding

QoS Classification

IPSec

Billing/Counting

622M 2.5G 10G 40G

0.5

155M

Switch/Router or Line Card Capacity

500 Msps

64 Msps

8 Msps

EnterpriseSwitch

PublicAccess

CPE

BroadbandAccess

Mobile Access &Corporate CPE

Metro EdgeRouters

Core EdgeRouters

QDR bus limit for SRAM at 250MHz memory access

QDR bus limit for NSE at 250MHz access.

Figure 20: Search Performance Limits of QDR SRAM and NSE4

This analysis indicates that IPv6 places considerable strain on all trie based table searches, and that the Multiple Field search algorithms for Access Control and Classification pose significant issues (whether implementing IPv4 or IPv6). Even at relatively low aggregate traffic levels the QDR bus will become a limiting factor for these functions.

The other main conclusion indicated by the SRAM/NSE device & cell usage columns, is that �transistor for transistor� NSEs (based on TCAM technology) are very efficient at performing search functions. The absolute minimum SRAM storage requirements (ignoring performance), must be to store at least the search key and associated mask data. This requires 2 SRAM cells (12 transistors) per bit of the search key. A NSE cell stores both data and mask bits, and only introduces an additional 4 transistors to perform the comparison function (an overhead of just 33%). This compares extremely well with the geometric increases in storage needed to increase the performance of algorithmic methods. In applications that go above the QDR SRAM bus limit shown above, this outweighs the advantages of using �Commodity SRAM�.

4 NSE refers to IDT�s Network Search Engine family with dedicated QDR/LA-1 interface.

20th January 2003



6. References 1. A Border Gateway Protocol 4 (BGP-4): RFC1771

2. RIP Version 2: RFC1723

3. Security Architecture for the Internet Protocol: RFC2401

4. Resource ReSerVation Protocol (RSVP): RFC 2205

5. GPRS Service Description, Stage 2 (Release 5): 3GPP TS 23.060 V5.2.0 (2002-06):

6. GPRS Tunneling Protocol (GTP) across the Gn and Gp Interface (Release 5): [3GPP TS 29.060]

7. SIP: Session Initiation Protocol: RFC2543

8. The Point to Point Protocol (PPP): RFC 1661.

9. Multiprotocol Encapsulation over ATM Adaptation Layer 5: RFC2684.

10. Layer Two Tunneling Protocol "L2TP": RFC2661

11. Algorithms for Routing Lookups and Packet Classification: Pankaj Gupta December 2000.

12. PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric: Morrison, Journal of the ACM, 15(4) pp514-534, Oct 1968

20th January 2003



A1 Appendix: CIDR & Longest Prefix Match CIDR was introduced to ease the shortage of IPv4 address space (as �stop gap� until the deployment of IPv6). RFC1519 defines its operation. CIDR uses an address prefix to mask off lower order IPv4 address bits. This creates sub-nets that can be identified from just the higher order bits. The example below, in figure 11, shows how the top 21 bits of address �197.106.104.0/21� locate nodes 104 to 110.

104105

106111

108109

110107

197.106.104.0/21

21 11 Don�tCare

104105

106111

108109

110107

Longest Prefix Match

197.106.107.0/24197.106.104.0/21

Highest Priority

Lowest Priority

21 11

24 8

Prefix Masking Operation

Figure A1: CIDR & LPM

Whilst CIDR allows several L3 forwarding table entries to be compressed to a single entry, it also introduces the problem of LPM (Longest Prefix Match). This occurs when a node leaves its subnet and reconnects at a different point in the Internet. This is shown above, where �107� connects to a different point, and therefore requires a separate routing table entry. This results in a multiple match, as the database search function will still match the previous sub-net conditions. The match with the Longest Prefix must therefore be determined.

IDT�s NSE family supports �positional priority� or �weighted� coding, where entries are loaded into blocks of common prefix length. The blocks are positioned according to the prefix length, so that a decoder can identify the address with the highest value as being the correct entry. As prefixes may vary from /8 to /29, up to 21 entry blocks need to be reserved for this. It is likely that these blocks may (on occasions) fill up, and a memory reshuffle becomes necessary.

Documents

IDT White Paper Optimum Search Methods for Switch/Router ...snarayan/sada/research/docs/networkingWhitePaper.pdfbetween equipments in the mobile infrastructure (i.e. the GGSN, SGSN