062112 LTE Troubleshooting

LTE Performance/4G Solutions

Generic Release

Alcatel-Lucent Proprietary

Subject: LTE Trouble Shooting Guide - Generic Release Date: June 21, 2012 LTE Performance Department Murray Hill, NJ

ENGINEER’S NOTES

Data throughput problems have been reported since LTE deployment in December. The customer is complaining about slow data and/or video downloading slow web browsing, choppy video, unusable gaming, etc. Because of the E2E complexity of the LTE network these problems could be:

• With a specific UE manufacturer/Laptop setup.

• In the RF Downlink, Uplink, or both.

• Somewhere in the EATN (Ethernet Access Transport Network).

• Somewhere in the Core Network (MME/SGW/PGW/Routers/Switches/Backhaul Transport, MiddleWare boxes).

The purpose of this paper is to document the approach that has been used in the lab and LTE trials to troubleshoot LTE throughput issues. This same approach can be applied to troubleshoot today’s field issues.

Figure 1- Monitor Trace Points for LTE Troubleshoot ing ( ).

2

Generic Release Alcatel-Lucent Proprietary

1 Data Collection.

The basic debugging approach is to capture user bearer traffic at key points in the network (see Figure 1 ). The objective is to isolate packet drops and latency, the 2 major causes of throughput issues, to a tractable network segment/element. In the ALU lab and field we use the PacketView Pro1 sniffer, which is the core sniffer used in our PDM2 (Packet Data Monitoring) system, and which has a GPS unit for accurate timestamping. This document is a more generic release version for VzW of an internal ALU trouble-shooting document. Where possible, we discuss using WireShark3 as a sniffer in place of PacketView Pro and PDM (WireShark is a popular freeware product). From past experience, the trace points we focus on are:

1. First at the UE (MP 1 in Figure 1 ). This trace is the most important and is the simplest to collect. This verifies the laptop, UE, and distant Applications Server configurations (i.e., TCP settings, MTU settings, etc.), and profile signature of packet drops (if any). It also gives us an accurate measure of ping round trip time. System (including backhaul) latency has major impact on LTE throughput. We provide charts in Section 2 that provide TCP throughput expectations as a function of backhaul delay and packet drop. This trace should be stationary, ~27 dB SNR (peak cell conditions), ~500’ in the main lobe of a LTE sector.

For Smart phones based on Operating Systems such as Android, there are tcpdump apps that can be downloaded and used to trace directly on the UE. These traces can be copied to a microSD for laptop processing with WireShark.

2. Second in the local MTSO (MP 2 in Figure 1 ) where the EATN (Ethernet Access Transport Network)4 terminates on the 2 7750 Aggregate Routers. We use VLAN span mirrors on each 7750. Each VLAN mirror aggregates all circuits that connect to the Cisco CSR1, and each VLAN mirror uses an IP-Filter (the MOP is provided as a separate document) to restrict ingress/egress traffic only to the eNodeBs being studied. These traces cover the GTP-U (User plane) port 2152 tunnels between the SGW and eNodeB. The Wireshark sniffer will have to connect to both 7750 Aggregate routers at the same time (with MPLS we see traffic involved with a particular eNb on both routers)5. [When the PDM system is used this traffic is GPS time stamped, so that very accurate latency analysis is possible when a dual trace analysis is done using this trace and a UE trace. In the case of Wireshark on Microsoft platforms good timestamp accuracy is difficult to achieve. Because of the high-speed/low-speed mismatch between the transport and the EATN loop6, end to end packet drop issues, if they occur, usually happen somewhere in the EATN between the Aggregate routers (7750s) and the UEs. Multiple lag 10GigE circuits feeding (low speed) 50 Mbps STS1 Sonet circuits presents a challenge to bearer traffic integrity.

1 Klos Technologies, http://www.klos.com 2 PDM is documented in internal ALU websites; external overviews are available. 3 The latest version can be obtained from www.wireshark.org. 4 If the port mirrors are put on the EATN side of the 7750 pair then the IP-Filter addresses will be that of the eNodeB 7705 and 7750. 5 The WireShark sniffer will not de-tunnel the GTP tunnels; to make working with GTP easier use a end-user MTU of 1428 for IPv4—this will result in a 1-GTP packet �> 1 TCP packet association. 6 This was also true for 1xEV-D0 where the default Cisco 10K and Juniper M40e aggregate routers came with wireline default egress queue configurations feeding the T1 lines going down to the cells. This drop issue is not new with LTE.

3


3. If needed, the third capture (MP 3 in Figure 1 ) would be taken on the 2 Juniper EX8216 Access routers as data leaves the MTSO for the PGW. The port mirrors would be set up in a similar manner as for the EATN traces (note that the VLANs for MLS1 and MLS2- the paired Access routers- are different). Comparing the EATN/Aggregate traces against the Access traces allows us to isolate problems to LTE components within the MTSO. Because of the volume of Access router/SGW traffic and the GigE limitation on WireShark sniffer laptops, these traces might have to be made in the early morning Maintenance Window when traffic volume drops7. While rare, we have seen cases where there are hardware problems in the MTSO (e.g., bad IO cards), and where config issues and middleware boxes (firewall/proxy/traffic shaper) are known to have caused problems with bearer traffic. The 2 traces at the 2 ends of the MTSO should trap these kinds of issues.

4. The last capture (MP 4 in Figure 1 ) would be at the PGW site (which is usually not at the originating MTSO). This should be external to the PGW on a connecting switch to the Application Server. This trace, with the Juniper trace at the originating MTSO, will isolate packet drop and latency issues to the backhaul transport or PGW. Note that if this trace is after the NAT (Network Address Translation) box at the PGW, TCP Port numbers, TCP Sequence/Acknowledgement numbers, IP addresses and IP/TCP checksums will change following the NAT translations8. This will make it tricky to match the packet trace at the MTSO with the trace at the PGW site. Issues in the transport backhaul and PGW site are rare. If issues occur here, they could be caused by Middleware boxes, Application Server config problems, or even Switch settings (e.g., GigE flow control is off by default on some switches).

2 Throughput Expectations.

2.1 Forward Link (FTP GET).

Controlled lab tests have been executed to gain a better understanding of the impact of packet drop rates and additional latency9 over and above those already existing in an operational LTE system. Packet latency was evenly inserted in both directions of the bearer path, and packet drops were injected only in the downlink (Internet Data Server to Client laptop) direction10. NOTE- test traffic flows were generated using a XP DOS window and FTP Client command line prompt. We don’t use freeware packages such as SG Speed Test because we need tight control for repeatable tests.

7 Unlike the VLAN span/IP-Filter on the 7750 routers, the only IP address that can be applied on the Juniper is that of the SGW. This will have a negligible traffic reduction impact on the VLAN span mirror since all eNb traffic will pass through this filter. 8 Split TCP circuits are not used for NAT protection because HTTPS, SSH and other secure circuits cannot span split TCP connections. 9 A lab 10 MHz LTE system was used (ALU eNodeB, SGW and PGW). The UE was a LG G13 connected to a Panasonic MK3 CF-52 running Microsoft Windows XP. The Application Server was a 64-bit W2008 blade. The packet drops and delays noted in this section are in addition to those inherent within the LTE system. All tests used a single UE cabled to a eNodeB (i.e., perfect RF conditions- ~27 dB SNR). The delay-drop injector was a Fedora Core 13 platform using Netem to inject delay and drop. All connections were GigE. The standard DOS window FTP client was used for FTP transfers. eNb load 3.02 was used for the FTP GET tests. ENb load 2.01 was used for the HTTP and FTP PUT tests, which also used a 0x7FFFF window for the large window tests. 10 All backhaul circuits were GigE; If a site is served by a single STS-1 channel, expect the maximum forward link throughput under 0.0% drop and 0 ms delay conditions to be in the low 40 Mbps range. 3G1X and DO technologies will be sharing the STS-1 channel, and MPLS encapsulation is used between the 7705 cell and 7750 MTSO boxes.

4


0x80000 FTP TCP Window

0

10

20

30

40

50

60

0 0.02 0.04 0.06 0.08 0.1

% PER

FT

P T

hrou

ghpu

t (M

bps)

0 ms 10 ms 30 ms 50 ms 60 ms 80 ms 100 ms

Figure 2- 0x80000 TCP Window- FTP GET Throughput.

0x80000 TCP Window - Throughput chart (Mbps) Drop (%)

Delay (ms) 0.00 0.01 0.03 0.05 0.07 0.10

0 58.1 38.7 22.8 18.7 16.5 14.3 10 57.1 38.1 19.5 17.9 14.0 12.0 30 55.4 29.9 15.6 11.0 10.6 8.3 50 45.7 23.2 14.7 9.3 8.6 7.9 60 40.5 16.9 9.9 7.9 6.6 6.0 80 33.8 14.6 9.0 8.5 5.54 4.8

100 28.9 10.7 9.1 6.8 4.5 4.5

Table 1- Source Data for Figure 2.

Figure 3- 0xFFFF TCP Window- FTP GET Throughput.

FFFF FTP TCP Window

0

10

20

30

40

50

60

0 0.02 0.04 0.06 0.08 0.1% PER

FT

P T

hrou

ghpu

t (M

bps)

0 ms 10 ms 30 ms 50 ms 60 ms 80 ms 100 ms

5


0xFFFF TCP Window - Throughput chart (Mbps) Drop (%)

Delay (ms) 0.00 0.01 0.03 0.05 0.07 0.1

0 15.2 15 14.4 14.2 12.9 11.7 10 12.9 12.7 11.9 11.4 10.8 10.2 30 8.6 8.5 7.9 7.4 7.1 6.9 50 6.5 6.3 6.0 5.7 5.3 5.3 60 5.8 5.7 5.4 5.2 4.4 4.5 80 4.8 4.6 4.5 4.2 4.1 4.0

100 4.0 3.9 3.8 3.4 3.4 3.0

Table 2- Source Data for Figure 3.

Figure 2 and Figure 3 show impacts of packet delay and drop on FTP GET transfers. 2 TCP Window sizes were used: 0x80000 (524,288)11 and 0xFFFF (65,535- Windows XP and Windows 2K default for Ethernet interfaces12).

FTP Throughput Impairment 0xFFFF (65,535 bytes) Window 0x80000 (524,288 bytes) Window 0 ms delay/0.00% drop 15.2 58.1 Mbps 100 ms delay/0.00% drop 4.0 (-74%) 28.9 (-50%) 0 ms delay/0.10 % drop 11.7 (-23%) 14.3 (-75%) 100 ms delay/0.10% drop 3.0 (-80%) 4.5 (-92%)

Table 3- Packet Delay-Drop Impairments to FTP Throu ghput.

2.2 FTP Traffic Takeaway ( forward transfer ).

• For FTP traffic, a large TCP Window is needed to fill available RF channel.

• Latency has the biggest impact on FTP throughput, regardless of TCP window size.

• For large TCP windows, packet drops will have sharp impact.

2.3 Reverse Link (FTP PUT 13).

Figure 4 graphs the results for adding delay and packet drops to FTP PUTs. These transfers are also generated using a XP DOS window and FTP command line prompt. [NOTE- this table was generated with a Golden Data Base variable, maxSIRtargetForFractionalPowerCtrl, set to 4, which for a 10 MHz system gives a maximum reverse link throughput of 8.17 Mbps. Current Golden Data Bases use a value of 10-19. For the 19 value, the maximum reverse link throughput would be ~ 17 Mbps. A scale factor of 2 will roughly correct Table 4 and Figure 4 .]

0x7FFFF TCP Window - Throughput chart (Mbps) Drop (%)

Delay (ms) 0.00 0.01 0.03 0.10 0 8.17 7.81 7.79 7.33

30 7.71 7.56 7.13 5.87 100 4.77 4.47 3.88 2.31

Table 4- Source data for Figure 4 (FTP PUT).

11 The W2008 FTP server produces an extremely choppy TCP window at the FTP server when a 0x7FFFF TCP window is requested by the UE/Client. This choppiness goes away when the requested windows is increased to 0x80000 byes. 12 3G1X and 1xEV-DO systems used the PPP protocol stack, which defaulted to a 16,384 byte TCP window size. 13 eNb load 2.1, Client TCP Window 0x7FFFF, W2008 FTP server used auto-tuning.

6


Figure 4- 0x7FFFF TCP Window- FTP PUT Throughput 14.

2.4 FTP Traffic Takeaway ( reverse transfer ).

• When the maxSIRtargetForFractionalPowerCtrl eNb translation is set to 4 (older eNb releases used this Golden Data Base value to avoid reverse link co-channel interference issues), this will give a maximum reverse link throughput of around 8 Mbps. Other values can be 10 and 19, the latter which will give a maximum throughput of around 17 Mbps.

• The controlling TCP Window is at the W2008 server, not the UE Client. The same results would be obtained if the UE Client had a TCP Window of 0xFFFF.

• The W2008 server used auto tuning. The effective TCP Window at the UE Client (not shown here by figures) ranged between 65,535 (0 ms delay, 0.00% packet drop) to 131,070 (100 ms delay, 0.1% packet drop). Microsoft auto tuning is not predictable and cannot be controlled- depending on network conditions, different results can be obtained.

2.5 Scaling Field Test Results to Lab Tests.

In the lab a simple ping round trip time is approximately 28 ms15. The latency times used in these tests are in addition to this measurement. In the field, if a ping RTT time to the FTP Server is, for example, 88 ms, then use the 60 ms latency curves in Figure 2 and Figure 3 to determine expected FTP throughput for a corresponding Packet loss rate. In this example, the expected throughputs for a 0.00% packet loss rate should be 40.5 Mbps or 5.8 Mbps, depending on whether the TCP Window size was 0x7FFFF or 0xFFFF.

2.6 Traffic Spiking & FTP server behavior (some spiking in DO, much more with LTE).

The higher data rates and reverse link TCP ACK grouping (caused by transmission of multiple TCP ACKs in a single LTE reverse link RF frame) is causing LTE traffic spiking on the forward link. TCP parameter settings and FTP server type can have a significant impact on performance.

2.6.1 0x7FFFF TCP Window, w2008 FTP server.

Figure 5 is a throughput curve of a 0x7FFFF TCP Window FTP GET with 0 ms delay and 0.00% packet drop rate. The bucket size is 1 ms (picket-fence time for which data is collected and used to generate a throughput rate). The chart to the left shows the throughput output bursting at the W2008 FTP server. TCP ACKs are received over roughly a 0.036 second period (see chart to right- the red curve represents the TCP ACK pulses coming up from the eNb) before responding with a burst of download data. Note from the left hand chart that for 2 ms (each blue dot is a sample slot of 1 ms) there is a sustained 1 Gbps burst from the W2008 server, followed by a 1 ms burst at 200 Mbps.

14 MaxSIRtargetForFractionalPowerCtrl set to 4. 15 ***With eNb LA3 load e511 there were scheduler changes that give different UE-eNb ping times (they can range from 25+ ms to 60ms). Make several UE connections, run 10 ping tests per connections, and use the lowest average ping time for the above scaling to the throughput graphs.***

7FFFF TCP Window FTP PUT

0

2

4

6

8

10

0.00 0.02 0.04 0.06 0.08 0.10 0.12

% PER

Thr

ough

put (

Mbp

s)

0 ms 30 ms 100 ms

7


The number of back-to-back packets is around 180, which represents a 270,000-byte burst at GigE line rate. This queuing behavior (the server does not immediately respond to a burst of TCP ACKs) is particular to a Windows W2008 server; other platforms could behave the same.

Figure 5- Example of LTE Traffic Spiking on when us ing W2008 Server.

This kind of data bursting into a LTE backhaul network can cause problems with ingress/egress queues and policers. The results can be traffic clipping during the bursts when the long-term throughput is relatively low. These packet drops will impact throughput over and above any latency impact.

Figure 6- 1-second throughput curve at UE and TCP w indow at W2008 server.

Figure 6 shows the 1-second throughput curve at the UE and the TCP Window at the W2008 FTP server. The TCP window is very choppy, which reflects the traffic spiking shown by Figure 5. 2.6.2 0x80000 TCP Window, w2008 FTP server.

Figure 7 shows the change in TCP Window behavior when a 0x80000 TCP window is negotiated with the W2008 server. A 1-byte change in negotiated TCP window size makes a dramatic change in the TCP window behavior at the server and dramatically reduces the traffic spiking caused by the W2008 server.

Figure 7- TCP Window 0x80000 W2008.

7fff f 0 ms delay 0.0% packet drop- XP/W2008 0.001 throughput bucket

0

200000

400000

600000

800000

1000000

8 8.05 8.1 8.15 8.2

Time (seconds)

Thr

ough

put (

Kbp

s)

7f f f f 0 ms delay 0.0% packet drop- XP/W2008 0.001 throughput bucket

0

2000

4000

6000

8000

10000

8 8.05 8.1 8.15 8.2

Time (seconds)

Thr

ough

put (

Kbp

s)

7FFFF 0 ms delay 0.0% packet drops- XP/W2008 Throughput

0

10000

20000

30000

40000

50000

60000

70000

0 2 4 6 8 10 12 14 16

Time (Seconds)

Thr

ough

put (

Kbp

s)

7FFFF 0 ms delay 0.0% packet drops- XP/W2008 TCP Window

0

100000

200000

300000

400000

500000

600000

0 2 4 6 8 10 12 14 16

Time (seconds)

TC

P W

indo

w (

byte

s)

0

100000

200000

300000

400000

500000

600000

700000

800000

0 2 4 6 8 10 12 14

Time (sec)

TC

P W

indo

w a

t ser

ver

(byt

es)

0

10000

20000

30000

40000

50000

60000

70000

Thr

ough

put a

t UE

(K

bps)

8


2.6.3 0x80000 TCP window, customer site FTP server.

Figure 8 is the TCP window curve measured at a customer FTP server site after a 0x80000 TCP window had been negotiated by the Client/UE. For unknown reasons the window never opened up beyond 100,000 bytes and the end-to-end performance was reduced to less than 13 Mbps. This underscores the importance of certifying any Applications Server used for LTE testing.

Figure 8- 0x80000 TCP window with Customer server.

3 Setup Procedures.

3.1 UE/Laptop Trace.

For Lab testing we generally use a Panasonic CF-52 MK3 that has 3.4 GB of usable RAM. This laptop can be used both to connect to the LTE UE and can also be used in the backhaul to sniff the backhaul components. When 2 backhaul elements have to be traced simultaneously (e.g. the 2 7750 EBH Aggregate routers) then we use the on-board Intel GigE interface for one of the 7750 switches, and a Startech EC1000S PCIx16 card (for example) for the second 7750. With WireShark, monitor the Status Bar counters (located at the bottom of the WireShark GUI). If WireShark cannot keep up with the trace, a Dropped packet count will show on this bar (otherwise Dropped will not appear). If there are drops, try Capture->interfaces->options and set buffer_size (memory capture) large enough for the trace. If WireShark is still overloaded, use a tracer file like dumpcap. Dumpcap will simply collect packets without real time analysis or fancy GUI display. Then analyze the dumpcap “.pcap” file with WireShark.

NTP is not very accurate on Microsoft platforms (can be off by 100s of milli-seconds). WireShark timing is more accurate on a Linux platform.

For XP and W2K there are 2 important registry settings:

• HKEY_LOCAL_MACHINE->SYSTEM->CurrentControlSet->Services->Tcpip->Parameters->TcpWIndowSize = DWORD 0x80000.

• HKEY_LOCAL_MACHINE->SYSTEM->CurrentControlSet->Services->Tcpip->Parameters->Tcp1323Opts = DWORD 0x1

This allows to set the TCP receive window, enables Large Windows, and turns off Timestamping. The MTU defaults to 1500 bytes- be sure HKEY_LOCAL_MACHINE->SYSTEM->CurrentControlSet->Services->Tcpip->Parameters->Interfaces->{LTE interface}->MTU is not set to an obsolete value (at one time LTE required MTU size of 1428). Also, if IPSec Client is enabled in Control_panel-

16 Older PCMCIA cards are not as efficient as PCIx- PCIx should give maximum performance.

0

100000

200000

300000

400000

500000

600000

0 10 20 30 40 50 60 70

Time (sec)

TC

P W

indo

w s

ize

at c

usto

mer

ser

ver

(byt

es)

9


>network_connections->local_area_connection->properties, disable it. This over-rides the default MTU with an IPSec value.

Note- Registry entries in HKEY_LOCAL_MACHINE->SYSTEM->CurrentControlSet->Services->AFD->Parameters will override the above TcpWindowSize if DefaultReceiveWindow or DefaultSendWindow is set. In this case the Client will override Tcpip settings and impose reverse link flow control when DefaultReceiveWindow is reached and will restrict the far end negotiated TCP Window to DevaultSendWIndow. These 2 AFD DWORDs should NOT be set. Windows 7, along with Vista and W2008 uses auto tuning. This is a feature that ignores RFC1323 (Tcp1323Opts and TcpWindowSize parameters in the Windows registry), and attempts to dynamically control the TCP Window during the data transfer. If McAfee is installed on the UE Client or W2008 server, it will disable auto tuning, and limit the TCP Window size to 0xFFFF. McAfee should be un-installed on Windows 7, Vista and W2008 platforms17. Optionally one can download “SG TCP Optimizer” from speedguide.net to change TCP settings using the Registry tab on the GUI. Remove the DefaultReceiveWindow and DefaultSendWindow AFD entries and update the Tcpip->Parameters entries. Note that the MTU setting is an interface parameter, not a global TCP parameter, and must be set for each UE interface. Reboot the laptop after any registry changes. Using the Windows FTP Client command in the DOS window, start WireShark and execute a FTP GET and, separately, an FTP PUT.

3.1.1 What to look for.

3.1.1.1 TCP Stack configurations ( laptop and server )

Examine the SYN/SYN-ACK/ACK messages for the TCP circuit setup using WireShark. The TCP Receive Window/Large Windows key and MSS will be in the SYN from the Client and Applications Server. The Client should advertise 0xFFFF with Window Scale of 4- this will give a TCP window size of 0x100000 = 1,048,560, but the Client will reverse link flow control the TCP window used by the Server to 0x80000 = 524,280 (look in the TCP ACK decode from the client, the “window size value” will be 0x8000). [Reverse flow control behavior is Microsoft; Linux (Android) behavior is unknown.]

The Client should advertise a MSS of 1460 (MTU of 150018). The wrong TCP Receive window will have major impact on throughput (see Figure 2 and Figure 3). To activate auto-tuning on Windows 7 the Application Server must advertise Large Windows (Tcp1323Opts=1)19. Verify that the Server is using large windows by looking for the Window Scale field in the Server SYN or SYN/ACK message.

3.1.1.2 Which throughput curve to use.

To verify whether or not you are getting “expected” throughput, you have to use the correct throughput curve in Figure 2 and Figure 3.

Run a ping and record the round trip time. Refer to the curves in Figure 2 or Figure 3. These curves give expected throughput as a function of packet loss. Note- the UE must be in optimal RF (including MIMO reception) to use these curves- ~27 dB SNR.

[NOTE- refer to Section 2.5 for the revised procedure to estimate the ping time to be used. eNb LA3 e511 has changed the timing of UE �> eNb round trip ping time will occur. 28 ms is no longer the constant ping time one will see.

3.1.1.3 Check for packet drops- forward direction ( Application Server to Client).

Make a 100 MB FTP GET from the Applications Server with WireShark running. After the run generate an IOGraph as per Figure Figure 10 . If there are SACKs (indicators from the Client to the

17 We have replaced McAfee with Microsoft Security Essentials (free from Microsoft) and a 3rd party freeware anti-virus package, Avira, with no impact to W7 throughput. 18 The SGW MTU fragmentation issue was fixed Oct 1020 in 2.0R6. Prior to that a MTU of 1428 was required. 19 This is generally the default.

10


Server that there were dropped packets) there will be red dots on the graph where the drops took place during the trace.

If the drop interval is relatively steady and low rate (e.g., 1 or 2 packet drops every 3-5 or more seconds) then this indicates physical facility damage or a marginal queue overflow (this will occur on low latency circuits ) situation somewhere in the backhaul in the forward direction. This causes the TCP window to fluctuate in a saw tooth pattern. When it grows large enough, packets drop on a queue and the SACKs then cause the TCP window to drop by ½.

If there are clumps of packet drops (10 or much more in one group) then 2 possibilities are:

• There is a policer or queue overflow (large latency circuits20) somewhere in the backhaul. LTE data transfer can be very spiky because of the high data rates and burst profile on the reverse link out of the eNodeB. Policers can have sample intervals that are too short (on the order of 0.001 seconds or less), which means that when a LTE data burst takes place in the forward direction, clumps of IP packets are clipped by the policer because the instantaneous data rate is calculated as too high (e.g., 10 Gbps in the case of a 10 Gbps link to an Applications Server). Queues between high speed circuits and low speed circuits (e.g., where the multiple 10GigE lag circuits in the MTSO feed down to STS1 circuits [multiples of 50 Mb]) can easily overflow because of the traffic pulses. This is the classic wireless traffic packet drop ca se.

• Bad hardware, damaged fiber. Drops in these cases tend to be exponential with gross traffic across the device. So at low gross traffic rates there may be no drops or very few drops, while at saturation rates the drops can be orders of magnitude larger.

The size of the dropout is measured by activating tcp.options.sack in the Filter window at the top of the WireShark GUI. This will filter for SACKs. Open the first SACK in a sequence, go to Transmission Protocol->Options->SACK and there will be a SACK range of missing bytes. Divide by the MTU size for the number of packets involved in the dropout. This will tell if there is a single packet drop issue or a multiple, burst, packet drop issue.

Error counters on backhaul equipment (e.g., RMON stats in Routers, Switches, PGW, SGW, etc.) should catch these problems, whether the problem is caused by damaged facilities (resulting in collisions, CRC errors, etc.), or queue/policer drops (queue drop statistics).

There will be many SACK messages for each dropped packet. The SACK messages have to be opened up to determine the number of packet drops reported by a SACK string. This is tedious, but this is the only way to calculate the PER rate to place you on the throughput curves of Figure 2 and Figure 3. Note- a back-to-back string of packet drops has the same impact on the TCP window reduction as a single packet. To use the Figure 2 and Figure 3 curves, a short cut is to use a contiguous sequence of SACKs to count as a single packet drop.

3.1.1.4 Check for packet drops- reverse direction ( Client to Application Server).

Make a 50 MB FTP PUT. Follow the procedure of the above paragraph. This will identify drops in the Client-Application Server direction. Under perfect conditions (ideal RF, no extra latency, no packet drops) a maximum of 8 Mbps should be achieved when maxSIRtargetForFractionalPowerCtrl =4.

Use the procedure of the previous section to get the reverse link PER rate to match the throughput against Table 4.

3.1.1.5 Wireshark Shows the IO graph options (Statistics->IO Graphs on Wireshark options bar) for Wireshark version 1.4.6 that will chart the TCP Window and duplicate ACKs (which are a good approximation of dropped packets for queue overflow situations). Note where there are packet drops the TCP Window shrinks.

20 Large latency backhaul circuits decrease the TCP Stack reaction time- TCP NAKs are coming back to slow to prevent the TCP Stack from over driving the network.

11


This particular trace used a 0xFFFF TCP window had 100 ms delay and 0.03% packet drop rate set in the simulator. There were 10 packet drops, which is shown by the 10 dips in the TCP window curve and in the red “dot” plots for the “tcp.analysis.duplicate_ack” curve. The throughput was 3.75 Mbps (not shown) and there were 35,100 downlink packets (also not shown). As a rough approximation, the downlink packet count for a FTP GET is approxim ately 2/3 the total packet count .

This trace was taken by a sniffer right at the FTP server. This is where a trace has to be made to verify correct operation (meaning correct TCP window size and smooth window profile).

To activate this chart, place a cursor over a TCP packet of interest (i.e., the IP addresses are for the UE/laptop client and the FTP server) and then enter the IO graph metrics.

For Policer, shaper and bad hardware issues, there will be multiple drops at the same point in time- however the drop times and TCP window impact will be seen here.

Also, when the TCP server receives a SACK (indicating dropped packets) it will cut the TCP window by ½. This is not quite shown correctly in because of Wireshark granularity, but the indicators are sufficient to indicate the trouble spots.

A word about Wireshark timestamp accuracy. Wireshark obtains packet timestamps from the PCAP tracer module (in the case of Microsoft, WinPcap). When WinPcap is initiated, it makes a Microsoft system call to get the absolute system time, which is CMOS based and which has an accuracy of 10 ms. It subsequently accesses the 64 bit Pentium TSC (Time Stamp Counter) register, which is not actually a clock register, but rather counts the number of CPU clock ticks. So if one is running with a 2 GHz clock, the TSC will start at 0 (boot time) and count at a 2,000,000,000-tick rate.

WinPcap gets an estimate of the PC crystal clock rate (in this case, 2 GHz) from the registry and extrapolates the original System time using the TSC counter and 2 GHz chip value.

There are several ramifications with this approach: a) the crystal changes frequency as a function of temperature and crystal voltage, causing drift, b) WinPcap does not go back during a trace and re-synchronize with the CMOS or System clock, c) there is no startup calibration of the PC crystal to see how much it differs from the ideal clock value (no two 2 GHz crystals clock at the same rate; the actual rate is a function of how the crystal is cut), and for multiple core chipsets, each cpu has it’s own, non-synchronized, TSC counter. The result is drift from original calibration.

In short, this makes it difficult to get long term, multiple Wireshark trace (i.e., at UE/Client and FTP server) synchronized packet timestamps that can be used for packet latency measurements.

Figure 9- IO Graph – trace at Server – TCP Window & packet drops.

12


Figure 10 is a trace taken at the client laptop. Note the ip.len and ip.dst == 10.176.207.171 rule for the first curve. This is because otherwise WireShark would add uplink and downlink throughputs by default.

Figure 10- IO Graph - trace at UE - Throughput & packet drops (different trace).

3.2 MTSO 7750 traces.

A MOP for setting up the VLAN span mirrors and IP Filters for these traces is provided as a separate document. Note that the IP addresses in the Filters are the SGW and eNodeB IP addresses. The IP Filters are required because of the large traffic volumes with LTE. Multiple GigE links are associated with the 2 7750 Aggregate routers and MPLS will dynamically route traffic across the 2 7750 routers in real time. Our sniffer platforms have GigE copper interfaces- we do not have 10 GigE trace devices. If, for some reason, the port mirrors were on the MPLS (EATN) side of the 7750, then these IP addresses would be for the eNodeB 7705 and 7750 router. When doing a 7750 trace, it is very useful to do a simultaneous UE trace. This lets one compare the 7750 and UE traces for per packet latency measurements.

The most accurate WireShark measurements will be achieved using a Linux based WireShark platform that has been NTP synchronized.


Many eNodeB sites share a single STS-1 Sonet channel with co-located 3g1x and 1xEV-DO cells. MPLS tunneling and the single STS-1 link alone will restrict potential LTE throughput to less than 50 Mbps. When QoS is considered, LTE bearer traffic is always assigned the lowest priority on this circuit (with 1xEV-DO bearer traffic).

3.2.1.1 STS-1 Saturation.

If a UE is in good RF, the RF link can give a higher data rate than that of the bandwidth available on the STS-1 circuit. This can cause queuing at the MSTO where the 7750 Aggregate routers feed the EATN instead of at the eNodeB. This can cause packet drops on the ingress queues of the EATN. If a dual trace analysis (UE and 7750 Aggregate Routers) shows dropped packets, the drop point is most likely where the IP packets are entering the EATN from the 7750 Aggregate routers (the LTE eNodeB sites have 512,000 byte ingress queues for each UE bearer traffic channel, which is more than enough to prevent drops there).

13


This should be verifiable by checking EATN performance counters for drops. Note that LTE drops here will accompany 1xEV-DO drops since these technologies share the same QoS queue on the STS-1 channel.

3.2.1.2 1675/1678 Configurations.

Past configurations of the 1675/1678 permitted dropped packets where the 7750 Aggregate Routers feed the Sonet equipment. MOP Reference Number 10-026 was shared with VzW to correct this issue. If there are drops in the ALU EATN, the ALU Sonet ring should be checked to see if this MOP was implemented.

Note also that we are dealing with a (MTSO) 7750 => 1678 => TSS 1850 => TSS 320 => TSS 1850 => TSS5 => UNI device => 7705 (eNb) chain where the 7750/1678/7705 equipment is customer owned, and the remaining equipment could be owned by a 3rd party optical carrier. This could require placing an additional sniffer at the eNb site to measure the MTSO – eNb transport span. We are seeing issues with 3rd party transport providers with other customers.

3.3 MTSO Juniper Traces.

As for Section 3.2 (MTSO 7750 traces), port mirror arrangements on the 2 Juniper EX-8612 MLS Access routers are required. Attachment B is the MOP to be used to set up these port mirrors and IP filters. In this case the IP filter is that of the SGW in the MTSO. This trace, combined with the 7750 traces, is used to isolate drop/latency problems to the MTSO.


Use the IP address of the UE as the GTP de-tunneling filter on the PDM sniffers. There should be no packet drops and negligible latency (maximum a few milli-seconds) between these 2 traces. If this is not the case then there is a problem in the MTSO and OA&M debugging tools on the various MTSO components (MME, SGW, Routers, etc.) will have to be used to track down exactly where the issues are. [NOTE WireShark does not de-tunnel. Use an end user MTU of 1428 to force a 1 GTP packet <-> 1 user payload packet encapsulation. It is harder to scan for Selective ACKs this way using WireShark, but it can be done.]

3.4 PGW Traces.

The port mirror at the PGW site can be put either on the PGW side facing the SGW or on the PGW side facing the Applications Server. In on the SGW side the MOP shown in Attachment B is used. If on the Applications Server, then the IP address in the IP-Filter should be that of the Applications Server. The first sniffer location to be tried should be on the Applications Server side of the PGW; this reduces the data rate the sniffer has to deal with.


3.4.1.1 Packet Latency.

The UE-eNb round trip latency for a standard ping will be roughly 28 ms. in near-peak cell conditions (again, refer to section 2.5 for an update on this number). The PGW/MTSO link will be the next source of high latency. This is caused by propagation delay and little can be done about it21. When an analysis is done using the MTSO Juniper Access router trace and the PGW trace, the latency curve should be very smooth (no spikes or saw tooth patterns) and should be basically the same for the uplink and downlink directions. A saw tooth pattern could indicate that the backhaul is under-engineered (traffic is saturating the transport).

3.4.1.2 Packet Drops.

We have seen issues where the PGW has ingress/egress queues that are too short to handle a 512KB TCP window. This results in dropped packets in the PGW. PGW counters will quickly reveal this issue and the PGW can be reconfigured to eliminate this problem.

21 Propagation time through fiber or copper, on the order of 2.3x108 and 2.0x108 m/sec respectively, will further increase latency. For fiber this equates to 1msec per 125 miles.

14


The PGW is said also to use policers. If packets drop in the PGW it could also be a policer issue. Again, PGW counters will reveal this issue.

4 System Configurations/Parameters.

If there are suspect eNodeB configuration issues, then the following document should be used for an audit: “FDD eNb LET Parameters User Guide”, LTE/IRC/APP/027426, Volume 4: Radio Resource Management. A list of parameters to be verified, as well as some other backhaul element checks, are listed in Appendix C.

5 Linux flood ping. (NOTE- ALU supports ping8.0 for internal use only- use “fast ping” options on vendor routers for this functionality). (NOTE - ALU is trying to get this feature moved back into the latest Fedora Core version of i Putils. Unknown availability time.)

The flood option of the Linux ping command can be used to load down circuits in search of layer 1 problems. Ping8.0 is a modified version of the Linux ping command that can be used for this purpose (the modification sets a timer at the end of the ping run to allow for outstanding Echo Replies to be received before the ping command exits- this gives an accurate drop count; the official Linux version exits immediately and ping responses in flight are counted as “drops”)22.

The command has to be run as root, and is ./ping8.0 –f –c 100000 –s 1472 –l 20 xxx.xxx.xxx.xxx. -f is the flood option, -c is the ping count, -s is the ping payload size (1472 gives a IP packet size of 1500), -l is a preload factor (outstanding pings not acknowledged) and xxx.xxx.xxx.xxx is the target address. The larger –l option, the larger load on the system. The larger –s value, also the larger load on the system.

A Linux laptop with a GigE port should be used. Configure a 7750 access port on the same VLAN as the end device to be tested (SGW, PGW, eNb). The port should be GigE copper. Give Linux a static IP address and gateway such that it can get the far device.

5.1 ALU PGW, SGW.

The SGW has 2 bearer path IP addresses and the PGW one. The SGW and PGW both have default squelch levels for incoming Echo requests to prevent cpu overload (the current default is 6000 Echo requests per second, over which the SGW/PGW will drop the Echo requests)23. The following 2 command strings will generate a large traffic flow and still generate less than 6000 Echo Requests per second:

• ./ping8.0 –f –c 100000 –s 1472 –l 7 xxx.xxx.xxx.xxx • ./ping8.0 –f –c 100000 –s 5000 –l 20 xxx.xxx.xxx.xxx (approx 200 Mbps).

5.2 Starent PGW.

• ./ping8.0 –f –c 100000 –s 1472 –l 20 xxx.xxx.xxx.xxx • ./ping8.0 –f –c 100000 –s 5000 –l 10 xxx.xxx.xxx.xxx (approx 135 Mbps).

5.3 ENb.

The eNb does not permit a sustained flood ping. The following command will work, but ping counts above 10 will cause drops.

• ./ping8.0 –f –c 10 –s 1472 –l 10 xxx.xxx.xxx.xxx

A sniffer placed at the eNb shows that when a all GigE core is used to connect to the eNb, the transfer rate seen by the eNb is approximately 240 Mbps. When a Sonet core is introduced this will lower the transfer rate to the STS1 bundle.

22 See the PDM leaf at http://navigator.web.alcatel-lucent.com/. The modified ping command can be downloaded from there. This command will run on all versions of Linux from Fedora Core 8 and beyond. 23 See Urgent Product Warning 3880-Issue 01, “CPU-protection violations in network ports may cause protocol instability”, 2009-07-03.

15


To extend the test write a script as follows: While true Do ./ping8.0 –f –c 10 –s 1472 –l 10 xxx.xxx.xxx.xxx sleep 2 done

16


Appendix A- 7750 Port Mirror and IP-Filter MOP. The 7750 MOP is being issued as a separate document, “Alcatel-Lucent Services, MOP (Method of Procedure), Mirror Service with IP-Filter match MoP, 7750 SR Integration Services”.

17


Appendix B- Juniper EX-8612 MLS Port Mirror MOP. These are notes were provided by a member of an ALU team that provides LTE Network Assessments in customer markets. These surveys involve setting up SGW VLAN mirrors on the 2 Juniper EX-8612

MLS devices in the MTSO and PGW site. Quote: Juniper ePC MLS VLAN Analyzer / Mirror Configure Analyzer Destination Port config set interfaces <destination port> description "Mirror Destination" set interfaces <destination port> unit 0 family ethernet-switching Configure Analyzer set ethernet-switching-options analyzer <analyzer name> input egress interface VLAN <SGW VLAN> set ethernet-switching-options analyzer <analyzer name> output interface <destination port> Save Changes show | compare commit check commit sync and-quit show analyzer NOTE: You only need to capture the egress traffic from the VLAN. SGW VLANs are different on each of the Juniper ePC MLS 8216s. The reason why we capture just the VLAN instead of the entire port to the SGW, is because there are multiple VLANs that traverse over the LAG between ePC MLS and SGW. And plus the 20Gpbs LAG exceeds MP’s physical line rate of 1 Gbps. Disable Analyzer deactivate ethernet-switching-options analyzer <analyzer name>

18


Appendix C- eNodeB Parameters and Backhaul Element Checks. eNodeB Downlink Parameters Ensure following downlink parameters are correctly configured in the eNodeB database file.

• dlMCSTransitionTable • transmissionMode • dlSinrThresholdBetweenOLMimoAndTxDiv • dlFullCLMimoMode • dlSinrThresholdBetweenCLMimoAndTxDiv • dlSinrThresholdBetweenCLMimoTwoLayersAndOneLayer • LteCell::Spare0 (bits 0-8) • CellL2DLConf::alphaFairnessFactor • referenceSignalPower • pBCHPowerOffset • pCFICHPowerOffset • pHICHPowerOffset • pDCCHPowerOffsetSymbol1 • pDCCHPowerOffsetSymbol2and3 • paOffsetPdsch • pDCCHPowerControlMaxPowerIncrease • pDCCHPowerControlMaxPowerDecrease

eNodeB Uplink Parameters Ensure following uplink parameters are correctly configured in the eNodeB database file.

• ulSchedPropFairAlphaFactor • sEcorrInit • sEcorrMin • sEcorrMax • sEcorrStep • maxHARQtx • p0NominalPUCCH • p0uePUCCH • sIRTargetforReferencePUCCHFormat • p0NominalPUSCH • p0UePUSCH • pUSCHPowerControlAlphaFactor • pathLossNominal • initialSIRtargetValueForPUSCHnonSemiStaticUsers • sIRtargetCorrectionFactorTableForPUSCHsemiStaticUsers • minSIRtargetForFractionalPowerCtrl • initialSIRtargetValueForPUSCHSemiStaticUsers • maxSIRtargetForFractionalPowerCtrl

Additionally, we need to check the MTU size in the eNodeB database grep mTU /data/ncagentd/database.xml <mTU>1500</mTU> Network Element Configuration Audit eNodeB To determine the throughput going into the eNodeB, dump the "enetstats" every 10 seconds. Example: /wal_ubm/pltf/wal> enetstats

19


Toward the bottom of this output, you will see "Packet" counts. Calculate the Throughput by (assuming 1428 Bytes packet size): Packet Count Differences = (Packet count) - (Packet count 10 sec earlier) Throughput = (Packet Count Differences) * 1428 * 8bits ) / 10 sec 7705 Check to confirm if the port is configured as full-duplex on the 7705 ports. Also verify MTU size > 1428 Example: 7705> A:TULYOK13T1A-P-AL-0023-01# show port 1/5/7 detail A:TULYOK13T1A-P-AL-0023-01# show port 1/5/8 detail 7750 1) match eNodeB and find the eNodeB telecom interface name Example: 7750-01# show router arp | match 023 172.25.192.88 6c:be:e9:6c:5f:64 00h00m00s Oth[I] to-BTS0023-7705 172.25.192.89 00:25:ba:30:41:61 03h17m17s Dyn[I] to-BTS0023-7705 <- 7705 IP 10.253.140.76 00:00:01:00:00:23 00h00m00s Oth[I] L3-Telecom-eNodeB0023 10.253.140.77 00:11:3f:c8:28:95 00h00m00s Sta[I] L3-Telecom-eNodeB0023 2) Find Port/SapId of eNodeB Example: 7750-01# show router interface L3-Telecom-eNodeB0023 =============================================================================== Interface Table (Router: Base) =============================================================================== Interface-Name Adm Opr(v4/v6) Mode Port/SapId IP-Address PfxState ---------------------------------------------------------------------------------------------------------------------------------------- L3-Telecom-eNodeB0023 Up Up/Down IES spoke-23:3102* 10.253.140.76/31 n/a ---------------------------------------------------------------------------------------------------------------------------------------- Interfaces : 1 =============================================================================== * indicates that the corresponding row element may have been truncated. 3) Spoke ping to 7705 and check MTU size Example: 7750-01# oam sdp-ping 23 resp-sdp 23 count 50 /* NOTE: can NOT do rapid pings to eNodeB */ /* 3-6 ms OK */ 7750-01# oam sdp-mtu 23 size-inc 1500 1530 step 1 /* check MTU size to spoke /* usually good to 1524 */ NOTE: Repeat these commands from 7750-02.

20


SGW 1) Traceroute the telecom of eNodeB: A:TULYOK1391A-L-AL-SGWx-01# traceroute 10.253.140.47 traceroute to 10.253.140.47, 30 hops max, 40 byte packets 1 10.253.140.8 (10.253.140.8) 1.34 ms 0.942 ms 1.08 ms 2 10.119.137.224 (10.119.137.224) 1.27 ms 1.08 ms 1.28 ms 3 10.119.136.225 (10.119.136.225) 1.74 ms 1.70 ms 1.73 ms 4 0.0.0.0 * * * 5 0.0.0.0 * * * 6 10.253.140.47 (10.253.140.47) 8.14 ms 7.33 ms 8.32 ms 2) Rapid ping the 7750's telecom port for the eNodeB's gateway address. Example: Get this IP address from the "show router interface L3-Telecom-eNodeB0023" on the 7750 (not SGW) which was dumped above. SGWx-01# ping 10.253.140.76 count 600 rapid size 1400 3) Ping the FTP server used by customer: Example: SGWx-01# ping router 2 198.224.220.254 count 600 rapid size 1400 PING 198.224.220.254 56 data bytes 64 bytes from 198.224.220.254: icmp_seq=1 ttl=252 time=6.25ms. 64 bytes from 198.224.220.254: icmp_seq=2 ttl=252 time=6.20ms. 64 bytes from 198.224.220.254: icmp_seq=3 ttl=252 time=6.16ms. 64 bytes from 198.224.220.254: icmp_seq=4 ttl=252 time=6.23ms. 64 bytes from 198.224.220.254: icmp_seq=5 ttl=252 time=6.21ms. 4) traceroute to FTP server Example: traceroute 198.224.220.254, 30 hops max, 40 byte packets 1 10.253.140.8 (10.253.140.8) 1.23 ms 0.916 ms 0.928 ms 2 10.119.137.224 (10.119.137.224) 1.35 ms !N 2 0.0.0.0 * 2 10.119.137.224 (10.119.137.224) 1.52 ms !N 5) Rapid ping to the PGWs: Example: ping 66.174.15.130 router 2 count 600 rapid size 1400 ping 66.174.20.194 router 2 count 600 rapid size 1400 ping 66.174.27.194 router 2 count 600 rapid size 1400 ping 66.174.38.224 router 2 count 600 rapid size 1400 ping 66.174.43.34 router 2 count 600 rapid size 1400 ping 66.174.62.194 router 2 count 600 rapid size 1400 ping 66.174.210.226 router 2 count 600 rapid size 1400 ping 66.174.219.226 router 2 count 600 rapid size 1400 6) Check the egress (toward the eNodeB) throughput of the call 7) Find the APN being used for this call (usually the one with a large "S1U DL bytes." Dump this again after 10 seconds and calculate the Throughput by:

21


Example: SGWx-01# show mobile-gateway serving bearer-context detail imsi 311480000025718 “S1U DL bytes” difference = (“S1U DL bytes”) – (“S1U DL bytes” from previous 10 second dump) Throughput = ((“S1U DL bytes” diff)* 8bits) / 10 sec Output Example: IMSI : 311480000025718 APN : vzwinternet.mnc480.mcc311.gprs Bearer Id : 7 Bearer type : Default Up Time : 0d 00:38:34 QCI/ARP : 9/10 QoS UL MBR : 0 Kbps QoS DL MBR : 0 Kbps QoS UL GBR : 0 Kbps QoS DL GBR : 0 Kbps S5 PGW Data TEID : 0x8005c032 S5 SGW Data TEID : 0x4b00647 S5 PGW Data addr : 66.174.43.34 S5 SGW Data addr : 199.223.102.193 S1U eNodeB TEID : 0x823 S1U SGW TEID : 0x4b00667 S1U eNodeB addr : 10.253.140.77 S1U SGW address : 10.253.140.0 S5 UL packets : 822695 S5 UL bytes : 143262573 S1U DL packets : 973848 S1U DL bytes : 1207191022 MME 1) Dump (display) UE context for a specified IMSI

lss# cd logs lss# ueadmin_cli -o dump –i 311480000025718

2) Run the call trace for specific IMSI on MME

calltrc_cli -o start -i 311480000025718 tail -f proto*.log | /opt/MME/sbin/msgFlow -i 311480000025718

1678 (or 1675) Request the customer to check whether there has been a 1Gig upgrade card in 1678 and whether the "traffic descriptors" have been re-configured. Traffic descriptors: Cli Mib ## trafficdescriptor show Cli Mib ## interface showall To see VLAN flow for eNodeB in question: Cli Mib ## etsflowunidir show vlan x to identify the trafficDescriptors assigned to a flow. Note: x is the vlan in question.

Documents

062112 LTE Troubleshooting