Monday, March 4, 2013

Dropped Frames on Linux Server

A few weeks ago we got an alert from Logicmonitor, our monitoring host, regarding dropped TCP frames on one of our servers. The alert looked like this:

The host Server-X is experiencing an unusual number of failed TCP connections, probably incoming connections...This could be caused by incorrect application backlog parameters, or by incorrect OS TCP listen queue settings.

What?

This server is part of a general pool of boxes that sit behind a load balancer. It was the only box on which we were receiving such errors, and it didn't appear to actually be affecting network performance. A quick check on the server using ifconfig showed that there were 3 dropped frames at the time of this alert. We acked it and moved on, deciding it was a one-off. Unfortunately the alert has continued to happen on and off since then. It was put on the back burner because we had higher priority issues to deal with, but eventually I had to come back to it and see what the deal was.

First things first. The server is running Centos 6.3, kernel version 2.6.32. We're only using one of the NICs, eth0. I did another ifconfig and showed this:


[root@Server-X log]$ ifconfig
eth0      Link encap:Ethernet  HWaddr 00:25:90:63:1E:4C 
          inet addr: x.x.x.x  Bcast:x.x.x.x  Mask:x.x.x.x
          inet6 addr: fe80::225:90ff:fe63:1e4c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1289983660 errors:24 dropped:0 overruns:0 frame:24
          TX packets:1814707795 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:468136124692 (435.9 GiB)  TX bytes:1564900472125 (1.4 TiB)
          Interrupt:16 Memory:fbce0000-fbd00000

So the number of dropped frames has increased. There are still no collisions though, no obvious issues with the network traffic or the NIC itself. A quick lookup on the alert informed me that these kinds of errors, receive or RX errors to be precise, are usually caused by the sending device. I traced the MAC address of the server and found the switch and port it was on and took a look. This is an HP ProCurve:


SW5# show interfaces 38

 Status and Counters - Port Counters for port 38                     

  Name  :                                                                
  MAC Address      : 009c02-b3e79a
  Link Status      : Up 
  Totals (Since boot or last clear) :                                   
   Bytes Rx        : 3,973,978,986      Bytes Tx        : 1,199,254,183    
   Unicast Rx      : 3,495,818,889      Unicast Tx      : 1,644,050,925    
   Bcast/Mcast Rx  : 678,069            Bcast/Mcast Tx  : 727,023,856      
  Errors (Since boot or last clear) :                                   
   FCS Rx          : 0                  Drops Tx        : 7,948,285        
   Alignment Rx    : 0                  Collisions Tx   : 0                
   Runts Rx        : 0                  Late Colln Tx   : 0                
   Giants Rx       : 0                  Excessive Colln : 0                
   Total Rx Errors : 0                  Deferred Tx     : 0                
  Others (Since boot or last clear) :                                   
   Discard Rx      : 0                  Out Queue Len   : 0                
   Unknown Protos  : 0                
  Rates (5 minute weighted average) :
   Total Rx  (bps) : 15941488            Total Tx  (bps) : 7623272  
   Unicast Rx (Pkts/sec) : 1524           Unicast Tx (Pkts/sec) : 944      
   B/Mcast Rx (Pkts/sec) : 0              B/Mcast Tx (Pkts/sec) : 65       
   Utilization Rx  : 01.59 %          Utilization Tx  : 00.76 %

As you can see, the switch itself isn't under a lot of load though it does have a fair number of Tx drops. These are drops since boot though, so in order to get an accurate idea of what's currently happening I had to do a clear statistics interface 38, which clears the stats on that interface for that remote session only. Once I set it back to 0, I waited. And waited. And waited. And saw no new drops. Mind you, I didn't see any incrementing drops on the server either so it didn't prove anything, but I started wondering if maybe it wasn't simply the server itself. 

Since I wasn't getting anything conclusive off of the switch I moved back to considering the server. I spent some time researching TCP queue settings and such and found that that hasn't generally been a problem in modern Linux OSes, and was likely not the culprit. That's when I ran across this handy site that addressed my exact error message. I ran into the standard issue I tend to have when troubleshooting, where the tool I want to use doesn't work for some reason. This time tcptrace was simply not available through the package manager, and when I tried to install it from source it barfed all over me and I decided it was best to not continue down that road. I ultimately wound up creating a micro-instance in AWS, installing from package there, and scp'ing the tcpdump from my server to that instance. Crazy right? Except that what I found was pretty illuminating. 

There were a number of TCP resets happening in communications between my server and the load balancer. In general they looked like this:

164: ip-x-x-x-x-1.ec2.internal:52489 - ip-x-x-x-x.ec2.internal:10021 (lo2lp)      2>    1<  (reset)
324: ip-x.x.x.x.ec2.internal:44979 - ip-x.x.x.x.ec2.internal:10041 (xw2xx)      1>    1<  (reset)

In case you're not familiar with TCP resets, let me give you a quick crash course. In the course of a normal TCP session, a SYN flag starts the connection and a FIN flag ends it. It's like the beginning hellos and ending good-byes of a civilized conversation. If for some reason both of those flags don't get sent, the connection is reset by one of the hosts. A reset is essentially an immediate close of the session, like hanging up on a telemarketer. Resets can happen for a number of reasons. If a client sends a SYN to a server on a port that the server is not configured to listen on, the server will send an ACK reset. If a host doesn't receive a response to frames it sends, after so many tries it will assume a network failure and send a reset. Resets are good in these scenarios as they don't keep a dead connection open.

So, somewhere in the communication between my server and the load balancer one of them sent a reset. Unfortunately from the output of tcptrace it's not clear which machine sent it, but I'm leaning towards it being the server. There are also corresponding rx_short_length_errors reported by ethtool. Short length error sounds obvious (i.e. the frame is "too short"), but it's not obvious what this means in practical terms? What causes short frames, or runts? The interfaces on the load balancer show no errors, no collisions, no CRCs.

I tried Googling the term rx_short_length_errors and the top hits all referred to a specific ethernet controller, the Intel 82574L, which is exactly the NIC being used in all of the servers in this particular group. Out of curiosity I logged in to the remaining servers and checked their NIC stats and found, much to my surprise, similar RX errors. In fact, I found more on those other boxes than on the one that I had been alerted for! I verified that all of the servers are being monitored for the same matrix, but whereas this particular box is showing an alert for UDP drops, the others are not. This obviously doesn't give me a ton of confidence in our monitoring solution.

There is a newer driver for the network card we use that I can try to see if it resolves the issue, but now I'm not sure if there really is an issue to resolve. I have no demonstrable network issues (latency or packet loss or anything) mappable to these servers. Given the amount of traffic they pass it looks like the number of errors is far less than the number of actual processed frames. I'll have to discuss it with my team but I'm inclined not to go forth any further by updating the driver.

No comments:

Post a Comment