Thursday, September 25, 2014

Troubleshooting Haproxy

I received an email from our R&D team stating that they were seeing less HTTP requests for a particular client/campaign than they expected on the order of ~30% less. He took a look at the stats on the stats page for haproxy and recorded that for a given time period the frontend was logging 1,688,298 sessions vs 1,186,079 sessions on the backend. I was asked to look into the cause for this discrepancy.

I took a look at the logs and felt like it was giving me much insight into what was happening between the frontend and backend. The logs were also being sent to /var/log/messages so it was pretty noisy to look through. I checked haproxy.cfg and edited the logging to send it to its own file, and also added the log global parameter to the backend section. After a reload of haproxy I went back to the logs. The logs looked the same, although in a different spot. After making some inquires on the haproxy IRC channel I learned that haproxy was indeed giving some information about the backend connections. The log lines were of a format I wasn't accustomed to, and there was actually a very detailed section on logging in the Haproxy documentation that explained what I was looking at.

Here's an example of the numbers I was working with (this is after the reload):

Frontend
- 3434009 requests total
- 79263 3xx errors
- 1283482 4xx errors
- 30396 5xx errors

Backend
- 79306 3xx errors
- 136 4xx errors
- 30396 5xx errors

Those are numbers gathered from the stats page. I compared these against numbers I was getting from grepping haproxy.log. The numbers didn't match up across the board.


Error
Counts
206
6
304
221635
400
230
404
11
405
60
408
21
500
15058
503
30396

The 503 errors matched up with what I was recording in the logs at 30396. This numbers make sense in terms of what the 503 error means. According to w3.org, a 503 error means: 

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.

The 503 lines from the logs look like this:

Sep 24 16:39:21 localhost haproxy[8562]: someip:54196 [24/Sep/2014:16:39:13.668] public webservers/srv1 8/8268/-1/-1/8276 503 109 - - SC-- 454/454/168/5/+5 0/37 {Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36|http://abunchofstuff} "GET /?requesthere HTTP/1.1"

According to the documentation, the SC flags represent the status of the TCP connection at connection termination.
This is how the connection between the frontend and backend ended. The "S" represents the first event that caused the connection to terminate, which in this case is the server explicitly aborting or refusing the connection. Srv1 for whatever reason refused the connection. The second character represents the state of the TCP or HTTP connection when it closed. The "C" means that the frontend was waiting for a connection response from the backend server. These errors appear to have happened during a defined period of time, 16:39-16:40. For ~1 minute there was some issue with requests getting through to the backend. To compare, I grepped /var/log/messages (the original logging location) for 503s. That log went back as far as Sept 21st and the only occurrence of 503s were for the same time period. This looks like there was likely some temporary glitch.

I moved on to the 4xx errors as these were higher in number and of more immediate concern to the team. The 4xx errors also had the most discrepancy in terms of basic recording. Referring to the stats above, the total number of reported 4xx errors for both front and back end is 1283618. The haproxy log only shows 322 4xx errors. That's a huge discrepancy. Those missing log lines make it very difficult to determine why such a large number of requests are getting 4xx responses. The majority of those requests aren't even getting to the backend; they're being discarded at the load balancer level. Here's an example of the kinds of 4xx errors being recorded:

Sep 23 14:33:40 localhost haproxy[3048]: someip:21425 [23/Sep/2014:14:33:40.398] public public/<NOSRV> -1/-1/-1/-1/190 400 187 - - PR-- 85/85/0/0/0 0/0 {|} "<BADREQ>"
Sep 23 14:58:47 localhost haproxy[3047]: 38.81.65.42:49558 [23/Sep/2014:14:58:47.857] public webservers/srv2 17/0/0/2/141 400 265 - - ---- 85/85/6/4/0 0/0 {Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)|} "GET / HTTP/1.1"

It makes sense, knowing what we're expecting for processing, why some of these are being met with various 4xx errors. Malformed syntax, resource not found—those kinds of responses are usually due to hackers trying to find a vulnerability. For example, 404s are often accompanied by "GET /phpmyadmin/scripts/setup.php HTTP/1.1" or simply "GET / HTTP/1.1", people trying to ascertain if we're doing something stupid like running default php installs or allowing directory listings. A few hundred of these is not alarming, and in fact I'd expect a whole lot more. The stats page is saying that we're returning 4xx errors on a ton more connection requests though. Where are those being logged?

I stumbled across this post and it gave me a hint. We had dontlognull on. I turned it off, reloaded haproxy, and the numbers immediately started to match up. The result was knowing that there was actually nothing wrong with our delivery. The numbers on the stats page show all requests, regardless of outcome, and were causing undue alarm. When you discard the requests in which no data is being transferred, any issues with responses from the web servers are application-specific, not hardware or haproxy.

No comments:

Post a Comment