Thursday, September 25, 2014

Troubleshooting Haproxy

I received an email from our R&D team stating that they were seeing less HTTP requests for a particular client/campaign than they expected on the order of ~30% less. He took a look at the stats on the stats page for haproxy and recorded that for a given time period the frontend was logging 1,688,298 sessions vs 1,186,079 sessions on the backend. I was asked to look into the cause for this discrepancy.

I took a look at the logs and felt like it was giving me much insight into what was happening between the frontend and backend. The logs were also being sent to /var/log/messages so it was pretty noisy to look through. I checked haproxy.cfg and edited the logging to send it to its own file, and also added the log global parameter to the backend section. After a reload of haproxy I went back to the logs. The logs looked the same, although in a different spot. After making some inquires on the haproxy IRC channel I learned that haproxy was indeed giving some information about the backend connections. The log lines were of a format I wasn't accustomed to, and there was actually a very detailed section on logging in the Haproxy documentation that explained what I was looking at.

Here's an example of the numbers I was working with (this is after the reload):

Frontend
- 3434009 requests total
- 79263 3xx errors
- 1283482 4xx errors
- 30396 5xx errors

Backend
- 79306 3xx errors
- 136 4xx errors
- 30396 5xx errors

Those are numbers gathered from the stats page. I compared these against numbers I was getting from grepping haproxy.log. The numbers didn't match up across the board.


Error
Counts
206
6
304
221635
400
230
404
11
405
60
408
21
500
15058
503
30396

The 503 errors matched up with what I was recording in the logs at 30396. This numbers make sense in terms of what the 503 error means. According to w3.org, a 503 error means: 

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.

The 503 lines from the logs look like this:

Sep 24 16:39:21 localhost haproxy[8562]: someip:54196 [24/Sep/2014:16:39:13.668] public webservers/srv1 8/8268/-1/-1/8276 503 109 - - SC-- 454/454/168/5/+5 0/37 {Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36|http://abunchofstuff} "GET /?requesthere HTTP/1.1"

According to the documentation, the SC flags represent the status of the TCP connection at connection termination.