Monday, May 7, 2018

HAProxy DNS Resolvers

It's...been a while, no?

I waffled about making a new post and trying to do this on the regular again, because honestly the scope of my current position is so much more...well, breadth-y than my old job. I started blogging because I wanted to document the things I was learning and running across; the volume of those things has exploded with my current work. I'm moving in the micro-services world now, baby! Firehose central. You should see how many notes I have in Evernote now.

Enough exposition. You don't care about what I'm doing anyway. You wanna know what I have to say about HAProxy.

We have HAProxy docker containers running on CoreOS EC2 instances in an autoscale group. We're using a combo of the HAProxy multibinder, confd, and etcd to handle seamless reloads and enable dynamic configuration changes (all things that 1.8 do but 1.7, the version we're running, didn't do at the time of deployment). Before you get all impressed, this isn't something I set up; it existed before I came on-board.

HAProxy is set up with multiple front-ends listening on different ports. Most of the back-ends that they attach to are AWS ELBs. ELBs can (and do) change IPs from time to time, and what happens then is that HAProxy can't resolve the back-end server name and just fails. Quick primer on how HAProxy does DNS (which is something I had to look up): HAProxy does a DNS lookup when it starts, using whatever nameserver is set in /etc/resolve.conf, and that's it. If you have an IP change of one of your back-end servers you have to reload HAProxy to force it to refresh DNS—that is, unless you configure a resolvers section. A very simple example of such a configuration is here. The relevant portions of that config are

resolvers awsvpc
    nameserver vpc 172.31.0.2:53
.......
server read01 read01.XXXXXXXXXXXX.ap-northeast-1.rds.amazonaws.com:3306 check port 3306 resolvers awsvpc inter 2000 fall 5


where nameserver takes the following parameters (at a minimum): a friendly name for your nameserver, and the IP(s) of said nameserver. In AWS, you can find out the nameserver to be used by checking /etc/resolve.conf.

The second part of this is that the server line in your back-end config needs to include a "check" parameter. This is because health checks are what trigger the dynamic DNS lookups. Now HAProxy tries to resolve your back-end servers at startup and every time a health check runs. By default, health checks are triggered every 2000ms (2 seconds).

First thing I did was to set up a quick test bed to make sure that nothing I was going to do would cause an issue with traffic to HAProxy. We only have a production instance of it, filtering both staging and production traffic. Once I verified that the DNS flip worked and did not disrupt traffic, I was good to go to set it up on our real servers.

I started off with the staging back-ends—which already had health checks enabled—so all I had to do was add the resolvers block to the config and the resolvers parameter to the server lines. I applied my changes and waited and observed no issues, so I considered it good to go. As soon as I tried it on one of our prod back-ends we started getting Pingdom alerts about that particular site being unavailable. It was very flappy and intermittent, but seemed to be directly related. I rolled back my changes and the flapping stopped, so that seemed pretty conclusive to me.

I checked the logs to see what I could see, and found many entries that looked like this

"May  4 17:47:29 haproxy[594]: Server stg_appx_back_end/appx is going DOWN for maintenance (DNS timeout status). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue."
"May  4 17:47:29 haproxy[594]: Server stg_appx_back_end/appx ('api.stage.company.io') is UP/READY (resolves again)."
This was happening for all of the back-ends that had health checks and resolvers added. I simply hadn't noticed them because we had no Pingdom alerts on staging endpoints. Some other observations:

  • they were happening every few seconds, which would mean if they were true we literally would not be able to serve any traffic
  • they were resolving within the same timestamp as the reported down event
I ran a tcpdump on the host to see what dns lookups were being done, and at no point did I see any DNS failures or timeouts. I did however see many more DNS lookups than I was expecting. Given the settings I'd used for HAProxy (hold timer was set for 30s, interval set to default 2 seconds) I'd expect a DNS lookup to happen every 30 seconds based on the documentation, where the formula is
[hold value] + ([hold value] % [inter])
30 + (30 % 2) = 30 + 0 = 30


Instead I was seeing

18:04:23.648381 IP x.x.x.x.ec2.internal.38377 > x.x.x.x.ec2.internal.domain: 36306+ A? api.stage.internal.company.domain. (63)
18:04:23.648729 IP x.x.x.x.ec2.internal.domain > x.x.x.x.ec2.internal.38377: 36306 3/0/0 A 1.1.1.1, A 2.2.2.2, A 3.3.3.3 (111)
18:04:25.168288 IP x.x.x.x.ec2.internal.38377 > x.x.x.x.ec2.internal.domain: 60645+ A? api.stage.internal.company.domain. (63)
18:04:25.168697 IP x.x.x.x.ec2.internal.domain > x.x.x.x.ec2.internal.38377: 60645 3/0/0 A 3.3.3.3, A 1.1.1.1, A 2.2.2.2 (111)

And another query would start for the same A records at 18:04:27. And so on and so on. 

The use cases I've seen documented for adding resolvers to 1.7 have addressed ELBs specifically, but I'm left to wonder if there's some issue with the multiple IP addresses ELBs can have (depending on how many AZs/subnets you've set for them). A DNS query brings back a list of A records, as seen above, and according to this post HAProxy takes the first IP address it receives as answer. If every time it makes a query it receives a different order of IPs (as shown above), then it has to assume that the host changed and update itself accordingly. Then the hold timer doesn't come into play because the last check came back as something other than invalid. Of course the default hold periods for other responses is also 30 seconds so that doesn't exactly pan out, but it's as close as I've gotten to an explanation for why this configuration causes so much rapid change. 

I originally started down the road of adding resolvers to 1.7 because it seemed like a simpler solution than moving to 1.8 (which we would do eventually, but this was meant to be a quick band-aid to a known problem). Unfortunately the "quick" fix wasn't quick after all, and I may as well have been looking into moving to 1.8 to begin with.