Monday, May 7, 2018

HAProxy DNS Resolvers

It's...been a while, no?

I waffled about making a new post and trying to do this on the regular again, because honestly the scope of my current position is so much more...well, breadth-y than my old job. I started blogging because I wanted to document the things I was learning and running across; the volume of those things has exploded with my current work. I'm moving in the micro-services world now, baby! Firehose central. You should see how many notes I have in Evernote now.

Enough exposition. You don't care about what I'm doing anyway. You wanna know what I have to say about HAProxy.

We have HAProxy docker containers running on CoreOS EC2 instances in an autoscale group. We're using a combo of the HAProxy multibinder, confd, and etcd to handle seamless reloads and enable dynamic configuration changes (all things that 1.8 do but 1.7, the version we're running, didn't do at the time of deployment). Before you get all impressed, this isn't something I set up; it existed before I came on-board.

HAProxy is set up with multiple front-ends listening on different ports. Most of the back-ends that they attach to are AWS ELBs. ELBs can (and do) change IPs from time to time, and what happens then is that HAProxy can't resolve the back-end server name and just fails. Quick primer on how HAProxy does DNS (which is something I had to look up): HAProxy does a DNS lookup when it starts, using whatever nameserver is set in /etc/resolve.conf, and that's it. If you have an IP change of one of your back-end servers you have to reload HAProxy to force it to refresh DNS—that is, unless you configure a resolvers section. A very simple example of such a configuration is here. The relevant portions of that config are

resolvers awsvpc
    nameserver vpc
server read01 check port 3306 resolvers awsvpc inter 2000 fall 5

where nameserver takes the following parameters (at a minimum): a friendly name for your nameserver, and the IP(s) of said nameserver. In AWS, you can find out the nameserver to be used by checking /etc/resolve.conf.

The second part of this is that the server line in your back-end config needs to include a "check" parameter. This is because health checks are what trigger the dynamic DNS lookups. Now HAProxy tries to resolve your back-end servers at startup and every time a health check runs. By default, health checks are triggered every 2000ms (2 seconds).

First thing I did was to set up a quick test bed to make sure that nothing I was going to do would cause an issue with traffic to HAProxy. We only have a production instance of it, filtering both staging and production traffic. Once I verified that the DNS flip worked and did not disrupt traffic, I was good to go to set it up on our real servers.

I started off with the staging back-ends—which already had health checks enabled—so all I had to do was add the resolvers block to the config and the resolvers parameter to the server lines. I applied my changes and waited and observed no issues, so I considered it good to go. As soon as I tried it on one of our prod back-ends we started getting Pingdom alerts about that particular site being unavailable. It was very flappy and intermittent, but seemed to be directly related. I rolled back my changes and the flapping stopped, so that seemed pretty conclusive to me.

I checked the logs to see what I could see, and found many entries that looked like this

"May  4 17:47:29 haproxy[594]: Server stg_appx_back_end/appx is going DOWN for maintenance (DNS timeout status). 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue."
"May  4 17:47:29 haproxy[594]: Server stg_appx_back_end/appx ('') is UP/READY (resolves again)."
This was happening for all of the back-ends that had health checks and resolvers added. I simply hadn't noticed them because we had no Pingdom alerts on staging endpoints. Some other observations:

  • they were happening every few seconds, which would mean if they were true we literally would not be able to serve any traffic
  • they were resolving within the same timestamp as the reported down event
I ran a tcpdump on the host to see what dns lookups were being done, and at no point did I see any DNS failures or timeouts. I did however see many more DNS lookups than I was expecting. Given the settings I'd used for HAProxy (hold timer was set for 30s, interval set to default 2 seconds) I'd expect a DNS lookup to happen every 30 seconds based on the documentation, where the formula is
[hold value] + ([hold value] % [inter])
30 + (30 % 2) = 30 + 0 = 30

Instead I was seeing

18:04:23.648381 IP x.x.x.x.ec2.internal.38377 > x.x.x.x.ec2.internal.domain: 36306+ A? (63)
18:04:23.648729 IP x.x.x.x.ec2.internal.domain > x.x.x.x.ec2.internal.38377: 36306 3/0/0 A, A, A (111)
18:04:25.168288 IP x.x.x.x.ec2.internal.38377 > x.x.x.x.ec2.internal.domain: 60645+ A? (63)
18:04:25.168697 IP x.x.x.x.ec2.internal.domain > x.x.x.x.ec2.internal.38377: 60645 3/0/0 A, A, A (111)

And another query would start for the same A records at 18:04:27. And so on and so on. 

The use cases I've seen documented for adding resolvers to 1.7 have addressed ELBs specifically, but I'm left to wonder if there's some issue with the multiple IP addresses ELBs can have (depending on how many AZs/subnets you've set for them). A DNS query brings back a list of A records, as seen above, and according to this post HAProxy takes the first IP address it receives as answer. If every time it makes a query it receives a different order of IPs (as shown above), then it has to assume that the host changed and update itself accordingly. Then the hold timer doesn't come into play because the last check came back as something other than invalid. Of course the default hold periods for other responses is also 30 seconds so that doesn't exactly pan out, but it's as close as I've gotten to an explanation for why this configuration causes so much rapid change. 

I originally started down the road of adding resolvers to 1.7 because it seemed like a simpler solution than moving to 1.8 (which we would do eventually, but this was meant to be a quick band-aid to a known problem). Unfortunately the "quick" fix wasn't quick after all, and I may as well have been looking into moving to 1.8 to begin with. 

Wednesday, January 4, 2017

EC2 Not Booting Main Course, With a Little LVM Sauce on the Side

One of our devs came to me to report that he couldn't log in to an EC2 instance that was acting as our FTP server. He'd tried to do so to investigate a report from a customer that they couldn't upload something. Before everything else then we see there's a big failure here: the customer should never be the first person to tell you that your system is broken or down. That means your monitoring has failed.

I checked the monitoring tab for the instance (after verifying that I also could not SSH to the instance) and saw that CPU was pegged solidly at 100%—for like 3 days.

Way bad.

This is an EC2 instance that time forgot, so little information was known about it. My first thought was to create an AMI before we did anything else so that we wouldn't lose everything. I started the AMI creation but it hung for a long time. I deregistered it to stop the AMI creation process, and tried rebooting the instance. A reboot failed, but a stop/start worked and AWS showed the instance as running. It was still inaccessible via SSH though. One more stop/start to be sure, and time to dig deeper.

I was at a loss as to why the instance was inaccessible. I eventually found the system log in AWS and saw an entry of mountall: Event failed. This led me to understand that the issue was with mounting some storage (unclear what still). The instance had 3 EBS volumes attached: 1 8GB root partition of /dev/sda1; 1 60GB block as /dev/sf; and 1 1TB block as /dev/sdg. I followed Amazon's very useful guide for attaching secondary volumes and detached /dev/sda1 from the instance and attached it to another instance. I was able to then investigate /etc/fstab. I found two entries for mounting two devices: /dev/sdb and one for an LVM mount. I commented both out, reattached the volume to the original instance, and tried again. Boot successful.

Now able to log in to the instance, I next had to reattach the volumes in question. I tried a simple sudo mount -a which failed with the message "mount: special device /dev/customer-ebsVG/custLV does not exist". That tells me that the problem lies with the LVM volume. I haven't touched LVM in years, so I had to refresh myself on commands and syntax.


ubuntu@host:~$ sudo lvdisplay
  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  --- Logical volume ---
  LV Name                /dev/customer-ebsVG/custLV
  VG Name                customer-ebsVG
  LV UUID                xvFLZV-EltK-cubd-bh4s-VGec-oM2a-Vk1AV3
  LV Write Access        read/write
  LV Status              NOT available
  LV Size                1.39 TiB
  Current LE             364534
  Segments               2
  Allocation             inherit

  Read ahead sectors     auto

This confirms that LVM is in play, and tells me a little about the overall config including how big it should be.


ubuntu@host:~$ sudo pvdisplay
  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  --- Physical volume ---
  PV Name               /dev/sdg
  VG Name               customer-ebsVG
  PV Size               1.00 TiB / not usable 4.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              262143
  Free PE               0
  Allocated PE          262143
  PV UUID               9OrOr5-Jm7u-DYay-8Xbb-L4Kz-6b84-Z3LwvL

  Couldn't find device with uuid 'p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3'.
  --- Physical volume ---
  PV Name               unknown device
  VG Name               customer-ebsVG
  PV Size               399.97 GiB / not usable 2.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              102391
  Free PE               0
  Allocated PE          102391
  PV UUID               p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3

  "/dev/sdf" is a new physical volume of "60.00 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/sdf
  VG Name
  PV Size               60.00 GiB
  Allocatable           NO
  PE Size               0
  Total PE              0
  Free PE               0
  Allocated PE          0

  PV UUID               Wgzu2H-JF0Y-BgsV-ONI3-ONkT-pt9f-1RceL9
This tells me which physical devices are included in the LVM volume group, and how large they are. As you can see the volume group in question, customer-ebsVG, has two disks allocated to it of different sizes, but only one of the disks is being seen. There is a 400GB device missing.

I looked through the EBS volumes in the AWS GUI to see if perhaps one was simply not attached. I found a few 400GB volumes but after attaching them determined that they were not the droids I was looking for. Here's where it gets a little weird (shout-out to "Stuff They Don't Want You To Know"): remember in fstab there was also an entry for mounting a device /dev/sdb? So, I hadn't mounted it yet, figuring I needed to concentrate on the LVM device. At a loss in trying to find this mysterious VG member, I went ahead and manually mounted /dev/sdb. It shows up in df -h:

ubuntu@host:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             7.9G  2.3G  5.3G  30% /
none                  1.9G  124K  1.9G   1% /dev
none                  1.9G     0  1.9G   0% /dev/shm
none                  1.9G   60K  1.9G   1% /var/run
none                  1.9G     0  1.9G   0% /var/lock
/dev/sdb              394G  199M  374G   1% /mnt

Yep, /dev/sdb happens to also be 400GB. Is this the missing volume? If so, why is it mounted as a formatted disk instead of part of the LVM? Alright, whatever, I run pvdisplay again to see if this has magically solved the problem. It hasn't.

sudo vgchange -a y customer-ebsVG

The hope is that for some reason the volume group was simply marked as inactive and doing this will change that. It doesn't. I wonder then if maybe /dev/sdb is not supposed to be part of the volume group any more. Maybe it was and it was removed and LVM needs to be cleaned up. I try sudo vgchange -a y customer-ebsVG --partial and it activated the volume group, but there was no data. Seems like everything I needed was on the missing device.

At this point I started looking into ways to repair/recover LVM. I found the backup files in /etc/lvm/backup and from there could verify that the missing device was indeed /dev/sdb. sudo blkid showed me the UUIDs of the devices again, and I considered that perhaps the issue was that /dev/sdb had a different UUID than LVM was expecting, and I that I may need to change the UUID. I started Googling how to do this. At the same time I reached out to a colleague to get his input. Sometimes in explaining the situation to someone else it helps to clarify things and maybe show something you missed, and sometimes maybe they'll see something you missed.

I found some very basic instructions for changing the UUID here. I hesitated to pull the trigger here because if I made things worse I wasn't sure how we'd recover, and I wasn't entirely sure what damage I would do. I talked with the dev guy to go over a recovery scenario, where we'd have to create a new volume and recreate the directory structure for our clients, which we luckily had a roadmap for. Recreating the directory structure would take less than an hour to do, so it seemed like we were going down the road.

My colleague in the mean time had found this Red Hat page that ultimately had the same instructions as I'd found, but in more detail. He hesitated as I did, but ultimately pulled the trigger and ran pvcreate --uuid "p9JTHn-YoKf-64IE-UnY5-K4v3-u1OQ-0GwFg3" --restorefile /etc/lvm/archive/ /dev/sdb which put the LVM back in order and brought the volume group back up.

So, not only did I get a little refresher course in LVM due to our issue (as well as troubleshooting some AWS issues), I have yet another reminder about the importance of sometimes just pulling the trigger and not being so afraid of being wrong or messing up. I had the solution to the problem in my hand, but my fear of making a mistake stopped me from seeing it through.

Wednesday, April 22, 2015

Proxy ARP and the ASA

We got an ASA 5545-X to replace our 5505 (finally!). My task is to do the upgrade. I thought my biggest challenge would be translating the old school NAT statements (IOS 8.2 and older) into the new syntax of the 8.3+ releases.

So my big plan was to simply configure the new ASA the same as the old with a few difference—namely the IPs assigned to the inside and outside interfaces. That way I could add the new ASA to the network, hanging off of our core switch just like the old one, and test that the config (L2L and IPSec VPN, NAT translation, routing, etc.) works. It'd look a little like this when I was done:

I tend to really, really dwell on things before I implement them though. Sometimes it's a bad thing, as in it takes me longer to complete a task, and other times it saves my bacon. This would be one of those times, because before I just threw the firewall on the network I stopped and thought, In what way could this backfire and cause disruption to the production network? More specifically I was wondering if there was any way that the new ASA would/could respond to requests meant for the old ASA. That's when I began to think about proxy arp, and developed a splitting headache (totally coincidental, I'm sure). I popped onto the Cisco support forums to craft a lengthy post, and when I didn't get responses fast enough for my liking I went to the Cisco IRC channel and asked. I got the following replies: 

Random1: never use proxy arp
for anything

Random2: turn it off

You win some, you lost some on IRC, amirite? I lost and got the IRC-equivalent of "Nick Burns, Your Company's Computer Guy". I tried to ask some clarifying questions and got radio silence so I went out and did a little more reading on my own. It's not a hard concept in the end, but few people seem to have the ability to clearly explain how it works and its implications, so I'm gonna try because I can do soooo much better. That was sarcasm. 

Here's the rundown on proxy arp:

It exists on networking devices to allow them to pass traffic on to interfaces/networks other than the one on which the traffic was received. Confusing? Yeah, try reading about it on Cisco's site. Basically if for some reason your host at wants to ARP for its destination host at instead of the default gateway, the router would get that ARP and if it has an interface with a destination to the same network as it would respond to the ARP request with its own MAC and forward it along. 

ASAs do the same sort've thing, but instead of being based on a routing table it's based on NAT statements, whether static or dynamic, and aliases. In a lot of standard ASA configurations, especially for smaller businesses, you'll have an IP address on your outside interface that also servers as your global NAT address for internal IPs. This is a scenario in which you would need to have proxy arp enabled. I checked our existing ASA config with the show run all command and sure enough, we have disabled noproxyarp, which means proxy arp is enabled. That makes sense of course. We're doing a standard config and if we didn't have proxy arp on, we would not be able to pass traffic to any of our publicly accessible hosts (like web servers) and VPN wouldn't work either. Clearly the suggestion to "turn if off" was pretty short-sighted. 

Now, this is not the case in every situation. If you're routing from your upstream to your ASA—say, your ISP is routing to your outside interface instead of ARPing—then you can turn proxy arp off. If your NATs live on a different network from the IP of your interface you can turn if off then as well. That's not the case in our house though. 

What does this mean for me? It means my very simple plan is no longer so simple. I can turn off proxy arp and not put in any NAT statements on the new ASA and test that I can at least reach it from outside, but that won't help with the more complex scenarios such as testing NAT translation and routing to the internal network and across networks. I'm either going to have to configure it and pull the trigger, hoping it works, or figure something else out. Maybe time to check out GSN3 again? 

Thursday, March 5, 2015

Wget Mystery

This is just a quick snippet of some one-off weirdness. I was trying to deploy the Community Edition of Aerospike via Chef. I created a recipe and tested it out with Vagrant successfully before pushing it up to our Chef server. I then attempted to deploy the recipe to our 8-node Aerospike cluster and got error messages on the extract step. The recipe looks like this:

aerospike_tar_directory = "#{Chef::Config[:file_cache_path]}"
aerospike_tar = "#{aerospike_tar_directory}/aerospike-server-community-3.5.3-el6.tgz"
aerospike_temp_dir = Dir.mktmpdir

remote_file aerospike_tar do
  source node[:aerospike][:url]
  mode 00644

directory aerospike_temp_dir do
  owner 'root'
  group 'root'
  mode '0755'

execute 'extract_aerospike' do
  command "tar zxf #{aerospike_tar}"
  cwd aerospike_temp_dir
  action :run

user 'aerospike' do

directory node[:aerospike][:dir] do
  owner 'aerospike'
  mode '0755'
  recursive true

script 'move_aerospike' do
  action :run
  cwd aerospike_temp_dir
  interpreter 'bash'
  code <<-EOH
  mv #{aerospike_temp_dir}/aerospike-server-community-*/* #{node[:aerospike][:dir]}

execute "install_aerospike" do
command "cd #{node[:aerospike][:dir]}; ./asinstall; chown -R aerospike:aerospike #{node[:aerospike][:dir]}; rm #{node[:aerospike][:dir]}/*.rpm"

service "aerospike" do
action [ :start, :enable ]

The error message I was receiving indicated that the downloaded file, aerospike.tgz, was not in fact a tarball. Sure enough I tested it and it was a text file with HTML enclosed. The HTML said the resource wasn't found. I tested it directly from my local machine and confirmed that hitting that same url on my Mac got me a properly formatted .tgz file. I ran wget on the server and was confused that it did not in fact download the file I was expecting. It instead downloaded a file with a different name all together. I repeated this process a couple of times to verify that I wasn't hallucinating.

A little research led me to the flag --trust-server-name. This flag is apparently used with wget when there are multiple redirects on the resource (which there were in this case). If there are redirects wget will download the file as the name of the last component in the redirection url, as opposed to the original resource name. Thus, if I did wget, without that flag wget would throw a file named el6 in my directory instead of the expected aerospike-server-community-*-.tgz file.

My Chef recipe still isn't exactly working, but I've solved one small problem at least.

Thursday, September 25, 2014

Troubleshooting Haproxy

I received an email from our R&D team stating that they were seeing less HTTP requests for a particular client/campaign than they expected on the order of ~30% less. He took a look at the stats on the stats page for haproxy and recorded that for a given time period the frontend was logging 1,688,298 sessions vs 1,186,079 sessions on the backend. I was asked to look into the cause for this discrepancy.

I took a look at the logs and felt like it was giving me much insight into what was happening between the frontend and backend. The logs were also being sent to /var/log/messages so it was pretty noisy to look through. I checked haproxy.cfg and edited the logging to send it to its own file, and also added the log global parameter to the backend section. After a reload of haproxy I went back to the logs. The logs looked the same, although in a different spot. After making some inquires on the haproxy IRC channel I learned that haproxy was indeed giving some information about the backend connections. The log lines were of a format I wasn't accustomed to, and there was actually a very detailed section on logging in the Haproxy documentation that explained what I was looking at.

Here's an example of the numbers I was working with (this is after the reload):

- 3434009 requests total
- 79263 3xx errors
- 1283482 4xx errors
- 30396 5xx errors

- 79306 3xx errors
- 136 4xx errors
- 30396 5xx errors

Those are numbers gathered from the stats page. I compared these against numbers I was getting from grepping haproxy.log. The numbers didn't match up across the board.


The 503 errors matched up with what I was recording in the logs at 30396. This numbers make sense in terms of what the 503 error means. According to, a 503 error means: 

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.

The 503 lines from the logs look like this:

Sep 24 16:39:21 localhost haproxy[8562]: someip:54196 [24/Sep/2014:16:39:13.668] public webservers/srv1 8/8268/-1/-1/8276 503 109 - - SC-- 454/454/168/5/+5 0/37 {Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36|http://abunchofstuff} "GET /?requesthere HTTP/1.1"

According to the documentation, the SC flags represent the status of the TCP connection at connection termination.

Monday, July 28, 2014

You Touched It Last, And Now You Own This Cluster @#$% (XenServer Dom0 Internal Error)

I was asked to spin up a vm and install WordPress. Piece of cake, LAMP then WP then done (with a solemn nod to my Golden Rule). Except...when I tried to create the VM I got this message:


This is a four-host XenServer cluster that I have barely touched other than to do what I was trying to do here: spin up a vm. Back in the heady days of yore when we were a bursting at the seams team of five, this cluster belonged to Someone Else. Someone Else is gone, and it's now one of those un-loved, you-touched-it-you-own-it systems. You know what I'm talking about.

So, great. I hit the Googles because, while I know enough to get the gist of the error message (some block device for my vm is attaching to dom 0 which is the native domain, Ground Zero for Xen), I had no idea what to do about it. This is apparently a common error message as I found a bunch of posts about it. Unfortunately in most cases the OP would come back and say, "Hey, I rebooted it! All good now!". This made me smack my head and groan because ideally there'd be a solution that did not require a reboot, this being a prod server and all. I wasn't alone in this, but no one had a clear answer. The few posts that did attempt to fix the error didn't go into enough detail for someone not intimately familiar with the inner-workings of Xen's virtualization stack. For example, there were a lot of suggestions to simply run xenstore-ls and use that output to delete the offending device. Have you ever run xenstore-ls? There's a lot of info, and finding anything online that tells you what exactly you're looking at is pretty much impossible.

I tried cobbling together a bunch of commands from various sites and trying to skim the admin guide, but eventually gave up under the sheer weight of information overflow and started considering the reboot. I looked into live migration (or whatever Xen's version of that is) because in theory we should be able to move guests around for just this kind of thing. This is what I saw:

You know what I didn't want to do? Start trying to solve yet another problem in the midst of the one I was currently struggling with. That meant a reboot was definitely going to take some prod vms down. As a last ditch effort I hit the Citrix forums.

I lucked out. Sometimes you can go to these forums and wait forever for someone to respond to a post, but I got a response fairly quickly here. Score 1 for Citrix. The poster suggested I do the following:

xe vbd-list vm-uuid=<UUID of VM>
xe vbd-unplug uuid=<UUID from previous command>
xe vbd-destroy uuid=<UUID from first command>

I ran those commands and tried to start the vm up again. Guess who came back to join the party?

Looks like I'm diving down that rabbit hole after all. 

I checked the Networking tab for the VM as per this guide and found that in comparison to other vms, this one had three Networks attached instead of two. I deleted the third and tried again. This time I received a bootloader error: The bootloader for this VM returned an error -- did the VM installation succeed? Error from bootloader: no bootable disk". This makes sense if I think about it, because I did run a command with the word "destroy" in it a few minutes ago. 

I deleted the vm and started over. This time I only got the network error message, not the Dom0 message, so that's a positive sign. I deleted Network 3 again (I will probably have to find out what this is and why it's being added to new VMs), and I finally got the VM to start up. Yay. Now I can actually, you what I was supposed to be doing all this time. Progress!

Tuesday, April 1, 2014

Zabbix Performance

Email and monitoring are two services that I believe can and should be outsourced if you don't have the people power to dedicate to them fully. Like so many technologies they can be fairly simple to get up and running, but how often is the out-of-the-box install sufficient for your business? You can install Exchange fairly easily, and there are countless tutorials and articles out there to guide you through it, but what happens when it breaks or when some new functionality needs to be added? Monitoring systems love to advertise with catchy slogans like "Get your entire infrastructure monitored in less than 10 minutes". That's true...if all you're monitoring are basic Linux stats like disk space, cpu, memory, etc. Need to add JMX queries, or monitor a custom app? Now it's time to roll up your sleeves and get to work.

We were using LogicMonitor to monitor our entire infrastructure when I came onboard almost two years ago (my Old Boss had brought it in). Every device, Java apps, file sizes, whatever custom queries we needed to keep an eye on our production platform, was handled by this app. My New Old Boss (who started about 6 months ago after my Old Boss quit, and has since left as well) came in and wanted to chuck the system. He was a big advocate of "if you can do it in-house, do it". LogicMonitor wasn't perfect, and we had our share of problems with it, but in hindsight a large part of that pain was that we hadn't followed through with set up. We were using default values and triggers for a lot of things, and they created noise. We didn't tune the metrics we received from LogicMonitor's setup team to match our environment. Rather than invest the time to learn the system we had, we scrapped it and went with Zabbix.

I hate Zabbix. I hated it from the beginning. We didn't have much of a burn-in of the product; it got installed, we started "testing" it, and suddenly it was our production monitoring platform for all of the new equipment we were putting into production. This was part of a major rollout as we were introducing a new platform, so it all became one and the same. One of my biggest complaints about Zabbix was that it wasn't user-friendly. Adding a host and its metrics to Zabbix had a pretty unnatural step process, as did adding users/groups and setting up alerts. For example, to set up a host with alerting or particular metrics you have to first add the host, then create items, then create triggers for the items, then create actions for the triggers. You can create host groups to aggregate hosts of like purpose, and you can create a template to group items, but you can't apply a template to a host group—you have to apply them to hosts individually still. When I raised concerns about the complexity of its front-end—because really, a GUI should not be more difficult to use than say doing creating the same thing from the command-line in Nagios as an example—New Old Boss explained that Zabbix was "expert-friendly".

Months later, New Old Boss has moved on and I've inherited Zabbix, which I pretty much let New Old Bus handle while he was here, and what a bag of trouble it has turned into. "Bag of trouble" is of course polite blog speak for what I really want to call it.