Wednesday, April 22, 2015

Proxy ARP and the ASA

We got an ASA 5545-X to replace our 5505 (finally!). My task is to do the upgrade. I thought my biggest challenge would be translating the old school NAT statements (IOS 8.2 and older) into the new syntax of the 8.3+ releases.



So my big plan was to simply configure the new ASA the same as the old with a few difference—namely the IPs assigned to the inside and outside interfaces. That way I could add the new ASA to the network, hanging off of our core switch just like the old one, and test that the config (L2L and IPSec VPN, NAT translation, routing, etc.) works. It'd look a little like this when I was done:


I tend to really, really dwell on things before I implement them though. Sometimes it's a bad thing, as in it takes me longer to complete a task, and other times it saves my bacon. This would be one of those times, because before I just threw the firewall on the network I stopped and thought, In what way could this backfire and cause disruption to the production network? More specifically I was wondering if there was any way that the new ASA would/could respond to requests meant for the old ASA. That's when I began to think about proxy arp, and developed a splitting headache (totally coincidental, I'm sure). I popped onto the Cisco support forums to craft a lengthy post, and when I didn't get responses fast enough for my liking I went to the Cisco IRC channel and asked. I got the following replies: 

Random1: never use proxy arp
for anything

Random2: turn it off

You win some, you lost some on IRC, amirite? I lost and got the IRC-equivalent of "Nick Burns, Your Company's Computer Guy". I tried to ask some clarifying questions and got radio silence so I went out and did a little more reading on my own. It's not a hard concept in the end, but few people seem to have the ability to clearly explain how it works and its implications, so I'm gonna try because I can do soooo much better. That was sarcasm. 

Here's the rundown on proxy arp:

It exists on networking devices to allow them to pass traffic on to interfaces/networks other than the one on which the traffic was received. Confusing? Yeah, try reading about it on Cisco's site. Basically if for some reason your host at 192.168.1.10 wants to ARP for its destination host at 12.2.3.4 instead of the default gateway, the router would get that ARP and if it has an interface with a destination to the same network as 12.2.3.4 it would respond to the ARP request with its own MAC and forward it along. 

ASAs do the same sort've thing, but instead of being based on a routing table it's based on NAT statements, whether static or dynamic, and aliases. In a lot of standard ASA configurations, especially for smaller businesses, you'll have an IP address on your outside interface that also servers as your global NAT address for internal IPs. This is a scenario in which you would need to have proxy arp enabled. I checked our existing ASA config with the show run all command and sure enough, we have disabled noproxyarp, which means proxy arp is enabled. That makes sense of course. We're doing a standard config and if we didn't have proxy arp on, we would not be able to pass traffic to any of our publicly accessible hosts (like web servers) and VPN wouldn't work either. Clearly the suggestion to "turn if off" was pretty short-sighted. 

Now, this is not the case in every situation. If you're routing from your upstream to your ASA—say, your ISP is routing to your outside interface instead of ARPing—then you can turn proxy arp off. If your NATs live on a different network from the IP of your interface you can turn if off then as well. That's not the case in our house though. 

What does this mean for me? It means my very simple plan is no longer so simple. I can turn off proxy arp and not put in any NAT statements on the new ASA and test that I can at least reach it from outside, but that won't help with the more complex scenarios such as testing NAT translation and routing to the internal network and across networks. I'm either going to have to configure it and pull the trigger, hoping it works, or figure something else out. Maybe time to check out GSN3 again? 

Thursday, March 5, 2015

Wget Mystery

This is just a quick snippet of some one-off weirdness. I was trying to deploy the Community Edition of Aerospike via Chef. I created a recipe and tested it out with Vagrant successfully before pushing it up to our Chef server. I then attempted to deploy the recipe to our 8-node Aerospike cluster and got error messages on the extract step. The recipe looks like this:

aerospike_tar_directory = "#{Chef::Config[:file_cache_path]}"
aerospike_tar = "#{aerospike_tar_directory}/aerospike-server-community-3.5.3-el6.tgz"
aerospike_temp_dir = Dir.mktmpdir

remote_file aerospike_tar do
  source node[:aerospike][:url]
  mode 00644
end

directory aerospike_temp_dir do
  owner 'root'
  group 'root'
  mode '0755'
end

execute 'extract_aerospike' do
  command "tar zxf #{aerospike_tar}"
  cwd aerospike_temp_dir
  action :run
end

user 'aerospike' do
end

directory node[:aerospike][:dir] do
  owner 'aerospike'
  mode '0755'
  recursive true
end

script 'move_aerospike' do
  action :run
  cwd aerospike_temp_dir
  interpreter 'bash'
  code <<-EOH
  mv #{aerospike_temp_dir}/aerospike-server-community-*/* #{node[:aerospike][:dir]}
  EOH
end

execute "install_aerospike" do
command "cd #{node[:aerospike][:dir]}; ./asinstall; chown -R aerospike:aerospike #{node[:aerospike][:dir]}; rm #{node[:aerospike][:dir]}/*.rpm"
end

service "aerospike" do
action [ :start, :enable ]
end


The error message I was receiving indicated that the downloaded file, aerospike.tgz, was not in fact a tarball. Sure enough I tested it and it was a text file with HTML enclosed. The HTML said the resource wasn't found. I tested it directly from my local machine and confirmed that hitting that same url on my Mac got me a properly formatted .tgz file. I ran wget on the server and was confused that it did not in fact download the file I was expecting. It instead downloaded a file with a different name all together. I repeated this process a couple of times to verify that I wasn't hallucinating.

A little research led me to the flag --trust-server-name. This flag is apparently used with wget when there are multiple redirects on the resource (which there were in this case). If there are redirects wget will download the file as the name of the last component in the redirection url, as opposed to the original resource name. Thus, if I did wget http://aerospike.com/download/server/latest/artifact/el6, without that flag wget would throw a file named el6 in my directory instead of the expected aerospike-server-community-*-.tgz file.

My Chef recipe still isn't exactly working, but I've solved one small problem at least.

Thursday, September 25, 2014

Troubleshooting Haproxy

I received an email from our R&D team stating that they were seeing less HTTP requests for a particular client/campaign than they expected on the order of ~30% less. He took a look at the stats on the stats page for haproxy and recorded that for a given time period the frontend was logging 1,688,298 sessions vs 1,186,079 sessions on the backend. I was asked to look into the cause for this discrepancy.

I took a look at the logs and felt like it was giving me much insight into what was happening between the frontend and backend. The logs were also being sent to /var/log/messages so it was pretty noisy to look through. I checked haproxy.cfg and edited the logging to send it to its own file, and also added the log global parameter to the backend section. After a reload of haproxy I went back to the logs. The logs looked the same, although in a different spot. After making some inquires on the haproxy IRC channel I learned that haproxy was indeed giving some information about the backend connections. The log lines were of a format I wasn't accustomed to, and there was actually a very detailed section on logging in the Haproxy documentation that explained what I was looking at.

Here's an example of the numbers I was working with (this is after the reload):

Frontend
- 3434009 requests total
- 79263 3xx errors
- 1283482 4xx errors
- 30396 5xx errors

Backend
- 79306 3xx errors
- 136 4xx errors
- 30396 5xx errors

Those are numbers gathered from the stats page. I compared these against numbers I was getting from grepping haproxy.log. The numbers didn't match up across the board.


Error
Counts
206
6
304
221635
400
230
404
11
405
60
408
21
500
15058
503
30396

The 503 errors matched up with what I was recording in the logs at 30396. This numbers make sense in terms of what the 503 error means. According to w3.org, a 503 error means: 

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.

The 503 lines from the logs look like this:

Sep 24 16:39:21 localhost haproxy[8562]: someip:54196 [24/Sep/2014:16:39:13.668] public webservers/srv1 8/8268/-1/-1/8276 503 109 - - SC-- 454/454/168/5/+5 0/37 {Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36|http://abunchofstuff} "GET /?requesthere HTTP/1.1"

According to the documentation, the SC flags represent the status of the TCP connection at connection termination.

Monday, July 28, 2014

You Touched It Last, And Now You Own This Cluster @#$% (XenServer Dom0 Internal Error)

I was asked to spin up a vm and install WordPress. Piece of cake, LAMP then WP then done (with a solemn nod to my Golden Rule). Except...when I tried to create the VM I got this message:



Oh.

This is a four-host XenServer cluster that I have barely touched other than to do what I was trying to do here: spin up a vm. Back in the heady days of yore when we were a bursting at the seams team of five, this cluster belonged to Someone Else. Someone Else is gone, and it's now one of those un-loved, you-touched-it-you-own-it systems. You know what I'm talking about.

So, great. I hit the Googles because, while I know enough to get the gist of the error message (some block device for my vm is attaching to dom 0 which is the native domain, Ground Zero for Xen), I had no idea what to do about it. This is apparently a common error message as I found a bunch of posts about it. Unfortunately in most cases the OP would come back and say, "Hey, I rebooted it! All good now!". This made me smack my head and groan because ideally there'd be a solution that did not require a reboot, this being a prod server and all. I wasn't alone in this, but no one had a clear answer. The few posts that did attempt to fix the error didn't go into enough detail for someone not intimately familiar with the inner-workings of Xen's virtualization stack. For example, there were a lot of suggestions to simply run xenstore-ls and use that output to delete the offending device. Have you ever run xenstore-ls? There's a lot of info, and finding anything online that tells you what exactly you're looking at is pretty much impossible.

I tried cobbling together a bunch of commands from various sites and trying to skim the admin guide, but eventually gave up under the sheer weight of information overflow and started considering the reboot. I looked into live migration (or whatever Xen's version of that is) because in theory we should be able to move guests around for just this kind of thing. This is what I saw:



You know what I didn't want to do? Start trying to solve yet another problem in the midst of the one I was currently struggling with. That meant a reboot was definitely going to take some prod vms down. As a last ditch effort I hit the Citrix forums.

I lucked out. Sometimes you can go to these forums and wait forever for someone to respond to a post, but I got a response fairly quickly here. Score 1 for Citrix. The poster suggested I do the following:

xe vbd-list vm-uuid=<UUID of VM>
xe vbd-unplug uuid=<UUID from previous command>
xe vbd-destroy uuid=<UUID from first command>


I ran those commands and tried to start the vm up again. Guess who came back to join the party?



Looks like I'm diving down that rabbit hole after all. 

I checked the Networking tab for the VM as per this guide and found that in comparison to other vms, this one had three Networks attached instead of two. I deleted the third and tried again. This time I received a bootloader error: The bootloader for this VM returned an error -- did the VM installation succeed? Error from bootloader: no bootable disk". This makes sense if I think about it, because I did run a command with the word "destroy" in it a few minutes ago. 

I deleted the vm and started over. This time I only got the network error message, not the Dom0 message, so that's a positive sign. I deleted Network 3 again (I will probably have to find out what this is and why it's being added to new VMs), and I finally got the VM to start up. Yay. Now I can actually, you know...do what I was supposed to be doing all this time. Progress!

Tuesday, April 1, 2014

Zabbix Performance

Email and monitoring are two services that I believe can and should be outsourced if you don't have the people power to dedicate to them fully. Like so many technologies they can be fairly simple to get up and running, but how often is the out-of-the-box install sufficient for your business? You can install Exchange fairly easily, and there are countless tutorials and articles out there to guide you through it, but what happens when it breaks or when some new functionality needs to be added? Monitoring systems love to advertise with catchy slogans like "Get your entire infrastructure monitored in less than 10 minutes". That's true...if all you're monitoring are basic Linux stats like disk space, cpu, memory, etc. Need to add JMX queries, or monitor a custom app? Now it's time to roll up your sleeves and get to work.

We were using LogicMonitor to monitor our entire infrastructure when I came onboard almost two years ago (my Old Boss had brought it in). Every device, Java apps, file sizes, whatever custom queries we needed to keep an eye on our production platform, was handled by this app. My New Old Boss (who started about 6 months ago after my Old Boss quit, and has since left as well) came in and wanted to chuck the system. He was a big advocate of "if you can do it in-house, do it". LogicMonitor wasn't perfect, and we had our share of problems with it, but in hindsight a large part of that pain was that we hadn't followed through with set up. We were using default values and triggers for a lot of things, and they created noise. We didn't tune the metrics we received from LogicMonitor's setup team to match our environment. Rather than invest the time to learn the system we had, we scrapped it and went with Zabbix.

I hate Zabbix. I hated it from the beginning. We didn't have much of a burn-in of the product; it got installed, we started "testing" it, and suddenly it was our production monitoring platform for all of the new equipment we were putting into production. This was part of a major rollout as we were introducing a new platform, so it all became one and the same. One of my biggest complaints about Zabbix was that it wasn't user-friendly. Adding a host and its metrics to Zabbix had a pretty unnatural step process, as did adding users/groups and setting up alerts. For example, to set up a host with alerting or particular metrics you have to first add the host, then create items, then create triggers for the items, then create actions for the triggers. You can create host groups to aggregate hosts of like purpose, and you can create a template to group items, but you can't apply a template to a host group—you have to apply them to hosts individually still. When I raised concerns about the complexity of its front-end—because really, a GUI should not be more difficult to use than say doing creating the same thing from the command-line in Nagios as an example—New Old Boss explained that Zabbix was "expert-friendly".

Months later, New Old Boss has moved on and I've inherited Zabbix, which I pretty much let New Old Bus handle while he was here, and what a bag of trouble it has turned into. "Bag of trouble" is of course polite blog speak for what I really want to call it.

Monday, February 3, 2014

What's Up With RAID?

Recently we received an alert that we were running out of disk space on our Exchange server. The first thought was "add more space". This is an older box, an HP DL360 G5, with a RAID HBA. After looking at its configuration details more closely we found out that the chassis was actually full. We were using a standard setup of two RAID 1 drives for the OS and 4 RAID 0 drives for Exchange itself. The 4th drive was in the array as a "hot spare", so there was no way to really expand the size of the array without doing some very intrusive and time-intensive things like adding larger drives one by one and rebuilding the array. There are no backups.

Now, let me be clear: I know that lack of backups is crazy. What you see here is something that I think happens at a lot of smaller companies that are mainly based on Linux. Operations, IT—whatever you want to call it—focuses on hiring Linux admins, and Linux admins treat Windows as the plague. No one wants to be in charge of it, no one wants to touch it. This was the case at my old company, and remains the case here.

Our team made the decision to convert the hot spare to a member of the array, and expand it. Since we were using an HP P400i we had the requirements to do this, so we did. Our manager was not pleased when he found out. He was concerned that we had taken away the hot spare, leaving us open to potential disaster if a disk failed. We rushed to assure him that a RAID 5 array could lose one disk without the whole server falling over, but he was still not exactly psyched that we no longer had that spare.

In the meantime, while the expansion and extension of the array had gone fine (it now showed that additional drive space as available as far as the HP RAID utility was concerned), it was not visible to the OS. This should not have been the case; the additional disk space should have shown as unallocated in Disk Management and we should have been able to expand the logical drive in Windows to use the remaining free space. Up until that point I had been going on historical knowledge of HP, the P400i card in particular, and the utilities HP provided for managing the array. Now I had to delve deeper into the technology and what was supposed to be a quick fix turned into an interesting discovery.

Apparently while I was traipsing around in the land of Linux here, where we used mainly commodity hardware and made of LVM and software RAID for all of our servers except the legacy HPs—which were only being used for Windows services—RAID 5 had fallen out of favor. In my searches I happened across a Spiceworks thread wherein one of the posters opined loudly and often that RAID 5 was THE WORSE choice for modern systems and likely to cause problems instead of mitigate them. This came as news to me. RAID 5 was the de facto standard for all implementations I did as a consultant with the outsourced IT provider for whom I worked. They had a blueprint for how systems were to be deployed, and that was non-negotiable. This was only a few years ago. In light of this I followed some links to find out more.

The basic workings of RAID 5 aren't a mystery. Parity is the key to its resiliency. For every bit of data (not literal bit) parity is calculated using an XOR operation and written to a disk. The parity is spread across multiple disks, so that any one disk can fail and the array can rebuild based on the remaining parity information when you add a new disk in. This all sounds good on paper so why the dire warnings? Well, it apparently has more to do with the reliability of drives, URE, and failure rates. URE is an acronym for unrecoverable read error. I'll admit that I was not versed on this particular concept, so let me explain it here in case you need a refresher.

Disks fail. Over time a standard spindle-based disk (as opposed to an SSD, which has its own wear specs) will start to not be able to read data for some reason. In general you don't notice this, especially in some kind of RAID setup because the spot is marked as bad by the OS and the world keeps on spinning. No pun intended. The formula folks seem to agree on for the probability of this kind of error is 10^14 bits of data. 10^14 bits is roughly 12.5TB of data. If you have an array of 2TB, then read that data 6 times over the course of its existence, you'll encounter at least 1 URE. The idea then is that if you have an array that fails due to a bad disk, running RAID 5, during the resilvering process you will encounter a URE. This means that in the best case scenario you will simply lose some data; in the worse case scenario, the entire rebuild will bork and you will be left without an array and praying that you have a backup.

Now, these rules seem to pertain to large-capacity (2TB+) SATA drives more than anything else. From what I've read SAS drives are less prone (with a 10^16 rate), and the more drives you have in your array, the better your chances of encountering a URE during a RAID recovery. On top of that, a hot spare is apparently not exactly a good thing. I sort've pssh'd the idea of a hot spare with RAID 5. This is in fact the first setup I've encountered where anyone actually designated a hot spare. If a disk fails and you have monitoring and alerting set up properly, the chances of having a second one fail in the time it takes you to replace the failed drive seemed trivial. Another article I read actually called the concept of a hot spare dangerous when combined with RAID 5 and URE factors. If a system automatically starts to rebuild an array with a spare, as in the case with our P400i, you are essentially getting into a state where you could have a complete array meltdown without even knowing about it. In this author's scenario a drive fails in a 3-drive array, the hot spare comes online and the array starts to rebuild, then encounters a URE and the rebuild fails all together even though no second drive failure actually occurred. So now not only do you have a down server, you had no warning about it and no chance to intervene. Intervene in this case would mean to take a backup of your data (or verify an existing backup) before starting a rebuild and being prepared to restore data if necessary.

Of course this makes me start to second guess our setup and evaluate the options. In general I think the warnings about using RAID 5 present in this post are not quite as serious as they're made out to be, at least in our situation. We are in fact using RAID 5, and using (or in the process of using) 4 drives, but they are all SAS drives of 146GB each, nowhere near the big capacity marks being cited in the examples out there. So while the alarms aren't ringing off the hook about the risks involved with our particular setup, it was valuable to learn about some of the other gotchas possible with RAID 5, and consider that a future migration to another RAID configuration—especially as we look at increasing disk capacity for growth—may be worth considering.

Wednesday, November 27, 2013

The Long and Dirty Story of Me and Perforce (with a cameo from our friends Backup and VMware)

This will read somewhat as a comedy. Considering how much time I have spent on a task that was meant to be trivial (or at least should have been trivial)...well, guess I am laughing too, but it's more of those shaking-my-head-in-dismay laughs.

I was tasked with making an image of our Perforce server. Okay, I've done this before. I've used Mondo Rescue in my previous job to do just this very thing, so I'm not anticipating trouble. First thing's first. I log in to get the lay of the land because this box isn't/wasn't under the jurisdiction of Ops, and I've never been on it before. I check it out and find that it's running RHEL 4. Oh dear. Our other boxes are CentOS and at least 5, so this machine has not seen love in a long time. Out of curiosity I start poking around, checking out the specs, wanting to know what kind of hardware we're dealing with. Dmidecode tells me it's a virtual machine. Not only that, but VMware. We're running XenServer now, but no worries. We still have a couple of vSphere Clients installed on our limited supply of Windows boxes. Looks like we have one VMware host in the data center, so let's hit it.

Except...the perforce machine isn't on that host. Oh dear again. That means we have an undocumented VMware server somewhere. To the Googles!

First try was looking to see if there's any way to ascertain a hosts's name/IP from the guest itself. No such luck. Apparently this functionality is locked down for security reasons. Next stop: this fellow wrote a handy little VMware scanner tool that pings your network for live hosts and then makes a call to VMware's API to test if the other end of the ping is indeed a VMware server. Brilliant. Found the little sucker easily with this tool. IP in and I went to log in...and found that none of the standard passwords worked. Montage of me trying every single password we have in our password file until I stumble across the one that works. Very efficient, I know.

Finally I'm able to connect to the host. At this point I figured I could simply clone the vm in question and call it a day. I could even use the cloned vm to work on converting it to XenServer. Here is where I run into one of the many limitations of ESXi vs the paid version of VMware. You can't clone a running vm in ESXi. You can't make a clone period. You can export to OVF, but you have to power the guest down. When you're talking about a perforce server in heavy use by your development team, not to mention a host that hasn't been powered down in who knows how long, people get nervous, with good reason. Suddenly the scope increases. Before you can power down the guest and export to OVF, you have to take a backup and verify that it works so that you can recover the data if things go south.

Did you notice that I said "take a backup"? One thing I've noticed over the years is that you will rarely come across a corporate Windows environment that doesn't have some kind of backup going on, be it Backup Exec or even the built-in backup utility that Windows servers have. Linux environments? For some reason backups take a backseat and more often than not or left to a handful of scripts scattered here and there that get written and cronned, never to be heard from again. Such was the case here. There was in fact a cronjob that was supposed to be backing up the perforce checkpoint, journal, and versioned files, but that hadn't reliably ran since December.