Wednesday, October 19, 2011

Arp, Arp, ARRGGHH (or why the CLI beats the GUI every...freaking...time)

Ready for a long one?

Our building recently had a scheduled power outage. I hate those. We don't have DRAC on our servers so it requires me to physically come in and power up each server. Thankfully this was the first time this happened during my stint at my company, and I hope it doesn't prove to be a frequent occurrence.

So I wander in intending to hit the power button and be done in 30 minutes tops. I should know better by now than to think that anything will ever turn out as simple as it sounds in theory; perhaps I'm an eternal optimist? At any rate, it became quickly apparent that this was not going to be an in and out operation (insert well-known phrase from The Office here :) ) when the first Linux server came up and stalled at booting due to an entry is fstab. It wasn't a critical mount; it was just a CIFS mount to a NAS so I didn't imagine it would stall the process all together if it couldn't make the connection. That is exactly what was happening though. The Windows servers came up without a hitch, and my remaining Linux boxes all hung.


Of course I wasn't simply staring at the servers waiting for them to come up. I made sure that the firewall and switch were powered up and went to my desk to boot up my computer and test network connectivity. At about the time that I saw that the servers were not booting properly, I also found that I could not get out to the WAN. I could access the LAN with no problem, but pings to my gateway were intermittent, with more failures than successes. I did the basics: reset my network adapter, checked cables, rebooted the network equipment. No success. At this point I had two problems: servers not booting, and no network connectivity. Because the Linux servers seemed to be failing at initializing the Ethernet adapters, I determined that I should work on the network problem first, and that perhaps the booting problem would be resolved as a result.

Thus began the troubleshooting:

  • Pinging the IP of my switch. No problem. 
  • Pinging other hosts on the subnet. No problem.
  • Pinging the default gateway. Intermittent packet loss. 
  • Pinging hosts on other subnets. No replies. 
Okay, so clearly there's something going wrong with the routing. The switch is fine, the subnet is fine, but I can't traverse subnets. Traversing subnets is the work of the firewall. Problem must be in the firewall. Did the reboot erase some configuration? Nah, I distinctly remember making a change earlier that week and applying it via the ASDM, so the memory had been saved recently. Let's check anyway. And while we're at it, let's check the switch and see if there are any errors being shown on the up-link port. Maybe there's some duplex mismatch going on. That could explain the intermittent nature of the failures. I accessed the GUI for the switch and found no errors or weird behavior from the up-link. Everything showed normal. I looked for diagnostics to be able to try and ping the gateway from the switch and the only thing their diagnostic utility did was check the cable! Useless really, since it didn't actually even do that properly. I had no way to do ping tests from the GUI. This problem is ultimately what led to the solution. But first to check the firewall. 

Here's where I learned a valuable lesson. There are two ways to connect to an ASA: direct console connection or terminal connection through ssh or telnet. If you're having difficulty with your network, direct console connection is pretty much your best bet. No problem, right? Connect a laptop and get going. Sure, no problem if you have a laptop that's like almost 10 years old. Laptops don't come with serial ports any more. The last serial port I saw on a laptop was the Thinkpad I used while at Whole Foods. We have Thinkpads here, and they don't have serial ports. I had to go to our graveyard of desktops, and even then my pickings were slim because most of our desktops don't have serial ports either. I finally found one and went through the rigmarole of hooking up a keyboard and mouse and monitor, only to find that Hyperterminal was missing! It's times like these that I could kiss my Droid 2. I was doing some incredible Googling on my phone and found an article that helped me install Hyperterminal pretty easily from an XP disc. That done I was able to get into the ASA and make sure nothing funky had happened. show run to make sure the config hadn't changed in some weird way, show interface Ethernet # to check for CRC errors. I turned on logging temporarily and frankly the input to the console was not only difficult to read, it didn't tell me anything more detailed than that there was traffic from the DMZ being denied access to the inside network due to an ACL. This is pretty standard since nothing in our DMZ is allowed to initiate a connection. 

Pings from the ASA to the inside network were successful, pings to the outside world were successful, but pings to the switch were intermittent. clear arp  got me a series of successful pings for less than a few seconds, then droppage. Clearly the issue then is communication between the switch and the ASA. I swapped cables, used a different Inside interface on the firewall, a different port on the switch, same behavior, so it's not glitchy hardware.

I really wanted to test ping from the switch itself, so I dug up some documentation and telneted to it. I checked the interfaces, no CRC errors here either. I began my ping tests and saw the same behavior as on the ASA, where pings to the subnet were fine but pings to the default gateway failed. Weird. Now what? Well, I'd cleared the arp table on the firewall, maybe I ought to try doing that on the switch. As with the firewall that cleared it up for a short period and I got like 5 successful ping tests of 4 pings each, but then it dropped again. I did a show arp  for giggles and here's where things started to make sense. The arp table showed the wrong port responding for the default gateway. My up-link was on port 3, and it showed a response coming from port 36, and that it was from the DMZ. Did I have something on the DMZ that thought it had the same IP as the firewall? Impossible: everything is statically addressed right? And there's no way it would have grabbed it through DHCP, an IP address on a different subnet? I checked the firewall config and verified that noproxyarp had been set. But that was pretty clearly the problem, because when I unplugged the connection to the DMZ traffic started flowing again. Reconnected it, screeching halt. I reset the arp cache and every time the same thing happened: it would initially show the correct entry, and then change to the wrong port. So now what? 

Then it hit me. show vlan on the switch, and there it was. The DMZ port on the firewall was connected to a port on the switch that was tagged for VLAN 1, or the inside VLAN, instead of for the DMZ. I moved the connection to a DMZ-tagged port and everything started working as it was supposed to. 

At this point I was beyond done and wanted to go home, so I didn't think too much about it. I got the servers booted (one using Knoppix to edit /etc/fstab first, the others simply because they could initialize their NICs now), and chalked this whole thing up to me being careless and plugging the firewall into the wrong port. Something about this nagged at me though, because I didn't start messing with the plugs until the problem had already presented itself. The mystery was solved the next day when others reported that they were having issues connecting to the internet. It turns out that the switch reverted to some other configuration when it rebooted. Why this is is a mystery. We typically use the GUI to make changes (not that there have been many to make) and there's an 'Apply' button on each screen to commit your changes. Apparently this isn't enough to have the changes persist through reboots, and instead you have to access the CLI and write the config to memory. Very weird, and terribly annoying. 

That's how I spent my Sunday. Good times. 

No comments:

Post a Comment