Sunday, May 5, 2013

Intermittent VPN Connectivity Issues

...and the art of troubleshooting

My brother-in-law, who is a Network Engineer for a large university and in charge of networking interns, commented once that troubleshooting was a lost art form. I couldn't agree more—I definitely struggle with troubleshooting at times. There are two main stumbling blocks that I tend to struggle with.

The first is getting distracted by red herrings. Have you ever looked at the logs on a machine when you're not actively troubleshooting? They're full of errors and warnings. If the server is working up to snuff, you never look at these logs too closely. Most of us have monitoring turned on anyway so we don't tend to look at the logs until something goes wrong. It then becomes a matter of trying to determine which, if any, of those errors and warnings are related to your problem. It's easy to see something and think that it could be the reason you're having issues, and you suddenly find yourself chasing down something that isn't related at all. Of course this is also a pretty strong argument for reviewing the logs every now and then simply as a matter of practice.

The second and sometimes more difficult problem is getting information. I think it's common knowledge that to effectively troubleshoot an issue, one of the first and most important steps is to gather information about the problem. If the problem is one that you as the administrator have discovered yourself, the information-gathering is much more straightforward. When it's been reported to you by someone else, it can be challenging to get what you need to proceed in a smart way.

This is especially dependent on the person doing the reporting as well. I have found that there are two categories of people you talk to when an issue arises. There's the person who knows nothing about the technology and doesn't care; they just want you to fix it. Getting information from this person is a little like pulling teeth because they are likely impatient and don't want to spend the time talking to you about it. Just get it done, man! For example, someone might report that the internet is down. They don't go into the detail of what happened to make them think this until you drag it out of them, and then you find that actually they can't send or receive email and it's nothing to do with the internet at all, or that they're using some intranet application.

Then there's the second type of person, who is either technical or thinks they're technical. You'd think this would be a good person to troubleshoot for. The problem is that this type of person often comes to you with an preconceived idea of what they think the problem might be, which makes information-gathering challenging because they only want to talk about what they think the issue is, not leaving room for you to determine if that is indeed the problem. When you start to ask standard questions they tend to get defensive, wondering why you don't simply treat the issue they've identified.

That being said, we recently had an issue with remote access VPN connectivity to our data center.
We had a setup that looked like this:

The ASA hung off a layer 2 switch. The default gateway for the network was a VIP, and the VIP in turn had a default gateway that pointed to the ASA, so any traffic on the network that had to go outside the network passed through the ASA. We then redid our network, getting rid of the Cisco switches and moving to HP ProCurves. The ASA is now directly connected to the HP core switch that is the default gateway for the network, using the same IP as the old gateway, and the HP switch uses the ASA as its default gateway. 

We didn't change any configuration on the ASA itself, and the only thing we did for the HP was change its IP to the old gateway IP, and tag/untag the ports connected to the ASA as necessary. Despite this, suddenly we were seeing intermittent issues with remote access vpn connectivity. Sometimes you'd connect to the vpn and be unable to connect to internal hosts, i.e. RDP to a Windows box or ssh to a Linux box. In fact, at times I was unable to even ping the HP switch. Disconnecting the vpn and reconnecting would sometimes resolve the issue. Other times you'd vpn and there'd be no connectivity problems at all. It was a symptom that I personally was not experiencing. I'd never had any issue with the vpn connection, and only my boss had reported having this problem. When trying to get information from him about it he was vague, and didn't have very many specifics such as exactly what server he had tried to get to, what he was doing, when it happened, etc. A few days later though another colleague on the operations team reported he was also having issues with the vpn. His initial reporting was that he could not get email while on the vpn, though he later added that he too was having problems ssh'ing to servers. He also said that he sometimes lost connectivity while in the middle of an SSH session. 

Around the same time we started getting alerts for the HP. It was discarding a large amount of frames without errors, i.e. there were no CRC errors to report. CRC errors would usually indicate an issue with  speed or duplex mismatches, but those were clear on both devices. Additionally it was only Tx errors on the HP, but no matching errors on the ASA. 

With this kind of intermittent behavior my first thought turned to DHCP or mac address table issues (i.e. a loop in the network). I checked the DHCP configuration on the ASA, which points to our domain controller and hands out IP addresses from within the same general scope as internal clients (a no-no, but it is what it is). I verified that the IP I had was unique in the network, and then did an arp table lookup on both the switch and the firewall to make sure there weren't duplicate entries for either my IP, or the switch, or one of the servers behind the firewall. So, that ruled out DHCP and ARP issues. 

I coincidentally started to experience the same kind of intermittent issue that my coworkers were having. This was good because it allowed me to take detailed note of what was going on when it happened. I noticed that every time it happened to me I had the same IP address. I checked the routing table and sure enough there was a static route added there for the IP I was using. L2L VPNs add their own static route into the routing table, but this one was persistent and manually entered for some reason. As soon as I removed this entry, the connection issues stopped. 

Simple, right? 

No comments:

Post a Comment