Tuesday, October 2, 2012

The Trouble with VLANs

VLANs are one of those networking technologies that are deceptively simple. You think you understand how they work, what they do, and implementing them seems like a basic task, but you can quickly find yourself staring at a non-functioning segment of your network asking yourself, "What the hell happened?"

I'm pretty familiar with VLANs. I know that end devices like PCs attach to access ports because they're not 802.1Q-aware, that you connect switches with trunk ports to allow multiple VLANs to pass through, and that you need a Layer 3 device to communicate between VLANs. I've done router-on-a-stick and sub-interfaces. When I worked for Large Retail Grocer, PCI came roaring onto the scene, gnashing its teeth and making demands. We had to segment out our store networks. Like most people at the time, we had flat classful networks in place. VLANs seemed like the logical way to separate out our point of sales traffic (POS) from the rest of the network. We invested in a bunch of HP ProCurves to replace our unmanaged switches, planned out how we wanted our VLANs to look, subnetted the class 24s, and went to work. Honestly, it was pretty simple work too...until we ran into difficulty that will be pretty familiar to a network admin. See, we were a Cisco shop and had a Layer 3 Catalyst switch at our regional office that all of the store HPs needed to hook back into. And that wasn't working at all.

After a lot of back and forth with our national office (during which time they gave us plenty of grief for using HP in the first place instead of Cisco at the access layer), we figured out it had to do with the tagged/untagged thing that HP uses for their VLANs. Do I remember exactly what the problem and solution was? Nope, and at the time I recall that not only did I not really understand, I didn't care about the details. I had been traveling all over our region, working overnights and much of the following days as well, and what I wanted was for this to simply work. When we figured out that untagging a VLAN on the the trunk port on the HP enabled it to talk to the Cisco, that was good enough for me.

VLANs are, for better or worse, the kind of thing that small businesses don't tend to bother with. No one cares if Accounting and HR have their resources available on the same subnet. They control access via domains and GPOs and ACLs and such. You can get to that folder labeled "Salaries", but you can't access it. Good enough. In my previous place of employment we had a couple of VLANs strictly for traffic control, not security. The communications between them wasn't all that complicated to understand or configure.

Fast-forward to my current position.
VLANs and my surface knowledge of them are coming back to bite me in the rear but good. And guess who's having to deal with HP and Cisco interoperability again? Here's the rub: there are two vlans in play. One connects the network to the outside world and is the uplink to the ISP's router, and runs across an access port. Switches in this network are sometimes connected using a trunk link, and other times connected using an access port. My first "What the...?!" moment came when I saw that some of the devices didn't have the same vlan IDs configured. For example:



I always figured that vlan IDs mattered. If you have vlan13 on the switch, but no vlan13 configured on the ASA, how in the world would traffic pass? Same thing with the switch and the load balancer (that cylindrical object on the left there). I've never seen unmatching VIDs in a network, and I'd just assumed that it wouldn't work. However, it does work. Learning why required that I step back, forget everything I thought I knew about vlans, and start over.

Probably the best way to illustrate what I've come to learn is to trace a packet through the network. Let's assume that the switch is just a layer 2 device, no routing or anything. The ISP's router is the gateway, the ASA's outside interface has an IP address in the same subnet as the ISP's router, and there's a NAT'd device behind the firewall, like an email server or something, that the outside world wants to connect to. 

Frame comes in to the ISP's router, encapsulating a packet with destination IP 33.44.55.66. The router recognizes the network as a directly attached one, so it sends out an ARP request to get the MAC address of that device to forward the frame onward and upward. The ARP request enters the port on the switch. The port looks like this:

interface GigabitEthernet0/26
 description WAN Connection
 switchport access vlan 13
 speed nonegotiate


VLAN 13 is the PVID (port vlan id). This doesn't affect the actual frame since access ports don't tag and don't expect tagged frames either, but the switch keeps this info in mind when it determines which ports to forwards out of. A look in its mac address table reveals the ports it can send this out on. It sees the MAC address in question on one of its ports, which is an access port with pvid 13. It forwards this out and over the link to the ASA. The ASA receives the frame on its inside port, which is an access port assigned to vlan 1. The frame has no tag information so when the ASA receives this frame, it maps it to vlan 1 since that's the port it came in on. Neither end of that connection cared what the access port is on the other side. This is why you can have different vlans on different switches and traffic will still work. It doesn't make sense to do obviously, and makes management a potential nightmare, but there you have it. That's how that works.

I hear you asking though: what about trunk ports? Well, this is where things really got interesting. As far as I was concerned, trunk ports carried multiple vlans, which were tagged to let the other switch know what vlan they came from. It's so much more than that though. Let's say you have an access port on one end and a trunk port on the other end. Or, let's say the switch needed to forward that frame to a trunk port instead of an access port in order to get to its destination. What happens then?

The switch again has kept track of the pvid of the ingress port for the frame, which would be vlan 13 in this case. We'll pretend the switch has a trunk port connected to another switch's trunk port. This untagged frame will reach the trunk port, and when the trunk port sends it out over its link it will tag the frame to match the pvid from whence it came, provided the trunk port is allowed to carry that vlan. When the frame reaches the second switch it will have an 802.1q tag that identifies it as belonging to vlan 13. The exception to this behavior is if the pvid of the incoming frame matches the default vlan id of the trunk port (this is what happens in default configurations when everything is riding on vlan 1). In that case the frame is kept untagged and assumed to be part of the native vlan for that trunk port. The other side of that trunk link will receive an untagged frame and will associate it with its own native vlan. Another possibility is if the trunk port receives a tagged frame whose vlan id matches the native vlan of the trunk port. In that case the trunk port will actually strip the vlan id information from the frame and send it untagged.

Confused yet? If you go to Cisco's Learning Network, the forums are inundated with questions about vlans. It's actually quite difficult to find a definitive step-by-step explanation. I'm still not 100% positive that I've got it right, but at least I'm thinking about it the right way now.


No comments:

Post a Comment