Friday, August 15, 2014

The case of the missing static IP address

This is what it's supposed to look like.
Ah, the joys of late nights at work. It's quiet enough to hear a server bearing spin, at least until you walk outside to go to the bathroom and the alarm goes off. It's good to work with smart, conscientious, security-minded people that do things like arm building alarms while IT is still working in the building. That's all right, though - at least I know nobody's going to get me by surprise while I'm here. The mad sprint down the stairs to the security panel so I can move around the building without the police showing up in short order (and I mean short - I could walk to the police station in under 15 minutes if I was so inclined) is just a nice exercise-inspiring perk of the job.
 
What made tonight particularly interesting was the maintenance I decided to perform. We have a small VMWare ESX infrastructure at work and, for the past few weeks, it's been acting up a bit. While digging into the problem, I noticed that there were several discarded inbound packets being registered by our stack switch; some Google-fu suggested that the problem could be due a combination of factors, but most of them pointed at the network drivers for both the guests and the Broadcom-equipped ESX hosts. Since the virtual machines were configured with bog standard Intel E1000 cards, and since VMWare's documentation suggested that the VMXNET3 virtual NIC has a higher performance envelope, I decided to swap virtual NICs as well. Having more than a little experience with hardware-independent restores, I knew that changing virtual NICs in SQL and Active Directory servers was non-trivial - in my personal experience, I've found the following instructions useful:
  1. Reboot the machine into either Directory Restore Mode or Safe Mode with Networking.
  2. Apply the original network settings to the new NIC(s).
  3. Reboot and enjoy!
So that's what I did on our VMs this time as well. Upon rebooting, however, I noticed that my servers - especially the SQL servers, for whatever reason; this problem was considerably less common among file servers and domain controllers - wouldn't keep their static IP addresses. Instead, the server would arbitrarily assign itself a 169 address. Interestingly, I wasn't alone - in fact, this has apparently been a problem with ESX-hosted Windows servers for a while, if the 2006 date stamp on the start of that thread is any indication. Even weirder, on other servers, it would keep the assigned IP address but randomly drop the gateway.
 
The good news, if you want to call it that, is that these issues are common enough for VMWare to offer KB articles on them:
What I ended up doing on the affected machines was closer to the spirit of the second article (2012646) than the first. I went to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\Tcpip\Parameters\Interfaces as VMWare suggested on a working physical server with a static IP address and compared against one on a misbehaving VM. What I found, once I isolated which GUID corresponded to the network interface on the VM that needed to be configured, was that I was missing the following keys:
  • Name: IPAddress
    Type: Multi-String Value (REG_MULTI_SZ)
    Data: Corresponds to each IP address used by the server, appears to be comma-delimited.
  • Name: DefaultGateway
    Type:
    Multi-String Value (REG_MULTI_SZ)
    Data: Corresponds to the IP address of the default gateway for the NIC. I don't use multiple gateways on any of my NICs, but NameServer and IPAddress appear to be comma-delimited, so I would assume this one would be as well.
  • Name: DefaultGatewayMetric
    Type: Multi-String Value (REG_MULTI_SZ)
    Data: 0
After manually assigning the values above, I then installed the Hotfix recommended in 1016878 and rebooted each affected VM a few times to make sure the changes stuck. I'm happy to report that, at least so far, everything appears to be more or less stable.

No comments:

Post a Comment