Remove Node from Hyper-V Failover Cluster

We have done this a few times in different ways with slightly different success rates. We were tasked with removing three old nodes from a 6 node cluster recently, so I revisited best practices.

What did I find? Nada. Isn’t it wonderful when Microsoft and Google both simultaneously let you down? In fairness there are a few decent videos regarding Hyper-V failover clusters but almost all documentation deals with adding or creating. We needed to remove or delete.

Robert McMillen’s Youtube channel came the closest
Oreilly Learning is another decent resource
PluralSight also has several courses on Hyper-V

The core issue is that we couldn’t find decent documentation of what exactly each cluster action actually does. Failover clusters also support multiple environments. We were interested in Hyper-V but the majority of failover clustering seems to be SQL Server.

One piece of documentation that I could find from Microsoft is specific to Powershell and provides almost no information on how the actual command works.

https://docs.microsoft.com/en-us/powershell/module/failoverclusters/remove-clusternode?view=win10-ps

The main hyper-v clustering page seems to be this one.

https://docs.microsoft.com/en-us/windows-server/failover-clustering/failover-clustering-overview

Once again, the only mention to a cluster node being removed was specific to VMWare which I find very amusing.

The two posts that we tried to work off of are:

https://community.spiceworks.com/topic/335420-removing-a-node-from-a-hyper-v-cluster

http://www.msserverpro.com/remove-node-windows-server-2016-hyper-v-cluster-destroy-cluster-procedure/

So, armed with almost nothing, we ventured off.

There are a couple of concerns. One, we need to migrate the VMs to other nodes. We also have storage disks that have the nodes that we want to delete as owner. We first attempted to migrate our VMs manually using live migration but a couple of them appeared to fall back to the same node. If this happens to you, the node is likely failing migration but the errors aren’t obvious. Check the Cluster Events in the Failover Cluster Manager. It also wasn’t clear how to change ownership of the disks. We saw a move option on some of our disks but not all of our disks showed this option. This alludes to another issue that we will cover later. As a note, all of our disks are cluster shared volumes which are commonly referred to as CSVs.

Our process.

  • Go to the Node tree, right click on the node that we wished to remove. Select Pause and Drain Roles. If this fails, you will want to resume the node as it stays paused even if it fails.
  • From more actions select stop cluster service
  • From more actions select evict.

The documentation that we could find seems to suggest that there is more than one way to do this. It sounds like you could simply stop the cluster service and it would drain roles. We were uncomfortable with taking a chance so we followed the steps above.

Regarding draining roles. This will not only live migrate your VMs to another node, but it will also change ownership of your CSVs.

Some failures.

  1. Failed migrations usually point to local media. For example, a mounted dvd or a drive that is local to the host. Correct the issue and try again.
  2. We had one issue with a VM with a gen2 bios that had secureboot enabled. This prevented the VM from migrating. We used the powershell command set-vmfirmware to disable secureboot which allowed the VM to migrate. Note, the VM must be shutdown to run this command.
  3. We had an issue with one of our CSVs. This wasn’t immediately evident through the logs which by the way are almost worthless. Thank you Microsoft. The issue and solution was that one of our CSVs was only available to one of our nodes. We added the missing volume using the Nimble Connection Manager to one of the nodes which we were keeping which resolved this issue. This was even a little more involved since we didn’t provide connection access to this volume on all of our servers. Just a note that you may need to make a change at the SAN level.

The last thing we did to cleanup the servers after they had been removed from the cluster was remove the connections to the SAN. This process will be different depending on your SAN.

That’s it. I hope this helps someone else.