Server virtualisation in its current state is pretty much done. The market is mature and critical tier one workloads are absolutely at home on tier one virtualised platforms such as VMware and HyperV. Virtualisation tends to be a very useful technology in larger organisations for what most administrators would say are the wrong reasons.
People, management especially, tend to think they can run anything in a virtual estate. People come up with great ideas like virtualising cluster nodes and “test configurations” introducing single points of failure in order to get the job done quickly. “We can fix it later” is the usual cry. Getting a VM guest stood up is easy. Orders of magnitude quicker than getting bare metal hosts bought and racked. Politics at work don’t help either.
One of the killer selling points of virtualisation is that when a physical cluster node needed fixing, upgrading or taking out of service it is a trivial matter to just migrate the hosts virtual machines onto another cluster node. Virtualisation platforms on the whole work so well that management tend to view them as real metal and this is where the problems start to seep in. Compromises start to creep in because management force the issue, ignorant of what the impact on the virtualisation platform will be.
These forced implementations tend to be things that break the whole VM-to-host independence and cause problems that impact not only the servers and services but ultimately your reputation with other groups and customers.
A classic example of this behaviour is when DBAs or other managers want a new application or database cluster stood up quickly because: “The installation engineer arrives tomorrow, we need it yesterday.” This time scale doesn't really allow for bare metal purchase and racking.
Virtualised clusters as guests is an age old issue that has dogged virtualised platforms for several generations of virtualisation. The faux clusters usually utilise shared SCSI bus technology and sit on different hosts. “Big wow,” you may say but it has a direct and detrimental effect on the ability to manage a cluster.
Once the faux cluster is up and live, it breaks one of the main tenets of virtualisation. It has the effect of tying specific virtual guests to specfic host nodes and preventing migration.
No longer can you migrate virtual machines between hosts without first powering off the faux cluster node on the host in question.
Hardware errors can no longer be easily fixed whilst remaining transparent to the VM users. You now need approval from the virtual cluster owner to power down the offending node before you can do any work on the host.
In big business, where everything is change controlled to the smallest degree, changes that require host outages can cost several hundreds of pounds by the time all the work is completed. Failing a piece of work because there was a faux cluster node on the host is seen as a big issue.
Sure, the odd virtualised cluster is perhaps acceptable but when you start to scale to hundreds of hosts, thousands of guests and several faux clusters these issues start to become a real pain for the administrator who has to work around them. Before planning any work you need to check if there are any cluster nodes on the server before you plan the work. All these little issues start to add up.
Another limiting factor is the use of “special” networks that are only available on one host as it functions as in internal heartbeat. Usually this is sheer laziness disguised as “It’s only a test.” All the guests that need this “test” network are then tied to the single host in question. It is easy enough to do properly but people get lazy.
The upshot is that, yet again, it impedes the underlying host and adds yet more pain points and the host is not able to function as a normal cluster member. An internal only network also means that should the host fail, those machines wont be able to start up properly as one of the networks will be absent.
So, given these issues, what can you as an admin do? To be honest, your options are limited. The key to making progress is you need to help management to understand the impact of what they are asking. To them (the ones that are not really techie) it may involve explaining that placing these virtualised cluster solutions on hypervisor platforms means you are making things difficult for yourself.
I know techies are not always comfortable going against management wishes but if you don't talk about these issues they will persist. Without management backing when you say no, you will get railroaded into it.
As to technical solutions, even at scale they are limited. Good design and keeping away from cutting corners is a first step - but as to stopping virtual clusters ending up on the machine? No one seems to have the answer. One of the crazier ideas I’ve see banded round is a cluster purely for virtualisation, but I for sure wouldn't want to try and organise putting that into maintenance mode.
One glimmer of hope is that hypervisor manufacturers are aware of this issue and, but I haven't had time to examine in detail the technology needed to manage virtualised clusters whilst being able to maintain guest to host independence.
As to how they work and how efficient they are, only time will tell. ®