Hyper-V Horror Story PDF Print E-mail
Written by Tim Wray   
Thursday, 13 December 2012 10:24

I don't have a lot of faith in most virtualization in production environments lately. Of course, I have not used VMware. I have never even tried it out in testing. I know this is the virtualization product that most IT shops use, and maybe I'll give it a shot eventually. This, though, is a story about Microsoft Windows Server 2008 R2 (w/SP1) running Hyper-V, Microsoft's server virtualization platform.


I do maintain at least one Dell Poweredge R510 in my daily activities, and running on it is the Hyper-V component of Windows Server 2008 R2 SP1. This is a little story about this server and a battle I waged with it last week.


We three child VMs on this particular server, two of which are Windows Server 2003 R2, and one is Windows Server 2008 R2. I have had no trouble whatsoever from the Server 2008 and one Server 2003 VM (and they continue to do well), but the other one is evil, and as of this writing it is retired.


This particular Server 2003 VM was host to some software that collect data from automation PLCs on our production floor and stores them in a Microsoft SQL Server 2005 database, also hosted on the same VM.


The issue that started happening at some time in the past three months was that once we started heavy production on this line, the VM would stop forwarding IP traffic. No errors on either the host or the child VM of any sort in the error logs or otherwise, and the adapter continued to show connected in the child VM. It would forward some traffic out and in from within the child, but if you tried to ping the server from elsewhere on the network, it would not resolve. Occasionally it would not be able to ping outbound from the child VM to the PLC that was needed to trigger and feed the data into the database, which, in production, is a major issue.


Rebooting the child VM would fix the issue, but it usually returned anywhere from 8 hours to 3 days later.


Naturally, I checked the physical network, but nothing else on the rather large infrastructure was exhibiting any routing or traffic flow problems, so I pretty quickly got back into Hyper-V to poke around. I also started the expected online research, which took me to this Microsoft patch, KB2263829, and also to this blog posting that mentioned that other people have encountered a similar issue, differing only in the fact that the network adapter in the child VM would actually disconnect or disconnect/reconnect, which mine was continuing to stay connected based on event log data on the child VM.


Finally, after losing annoying amounts of data and thinking the issue was fixed about three times, just to have it come back, I pulled a spare Dell Poweredge 1850 I had on a shelf for backup and moved the services and databases off the VM onto that physical server. Problem solved.


Looking back at the situation, I honestly never found the actual issue, which is irritating. Having other VMs on that server and having them provide their services properly while this was going on makes me want to blame that particular OS install, but on the other hand, the other VMs don't have the traffic load the failing one did (the database not only is being fed 24 hours a day, it also is used heavily for reporting on that data via PHP on our intranet.)


It will not be known, unless I find something different online, what caused this issue, but I am leaning toward the large traffic load on the child VM. I will report back in this article in the future if I see this behavior out of the other VMs, although I don't expect to.