Last month I posted about an error in Windows Server 2008 that caused me to age 10 years in 10 minutes. Over a month later, I finally “fixed” that server—I re-installed Windows on it. You can stop reading here if the details are of no concern to you
The weird thing is that the server continued to, well SERVE the whole time it was in that compromised state, so the users didn’t know anything was wrong. In the meantime my ass was puckered so tight I was pulling the fabric of my seat right up into my ass leaving little rosebuds everywhere I sat.
I posted some questions to a few forums I frequent and even got a response from a well-known Windows IT guru who suggested an off the wall fix that may or may not work. It didn’t. In theory it should have, but when I tried it, the error message was something along the lines of “Nice try.”
So here’s what happened:
One of the updates to Windows Server 2008 that came out on Patch Tuesday in October 2010 required a reboot to complete it’s installation. Windows waited patiently and kept popping up that little window reminding me that it needed to reboot, and I held off and held off until I could do it outside of work hours so as not to impact anyone’s (ok EVERYONE’S) work.
When that one update (and I still don’t know which one it was) came up from the reboot, it failed to to notify the system that it was complete. Because that update was still pending, and was requesting a reboot, the system obliged by initiating a shutdown. When it came back up, it failed to notify the system that it was complete, requested another reboot and so on and so forth.
Sitting at home, watching a ping window I saw the server power off, and then come back up and start responding. At that point I disconnected and did whatever I did that night. I did not at that time know that the server was rebooting every four minutes all night and all morning until I arrived.
I found some advice on the Internet, but it applied to Windows Vista. Since Windows Server 2008 is the server version of Vista (and Windows Server 2008 R2 is the Windows 7 version) I thought I would give it a try. Couldn’t make it any worse, could I? well yeah, I could.
I booted up from the Windows disc and went to “Repair My Computer” and found that the only options available to me were restore from an image created with Windows Server Backup… which I never bothered to set up, because I was using Symantec Backup Exec to back up the server. Rebooted and this time went to the command-line recovery console (cue dramatic theme)
At the command prompt, I navigated to C:\Windows and then into the WinSXS folder which I later found out stands for Windows Side-By-Side system and it keeps versions of dll libraries and other files cached in there from every single piece of software you’ve installed, so that when that software runs, it can use “it’s own” version of each library for compatibility’s sake. In that folder is a file called pending.xml and in this case was about 7mb. That’s a lot of text surrounded by angle brackets. The advice we got was to delete that file. Since the advice came from the internet, I decided to rename it pending.xml.old, just in case.
Rebooted the computer and was able to log in, but Server manager didn’t work. Nothing that relied on the WinRM service worked. Nothing that relied on the Windows Installer worked. Nothing that relied on the Windows Package Installer worked… but everything that was already set up, installed and running worked. This included the Symantec Backup Exec agent, so the server was being backed up every night still.
At the same time, I was having issues with Backup Exec (version 12.5d) not “seeing” my SharePoint 2010 farm on the network so it wouldn’t back that up, either. After getting on the phone with tech support and trying a few things, they asked if I was running SharePoint 2010. I said yes, and they said that 12.5d didn’t support SPS2010 and I would have to upgrade to BackupExec 2010 R2 which did. Fortunately I have my software extortion paid up and the upgrade was at no additional cost. I downloaded all 2.8gb of it, mounted the ISO and installed it overtop of the old BE12.5d installation. It went off without a hitch, but then all the agents needed to be updated. All the servers agents updated except for this problem server. On the phone with Symantec tech support again, we were unable to uninstall the agent, so we went through and did it manually. Most.Excruciating.Task.Ever. Once it was uninstalled, we tried to install the newer version and it just would not take. We were looking at a Symantec KB article when something caught my eye: WinSXS. Ahh shit, remember when I said that the installer and the package installer were affected? Now I had the added urgency that my file server was not being backed up anymore.
Over the next couple days I migrated off the accounting system to it’s own virtual machine on another host, and then I also migrated off my MDT2010 and WSUS virtual machine that was hosted on this machine to another host, so the only thing left on this machine was the file shares for this office. Re-installing windows in theory shouldn’t affect the D drive… in theory. I connected a 2TB drive via USB and started up Robocopy to make a 1:1 copy of all the data on the D drive, just in case. If you ever wondered how long it takes to copy 1.5 terabytes of data over a USB2.0 connection, I can tell you: 28 hours.
Now that all the apps & VMs were migrated off and a good copy of the data was on a separate drive it was time to try a few things out. If worst came to worst, I could just re-install windows (and update it to Windows Server 2008 R2) but if I could fix it… well that would be great, too.
I booted up into the recovery console, navigated back to that WinSXS folder and re-renamed pending.xml.old back to pending.xml and rebooted. No change. restarted the computer again and this time it went back to the “Configuring Updates Stage 3 of 3 0% Do not turn off your computer” then click, reboot, then back to the same screen and the cycle continued. Now that I had the server back to it’s “original” condition, I booted back to the recovery console from the Windows Server 2008 disc and tried to invoke DISM, the Disk Image Service Manager. It did not exist. I swapped discs to a bootable WinPE disc I made in MDT2010 awhile ago and exited out to the command prompt and typed DISM again and this time I got the help file.
I entered the following command: “DISM /image:C:\Windows /Cleanup-Image /RevertPendingActions” ooh that sounds like it should work! It failed though, it said it could not access the image at C:\Windows. Some further research on the DISM command and I re-tried it, this time providing C:\ as the image, rather than C:\Windows. This time I got some results!
"The command specified is unknown or not supported when running DISM.exe against a Windows Vista with Service Pack 1 or a Windows Server 2008 target image"
Well shit. So much for that. I broke it back out of the loop by renaming pending.xml to pending.xml.old again and removed all the DFS targets from the server (oops, forgot about those after the backup) and then left it overnight to replicate and get it all out of it’s system.
I came back in the next day (Saturday) and as soon as I opened the door I heard an alarm buzzer coming from the server room. Not good. I opened the door and was not greeted by any red blinky lights, just green ones. So if it wasn’t a UPS, and it wasn’t a loose power connector, it could only be one thing: a RAID controller. Sure enough, this same server that I’ve had all this trouble with had a bad hard drive in it’s RAID1 array for the OS. I went back out and to NCIX to pick up a replacement hard drive, came back, popped it in and let the array rebuild itself. Strangely, there ARE red LEDs in the drive carriers, but the bad drive didn’t make it illuminate, even though it triggered the alarm and showed the drive as failed and the array as degraded.
After that, I gave up and pulled out the Windows Server 2008 R2 disc. I did a fresh installation on that C drive and sure enough, the D drive was left intact and all I had to do was set up the shares and DFS. In the end, I may as well blame it on Vista sucking, but most likely that initial error a month ago was because it tried to write to a bad sector on the hard drive.