Last year (last January to be exact) the company I work for expanded by acquiring another small firm and that brought with it a new office. That meant I had to provision services and configure some computers. I came up here last January and brought with me a computer, a laptop, a firewall and later I would send up a server.
Everything was working, but logons were slow and file operations were slow since they lived on the far side of a WAN connection. No biggie, we would get a server up there soon as a DC with a Global Catalog on it for logins and “local” file & printer shares.
In April I came back and brought with me a hand-me-down server (2003) and a hand-me-down printer (a real redheaded step-child office if there ever was one!). I configured them, configured the network, connected the printer and took off.
About a week later the server died. I diagnosed over the phone that it was the power supply and rather than travel over for 5 hours & a ferry ride and then have to stay over just to replace a $100 power supply, I had them take it to a local computer store and have them replace it. Everything worked, brought it back, fired it up and about a day after that, one of the hard drives stopped working.. possibly due to the power supply failure. Back it went to the local computer store and this time they called me as the drives in this server were old and they didn’t have any similar ones.
When I expressed my disbelief that a power supply could cause a hard drive failure he said “You mean you didn’t hear about that power supply?” I professed my ignorance and he told me that it had melted down and was on the verge of catching fire (inside the power supply) and that one of the wires had melted through, causing a short and stopping the flow of electricity. Yikes! How bad would that have looked if I burned the building down?!
I had them put in a new drive, let the RAID array rebuild (it was a RAID1 mirror), then pull the other old drive, replace it with a matching one, let it rebuild, then extend the partition to use up the full space.
It went back into service and about a week later stopped replicating with the rest of the domain, and some other issues that would have required new RAM and a re-install of Windows, if I was unable to get it started again at all. At this point I cut my losses and lobbied for a new server. Management OK’d and I replaced it with an HP Proliant tower server and Windows Server 2008 x64.
I took the new server over in June and configured it and got it set up as a Domain Controller, everything was working, and I came home. About a week later, it dropped off the network. I called over and had someone plug in a monitor and see what was on the screen. Bad news. one of the hard drives on the C: array was bad (seriously, this was a brand new server) fortunately I had ordered a spare drive with this system, so I had someone a little more comfortable with computers open the case, take out the bad drive, put the new one in and fire it up. The RAID1 rebuilt and the error went away… or so I thought.
At the same time, there was a Windows Update that kept failing to install, over and over and over again. The error code pointed me to a bunch of forum posts and articles about a bad crypto file that had become corrupted. This could happen due to a bad hard drive and/or hard drive replacement in the system.. exactly what I had just done. The fix? Re-install Windows, or if you’re lucky, a repair install. The server was functioning, other than this update failing, so I left it as I didn’t have the time or the software/discs/tools with me to do it on the spot.
Sometime over the next few months I noticed that the new server wasn’t replicating AD partitions properly and I saw some very odd event messages on some of my other servers that there was an unfinished DCPROMO event. That was very disturbing. In November of last year I discovered that my CA was in a different office, one that this red-headed step-office did not have any access to. The new server I installed over there had never been able to contact the CA and get a Domain Controller certificate with which to sign (encrypt) it’s updates with so the rest of the network ignored it… until it reached the tombstone period. At that point, well, the whole fucking thing was a write-off again.
Just after that in early December, my “primary” DC in head office (the one that held all the FSMO roles and SHOULD have been the CA but wasn’t) had a catastrophic failure. This caused grief and panic but in the end I was able to save it by converting it to a virtual machine and hosting it on one of my newer Windows Server 2008 x64 servers. Getting everything sorted out took most of December, all the while waiting and calling and emailing and threatening my supplier wondering where the replacement server was. It finally showed up in early January.
Coupled with the tombstoned server in the remote office, it caused a lot of weird issues around the whole network. I had ordered a “new” firewall to upgrade the one in the remote network and give it full connectivity to all the offices, but it had to wait until I could get over there to install it.
I installed it last night, and as if by magic, as soon as it contacted the CA things started working better… but not everything. It was already tombstoned. The other DCs wouldn’t talk to it or replicate with it. I was thinking I would have to demote it anyway, disjoin it from the domain, and then re-install Windows anyway, so I tried to demote it. It wouldn’t work. There were database changes on the system that had to be replicated to the other servers, but the other servers wouldn’t accept them because it was tombstoned and suggested demoting and re-promoting… which it wouldn’t do because there were database changes that had to be replicated to the other servers… a vicious circle.
Late last night I had to physically disconnect the server from the domain and do a dcpromo /forceremove on it. Not the best scenario, but all I had left. Once it was done and rebooted, it was back to a pristine, non-domain-connected condition. From there I had to connect to one of the DCs back in Head Office and manually cleanup the Active Directory database from the command line using NTDSUTIL. The exact steps to follow I found on Petri.co.il. It’s one of those things where if you’ve done it so many times that you’ve memorized the steps, you should probably just stop fucking around with computers because you keep breaking them. :)
I renamed it, rebooted, joined the domain, rebooted, promoted it to be a domain controller and rebooted again. Then I left for the night.
Today, things are going smooth. There are some DFS issues to work out, but DNS, AD, Group Policy, Logon scripts and all that sorta sorta are all working rock-solidly.