Pucker factor: 9
We got eight new computers at work, all identical Dell Optiplexes that are going to one department. Generally what happens in situations like this is that one machine is opened up, started up, configured & apps installed and then I take a Ghost snapshot of the hard drive, and push that image out to the other machines using Ghostcast Server. That way we end up with 8 identical machines, and then Scripts and Group Policy futher refine the settings and restrictions on those machines based upon where they are going and who is going to be using them.
Since these ones are going to be going into a controlled environment where we want to absolutely minimize any downtime caused by people surfing the net on them and putting them at risk to drive-by downloads and other forms of crapware, we lock them down pretty tight.
On that note, I've been playing with the Microsoft Shared Computer Toolkit and it's pretty cool. You can lock down a machine so tight that it squeaks when it tries to fart. It's also geared towards computers that are operating alone, and not part of a domain. There's a whole chapter related to using the MSCT in a domain environment and I read over that this morning. Basically what you need to do is set the initial security settings on the machine (or the machine prior to imaging it in this case) and then use the included Administrative Template for Group Policy rather than the Shared Computer Toolkit interface.
So after talking it over with the other network admins this morning, I created a new Group Policy on our domain and called it “%machinename% Experimental Group Policy” and applied it to the machine name that I was working with. That way the changes and restrictions and lockdowns that I was experimenting with would ONLY be applied to that computer. That's where I made the fatal error.
In Windows 2003 Server SP1 and the 'new' Group Policy Management Console SP1, when you create a new policy, it defaults to the Authenticaed Users group (practically everyone). In this case, the ACL said Authenticated Users and machinename-01. I went about locking down machine-01 and testing it, not realizing that the changes I was making were affecting the entire domain, in every country we operate in. Bad. Very bad.
I realized that it was locked down too tight for one of our critical applications to work, so I backed off, and then backed off some more, testing each step to make sure it worked. After a few rounds of that, I noticed it was getting late and went for lunch. Second fatal error. By the time I got back from lunch, the changes had replicated to all the other servers and were trickling down to client machines.
I got an email from a user asking why their homepage had changed in Internet Explorer, but I was just getting back from lunch and ready to crack back into the testing of this new machine and didn't really clue in. I hit the Windows key on my keyboard to bring up the Start Menu... and it was blank. I had my last few programs opened, Internet Explorer and Outlook up at the top where they belong, but the only thing on the right-hand pane of the start menu was Administrative Tools. No Control Panel, no My Computer, no My Documents, no nothing. I thought to myself “that's weird, I don't remember making any changes to MY machine... and even went so far as to ask the other admins who was pulling my leg. No one fessed up, so I tried to open Group Policy Management Console to check it and change it back when I got a Windows Critical Error and the message “Access to the Microsoft Management Console has been disabled. Please see your Network Administrator”. Not good, I AM the network administrator, don't tell me to go ask myself! OK, well I'll VNC the console of the PDC... Log in there, hit Start Button... and it's empty.. To quote $imdb(Ralphie Parker) “Only I didn't say "Fudge." I said THE word, the big one, the queen-mother of dirty words, the "F-dash-dash-dash" word!”
That's when the email about the changed homepage popped back into my mind, and a frenzied attempt to get into GPMC via any DC in the datacenter and a phone call from another admin who had gone offsite about 20 mins before all happened at once. He was not amused when i told him what happened. We hit up Google with a passion, looking for a way to “un-fuck” ourselves. We found a couple things: registry keys, some obscure MS command-line tools, and ultimately, the same situation we found ourselves in and what saved our (mine especially) bacon in a newsgroup post. Someone had done exactly the same thing as me. His solution? He was lucky. As was I. The offsite location that the other admin was at had not been updated yet due to a slow WAN link. Getting in there and making the change to the GPO and saving it caused it to have a newer timestamp, and therefore it replicated ITSELF back to the network here rather than be overwritten itself by the “bad” GPO. If that had not happened, I would probably be on the phone with Microsoft for most of the night while the rest of the guys made plans to roll back the entire AD to a previous state.
We waited five minutes and then I got antsy so I did a gpupdate /force on my machine, and once it was refreshed, I hit the start button and everything was back to normal on my machine. After that I relaxed a little, and was still searching for a solution in case it ever happened again (not bloody likely) or it happened to someone else and asked me for help.
I found a message thread in Usenet/Google Groups about the same thing that I did. The solution that he found was the same thing that saved my ass: one of the other domain controllers hadn't updated yet. If it did, he would have been screwed. (as would I)
This could have been one of those COLOSSAL fuckups that define a career (or at least the downward trajectory of one) had it not been for a slow WAN link. It's one of those mistakes you only make once, as the fear of it actualy happening again/for real is SO MUCH that it will make you pause and check the settings every friggin time you go into Group Policy Management Console for the rest of your life.