Posts Tagged ‘Snapshot’

Updated: Recover from a USN Rollback WITHOUT Demoting and Promoting your DC

October 27th, 2009 by Paul Sterley | 1 Comment | Filed in Backup and Restore, ESXi, IIS, In the Windows Box, Virtualization, Windows Server

What’s a USN Rollback? That’s when you’ve restored an Active Directory DC in a multiple DC environment using a method that is not Active-Directory Aware. Examples include Ghost images, VMware or Hyper-V snapshots, or other imaging or volume-level restore methods.

Why is that a problem? A very good detailed explanation is available here, but the basic idea is that AD keeps track of which servers it has replicated with and when, and if a DC is rolled back in a way that is not compatible with the record-keeping, the affected DC will disabled inbound and outbound replication, and refuse to replicate with the other DCs.

Here’s a related article by the same author as the above post, which led me to my solution this evening. My article expands on the second option provided, but goes into the mechanics of it, and the associated difficulties.

According to Microsoft’s Knowledge Base article on the subject, recovering from this situation entails forcibly demoting the DC, cleaning up the AD, and then (optionally) promoting it again. If the DC in question has no other roles, or just a couple of basic ones such as a print server, this might be the best way to go, if you’re familiar with such things as seizing FSMO roles and performing metadata cleanup in Active Directory after an unsuccessful DC demotion.

** Update: Read on for more details about how this all works, but make sure you check the update at the bottom of the article for the easier method I successfully tested!

However, if you’re not familiar with these things, or you have other applications on the server which might be affected (IIS, in particular, is very sensitive to the permissions changes associated with DC promotion), this might create a very large amount of havoc on your server.

Your saving grace, if you have one, is a System State backup from before the USN rollback occurred. If you don’t have a backup of JUST the System State, perhaps you can restore an entire image to another server, boot it, and create one.

If you have or can create one of these, your solution becomes much simpler. You just need to boot your server in Directory Services Restore Mode, restore the System State, DO NOT mark any part of your restore as authoritative, and reboot.

After the reboot, you might need to remove the flags AD has set, which have disabled inbound and outbound replications. The commands for this are:

repadmin /options [YourServerName] -disable_inbound_repl
repadmin /options [YourServerName] -disable_outbound_repl

Note: This looks like you are disabling replication, but what you are actually doing is putting a minus sign (-) before the disable option, which enables it. I know, it’s counter-intuitive, but trust me on this one – or go check the syntax yourself.

Of course, you need the Support Tools installed to get the repadmin utility. Once you run those commands, your server will start replicating again, and the more up-to-date DC(s) will override the old, out of date information your USN Rollback victim was holding onto.

There are some extra difficulties associated with the above plan:
1. If you have to restore a server image to create that System State backup, and you restore to different hardware, things could get a little messy. Is it messier than demoting, seizing FSMO roles, performing metadata cleanup, promoting, and cleaning up the fallout from your installed apps? You’ll have to decide on that one.

2. This requires you having an extra server (or two, if you want to restore more than one DC to create a stable lab environment from which to back up the System State) laying around. Do you have those resources available?

I was facing this issue today, and all of the above became MUCH simpler for me when I realized I could use the Doyenz Test Lab to sort all of this out. I did NOT have a System State backup from before the USN Rollback, but I HAVE been running backups into the Doyenz system since before the problem began.

Here is what I did:
1. Created a backup of the System State

a. Restored a copy of the affected server in the Doyenz Test Lab. I specifically restored from the date BEFORE the USN Rollback happened. It was easy to find this by looking at the date of the last successful replication with repadmin on the affected server.
b. Performed a System State backup using NTBackup (you can do this with WBAdmin on Windows 2008).
c. Zipped the backup file and sent to an FTP server.
d. Shut down the restored server.

2. Performed a test run to make sure this was going to work, without affecting the live servers.

a. Using the Doyenz Portal, I select last night’s backup and restored it for both servers.
b. I booted the primary DC (the one with the FSMO roles) first.
c. Attached the second (USN Rollback victim) server to the first one in the Lab, and booted it.
d. Pulled the System State backup down from the FTP site onto the affected server.
e. Rebooted the affected server into Directory Services Restore Mode.
f. Restored the System State on the affected server.
g. Rebooted the affected into Normal Mode.
h. Used the repadmin commands to remove the replication blocks.
i. Forced replication using AD Sites and Services.

3. Verified successful replication.

a. Created a user account on one DC in the Test Lab, forced replication, and checked for the account on the other DC.
b. Deleted the user account on the other DC, and checked it on the first DC.

4. Tested the touchy sensitive web applications that are running on the affected server.

5. Shut down the servers in the test lab.

After this successful test, I notified the users of pending late-night downtime, and repeated the above steps, this time on the live, production server and with great confidence of the outcome. Sure enough, I restored the AD replication functionality of the server with minimal downtime, without crossing my fingers, holding my breath, and hoping against hope that it would work and not trash the server.

What is more, since the production server is a virtual server, and I have VPN access to the virtual host, I was able to perform the entire operation from my home office, 30 miles away. I didn’t swap any tapes, set up any lab hardware, or drive to the server site late at night. I did the whole thing in comfortable clothes with a 2-liter bottle of Ruby Red Squirt, Winamp playing “Save Me” by Queen, and my devoted cat purring on my lap.

What could be better than that?

Update: It was very handy to be able to do the above scenario, but what is even handier is that I was able to find a significantly simpler method. So much simpler, I wonder why it did not occur to me sooner, and why Microsoft doesn’t have this listed in their KB article.

I set this problem up in a lab scenario again, and this time rather than do a complicated restore of an earlier version of the machine, I simply:

  • Performed a System State backup of the machine (in its broken, non-replicating condition).
  • Booted it into Directory Services Restore Mode.
  • Restored the System State backup, carefully NOT selecting the option to make it authoritative.
  • Rebooted, and ran the above repadmin commands to re-enable replication.

After that, I was able to trigger another replication, and it worked just fine.

Tags: , , ,

Updated: Trouble with Hyper-V’s Snapshot Feature

January 30th, 2009 by Paul Sterley | 1 Comment | Filed in Hyper-V, Virtualization

I’ve just received this update from my friend, who I will call “Spleen” here to protect the “innocent”.

 

When you make a snapshot you go into “differencing disk” territory.  Well, it turns out that even if you delete all your snapshots, those differencing disks could hang around.  They get cleaned up “automatically” when you leave the machine turned off long enough.  (Yeah, that’s how you activate that feature: sit around and wait to see whether it decides to start up.)

 

Long story short, before you know it your 40 GB VM happens to occupy 400 GB on disk.  You’re out of space, and of course rumor has it (I haven’t seen it yet myself to confirm or deny) that “applying” a differencing disk to the base disk to make it go away requires as much free disk space as the sum of the base disk and the differencing disk.

 

Of course, you notice this when your differencing disks have soaked up ALL your free space, so unless you happen to have 50% of your hard drive’s entire capacity taken up by something else, you’re in deep, deep shit.

 

The way out, of course, is shuffling dozens or even hundreds of gigabytes of data all around hither and yon, until you have enough free space to fix your problem.  (Ironically, this is where having two or more  VMs on a single drive will save your bacon.  If you only had one VM, and it filled the drive, you’re going to need a new drive that’s twice the size or so…)

 

I’m about to look into manually forcing it to apply the disks.  You do this by finding out what the precise chain-order of the differencing disks is, and you take the first differencing disk (i.e. the one right “below” the base disk in the chain) and rename it from .avhd to .vhd, and then you use the “Edit Disk” feature in Hyper-V to squish the two discs together.  Then you watch to see whether you have enough free disk for this to succeed, and if it does then you win because you’ve just made some fresh empty space.  Yay!

 

(This information is unverified, and comes from here: http://itproctology.blogspot.com/2008/06/how-to-manually-merge-hyper-v-snapshots.html )

 

Seems like a shit-ton of work just because there’s no button there that says “please actually do this incredibly important task for me RIGHT FSCKING NOW because this is an important VM that really can’t sit around all weekend turned off while I pull my ass hairs and wonder whether some service will decide completely on its own to do what I need, or not, and why.”

 

My day would have been chock-full of work, start to end, if I had not discovered this issue late one evening when some people complained that some of the servers had stopped responding, and I was awakened by the pages.  I cleared a meager 10 GB of space that was unused stuff and went to sleep knowing that would get us through until 8 am.  (That technique ultimately helped us limp through until 6 pm, when we’re allowed to shut down the VMs.)

 

Hyper-V is definitely at around the “Version 2.0″ phase: it does some stuff, and it does some stuff really well.  But the warts are so vastly terrible, you can go blind just wondering what the hell happened there.  You know, sort of like Internet Explorer 2.0 was.

 

 

UPDATE:
I thought about some simple math last night / this morning, regarding how exporting a VM is kinda slow and takes up a lot of disk space.  Like, 10 or 30 GB average for our machines.  (10 is more normal while we are building them up, before users get on them.)

 

I bet that if my team had done an “export” instead of a snapshot every single time we actually did a snapshot, and then gone through the trouble of restoring on the rare occasions we needed to “apply” a snapshot, we would have used 10% of the time and 10% of the overall disk space that digging out of a snapshot hole has caused us.

 

Furthermore, all our “wasted time” would have happened _before_ deployment, not during deployment.

 

Changing gears only slightly, you can apparently make backups of your VHD if you know what you’re doing, then at a later date tell Hyper-V “drop that hard drive from the image and use this one (an old copy) instead”.  Bam – now we’re effectively ghosting.  And copying a VHD to a backup folder may even be significantly faster than exporting the whole VM.

 

Now, there may be a downside of having to shut down the VM to export it, and I’m certain you need to shut down the VM in order to just file-copy it.  And, this requires research/training to achieve proficiency and confidence that you can do it without fscking up.

 

But you MUST shut down a machine to collapse out an applied snapshot anyway, and that will probably always be slower than copying an old VHD from the backup location.  Snapshots keep biting us in the ass; I think it’s time to give shut-down-and-ghost a try instead.
 

Further Update:
I have a 300+GB bloated VM (should be more like 30 to 50 GB) that is merging 5 separate differencing disks at the speed of an arthritic old man frozen to death in a glacier without his walker, and every so often I take a screenshot of the files in the snapshots folder and save it.

That will allow us to look more precisely at the “A + B free disk space required” problem.

 

Tags: , ,