Archive for the ‘In the Windows Box’ Category

Run CHKDSK /F at ROCKET SPEED without rebooting your server

July 18th, 2010 by Paul Sterley | No Comments | Filed in Backup and Restore, Hardware, In the Windows Box, Management Software, Windows Server

CHKDSK can’t fix a volume when someone or something is using it.

Normally, when you run CHKDSK and you want to fix something, you run the command, it tells you that it cannot gain exclusive access to the disk, and asks if you want to schedule it for the next reboot. You say yes, reboot the server, and then CHKDSK gets to work halfway through the next server boot. The problem is, all of the services of that server, like AD/DHCP/DNS, etc, and any shared folders on other volumes are also offline during this time. This is very inconvenient.

Looking a little closer at what constitutes a file handle that locks CHKDSK from fixing the volume: 

  • If a service is running (QuickBooks Database Server Manager, for example) and is looking at the volume, CHKDSK is hands-off.
  • If a user has a file open, then CHKDSK is hand-off.
  • If you have Windows Explorer open on the server looking at the volume, CHKDSK is hands-off.
  • If you have a command prompt open and have changed directory to anything on the volume, CHKDSK is hands-off.
  • If you even have a folder on that volume shared on the server, CHKDSK cannot fix it without dismounting the file system.
  • If you carefully make sure that NONE of these are true, and if I haven’t missed any, you can actually run CHKDSK with the /F switch while your server is still running!

Here are some reasons you’d want to do this – and there’s one unexpected and very important one in there.

  • You could fix one volume while leaving the others accessible.
  • You could still have DNS/DHCP/PDC/Exchange services while the data volume is being repaired (if your Exchange database is on a different volume).
  • If this is a physical server, and you don’t have iLO or DRAC to remotely view the screen, running CHKDSK in this manner will allow you to watch the process run and check in on it from time to time, without having to be physically in front of the server.
  • Here’s the REALLY BIG ONE, and it is so dang big, I am simply amazed that I have not heard about this before:
  • IT IS FASTER! We’re not talking about 2x, or even 4x. It is ROCKET-FAST.

 I was fixing a server in single-mode (halfway through Windows boot), and it took 2.5 DAYS to fix the security descriptors on about three million files. I was forced to interrupt it to let the users back in.

I am now experimenting on another server that I restored the entire volume to (broken security descriptors and all). I made sure nothing had locks on the volume, and ran the CHKDSK /F with Windows up and running – and it has now fixed 2.4 million files in about 31 minutes! It may even be done with the 6.8 million files on this server before I finish writing, editing, and posting this blog entry (OK, maybe not quite that fast).

This other server I am experimenting with is a physical server, where the other was cirtual – but this server is running 7200 RPM SATA disks compared to the 15K SAS disks in the virtual server. It’s a generation older. I know that physical servers run a bit faster than virtual but not THIS MUCH faster. No way.

The production virtual server still has half its file system needing to be fixed, and I intend to put this new development to the test during the  next downtime window. I will post my results.

So what about those shares? Don’t want to delete and recreate them?

Try this MS KB document (Article ID: 125996) on for size. Export your shares before deleting them, run the CHKDSK, and then re-import your shares in 5 minutes plus a reboot.

Update: It is not necessary to export and delete your shares. CHKDSK prompts to force a dismount on the volume (rather than scheduling for the reboot) when you have shared folders, but no services or other file locks.

Tags:

What to do when you KNOW your CHKDSK /R operation is going to take a VERY long time to run.

July 18th, 2010 by Paul Sterley | No Comments | Filed in Backup and Restore, Hardware, In the Windows Box, Management Software, Windows Server

You suspect file system problems. You run CHKDSK _without_ the /R switch, which runs in read only mode. It checks the disk and tells you that you have over six million security descriptors that need to be replaced with the default ones.

You’re not sure if your server will come up OK when done fixing all of this.
You don’t know how long it is going to take to fix.

Well, take my word on it; You don’t want to find out the hard way that it is too long. I am running this scenario on the following:

Dell PowerEdge R710 server with:

  • PERC 6/i SAS RAID card with 256 MB cache
  • Dual quad-core 2.25 GHz processors
  • 16 GB memory
  • Six 600GB 15K SAS disks in a RAID5 with the default stripe size.
  • I am running VMware ESXi 4.0 Update 1.
  • The guest OS is Windows 2003 R2 SP2. It is the only VM running, with 4 CPUs allocated.

I ran the CHKDSK in read only mode and it documented 6,864,384 files with bad security descriptors.
I started running CHKDSK with the /R switch and recorded the following:
The process fixes approximately 67,150 descriptors per hour, or 1,611,675 per day.
That means it will require 4.3 days to complete.

I know it’s a bad idea to interrupt CHKDSK while it is in progress, but there is no way in hell the customer is going to allow me 4.3 days of downtime. It’s just not going to happen.
So I thought about CHKDSK for a while, and came up with this:

Stage 1 works with the files themselves. The files have extra bits on the end that CHKDSK can look at to see if there is a likelihood that the file is messed up. It’s called a “checksum” or some such.

Stage 2 works with the indexes. This is where CHKDSK looks at where the files are “supposed” to be in the disk, as indicated by the “map” it is looking at. Then it goes and looks to see if the files are actually where they are supposed to be.

Stage 3 works with the security descriptors on the files and folders.

Stage 1 and stage 2 are the most dangerous stages. This is where, if interrupted, the files or indexes could become irrecoverably corrupted, and we’d be very unhappy campers.

Stage 3 is, in my opinion, an area of less danger. The files and the indexes are OK; it’s just checking security descriptors and fixing them if needed.

I took a calculated risk and rebooted the server when it was working on file # 422,000 or thereabouts. It seemed more or less happy. I ran CHKDSK in read only mode again, and after checking Stage 1 and Stage 2 without errors, it started reporting bad security descriptors again on Stage 3 at file # 422,000.
Maybe I dodged a bullet, or maybe interrupting CHKDSK in Stage 3 is not as bad as it could be.
Anyway, rebooting during a CHKDSK operation is bad news, and to be avoided if possible. So, this article offers you a way to find out how long your CHKDSK operation might take, or avoid that risk altogether.
I offer you an alternate solution that does NOT involve setting a CHKDSK flag, rebooting the production server, and hoping for the best.

This method is outlined very roughly like this:

  1. Take a full volume backup (including the errors) of the production server using ShadowProtect or other disk-based backup system.
  2. Restore this backup to an alternate or loaner server.
  3. Fix the file system on the loaner server (giving you a rough idea of the time it would take on the production server.
  4. Run a full backup of the fixed temporary server’s data volume.
  5. At this point, you have a choice:
       a. It didn’t take very long, so go ahead and run it on the production server, or
       b. Proceed with this alternate method.

While you have been fixing the file system on the temporary server, users have been modifying files on the primary server. So:

  1. Use Robocopy with the /MIR, /DATSO switch, etc. to synchronize the changes between the production server and the temporary server (Users must be offline not making changes during this time).
  2. Restore this backup to the production server. (Users are offline during this time).

 The drawbacks:

  • It involves moving the data all over the place repeatedly, which takes a lot of time and network bandwidth.
  • It requires two separate backup locations so you don’t overwrite your only backup.
  • It relies entirely on the integrity of the file system on the temporary server.
  • Once the restore has begun, you CANNOT interrupt it the way you can (even if you shouldn’t) interrupt the CHKDSK.

 The benefits:

  • Depending on data size and number of files that need to be fixed, the amount of downtime required for synchronizing changes and restoring the volume might be significantly less than letting the CHKDSK run.
  • No more interruptions of CHKDSK if the users won’t let you fix it all in one sitting.
  • No-risk CHKDSK. How many times have your run CHKDSK /R and wondered if your file system would mount when it was done?

 

There are some aspects of this I would like to discuss before they come up in the comments:

Q: What if the customer has only one server, and it’s SBS?
A: Well, now that’s tricky. It is still possible to do this, but it gets complicated. You’d have to restore that volume to similar hardware (great if it is a virtual server), because you’d be restoring the OS as well, so that the permissions wouldn’t get trashed. So then you’d have two servers with the same name, same IP address, same domain, etc. This is not an insurmountable problem. All you need is a $69 broadband router to put between them, and change the IP address on your temporary server. That will significantly slow down file operations, and in light of the other issues I am about to cover, this might not be worth it.

Q: What if there are other things on that volume (Exchange, other databases, etc) besides files?
A: Well, now you’ll have to make a choice on how you want to handle that. You could do something like this:

  1. Dismount the database and copy it off before you do the restore, then copy it back afterward.
  2. Back up the databases separately using other tools, and restore them afterward.
  3. After having fixed all of the files on the temporary server and having synched them with Robocopy, delete all of the files on the production server, run the CHKDSK to fix the remaining issues (should run VERY quickly with all of the files gone), and then do a file-by-file restore (which will be VERY slow), and then of course you’ll have the fix the NTFS permissions.

Q: What if the customer does not have an alternate (temporary server)?
A: Seriously?   <rant> Come on now. If really amazes me how many IT consulting companies, large and small, do not have usable loaner servers to put at client sites in an emergency.

I run an IT consulting company. Me. I’m a one-man show at this point. I have THREE loaner servers I can bring to bear if needed. I have a half-dozen extra hard disks lying around to help configure these servers as needed. If I can afford this, so can your company. It simply requires dedication to your customers instead of squeezing every dollar you can out of your customers.

Server1: 2U compact low-noise rack-mount white-box running an Intel motherboard, quad-core 2.5 GHz proc, 8 GB RAM, and a couple of 1 TB SATA disks. No RAID. It’s loaded with VMware ESXi that boots from a USB stick. This machine cost me about $800 to build. It’s handy to have around to run labs on, when not being used for a loaner server.

Server2: Micro-ATX Tower white-box running an Intel motherboard, quad-core 2.5 GHz proc, 8 GB RAM, and three 1 TB SATA disks. No RAID. This machine cost me about $600 to build. This one doubles as a gaming PC for when my gaming friends come over.

Server3: HP Proliant DL320 G3 1U rack-mount w/onboard SATA RAID, 2 disks max. It’s an older 32-bit machine, but it has 4 GB of RAM and I swapped out the two 80 GB SATA disks it had with two 1 TB SATA disks. This machine was given to me by a customer who retired it. This one doubles as a dedicated UT2004 server for when my gaming friends come over.

These may not be super-impressive machines, but as loaner servers in a pinch, they are very flexible. I can configure them with software mirroring for fault tolerance, or I can configure them striped for capacity (I just make sure to back up the data incrementally every hour while in use). They have enough RAM to run an SBS 2008 server and enough CPU to run two or three virtual guests if needed. One of these machines ran my entire server infrastructure (SBS 2008 and Windows 2008) for two weeks last year when I had an air conditioning issue.

So if your customer does not have a spare server lying around, maybe you can come up with something with your own resources. </rant>

Really, you have to look at the particulars of your situation and decide if this is a good idea for you. Still, it’s one more option to put in your tool belt.

Tags:

Remote Desktop: The /console or /admin switch does NOT always get you the “real” desktop.

July 2nd, 2010 by Paul Sterley | No Comments | Filed in Antivirus Software, In the Windows Box, Symantec, Windows Server

Today, I was troubleshooting a problem with Symantec Antivirus. Specifically, I was trying to stop a monthly scan that had been running for more than 24 hours, and was having an issue with quarantined items in the “xfer” folder.

My problem: Even though the scan settings were configured to allow me to stop/pause/snooze scans, I could not find a method of doing so. The Scan Progress dialog was not up, and I could not bring it up using the product GUI.

I tried using the “/admin” switch to connect to the server in question, I did not see the scan dialog on the screen as expected.

However, since this was a virtual server, I tried another angle. I used the VMware Infrastructure Client to look at the ACTUAL DESKTOP of the server, which was not logged in.

When it logged in, it showed me the scan dialog and I was able to stop/pause/snooze the scan.

Curious, and wanting to understand more about what happened, I then connected again with the /admin switch, without having logged off of the actual desktop. It locked the desktop I was looking at in the VI Client, and it showed me the scan dialog.

So I have learned something new about RDP today. Session 0, or “the console session” is NOT actually quite the same thing as the actual desktop of the server. If the actual desktop is logged off, you don’t get things, even with the /admin switch, that you do when “standing in front of the server” or viewing the console with VMware Infrastructure Client.

I’ll have to file this one away for future reference. When wanting to make sure I catch console pop-ups, make sure the actual, real desktop of the server is logged on if possible.

Tags: , , ,

Updated: Configure PPTP on a Watchguard Firebox Using RADIUS Authentication and Windows 2008

January 17th, 2010 by Paul Sterley | 4 Comments | Filed in Firewall Configuration, In the Windows Box, Windows Server

This article covers the steps to configure a Watchguard Firebox to pass authentication traffic for PPTP VPN connections to a RADIUS server running on Windows Server. The first part of the document covers Fireware 10.2 and Windows 2008. Legacy technologies can be found at the bottom of the article.

Usage Scenario: You wish to have the Firebox terminate the VPN connection, but still pass the authentication through to your Active Directory server instead of using static Firebox user accounts.

Note: Fireware has Active Directory and LDAP authentication methods, but these cannot be used for PPTP VPN authentication as of version 10.2.12. These can be used with MUVPN, which requires IPSEC Client software to be loaded on the connecting workstation.

Benefits of having the firewall terminate a PPTP VPN:

·         It is not necessary to have more than one IP address on the Firebox’s external interface.

·         It is not necessary to set up 1:1 NAT, which would put your server on a different outgoing IP address from the rest of the network (this is a good thing from a “keep it simple” perspective).

·         You can reboot the server without dropping your VPN connection – you cannot authenticate while it is rebooting, but if you are already connected, you will stay connected.

·         PPTP tunnels terminated by the Firebox are generally faster and more reliable than when terminated by a Windows server.

·         It is not necessary to load any software on the connecting workstation; it’s built into Windows.

 

Configure the Firewall:

 

1.       Open the Policy Manager.

2.       Configure RADIUS Authentication:

a.       Click Setup -> Authentication -> Authentication Servers.

b.      Click the RADIUS tab.

c.       Check to enable the RADIUS server.

d.      Type the IP address of the Windows 2008 server and set the port to 1812.

e.      Type a “secret” and confirm it. Take note of this in your network documentation, as you will need it later to configure Windows 2008, and possibly even later still, when you change things on the network. Try to use a secure secret here.

f.        Click OK to close the Authentication Servers dialog. 

3.       Create the PPTP VPN Policy:

a.       Click VPN -> Mobile VPN -> PPTP.

b.      Check the box to Activate Mobile VPN with PPTP.

c.       Check the box to use RADIUS authentication.

d.      Require 128-bit Encryption (I think this is optional, but why would you?).

e.      Add an IP address pool.

Note: It would be a very good idea to create a DHCP exclusion matching this IP address pool, both to avoid IP conflicts due to DHCP, and to remind you that you have assigned these addresses when you go looking for an available static IP address later. If you have an IP address spreadsheet (hopefully you do), add it there as well. Documentation is key to an organized network.

f.        Click OK. 

4.       Create an Access Rule to allow VPN traffic:

a.       Click Edit -> Add Policy.

b.      Expand Packet Filters and double-click the “Any” filter.

c.       Change the name to “Any-RUVPN” (or something else that is descriptive to you).

d.      Remove “Any-Trusted” from the “From” area.

e.      Click Add-> Add User, select type “PPTP” and “Group”, double-click PPTP-Users, and click OK.

f.        Click Add-> Add other -> Network IP, add your internal network subnet, and click OK -> OK.

g.       Remove “Any-External” from the “To” area.

h.      Click Add-> Add other -> Network IP, add your internal network subnet, and click OK -> OK.

i.         Click Add-> Add User, select type “PPTP” and “Group”, double-click PPTP-Users, and click OK.

Note: We have just created a bi-directional rule that allow traffic both directions over the PPTP VPN. Your rule should have “PPTP-Users” and your internal subnet in both the “From” and the “To” areas.

j.        Click OK to close the policy properties dialog. 

5.       (Important!) Configure DNS on the Firebox:

a.       Click Network -> Configuration and go to the WINS/DNS tab.

b.      Enter the DNS servers for your network.

Note: The DNS settings are important for your VPN client to obtain the DNS server automatically from the firewall when the VPN connects. Unfortunately, as of Fireware 10.2, the DNS suffix is not passed to the VPN client, so you will need to include that in the VPN connection’s advanced properties on the workstation.

6.       Upload your config to your firewall. 

Configure Windows 2008:

1.       Prerequisites:

a.       Network Policy and Access Services

b.      Windows Firewall disabled or configured to allow RADIUS traffic on port 1812. 

2.       Ensure that NPS is installed and started. 

3.       Create a Security Group:

a.       Create a security Group on your AD domain controller with a name that is descriptive to you (VPNUsers, for example) and populate it with users who will have VPN access. 

4.       Open the Server Manager. 

5.       Tell Windows about the RADIUS Client:

a.       Expand Roles -> Network Policy and Access Services -> NPS (Local) -> RADIUS Clients and Servers, and select RADIUS Clients.

b.      Right-Click RADIUS Clients and select New RADIUS Client.

c.       Check the box to enable the RADIUS Client.

d.      Type a friendly name (Firebox) for the RADIUS Client.

e.      Add the IP address of the Firebox.

f.        Select RADIUS Standard from the Vendor Name list.

g.       Choose the “Manual” radio button.

h.      Type and confirm the “secret” you entered into the Firebox config in the “Configure the Firebox” section.

i.         Make sure both checkboxes at the bottom o the dialog are unchecked and click OK. 

6.       Configure a RADIUS Authentication Policy:

a.       Expand Roles -> Network Policy and Access Services -> NPS (Local) -> Policies -> Network Policies.

b.      Right-Click Network Policies and select New.

c.       Type a Policy name that will be descriptive to you (RUVPN Connections, for example).

d.      Leave the “Type of network access server” set to “Unspecified” and click Next.

e.      Click the Add button and double-click “Windows Groups” in the Conditions list.

f.        Click the Add Groups button and type or search for the VPN users group you created earlier.

g.       Click OK -> OK, which should bring you back to the Specify Conditions dialog.

h.      Click the Next button to get to the Specify Access Permission dialog.

i.         Leave “Access granted” selected and click Next.

j.        Ensure that MS-CHAP-v2 and MS-CHAP are selected, and click Next.

k.       Click Next again without configuring any constraints.

l.         In the left Windows pane, select Standard under RADIUS Attributes.

m.    Remove any existing attributes and click Add.

n.      Double-click Filter-ID.

o.      Click the Add button.

p.      Type “PPTP-Users” (case sensitive) into the “String” field and click OK.

q.      Click OK and Close to get back to the Configure Settings dialog.

r.        Select Encryption under Routing and Remote Access, and uncheck “No Encryption”.

s.       Click Next -> Finish.

t.        Right-click you new policy and select “Move Up” repeatedly until it is first in the list.

Test your configuration:

1.       Set up a workstation outside the firewall with PPTP VPN.

2.       Connect to the VPN with a user who exists in the VPN users group you created in AD.

3.       Once the VPN is running, test access to network resources.

Note: It is possible to be connected to the VPN, but still have no resource access if you did not configure the access policy properly, so be sure to test this.

 

Update:

If you have an older Firebox running WSM 7.x, and wish to use PPTP terminated by the firewall, with RADIUS authenticated by a Windows 2008 server, use these instructions for the firewall side:

Note: You will need to adjust the policy in NPS on the Windows 2008 server to use “pptp_users” instead of “PPTP-Users”. This changed between WSM and Fireware.

 

Configure a legacy Firebox (WSM 7.x) for Remote User PPTP:

1.       Open Policy Manager and select Setup -> Firewall Authentication.

2.       Select the radio button for RADIUS Server -> OK -> OK.

3.       Enter the IP address of the Windows 2000 server running IAS.

4.       Change the Port number to 1812 and enter your shared secret -> OK

5.       Click Network -> Remote User -> PPTP tab.

6.       Check the checkboxes for Activate Remote User and Use Radius Authentication.

7.       Click the Add button, select Host IP Address and enter the first IP address you allocated for use by the Firebox -> OK.

8.       Repeat this until all of your allocated IP addresses have been entered.

Note: You can copy/paste into the IP address field.

Note: You may wish to enable logging here if you have any difficulty getting this to work.

9.       Click OK.

 

Configure a legacy Firebox Access Rule for RUVPN:

1.       Add a service to allow traffic from VPN Users:

a.       Click Edit -> Add Service. Expand Packet Filters and select “Any”.

b.      Click the Add button. Change the name to “Any-RUVPN”.

Note: If you change this name, I recommend against using spaces.

c.       On the Incoming tab, select “Enabled and Allowed” from the selection list.

d.      Click the Add button in the “From” area and add the “pptp_users” group.

Note: If the “pptp_users” group is not available to be selected here, you can click “Add other”, drop down and select “Radius User or Group” and type pptp_users in. I had to do this with a Firebox. Once I had uploaded the config and firmware to the firebox, then pulled down a fresh config file from the firebox, the pptp_users that I had typed in became the special Firebox group and took on the icon with the two head with a red thing behind them, indicating that it recognized the special group. Your mileage may vary.

e.      Click the Add button in the “To” area and add “Trusted”.

f.        Go to the Outgoing tab.

g.       Add “Trusted” to the “From” area and “pptp_users” to the “To” area.

h.      Finish the rule and upload the configuration to the Firebox.

 

 

 

If you have a Windows 2003 server and wish to use IAS for RADIUS authentication for a Watchguard Firebox, here are the steps:

Install and Configure IAS on Windows 2003:

 

Note: You must either disable SMB Signing or use Firebox Software version 7.30-B2938 or later!

 

1.       In Add/Remove programs -> Windows Components -> Networking Services, check “Internet Authentication Service” and finish the wizard.

2.       Open the Services applet and stop, then restart the IAS service. Refresh the screen and ensure that the service continues to show “running” status. Some applications (the Symantec antivirus management console, for example) interfere with IAS by using port 1812. If this is the case you will need to configure IAS on a different server.

3.       Open Administrative Tools -> Internet Authentication Service and select Radius Clients in the left pane.

4.       Click Action -> New Radius Client. Enter “Firebox” for the friendly name.

Note: If you change this name, I recommend against using spaces or non-alpha characters.

5.       Enter the Trusted IP address of the Firebox for the Client Address and click Next.

6.       Verify that RADIUS Standard is the selected protocol.

7.       Enter and confirm a “shared secret” of your choice.

Note: I recommend Uppercase, Lowercase, and Numbers – but not non-alpha characters.

8.       Verify that RADIUS Standard is the selected Client-Vendor.

9.       Verify that the box for “Request must contain the Message Authenticator attribute” is NOT checked, and click Finish.

10.   Select Remote Access Policies and click Action -> New Remote Access Policy.

11.   Select the option for “Set up a custom policy”.

12.   Enter VPNUsers for the friendly name of the policy.

Note: If you change this name, I recommend against using spaces or non-alpha characters.

13.   Click Next -> Add -> select Windows-Groups -> Add -> Add -> select your VPNUsers group -> OK -> OK -> Next.

14.   Select the radio button for “Grant remote access permission” -> Next.

15.   Click the Edit Profile button -> Authentication tab.

16.   Verify that the checkboxes for “Microsoft Encrypted Authentication version 2 (MS-CHAP v2)” and MS-CHAP are checked.

17.   Go to the Encryption Tab and clear the check box next to “No Encryption”.

18.   Click the Advanced tab and remove “Framed-Protocol” and “Service-Type”.

19.   Click Add -> Filter-Id -> Add -> verify that “string” is selected and type “pptp_users” into the attribute field.

Note: For Fireware Pro 8.2 the string must be set to “PPTP-Users” (case sensitive).

Note: Other documentation may suggest that you type something else here, like your group name. DON’T. The Firebox wants to see “pptp_users” or “PPTP-Users” in this attribute, just as it is typed here – lowercase, underscore or hyphen and all.

20.   Click whatever combination of OK, Next, and/or Finish is required to complete the config. If it prompts you to view help topics, say no.

 

Tags: , , , , ,

Updated: Recover from a USN Rollback WITHOUT Demoting and Promoting your DC

October 27th, 2009 by Paul Sterley | 1 Comment | Filed in Backup and Restore, ESXi, IIS, In the Windows Box, Virtualization, Windows Server

What’s a USN Rollback? That’s when you’ve restored an Active Directory DC in a multiple DC environment using a method that is not Active-Directory Aware. Examples include Ghost images, VMware or Hyper-V snapshots, or other imaging or volume-level restore methods.

Why is that a problem? A very good detailed explanation is available here, but the basic idea is that AD keeps track of which servers it has replicated with and when, and if a DC is rolled back in a way that is not compatible with the record-keeping, the affected DC will disabled inbound and outbound replication, and refuse to replicate with the other DCs.

Here’s a related article by the same author as the above post, which led me to my solution this evening. My article expands on the second option provided, but goes into the mechanics of it, and the associated difficulties.

According to Microsoft’s Knowledge Base article on the subject, recovering from this situation entails forcibly demoting the DC, cleaning up the AD, and then (optionally) promoting it again. If the DC in question has no other roles, or just a couple of basic ones such as a print server, this might be the best way to go, if you’re familiar with such things as seizing FSMO roles and performing metadata cleanup in Active Directory after an unsuccessful DC demotion.

** Update: Read on for more details about how this all works, but make sure you check the update at the bottom of the article for the easier method I successfully tested!

However, if you’re not familiar with these things, or you have other applications on the server which might be affected (IIS, in particular, is very sensitive to the permissions changes associated with DC promotion), this might create a very large amount of havoc on your server.

Your saving grace, if you have one, is a System State backup from before the USN rollback occurred. If you don’t have a backup of JUST the System State, perhaps you can restore an entire image to another server, boot it, and create one.

If you have or can create one of these, your solution becomes much simpler. You just need to boot your server in Directory Services Restore Mode, restore the System State, DO NOT mark any part of your restore as authoritative, and reboot.

After the reboot, you might need to remove the flags AD has set, which have disabled inbound and outbound replications. The commands for this are:

repadmin /options [YourServerName] -disable_inbound_repl
repadmin /options [YourServerName] -disable_outbound_repl

Note: This looks like you are disabling replication, but what you are actually doing is putting a minus sign (-) before the disable option, which enables it. I know, it’s counter-intuitive, but trust me on this one – or go check the syntax yourself.

Of course, you need the Support Tools installed to get the repadmin utility. Once you run those commands, your server will start replicating again, and the more up-to-date DC(s) will override the old, out of date information your USN Rollback victim was holding onto.

There are some extra difficulties associated with the above plan:
1. If you have to restore a server image to create that System State backup, and you restore to different hardware, things could get a little messy. Is it messier than demoting, seizing FSMO roles, performing metadata cleanup, promoting, and cleaning up the fallout from your installed apps? You’ll have to decide on that one.

2. This requires you having an extra server (or two, if you want to restore more than one DC to create a stable lab environment from which to back up the System State) laying around. Do you have those resources available?

I was facing this issue today, and all of the above became MUCH simpler for me when I realized I could use the Doyenz Test Lab to sort all of this out. I did NOT have a System State backup from before the USN Rollback, but I HAVE been running backups into the Doyenz system since before the problem began.

Here is what I did:
1. Created a backup of the System State

a. Restored a copy of the affected server in the Doyenz Test Lab. I specifically restored from the date BEFORE the USN Rollback happened. It was easy to find this by looking at the date of the last successful replication with repadmin on the affected server.
b. Performed a System State backup using NTBackup (you can do this with WBAdmin on Windows 2008).
c. Zipped the backup file and sent to an FTP server.
d. Shut down the restored server.

2. Performed a test run to make sure this was going to work, without affecting the live servers.

a. Using the Doyenz Portal, I select last night’s backup and restored it for both servers.
b. I booted the primary DC (the one with the FSMO roles) first.
c. Attached the second (USN Rollback victim) server to the first one in the Lab, and booted it.
d. Pulled the System State backup down from the FTP site onto the affected server.
e. Rebooted the affected server into Directory Services Restore Mode.
f. Restored the System State on the affected server.
g. Rebooted the affected into Normal Mode.
h. Used the repadmin commands to remove the replication blocks.
i. Forced replication using AD Sites and Services.

3. Verified successful replication.

a. Created a user account on one DC in the Test Lab, forced replication, and checked for the account on the other DC.
b. Deleted the user account on the other DC, and checked it on the first DC.

4. Tested the touchy sensitive web applications that are running on the affected server.

5. Shut down the servers in the test lab.

After this successful test, I notified the users of pending late-night downtime, and repeated the above steps, this time on the live, production server and with great confidence of the outcome. Sure enough, I restored the AD replication functionality of the server with minimal downtime, without crossing my fingers, holding my breath, and hoping against hope that it would work and not trash the server.

What is more, since the production server is a virtual server, and I have VPN access to the virtual host, I was able to perform the entire operation from my home office, 30 miles away. I didn’t swap any tapes, set up any lab hardware, or drive to the server site late at night. I did the whole thing in comfortable clothes with a 2-liter bottle of Ruby Red Squirt, Winamp playing “Save Me” by Queen, and my devoted cat purring on my lap.

What could be better than that?

Update: It was very handy to be able to do the above scenario, but what is even handier is that I was able to find a significantly simpler method. So much simpler, I wonder why it did not occur to me sooner, and why Microsoft doesn’t have this listed in their KB article.

I set this problem up in a lab scenario again, and this time rather than do a complicated restore of an earlier version of the machine, I simply:

  • Performed a System State backup of the machine (in its broken, non-replicating condition).
  • Booted it into Directory Services Restore Mode.
  • Restored the System State backup, carefully NOT selecting the option to make it authoritative.
  • Rebooted, and ran the above repadmin commands to re-enable replication.

After that, I was able to trigger another replication, and it worked just fine.

Tags: , , ,