You suspect file system problems. You run CHKDSK _without_ the /R switch, which runs in read only mode. It checks the disk and tells you that you have over six million security descriptors that need to be replaced with the default ones.
You’re not sure if your server will come up OK when done fixing all of this.
You don’t know how long it is going to take to fix.
Well, take my word on it; You don’t want to find out the hard way that it is too long. I am running this scenario on the following:
Dell PowerEdge R710 server with:
- PERC 6/i SAS RAID card with 256 MB cache
- Dual quad-core 2.25 GHz processors
- 16 GB memory
- Six 600GB 15K SAS disks in a RAID5 with the default stripe size.
- I am running VMware ESXi 4.0 Update 1.
- The guest OS is Windows 2003 R2 SP2. It is the only VM running, with 4 CPUs allocated.
I ran the CHKDSK in read only mode and it documented 6,864,384 files with bad security descriptors.
I started running CHKDSK with the /R switch and recorded the following:
The process fixes approximately 67,150 descriptors per hour, or 1,611,675 per day.
That means it will require 4.3 days to complete.
I know it’s a bad idea to interrupt CHKDSK while it is in progress, but there is no way in hell the customer is going to allow me 4.3 days of downtime. It’s just not going to happen.
So I thought about CHKDSK for a while, and came up with this:
Stage 1 works with the files themselves. The files have extra bits on the end that CHKDSK can look at to see if there is a likelihood that the file is messed up. It’s called a “checksum” or some such.
Stage 2 works with the indexes. This is where CHKDSK looks at where the files are “supposed” to be in the disk, as indicated by the “map” it is looking at. Then it goes and looks to see if the files are actually where they are supposed to be.
Stage 3 works with the security descriptors on the files and folders.
Stage 1 and stage 2 are the most dangerous stages. This is where, if interrupted, the files or indexes could become irrecoverably corrupted, and we’d be very unhappy campers.
Stage 3 is, in my opinion, an area of less danger. The files and the indexes are OK; it’s just checking security descriptors and fixing them if needed.
I took a calculated risk and rebooted the server when it was working on file # 422,000 or thereabouts. It seemed more or less happy. I ran CHKDSK in read only mode again, and after checking Stage 1 and Stage 2 without errors, it started reporting bad security descriptors again on Stage 3 at file # 422,000.
Maybe I dodged a bullet, or maybe interrupting CHKDSK in Stage 3 is not as bad as it could be.
Anyway, rebooting during a CHKDSK operation is bad news, and to be avoided if possible. So, this article offers you a way to find out how long your CHKDSK operation might take, or avoid that risk altogether.
I offer you an alternate solution that does NOT involve setting a CHKDSK flag, rebooting the production server, and hoping for the best.
This method is outlined very roughly like this:
- Take a full volume backup (including the errors) of the production server using ShadowProtect or other disk-based backup system.
- Restore this backup to an alternate or loaner server.
- Fix the file system on the loaner server (giving you a rough idea of the time it would take on the production server.
- Run a full backup of the fixed temporary server’s data volume.
- At this point, you have a choice:
a. It didn’t take very long, so go ahead and run it on the production server, or
b. Proceed with this alternate method.
While you have been fixing the file system on the temporary server, users have been modifying files on the primary server. So:
- Use Robocopy with the /MIR, /DATSO switch, etc. to synchronize the changes between the production server and the temporary server (Users must be offline not making changes during this time).
- Restore this backup to the production server. (Users are offline during this time).
The drawbacks:
- It involves moving the data all over the place repeatedly, which takes a lot of time and network bandwidth.
- It requires two separate backup locations so you don’t overwrite your only backup.
- It relies entirely on the integrity of the file system on the temporary server.
- Once the restore has begun, you CANNOT interrupt it the way you can (even if you shouldn’t) interrupt the CHKDSK.
The benefits:
- Depending on data size and number of files that need to be fixed, the amount of downtime required for synchronizing changes and restoring the volume might be significantly less than letting the CHKDSK run.
- No more interruptions of CHKDSK if the users won’t let you fix it all in one sitting.
- No-risk CHKDSK. How many times have your run CHKDSK /R and wondered if your file system would mount when it was done?
There are some aspects of this I would like to discuss before they come up in the comments:
Q: What if the customer has only one server, and it’s SBS?
A: Well, now that’s tricky. It is still possible to do this, but it gets complicated. You’d have to restore that volume to similar hardware (great if it is a virtual server), because you’d be restoring the OS as well, so that the permissions wouldn’t get trashed. So then you’d have two servers with the same name, same IP address, same domain, etc. This is not an insurmountable problem. All you need is a $69 broadband router to put between them, and change the IP address on your temporary server. That will significantly slow down file operations, and in light of the other issues I am about to cover, this might not be worth it.
Q: What if there are other things on that volume (Exchange, other databases, etc) besides files?
A: Well, now you’ll have to make a choice on how you want to handle that. You could do something like this:
- Dismount the database and copy it off before you do the restore, then copy it back afterward.
- Back up the databases separately using other tools, and restore them afterward.
- After having fixed all of the files on the temporary server and having synched them with Robocopy, delete all of the files on the production server, run the CHKDSK to fix the remaining issues (should run VERY quickly with all of the files gone), and then do a file-by-file restore (which will be VERY slow), and then of course you’ll have the fix the NTFS permissions.
Q: What if the customer does not have an alternate (temporary server)?
A: Seriously? <rant> Come on now. If really amazes me how many IT consulting companies, large and small, do not have usable loaner servers to put at client sites in an emergency.
I run an IT consulting company. Me. I’m a one-man show at this point. I have THREE loaner servers I can bring to bear if needed. I have a half-dozen extra hard disks lying around to help configure these servers as needed. If I can afford this, so can your company. It simply requires dedication to your customers instead of squeezing every dollar you can out of your customers.
Server1: 2U compact low-noise rack-mount white-box running an Intel motherboard, quad-core 2.5 GHz proc, 8 GB RAM, and a couple of 1 TB SATA disks. No RAID. It’s loaded with VMware ESXi that boots from a USB stick. This machine cost me about $800 to build. It’s handy to have around to run labs on, when not being used for a loaner server.
Server2: Micro-ATX Tower white-box running an Intel motherboard, quad-core 2.5 GHz proc, 8 GB RAM, and three 1 TB SATA disks. No RAID. This machine cost me about $600 to build. This one doubles as a gaming PC for when my gaming friends come over.
Server3: HP Proliant DL320 G3 1U rack-mount w/onboard SATA RAID, 2 disks max. It’s an older 32-bit machine, but it has 4 GB of RAM and I swapped out the two 80 GB SATA disks it had with two 1 TB SATA disks. This machine was given to me by a customer who retired it. This one doubles as a dedicated UT2004 server for when my gaming friends come over.
These may not be super-impressive machines, but as loaner servers in a pinch, they are very flexible. I can configure them with software mirroring for fault tolerance, or I can configure them striped for capacity (I just make sure to back up the data incrementally every hour while in use). They have enough RAM to run an SBS 2008 server and enough CPU to run two or three virtual guests if needed. One of these machines ran my entire server infrastructure (SBS 2008 and Windows 2008) for two weeks last year when I had an air conditioning issue.
So if your customer does not have a spare server lying around, maybe you can come up with something with your own resources. </rant>
Really, you have to look at the particulars of your situation and decide if this is a good idea for you. Still, it’s one more option to put in your tool belt.
Tags: CHKDSK