Archive for the ‘Backup and Restore’ Category

Internet-Based Offsite Backup: Check Out CloudBerry Backup

July 29th, 2010 by Paul Sterley | 1 Comment | Filed in Backup and Restore

There are, of course, many ways to send your data into the cloud. More of them appear every day. Today, I’m going to talk to you about one combination of backup software, cloud storage, and cloud storage management software that I am putting in place for one of my customers.

The local backup software of choice in this case is StorageCraft ShadowProtect. It is writing a continuous incremental backup to disk, and it works well. That just leaves the problem of offsite backup.

This customer has about 1.5 TB of data. That narrows down the playing field of internet-based offsite storage considerably, due to pricing. With that much data, it becomes necessary to use inexpensive storage. I selected Amazon S3 as the most likely affordable option, and chose a NAS device which supports Amazon S3 to store the backups.

Then I discovered the flaw in my plan. The NAS device has a very limited interface for Amazon S3, and does not warn me when it fails to upload a file because that file exceeds the 5 GB limit imposed by Amazon. I had to start looking around for other options.

I became aware of Cloudberry Backup, by Cloudberry Lab. I’m putting it through its paces now, and so far it seems to be a pretty good tool for this job. In addition to Amazon S3, it supports several other cloud-based storage systems, but I’m working with Amazon right now, so that’s what I’ll talk about in this article.

CloudBerry Backup has a small footprint, is easy to install and configure, and does not use a lot of resources on the server. It has a lot of configuration options. The most recent version allows you to upload files that are located at network paths, so I can run this on a utility server and use it to upload data stored on a NAS device.

Most importantly, it seems to be a robust utility for sending the data to the cloud, making sure it gets there, and making sure it gets there my way. That last bit is the most important. When I first cranked up the NAS and told it to start uploading, the users got pretty upset because suddenly all of their bandwidth was being used up by the NAS. There was no throttling capability. I couldn’t pause the backup and wait for the evening, either. It was an all or nothing proposition.

CloudBerry Backup has bandwidth throttling built in, and a Pause Backup button, right where you might go looking for one (Quick note about the Pause function – when the backup was paused, the Speed indicator kept jumping around as if it was still uploading data and calculating speed – but the “Files Uploaded” counter did not increment. That was creepy. Not sure what that was about.)

There are also has more advanced options for setting the process priority, and number of simultaneous threads to use per backup plan. You can create a single backup plan with all of your data in it, or several backup plans with different schedules.

Wait, what about the 5 GB file size? In “Advanced Mode”, CloudBerry Backup splits large files into “Chunks” of whatever size you specify. Unfortunately, this means you can’t download and re-assemble your data from the cloud without the CloudBerry Backup software, so you’re locked into a single tool – but that’s often the case anyway. The software does not expire, and CloudBerry Lab is talking about releasing a standalone tool for chunk re-assembly anyway.

I figured the chunking thing would make the file structure on Amazon completely incomprehensible, and that I would need to set up multiple buckets for the different servers I was uploading backups from, each with its own folder pair and schedule, etc – but I was wrong. I could have done that, but it wasn’t necessary. CBB does a great job of making the folder structure on Amazon understandable, even with the chunking going on. The best part of the chunking is, if your file is smaller than the chunk size you have specified, it doesn’t get chunked. That makes it a lot easier for verifying files if there aren’t many that are larger than your specified chunk size. The down side is that even though Amazon supports 5 GB files, CBB currently has a maximum chunk size of 1 GB. Oh well, you win some and you lose some.

Other cool things it can do include e-mail notifications on success or failure, deleting files from the destination that no longer exist on the source (including a number of days to keep them), keeping versions of files on the destination, encrypting the data, compressing the data, and it even includes an option for Amazon’s new “Reduced Redundancy Storage” for an additional price break.

So what happens when bad things happen?

You can choose from several logging levels, and where to put the log files, so that’s a good start.

After setting a large backup in motion, I decided to be mean to the software. I pulled the plug on the internet connection. It was down for maybe 2-3 minutes as the cable modem booted and re-trained. When the connection came back up, CBB resumed uploading as if nothing had happened. Not bad.

Then I rebooted the server. When the server came back up, things were not quite as I expected. I opened the software, and it showed a backup running, but it didn’t seem to be getting anywhere. I gave it quite a bit of time, but it did not progress. The Pause button was available but not functional. Eventually, I clicked the Stop Backup button, which did work, and then the Start Backup button, at which point it started recalculating things and uploading files again. This bears more testing, as I wouldn’t want to have to remember to check the CBB software after every reboot.

For my next mean trick, I told the CloudBerry Backup service to restart (not using the app interface, I went into services.msc). It restarted in about a half second, and the console didn’t even seem to notice, it just kept plugging away. Hmm. Then I stopped the service and left it off. The software kept running. OK, that’s weird. I guess the service is just a function for starting the software when the user is not logged in.

I went into Task Manager and ended task on CBBackupPlan.exe. That brought the console up short. At that point it seemed to be in pretty much the same state as when I rebooted. I stopped the backup again, started it again, and it once again counted up the files and started transferring them.

OK, so there’s a bit of streamlining needed in this product – but hey, for the great low price, it’s a pretty good setup. It’s cheap, it’s easy to use, it has a lot of configuration options, and it’s somewhat resilient to abuse.

There’s another module called CloudBerry Explorer which has some good features for managing your offsite storage. At the moment, it doesn’t seem to have inherited the Network Shares feature, so I can’t really use it, but I’m sure that’s coming soon.

Tags: ,

Run CHKDSK /F at ROCKET SPEED without rebooting your server

July 18th, 2010 by Paul Sterley | No Comments | Filed in Backup and Restore, Hardware, In the Windows Box, Management Software, Windows Server

CHKDSK can’t fix a volume when someone or something is using it.

Normally, when you run CHKDSK and you want to fix something, you run the command, it tells you that it cannot gain exclusive access to the disk, and asks if you want to schedule it for the next reboot. You say yes, reboot the server, and then CHKDSK gets to work halfway through the next server boot. The problem is, all of the services of that server, like AD/DHCP/DNS, etc, and any shared folders on other volumes are also offline during this time. This is very inconvenient.

Looking a little closer at what constitutes a file handle that locks CHKDSK from fixing the volume: 

  • If a service is running (QuickBooks Database Server Manager, for example) and is looking at the volume, CHKDSK is hands-off.
  • If a user has a file open, then CHKDSK is hand-off.
  • If you have Windows Explorer open on the server looking at the volume, CHKDSK is hands-off.
  • If you have a command prompt open and have changed directory to anything on the volume, CHKDSK is hands-off.
  • If you even have a folder on that volume shared on the server, CHKDSK cannot fix it without dismounting the file system.
  • If you carefully make sure that NONE of these are true, and if I haven’t missed any, you can actually run CHKDSK with the /F switch while your server is still running!

Here are some reasons you’d want to do this – and there’s one unexpected and very important one in there.

  • You could fix one volume while leaving the others accessible.
  • You could still have DNS/DHCP/PDC/Exchange services while the data volume is being repaired (if your Exchange database is on a different volume).
  • If this is a physical server, and you don’t have iLO or DRAC to remotely view the screen, running CHKDSK in this manner will allow you to watch the process run and check in on it from time to time, without having to be physically in front of the server.
  • Here’s the REALLY BIG ONE, and it is so dang big, I am simply amazed that I have not heard about this before:
  • IT IS FASTER! We’re not talking about 2x, or even 4x. It is ROCKET-FAST.

 I was fixing a server in single-mode (halfway through Windows boot), and it took 2.5 DAYS to fix the security descriptors on about three million files. I was forced to interrupt it to let the users back in.

I am now experimenting on another server that I restored the entire volume to (broken security descriptors and all). I made sure nothing had locks on the volume, and ran the CHKDSK /F with Windows up and running – and it has now fixed 2.4 million files in about 31 minutes! It may even be done with the 6.8 million files on this server before I finish writing, editing, and posting this blog entry (OK, maybe not quite that fast).

This other server I am experimenting with is a physical server, where the other was cirtual – but this server is running 7200 RPM SATA disks compared to the 15K SAS disks in the virtual server. It’s a generation older. I know that physical servers run a bit faster than virtual but not THIS MUCH faster. No way.

The production virtual server still has half its file system needing to be fixed, and I intend to put this new development to the test during the  next downtime window. I will post my results.

So what about those shares? Don’t want to delete and recreate them?

Try this MS KB document (Article ID: 125996) on for size. Export your shares before deleting them, run the CHKDSK, and then re-import your shares in 5 minutes plus a reboot.

Update: It is not necessary to export and delete your shares. CHKDSK prompts to force a dismount on the volume (rather than scheduling for the reboot) when you have shared folders, but no services or other file locks.

Tags:

What to do when you KNOW your CHKDSK /R operation is going to take a VERY long time to run.

July 18th, 2010 by Paul Sterley | No Comments | Filed in Backup and Restore, Hardware, In the Windows Box, Management Software, Windows Server

You suspect file system problems. You run CHKDSK _without_ the /R switch, which runs in read only mode. It checks the disk and tells you that you have over six million security descriptors that need to be replaced with the default ones.

You’re not sure if your server will come up OK when done fixing all of this.
You don’t know how long it is going to take to fix.

Well, take my word on it; You don’t want to find out the hard way that it is too long. I am running this scenario on the following:

Dell PowerEdge R710 server with:

  • PERC 6/i SAS RAID card with 256 MB cache
  • Dual quad-core 2.25 GHz processors
  • 16 GB memory
  • Six 600GB 15K SAS disks in a RAID5 with the default stripe size.
  • I am running VMware ESXi 4.0 Update 1.
  • The guest OS is Windows 2003 R2 SP2. It is the only VM running, with 4 CPUs allocated.

I ran the CHKDSK in read only mode and it documented 6,864,384 files with bad security descriptors.
I started running CHKDSK with the /R switch and recorded the following:
The process fixes approximately 67,150 descriptors per hour, or 1,611,675 per day.
That means it will require 4.3 days to complete.

I know it’s a bad idea to interrupt CHKDSK while it is in progress, but there is no way in hell the customer is going to allow me 4.3 days of downtime. It’s just not going to happen.
So I thought about CHKDSK for a while, and came up with this:

Stage 1 works with the files themselves. The files have extra bits on the end that CHKDSK can look at to see if there is a likelihood that the file is messed up. It’s called a “checksum” or some such.

Stage 2 works with the indexes. This is where CHKDSK looks at where the files are “supposed” to be in the disk, as indicated by the “map” it is looking at. Then it goes and looks to see if the files are actually where they are supposed to be.

Stage 3 works with the security descriptors on the files and folders.

Stage 1 and stage 2 are the most dangerous stages. This is where, if interrupted, the files or indexes could become irrecoverably corrupted, and we’d be very unhappy campers.

Stage 3 is, in my opinion, an area of less danger. The files and the indexes are OK; it’s just checking security descriptors and fixing them if needed.

I took a calculated risk and rebooted the server when it was working on file # 422,000 or thereabouts. It seemed more or less happy. I ran CHKDSK in read only mode again, and after checking Stage 1 and Stage 2 without errors, it started reporting bad security descriptors again on Stage 3 at file # 422,000.
Maybe I dodged a bullet, or maybe interrupting CHKDSK in Stage 3 is not as bad as it could be.
Anyway, rebooting during a CHKDSK operation is bad news, and to be avoided if possible. So, this article offers you a way to find out how long your CHKDSK operation might take, or avoid that risk altogether.
I offer you an alternate solution that does NOT involve setting a CHKDSK flag, rebooting the production server, and hoping for the best.

This method is outlined very roughly like this:

  1. Take a full volume backup (including the errors) of the production server using ShadowProtect or other disk-based backup system.
  2. Restore this backup to an alternate or loaner server.
  3. Fix the file system on the loaner server (giving you a rough idea of the time it would take on the production server.
  4. Run a full backup of the fixed temporary server’s data volume.
  5. At this point, you have a choice:
       a. It didn’t take very long, so go ahead and run it on the production server, or
       b. Proceed with this alternate method.

While you have been fixing the file system on the temporary server, users have been modifying files on the primary server. So:

  1. Use Robocopy with the /MIR, /DATSO switch, etc. to synchronize the changes between the production server and the temporary server (Users must be offline not making changes during this time).
  2. Restore this backup to the production server. (Users are offline during this time).

 The drawbacks:

  • It involves moving the data all over the place repeatedly, which takes a lot of time and network bandwidth.
  • It requires two separate backup locations so you don’t overwrite your only backup.
  • It relies entirely on the integrity of the file system on the temporary server.
  • Once the restore has begun, you CANNOT interrupt it the way you can (even if you shouldn’t) interrupt the CHKDSK.

 The benefits:

  • Depending on data size and number of files that need to be fixed, the amount of downtime required for synchronizing changes and restoring the volume might be significantly less than letting the CHKDSK run.
  • No more interruptions of CHKDSK if the users won’t let you fix it all in one sitting.
  • No-risk CHKDSK. How many times have your run CHKDSK /R and wondered if your file system would mount when it was done?

 

There are some aspects of this I would like to discuss before they come up in the comments:

Q: What if the customer has only one server, and it’s SBS?
A: Well, now that’s tricky. It is still possible to do this, but it gets complicated. You’d have to restore that volume to similar hardware (great if it is a virtual server), because you’d be restoring the OS as well, so that the permissions wouldn’t get trashed. So then you’d have two servers with the same name, same IP address, same domain, etc. This is not an insurmountable problem. All you need is a $69 broadband router to put between them, and change the IP address on your temporary server. That will significantly slow down file operations, and in light of the other issues I am about to cover, this might not be worth it.

Q: What if there are other things on that volume (Exchange, other databases, etc) besides files?
A: Well, now you’ll have to make a choice on how you want to handle that. You could do something like this:

  1. Dismount the database and copy it off before you do the restore, then copy it back afterward.
  2. Back up the databases separately using other tools, and restore them afterward.
  3. After having fixed all of the files on the temporary server and having synched them with Robocopy, delete all of the files on the production server, run the CHKDSK to fix the remaining issues (should run VERY quickly with all of the files gone), and then do a file-by-file restore (which will be VERY slow), and then of course you’ll have the fix the NTFS permissions.

Q: What if the customer does not have an alternate (temporary server)?
A: Seriously?   <rant> Come on now. If really amazes me how many IT consulting companies, large and small, do not have usable loaner servers to put at client sites in an emergency.

I run an IT consulting company. Me. I’m a one-man show at this point. I have THREE loaner servers I can bring to bear if needed. I have a half-dozen extra hard disks lying around to help configure these servers as needed. If I can afford this, so can your company. It simply requires dedication to your customers instead of squeezing every dollar you can out of your customers.

Server1: 2U compact low-noise rack-mount white-box running an Intel motherboard, quad-core 2.5 GHz proc, 8 GB RAM, and a couple of 1 TB SATA disks. No RAID. It’s loaded with VMware ESXi that boots from a USB stick. This machine cost me about $800 to build. It’s handy to have around to run labs on, when not being used for a loaner server.

Server2: Micro-ATX Tower white-box running an Intel motherboard, quad-core 2.5 GHz proc, 8 GB RAM, and three 1 TB SATA disks. No RAID. This machine cost me about $600 to build. This one doubles as a gaming PC for when my gaming friends come over.

Server3: HP Proliant DL320 G3 1U rack-mount w/onboard SATA RAID, 2 disks max. It’s an older 32-bit machine, but it has 4 GB of RAM and I swapped out the two 80 GB SATA disks it had with two 1 TB SATA disks. This machine was given to me by a customer who retired it. This one doubles as a dedicated UT2004 server for when my gaming friends come over.

These may not be super-impressive machines, but as loaner servers in a pinch, they are very flexible. I can configure them with software mirroring for fault tolerance, or I can configure them striped for capacity (I just make sure to back up the data incrementally every hour while in use). They have enough RAM to run an SBS 2008 server and enough CPU to run two or three virtual guests if needed. One of these machines ran my entire server infrastructure (SBS 2008 and Windows 2008) for two weeks last year when I had an air conditioning issue.

So if your customer does not have a spare server lying around, maybe you can come up with something with your own resources. </rant>

Really, you have to look at the particulars of your situation and decide if this is a good idea for you. Still, it’s one more option to put in your tool belt.

Tags:

Updated: QNAP -> Amazon S3: The Bucket does not exist, or you do not have access to it.

June 18th, 2010 by Paul Sterley | 2 Comments | Filed in Backup and Restore, Management Software

I have just lost my faith in the universe (again).

I’m working with a QNAP Turbo NAS TS-410U. I’m setting up Remote Replication to Amazon S3, for offsite backups.

When I researched the QNAP product line, I found with some degree of confidence that the TS-410U would support Amazon S3 replication. So I told my customer to purchase the $700 unit, plus four $100 hard disks to plug into it. She did, and the unit arrived.

Before we get into the Amazon S3 thing, I need to rant a little bit. Are you ready?

<rant>

You can always tell when a product design team consists of semi-intelligent MONKEYS that have been fed marijuana laced with crystal meth. I’m not talking about the TS-410U here. It seems to be a good product. So far, despite the troubles I am about to go into with the firmware and the Amazon S3, I like the TS-410U. No, when I accuse QNAP of employing baked, methed-out monkeys, I’m talking about the team who either designed, or decided which generic product to license to fill the need for, the rack-mount rail system.

To say that the rail system is not tool-less is an understatement, and a distraction from the core problem. The thing shipped with about 6 bags of screws of different sizes and shapes. All in all, there were about 80 screws. I wish I were exaggerating. Naturally, one of the screw bags had split open (the one containing the tiniest screws) and they were all over the inside of the box. There were instructions, but they were in Engrish, of course, and very vague. I’m very glad that I have mechanical aptitude, or I’d have struggled with the rail kit for much longer than I did, and would have been quite flabbergasted. Eventually I figured out which screws to put where, and assembled the rails, only to find that the rails would not collapse all of the way. The ends of a couple of the screws protruded too far into the channel, and stopped the inner rail from going all of the way into the outer rail. Aha! So THAT is what the washers were for. I got this now. Much too much time later, after much sweat and frustration, I racked the QNAP.

Then I spent 3 minutes snapping two pairs of tool-less rail kits from Dell into the rack for the other servers I was setting up that day.

I mean, really. I would happily pay $50 – $100 more for a QNAP TS-410U, if the dang thing came with tool-less rails. QNAP, are you paying attention?

OK, just had to get that over with.

</rant>

OK, Amazon S3. Signed up, created an account, copied and pasted the Access Key and Private Access Key. Logged into the AWS Console, created a Bucket. Checked the access permissions on the bucket. Created a folder inside the bucket.

Thus armed, I logged into the Administration console of the QNAP. I looked around for the Amazon S3 stuff. Couldn’t find it. Not here, not there, not anywhere. I could not find it, Sam-I-Am! Looking around online, I found a helpful article from QNAP on how to set up the Amazon S3 replication. It says to go to Backup -> Remote Replication -> Amazon S3. But it’s not there. the header at the top of the article lists the models that these instructions are supposed to work with. The TS-410U is NOT in there. What’s up with that? I was SURE that the TS-410U supported Amazon S3.

Further investigation found a firmware update that includes the Amazon S3 component. My QNAP was many versions behind on the firmware. I wonder how long it’s been sitting on a shelf in a warehouse. I downloaded the update and tried to use it. The QNAP said it failed. It did not like the update file. I downloaded another copy from another mirror site. That one worked OK. The thing updated in about 5 minutes.

OK, so now I have an Amazon S3 tab in Remote Replication. I went through the wizard, defined the profile, gave it the access keys, gave it the Bucket name and the folder name, clicked the Test button, and…

…it failed.  “The Bucket does not exist, or you do not have access to it.”   Uh huh.   Someone stole mah bukkit.

OK, let’s get down to troubleshooting. Is it time sensitive? It’s been an hour since I created the bucket. It can’t still be initializing. Is it case sensitive? It may be, but I typed it right. Did I screw up the ACLs? No, the permissions look right. Hmm. Well, maybe it’s a bug in the testing system. I’ll finish the wizard without a successful test and see if it runs. Does it?  Nope. FAIL!

This went on for some time before I once again resorted to the old standby, Google.

I didn’t find anything with the specific error message, in quotes, which was my primary inspiration for writing this article. I didn’t even find anything clearly laying out for me what had to be done. What I did find was that it was possible to create buckets and folders with a plug-in for Firefox called S3Fox. I remembered seeing this utility mentioned in that QNAP article as well, in the section about synchronizing your S3 files via web browser. Something went “clunk” in my brain.

I downloaded Firefox (no, don’t faint in amazement that I did not already have it installed, really). Just to poke briefly at the 9-year-old arrogant haters out there who may have been conditioned by their separatist, home-schooling, plant-eating parents, let me just say right now that I UNCHECKED the box for making Firefox my default browser. Not everyone shares your opinions and preferences. Get over it.

Then I downloaded the S3Fox plug-in. My goal here was to double-check whether mah bukkit was working or not, using another utility which utilizes the same access and authentication method as the QNAP, so that when/if the QNAP tech support guy ever called me back, I could tell him that my bucket was configured just fine, thank you, and that the problem had to be on the QNAP.

I was able to successfully log into my bucket using S3Fox, and was able to upload, download, and delete files and folders. Then I had an intuitive leap.

I knew that the QNAP was not happy with my bucket.

I knew that S3Fox liked my bucket just fine.

I knew that S3Fox had the ability to create buckets.

I knew that the QNAP guys liked S3Fox.

So I deleted my bucket, then recreated it using S3Fox. Logged back into the QNAP, tried the test again, and…

…Success!

The QNAP implementation of Amazon S3 Remote Replication can use a bucket I created using S3Fox, but CANNOT work with one that I created using Amazon’s own AWS Management Console. Who’d have thought it?

Update:

Goaded by Eugene, my masochistic software tester friend, I tested this again, to make sure it wasn’t a one-time error. I logged into the Amazon AWS console, created a new bucket, logged into the QNAP, started the wizard, and…

…it worked. OK, now that I have started down this dark path, forever will it dominate my destiny (or at least until I figure out whether it really was a one-time break or some other thing).

So I re-traced my earlier steps EXACTLY. You see, when I created my first bucket, I did something a little bit different. Read on:

Logged into the Amazon console.
Clicked the “Create Bucket” button.
Typed a 5-character name in the “Bucket Name” field, consisting of three capital letters followed by two numbers.

Now here is where I diverged from the path when I did the test that succeeded where I expected it to fail. This time, just like the first time, instead of clicking the “Create” button, I clicked the “Set Up Logging” button instead. This takes me to another page where I can check a box to enable logging and specify a bucket for the logs to go into, and a prefix for the log files to have.

THEN I clicked the “Create” button, which created the bucket, along with some logging attributes.
I finished by adding a folder with the same character pattern as all of the others.

Once again, (and I tested it more than once this time), the QNAP device cannot work with the bucket that was created with the logging service added into it.

Knowing that Eugene would be disappointed if I did not pursue this to its grisly conclusion, I then added logging to the folder that worked properly when I created it without logging. This did not affect the ability to access it from the QNAP.

Sensing that I was not out of the woods yet, I persisted. I then created another bucket without the logging, then immediately added logging to it, without first connecting with the QNAP. The QNAP then failed to connect to this bucket.

Of course, I could not stop there. Then I had to remove the logging from the bucket, and check the QNAP again: Still no luck.

There are probably some other test sequences I could run, for example, create a bucket with S3Fox, then go to the AWS console and add logging before connecting with the QNAP. However, I am not as masochistic as Eugene, so I’ll tell him that if he wants to log in and play with this, he’s welcome to.

My conclusion: If you create a bucket _and add logging to it_ using the AWS console before connecting to it with the QNAP, it will fail. If you create a bucket _without logging_, then connect to it with the QNAP, then add logging to it afterward, it will probably keep working.

I do hope that the QNAP guy does call back, so I can let him know about this product bug.

Tags: , ,

The BrainPool Discusses Extended Warranties for Aging Servers

April 10th, 2010 by Paul Sterley | No Comments | Filed in Backup and Restore, Hardware, Virtualization

Recently, a discussion took place on the BrainPool distribution group that contained good comments and perspectives about extending the warranty on an aging server. I felt that there was significant value in this conversation, so I have pasted it below in its entirety. If you’d rather just have the salient points, I have enumerated them here:

 

Pros:

·         Fast replacement of hardware when a clear and definable failure occurs.

·         Availability of replacement hardware that may not easily be found elsewhere.

·         Peace of mind for business owners that in the event of a failure, a mechanism exists for fast recovery.

 

Cons:

·         Expense that might be better spent elsewhere, for example the purchase of a new server.

·         False sense of security which could lead to downtime when the aging server fails.

 

Mitigating Factors:

·         In a virtual environment with a Storage Area Network, it is fast and easy to boot a VM on another host when one fails.

·         If there is identical hardware available, a parts replacement warranty may not be needed.

 

Extenuating Circumstances:

·         In some business environments, redundancy and high availability are absolute requirements. In these cases, the above considerations probably do not apply, as there is an overriding business requirement for warranty/service contracts and failover hardware.

 

The discussion commences here:

 

Lynn says:

Occasionally, I really question whether it is prudent to recommend extending warranties for aging hardware.  The cost is pretty extreme (a recent Dell quote tells me this), to the point you could buy new hardware for the cost of extending the warranty for a couple of older servers.  I understand the concept of covering your own a$$, but with VM’s, disk based backups, relatively cheap SATA storage, etc – you can get a down system back online in pretty quick order, and work on the bad hardware in the background, without sacrificing much in the way of performance.    Does anyone have any thoughts on this?  I mean, it’s one thing if it was an ESXi box and you’ve got multiple guest systems running on there, but a file server, even an email server (aka – pretty much any single purpose server you might have) – I’m much more on the fence about that sort of thing than I used to be.  Thoughts?

 

Joe says:

The big picture is that a company can say “I’m covered” for ~$650 a year.  Now if the server dies, how much can they be out?  I know things don’t hard fail like they did 10 years ago, but still, a new motherboard / processor / power supply and you’ve spent more than the insurance would have cost you, and it’s on someone else to show up with the parts.  If you’re white boxing it, it isn’t as big a deal.  For Dell/HP/IBM/etc there are times when getting a specific part means you have to turn to eBay and hope for a quick ship.

 

Normally I say buy it, but only in 1 year increments.  Remember, Murphy reads these emails.

 

Larry says:

I always recommend a warranty extension if the server falls into the “production server” category and the client does not want to upgrade to a new system or there is not a business reason to upgrade.  I also found that if you call CDW, you can get the warranty for considerably less than you can from HP. In my case, I was able to extend the warranty for 2 years on a system for less cost than HP would have charged for a single year.

 

Ellis says:

I believe these need to be viewed as an insurance policy, and as such they become a business decision, not a technical one.

 

One extenuating circumstance I have run into with both HP and Dell is that you might not be able to buy a replacement part for an old system, yet you can get that part if you have a warranty.  Trying to locate replacement parts for a production system on eBay is not my idea of a responsible way to run a business.

 

Jason says:

+1 for Ellis’ comments.  If you have a production server that is mission critical, not having a service contract is insane in my opinion.  The only time I would consider otherwise is if you have an identical spare server you can use as an organ donor.  I have a client with an old Dell 2650 server that Dell will no longer write maintenance agreements on.  They are just about broke, so I suggested they buy an identical server off Ebay for $75 and use it as an organ donor.  Plus, even though their server is old, the SBS 2003 install on it is really stable. 

 

If Dell would write a maintenance contract for it, I would have told them to do this.  When doo doo hits the fan, having hardware support is always nice (just in case), and without a support contract, you don’t get that. 

 

Paul says:

With virtualization as a tool in the belt, I am having some customers keep old servers around instead of tossing them. With a restore and P2V if necessary, an older server could run the newer server’s roles for a short time while the new one is being fixed. This of course has to be looked at individually – can the old server really do the job even for a short time? Is it capable, even if slower? Does it have enough disk space, or can it do the job of serving some stuff, and the rest can be restored to USB disk if needed for the short term? Given the amount of time needed for recovery of the new server, is it worth the time/effort to restore to a recovery server? 

 

Joe considers:

There is one twist, and that’s if you’re running a VM based setup.  If you have 2-3 physical servers, and a SAN, the only service contract you need is on the SAN.  Oversize the RAM in the host servers, no such thing as too much, and migrate the VMs to another physical host if one fails.

 

In a single server instance, yes, hardware contracts are mandatory.  In a multi-server setup, you can run a box until it dies as long as all the data is on the SAN, and you have resources to move the VMs over.

 

Ellis questions:

Aren’t you making some assumptions about customer expectations and service level agreements in your comment?  In some environments, a loss of redundancy would be considered cause for immediate action.

 

Joe answers:

Yep, I’m assuming a lot, and it will vary depending on the client.  The upside is that in most instances, downtime is less than 10 minutes rather than waiting a full 4 hours, if the vendor has the part in your local area.  I’ve had Dell drive up parts from Portland for a server that was down.  Had that server been a VM, I could have moved it and not involved the hardware vendor until the client was working again.  This setup allows the client to downgrade from a 24/7 4 hour contract to a NBD contract at a huge savings.  Granted this involves a SAN that may or may not be in the budget, and the SAN would need that 24/7 2 or 4 hour contract.

 

If a 4 hour outage isn’t acceptable then they should be on a hardware replacement cycle where this discussion is moot.  

 

Ellis clarifies:

I think you’re missing my point.  Your recovery plan, supported by the hardware architecture you describe, is exactly what the customer would hope for.  My point is that once the system is brought back online, there is still an outage to recover from: the system is no longer redundant.  In that case isn’t having a warranty good practice (although sometimes more expensive than we would like) to complete the resolution?

 

Paul says:

If the server in question is fairly recent, it might be a good idea to still have that warranty on it, but how much does the warranty cost, versus the replacement part? If quick recovery and business productivity has been achieved, and now we are looking at repairing the failed system, we don’t necessarily need the quick response offered by the warranty. If the failed host server is so old that parts are not readily available, then perhaps it would be time for replacement of that host, rather than repair.

Tags: ,