Archive for the ‘Backup and Restore’ Category

The BrainPool Discusses Extended Warranties for Aging Servers

April 10th, 2010 by Paul Sterley | No Comments | Filed in Backup and Restore, Hardware, Virtualization

Recently, a discussion took place on the BrainPool distribution group that contained good comments and perspectives about extending the warranty on an aging server. I felt that there was significant value in this conversation, so I have pasted it below in its entirety. If you’d rather just have the salient points, I have enumerated them here:

 

Pros:

·         Fast replacement of hardware when a clear and definable failure occurs.

·         Availability of replacement hardware that may not easily be found elsewhere.

·         Peace of mind for business owners that in the event of a failure, a mechanism exists for fast recovery.

 

Cons:

·         Expense that might be better spent elsewhere, for example the purchase of a new server.

·         False sense of security which could lead to downtime when the aging server fails.

 

Mitigating Factors:

·         In a virtual environment with a Storage Area Network, it is fast and easy to boot a VM on another host when one fails.

·         If there is identical hardware available, a parts replacement warranty may not be needed.

 

Extenuating Circumstances:

·         In some business environments, redundancy and high availability are absolute requirements. In these cases, the above considerations probably do not apply, as there is an overriding business requirement for warranty/service contracts and failover hardware.

 

The discussion commences here:

 

Lynn says:

Occasionally, I really question whether it is prudent to recommend extending warranties for aging hardware.  The cost is pretty extreme (a recent Dell quote tells me this), to the point you could buy new hardware for the cost of extending the warranty for a couple of older servers.  I understand the concept of covering your own a$$, but with VM’s, disk based backups, relatively cheap SATA storage, etc – you can get a down system back online in pretty quick order, and work on the bad hardware in the background, without sacrificing much in the way of performance.    Does anyone have any thoughts on this?  I mean, it’s one thing if it was an ESXi box and you’ve got multiple guest systems running on there, but a file server, even an email server (aka – pretty much any single purpose server you might have) – I’m much more on the fence about that sort of thing than I used to be.  Thoughts?

 

Joe says:

The big picture is that a company can say “I’m covered” for ~$650 a year.  Now if the server dies, how much can they be out?  I know things don’t hard fail like they did 10 years ago, but still, a new motherboard / processor / power supply and you’ve spent more than the insurance would have cost you, and it’s on someone else to show up with the parts.  If you’re white boxing it, it isn’t as big a deal.  For Dell/HP/IBM/etc there are times when getting a specific part means you have to turn to eBay and hope for a quick ship.

 

Normally I say buy it, but only in 1 year increments.  Remember, Murphy reads these emails.

 

Larry says:

I always recommend a warranty extension if the server falls into the “production server” category and the client does not want to upgrade to a new system or there is not a business reason to upgrade.  I also found that if you call CDW, you can get the warranty for considerably less than you can from HP. In my case, I was able to extend the warranty for 2 years on a system for less cost than HP would have charged for a single year.

 

Ellis says:

I believe these need to be viewed as an insurance policy, and as such they become a business decision, not a technical one.

 

One extenuating circumstance I have run into with both HP and Dell is that you might not be able to buy a replacement part for an old system, yet you can get that part if you have a warranty.  Trying to locate replacement parts for a production system on eBay is not my idea of a responsible way to run a business.

 

Jason says:

+1 for Ellis’ comments.  If you have a production server that is mission critical, not having a service contract is insane in my opinion.  The only time I would consider otherwise is if you have an identical spare server you can use as an organ donor.  I have a client with an old Dell 2650 server that Dell will no longer write maintenance agreements on.  They are just about broke, so I suggested they buy an identical server off Ebay for $75 and use it as an organ donor.  Plus, even though their server is old, the SBS 2003 install on it is really stable. 

 

If Dell would write a maintenance contract for it, I would have told them to do this.  When doo doo hits the fan, having hardware support is always nice (just in case), and without a support contract, you don’t get that. 

 

Paul says:

With virtualization as a tool in the belt, I am having some customers keep old servers around instead of tossing them. With a restore and P2V if necessary, an older server could run the newer server’s roles for a short time while the new one is being fixed. This of course has to be looked at individually – can the old server really do the job even for a short time? Is it capable, even if slower? Does it have enough disk space, or can it do the job of serving some stuff, and the rest can be restored to USB disk if needed for the short term? Given the amount of time needed for recovery of the new server, is it worth the time/effort to restore to a recovery server? 

 

Joe considers:

There is one twist, and that’s if you’re running a VM based setup.  If you have 2-3 physical servers, and a SAN, the only service contract you need is on the SAN.  Oversize the RAM in the host servers, no such thing as too much, and migrate the VMs to another physical host if one fails.

 

In a single server instance, yes, hardware contracts are mandatory.  In a multi-server setup, you can run a box until it dies as long as all the data is on the SAN, and you have resources to move the VMs over.

 

Ellis questions:

Aren’t you making some assumptions about customer expectations and service level agreements in your comment?  In some environments, a loss of redundancy would be considered cause for immediate action.

 

Joe answers:

Yep, I’m assuming a lot, and it will vary depending on the client.  The upside is that in most instances, downtime is less than 10 minutes rather than waiting a full 4 hours, if the vendor has the part in your local area.  I’ve had Dell drive up parts from Portland for a server that was down.  Had that server been a VM, I could have moved it and not involved the hardware vendor until the client was working again.  This setup allows the client to downgrade from a 24/7 4 hour contract to a NBD contract at a huge savings.  Granted this involves a SAN that may or may not be in the budget, and the SAN would need that 24/7 2 or 4 hour contract.

 

If a 4 hour outage isn’t acceptable then they should be on a hardware replacement cycle where this discussion is moot.  

 

Ellis clarifies:

I think you’re missing my point.  Your recovery plan, supported by the hardware architecture you describe, is exactly what the customer would hope for.  My point is that once the system is brought back online, there is still an outage to recover from: the system is no longer redundant.  In that case isn’t having a warranty good practice (although sometimes more expensive than we would like) to complete the resolution?

 

Paul says:

If the server in question is fairly recent, it might be a good idea to still have that warranty on it, but how much does the warranty cost, versus the replacement part? If quick recovery and business productivity has been achieved, and now we are looking at repairing the failed system, we don’t necessarily need the quick response offered by the warranty. If the failed host server is so old that parts are not readily available, then perhaps it would be time for replacement of that host, rather than repair.

Tags: ,

Possible workaround when your ESXi server runs out of space on the datastore

March 10th, 2010 by Paul Sterley | No Comments | Filed in Backup and Restore, ESXi, Hardware, Hyper-V

Scenario:
You have a virtual machine running on ESXi, and either the disk is thin-provisioned, or you have one or more snapshots. The datastore runs out of space, and the VM goes down. You are unable to boot the VM because there is not enough free space on the datastore.

When you allocate memory to a VM and boot it, ESXi creates a “swapfile” on the datastore using an amount of space equivalent to the amount of RAM you allocated. By default, ESXi is configure to place this swapfile in the same folder (on the same datastore) as the VM.

Thus although the datastore might have 3.75 GB free, when you attempt to boot the server that you have allocated 8 GB of RAM to, it will not boot.

 

Solution:
If you have more than one datastore available, you can go into the vSphere Client, configuration tab, and configure the virtual machine swapfile location. Place the swapfiles on the second datastore.

If you don’t have more than one datastore, perhaps you can add one. If you have a NAS device that supports NFS, you can use that. If the onboard SATA controller on your server is supported by ESXi, you can add a cheap SATA disk to use for your swapfile location (and a good backup location) while you sort this issue out.

Once you have done this, you can boot the server, and run a backup from within the OS .

Once you have a full backup, you can delete the VM to free up space. If you ran out of room due to snapshots, you can create a new VM and start restoring your backup right away. If you ran out of room due to a thin provisioned disk that exceeded the datastore size, you will obviously need to make your datastore larger before proceeding with the restore.

Other ways you can recover from this situation:
1. Add disks to the server and extend the datastore to use them, so the datastore gets larger.

2. Move one or more of the VMDK files to the second datastore and edit your VM configuration to use the disk(s) in the new location.

How you can prevent this situation:
1. When allocating space, ensure that if you are using thin provisioning, if the disk grows to its full potential size, it will still fit on the datastore. If you want to use some of teh available space while your VMDK files are still small, go right ahead – but make sure you can either delete or move the less important machines on short notice – and monitor your disk usage!

2. leave plenty of extra room. Put more physical space in the server than you’re ever likely to need. Disks are cheap.

 

P.S. I am sure that this same concept, or parts of it, can be applied to Hyper-V virtual hosts. However, I am not familair enough with Hyper-V to give specifics.

Tags: , ,

Product Review: R1Soft’s CDP Server

December 8th, 2009 by Paul Sterley | No Comments | Filed in Backup and Restore

I recently evaluated the Windows version of the R1Soft CDP Server 2.0 product. What follows is a basic write-up of the points and features that seemed relevant and important to me. Your needs may be different. For a full description of the product, click here.

For their full documentation set, click here.

Summary:

In my opinion this is a great product for local backup to disk. However, it has no good provisions for rotation of storage devices and offsite backup for disaster recovery. They did give it a go with the Archive module, but I feel that they fell short of the mark with this.

 The only way to get a good offsite backup with full capabilities is to stop the CDP service, back up the CDP Server including its databases, system state, etc. to portable media, and take that offsite. In a disaster, you’ll spend some time recovering your recovery server first.

 Overview:

·         The product is installed on a server that is not one of those that you will be backing up.

·         A disk is defined for the storage container. This disk cannot be rotated with other disks and taken offsite. It must remain present.

·         Agents are installed on server that you wish to back up.

·         Backups are scheduled.

·         E-mail notifications can be scheduled, which include a summary of the history screen.

·         Individual file/folder restore is done via the CDP server console and is pretty easy.

·         Bare Metal restore is accomplished by booting a server from CD and controlling it from the CDP Server console. There are other methods as well, but this is the most straightforward.

·         Archives to Zip files can be scheduled via the console, as long as you are not using encryption. The target for these can be FTP, SFTP, or CIFS.

 

During my evaluation of the product, there were several points that I could not find information about in their documentation, so I submitted technical support incidents. I just got the answers back from them. I can’t say I’m happy about any of them.

1.       When you “Archive” information from the data storage container, which allows you to send it off to an FTP server or something, you can no longer use the R1Soft graphical interface to work with that archive. From that point it becomes a Zip file that you can manually open up and copy data out of. So we could not use an Archive to do a bare-metal restore, for example.

2.       If you choose to “encrypt” (password protect) your storage, then you cannot schedule an Archive job. The software does not store the encryption password. Archives can then only be manually done.

3.       It is not possible to rotate data storage media. R1Soft writes to a disk as a container to store the data. It makes a database there, and it wants the same database to be available at all times. So the only way to get an offsite backup of the data container as an intact, whole backup that you can use the GUI to restore from is to stop the R1Soft services, back up the entire CDP Server to removable media, and take that offsite. That means restoring from one of these will involve first recovering the R1Soft server from that backup.

4.       The current version of CDP Server (2) involves one central server with agents installed on other servers to back them up. The pricing is excellent. However, version 3, which is due out in 2010, changes this model. Each server will have its own copy of the software and will back up to standalone databases that can be copied around. This will improve the offsite storage capability. The “Enterprise” edition will still have the capability to have a central server with agents for the backup targets. It is unknown at this time what the pricing will be like for either option.

Conclusion:

Within its limits, the R1Soft CDP Server 2.0 product performs well and provides a very cost effective (at this time) local backup solution for companies with multiple servers.

However, the lack of off-site disaster recovery functionality makes this a product that I am unlikely to recommend to customers, unless I have some other independent option for offsite disaster recovery.

Further, the fact that the architecture (and pricing) will change significantly in the next version, due out within a year, gives me pause. I am hesitant to roll out a backup system based on this architecture and pricing, with the probability that in less than a year, I will either have to change the backup model completely, or pay significantly more for the “enterprise” edition that will include the backup model that is being offered at such a good price now.

During my evaluation, when I found something that was not intuitive, or an interface that seemed a little clunky, I reminded myself of the great pricing and the benefits of needing only a lightweight agent installed on each server. Finding out that within a year I will either have to abandon the benefits of the agent or pay a higher cost for an “Enterprise” level product puts those rough edges and minor defects in an entirely different light.

Tags: , ,

Updated: Recover from a USN Rollback WITHOUT Demoting and Promoting your DC

October 27th, 2009 by Paul Sterley | 3 Comments | Filed in Backup and Restore, ESXi, IIS, In the Windows Box, Virtualization, Windows Server

What’s a USN Rollback? That’s when you’ve restored an Active Directory DC in a multiple DC environment using a method that is not Active-Directory Aware. Examples include Ghost images, VMware or Hyper-V snapshots, or other imaging or volume-level restore methods.

Why is that a problem? A very good detailed explanation is available here, but the basic idea is that AD keeps track of which servers it has replicated with and when, and if a DC is rolled back in a way that is not compatible with the record-keeping, the affected DC will disabled inbound and outbound replication, and refuse to replicate with the other DCs.

Here’s a related article by the same author as the above post, which led me to my solution this evening. My article expands on the second option provided, but goes into the mechanics of it, and the associated difficulties.

According to Microsoft’s Knowledge Base article on the subject, recovering from this situation entails forcibly demoting the DC, cleaning up the AD, and then (optionally) promoting it again. If the DC in question has no other roles, or just a couple of basic ones such as a print server, this might be the best way to go, if you’re familiar with such things as seizing FSMO roles and performing metadata cleanup in Active Directory after an unsuccessful DC demotion.

** Update: Read on for more details about how this all works, but make sure you check the update at the bottom of the article for the easier method I successfully tested!

However, if you’re not familiar with these things, or you have other applications on the server which might be affected (IIS, in particular, is very sensitive to the permissions changes associated with DC promotion), this might create a very large amount of havoc on your server.

Your saving grace, if you have one, is a System State backup from before the USN rollback occurred. If you don’t have a backup of JUST the System State, perhaps you can restore an entire image to another server, boot it, and create one.

If you have or can create one of these, your solution becomes much simpler. You just need to boot your server in Directory Services Restore Mode, restore the System State, DO NOT mark any part of your restore as authoritative, and reboot.

After the reboot, you might need to remove the flags AD has set, which have disabled inbound and outbound replications. The commands for this are:

repadmin /options [YourServerName] -disable_inbound_repl
repadmin /options [YourServerName] -disable_outbound_repl

Note: This looks like you are disabling replication, but what you are actually doing is putting a minus sign (-) before the disable option, which enables it. I know, it’s counter-intuitive, but trust me on this one – or go check the syntax yourself.

Of course, you need the Support Tools installed to get the repadmin utility. Once you run those commands, your server will start replicating again, and the more up-to-date DC(s) will override the old, out of date information your USN Rollback victim was holding onto.

There are some extra difficulties associated with the above plan:
1. If you have to restore a server image to create that System State backup, and you restore to different hardware, things could get a little messy. Is it messier than demoting, seizing FSMO roles, performing metadata cleanup, promoting, and cleaning up the fallout from your installed apps? You’ll have to decide on that one.

2. This requires you having an extra server (or two, if you want to restore more than one DC to create a stable lab environment from which to back up the System State) laying around. Do you have those resources available?

I was facing this issue today, and all of the above became MUCH simpler for me when I realized I could use the Doyenz Test Lab to sort all of this out. I did NOT have a System State backup from before the USN Rollback, but I HAVE been running backups into the Doyenz system since before the problem began.

Here is what I did:
1. Created a backup of the System State

a. Restored a copy of the affected server in the Doyenz Test Lab. I specifically restored from the date BEFORE the USN Rollback happened. It was easy to find this by looking at the date of the last successful replication with repadmin on the affected server.
b. Performed a System State backup using NTBackup (you can do this with WBAdmin on Windows 2008).
c. Zipped the backup file and sent to an FTP server.
d. Shut down the restored server.

2. Performed a test run to make sure this was going to work, without affecting the live servers.

a. Using the Doyenz Portal, I select last night’s backup and restored it for both servers.
b. I booted the primary DC (the one with the FSMO roles) first.
c. Attached the second (USN Rollback victim) server to the first one in the Lab, and booted it.
d. Pulled the System State backup down from the FTP site onto the affected server.
e. Rebooted the affected server into Directory Services Restore Mode.
f. Restored the System State on the affected server.
g. Rebooted the affected into Normal Mode.
h. Used the repadmin commands to remove the replication blocks.
i. Forced replication using AD Sites and Services.

3. Verified successful replication.

a. Created a user account on one DC in the Test Lab, forced replication, and checked for the account on the other DC.
b. Deleted the user account on the other DC, and checked it on the first DC.

4. Tested the touchy sensitive web applications that are running on the affected server.

5. Shut down the servers in the test lab.

After this successful test, I notified the users of pending late-night downtime, and repeated the above steps, this time on the live, production server and with great confidence of the outcome. Sure enough, I restored the AD replication functionality of the server with minimal downtime, without crossing my fingers, holding my breath, and hoping against hope that it would work and not trash the server.

What is more, since the production server is a virtual server, and I have VPN access to the virtual host, I was able to perform the entire operation from my home office, 30 miles away. I didn’t swap any tapes, set up any lab hardware, or drive to the server site late at night. I did the whole thing in comfortable clothes with a 2-liter bottle of Ruby Red Squirt, Winamp playing “Save Me” by Queen, and my devoted cat purring on my lap.

What could be better than that?

Update: It was very handy to be able to do the above scenario, but what is even handier is that I was able to find a significantly simpler method. So much simpler, I wonder why it did not occur to me sooner, and why Microsoft doesn’t have this listed in their KB article.

I set this problem up in a lab scenario again, and this time rather than do a complicated restore of an earlier version of the machine, I simply:

  • Performed a System State backup of the machine (in its broken, non-replicating condition).
  • Booted it into Directory Services Restore Mode.
  • Restored the System State backup, carefully NOT selecting the option to make it authoritative.
  • Rebooted, and ran the above repadmin commands to re-enable replication.

After that, I was able to trigger another replication, and it worked just fine.

Tags: , , ,

This is a test of the Windows Backup system on VMware ESXi. This is only a test.

July 30th, 2009 by Paul Sterley | 2 Comments | Filed in Backup and Restore, ESXi, In the Windows Box, Virtualization, Windows Server

Summary:
Triggered by an excessive heat wave, I used the built-in Windows Backup to do a test restore of my production virtual servers from their usual VMware ESXi host to a smaller, more portable machine that lives in an air-conditioned room.
The servers will run there until the heat wave dissipates, whereupon I will reverse the procedure and move them back to their usual home.

The restore process was incredibly easy. This is a demonstration of how portable and flexible virtual servers are, and how well the built-in Windows Backup works with virtualization.

I can now say with a high level of confidence that virtual servers, backed up with a local VSS-based disk backup solution, and coupled with an offsite backup solution, is a great way to go. My scenario was a simple problem with a simple solution, but this power and flexibility can easily be applied in many different situations.

The Full Story:
If you live in the Western Washington area, you know we’re having a crazy heat wave.

Many businesses have servers tucked away in closets, kitchen areas, and other little nooks and crannies, without air conditioning. Mine is one of them. I strongly recommend air conditioning to my customers, and it is with some embarrassment that I admit that I have not implemented it myself – but I have never needed it before. My company’s servers are in a steel enclosure in a 675 square foot garage. Usually it stays quite cool, verified by the thermal monitoring unit attached to my battery backup system. If the temperature gets too high, the battery backup sends a shutdown command to the servers so they are not damaged by the heat.

Several of my customers have had thermal shutdown issues the last few days. Today it was my turn. I happened to be sitting at my workstation when the e-mail arrived, telling me that I had 3 minutes to correct the situation before things started shutting down.

I started by logging into the battery backup unit and adjusting the threshold up a few degrees to give me time to work. Next I walked down to the server rack and opened its door to allow more air flow to the servers. The thermal monitor is just inside the door, right next to the air intake holes in the front of the server. The third step I took was to shut down one of the servers in the rack – a virtual server running Windows Home Server, which backs up my workstations. Since I don’t store data on workstations, it’s OK to go a few days without backing them up.

Back in my air-conditioned office, I logged into the battery backup management web page and saw that it had gone up to 91 degrees while I was working, but was now back to 90. I watched it for a few minutes. It stayed at 90. Still too hot.

Sitting back and thinking about my options, I considered fans – but the entire room was very hot. Fans would only push the hot air around, and I’ve heard horror stories and seen pictures of server rooms which had burned down due to electrical fires starting from cheap fans that weren’t designed for a 24/7 duty cycle.

I considered moving the server to my office – but the server is very noisy, being a rack-mount server with small fans moving very quickly. However, my servers are virtual, running on VMware ESXi, so they should be very portable…        …and an idea was formed.

One of the great benefits of virtualization is that you can put your virtual machine on any hardware that is supported by the host operating system, which in my case is VMware ESXi. That makes backup and restore very simple. You don’t have to be concerned with hard disk controller drivers and other such obstacles to a smooth restore operation.

I’ve been evangelizing these virtues for over a year now, and using the technology myself. I decided to use this unfortunate heat wave as an opportunity to perform a real-world test of the technology I have been talking about. I decided to do a last-minute backup of my server, move the backup device to a smaller, quieter machine in my office, and restore the backup. I would run it in my office until temperatures reach sane levels again, and then reverse the procedure.

I warned the users that the server was going down for a while. I stopped the incoming e-mail service, and forced a “backup now” on the SBS 2008 and Windows 2008 servers that form my infrastructure. That took about 1/2 hour. I am using the built-in Windows Backup, and it is performing disk-based incremental backups. Then I shut down the “guest” operating systems, and finally shut down the host server.

Again I walked down to the server rack and disconnected the external hard disk that I store my local backups on. It was nearly hot enough to burn my fingers. I carried it up to my office and plugged it into the generic white-box server ($800) that I use to run lab experiments. This machine would also make an excellent loaner ESXi server if one of my customers experienced a server failure. It has a single quad-core 2.5GHz CPU, 8GB RAM, and 1.5 TB of disk space.

I attached the USB stick that boots VMware ESXi on that host, booted it up, and configured its networking (2 minutes).

Next step, I created two guest virtual machines with the same disk sizes as the machines I was going to restore. I had to allocate less memory, so the servers might run a little slower. Then I attached the virtual disks on the backup device to the appropriate VMs, and finally mapped the SBS2008 and Windows 2008 DVDs to the new virtual machines and configured them to boot from DVD.

I booted up the SBS2008 server first. It booted from DVD, and I used the menus on the DVD to start a Full Computer Restore, using the backups that it found automatically when it searched the attached disks. I chose the correct date/time of the backup to restore, verified that all of the volumes were present, and told it to begin.

restore

restore2

I didn’t have to flounder around looking for hard disk controller drivers, making floppy disks or putting drivers on USB. I set to work on the second server, which is less critical to my business, and had similar results with that one. Not wanting to cause the first restore to slow down, I brought the second server to the final prompt to begin the restore, and waited for the first one to complete.

The restore was the easiest full-server restore I have ever done, with the best results. After the restore, I booted the server, and it was off and running without a backward glance.

The first server, which runs 90% of my business, was restored and running less than 2 hours of shutting down for the move. A backup queuing mail service had received and stored my e-mail while it was down, so I didn’t miss a single message. The second server, running my blog site, followed soon after.

I did have three very small hiccups:
1. Windows detected the hardware change (probably the CPU chip) and required re-activation, but it worked automatically – two mouse clicks and a few seconds took care of it.
2. Because I forgot to set the date/time properly on the destination ESXi host, my SBS2008 server’s clock got set wrong and that caused authentication problems for a few minutes until I figured out what was going on and corrected it.
3. The DHCP Server service on my SBS did not start because I was running an open-source DHCP server during the downtime to keep everything connected to the network. I just had to stop the one and start the other.

Compared with the kind of difficulties I would normally expect with this kind of full server restore to different hardware, this was a piece of cake.

I can now say with a high level of confidence that virtual servers, backed up with a local VSS-based disk backup solution, coupled with an offsite backup solution, is a great way to go. My scenario was a simple problem with a simple solution, but this power and flexibility can easily be applied in many different situations.

Tags: , , ,