Archive for the ‘ESXi’ Category

Change the MAC address in an ESXi VM

April 8th, 2010 by Paul Sterley | No Comments | Filed in ESXi, Hardware, Migration, Not in the Windows Box, Virtualization, Windows Server

 

Last year, I messed around with changing the MAC address in an ESXi VM to avoid some problems with a license manager app that binds to the MAC address of the NIC when you install and license it. I was unsuccessful in my attempt to change the MAC address then. The MAC address in the VM Settings window refuses to allow you to set the first six digits, requiring you to keep the vendor code of a VMware NIC.

 

The NIC driver inside Windows, however, has an option to set the MAC address. Go to the NIC adapter properties, Advanced tab, and select NetworkAddress.

 

nicpropertiesmacaddress 

The server now thinks it has the MAC address you specified.

It works for IPCONFIG /ALL, it works for workstations that ping the server and then check their ARP cache, and it works for FlexLM, the aforementioned license manager software.

 

NOTE: I tested this in Windows 2003 SP2 on ESXi 3.5 and 4.0, and Windows 2008 SBS Edition on ESXi 4.0.

 

Further note: If you do this, it is critical that you NEVER light up the network card in your old server, which has the same MAC address, on the same physical segment of your network. It will bring down connectivity to your new server.

It is possible that you can avoid this problem and continue to use your old server in some other capacity if you change its MAC address, or if you install a different NIC. If the NIC with the matching MAC adddress is onboard though, you may still have trouble connecting to your new server from your old one. This bears testing at some future date when I have way too much time on my hands.

Tags: ,

Possible workaround when your ESXi server runs out of space on the datastore

March 10th, 2010 by Paul Sterley | No Comments | Filed in Backup and Restore, ESXi, Hardware, Hyper-V

Scenario:
You have a virtual machine running on ESXi, and either the disk is thin-provisioned, or you have one or more snapshots. The datastore runs out of space, and the VM goes down. You are unable to boot the VM because there is not enough free space on the datastore.

When you allocate memory to a VM and boot it, ESXi creates a “swapfile” on the datastore using an amount of space equivalent to the amount of RAM you allocated. By default, ESXi is configure to place this swapfile in the same folder (on the same datastore) as the VM.

Thus although the datastore might have 3.75 GB free, when you attempt to boot the server that you have allocated 8 GB of RAM to, it will not boot.

 

Solution:
If you have more than one datastore available, you can go into the vSphere Client, configuration tab, and configure the virtual machine swapfile location. Place the swapfiles on the second datastore.

If you don’t have more than one datastore, perhaps you can add one. If you have a NAS device that supports NFS, you can use that. If the onboard SATA controller on your server is supported by ESXi, you can add a cheap SATA disk to use for your swapfile location (and a good backup location) while you sort this issue out.

Once you have done this, you can boot the server, and run a backup from within the OS .

Once you have a full backup, you can delete the VM to free up space. If you ran out of room due to snapshots, you can create a new VM and start restoring your backup right away. If you ran out of room due to a thin provisioned disk that exceeded the datastore size, you will obviously need to make your datastore larger before proceeding with the restore.

Other ways you can recover from this situation:
1. Add disks to the server and extend the datastore to use them, so the datastore gets larger.

2. Move one or more of the VMDK files to the second datastore and edit your VM configuration to use the disk(s) in the new location.

How you can prevent this situation:
1. When allocating space, ensure that if you are using thin provisioning, if the disk grows to its full potential size, it will still fit on the datastore. If you want to use some of teh available space while your VMDK files are still small, go right ahead – but make sure you can either delete or move the less important machines on short notice – and monitor your disk usage!

2. leave plenty of extra room. Put more physical space in the server than you’re ever likely to need. Disks are cheap.

 

P.S. I am sure that this same concept, or parts of it, can be applied to Hyper-V virtual hosts. However, I am not familair enough with Hyper-V to give specifics.

Tags: , ,

Updated: Recover from a USN Rollback WITHOUT Demoting and Promoting your DC

October 27th, 2009 by Paul Sterley | 1 Comment | Filed in Backup and Restore, ESXi, IIS, In the Windows Box, Virtualization, Windows Server

What’s a USN Rollback? That’s when you’ve restored an Active Directory DC in a multiple DC environment using a method that is not Active-Directory Aware. Examples include Ghost images, VMware or Hyper-V snapshots, or other imaging or volume-level restore methods.

Why is that a problem? A very good detailed explanation is available here, but the basic idea is that AD keeps track of which servers it has replicated with and when, and if a DC is rolled back in a way that is not compatible with the record-keeping, the affected DC will disabled inbound and outbound replication, and refuse to replicate with the other DCs.

Here’s a related article by the same author as the above post, which led me to my solution this evening. My article expands on the second option provided, but goes into the mechanics of it, and the associated difficulties.

According to Microsoft’s Knowledge Base article on the subject, recovering from this situation entails forcibly demoting the DC, cleaning up the AD, and then (optionally) promoting it again. If the DC in question has no other roles, or just a couple of basic ones such as a print server, this might be the best way to go, if you’re familiar with such things as seizing FSMO roles and performing metadata cleanup in Active Directory after an unsuccessful DC demotion.

** Update: Read on for more details about how this all works, but make sure you check the update at the bottom of the article for the easier method I successfully tested!

However, if you’re not familiar with these things, or you have other applications on the server which might be affected (IIS, in particular, is very sensitive to the permissions changes associated with DC promotion), this might create a very large amount of havoc on your server.

Your saving grace, if you have one, is a System State backup from before the USN rollback occurred. If you don’t have a backup of JUST the System State, perhaps you can restore an entire image to another server, boot it, and create one.

If you have or can create one of these, your solution becomes much simpler. You just need to boot your server in Directory Services Restore Mode, restore the System State, DO NOT mark any part of your restore as authoritative, and reboot.

After the reboot, you might need to remove the flags AD has set, which have disabled inbound and outbound replications. The commands for this are:

repadmin /options [YourServerName] -disable_inbound_repl
repadmin /options [YourServerName] -disable_outbound_repl

Note: This looks like you are disabling replication, but what you are actually doing is putting a minus sign (-) before the disable option, which enables it. I know, it’s counter-intuitive, but trust me on this one – or go check the syntax yourself.

Of course, you need the Support Tools installed to get the repadmin utility. Once you run those commands, your server will start replicating again, and the more up-to-date DC(s) will override the old, out of date information your USN Rollback victim was holding onto.

There are some extra difficulties associated with the above plan:
1. If you have to restore a server image to create that System State backup, and you restore to different hardware, things could get a little messy. Is it messier than demoting, seizing FSMO roles, performing metadata cleanup, promoting, and cleaning up the fallout from your installed apps? You’ll have to decide on that one.

2. This requires you having an extra server (or two, if you want to restore more than one DC to create a stable lab environment from which to back up the System State) laying around. Do you have those resources available?

I was facing this issue today, and all of the above became MUCH simpler for me when I realized I could use the Doyenz Test Lab to sort all of this out. I did NOT have a System State backup from before the USN Rollback, but I HAVE been running backups into the Doyenz system since before the problem began.

Here is what I did:
1. Created a backup of the System State

a. Restored a copy of the affected server in the Doyenz Test Lab. I specifically restored from the date BEFORE the USN Rollback happened. It was easy to find this by looking at the date of the last successful replication with repadmin on the affected server.
b. Performed a System State backup using NTBackup (you can do this with WBAdmin on Windows 2008).
c. Zipped the backup file and sent to an FTP server.
d. Shut down the restored server.

2. Performed a test run to make sure this was going to work, without affecting the live servers.

a. Using the Doyenz Portal, I select last night’s backup and restored it for both servers.
b. I booted the primary DC (the one with the FSMO roles) first.
c. Attached the second (USN Rollback victim) server to the first one in the Lab, and booted it.
d. Pulled the System State backup down from the FTP site onto the affected server.
e. Rebooted the affected server into Directory Services Restore Mode.
f. Restored the System State on the affected server.
g. Rebooted the affected into Normal Mode.
h. Used the repadmin commands to remove the replication blocks.
i. Forced replication using AD Sites and Services.

3. Verified successful replication.

a. Created a user account on one DC in the Test Lab, forced replication, and checked for the account on the other DC.
b. Deleted the user account on the other DC, and checked it on the first DC.

4. Tested the touchy sensitive web applications that are running on the affected server.

5. Shut down the servers in the test lab.

After this successful test, I notified the users of pending late-night downtime, and repeated the above steps, this time on the live, production server and with great confidence of the outcome. Sure enough, I restored the AD replication functionality of the server with minimal downtime, without crossing my fingers, holding my breath, and hoping against hope that it would work and not trash the server.

What is more, since the production server is a virtual server, and I have VPN access to the virtual host, I was able to perform the entire operation from my home office, 30 miles away. I didn’t swap any tapes, set up any lab hardware, or drive to the server site late at night. I did the whole thing in comfortable clothes with a 2-liter bottle of Ruby Red Squirt, Winamp playing “Save Me” by Queen, and my devoted cat purring on my lap.

What could be better than that?

Update: It was very handy to be able to do the above scenario, but what is even handier is that I was able to find a significantly simpler method. So much simpler, I wonder why it did not occur to me sooner, and why Microsoft doesn’t have this listed in their KB article.

I set this problem up in a lab scenario again, and this time rather than do a complicated restore of an earlier version of the machine, I simply:

  • Performed a System State backup of the machine (in its broken, non-replicating condition).
  • Booted it into Directory Services Restore Mode.
  • Restored the System State backup, carefully NOT selecting the option to make it authoritative.
  • Rebooted, and ran the above repadmin commands to re-enable replication.

After that, I was able to trigger another replication, and it worked just fine.

Tags: , , ,

This is a test of the Windows Backup system on VMware ESXi. This is only a test.

July 30th, 2009 by Paul Sterley | 2 Comments | Filed in Backup and Restore, ESXi, In the Windows Box, Virtualization, Windows Server

Summary:
Triggered by an excessive heat wave, I used the built-in Windows Backup to do a test restore of my production virtual servers from their usual VMware ESXi host to a smaller, more portable machine that lives in an air-conditioned room.
The servers will run there until the heat wave dissipates, whereupon I will reverse the procedure and move them back to their usual home.

The restore process was incredibly easy. This is a demonstration of how portable and flexible virtual servers are, and how well the built-in Windows Backup works with virtualization.

I can now say with a high level of confidence that virtual servers, backed up with a local VSS-based disk backup solution, and coupled with an offsite backup solution, is a great way to go. My scenario was a simple problem with a simple solution, but this power and flexibility can easily be applied in many different situations.

The Full Story:
If you live in the Western Washington area, you know we’re having a crazy heat wave.

Many businesses have servers tucked away in closets, kitchen areas, and other little nooks and crannies, without air conditioning. Mine is one of them. I strongly recommend air conditioning to my customers, and it is with some embarrassment that I admit that I have not implemented it myself – but I have never needed it before. My company’s servers are in a steel enclosure in a 675 square foot garage. Usually it stays quite cool, verified by the thermal monitoring unit attached to my battery backup system. If the temperature gets too high, the battery backup sends a shutdown command to the servers so they are not damaged by the heat.

Several of my customers have had thermal shutdown issues the last few days. Today it was my turn. I happened to be sitting at my workstation when the e-mail arrived, telling me that I had 3 minutes to correct the situation before things started shutting down.

I started by logging into the battery backup unit and adjusting the threshold up a few degrees to give me time to work. Next I walked down to the server rack and opened its door to allow more air flow to the servers. The thermal monitor is just inside the door, right next to the air intake holes in the front of the server. The third step I took was to shut down one of the servers in the rack – a virtual server running Windows Home Server, which backs up my workstations. Since I don’t store data on workstations, it’s OK to go a few days without backing them up.

Back in my air-conditioned office, I logged into the battery backup management web page and saw that it had gone up to 91 degrees while I was working, but was now back to 90. I watched it for a few minutes. It stayed at 90. Still too hot.

Sitting back and thinking about my options, I considered fans – but the entire room was very hot. Fans would only push the hot air around, and I’ve heard horror stories and seen pictures of server rooms which had burned down due to electrical fires starting from cheap fans that weren’t designed for a 24/7 duty cycle.

I considered moving the server to my office – but the server is very noisy, being a rack-mount server with small fans moving very quickly. However, my servers are virtual, running on VMware ESXi, so they should be very portable…        …and an idea was formed.

One of the great benefits of virtualization is that you can put your virtual machine on any hardware that is supported by the host operating system, which in my case is VMware ESXi. That makes backup and restore very simple. You don’t have to be concerned with hard disk controller drivers and other such obstacles to a smooth restore operation.

I’ve been evangelizing these virtues for over a year now, and using the technology myself. I decided to use this unfortunate heat wave as an opportunity to perform a real-world test of the technology I have been talking about. I decided to do a last-minute backup of my server, move the backup device to a smaller, quieter machine in my office, and restore the backup. I would run it in my office until temperatures reach sane levels again, and then reverse the procedure.

I warned the users that the server was going down for a while. I stopped the incoming e-mail service, and forced a “backup now” on the SBS 2008 and Windows 2008 servers that form my infrastructure. That took about 1/2 hour. I am using the built-in Windows Backup, and it is performing disk-based incremental backups. Then I shut down the “guest” operating systems, and finally shut down the host server.

Again I walked down to the server rack and disconnected the external hard disk that I store my local backups on. It was nearly hot enough to burn my fingers. I carried it up to my office and plugged it into the generic white-box server ($800) that I use to run lab experiments. This machine would also make an excellent loaner ESXi server if one of my customers experienced a server failure. It has a single quad-core 2.5GHz CPU, 8GB RAM, and 1.5 TB of disk space.

I attached the USB stick that boots VMware ESXi on that host, booted it up, and configured its networking (2 minutes).

Next step, I created two guest virtual machines with the same disk sizes as the machines I was going to restore. I had to allocate less memory, so the servers might run a little slower. Then I attached the virtual disks on the backup device to the appropriate VMs, and finally mapped the SBS2008 and Windows 2008 DVDs to the new virtual machines and configured them to boot from DVD.

I booted up the SBS2008 server first. It booted from DVD, and I used the menus on the DVD to start a Full Computer Restore, using the backups that it found automatically when it searched the attached disks. I chose the correct date/time of the backup to restore, verified that all of the volumes were present, and told it to begin.

restore

restore2

I didn’t have to flounder around looking for hard disk controller drivers, making floppy disks or putting drivers on USB. I set to work on the second server, which is less critical to my business, and had similar results with that one. Not wanting to cause the first restore to slow down, I brought the second server to the final prompt to begin the restore, and waited for the first one to complete.

The restore was the easiest full-server restore I have ever done, with the best results. After the restore, I booted the server, and it was off and running without a backward glance.

The first server, which runs 90% of my business, was restored and running less than 2 hours of shutting down for the move. A backup queuing mail service had received and stored my e-mail while it was down, so I didn’t miss a single message. The second server, running my blog site, followed soon after.

I did have three very small hiccups:
1. Windows detected the hardware change (probably the CPU chip) and required re-activation, but it worked automatically – two mouse clicks and a few seconds took care of it.
2. Because I forgot to set the date/time properly on the destination ESXi host, my SBS2008 server’s clock got set wrong and that caused authentication problems for a few minutes until I figured out what was going on and corrected it.
3. The DHCP Server service on my SBS did not start because I was running an open-source DHCP server during the downtime to keep everything connected to the network. I just had to stop the one and start the other.

Compared with the kind of difficulties I would normally expect with this kind of full server restore to different hardware, this was a piece of cake.

I can now say with a high level of confidence that virtual servers, backed up with a local VSS-based disk backup solution, coupled with an offsite backup solution, is a great way to go. My scenario was a simple problem with a simple solution, but this power and flexibility can easily be applied in many different situations.

Tags: , , ,

Dell and ESXi – Hardware Monitoring? Good Luck.

April 7th, 2009 by Paul Sterley | 6 Comments | Filed in ESXi, Hardware, Virtualization

Note: The rant contained in this post is probably only relevant for a short period of time. I’m sure that Dell and VMware will make this better. At least I hope so. And I hope they don’t make it better ONLY for brand new servers. I hope they fix it for servers that are six months old too.

My Task: Get monitoring/management alerts for hardware status such as RAID volumes, physical disks, fans, power supplies, etc, for a Dell PowerEdge 2950 III server, purchased less than 6 months ago.

ESXi 3.5 update 4 has the Dell CIM agents and things built into it, I am told. I am also told that OpenManage 6.0.3 can talk to these agents directly. However, nobody can tell me exactly how this works. Can you install it on a VM and then point it to the ESXi management IP? Do you still need Dell IT Assistant, or does it still rely on configuring SNMP traps (a task I enjoy about as much as whacking myself in the shin with a rubber mallet). Nobody at Dell seems to know. To be fair, u4 was only released yesterday. Nobody at Dell seems to have been trained on this yet. They were even surprised to learn that OM 6.0.3 had been released. Eventually one of them told me that 6.0.3 only works with the brand new Generation11 servers. Lovely.

For “older” servers, it’s even more fun. I did hours of research. I downloaded OpenManage Management Station, which includes IT Assistant. The readme file states clearly that 64-bit Windows 2008 is supported – but when the installer runs the prerequisite check, it tells me that “IT Assistant cannot be installed on a system running a Microsoft(R) Windows(R) x64 operating system. What?! There are a ton of other prerequisites too. SQL Express, Java, some portion of Visual Studio (which will trigger a 450MB Windows Update for the entire VS SP1, which will fail and need to be installed manually). Then you need the ESXi Remote Command Line Utility, which in turn requires ActivePerl. You really wanted to install all of that junk on your SBS server, didn’t you?

I gave this one final shot. I actually installed SQL, Java, some Visual Studio thing, SNMP services, the ESXi RCLI, and even ActivePerl. I jumbled all of that crud onto my beautiful, uncluttered, stable server (snapshot first) and started going through the Dell PDF that tells how to enable SNMP on ESXi (msmpa02.pdf, page 10).

I got as far as executing the Perl script, and got this error:
Changing community list to: public…
Failed : fault.RestrictedVersion.summary

OK, that’s it. I am done. Forget it.

So much for the altruistic statement on Dell’s website that says:
“Virtualization is a key path to simplifying IT. Dell and VMware are committed to making virtualization accessible to the mainstream. It shouldn’t be just for the largest datacenters. It shouldn’t be complicated. It shouldn’t require an army of consultants.”

That’s very nice politics but I don’t see it happening. When VMware and Dell pull this together well enough that I don’t need 538MB of junk from different vendors, a bunch of command line scripting, SNMP configuration, and lots of figuring things out, then I will be interested in working out how to get alerts when hardware events happen.

The VI client has all of the health status indicators right there. It would probably be 50 lines of code to have ESXi send SMTP notifications when any of those dots goes yellow or red. VMware needs to write that into ESXi – but they won’t, because they want people to buy the full Virtual Infrastructure for $3000.

Tags: , , ,