EMC VMware’s ESX 3.0 was released a bit more than three years ago. While ESX 2.5 was a solid virtualization platform, ESX 3.0 seemed to push server virtualization into the realm where a lot of small and large businesses alike could really sink their teeth into it.
The new high-availability features in ESX 3.0 were a huge draw to many businesses seeking better uptime, and the refined centralized management offered by VirtualCenter 2.0 was compelling. Support for a wider set of hardware such as iSCSI SANs also allowed high-end functionality at a lower price.
Now that we’re three years down the road, many of these initial adopters of ESX 3.0 are starting to replace their hosts with new ones and preparing to upgrade to vSphere 4.0.
That seems to be leaving a lot of server admins staring at a stack of three-year-old virtualization hosts that aren’t yet finished doing their jobs. Sure, they might not be quite fast enough to go the distance with increased production loads, and you might like to have some more performance headroom, but it’s always a painful decision to turn off a bunch of expensive servers and not do anything with them.
Instead of tossing their old hosts in a dumpster, many enterprises are opting to reuse them. Some turn them into development clusters to separate dev loads from production loads. Some make them available for testing and training. My favorite use is as the seed hardware for a warm site. Even if the old hardware can’t run all your production resources at 100 percent resource availability, having some immediately available production capability in a production site failure scenario is better than none — and it bridges the gap between the time of the disaster and the time that you can get replacement hardware on site.
Assuming that business continuity is important to your organization and you have multiple offices or a sufficiently large campus, building a warm site is a great use of your hardware. It certainly isn’t free and there are a number of common pitfalls that you’ll want to steer clear from, but it’s definitely a worthy endeavor if downtime costs you money.
It may be that, to be useful, a warm site would cost more than you can currently afford to spend on it. In that case it’s better to save your pennies and do it correctly than to implement something that won’t accomplish your organization’s goals.
For example, if you run a FibreChannel SAN with no iSCSI connectivity and don’t have the tremendous luck to have dark fiber running to your warm site, implementing SAN replication might be out of the question without hardware such as an FCIP gateway or software such as EMC’s RepliStor. If you’re in this boat, be sure to consider these factors the next time you are weighing an upgrade to or replacement of your current SAN.
On the other hand, users of devices such as NetApp filers should add more SnapMirror licensing, and users of Dell EqualLogic PeerStorage arrays (also sold under the Dell brand) have everything they need already. No matter what your SAN, to perform SAN-to-SAN replication, you’re going to need a second one.
If performing SAN-to-SAN replication is out of the question, you still have options. There are several good host-based replication software packages available that will run on the ESX hosts and do direct host-to-host replication. These include Vizioncore vReplicator and NSI DoubleTake for VI. They are usually licensed per VM rather than per host, which can make them unattractive depending upon the number of guests you want to replicate. The big caveat here is that you will need a large amount of directly attached storage on the old hosts that are being moved across to the warm site. (If they had been attached to your production SAN, they may no longer have any disks in them.)
Find out how you can reduce IT time, costs and burdens –Attend the IBM Information on Demand in:Victoria and EdmontonNo matter how you decide to do it, your storage configuration — whether it involves SAN or host-based replication — is the most important part of the warm site design and should not be treated lightly.
For example, let’s say your initial calculations show that you’re going to need two T1s’ worth of bandwidth (3.0Mbps) to replicate an estimated 25GB of storage deltas per 24-hour period to maintain whatever RPO you’ve set. But it turns out you actually need to move 35GB per day to meet that RPO — a difference of roughly one more T1 circuit. Depending on your bandwidth costs, that small difference could cost as much as an entirely new SAN or a few new virtualization hosts over three years’ time.
So if estimating your replication bandwidth needs is so important, there must be a tried-and-true way of doing it, right? Not really. There are some tricks to determine how much data is turning over on your VMs, but you can’t always trust what they tell you.
The first and easiest method is to use VMware’s built-in snapshot functionality. Take a snapshot of every VM you want to replicate, wait a period of time equal to what you’d like your replication period to be based on your RPO, then examine the snapshot files on your VMFS volumes to see how big they are. (Note: Be sure you have enough free space on your VMFS volumes before you do this.) That figure is roughly how much data has changed on those VMs in that period. If you do this at several times during different parts of your production day and month, you should get a reasonably good idea of how quickly your data is changing.
However, that’s not all there’s to it. Depending on your SAN platform, your SAN may replicate data in larger blocks than VMware’s snapshot files allocate. Thus, a single change of a 1KB file within a VM may be seen as a change to a 16MB block on your SAN — essentially magnifying the amount of data that needs to move by 16,000 times. This magnitude difference would be a fairly rare occurrence, but it shows that you can’t easily predict actual data volumes based on snapshots.
To combat this problem and generally increase the amount of data your WAN can carry, using some form of WAN accelerator that includes deduplication technology is a wise move. Examples of such products include Cisco’s WAAS and Riverbed’s Steelhead. Both platforms have their own strengths and weaknesses, but they operate in much the same way. They optimize the WAN data flow through intelligent re-windowing and other TCP enhancements, but they also retain a remote cache of what has previously been sent over the WAN link.
In the event that they get a cache hit (a packet that has the same data payload as one seen previously), that packet is not re-sent. Instead, just a pointer to that packets payload is sent to the device on the other end of the circuit. In the example of a 1KB change requiring 16MB of data transmission, a WAN accelerator could essentially nullify the problem.
Another option is to purchase licensing for VMware vSphere Essentials, which will operate with a much more limited feature set than on your production site, but still be able to start and run your VMs.
Another issue to consider is whether you want to implement VMware’s Site Recovery Manager (SRM). SRM requires that you do SAN-to-SAN replication with SANs support it (most do), and it is somewhat expensive. However, if being able to test your recovery plan frequently and having a completely automated failover process is important to you, implementing SRM is certainly worth a close look. It’s also worth noting that vSphere 4.0 support for SRM likely won’t be available until later this year.
If you do it right, reusing retired hardware is a great idea.
Taking advantage of retired hardware to build a warm-standby datacenter is a fantastic use of resources and builds in backup computational capacity you’ll be happy to have if you ever need it. However, blindly building a warm site without a plan — regardless of how much extra hardware you have kicking around — isn’t likely to work out well in the long run.
Failing to do any of several things — set goals properly, consider storage resources, keep WAN bandwidth in mind, or take into account software licensing limitations — will almost certainly make the exercise more expensive and less effective than it could be.
Notwithstanding all of these challenges, today’s virtualization technology, coupled with modern storage and networking technology, makes it far easier to build always-on standby failover capacity than it ever has been in the past. If your organization places a high value on uptime, now is the time to put your toe in the water and give it a try.