One key to keeping your business on its feet in a disaster is anticipating the sometimes cascading effects a catastrophe can have on your IT operation.
Take Miami-Dade County, for example. When a hurricane hit southern Florida in 1992, the county’s data centre lost power. Diesel generators had overheated when well water ran out because high winds had broken water mains and lowered the water table. IT managers later had air-cooled generators installed.
One of the problems with disaster recovery, experts say, is that although most companies have plans for common scenarios — weather-related emergencies, headquarters lockouts and massive power outages — those plans aren’t regularly tested or communicated to end users. In fact, in a recent survey of 283 Computerworld (U.S.) readers, 81 per cent of the respondents said their organizations have disaster recovery plans. But 71 per cent of the respondents at companies with plans said the plans hadn’t been exercised in the past year.
It takes forethought to avoid a business shutdown during a disaster. Experts and users agree that there are steps you can take to increase your chances of coming through the most common disasters unscathed.
Weather-related emergencies
“If you look at why facilities fail (during weather disasters), it’s all pretty predictable. They call it an act of God, and I call it an act of stupidity,” says Ken Brill, executive director of The Uptime Institute in Santa Fe, N.M. Hurricanes threaten Miami-Dade County’s data centre every year from June through November, yet IT managers still struggle with getting everyone to understand the importance of disaster planning. “The challenge we always have is to make sure the staff is completely involved and we have participation,” says Ruben Lopez, director of the enterprise technology services department for the county.
Miami-Dade County gives itself a 56-hour window to test its disaster recovery plan each year by cutting over to its alternate data centre and restoring data. It uses the time to find deficiencies and later corrects them. “Business continuity and disaster recovery preparedness is all about figuring out what your deficiencies are and how you’re going to fix them. It’s not about how to get an A+ on paper,” says Joe Torres, disaster recovery coordinator for Miami-Dade County. He points out that it’s not the people he’s testing during a disaster recovery exercise but the plan — “because you can’t depend on the people being available.”
“You’re going to give them a book with instructions, and they need to be able to follow that,” Torres says. One step Miami-Dade has taken in that direction is to consider call-tree software that could help employees contact key managers in an emergency.
Walter Hatten, senior vice-president and technical services manager at Hancock Bank in Gulfport, Miss., has focused on consolidating his server farm and creating a redundant communications network for an area of the country that gets hit or brushed by a hurricane every three and a half years. The 100-branch bank, with headquarters on the Gulf of Mexico, is consolidating 500 servers onto a Linux-based mainframe to reduce recovery time in a disaster.
“Just the sheer magnitude of rebuilding 500 servers puts us at risk for not being able to do it quickly enough,” says Hatten, who chose Linux for its open standard and scalability. He says the mainframe will offer greater speed for recovery of data, reducing the amount of time it would take to restore data from days to hours. Headquarters lockouts
Maria Herrera is chief technology officer at Patton Boggs LLP, a Washington-based law firm with 400 attorneys specializing in international trade law. Because of the firm’s proximity to the U.S. Capitol building, one constant concern is a building lockout brought on by terrorist threats, she says. Herrera has set up duplicate operating environments in several remote offices and has contracted with two disaster recovery vendors: SunGard Data Systems Inc. in Wayne, Pa., for server recovery and workstation services, and AmeriVault Corp. in Waltham, Mass., for data backup.
In January, AmeriVault installed its CentralControl interface on desktops and an agent on each of Patton Boggs’ servers. After completing an initial full backup of all data, AmeriVault now performs daily incremental backups of deltas, or changes, to disaster recovery centres in Waltham and Philadelphia.
In an emergency, data restores can be performed remotely, even from home, by administrators using a point-and-click function on a Web portal provided by AmeriVault, or data can be shipped on tape for large restores.
“Every month or couple of months, we access several documents and download them from AmeriVault to test the system,” says Herrera. During full testing, she spends 16 hours recovering full data sets. “We’re able to restore everything within the firm in about 10 hours,” she says. Herrera also suggests involving all IT personnel in the disaster recovery testing process, because in an emergency, you never know who might be available to help. She has trained employees in all four satellite offices around the country on disaster recovery procedures. SunGard also has several facilities where IT personnel and lawyers can meet to continue work in the event of a headquarters lockout, Herrera says.
Officials at Mizuho Capital Markets Corp., a subsidiary of the world’s second-largest financial services firm, Mizuho Financial Group Inc. in Tokyo, say that some of the most effective disaster recovery tools are the simplest. For example, when a protest kept employees from entering the firm’s Times Square headquarters late last year, IT managers passed out laminated business cards with a directory of managers’ home phone numbers.
Doug Lilly, a senior telecommunications technologist at the Delaware Department of Technology and Information, says his agency has three data centres that support about 20,000 state employees. The department uses EMC Corp.’s Symmetrix Remote Data Facility to replicate data among the data centres. It also uses backup software from Oceanport, N.J.-based CommVault Systems Inc. as a central management tool.
“If this site were bombed . . . we’d have servers running to replace them, but we’d still have to restore data from tapes,” Lilly says. “CommVault’s software transfers between 60GB and 65GB of data per hour. It would be a few hours before we got people up online.” Lilly’s IT team also keeps a copy of disaster recovery procedures at home. “Team leaders notify everyone, and we carry cell phones and BlackBerries that are on redundant networks,” he says. “It’s a pretty unified messaging platform . . . that ties data, voice, fax and video into one application. They can get hold of us anytime, anywhere.” Massive power outages
Edward Koplin, an engineer at Jack Dale Associates PC, an engineering firm in Baltimore, says a lack of disaster testing is the No. 1 cause of data centre failures during a blackout. Koplin suggests that companies test their diesel generators often and at full load for as long as they’re expected to be in use during a blackout.
The Uptime Institute’s Brill adds to that advice: Always prepare for a blackout with at least two more generators than needed, and test them by literally pulling the plug. “I would test it for as long as I expected it to work under load. I’d do that at least every two or three years. And I would run it in the summer,” Brill says.
Jim Rittas, a security administrator responsible for networking at Mizuho, says the company can now perform full data restores after blackouts or other disasters in an hour instead of two days because it now mirrors its data to a New Jersey office that’s also an active work site. “The other thing we did was diversify our Internet connections. Internet connections now flow in and out of New York and New Jersey, where we only had one in New York before,” Rittas says.
Needham, Mass.-based research firm TowerGroup recommends turning parts of disaster recovery or business continuity data centres into profit centres by going with an active/active operations model. Traditionally, companies have set up an active primary data centre and unmanned backup site. An active/active model eliminates the need for IT staffers to relocate in an disaster because they’re permanently stationed at the disaster recovery site, which is also used to run active business applications.
Integrating disaster recovery IT assets and personnel into operations budgets across geographically dispersed data centres will also help blur the line between disaster recovery and operations spending. It’s best to have a complete copy of your data in an alternate site at all times, “not just some of it,” says Wayne Schletter, associate director of global technology at Mizuho Capital Markets. “You don’t want to be piecing things together after something happens. You just want to be ready to go.”
Side bar
Tips for coping with disasters
— Choose vendors that are proactive and don’t require prodding to upgrade or test your disaster recovery plan.
— Don’t test people; test your disaster recovery plan. People come and go. Make the plan easy to follow and use.
— After a disaster, don’t count on employees being willing to fly to alternate work sites.
— Distribute key disaster recovery personnel across many geographic locations.
— Turn disaster recovery data centres into active work sites.
— Disaster recovery plans are living, breathing things. Keep them up to date and make sure employees are well versed in them.
— Seek vendors with plenty of longevity and geographically dispersed offices for disaster recovery.
— Make sure portals to your outsourcing vendor are dedicated or have enough bandwidth to handle multiple companies seeking fast restores.
— Make sure that not just your vendor but you understand how to back up and restore systems.
— Verify that backup tapes can restore data.
— Train and involve all IT personnel in the disaster recovery process.