It’s one of the unwritten laws of physics: At some time or another, everybody screws up.
But when IT pros make mistakes, they don’t mess around. Entire buildings go dark. Web sites disappear. Companies grind to a halt. Because if you’re going to mess up, you might as well make it count.
“I always tell my guys, hey, you’re gonna do stupid stuff,” says Rich Casselberry, director of IT operations at Enterasys, a networking systems vendor. “It’s OK to do something stupid if you have the wrong information. But if you do something stupid because you’re stupid, that’s a problem. The trick is to not flip out, which only makes it worse, or try to hide it. You need to figure out how to keep it from happening again.”
Perilous Programming:Top 25 dangerous programing screw ups
Sure, some of these mishaps are amusing in retrospect. But don’t laugh too hard. We know you’ve probably done worse.
True IT confession No. 1: The case of the mysterious invisible backup
Our first tale of misadventure involves a longtime IT pro who doesn’t want his real name used, so we’ll just call him Hard Luck Harry.
Harry had his share of mishaps when he started out a decade ago at a major networking equipment maker in the Northeast. There was the time he changed an environmental variable that broke everything on his company’s financial apps, earning an e-mail from his boss ordering him to “never hack on this system again.” Or the time he crashed the company’s core ERP system by overwriting /dev/tty. Harry says after he accidentally ripped the company’s T1 lines out of the wall with his pager, he was banned from ever reentering the telecom closet.
PRODUCT BRIEF -Available with free subscription
But the worst one happened after Harry installed an Emerald tape backup system. Did he bother to read the manual? Please. This was child’s play. Just load install.exe and let the software do its thing.
It seemed to work perfectly. Four hours later, the first backup completed and everything looked fine.
Fast-forward six months. Harry gets a call late one night at home from one of his work pals. That night’s backup tape is completely blank, the friend tells him. Worse, the last four weeks of backups are also blank.
As Harry soon discovered, that particular backup program installs in demo mode by default. Demo mode looked exactly like real mode and even took the same amount of time as an actual backup, but nothing ever got written to tape — a fact that was noted in the manual, which Harry might have seen had he read it.
Fortunately, the company used ADP for payroll processing. ADP shipped back historical payroll records, so the firm lost only a week’s worth of data. The bad news? Harry was up until 3 a.m. manually stuffing payroll envelopes, along with his boss, the VP of finance, the entire payroll department, and the company’s brand-new CIO, whom he met for the first time that night.
“I got to say, I was pretty popular,” he jokes. “I think the only reason they didn’t fire me was by that point they had gotten so used to me screwing up, they realized I couldn’t do anything right.”
Lessons learned? 1. Test the restores, not the backups, says Harry. “No one cares if the backup works; they care if the restore does.” 2. Think before you type. 3. Remove your pager (or BlackBerry) before entering the telecom closet, just to be safe.
True IT confession No. 2: Sometimes it takes a janitor to clean up an IT mess
Late one night in 1997, Josh Stephens was working all alone at his console at a large Midwestern telecom company. Stephens was making changes to the Cisco Catalyst switches at the telco’s main customer call center, which was located several states away. That’s when the spanning tree protocols hit the fan.
“I’m still not sure exactly how I did it, but I caused some sort of broadcast storm and STP freak-out that locked up not only the switch I was working on but every single switch in that facility,” he says. That broadcast storm brought down hundreds of call center users, stranding many of them in the middle of customer calls.
Worse, the switches were “locked hard,” requiring a physical power-off and a slow methodical plan to bring them back online, one at a time. The datacenter was hundreds of miles away and had no on-site IT staff, so Stephens did the next best thing: He called maintenance.
“I ended up finding a janitor that had keys to all of my LAN closets and I talked him through (a) which devices were the Catalyst switches, and (b) how to power them off,” he says. “I also promised him he wouldn’t get fired for helping me.”
Though the call center was down for more than hour, nobody ever found out why or who was behind the glitch, says Stephens, who is now VP of technology and Head Geek (yes, that’s the actual title) for SolarWinds, a maker of network management software.
Lessons learned? 1. Don’t make changes without scheduling a window for them, even if the changes seem minor, says Stephens. 2. Never conduct a change control event without IT resources near the gear you’re changing. 3. Be nice to the janitors. One day they might save your assets.
True IT confession No. 3: Put your hands up and step away from the terminal
One of the unavoidable facts of tech life is that when managers are given administrative rights to complex systems, bad things tend to happen.
Back in the late ’80s, Johanna Rothman was director of development for a small, distributed process systems maker in the Boston area. Company management insisted on mandatory overtime for everyone, Rothman included. After three months of this, Rothman and her team were cranky and exhausted — a recipe for disaster.
“One night at 9 p.m., I realize we have a bunch of files to be deleted,” she says. “I’m on a Unix system, and the system won’t let me delete them — I’m not root. Well, I’m the Director. I have the root password. I log in as root. I start rm -r — the recursive delete — from the directory I know is the right directory. I know this.”
“He says, ‘Move away from the keyboard. I’m coming in to start the restore.’ I say, ‘I can help. Where are the tapes?’ He says, ‘Go away. Just leave. I don’t need more of your help.'”
The restore takes two days. Rothman says she slept in late on both days and told everyone else on her team to do the same. She also left voicemail apologies to all the developers.
“I think the only reason I didn’t get fired is because management was too busy with the crisis to realize what a mess I’d made,” says Rothman, who now runs her own IT consulting group and keeps a safe distance from Unix root directories.
Lesson learned? 1. There is no reason for anyone higher than the level of manager to have the root password, says Rothman. 2. Too much overtime makes people tired and stupid. The more tired they are, the stupider they get.
Read four more accounts of IT operations gone bad in Seven tales of IT foul up – Part 2