The problem was baffling if not downright eerie: Employee PCs were spontaneously rebooting for seemingly no reason at all.
After fruitlessly trying to resolve the mystery, the organization called in Tony Fortunato, a Georgetown, Ont.-based consultant whose company is benignly called The Technology Firm. Actually, he’s a network detective who specializes in tracking down what others can’t find.
A ghost in the network? A competitor’s revenge? Nothing so mysterious: The problem was a firewall with a TCP driver bug.
In this case, Fortunato’s ally was that most humble of tools, a protocol analyzer.
Increasingly automated software and hardware tools with network self-healing capabilities are being touted by utility manufacturers as good enough to prevent most outages. But a number of experts we interviewed suggest common sense is just as important.
“I have yet to find a tool that’s comprehensive enough to be able to say, ‘Here’s the problem and here’s the solution,’” says Fortunato. But a number of them, he adds, are good enough to say where one should start looking.
Surprisingly, with a veritable cornucopia of performance and network management tools on the market, many enterprise-sized clients admitted they were running networks on a shoestring with the barest of diagnostic tools, says Debra Curtis, vice-president of research in Gartner Inc.’s IT operations management group. They call her for advice after realizing the danger they’re in.
Curtis breaks down network diagnostic tools into three broad categories:
• Enterprise network management suites with root-cause analysis capabilities, such as EMC Corp.’s Ionix for IT Operations Intelligence (formerly called Smarts), CA Technologies Inc.’s Spectrum Service Assurance, Hewlett-Packard’s Network Node Manager and IBM Corp.’s Tivoli Network Manager among others.
• Network performance appliances with traffic analysis and x-flow monitoring. Arguably the largest range of diagnostic tools, they include familiar products from CA’s NetQoS, NetScout Systems Inc., Fluke Networks’ OptiView series and SolarWinds. She also mentioned Opnet Technology Inc.’s troubleshooting application, ACE Analyst, and SevOne Inc.’s PAS (Performance Application Solution).
• Protocol analyzers such as NetScout’s Sniffer, Network Instruments LLC’s Observer and the open source WireShark.
Experts emphasize the need to store performance data because sometimes the key to discovering the cause of a problem is found in historical data. In addition, carefully watching performance data – even if the application is automated – can give keys to preventing problems.
Interestingly, Ron Groux, product manager for Fluke Networks Canada, says administrators used to favour x-flow analysis tools. Increasingly, however, they are returning to packet analysis tools particularly for time-sensitive applications such as VoIP and video.
On the other hand Fortunato complains that in his experience few companies have policies to take advantage of the tools they have – for example, delegating someone to check the port or alarm monitors regularly for anomalies that could give advance notice of trouble. More than half the time he has been brought in to solve a network issue the problem has persisted over quite a while but gone unchecked.
“Every tool is helpful,” Fortunato insists. However, too many organizations don’t invest enough time training network staff to take advantage of the tools they have. “They just feel the tool is going to magically identify and fix their issue without putting much time into it.”
“One of the biggest problems is not that anyone designs a network to fail,” says Bob Laliberte, senior analyst at the Enterprise Strategy Group, “but once you have everything in if you don’t have adequate visibility it could leave you exposed such that a single failure could cause an outage.” Again, look for tools that gather enough data that can do some form of root-cause analysis to narrow down what staff have to look for.
This can, however, create a side problem. With an increasing number of performance management tools pulling in data, wireless LAN information plus the x-flow data from routers and switches, administrators could face an overflow of statistics.
“IT organizations are not only drowning in data, they’re drowning in tools,” says Steven Shalita, vice-president of marketing at NetScout. One of the disadvantages of choosing best of breed IT solutions is the wealth of management software that comes with each appliance or software, he says, one reason why some organizations try to standardize on vendors. One advantage is fewer diagnostic tools.
Andrew McAusland, CEO of Montreal-based KnowledgeOne, a distributor of time-sensitive online training courses to 80 countries, knows the advantage of standardizing on one supplier. His company’s data centre is filled with Cisco Systems Inc. switches connected to providers in other countries, many of which also have Cisco gear. For troubleshooting the company uses multiple tiers of Cisco software which gives visibility into other providers. “It takes 60 to 90 seconds to trace where the problem is,” he says. “We can’t tell them what box it is, but can say ‘It is within this IP set.’”
John Bell, Cisco’s vice-president of systems engineering notes that in the past network engineers used to focus on solving a problem, often at the cost of dropping connectivity to the user. Today, best practice calls for restoring service first, then looking for the cause.
More often than not, through, problems can be solved without technology.
“The first three rules are ‘Check the cable, check the cable, check the cable,’” says Josh Stephens, vice-president of technology at SolarWinds. It’s a rule even he forgot when retained by a company to solve the apparently unsolvable outage. “I went in and looked at configurations and provisioning and all sorts of high level things.” Then, walking around the office he saw a three-foot patch cord that didn’t look right. It wasn’t. It had a bad connector.
SIDEBAR: Tony’s Toolbox
There was the steel company that had the oddest network crashes. After touring the facility, he realized the outages coincided with an overhead crane going in and out of the factory. It turned out the crane’s operator needed a wireless tablet to do his job, which caused a spanning tree problem. The company’s network staff didn’t put two and two together because they were at their desks looking at data.
Then there was the auto manufacturing plant whose network would randomly crash. Spying an unused Ethernet cable sitting on an arc welder, Fortunato learned a supervisor put it on the machine to take it off the floor. Unfortunately, it was an unshielded cable, so every time the welder was used it caused errors on the line and users lost connectivity.