Add data centre maintenance and monitoring to the list of jobs that artificial intelligence (AI) can perform.
It’s constantly said by AI experts that to succeed in applying AI, you need to have a lot of data to train it first. And nowhere has more data than data centres. First hyperscale data centre operators applied AI to automate the detection of problems related to the servers, switches, and cooling hardware on their racks. Now a Toronto-based managed service provider is offering its clients an AI assistant that can do the same thing.
Park Place Technologies has been a full-service data centre maintenance and monitoring business for 27 years. During that time, if its clients needed help solving a problem they’d call a 1-800 number and describe the failure taking place. According to Sean Sears, vice-president of Canada for Park Place, that method was effective enough to win a 97 per cent customer satisfaction rate. But now clients are being switched over to ParkView, an AI-backed data centre triage service that can predict problems well before a customer would even think about picking up the phone.
“We have clients that have gotten to the point where they’re saying ‘I’m just giving you a pass like a full-time employee,” Sears says. “‘When there’s a failure then you’ll know what to do and what to repair and I’ll just get the ticket in the morning.'”
ParkView’s foundation is built on BMC Software’s TrueSight AIOps platform. Normally software that’s intended for only the largest enterprise data centres, Park Place is able to host the software and divide the cost up among its more than 1,000 clients. That allows this MSP to offer a data centre infrastructure management (DCIM) service that is enterprise-grade to smaller firms. From fault detection to monitoring and alert-driven responses, ParkView’s benefits are realized by clients with the installation of a lightweight user agent that sends an encrypted stream of data back to Park Place.
Park Place is only getting back data pertaining to the machines’ performance, emphasizes Paul Mercina, director of product management at Park Place. No data that would be confidential to a client or its customers is exchanged for this purpose. Still, by ingesting all that machine data, ParkView is getting smarter every day.
“Think about all the customers and the data we are aggregating and what can we can learn there,” he explains. “We get the benefits of multiple environments and that machine learning gets applied.”
Data centre downtime has typically been represented as well above 99 per cent by those in the business. A data centre with “three nine” or 99.999 per cent reliability would be down for only five minutes in an entire year. But this doesn’t tell the whole story. The reliability measurement is determined by the amount of time the data centre is receiving power, so downtime in this sense refers to only the time when the equipment is literally without electricity. But what is a far more common cause of disrupted service with data centres is human error. By using AI to predict problems before they occur, data centre operators can schedule fixes and preventative maintenance during scheduled downtime, avoiding unexpected interruptions.
Google has been doing this since 2014, after it acquired AI startup DeepMind. It released a white paper in May 2014 about using neural networks to optimize it data centre operations and reduce energy consumption. What began as a “20 per cent project” where Google employees spend time on work outside of their job description, saw Jim “Boy Genius” Gao design an AI system for measuring data center performance that’s been deployed across the company.
In Montreal, wholesale data centre operator Root Data Center announced it was using AI to reduce downtime in December 2017. This was a first for the wholesale data centre market, according to Root, which is a market that sells rack space to other technology vendors offering cloud services.
Working with Litbit, Root collects sight and sound information that is used to determine if power generators are functioning properly. The AI-powered software creates a baseline of what normal operations sound like and then can detect sounds that indicate something is amiss.
“It will listen to the generator to determine if something is off,” president and CEO of Root, AJ Byers, explained to IT World Canada in December. “Potentially something that you could hear with your ear.”
At Park Place, all of its new customers are being onboarded with ParkView. For its existing companies, it’s encouraging them to move over to the AI-assisted service. ParkView is compatible with servers, storage, and networking equipment from Tier 1 OEM vendors, according to the company. But installing the user agent in the data centre can help avoid those unanticipated downtime scenarios.
Sears recalls one customer that saw a failure on Canada Day. Park Place technicians could have gone in to fix the problem, but the IT staff of the client wasn’t reachable. So the downtime lasted for 20 hours.
With ParkView, whatever caused the failure would likely have been repaired well before the national holiday.
“When you have a failure in the data centre without this technology, then how do you know who to call?” Mercina says. “There are multiple service providers you might reach out to, but this technology sorts through this and identifies the root cause of the failure.”
Perhaps the facility’s HVAC unit is malfunctioning and causing multiple server errors, he gives as an example. ParkView could diagnose that ‘outside the box’ problem as well.
It just needs the right data.