The Transaction Processing Performance Council should add an additional metric for availability to its set of database performance benchmarks, Microsoft researchers plan to argue at the upcoming TPC conference, being held this week in Seattle.
Such a metric “could provide guidance in [database] system design, and also in the picking of the technology,” said Microsoft researcher Yantao Li, who, along with colleague Charles Levine, will make the case for the test at the TPC’s Third International Conference on Performance Evaluation and Benchmarking (TPCTC 2011). It would also reflect the growing importance of having database systems remain in operation 24 hours a day, seven days a week.
The TPC’s TPC-E benchmark can measure the performance and scalability of online transactional processing systems, by simulating an average load at a financial brokerage firm. But it doesn’t take into account how long these database systems can run or how quickly they can be running again, should they stop working for some reason. So researchers at the Microsoft Research Lab propose adding another metric, the time taken for a system to recover after it goes down.
Databases can stop working for all sorts of reasons. Planned downtime could involve the application of patches or hardware maintenance. Unplanned downtime can happen due to software bugs, equipment malfunction, power outages and human errors.
This test would wrap into a single metric both how long a database can run, on average, without going down, along with how fast the database gets back up to speed once a disabling problem is fixed. It would be the product of a system’s mean time between failures (MTBF) and the mean time to recovery (MTTR), also called mean time to repair, the time it takes the system to fully boot up.
To measure MTTR, the researchers propose extending the TPC’s System Under Test (SUT) to look at all the components of the database system, not only the primary servers but also standby servers and connectivity between system components. System cost could also be calculated alongside the availability metric, allowing potential buyers to balance how much availability they’d want for a new system against how much they’d be willing to pay.
The work came about as part of Microsoft’s internal engineering tests. “We thought that there was some good original work here that would be of interest to a wider audience,” Levine said.
Microsoft itself has used the benchmark for its own SQL Server-based systems. One test had a configuration consisting of a primary server and a standby server. Both were Dell PE 2950s with 16 gigabytes of memory and two Intel 2.66Ghz quad-core processors running Microsoft Windows Server 2008 and Microsoft SQL Server 2008
In this configuration, a typical failover, where work gets handed over to the standby server, would take 41 seconds, the researchers report.
“We are looking for ideas like this,” said Raghunath Nambiar, a Cisco performance strategist and co-chairman of the conference. He noted that the purpose of the conference is to field new ideas on how to expand TPC’s benchmarks.
After the presentation, the TPC will evaluate whether the community would be interested in having such a metric incorporated into TPC-E. “It is something that could be done relatively quickly, because it is an incremental addition to an existing model,” Levine said.