While the labour shortage in IT continues to loom, including the dwindling base of DBAs (database administrators), the amount of data itself is growing faster than ever, with no end in sight. IBM Corp. and some of its competitors are looking to ease the IT labour burden by making database systems more self-responsive and better able to learn ways to function more efficiently. Pat Selinger, who has the distinction of being an IBM Fellow and vice-president of database integration, spoke recently with InfoWorld (US) Senior Editor Tom Sullivan about autonomic computing, gaining control of data before it gains control of us, and how query optimization can make a company more compelling.
InfoWorld: What exactly does IBM mean by autonomic computing?
Selinger: I heard a phrase used by Alan Ganek, who is now the head of our autonomic computing division, and he said that autonomic computing compares to the way that human beings have an autonomic nervous system, which makes my heart beat even when I’m not thinking about it. That’s really how you’d like your systems to perform. You’d like it to do all of these things without having to manually think about them, without having to tell it to do this or that.
InfoWorld: What role does the database play in IBM’s autonomic computing strategy?
Selinger: I think it’s a wonderful example because databases are really part operating system, they’re part programming language and compiler, they’re part of a lot of different aspects of a system. The only thing they don’t involve is hardware. And we have examples of autonomic computing in hardware as well, where an AS/400 phones home when it’s feeling ill, for instance. When it’s detecting that it is doing too many re-tries on a disk, it thinks the disk bearings are going, or whatever happens to disks, and it actually makes a phone call back to the IBM location and, depending on what a customer opted to have happen, maybe a new disk comes or a service man comes or whatever.
InfoWorld: What are the autonomic attributes to DB2?
Selinger: When it comes to LEO [Learning Optimizer, a query learning technology], you want it to adjust the data. First, we keep statistics about data items. We keep statistics about how many ZIP codes, how many states there are. As part of LEO, we started noticing some correlations, some behaviour that would go between those two pieces of information. I think there are about 40,000 ZIP codes, something like that, and 50 states. At the labs here at 95141 [ZIP code], we know that is only in California and we know that if you apply the 95141 predicate to a query, also applying the California predicate is going to do nothing. It’s not going to reduce the answer at all, because all the 95141s are in California. So, when you notice that kind of detail, you can remember that and you can take advantage of it so you’re making a better process of it the next time. That’s an example of the kind of learning that we expect LEO to be able to do. And, of course, there are much more complicated relationships that you couldn’t just write down but we learn by data mining. You might have data mining discover for you and help you in terms of correlating, when someone asks a specific question, these are the right access paths to use to get at the data. When this other question is asked, which has about the same form, there might be an entirely different way to access the data that really is beneficial.
InfoWorld: Database companies have talked for the last couple of years about the idea of self-healing databases, and in IBM’s case, self-healing servers. Is this all going toward a futuristic goal with that in mind?
Selinger: We’ve got a shortage already today in IT skills, and at the rate we’re producing data, to keep those people from being totally overloaded what we have to do is make changes, to make our systems more autonomic, more responsive to the behaviour and the events that are occurring in the system so they can self-adjust, self-tune. That frees up a DBA to do the things that we still need human brains and human talent to do. The more that the computer can take over what doesn’t need that manual attention, the nicer job a DBA has, for one thing, and the better off an organization is because they don’t have to keep on making that shortage worse.
InfoWorld: What is IBM doing to reduce the ratio of DBAs needed to manage the explosive growth of data?
Selinger: We’ve got a cost-based query optimizer that has about 25 years of modelling, very careful detailed modelling about the data, about the ways we access the data, the repertoire of actions we can take on the data, and so forth, in a very detailed level. But all of that is aimed at making customers’ applications get online faster. That is so that they don’t have to spend time worrying about access paths. The customer can just write the query and it will run well. You would pick a typical mixed application workload that would have, say, a thousand queries in it, and perhaps about 10 percent or 100 [queries] would need some DBA attention for DB2. In contrast, we’ve been told by consultants that Oracle would take manual effort in about 600 of those. So when you start looking at that and saying what manual effort is required, you have to fuss with the query, you have to decide if you need more access paths to the data, and so forth. So it’s a one- to two-hour to whole-afternoon kind of effort. Then you take those 500 extra queries between DB2 and Oracle, and you say a half day each, that is 250 DBA days, and that’s a whole other DBA. Then you look at another data point, which says the data in the world is doubling every – pick your timeframe number – few years. If we don’t do something to manage the DBA-to-data ratio, then we’re going to need twice as many DBAs next year as this year, and twice as many as that the year after, until we use up all of the population of the earth trying to manage the data we’re producing. It’s not putting DBAs out of work, it’s helping them to survive.
InfoWorld: When you have more people understanding less of the data at their fingertips, fewer people educated about databases, and more and more data, what are the implications?
Selinger: Well, another impact here is that as we look toward more terabyte databases and petabyte databases, the kind of in-depth understanding that DBAs have today of the data and the kind of questions that people ask about that data is going to be changing. In this changing world, out-of-the-blue requests being made by people who are not IT experts is becoming a more and more frequent occurrence. I have a leading-edge customer today that is letting their mailing campaign marketers access the database and ask any question they want. They’re making that data available, and these people don’t know that it’s SQL, they don’t know how the data is organized. They can ask some questions that are pretty intense, and sometimes they make mistakes. And they don’t go to SQL school for a year to learn how to write queries. So we have to continually do an ever-better job of taking the average – or even the less-than-average – quality of query and drive that, and to do something with that through our query rewriting technologies, our cost-based query optimization, and now LEO, to take advantage of those technology capabilities to improve the access to data and, ultimately, that becomes a competitive advantage for the customer that is using it. When you look at it in that context, people are going to start understanding less and less of the data they have at their fingertips, and less and less database-educated people are going to be touching that growing petabyte amount of data, there are a couple different exploding dimensions. This is the kind of technology we are bringing to bear on that problem. That is the perspective that I have on this, and we have a number of initiatives going on in autonomic computing. A lot of them are related to databases and some of them that are not. This is one of the pieces of that.
InfoWorld: How can query optimization be a competitive advantage for enterprises?
Selinger: What you see with the promise of LEO is that we can make adjustments to further reduce that manual intervention. So the pocket we’re really going after is reducing cost of ownership, time-to-market because if I have to take that extra DBA year to build my application before I can release it and let people really use it in production, I can’t respond as fast as the competition. So I can make a company more compelling by reducing the manual labour involved in bringing an application into production.
Also, there are a variety of monitoring capabilities where you want to be able to watch the usage. For example, we have one available today in DB2 where you can say “parallelism degree any.” That says “Mr. Optimizer, you know what, I am not going to go do some experiments here and waste my time as an administrator figuring out whether 2 degrees of parallelism or 8 degrees of parallelism is best, I’m going to say ‘any’ and let you pick.” What we do in the database engine is look at the system environment, we look at the query, we look at the data, we decide what the best degree of parallelism is. So we can save someone a whole bunch of time by using our detailed knowledge of the data and the system behaviour at the moment and making the choice that is good for a particular query.
InfoWorld: How flexible is that? For instance, does it change from query to query?
Selinger: Oh, absolutely. You can declare this parallelism degree at any level. You can say “for this query.” Or you can say “for this application,” or you can say “for this database.” Probably in most situations people wouldn’t elect “for this database,” they’d choose “for this application” or “for this query.” A classic example for that is database loading. You can take “parallelism degree any” for database loading and we will pick a good number of how many agents are working on loading the data for you. That will depend on your disk configuration, the number of processors on your machine, and so forth.
InfoWorld: What is the timeframe for LEO?
Selinger: We anticipate that the results of what LEO is producing as a research project is not something that will be delivered tomorrow. It’s more in the range of three years from now.