It’s long been said that close is only good enough for horseshoes, but it’s also good enough for most data queries, especially when you’re looking to draw an answer from petabytes of data.
That’s the logic behind the Infobright Approximate Query (IAQ) solution for large scale data environments, which the Toronto-based company introduced at the World Mobile Congress in Barcelona, Spain today.
Infobright president and CEO Don DeLoach said the rapid, exponential growth of data enterprises are experiencing is just getting started, and the large data lake architecture being built actually inhibits the ability to get needed information fast enough. “The idea that you might have central data lake architecture is something that does not account for immediate insight into that data or interrogation of that data.”
IAQ uses a statistical modelling approach to render results for complex datasets and generates knowledge using analysis on vertical segments of data. When data is loaded, intelligent algorithms evaluate the data and generate knowledge, said DeLoach. “The thesis we are drawing on is not all queries require an exact answer.” IAQ can be overlaid on top of traditional relational environments as well as NoSQL environments such as Hadoop, Spark, Teradata, Cassandra, MongoDB, and others.
Founded nine years ago in Toronto by four Polish mathematicians, Infobright’s main product has been a column-based data store uniquely designed to store machine generated data to create a knowledge grid by ingesting and interrogating it to create a statistical metadata model of the dataset, explained DeLoach. Combined with tight compression, Infobright is able to drastically reduce I/O requirements. It’s also designed not to require the specialized skillset of database administrator to use it.
Infobright is installed at most mobile service providers around the world, he said, accounting for a large part of the company’s business. Other segments include digital media, online advertising and financial services. DeLoach said the world is moving to a new paradigm for data stores as big data systems such as Hadoop are becoming mainstream. With the massive amounts of data being collected, he said, the traditional data lake is not going to feasible or fast enough.
Not needing an exact answer every time speeds up the process of a query and increases an organization’s efficiency. DeLoach likens it to a detective asking a series of questions; only the answer to the ultimate question needs to be correct; the answers to get there just need to be close enough.
An actual business example might an enterprise looking to understand how and why a network intrusion occurred. “Troubleshooting is essentially an investigation.” The investigation could require more than a dozen queries that might take days. That could be knocked down considerably using the IAQ approach, said DeLoach. One of the key impediments for using a cloud-based data lake architecture is that it prohibits fast remediation, as the number of queries mount against an increasingly large volume of data. “That’s a big limitation.”
Infobright is essentially creating an abstraction layer of metadata so that it doesn’t have to deal with all of the data at once to get answers. The idea of using inexact data to get an answer is not new, noted Robin Bloor, chief analyst at The Bloor Group, as sampling is done all of the time in statistical analysis. “It not as radical as you might suppose.”
What’s changed is the machine learning algorithms that are required when pushing a query against petabytes of data over Hadoop. “Normally you can get answer back in minutes when it used to take hours.”
How fast a query needs completed in an organization is really dependent on the business processes, said Bloor. “There’s a technology war between banks to get a transaction done before everyone else. They’re looking at nanosecond differences,” he said. “That’s one end of the spectrum.”
At the other end is analyzing all of the data from particle collisions in the Large Hadron Collider, said Bloor. “If you don’t get answer in a day, it doesn’t matter. It matters that you can process everything.” Most enterprises doing analytics fall right in the middle, with predictive analytics expected to operate at human speed – about one tenth of a second. That’s not achievable if the heap of data is too large, he said, and Infobright is able to chew through large amounts fast.
“No one takes the same approach,” Bloor said. Its mathematical approach is why IAQ is better than sampling, and it’s something the company could demonstrate to mathematicians that would get their heads nodding. “That’s what makes the difference. It’s not a gimmick.”