HP has launched an open-sourced big data predictive analytics platform that it says will bring machine learning capability to Canadian companies.
The firm has launched HP Haven Predictive Analytics under its Haven big data banner. The new version of the software includes an enhanced version of R, the statistical modelling language commonly used in big data analytics. Distributed R enables developers to address large data sets distributed across multiple clusters, and to make calculations on them as a single entity.
The new version of R is syntactically the same as the old one, and can work with the same tools, such as R Studio, the popular R programming interface. There are additional extensions to the system for use with the distributed data, though, said Sunil Venkayala, senior technical product manager for the Big Data Business Group within HP Software.
“We initially started looking at what language they should start to implement, and found R as the language with a lot of mindshare among scientists,” he said. “One of the core strengths of R is its community. There are almost two million users and six thousand packages developed in R. But R wasn’t designed to scale, originally.”
Distributed R is the open sourced part of the product, but HP has designed it to work natively with Vertica, the proprietary analytics database that it purchased in 2011 and has since made a part of the Haven architecture. Vertica is a massively-parallel columnar database, designed for the kinds of SQL queries typically used in relational databases.
HP has built native connectors between Vertica and Distributed R that enable developers to run Distributed R queries from within the database. They also provide the statistical language with fast access to data, executives said.
The increased scale of the Distributed R solution is what gives this latest software combination its predictive analytics capabilities. It can be used to identity trends in data scaling up to the Petabyte level, and then draw inferences about future developments, explains Venkayala. Standard R won’t scale beyond a certain number of records, he said, arguing that it simply isn’t enough for many predictive analytics and machine learning applications.
“In one of our healthcare customers, they have patient data with thousands of attributes and millions of transactions,” he said. “When they’re trying to build a model, running that model on the whole base of data is important because there might be false negatives. They want to use all of the data that they have. But with Standard R it’s not feasible.”
Although Distributed R is designed to integrate natively with Vertica, it isn’t yet honed to work with the HavenOnDemand solution that the company released in December that provided cloud-based big data analytics processing. This may be coming in the future, though. The company is working on an intelligent service bus that will tie together more closely the different Haven platforms, including IDLE, a processing system for unstructured data (such as text and social media posts).