IBM finally sidelined a key technology in the big data movement last week, putting its efforts instead behind a newer competitor. The firm is adding Apache Spark to its portfolio of open source large-scale data processing software, overshadowing long-standing system MapReduce.
The company, which calls Spark the most important open source project in a decade, has vowed to embed the technology into its analytics and commerce platforms, and also to offer Spark as a service on its own public cloud infrastructure. Big Blue will also donate its SystemML machine learning technology to the Spark open source movement. Why?
Spark is a general purpose framework for data processing, designed to run applications that process data across a cluster of many different computers at once. This solves a couple of common problems associated with processing large amounts of data.
Firstly, very large sets of data might take a long time to move across networks to a single computer tasked with processing them. And secondly, some large data applications, such as machine learning, require all of the data to be held in memory at once. That’s very difficult for a single computer to do, when you are talking about terabytes of the stuff. That’s why Spark has been described among other things as a useful tool for machine learning applications, that typically require large amounts of empirical data.
Historically, the go-to technology for many people dealing with large data sets across clusters of computers was MapReduce, which is the technology that distributes processing jobs for the large data processing platform Hadoop.
Hadoop, which is also a product of the Apache Foundation, is supported by various vendors including IBM and HP. IBM’s distribution of Hadoop, based on the the Apache open source distribution, is called IOP.
Dirk deRoos, worldwide technical sales leader for big data analytics platforms at IBM, argues that Spark is outpacing MapReduce as a tool for Hadoop. It has a more expressive API for programmers, he said, making it possible for them to do a wider variety of things with their data processing. This means they can be used for different kinds of job, he added.
“While MapReduce is very good at batch processing applications that fit within the strict Map and Reduce model, Spark is much more flexible,” he said.
“Spark can be used for batch applications, but also interactive applications (ie. where a user is asking questions, like SQL queries, and expects results back within a few seconds or less),” he continued. It can also be used for near-real time applications such as processing data that streams across a neetowrk.
Shortcomings in MapReduce might well have influenced Google to effectively abandon the technology a year ago. Last June, it announced that it would replace MapReduce a new cloud analytics system it had built itself, called Cloud Dataflow.
IBM maybe focusing its efforts on Spark but it won’t abandon MapReduce. It will continue to ship that technology as long as the Apache open-source project includes it in Hadoop, but IBM is now folding Spark into its own Hadoop distribution too. Spark can be used both for Hadoop projects, but also for other non-Hadoop projects.
Like Hadoop, Spark will be available on a completely open source distribution from IBM, known as Open Platform with Apache Hadoop. However, it will also be bundled into other application frameworks produced by IBM.
The cloud implementation, which deRoos calls Spark as a Service, is in beta on Bluemix, which is a cloud-based environment based on the open-source Cloud Foundry project. This is designed to help developers mix and match different applications online, he said.
“Where Spark is well suited for machine learning applications, this makes it possible to fairly easily integrate machine learning capabilities into Bluemix applications that work with data,”deRoos concluded.