SAN FRANCISCO — Apache’s Hadoop technologies are becoming critical in helping enterprises manage vast amounts of data, with users ranging from NASA to Twitter to Netflix increasing their reliance on the open source distributed computing platform.
Hadoop has gathered momentum as a mechanism for dealing with the concept of big data, in which enterprises seek to derive value from the rapidly growing amounts of data in their computer systems. Recognizing Hadoop’s potential, users are both using the existing Hadoop platform technologies and developing their own technologies to complement the Hadoop stack.
Hadoop’s corporate usage now and in the future
NASA expects Hadoop to handle large data loads in projects such as its Square Kilometer Array sky-imaging effort, which will churn out 700TBps when built in the next decade. The data systems will include Hadoop, as well as technologies such as Apache OODT (Object Oriented Data Technology), to cope with the massive data loads, says Chris Mattmann, a senior computer scientist at NASA.
Twitter is a big user of Hadoop. “All of the relevance products [offering personalized recommendations to users] have some interaction with Hadoop,” says Oscar Boykin, a Twitter data scientist. The company has been using Hadoop for about four years and has even developed Scalding, a Scala library intended to make it easy to write Hadoop MapReduce jobs; it is built on top of the Cascading Java library, which is designed to abstract away Hadoop’s complexity.
Hadoop subprojects include MapReduce, which is a software framework for processing large set sets on compute clusters; HDFS (Hadoop Distributed File System), which provides high-throughput access to application data; and Common, which offers utilities to support other Hadoop subprojects. Movie rental service Netflix has begun using Apache ZooKeeper, a Hadoop-related technology for configuration management. “We use it for all kinds of things: distributed locks, some queuing, and leader election” for prioritizing service activity, says Jordan Zimmerman, a senior platform engineer at Netflix. “We open-sourced a client for ZooKeeper that I wrote called Curator”; the client serves as a library for developers to connect to ZooKeeper.
The Tagged social network is using Hadoop technology for data analytics, processing about half a terabyte of new data daily, says Rich McKinley, Tagged’s senior data engineer. Hadoop is being applied to on tasks beyond the capacity of its Greenplum database, which is still in use at Tagged: “We’re looking toward doing more with Hadoop just for scale.”
Although they laud Hadoop, users see issues that need fixing, such as deficiencies in reliability and job-tracking. Tagged’s McKinley notes a problem with latency: “The time to get data in is quite quick and then, of course, I think everybody’s big complaint is the high latency for doing your queries.” Tagged has used Apache Hive, another Hadoop-derived project, for ad hoc queries. “That can take several minutes to get in a result that in Greenplum would return in a couple of seconds.” Using Hadoop is cheaper than using Greenplum, though.
What’s in store for Hadoop 2.0
Hadoop 1.0 was released late in 2011, featuring strong authentication via Kerberos and support for the HBase database. The release also limits individual users from taking down clusters via constraints on MapReduce. But a new version is on the horizon: HortonWorks CTO Eric Baldeschwieler has provided a road map for Hadoop that includes the upcoming 2.0 release. (HortonWorks has been a contributor to Apache Hadoop.) Version 2.0, which went into an alpha release phase earlier this year, “has an end-to-end rewrite of the MapReduce layer and a pretty complete rewrite of all the storage logic and the HDFS layer as well,” Baldeschwieler says.
Hadoop 2.0 focuses on scale and innovation, with Yarn (next-generation MapReduce) and federation capabilities. Yarn will let users add their own compute models so that they do not have to stick to MapReduce. “We’re really looking forward to the community inventing many new ways of using Hadoop,” Baldeschwieler says. Expected uses include real-time applications and machine-learning algorithms. Scalable, pluggable storage is planned also.
Always-on capabilities in Version 2.0 will enable clusters with no downtime. Scalable storage is planned as well. General availability of Hadoop 2.0 is expected within a year.