LAS VEGAS – EMC Corp. is packaging the open source Hadoop analytics engine into its Greenplum data centre appliance to help companies analyze the massive amount of unstructured data being created online.
The company unveiled its integration with Apache Software Foundation’s Hadoop software on Monday as it kicked off its annual EMC World 2011 conference in Las Vegas. The Greenplum HD Data Computing Appliance will allow enterprises to process both structured and unstructured data sets within a single box.
EMC said the Hadoop platform has gained in popularity over the last several years with Internet companies such as Facebook, Twitter and eBay using the engine to crunch unstructured data sets. But now, the company is integrating the open source software into Greenplum hardware to make it easier for customers to deploy the product and jump onto the open source data analysis bandwagon.
Luke Longergan, co-founder of Greenplum and the chief technology officer of EMC’s data computing division, called the move a “simplicity play” for enterprises looking to bring unstructured data sets into the existing analytics systems they already use.
“A lot of data that comes from transaction systems is in a structured format in those systems, but have no meaning to the people who aggregate information,” he said. Other unstructured data sets could include system log files, feedback from customer support centres, and social media data.
EMC said the Greenplum hardware appliance will eliminate overall complexity and the need for any specialized software or hardware products to run Hadoop. It is also specifically optimized to work with the open source engine as soon as it is plugged in.
By providing a stack around Hadoop, Longergan said, companies can do all their work around big data in one platform, including real-time deep analytics and massive scale-out storage capabilities.
Scott Yara, co-founder of Greenplum and the vice-president of products at EMC’s data computing division, said the appliances will be geared toward organizations looking to get a handle on “behavioural analytics.”
“How can I tell who the good and loyal customers are,” he said. “Also, how can I get a sense of fraudulent and negligent behaviour?”
In addition to offering a community-based version of the analytics software, EMC will also commercialize Hadoop and distribute its own version of the software in an enterprise-focused release called Greenplum HD Enterprise edition.
The enterprise software will be aimed at corporate data centres and come with advanced data management features such as snapshots and wide area replication. The product will also offer simplified data loading and access using a native network file system interface and end-to-end management features such as automated node failure detection, multi-site management and rolling upgrades.
The aforementioned Greenplum HD Community edition was touted as a “100 per cent open source certified and supported version of the Apache Hadoop stack,” which will be comprised of HDFS, MapReduce, Zookeeper, Hive and HBase. The company also vows to take a proactive, open source friendly stance with this offering and will contribute everything back to the Apache community.
EMC has also brought together an ecosystem of 12 companies that will offer analytics and data transfer capabilities, which include Concurrent, CSC, Datameer, Informatica, Jaspersoft, Karmasphere, Microstrategy, Pentaho, SAS, SnapLogic, Talend and VMware.
James Markarian, executive vice-president and chief technology officer of Infomatica, said that with all the information being created on social media sites, figuring out an enterprise’s online sentiment has become a daunting task.
“Without Hadoop, there’s no way to gather insight from all this information,” he said.
EMC rival NetApp Inc. also launched a Hadoop product of its own on Monday, unveiling the E-Series Platform of storage arrays aimed at high performance analytics workloads such as Hadoop.
Rounding out the announcements on the first day of EMC’s annual user conference included an addition to EMC’s Isilon scale-out NAS hardware portfolio. The new Isilon IQ 108NL can scale to more than 15 petabytes of information and will provide the storage foundation for companies playing in the big data analytics space, EMC said.