Hadoop MapReduce challenges in the Enterprise

- October 16, 2013

Platform Computing published a five part series (one, two, three, four, five) about the Hadoop MapReduce Challenges in the Enterprise. Some of the challenges mentioned in the Series are addressed by the NextGen MapReduce which will be available soon for download, but some of the allegations are not proper. Platform has got products around MapReduce and is about to be acquired by IBM, so not sure how they got them wrong.

Platform) On the performance measure, to be most useful in a robust enterprise environment a MapReduce job should take sub-millisecond to start, but the job startup time in the current open source MapReduce implementation is measured in seconds.

Praveen) MapReduce is supposed to be for batch processing and not for online transactions. The data from a MapReduce Job can be fed to a system for online processing. It's not to say that there is no scope for improvement in the MapReduce job performance.

Platform) The current Hadoop MapReduce implementation does not provide such capabilities. As a result, for each MapReduce job, a customer has to assign a dedicated cluster to run that particular application, one at a time.

Platform) Each cluster is dedicated to a single MapReduce application so if a user has multiple applications, s/he has to run them in serial on that same resource or buy another cluster for the additional application.

Praveen) Apache Hadoop had a pluggable Scheduler architecture and has Capacity, Fair and FIFO Scheduler. The FIFO Scheduler is the default one. Schedulers allow multiple applications and multiple users to share the cluster at the same time.

Platform) Current Hadoop MapReduce implementations derived from open source are not equipped to address the dynamic resource allocation required by various applications.

Platform) Customers also need support for workloads that may have different characteristics or are written in different programming languages. For instance, some of those applications could be data intensive such as MapReduce applications written in Java, some could be CPU intensive such as Monte Carlo simulations which are often written in C++ -- a runtime engine must be designed to support both simultaneously.

Praveen) NextGen MapReduce allows for dynamic allocation of resources. Currently there is only support for RAM based requests, but the framework can be extended for other parameters like CPU, HDD etc in the future.

Platform) As mentioned in part 2 of this blog series, the single job tracker in the current Hadoop implementation is not separated from the resource manager, so as a result, the job tracker does not provide sufficient resource management functionalities to allow dynamic lending and borrowing of available IT resources.

Praveen) NextGen MapReduce separates resource management and task scheduling into separate components.

To summarize, NextGen MapReduce addresses some of the concerns raised by Platform, but it will take some time for NextGen MapReduce to get stabilized and be production ready.

Search This Blog

Big Data Trendz

UPDATES

Hadoop MapReduce challenges in the Enterprise

Comments

Post a Comment

Popular posts from this blog

Cloudera Data Hub: Where Agility Meets Control

3X FASTER INTERACTIVE QUERY WITH APACHE HIVE LLAP

Introduction to HDFS Erasure Coding in Apache Hadoop

Big Data Trendz