Hadoop MapReduce challenges in the Enterprise
Platform) On the performance measure, to be most useful in a robust enterprise environment a MapReduce job should take sub-millisecond to start, but the job startup time in the current open source MapReduce implementation is measured in seconds.
Praveen) MapReduce is supposed to be for batch processing and not for online transactions. The data from a MapReduce Job can be fed to a system for online processing. It's not to say that there is no scope for improvement in the MapReduce job performance.
Platform) The current Hadoop MapReduce implementation does not provide such capabilities. As a result, for each MapReduce job, a customer has to assign a dedicated cluster to run that particular application, one at a time.
Platform) Each cluster is dedicated to a single MapReduce application so if a user has multiple applications, s/he has to run them in serial on that same resource or buy another cluster for the additional application.
Praveen) Apache Hadoop had a pluggable Scheduler architecture and has Capacity, Fair and FIFO Scheduler. The FIFO Scheduler is the default one. Schedulers allow multiple applications and multiple users to share the cluster at the same time.
Platform) Current Hadoop MapReduce implementations derived from open source are not equipped to address the dynamic resource allocation required by various applications.
Platform) Customers also need support for workloads that may have different characteristics or are written in different programming languages. For instance, some of those applications could be data intensive such as MapReduce applications written in Java, some could be CPU intensive such as Monte Carlo simulations which are often written in C++ -- a runtime engine must be designed to support both simultaneously.
Praveen) NextGen MapReduce allows for dynamic allocation of resources. Currently there is only support for RAM based requests, but the framework can be extended for other parameters like CPU, HDD etc in the future.
Platform) As mentioned in part 2 of this blog series, the single job tracker in the current Hadoop implementation is not separated from the resource manager, so as a result, the job tracker does not provide sufficient resource management functionalities to allow dynamic lending and borrowing of available IT resources.
Praveen) NextGen MapReduce separates resource management and task scheduling into separate components.
To summarize, NextGen MapReduce addresses some of the concerns raised by Platform, but it will take some time for NextGen MapReduce to get stabilized and be production ready.
Comments
Post a Comment
thank you for your feedback