Installation and configuration of Apache Oozie

- October 16, 2013

Many a times there will be a requirement of running a group of dependent data processing jobs. Also, we might want to run some of them at regular intervals of time. This is where Apache Oozie fits the picture. Here are some nice articles (1, 2, 3, 4) on how to use Oozie.
Apache Oozie has three components which are a work flow engine to run a DAG of actions, a coordinator (similar to a cron job or a scheduler) and a bundle to batch a group of coordinators. Azkaban from LinkedIn is similar to Oozie, here are the articles (1, 2) comparing both of them.
Installing and configuring Oozie is not straight forward, not only because of the documentation, but also because the release includes only the source code and not the binaries. The code has to be got, the dependencies installed and then the binaries built. It's a bit tedious process, so this blog with an assumption that Hadoop has been already installed and configured. Here is the official documentation on how to build and install Oozie.
So, here are the steps to install and configure
- Make sure the requirements (Unix box (tested on Mac OS X and Linux), Java JDK 1.6+, Maven 3.0.1+, Hadoop 0.20.2+, Pig 0.7+) to build are met.
- Download a release containing the code from Apache Oozie site and extract the source code.

- Execute the below command to start the build. During the build process, the jars have to be downloaded, so it might take some time based on the network bandwidth. Make sure that there are no errors in the build process.

1	`bin/mkdistro.sh -DskipTests`

- Once the build is complete the binary file oozie-4.0.0.tar.gz should be present in the folder where Oozie code was extracted. Extract the tar.gz file, this will create a folder called oozie-4.0.0.
- Create a libext/ folder and copy the commons-configuration-*.jar, ext-2.2.zip, hadoop-client-*.jar and hadoop-core-*.jar files. The hadoop jars need to be copied from the Hadoop installation folder.
When Oozie is started, the below exception is seen in the catalina.out log file. This is the reason for including the commons-configuration-*.jar file in libext/ folder.

java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration

        at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:37)

        at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:34)

        at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)

        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:217)

- Prepare a war file using the below command. oozie.war file should be there in the oozie-4.0.0/oozie-server/webapps folder.

1	`bin/oozie-setup.sh prepare-war`

- Create Oozie related schema using the below command

1	`bin/ooziedb.sh create -sqlfile oozie.sql -run`

- Now is the time to start the Oozie Service which runs in Tomcat.

1	`bin/oozied.sh start`

- Check the Oozie log file logs/oozie.log to ensure Oozie started properly. And, run the below command to check the status of Oozie or instead go to the Oozie console at http://localhost:11000/oozie

1	`bin/oozie admin -oozie http://localhost:11000/oozie -status`

- Now, the Oozie client has to be installed by extracting the oozie-client-4.0.0.tar.gz. This will create a folder called oozie-client-4.0.0.
With the Oozie service running and the Oozie client installed, now is the time to run some simple work flows in Oozie to make sure Oozie works fine. Oozie comes with a bunch of examples in the oozie-examples.tar.gz. Here are the steps for the same.
- Extract the oozie-examples.tar.gz and change the port number on which the NameNode listens (Oozie default is 8020 and Hadoop default is 9000) in all the job.properties files. Similarly, for the JobTracker also the port number has to be modified (Oozie default is 8021 and Hadoop default is 9001).
- In the Hadoop installation, add the below to the conf/core-site.xml file. Check the Oozie documentation for more information on what these parameters mean

     <property>

          <name>hadoop.proxyuser.training.hosts</name>

          <value>localhost</value>

     </property>

     <property>

          <name>hadoop.proxyuser.training.groups</name>es

          <value>training</value>

     </property>

- Make sure that HDFS and MR are started and running properly.
- Copy the examples folder in HDFS using the below command

1	`bin/hadoop fs -put /home/training/tmp/examples/ examples/`

- Now run the Oozie example as

1	`oozie job -oozie http://localhost:11000/oozie -config /home/training/tmp/examples/apps/map-reduce/job.properties -run`

- The status of the job can be got using the below command

1	`oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-tucu`

In the upcoming blogs, we will see how to write some simple work flows and schedule tasks in Oozie.

Search This Blog

Big Data Trendz

UPDATES

Installation and configuration of Apache Oozie

Comments

Post a Comment

Popular posts from this blog

Cloudera Data Hub: Where Agility Meets Control

3X FASTER INTERACTIVE QUERY WITH APACHE HIVE LLAP

Introduction to HDFS Erasure Coding in Apache Hadoop

Big Data Trendz