Installation and configuration of Apache Oozie
Many a times there will be a requirement of running a group of dependent
data processing jobs. Also, we might want to run some of them at
regular intervals of time. This is where Apache Oozie fits the picture. Here are some nice articles (1, 2, 3, 4) on how to use Oozie.
Apache Oozie has three components which are a work flow engine to run a DAG of actions, a coordinator (similar to a cron job or a scheduler) and a bundle to batch a group of coordinators. Azkaban from LinkedIn is similar to Oozie, here are the articles (1, 2) comparing both of them.
Installing and configuring Oozie is not straight forward, not only because of the documentation, but also because the release includes only the source code and not the binaries. The code has to be got, the dependencies installed and then the binaries built. It's a bit tedious process, so this blog with an assumption that Hadoop has been already installed and configured. Here is the official documentation on how to build and install Oozie.
So, here are the steps to install and configure
- Make sure the requirements (Unix box (tested on Mac OS X and Linux), Java JDK 1.6+, Maven 3.0.1+, Hadoop 0.20.2+, Pig 0.7+) to build are met.
- Download a release containing the code from Apache Oozie site and extract the source code. - Execute the below command to start the build. During the build process, the jars have to be downloaded, so it might take some time based on the network bandwidth. Make sure that there are no errors in the build process.
- Once the build is complete the binary file oozie-4.0.0.tar.gz should
be present in the folder where Oozie code was extracted. Extract the
tar.gz file, this will create a folder called oozie-4.0.0.
- Create a libext/ folder and copy the commons-configuration-*.jar, ext-2.2.zip, hadoop-client-*.jar and hadoop-core-*.jar files. The hadoop jars need to be copied from the Hadoop installation folder.
When Oozie is started, the below exception is seen in the catalina.out log file. This is the reason for including the commons-configuration-*.jar file in libext/ folder.
- Prepare a war file using the below command. oozie.war file should be there in the oozie-4.0.0/oozie-server/webapps folder.
- Create Oozie related schema using the below command
- Now is the time to start the Oozie Service which runs in Tomcat.
- Check the Oozie log file logs/oozie.log to ensure Oozie started
properly. And, run the below command to check the status of Oozie or
instead go to the Oozie console at http://localhost:11000/oozie
- Now, the Oozie client has to be installed by extracting the
oozie-client-4.0.0.tar.gz. This will create a folder called
oozie-client-4.0.0.
With the Oozie service running and the Oozie client installed, now is the time to run some simple work flows in Oozie to make sure Oozie works fine. Oozie comes with a bunch of examples in the oozie-examples.tar.gz. Here are the steps for the same.
- Extract the oozie-examples.tar.gz and change the port number on which the NameNode listens (Oozie default is 8020 and Hadoop default is 9000) in all the job.properties files. Similarly, for the JobTracker also the port number has to be modified (Oozie default is 8021 and Hadoop default is 9001).
- In the Hadoop installation, add the below to the conf/core-site.xml file. Check the Oozie documentation for more information on what these parameters mean
- Make sure that HDFS and MR are started and running properly.
- Copy the examples folder in HDFS using the below command
- Now run the Oozie example as
- The status of the job can be got using the below command
In the upcoming blogs, we will see how to write some simple work flows and schedule tasks in Oozie.
Apache Oozie has three components which are a work flow engine to run a DAG of actions, a coordinator (similar to a cron job or a scheduler) and a bundle to batch a group of coordinators. Azkaban from LinkedIn is similar to Oozie, here are the articles (1, 2) comparing both of them.
Installing and configuring Oozie is not straight forward, not only because of the documentation, but also because the release includes only the source code and not the binaries. The code has to be got, the dependencies installed and then the binaries built. It's a bit tedious process, so this blog with an assumption that Hadoop has been already installed and configured. Here is the official documentation on how to build and install Oozie.
So, here are the steps to install and configure
- Make sure the requirements (Unix box (tested on Mac OS X and Linux), Java JDK 1.6+, Maven 3.0.1+, Hadoop 0.20.2+, Pig 0.7+) to build are met.
- Download a release containing the code from Apache Oozie site and extract the source code. - Execute the below command to start the build. During the build process, the jars have to be downloaded, so it might take some time based on the network bandwidth. Make sure that there are no errors in the build process.
1
| bin/mkdistro.sh -DskipTests |
- Create a libext/ folder and copy the commons-configuration-*.jar, ext-2.2.zip, hadoop-client-*.jar and hadoop-core-*.jar files. The hadoop jars need to be copied from the Hadoop installation folder.
When Oozie is started, the below exception is seen in the catalina.out log file. This is the reason for including the commons-configuration-*.jar file in libext/ folder.
1
2
3
4
5
| java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem. 37 ) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem. 34 ) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java: 51 ) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java: 217 ) |
1
| bin/oozie-setup.sh prepare-war |
1
| bin/ooziedb.sh create -sqlfile oozie.sql -run |
1
| bin/oozied.sh start |
1
| bin/oozie admin -oozie http: //localhost:11000/oozie -status |
With the Oozie service running and the Oozie client installed, now is the time to run some simple work flows in Oozie to make sure Oozie works fine. Oozie comes with a bunch of examples in the oozie-examples.tar.gz. Here are the steps for the same.
- Extract the oozie-examples.tar.gz and change the port number on which the NameNode listens (Oozie default is 8020 and Hadoop default is 9000) in all the job.properties files. Similarly, for the JobTracker also the port number has to be modified (Oozie default is 8021 and Hadoop default is 9001).
- In the Hadoop installation, add the below to the conf/core-site.xml file. Check the Oozie documentation for more information on what these parameters mean
1
2
3
4
5
6
7
8
| < property > < name >hadoop.proxyuser.training.hosts</ name > < value >localhost</ value > </ property > < property > < name >hadoop.proxyuser.training.groups</ name >es < value >training</ value > </ property > |
- Copy the examples folder in HDFS using the below command
1
| bin/hadoop fs -put /home/training/tmp/examples/ examples/ |
1
| oozie job -oozie http: //localhost:11000/oozie -config /home/training/tmp/examples/apps/map-reduce/job.properties -run |
1
| oozie job -oozie http: //localhost:11000/oozie -info 14-20090525161321-oozie-tucu |
Comments
Post a Comment
thank you for your feedback