Apache Tez - Beyond Hadoop (MR)
Thanks to Hadoop Tips
Apache Pig and Hive are higher level abstracts on top of MR (MapReduce). PigLatin scripts of Pig and HiveQL queries of Hive are converted into a DAG of MR jobs, with the first MR job (5) reading from the input the last MR job (2) writing to the output. One of the problem with this approach is that the temporary data between the MR jobs (as in the case of 11 and 9) is written to HDFS (by 11) and read from HDFS (by 9) which leads to inefficiency. Not only this, multiple MR jobs will also lead to initialization over head.
Apache Pig and Hive are higher level abstracts on top of MR (MapReduce). PigLatin scripts of Pig and HiveQL queries of Hive are converted into a DAG of MR jobs, with the first MR job (5) reading from the input the last MR job (2) writing to the output. One of the problem with this approach is that the temporary data between the MR jobs (as in the case of 11 and 9) is written to HDFS (by 11) and read from HDFS (by 9) which leads to inefficiency. Not only this, multiple MR jobs will also lead to initialization over head.
With the ClickStream analysis mentioned in the blog
earlier, to find out the `top 3 urls visited by users whose age is less
than 16` Hive results in a DAG of 3 MR jobs, while Pig results in a DAG
of 5 MR jobs.
Apache Tez (Tez means Speed in Hindi)
is a generalized data processing paradigm which tries to address some
of the challenges mentioned above. Using Tez along with Hive/Pig, a
single HiveQL or PigLatin will be converted into a single Tez Job and
not as a DAG of MR jobs as is happening now. In Tez, data processing is
represented as a graph with the vertices in
the graph representing processing of data and edges representing
movement of data between the processing.
With Tez the output of a MR job can be directly streamed to the next MR
job without writing to HDFS. If there are is any failure, the tasks
from the last checkpoint will be executed. In the above DAG, (5) will
read the input from the disk and (2) will write the output to the disk.
The data between the remaining vertices can happen in-memory or can be
streamed over the network or can be written to the disk for the sake of
checkpointing.
According to the Tez developers in the below video, Tez is for
frameworks like Pig, Hive and not for the end users to directly code to
Tez.
HortonWorks had been aggressively promoting Tez as a replacement to
Hadoop runtime and is also publishing a series of articles on the same (1, 2, 3, 4). A DAG of MR jobs would also run efficiently on a Tez runtime than on a Hadoop runtime.
Work is in progress to run Pig and Hive on the Tez runtime. Here are the JIRAs (Pig and Hive)
for the same. Need to wait and see how far it is accepted by the
community. In the future we should be able to run Pig and Hive on Hadoop
and Tez runtime by just switching a flag in the configuration files.
Comments
Post a Comment
thank you for your feedback