Oozie shell script action Oozie shell script action hadoop hadoop

Oozie shell script action


One thing that has always been tricky about Oozie workflows is the execution of bash scripts. Hadoop is created to be massively parallel so the architecture acts very different than you would think.

When an oozie workflow executes a shell action, it will receive resources from your job tracker or YARN on any of the nodes in your cluster. This means that using a local location for your file will not work, since the local storage is exclusively on your edge node. If the job happened to spawn on your edge node then it would work, but any other time it would fail, and this distribution is random.

To get around this, I found it best to have the files I needed (including the sh scripts) in hdfs in either a lib space or the same location as my workflow.

Here is a good way to approach what you are trying to achieve.

<shell xmlns="uri:oozie:shell-action:0.1">    <exec>hive.sh</exec>     <file>/user/lib/hive.sh#hive.sh</file>    <file>ETL_file1.hql#hivescript</file></shell>

One thing you will notice is that the exec is just hive.sh since we are assuming that the file will be moved to the base directory where the shell action is completed

To make sure that last note is true, you must include the file's hdfs path, this will force oozie to distribute that file with the action. In your case, the hive script launcher should only be coded once, and simply fed different files. Since we have a one to many relationship, the hive.sh should be kept in a lib and not distributed with every workflow.

Lastly you see the line:

<file>ETL_file1.hql#hivescript</file>

This line does two things. Before the # we have the location of the file. It is just the file name since we should distribute our distinct hive files with our workflows

user/directory/workflow.xmluser/directory/ETL_file1.hql

and the node running the sh will have this distributed to it automagically. Lastly, the part after the # is the variable name we assign it two inside of the sh script. This gives you the ability to reuse the same script over and over and simply feed it different files.

HDFS directory notes,

if the file is nested inside the same directory as the workflow, then you only need to specify child paths:

user/directory/workflow.xmluser/directory/hive/ETL_file1.hql

Would yield:

<file>hive/ETL_file1.hql#hivescript</file>

But if the path is outside of the workflow directory you will need the full path:

user/directory/workflow.xmluser/lib/hive.sh

would yield:

<file>/user/lib/hive.sh#hive.sh</file>

I hope this helps everyone.


From

http://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html#Shell_Action_Schema_Version_0.2

If you keep your shell script and hive script both in some folder in workflow then you can execute it.

See the command in sample

<exec>${EXEC}</exec>        <argument>A</argument>        <argument>B</argument>        <file>${EXEC}#${EXEC}</file> <!--Copy the executable to compute node's current     working directory -->

you can write whatever commands you want in file

You can also use use hive action directly

http://oozie.apache.org/docs/3.3.0/DG_HiveActionExtension.html