Passing script files on hdfs to impala-shell
Impala shell can accept query text from STDIN. As described here, option -f
-f query_file or --query_file=query_file
query_file=path_to_query_file
Passes a SQL query from a file. Multiple statements must be semicolon (;) delimited. In Impala 2.3 and higher, you can specify a filename of - to represent standard input. This feature makes it convenient to use impala-shell as part of a Unix pipeline where SQL statements are generated dynamically by other tools.
So in your case, your shell script can simply do something like
$ hdfs dfs -cat <hdfs_file_name> | impala-shell -i <impala_daemon> -f -
If you have the fixed number of queries, or you can collect (cat) them into one file, then you can pass the name of this file(s) as a parameter out of the <action>
using <capture-output/>
tag:
$ hdfs hdfs -cat /user/impala/sql/custom_script_name.sql
CREATE TABLE default.t1(n INT);INSERT INTO default.t1 VALUES(1);
$ hdfs hdfs -cat /oozie/shell/prepare-implala-sql.sh
#!/bin/bashecho HDFS_IMPALA_SCRIPT:/user/impala/sql/custom_script_name.sql
$ hdfs hdfs -cat /user/oozie/workflow/wf_impala_env/wf_impala_env.xml
<workflow-app name="wf_impala_env" xmlns="uri:oozie:workflow:0.5"> <start to="a1"/> <kill name="a0"> <message>Error: [${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="a1"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${resourceManager}</job-tracker> <name-node>${nameNode}</name-node> <exec>bash</exec> <argument>prepare-implala-sql.sh</argument> <file>/oozie/shell/prepare-implala-sql.sh#prepare-implala-sql.sh</file> <capture-output/> </shell> <ok to="a2"/> <error to="a0"/> </action> ...
And then use it in Impala step as a <file>
parameter:
... <action name="a2"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${resourceManager}</job-tracker> <name-node>${nameNode}</name-node> <exec>impala-shell</exec> <argument>-i</argument> <argument>${impalad}</argument> <argument>-f</argument> <argument>query.sql</argument> <env-var>PYTHON_EGG_CACHE=./myeggs</env-var> <file>${wf:actionData("a1")["HDFS_IMPALA_SCRIPT"]}#query.sql</file> <capture-output/> </shell> <ok to="a99"/> <error to="a0"/> </action> <end name="a99"/></workflow-app>
Just don't forget about PYTHON_EGG_CACHE for impala-shell
(or bash
-> impala-shell
).