Curl, Kerberos authenticated file copy on hadoop
WebHDFS alone does not offer a copy operation in its interface. The WebHDFS interface provides lower-level file system primitives. A copy operation is a higher-level application that uses those primitive operations to accomplish its work.
The implementation of hdfs dfs -cp
against a webhdfs:
URL essentially combines op=OPEN and op=CREATE calls to complete the copy. You could potentially re-implement a subset of that logic in your script. If you want to pursue that direction, the CopyCommands
class is a good starting point in the Apache Hadoop codebase for seeing how that works.
Here is a starting point for how this could work. There is an existing file at /hello1 that we want to copy to /hello2. This script calls curl
to open /hello1 and pipes the output to another curl
command, which creates /hello2, using stdin as the input source.
> hdfs dfs -ls /hello*-rw-r--r-- 3 cnauroth supergroup 6 2017-07-06 09:15 /hello1> curl -sS -L 'http://localhost:9870/webhdfs/v1/hello1?op=OPEN' |> curl -sS -L -X PUT -d @- 'http://localhost:9870/webhdfs/v1/hello2?op=CREATE&user.name=cnauroth'> hdfs dfs -ls /hello*-rw-r--r-- 3 cnauroth supergroup 6 2017-07-06 09:15 /hello1-rw-r--r-- 3 cnauroth supergroup 5 2017-07-06 09:20 /hello2
But my requirement is to connect from an external unix box, automated kerberos login into hdfs and then move the files within hdfs, hence the curl.
Another option could be a client-only Hadoop installation on your external host. You would have an installation of the Hadoop software and the same configuration files from the Hadoop cluster, and then you could issue the hdfs dfs -cp
commands instead of running curl
commands against HDFS.