Curl, Kerberos authenticated file copy on hadoop Curl, Kerberos authenticated file copy on hadoop curl curl

Curl, Kerberos authenticated file copy on hadoop


WebHDFS alone does not offer a copy operation in its interface. The WebHDFS interface provides lower-level file system primitives. A copy operation is a higher-level application that uses those primitive operations to accomplish its work.

The implementation of hdfs dfs -cp against a webhdfs: URL essentially combines op=OPEN and op=CREATE calls to complete the copy. You could potentially re-implement a subset of that logic in your script. If you want to pursue that direction, the CopyCommands class is a good starting point in the Apache Hadoop codebase for seeing how that works.

Here is a starting point for how this could work. There is an existing file at /hello1 that we want to copy to /hello2. This script calls curl to open /hello1 and pipes the output to another curl command, which creates /hello2, using stdin as the input source.

> hdfs dfs -ls /hello*-rw-r--r--   3 cnauroth supergroup          6 2017-07-06 09:15 /hello1> curl -sS -L 'http://localhost:9870/webhdfs/v1/hello1?op=OPEN' |>     curl -sS -L -X PUT -d @- 'http://localhost:9870/webhdfs/v1/hello2?op=CREATE&user.name=cnauroth'> hdfs dfs -ls /hello*-rw-r--r--   3 cnauroth supergroup          6 2017-07-06 09:15 /hello1-rw-r--r--   3 cnauroth supergroup          5 2017-07-06 09:20 /hello2

But my requirement is to connect from an external unix box, automated kerberos login into hdfs and then move the files within hdfs, hence the curl.

Another option could be a client-only Hadoop installation on your external host. You would have an installation of the Hadoop software and the same configuration files from the Hadoop cluster, and then you could issue the hdfs dfs -cp commands instead of running curl commands against HDFS.


I don't know what distribution you use, if you use Cloudera, try using BDR (Backup, Data recovery module) using REST APIs.

I used it to copy the files/folders within hadoop cluster and across hadoop clusters, it works against encrypted zones(TDE) as well