Curl, Kerberos authenticated file copy on hadoop

shell hadoop curl webhdfs

WebHDFS alone does not offer a copy operation in its interface. The WebHDFS interface provides lower-level file system primitives. A copy operation is a higher-level application that uses those primitive operations to accomplish its work.

The implementation of hdfs dfs -cp against a webhdfs: URL essentially combines op=OPEN and op=CREATE calls to complete the copy. You could potentially re-implement a subset of that logic in your script. If you want to pursue that direction, the CopyCommands class is a good starting point in the Apache Hadoop codebase for seeing how that works.

Here is a starting point for how this could work. There is an existing file at /hello1 that we want to copy to /hello2. This script calls curl to open /hello1 and pipes the output to another curl command, which creates /hello2, using stdin as the input source.

> hdfs dfs -ls /hello*-rw-r--r--   3 cnauroth supergroup          6 2017-07-06 09:15 /hello1> curl -sS -L 'http://localhost:9870/webhdfs/v1/hello1?op=OPEN' |>     curl -sS -L -X PUT -d @- 'http://localhost:9870/webhdfs/v1/hello2?op=CREATE&user.name=cnauroth'> hdfs dfs -ls /hello*-rw-r--r--   3 cnauroth supergroup          6 2017-07-06 09:15 /hello1-rw-r--r--   3 cnauroth supergroup          5 2017-07-06 09:20 /hello2

But my requirement is to connect from an external unix box, automated kerberos login into hdfs and then move the files within hdfs, hence the curl.

Another option could be a client-only Hadoop installation on your external host. You would have an installation of the Hadoop software and the same configuration files from the Hadoop cluster, and then you could issue the hdfs dfs -cp commands instead of running curl commands against HDFS.

shell hadoop curl webhdfs

I don't know what distribution you use, if you use Cloudera, try using BDR (Backup, Data recovery module) using REST APIs.

I used it to copy the files/folders within hadoop cluster and across hadoop clusters, it works against encrypted zones(TDE) as well

CodeHunter

Curl, Kerberos authenticated file copy on hadoop

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last