Batch rename in hadoop Batch rename in hadoop hadoop hadoop

Batch rename in hadoop


If you don't want to write Java Code for this - I think using the command line HDFS API is your best bet:

mv in Hadoop

hadoop fs -mv URI [URI …] <dest>

You can get the paths using a small one liner:

% hadoop fs -ls /user/foo/bar | awk  '!/^d/ {print $8}'/user/foo/bar/blacklist/user/foo/bar/books-eng...

the awk will remove directories from output..now you can put these files into a variable:

% files=$(hadoop fs -ls /user/foo/bar | awk  '!/^d/ {print $8}')

and rename each file..

% for f in $files; do hadoop fs -mv $f $f.lzo; done

you can also use awk to filter the files for other criteria. This should remove files that match the regex nolzo. However it's untested. But this way you can write flexible filters.

% files=$(hadoop fs -ls /user/foo/bar | awk  '!/^d|nolzo/ {print $8}' )

test if it works with replacing the hadoop command with echo:

$ for f in $files; do echo $f $f.lzo; done

Edit: Updated examples to use awk instead of sed for more reliable output.

The "right" way to do it is probably using the HDFS Java API .. However using the shell is probably faster and more flexible for most jobs.


When I had to rename many files I was searching for an efficient solution and stumbled over this question and thi-duong-nguyen's remark that renaming many files is very slow. I implemented a Java solution for batch rename operations which I can highly recommend, since it is orders of magnitude faster. The basic idea is to use org.apache.hadoop.fs.FileSystem's rename() method:

Configuration conf = new Configuration();conf.set("fs.defaultFS", "hdfs://master:8020");FileSystem dfs = FileSystem.get(conf);dfs.rename(from, to);

where from and to are org.apache.hadoop.fs.Path objects. The easiest way is to create a list of files to be renamed (including their new name) and feed this list to the Java program.

I have published the complete implementation which reads such a mapping from STDIN. It renamed 100 files in less than four seconds (the same time was required to rename 7000 files!) while the hdfs dfs -mv based approach described before requires 4 minutes to rename 100 files.


We created an utility to do bulk renaming of files in HDFS: https://github.com/tenaris/hdfs-rename. The tool is limited, but if you want you can contribute to improve it with recursive, awk regex syntax and so on.