How to rename huge amount of files in Hadoop/Spark? How to rename huge amount of files in Hadoop/Spark? hadoop hadoop

How to rename huge amount of files in Hadoop/Spark?


You need to do FileSystem.get inside of the VoidFunction too.

The driver needs a FileSystem to get the list of files, but also each worker needs a FileSystem for the renaming. The driver cannot pass its FileSystem to the workers, because it is not Serializable. But the workers can get their own FileSystem just fine.

In the Scala API you could use RDD.foreachPartition to write the code in a way that you only do FileSystem.get once per partition, instead of once per line. It is probably available in the Java API as well.


The problem is that you are trying to serialize the ghfs object. If you use mapPartitions and recreate the ghfs object in each partition you will be able to run your code with just a couple of minor changes.


I would recommend just renaming them like you were with the file system class in just a non map reduce context (just in the driver), it's not a big deal to rename 100k files, it's it's too slow, then you can attempt to multithread it. Just do something like

FileSystem fileSystem = new Path("").getFileSystem(new Configuration());File [] files =  FileUtil.listFiles(directory)for (File file : files) {    fileSystem.rename(new Path(file.getAbsolutePath()),new Path("renamed"));}

Btw the error that you're getting in spark is because spark requires objects it uses to implement Serializable, which FileSystem does not.


I can't confirm this but it would seem that every rename in HDFS would involve the NameNode since it tracks the full directory structure and node location of files (confirmation link), meaning it can't be done efficiently in parallel. As per this answer renaming is a metadata only operation so it should be very fast run serially.