How to rename huge amount of files in Hadoop/Spark?

hadoop parallel-processing bigdata apache-spark

You need to do FileSystem.get inside of the VoidFunction too.

The driver needs a FileSystem to get the list of files, but also each worker needs a FileSystem for the renaming. The driver cannot pass its FileSystem to the workers, because it is not Serializable. But the workers can get their own FileSystem just fine.

In the Scala API you could use RDD.foreachPartition to write the code in a way that you only do FileSystem.get once per partition, instead of once per line. It is probably available in the Java API as well.

hadoop parallel-processing bigdata apache-spark

The problem is that you are trying to serialize the ghfs object. If you use mapPartitions and recreate the ghfs object in each partition you will be able to run your code with just a couple of minor changes.

hadoop parallel-processing bigdata apache-spark

I would recommend just renaming them like you were with the file system class in just a non map reduce context (just in the driver), it's not a big deal to rename 100k files, it's it's too slow, then you can attempt to multithread it. Just do something like

FileSystem fileSystem = new Path("").getFileSystem(new Configuration());File [] files =  FileUtil.listFiles(directory)for (File file : files) {    fileSystem.rename(new Path(file.getAbsolutePath()),new Path("renamed"));}

Btw the error that you're getting in spark is because spark requires objects it uses to implement Serializable, which FileSystem does not.

I can't confirm this but it would seem that every rename in HDFS would involve the NameNode since it tracks the full directory structure and node location of files (confirmation link), meaning it can't be done efficiently in parallel. As per this answer renaming is a metadata only operation so it should be very fast run serially.

CodeHunter

How to rename huge amount of files in Hadoop/Spark?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last