Save a spark RDD using mapPartition with iterator Save a spark RDD using mapPartition with iterator hadoop hadoop

Save a spark RDD using mapPartition with iterator


A couple of things:

  • Never call Iterator.size if you plan to use data later. Iterators are TraversableOnce. The only way to compute Iterator size is to traverse all its element and after that there is no more data to be read.
  • Don't use transformations like mapPartitions for side effects. If you want to perform some type of IO use actions like foreach / foreachPartition. It is a bad practice and doesn't guarantee that given piece of code will be executed only once.
  • Local path inside action or transformations is a local path of particular worker. If you want to write directly on the client machine you should fetch data first with collect or toLocalIterator. It could be better though to write to distributed storage and fetch data later.


Java 7 provides means to watch directories.

https://docs.oracle.com/javase/tutorial/essential/io/notification.html

The idea is to create a watch service, register it with the directory of interest (mention the events of your interest, like file creation, deletion, etc.,), do watch, you will be notified of any events like creation, deletion, etc., you can take whatever action you want then.

You will have to depend on Java hdfs api heavily wherever applicable.

Run the program in background since it waits for events forever. (You can write logic to quit after you do whatever you want)

On the other hand, shell scripting will also help.

Be aware of coherency model of hdfs file system while reading files.

Hope this helps with some idea.