Pass directories not files to hadoop-streaming? Pass directories not files to hadoop-streaming? hadoop hadoop

Pass directories not files to hadoop-streaming?


I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers


Hadoop supports input paths to be a regular expression. I haven't experimented with a lot of complex regex, but the simple placeholders ? and * does work.

So in your case I think if you have the following as your input path it will work :

file:///mnt/logs/Customer_Name/*/*

The last asterisk might not be needed as all the files in the final directory are automatically added as input path.