pig load udf for loading files from several sub directories pig load udf for loading files from several sub directories hadoop hadoop

pig load udf for loading files from several sub directories


Because you have a defined folder structure that doesn't have variable depth, I think it's as simple as passing the following pattern as your input path:

A = LOAD 'maildir/*/inbox/1.txt' USING PigStorage('\t') AS (f1,f2,f3)

You probably don't need to create your own UDF for this, the PigLoader should be able to handle them, assuming they are in some delimited format (the above example assumes 3 fields, tab delimited).

If there are multiple txt files in each inbox, use *.txt rather than 1.txt. Finally, if the maildir root directory is not in your users home directory, you should use the absolute path to the folder, say /data/maildir/*/index/*.txt