Pig Latin: Load multiple files from a date range (part of the directory structure) Pig Latin: Load multiple files from a date range (part of the directory structure) hadoop hadoop

Pig Latin: Load multiple files from a date range (part of the directory structure)


As zjffdu said, the path expansion is done by the shell. One common way to solve your problem is to simply use Pig parameters (which is a good way to make your script more resuable anyway):

shell:

pig -f script.pig -param input=/user/training/test/{20100810..20100812}

script.pig:

temp = LOAD '$input' USING SomeLoader() AS (...);


Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).


i ran across this answer when i was having trouble trying to create a file glob in a script and then pass it as a parameter into a pig script.

none of the current answers applied to my situation, but i did find a general answer that might be helpful here.

in my case, the shell expansion was happening and then passing that into the script - causing complete problems with the pig parser, understandably.

so by simply surrounding the glob in double-quotes protects it from being expanded by the shell, and passes it as is into the command.

WON'T WORK:

$ pig -f my-pig-file.pig -p INPUTFILEMASK='/logs/file{01,02,06}.log' -p OTHERPARAM=6

WILL WORK

$ pig -f my-pig-file.pig -p INPUTFILEMASK="/logs/file{01,02,06}.log" -p OTHERPARAM=6

i hope this saves someone some pain and agony.