Grep across multiple files in Hadoop Filesystem
This is a hadoop "filesystem", not a POSIX one, so try this:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \while read fdo hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $fdone
This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \ xargs -n 1 -I ^ -P 10 bash -c \ "hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"
Notice the -P 10
option to xargs
: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.
EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done
Using hadoop fs -cat
(or the more generic hadoop fs -text
) might be feasible if you just have two 1 GB files. For 100 files though I would use the streaming-api because it can be used for adhoc-queries without resorting to a full fledged mapreduce job. E.g. in your case create a script get_filename_for_pattern.sh
:
#!/bin/bashgrep -q $1 && echo $mapreduce_map_input_filecat >/dev/null # ignore the rest
Note that you have to read the whole input, in order to avoid getting java.io.IOException: Stream closed
exceptions.
Then issue the commands
hadoop jar $HADOOP_HOME/hadoop-streaming.jar\ -Dstream.non.zero.exit.is.failure=false\ -files get_filename_for_pattern.sh\ -numReduceTasks 1\ -mapper "get_filename_for_pattern.sh bcd4bc3e1380a56108f486a4fffbc8dc"\ -reducer "uniq"\ -input /apps/hdmi-technology/b_dps/real-time/*\ -output /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dchadoop fs -cat /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dc/*
In newer distributions mapred streaming
instead of hadoop jar $HADOOP_HOME/hadoop-streaming.jar
should work. In the latter case you have to set your $HADOOP_HOME
correctly in order to find the jar (or provide the full path directly).
For simpler queries you don't even need a script but just can provide the command to the -mapper
parameter directly. But for anything slightly complex it's preferable to use a script, because getting the escaping right can be a chore.
If you don't need a reduce phase provide the symbolic NONE
parameter to the respective -reduce
option (or just use -numReduceTasks 0
). But in your case it's useful to have a reduce phase in order to have the output consolidated into a single file.