Counting the number of compressed files in hdfs
hdfs dfs -count
command gives following information:
- Directory count
- File count
- Content size
- File name
For e.g., I get following output on my /tmp/
folder:
CMD> hdfs dfs -count /tmp/ 14 33 193414280395 /tmp
Using this command, you cant get the count of .snappy
files like this:
CMD> hdfs dfs -count -v /tmp/*.snappy
You will get output like this:
DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME 0 1 4623 /tmp/Links.txt.snappy 0 1 190939 /tmp/inclusions_00000005.snappy
To get the count of .snappy
files, you can also execute following commands:
Get the count of
.snappy
files directly under a folder:Just execute
hadoop fs -ls
command. For e.g. to get number of.snappy
files under/user/data
folder, just execute:hadoop fs -ls /user/data/*.snappy | wc -l
Recursively get the count of all the
.snappy
files under a folder:Execute
hadoop fsck
command. For e.g.:hadoop fsck /user/data/ -files | grep ".snappy" | wc -l
EDIT: All files greater than 30 MBIf you want to find all the files with size greater than or equal to 30 MB (30 *1024 * 1024 = 31457280), you need to execute the following command:
hadoop fsck /user/data -files | grep ".snappy" | gawk '{if ($2 ~ /^[0-9]+$/ && $2>=31457280) print $1,$2;}'
This will print $1 as file name and $2 as the size of the file.
If you want the count of the files, then just pipe it to wc -l
as shown below:
hadoop fsck /user/data -files | grep ".snappy" | gawk '{if ($2 ~ /^[0-9]+$/ && $2>=31457280) print $1,$2;}' | wc -l