Counting the number of compressed files in hdfs Counting the number of compressed files in hdfs hadoop hadoop

Counting the number of compressed files in hdfs


hdfs dfs -count command gives following information:

  • Directory count
  • File count
  • Content size
  • File name

For e.g., I get following output on my /tmp/ folder:

CMD> hdfs dfs -count  /tmp/    14           33       193414280395 /tmp

Using this command, you cant get the count of .snappy files like this:

CMD> hdfs dfs -count -v /tmp/*.snappy

You will get output like this:

DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME       0            1               4623 /tmp/Links.txt.snappy       0            1             190939 /tmp/inclusions_00000005.snappy

To get the count of .snappy files, you can also execute following commands:

  • Get the count of .snappy files directly under a folder:

    Just execute hadoop fs -ls command. For e.g. to get number of .snappy files under /user/data folder, just execute:

    hadoop fs -ls /user/data/*.snappy | wc -l
  • Recursively get the count of all the .snappy files under a folder:

    Execute hadoop fsck command. For e.g.:

    hadoop fsck /user/data/ -files | grep ".snappy" | wc -l

EDIT: All files greater than 30 MBIf you want to find all the files with size greater than or equal to 30 MB (30 *1024 * 1024 = 31457280), you need to execute the following command:

hadoop fsck /user/data -files | grep ".snappy" | gawk '{if ($2 ~ /^[0-9]+$/ && $2>=31457280) print $1,$2;}'

This will print $1 as file name and $2 as the size of the file.

If you want the count of the files, then just pipe it to wc -l as shown below:

hadoop fsck /user/data -files | grep ".snappy" | gawk '{if ($2 ~ /^[0-9]+$/ && $2>=31457280) print $1,$2;}' | wc -l