Counting the number of compressed files in hdfs

hdfs dfs -count command gives following information:

Directory count
File count
Content size
File name

For e.g., I get following output on my /tmp/ folder:

CMD> hdfs dfs -count  /tmp/    14           33       193414280395 /tmp

Using this command, you cant get the count of .snappy files like this:

CMD> hdfs dfs -count -v /tmp/*.snappy

You will get output like this:

DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME       0            1               4623 /tmp/Links.txt.snappy       0            1             190939 /tmp/inclusions_00000005.snappy

To get the count of .snappy files, you can also execute following commands:

Get the count of .snappy files directly under a folder:
Just execute hadoop fs -ls command. For e.g. to get number of .snappy files under /user/data folder, just execute:
```
hadoop fs -ls /user/data/*.snappy | wc -l
```
Recursively get the count of all the .snappy files under a folder:
Execute hadoop fsck command. For e.g.:
```
hadoop fsck /user/data/ -files | grep ".snappy" | wc -l
```

EDIT: All files greater than 30 MBIf you want to find all the files with size greater than or equal to 30 MB (30 *1024 * 1024 = 31457280), you need to execute the following command:

hadoop fsck /user/data -files | grep ".snappy" | gawk '{if ($2 ~ /^[0-9]+$/ && $2>=31457280) print $1,$2;}'

This will print $1 as file name and $2 as the size of the file.

If you want the count of the files, then just pipe it to wc -l as shown below:

hadoop fsck /user/data -files | grep ".snappy" | gawk '{if ($2 ~ /^[0-9]+$/ && $2>=31457280) print $1,$2;}' | wc -l

CodeHunter

Counting the number of compressed files in hdfs

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last