Performant way of displaying the number of unique column entries in a set of files? Performant way of displaying the number of unique column entries in a set of files? unix unix

Performant way of displaying the number of unique column entries in a set of files?


You can count unique occurrences of values in the fifth field in a single pass with awk:

awk '{if (!seen[$5]++) ++ctr} END {print ctr}'

This creates an array of the values in the fifth field and increments the ctr variable if the value has never seen before. The END rule prints the value of the counter.

With GNU awk, you can alternatively just check the length of the associative array in the end:

awk '{seen[$5]++} END {print length(seen)}'


Benjamin has supplied the good oil, but depending on just how much data is to be stored in the array, it may pay to pass the data to wc anyway:

awk '!_[$5]++' file | wc -l


the sortest and fastest (i could) using awk but not far from previous version of @BenjaminW. I think a bit faster (difference could only be interesting on very huge file) because of test made earlier in the process

awk '!E[$5]++{c++}END{print c}' YourFile

works with all awk version