How to gather characters usage statistics in text file using Unix commands?
This should do what you're looking for:
cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c
The premise is that the sed
puts each character in the file onto a line by itself, then the usual sort | uniq -c
sequence strips out all but one of each unique character that occurs, and provides counts of how many times each occurred.
Also, you could append | sort -n
to the end of the whole sequence to sort the output by how many times each character occurred. Example:
$ echo hello | sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n 1 1 e 1 h 1 o 2 l
This will do it:
#!/usr/bin/perl -n## charcounts - show how many times each code point is used# Tom Christiansen <tchrist@perl.com>use open ":utf8";++$seen{ ord() } for split //;END { for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) { printf "%04X %d\n", $cp, $seen{$cp}; }}
Run on itself, that program produces:
$ charcounts /tmp/charcounts | head0020 460065 200073 18006E 15000A 14006F 120072 110074 100063 90070 9
If you want the literal character and/or name of the character, too, that’s easy to add.
If you want something more sophisticated, this program figures out characters by Unicode property. It may be enough for your purposes, and if not, you should be able to adapt it.
#!/usr/bin/perl## unicats - show character distribution by Unicode character property# Tom Christiansen <tchrist@perl.com>use strict;use warnings qw<FATAL all>;use open ":utf8";my %cats;our %Prop_Table;build_prop_table();if (@ARGV == 0 && -t STDIN) { warn <<"END_WARNING";$0: reading UTF-8 character data directly from your tty\tSo please type stuff...\t and then hit your tty's EOF sequence when done.END_WARNING} while (<>) { for (split(//)) { $cats{Total}++; if (/\p{ASCII}/) { $cats{ASCII}++ } else { $cats{Unicode}++ } my $gcat = get_general_category($_); $cats{$gcat}++; my $subcat = get_general_subcategory($_); $cats{$subcat}++; } } my $width = length $cats{Total};my $mask = "%*d %s\n";for my $cat(qw< Total ASCII Unicode >) { printf $mask, $width => $cats{$cat} || 0, $cat; }print "\n";my @catnames = qw[ L Lu Ll Lt Lm Lo N Nd Nl No S Sm Sc Sk So P Pc Pd Ps Pe Pi Pf Po M Mn Mc Me Z Zs Zl Zp C Cc Cf Cs Co Cn];#for my $cat (sort keys %cats) {for my $cat (@catnames) { next if length($cat) > 2; next unless $cats{$cat}; my $prop = length($cat) == 1 ? ( " " . q<\p> . $cat ) : ( q<\p> . "{$cat}" . "\t" ) ; my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat}); printf $mask, $width => $cats{$cat}, $desc;} exit;sub get_general_category { my $_ = shift(); return "L" if /\pL/; return "S" if /\pS/; return "P" if /\pP/; return "N" if /\pN/; return "C" if /\pC/; return "M" if /\pM/; return "Z" if /\pZ/; die "not reached one: $_";} sub get_general_subcategory { my $_ = shift(); return "Lu" if /\p{Lu}/; return "Ll" if /\p{Ll}/; return "Lt" if /\p{Lt}/; return "Lm" if /\p{Lm}/; return "Lo" if /\p{Lo}/; return "Mn" if /\p{Mn}/; return "Mc" if /\p{Mc}/; return "Me" if /\p{Me}/; return "Nd" if /\p{Nd}/; return "Nl" if /\p{Nl}/; return "No" if /\p{No}/; return "Pc" if /\p{Pc}/; return "Pd" if /\p{Pd}/; return "Ps" if /\p{Ps}/; return "Pe" if /\p{Pe}/; return "Pi" if /\p{Pi}/; return "Pf" if /\p{Pf}/; return "Po" if /\p{Po}/; return "Sm" if /\p{Sm}/; return "Sc" if /\p{Sc}/; return "Sk" if /\p{Sk}/; return "So" if /\p{So}/; return "Zs" if /\p{Zs}/; return "Zl" if /\p{Zl}/; return "Zp" if /\p{Zp}/; return "Cc" if /\p{Cc}/; return "Cf" if /\p{Cf}/; return "Cs" if /\p{Cs}/; return "Co" if /\p{Co}/; return "Cn" if /\p{Cn}/; die "not reached two: <$_> " . sprintf("U+%vX", $_);}sub build_prop_table { for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) { L Letter Lu Uppercase_Letter Ll Lowercase_Letter Lt Titlecase_Letter Lm Modifier_Letter Lo Other_Letter M Mark (combining characters, including diacritics) Mn Nonspacing_Mark Mc Spacing_Mark Me Enclosing_Mark N Number Nd Decimal_Number (also Digit) Nl Letter_Number No Other_Number P Punctuation Pc Connector_Punctuation Pd Dash_Punctuation Ps Open_Punctuation Pe Close_Punctuation Pi Initial_Punctuation (may behave like Ps or Pe depending on usage) Pf Final_Punctuation (may behave like Ps or Pe depending on usage) Po Other_Punctuation S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifier_Symbol So Other_Symbol Z Separator Zs Space_Separator Zl Line_Separator Zp Paragraph_Separator C Other (means not L/N/P/S/Z) Cc Control (also Cntrl) Cf Format Cs Surrogate (not usable) Co Private_Use Cn UnassignedEnd_of_Property_List my($short_prop, $long_prop) = $line =~ m{ \b ( \p{Lu} \p{Ll} ? ) \s + ( \p{Lu} [\p{L&}_] + ) \b }x; $Prop_Table{$short_prop} = $long_prop; }}
For example:
$ unicats book.txt2357232 Total2357199 ASCII 33 Unicode1604949 \pL Letter 74455 \p{Lu} Uppercase_Letter1530485 \p{Ll} Lowercase_Letter 9 \p{Lo} Other_Letter 10676 \pN Number 10676 \p{Nd} Decimal_Number 19679 \pS Symbol 10705 \p{Sm} Math_Symbol 8365 \p{Sc} Currency_Symbol 603 \p{Sk} Modifier_Symbol 6 \p{So} Other_Symbol 111899 \pP Punctuation 2996 \p{Pc} Connector_Punctuation 6145 \p{Pd} Dash_Punctuation 11392 \p{Ps} Open_Punctuation 11371 \p{Pe} Close_Punctuation 79995 \p{Po} Other_Punctuation 548529 \pZ Separator 548529 \p{Zs} Space_Separator 61500 \pC Other 61500 \p{Cc} Control
As far as using *nix commands, the answer above is good, but it doesn't get usage stats.
However, if you actually want stats (like the rarest used, median, most used, etc) on the file, this Python should do it.
def get_char_counts(fname): f = open(fname) usage = {} for c in f.read(): if c not in usage: usage.update({c:1}) else: usage[c] += 1 return usage