How to gather characters usage statistics in text file using Unix commands? How to gather characters usage statistics in text file using Unix commands? unix unix

How to gather characters usage statistics in text file using Unix commands?


This should do what you're looking for:

cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c

The premise is that the sed puts each character in the file onto a line by itself, then the usual sort | uniq -c sequence strips out all but one of each unique character that occurs, and provides counts of how many times each occurred.

Also, you could append | sort -n to the end of the whole sequence to sort the output by how many times each character occurred. Example:

$ echo hello |  sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n  1   1 e  1 h  1 o  2 l


This will do it:

#!/usr/bin/perl -n## charcounts - show how many times each code point is used# Tom Christiansen <tchrist@perl.com>use open ":utf8";++$seen{ ord() } for split //;END {    for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) {        printf "%04X %d\n", $cp, $seen{$cp};    }}

Run on itself, that program produces:

$ charcounts /tmp/charcounts | head0020 460065 200073 18006E 15000A 14006F 120072 110074 100063 90070 9

If you want the literal character and/or name of the character, too, that’s easy to add.

If you want something more sophisticated, this program figures out characters by Unicode property. It may be enough for your purposes, and if not, you should be able to adapt it.

#!/usr/bin/perl## unicats - show character distribution by Unicode character property# Tom Christiansen <tchrist@perl.com>use strict;use warnings qw<FATAL all>;use open ":utf8";my %cats;our %Prop_Table;build_prop_table();if (@ARGV == 0 && -t STDIN) {    warn <<"END_WARNING";$0: reading UTF-8 character data directly from your tty\tSo please type stuff...\t and then hit your tty's EOF sequence when done.END_WARNING} while (<>) {    for (split(//)) {        $cats{Total}++;        if (/\p{ASCII}/) { $cats{ASCII}++   }         else             { $cats{Unicode}++ }         my $gcat   = get_general_category($_);        $cats{$gcat}++;        my $subcat = get_general_subcategory($_);        $cats{$subcat}++;    } } my $width = length $cats{Total};my $mask = "%*d %s\n";for my $cat(qw< Total ASCII Unicode >) {     printf $mask, $width => $cats{$cat} || 0, $cat; }print "\n";my @catnames = qw[    L Lu Ll Lt Lm Lo    N Nd Nl No    S Sm Sc Sk So    P Pc Pd Ps Pe Pi Pf Po    M Mn Mc Me    Z Zs Zl Zp    C Cc Cf Cs Co Cn];#for my $cat (sort keys %cats) {for my $cat (@catnames) {    next if length($cat) > 2;    next unless $cats{$cat};    my $prop = length($cat) == 1                  ? ( " " . q<\p> .   $cat          )                 : (       q<\p> . "{$cat}" . "\t" )             ;    my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat});    printf $mask, $width => $cats{$cat}, $desc;} exit;sub get_general_category {    my $_ = shift();    return "L" if /\pL/;    return "S" if /\pS/;    return "P" if /\pP/;    return "N" if /\pN/;    return "C" if /\pC/;    return "M" if /\pM/;    return "Z" if /\pZ/;    die "not reached one: $_";} sub get_general_subcategory {    my $_ = shift();    return "Lu" if /\p{Lu}/;    return "Ll" if /\p{Ll}/;    return "Lt" if /\p{Lt}/;    return "Lm" if /\p{Lm}/;    return "Lo" if /\p{Lo}/;    return "Mn" if /\p{Mn}/;    return "Mc" if /\p{Mc}/;    return "Me" if /\p{Me}/;    return "Nd" if /\p{Nd}/;    return "Nl" if /\p{Nl}/;    return "No" if /\p{No}/;    return "Pc" if /\p{Pc}/;    return "Pd" if /\p{Pd}/;    return "Ps" if /\p{Ps}/;    return "Pe" if /\p{Pe}/;    return "Pi" if /\p{Pi}/;    return "Pf" if /\p{Pf}/;    return "Po" if /\p{Po}/;    return "Sm" if /\p{Sm}/;    return "Sc" if /\p{Sc}/;    return "Sk" if /\p{Sk}/;    return "So" if /\p{So}/;    return "Zs" if /\p{Zs}/;    return "Zl" if /\p{Zl}/;    return "Zp" if /\p{Zp}/;    return "Cc" if /\p{Cc}/;    return "Cf" if /\p{Cf}/;    return "Cs" if /\p{Cs}/;    return "Co" if /\p{Co}/;    return "Cn" if /\p{Cn}/;    die "not reached two: <$_> " . sprintf("U+%vX", $_);}sub build_prop_table {     for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) {       L           Letter       Lu          Uppercase_Letter       Ll          Lowercase_Letter       Lt          Titlecase_Letter       Lm          Modifier_Letter       Lo          Other_Letter       M           Mark  (combining characters, including diacritics)       Mn          Nonspacing_Mark       Mc          Spacing_Mark       Me          Enclosing_Mark       N           Number       Nd          Decimal_Number (also Digit)       Nl          Letter_Number       No          Other_Number       P           Punctuation       Pc          Connector_Punctuation       Pd          Dash_Punctuation       Ps          Open_Punctuation       Pe          Close_Punctuation       Pi          Initial_Punctuation (may behave like Ps or Pe depending on usage)       Pf          Final_Punctuation (may behave like Ps or Pe depending on usage)       Po          Other_Punctuation       S           Symbol       Sm          Math_Symbol       Sc          Currency_Symbol       Sk          Modifier_Symbol       So          Other_Symbol       Z           Separator       Zs          Space_Separator       Zl          Line_Separator       Zp          Paragraph_Separator       C           Other (means not L/N/P/S/Z)       Cc          Control (also Cntrl)       Cf          Format       Cs          Surrogate   (not usable)       Co          Private_Use       Cn          UnassignedEnd_of_Property_List            my($short_prop, $long_prop) = $line =~ m{                 \b                  ( \p{Lu}  \p{Ll}   ? )                 \s +                  ( \p{Lu} [\p{L&}_] + )                \b            }x;            $Prop_Table{$short_prop} = $long_prop;    }}

For example:

$ unicats book.txt2357232 Total2357199 ASCII     33 Unicode1604949  \pL   Letter  74455 \p{Lu}   Uppercase_Letter1530485 \p{Ll}   Lowercase_Letter      9 \p{Lo}   Other_Letter  10676  \pN   Number  10676 \p{Nd}   Decimal_Number  19679  \pS   Symbol  10705 \p{Sm}   Math_Symbol   8365 \p{Sc}   Currency_Symbol    603 \p{Sk}   Modifier_Symbol      6 \p{So}   Other_Symbol 111899  \pP   Punctuation   2996 \p{Pc}   Connector_Punctuation   6145 \p{Pd}   Dash_Punctuation  11392 \p{Ps}   Open_Punctuation  11371 \p{Pe}   Close_Punctuation  79995 \p{Po}   Other_Punctuation 548529  \pZ   Separator 548529 \p{Zs}   Space_Separator  61500  \pC   Other  61500 \p{Cc}   Control


As far as using *nix commands, the answer above is good, but it doesn't get usage stats.

However, if you actually want stats (like the rarest used, median, most used, etc) on the file, this Python should do it.

def get_char_counts(fname):    f = open(fname)    usage = {}    for c in f.read():        if c not in usage:            usage.update({c:1})        else:            usage[c] += 1    return usage