command line utility to print statistics of numbers in linux

This is a breeze with R. For a file that looks like this:

12345678910

Use this:

R -q -e "x <- read.csv('nums.txt', header = F); summary(x); sd(x[ , 1])"

To get this:

       V1        Min.   : 1.00   1st Qu.: 3.25   Median : 5.50   Mean   : 5.50   3rd Qu.: 7.75   Max.   :10.00  [1] 3.02765

The -q flag squelches R's startup licensing and help output
The -e flag tells R you'll be passing an expression from the terminal
x is a data.frame - a table, basically. It's a structure that accommodates multiple vectors/columns of data, which is a little peculiar if you're just reading in a single vector. This has an impact on which functions you can use.
Some functions, like summary(), naturally accommodate data.frames. If x had multiple fields, summary() would provide the above descriptive stats for each.
But sd() can only take one vector at a time, which is why I index x for that command (x[ , 1] returns the first column of x). You could use apply(x, MARGIN = 2, FUN = sd) to get the SDs for all columns.

linux command-line statistics

Using "st" (https://github.com/nferraz/st)

$ st numbers.txtN    min   max   sum   mean  stddev10   1     10    55    5.5   3.02765

Or:

$ st numbers.txt --transposeN      10min    1max    10sum    55mean   5.5stddev 3.02765

(DISCLAIMER: I wrote this tool :))

linux command-line statistics

For the average, median & standard deviation you can use awk. This will generally be faster than R solutions. For instance the following will print the average :

awk '{a+=$1} END{print a/NR}' myfile

(NR is an awk variable for the number of records, $1 means the first (space-separated) argument of the line ($0 would be the whole line, which would also work here but in principle would be less secure, although for the computation it would probably just take the first argument anyway) and END means that the following commands will be executed after having processed the whole file (one could also have initialized a to 0 in a BEGIN{a=0} statement)).

Here is a simple awk script which provides more detailed statistics (takes a CSV file as input, otherwise change FS) :

#!/usr/bin/awk -fBEGIN {    FS=",";}{   a += $1;   b[++i] = $1;}END {    m = a/NR; # mean    for (i in b)    {        d += (b[i]-m)^2;        e += (b[i]-m)^3;        f += (b[i]-m)^4;    }    va = d/NR; # variance    sd = sqrt(va); # standard deviation    sk = (e/NR)/sd^3; # skewness    ku = (f/NR)/sd^4-3; # standardized kurtosis    print "N,sum,mean,variance,std,SEM,skewness,kurtosis"    print NR "," a "," m "," va "," sd "," sd/sqrt(NR) "," sk "," ku}

It is straightforward to add min/max to this script, but it is as easy to pipe sort & head/tail :

sort -n myfile | head -n1sort -n myfile | tail -n1

CodeHunter

command line utility to print statistics of numbers in linux

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last