Numpy histogram of large arrays Numpy histogram of large arrays numpy numpy

Numpy histogram of large arrays


As you said, it's not that hard to roll your own. You'll need to set up the bins yourself and reuse them as you iterate over the file. The following ought to be a decent starting point:

import numpy as npdatamin = -5datamax = 5numbins = 20mybins = np.linspace(datamin, datamax, numbins)myhist = np.zeros(numbins-1, dtype='int32')for i in range(100):    d = np.random.randn(1000,1)    htemp, jnk = np.histogram(d, mybins)    myhist += htemp

I'm guessing performance will be an issue with such large files, and the overhead of calling histogram on each line might be too slow. @doug's suggestion of a generator seems like a good way to address that problem.


Here's a way to bin your values directly:

import numpy as NPcolumn_of_values = NP.random.randint(10, 99, 10)# set the bin values:bins = NP.array([0.0, 20.0, 50.0, 75.0])binned_values = NP.digitize(column_of_values, bins)

'binned_values' is an index array, containing the index of the bin to which each value in column_of_values belongs.

'bincount' will give you (obviously) the bin counts:

NP.bincount(binned_values)

Given the size of your data set, using Numpy's 'loadtxt' to build a generator, might be useful:

data_array = NP.loadtxt(data_file.txt, delimiter=",")def fnx() :  for i in range(0, data_array.shape[1]) :    yield dx[:,i]


Binning with a Fenwick Tree (very large dataset; percentile boundaries needed)

I'm posting a second answer to the same question since this approach is very different, and addresses different issues.

What if you have a VERY large dataset (billions of samples), and you don't know ahead of time WHERE your bin boundaries should be? For example, maybe you want to bin things up in to quartiles or deciles.

For small datasets, the answer is easy: load the data in to an array, then sort, then read off the values at any given percentile by jumping to the index that percentage of the way through the array.

For large datasets where the memory size to hold the array is not practical (not to mention the time to sort)... then consider using a Fenwick Tree, aka a "Binary Indexed Tree".

I think these only work for positive integer data, so you'll at least need to know enough about your dataset to shift (and possibly scale) your data before you tabulate it in the Fenwick Tree.

I've used this to find the median of a 100 billion sample dataset, in reasonable time and very comfortable memory limits. (Consider using generators to open and read the files, as per my other answer; that's still useful.)

More on Fenwick Trees: