Numpy histogram of large arrays

python numpy scipy histogram

As you said, it's not that hard to roll your own. You'll need to set up the bins yourself and reuse them as you iterate over the file. The following ought to be a decent starting point:

import numpy as npdatamin = -5datamax = 5numbins = 20mybins = np.linspace(datamin, datamax, numbins)myhist = np.zeros(numbins-1, dtype='int32')for i in range(100):    d = np.random.randn(1000,1)    htemp, jnk = np.histogram(d, mybins)    myhist += htemp

I'm guessing performance will be an issue with such large files, and the overhead of calling histogram on each line might be too slow. @doug's suggestion of a generator seems like a good way to address that problem.

python numpy scipy histogram

Here's a way to bin your values directly:

import numpy as NPcolumn_of_values = NP.random.randint(10, 99, 10)# set the bin values:bins = NP.array([0.0, 20.0, 50.0, 75.0])binned_values = NP.digitize(column_of_values, bins)

'binned_values' is an index array, containing the index of the bin to which each value in column_of_values belongs.

'bincount' will give you (obviously) the bin counts:

NP.bincount(binned_values)

Given the size of your data set, using Numpy's 'loadtxt' to build a generator, might be useful:

data_array = NP.loadtxt(data_file.txt, delimiter=",")def fnx() :  for i in range(0, data_array.shape[1]) :    yield dx[:,i]

python numpy scipy histogram

Binning with a Fenwick Tree (very large dataset; percentile boundaries needed)

I'm posting a second answer to the same question since this approach is very different, and addresses different issues.

What if you have a VERY large dataset (billions of samples), and you don't know ahead of time WHERE your bin boundaries should be? For example, maybe you want to bin things up in to quartiles or deciles.

For small datasets, the answer is easy: load the data in to an array, then sort, then read off the values at any given percentile by jumping to the index that percentage of the way through the array.

For large datasets where the memory size to hold the array is not practical (not to mention the time to sort)... then consider using a Fenwick Tree, aka a "Binary Indexed Tree".

I think these only work for positive integer data, so you'll at least need to know enough about your dataset to shift (and possibly scale) your data before you tabulate it in the Fenwick Tree.

I've used this to find the median of a 100 billion sample dataset, in reasonable time and very comfortable memory limits. (Consider using generators to open and read the files, as per my other answer; that's still useful.)

CodeHunter

Numpy histogram of large arrays

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last