How do you deal with missing data using numpy/scipy? How do you deal with missing data using numpy/scipy? numpy numpy

How do you deal with missing data using numpy/scipy?


If you are willing to consider a library, pandas (http://pandas.pydata.org/) is a library built on top of numpy which amongst many other things provides:

Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form

I've been using it for almost one year in the financial industry where missing and badly aligned data is the norm and it really made my life easier.


I also question the problem with masked arrays. Here are a couple of examples:

import numpy as npdata = np.ma.masked_array(np.arange(10))data[5] = np.ma.masked # Mask a specific valuedata[data>6] = np.ma.masked # Mask any value greater than 6# Same thing done at initialization timeinit_data = np.arange(10)data = np.ma.masked_array(init_data, mask=(init_data > 6))


Masked arrays are the anwswer, as DpplerShift describes. For quick and dirty use, you can use fancy indexing with boolean arrays:

>>> import numpy as np>>> data = np.arange(10)>>> valid_idx = data % 2 == 0 #pretend that even elements are missing>>> # Get non-missing data>>> data[valid_idx]array([0, 2, 4, 6, 8])

You can now use valid_idx as a quick mask on other data as well

>>> comparison = np.arange(10) + 10>>> comparison[valid_idx]array([10, 12, 14, 16, 18])