How to split a pandas time-series by NAN values How to split a pandas time-series by NAN values numpy numpy

How to split a pandas time-series by NAN values


You can use numpy.split and then filter the resulting list. Here is one example assuming that the column with the values is labeled "value":

events = np.split(df, np.where(np.isnan(df.value))[0])# removing NaN entriesevents = [ev[~np.isnan(ev.value)] for ev in events if not isinstance(ev, np.ndarray)]# removing empty DataFramesevents = [ev for ev in events if not ev.empty]

You will have a list with all the events separated by the NaN values.


Note, this answer is for pandas<0.25.0, if you're using 0.25.0 or greater see this answer by thesofakillers


I found an efficient solution for very large and sparse datasets. In my case, hundreds of thousands of rows with only a dozen or so brief segments of data between NaN values. I (ab)used the internals of pandas.SparseIndex, which is a feature to help compress sparse datasets in memory.

Given some data:

import pandas as pdimport numpy as np# 10 days at per-second resolution, starting at midnight Jan 1st, 2011rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)# Three blocks of non-null data throughout timeseriesdense_ts[500:510] = np.random.randn(10)dense_ts[12000:12015] = np.random.randn(15)dense_ts[20000:20050] = np.random.randn(50)

Which looks like:

2011-01-01 00:00:00   NaN2011-01-01 00:00:01   NaN2011-01-01 00:00:02   NaN2011-01-01 00:00:03   NaN                       ..2011-01-10 23:59:56   NaN2011-01-10 23:59:57   NaN2011-01-10 23:59:58   NaN2011-01-10 23:59:59   NaNFreq: S, Length: 864000, dtype: float64

We can find the blocks efficiently and easily:

# Convert to sparse then query index to find block locationssparse_ts = dense_ts.to_sparse()block_locs = zip(sparse_ts.sp_index.blocs, sparse_ts.sp_index.blengths)# Map the sparse blocks back to the dense timeseriesblocks = [dense_ts.iloc[start:(start + length - 1)] for (start, length) in block_locs]

Voila:

[2011-01-01 00:08:20    0.531793 2011-01-01 00:08:21    0.484391 2011-01-01 00:08:22    0.022686 2011-01-01 00:08:23   -0.206495 2011-01-01 00:08:24    1.472209 2011-01-01 00:08:25   -1.261940 2011-01-01 00:08:26   -0.696388 2011-01-01 00:08:27   -0.219316 2011-01-01 00:08:28   -0.474840 Freq: S, dtype: float64, 2011-01-01 03:20:00   -0.147190 2011-01-01 03:20:01    0.299565 2011-01-01 03:20:02   -0.846878 2011-01-01 03:20:03   -0.100975 2011-01-01 03:20:04    1.288872 2011-01-01 03:20:05   -0.092474 2011-01-01 03:20:06   -0.214774 2011-01-01 03:20:07   -0.540479 2011-01-01 03:20:08   -0.661083 2011-01-01 03:20:09    1.129878 2011-01-01 03:20:10    0.791373 2011-01-01 03:20:11    0.119564 2011-01-01 03:20:12    0.345459 2011-01-01 03:20:13   -0.272132 Freq: S, dtype: float64, 2011-01-01 05:33:20    1.028268 2011-01-01 05:33:21    1.476468 2011-01-01 05:33:22    1.308881 2011-01-01 05:33:23    1.458202 2011-01-01 05:33:24   -0.874308                              .. 2011-01-01 05:34:02    0.941446 2011-01-01 05:34:03   -0.996767 2011-01-01 05:34:04    1.266660 2011-01-01 05:34:05   -0.391560 2011-01-01 05:34:06    1.498499 2011-01-01 05:34:07   -0.636908 2011-01-01 05:34:08    0.621681 Freq: S, dtype: float64]


For anyone looking for a non-deprecated (pandas>=0.25.0) version of bloudermilk's answer, after a bit of digging in the pandas sparse source code, I came up with the following. I tried to keep it as similar as possible to their answer so you can compare:

Given some data:

import pandas as pdimport numpy as np# 10 days at per-second resolution, starting at midnight Jan 1st, 2011rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')# NaN data interspersed with 3 blocks of non-NaN datadense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)dense_ts[500:510] = np.random.randn(10)dense_ts[12000:12015] = np.random.randn(15)dense_ts[20000:20050] = np.random.randn(50)

Which looks like:

2011-01-01 00:00:00   NaN2011-01-01 00:00:01   NaN2011-01-01 00:00:02   NaN2011-01-01 00:00:03   NaN2011-01-01 00:00:04   NaN                       ..2011-01-10 23:59:55   NaN2011-01-10 23:59:56   NaN2011-01-10 23:59:57   NaN2011-01-10 23:59:58   NaN2011-01-10 23:59:59   NaNFreq: S, Length: 864000, dtype: float64

We can find the blocks efficiently and easily:

# Convert to sparse then query index to find block locations# different way of converting to sparse in pandas>=0.25.0sparse_ts = dense_ts.astype(pd.SparseDtype('float'))# we need to use .values.sp_index.to_block_index() in this version of pandasblock_locs = zip(    sparse_ts.values.sp_index.to_block_index().blocs,    sparse_ts.values.sp_index.to_block_index().blengths,)# Map the sparse blocks back to the dense timeseriesblocks = [    dense_ts.iloc[start : (start + length - 1)]    for (start, length) in block_locs]

Voila

> blocks[2011-01-01 00:08:20    0.092338 2011-01-01 00:08:21    1.196703 2011-01-01 00:08:22    0.936586 2011-01-01 00:08:23   -0.354768 2011-01-01 00:08:24   -0.209642 2011-01-01 00:08:25   -0.750103 2011-01-01 00:08:26    1.344343 2011-01-01 00:08:27    1.446148 2011-01-01 00:08:28    1.174443 Freq: S, dtype: float64, 2011-01-01 03:20:00    1.327026 2011-01-01 03:20:01   -0.431162 2011-01-01 03:20:02   -0.461407 2011-01-01 03:20:03   -1.330671 2011-01-01 03:20:04   -0.892480 2011-01-01 03:20:05   -0.323433 2011-01-01 03:20:06    2.520965 2011-01-01 03:20:07    0.140757 2011-01-01 03:20:08   -1.688278 2011-01-01 03:20:09    0.856346 2011-01-01 03:20:10    0.013968 2011-01-01 03:20:11    0.204514 2011-01-01 03:20:12    0.287756 2011-01-01 03:20:13   -0.727400 Freq: S, dtype: float64, 2011-01-01 05:33:20   -1.409744 2011-01-01 05:33:21    0.338251 2011-01-01 05:33:22    0.215555 2011-01-01 05:33:23   -0.309874 2011-01-01 05:33:24    0.753737 2011-01-01 05:33:25   -0.349966 2011-01-01 05:33:26    0.074758 2011-01-01 05:33:27   -1.574485 2011-01-01 05:33:28    0.595844 2011-01-01 05:33:29   -0.670004 2011-01-01 05:33:30    1.655479 2011-01-01 05:33:31   -0.362853 2011-01-01 05:33:32    0.167355 2011-01-01 05:33:33    0.703780 2011-01-01 05:33:34    2.633756 2011-01-01 05:33:35    1.898891 2011-01-01 05:33:36   -1.129365 2011-01-01 05:33:37   -0.765057 2011-01-01 05:33:38    0.279869 2011-01-01 05:33:39    1.388705 2011-01-01 05:33:40   -1.420761 2011-01-01 05:33:41    0.455692 2011-01-01 05:33:42    0.367106 2011-01-01 05:33:43    0.856598 2011-01-01 05:33:44    1.920748 2011-01-01 05:33:45    0.648581 2011-01-01 05:33:46   -0.606784 2011-01-01 05:33:47   -0.246285 2011-01-01 05:33:48   -0.040520 2011-01-01 05:33:49    1.422764 2011-01-01 05:33:50   -1.686568 2011-01-01 05:33:51    1.282430 2011-01-01 05:33:52    1.358482 2011-01-01 05:33:53   -0.998765 2011-01-01 05:33:54   -0.009527 2011-01-01 05:33:55    0.647671 2011-01-01 05:33:56   -1.098435 2011-01-01 05:33:57   -0.638245 2011-01-01 05:33:58   -1.820668 2011-01-01 05:33:59    0.768250 2011-01-01 05:34:00   -1.029975 2011-01-01 05:34:01   -0.744205 2011-01-01 05:34:02    1.627130 2011-01-01 05:34:03    2.058689 2011-01-01 05:34:04   -1.194971 2011-01-01 05:34:05    1.293214 2011-01-01 05:34:06    0.029523 2011-01-01 05:34:07   -0.405785 2011-01-01 05:34:08    0.837123 Freq: S, dtype: float64]