How to split a pandas time-series by NAN values
You can use numpy.split
and then filter the resulting list. Here is one example assuming that the column with the values is labeled "value"
:
events = np.split(df, np.where(np.isnan(df.value))[0])# removing NaN entriesevents = [ev[~np.isnan(ev.value)] for ev in events if not isinstance(ev, np.ndarray)]# removing empty DataFramesevents = [ev for ev in events if not ev.empty]
You will have a list with all the events separated by the NaN
values.
Note, this answer is for pandas<0.25.0, if you're using 0.25.0 or greater see this answer by thesofakillers
I found an efficient solution for very large and sparse datasets. In my case, hundreds of thousands of rows with only a dozen or so brief segments of data between NaN
values. I (ab)used the internals of pandas.SparseIndex
, which is a feature to help compress sparse datasets in memory.
Given some data:
import pandas as pdimport numpy as np# 10 days at per-second resolution, starting at midnight Jan 1st, 2011rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)# Three blocks of non-null data throughout timeseriesdense_ts[500:510] = np.random.randn(10)dense_ts[12000:12015] = np.random.randn(15)dense_ts[20000:20050] = np.random.randn(50)
Which looks like:
2011-01-01 00:00:00 NaN2011-01-01 00:00:01 NaN2011-01-01 00:00:02 NaN2011-01-01 00:00:03 NaN ..2011-01-10 23:59:56 NaN2011-01-10 23:59:57 NaN2011-01-10 23:59:58 NaN2011-01-10 23:59:59 NaNFreq: S, Length: 864000, dtype: float64
We can find the blocks efficiently and easily:
# Convert to sparse then query index to find block locationssparse_ts = dense_ts.to_sparse()block_locs = zip(sparse_ts.sp_index.blocs, sparse_ts.sp_index.blengths)# Map the sparse blocks back to the dense timeseriesblocks = [dense_ts.iloc[start:(start + length - 1)] for (start, length) in block_locs]
Voila:
[2011-01-01 00:08:20 0.531793 2011-01-01 00:08:21 0.484391 2011-01-01 00:08:22 0.022686 2011-01-01 00:08:23 -0.206495 2011-01-01 00:08:24 1.472209 2011-01-01 00:08:25 -1.261940 2011-01-01 00:08:26 -0.696388 2011-01-01 00:08:27 -0.219316 2011-01-01 00:08:28 -0.474840 Freq: S, dtype: float64, 2011-01-01 03:20:00 -0.147190 2011-01-01 03:20:01 0.299565 2011-01-01 03:20:02 -0.846878 2011-01-01 03:20:03 -0.100975 2011-01-01 03:20:04 1.288872 2011-01-01 03:20:05 -0.092474 2011-01-01 03:20:06 -0.214774 2011-01-01 03:20:07 -0.540479 2011-01-01 03:20:08 -0.661083 2011-01-01 03:20:09 1.129878 2011-01-01 03:20:10 0.791373 2011-01-01 03:20:11 0.119564 2011-01-01 03:20:12 0.345459 2011-01-01 03:20:13 -0.272132 Freq: S, dtype: float64, 2011-01-01 05:33:20 1.028268 2011-01-01 05:33:21 1.476468 2011-01-01 05:33:22 1.308881 2011-01-01 05:33:23 1.458202 2011-01-01 05:33:24 -0.874308 .. 2011-01-01 05:34:02 0.941446 2011-01-01 05:34:03 -0.996767 2011-01-01 05:34:04 1.266660 2011-01-01 05:34:05 -0.391560 2011-01-01 05:34:06 1.498499 2011-01-01 05:34:07 -0.636908 2011-01-01 05:34:08 0.621681 Freq: S, dtype: float64]
For anyone looking for a non-deprecated (pandas>=0.25.0) version of bloudermilk's answer, after a bit of digging in the pandas sparse source code, I came up with the following. I tried to keep it as similar as possible to their answer so you can compare:
Given some data:
import pandas as pdimport numpy as np# 10 days at per-second resolution, starting at midnight Jan 1st, 2011rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')# NaN data interspersed with 3 blocks of non-NaN datadense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)dense_ts[500:510] = np.random.randn(10)dense_ts[12000:12015] = np.random.randn(15)dense_ts[20000:20050] = np.random.randn(50)
Which looks like:
2011-01-01 00:00:00 NaN2011-01-01 00:00:01 NaN2011-01-01 00:00:02 NaN2011-01-01 00:00:03 NaN2011-01-01 00:00:04 NaN ..2011-01-10 23:59:55 NaN2011-01-10 23:59:56 NaN2011-01-10 23:59:57 NaN2011-01-10 23:59:58 NaN2011-01-10 23:59:59 NaNFreq: S, Length: 864000, dtype: float64
We can find the blocks efficiently and easily:
# Convert to sparse then query index to find block locations# different way of converting to sparse in pandas>=0.25.0sparse_ts = dense_ts.astype(pd.SparseDtype('float'))# we need to use .values.sp_index.to_block_index() in this version of pandasblock_locs = zip( sparse_ts.values.sp_index.to_block_index().blocs, sparse_ts.values.sp_index.to_block_index().blengths,)# Map the sparse blocks back to the dense timeseriesblocks = [ dense_ts.iloc[start : (start + length - 1)] for (start, length) in block_locs]
Voila
> blocks[2011-01-01 00:08:20 0.092338 2011-01-01 00:08:21 1.196703 2011-01-01 00:08:22 0.936586 2011-01-01 00:08:23 -0.354768 2011-01-01 00:08:24 -0.209642 2011-01-01 00:08:25 -0.750103 2011-01-01 00:08:26 1.344343 2011-01-01 00:08:27 1.446148 2011-01-01 00:08:28 1.174443 Freq: S, dtype: float64, 2011-01-01 03:20:00 1.327026 2011-01-01 03:20:01 -0.431162 2011-01-01 03:20:02 -0.461407 2011-01-01 03:20:03 -1.330671 2011-01-01 03:20:04 -0.892480 2011-01-01 03:20:05 -0.323433 2011-01-01 03:20:06 2.520965 2011-01-01 03:20:07 0.140757 2011-01-01 03:20:08 -1.688278 2011-01-01 03:20:09 0.856346 2011-01-01 03:20:10 0.013968 2011-01-01 03:20:11 0.204514 2011-01-01 03:20:12 0.287756 2011-01-01 03:20:13 -0.727400 Freq: S, dtype: float64, 2011-01-01 05:33:20 -1.409744 2011-01-01 05:33:21 0.338251 2011-01-01 05:33:22 0.215555 2011-01-01 05:33:23 -0.309874 2011-01-01 05:33:24 0.753737 2011-01-01 05:33:25 -0.349966 2011-01-01 05:33:26 0.074758 2011-01-01 05:33:27 -1.574485 2011-01-01 05:33:28 0.595844 2011-01-01 05:33:29 -0.670004 2011-01-01 05:33:30 1.655479 2011-01-01 05:33:31 -0.362853 2011-01-01 05:33:32 0.167355 2011-01-01 05:33:33 0.703780 2011-01-01 05:33:34 2.633756 2011-01-01 05:33:35 1.898891 2011-01-01 05:33:36 -1.129365 2011-01-01 05:33:37 -0.765057 2011-01-01 05:33:38 0.279869 2011-01-01 05:33:39 1.388705 2011-01-01 05:33:40 -1.420761 2011-01-01 05:33:41 0.455692 2011-01-01 05:33:42 0.367106 2011-01-01 05:33:43 0.856598 2011-01-01 05:33:44 1.920748 2011-01-01 05:33:45 0.648581 2011-01-01 05:33:46 -0.606784 2011-01-01 05:33:47 -0.246285 2011-01-01 05:33:48 -0.040520 2011-01-01 05:33:49 1.422764 2011-01-01 05:33:50 -1.686568 2011-01-01 05:33:51 1.282430 2011-01-01 05:33:52 1.358482 2011-01-01 05:33:53 -0.998765 2011-01-01 05:33:54 -0.009527 2011-01-01 05:33:55 0.647671 2011-01-01 05:33:56 -1.098435 2011-01-01 05:33:57 -0.638245 2011-01-01 05:33:58 -1.820668 2011-01-01 05:33:59 0.768250 2011-01-01 05:34:00 -1.029975 2011-01-01 05:34:01 -0.744205 2011-01-01 05:34:02 1.627130 2011-01-01 05:34:03 2.058689 2011-01-01 05:34:04 -1.194971 2011-01-01 05:34:05 1.293214 2011-01-01 05:34:06 0.029523 2011-01-01 05:34:07 -0.405785 2011-01-01 05:34:08 0.837123 Freq: S, dtype: float64]