Calculate average of every x rows in a table and create new table Calculate average of every x rows in a table and create new table python-3.x python-3.x

Calculate average of every x rows in a table and create new table


You can create an artificial group using df.index//2 (or as @DSM pointed out, using np.arange(len(df))//2 - so that it works for all indices) and then use groupby:

df.groupby(np.arange(len(df))//2).mean()Out[13]:       a     b     c     d0   3.0  30.5  31.5  35.01   7.0  35.0  21.5  25.02  11.0  37.5  41.5  38.53  15.0  10.0  16.0  18.54  19.0  15.5  27.0  38.0


You can approach this problem using pd.rolling() to create a rolling average and then just grab every second element using iloc

df = df.rolling(2).mean() df = df.iloc[::2, :]

Note that the first observation will be missing (i.e. the rolling starts at the top) so make sure to check that your data is sorted how you need it.


NumPythonic way would be to extract the elements as a NumPy array with df.values, then reshape to a 3D array with 2 elements along axis=1 and 4 along axis=2 and perform the average reduction along axis=1 and finally convert back to a dataframe, like so -

pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))

As it turns out, you can introduce NumPy's very efficient tool : np.einsum to do this average-reduction as a combination of sum-reduction and scaling-down, like so -

pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)

Please note that the proposed approaches assume that the number of rows is divisible by 2.

Also as noted by @DSM, to preserve the column names, you need to add columns=df.columns when converting back to Dataframe, i.e. -

pd.DataFrame(...,columns=df.columns)

Sample run -

>>> df    0   1   2   30   2  50  25  261   4  11  38  442   6  33  16  253   8  37  27  254  10  28  48  325  12  47  35  456  14   8  16   77  16  12  16  308  18  22  39  299  20   9  15  47>>> pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))    0     1     2     30   3  30.5  31.5  35.01   7  35.0  21.5  25.02  11  37.5  41.5  38.53  15  10.0  16.0  18.54  19  15.5  27.0  38.0>>> pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)    0     1     2     30   3  30.5  31.5  35.01   7  35.0  21.5  25.02  11  37.5  41.5  38.53  15  10.0  16.0  18.54  19  15.5  27.0  38.0

Runtime tests -

In this section, let's test out all the three approaches listed thus far to solve the problem for performance, including @ayhan's solution with groupby.

In [24]: A = np.random.randint(0,9,(200,50))In [25]: df = pd.DataFrame(A)In [26]: %timeit df.groupby(df.index//2).mean() # @ayhan's solution1000 loops, best of 3: 1.61 ms per loopIn [27]: %timeit pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))1000 loops, best of 3: 317 µs per loopIn [28]: %timeit pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)1000 loops, best of 3: 266 µs per loop