What is the Big O Complexity of Reversing the Order of Columns in Pandas DataFrame? What is the Big O Complexity of Reversing the Order of Columns in Pandas DataFrame? numpy numpy

What is the Big O Complexity of Reversing the Order of Columns in Pandas DataFrame?


I don't know how Pandas implements this, but I did test it empirically. I ran the following code (in a Jupyter notebook) to test the speed of the operation:

def get_dummy_df(n):    return pd.DataFrame({'a': [1,2]*n, 'b': [4,5]*n, 'c': [7,8]*n})df = get_dummy_df(100)print df.shape%timeit df_r = df[df.columns[::-1]]df = get_dummy_df(1000)print df.shape%timeit df_r = df[df.columns[::-1]]df = get_dummy_df(10000)print df.shape%timeit df_r = df[df.columns[::-1]]df = get_dummy_df(100000)print df.shape%timeit df_r = df[df.columns[::-1]]df = get_dummy_df(1000000)print df.shape%timeit df_r = df[df.columns[::-1]]df = get_dummy_df(10000000)print df.shape%timeit df_r = df[df.columns[::-1]]

The output was:

(200, 3)1000 loops, best of 3: 419 µs per loop(2000, 3)1000 loops, best of 3: 425 µs per loop(20000, 3)1000 loops, best of 3: 498 µs per loop(200000, 3)100 loops, best of 3: 2.66 ms per loop(2000000, 3)10 loops, best of 3: 25.2 ms per loop(20000000, 3)1 loop, best of 3: 207 ms per loop

As you can see, in the first 3 cases, the overhead of the operation is what takes most of the time (400-500µs), but from the 4th case, the time it takes starts to be proportional to the amount of data, increasing in an order of magnitude each time.

So, assuming there must also be a proportion to n, it seems that we are dealing with O(m*n)


The Big O complexity (as of Pandas 0.24) is m*n where m is the number of columns and n is the number of rows. Note, this is when using the DataFrame.__getitem__ method (aka []) with an Index (see relevant code, with other types that would trigger a copy).

Here is a helpful stack trace:

 <ipython-input-4-3162cae03863>(2)<module>()      1 columns = df.columns[::-1]----> 2 df_reversed = df[columns]  pandas/core/frame.py(2682)__getitem__()   2681             # either boolean or fancy integer index-> 2682             return self._getitem_array(key)   2683         elif isinstance(key, DataFrame):  pandas/core/frame.py(2727)_getitem_array()   2726             indexer = self.loc._convert_to_indexer(key, axis=1)-> 2727             return self._take(indexer, axis=1)   2728   pandas/core/generic.py(2789)_take()   2788                                    axis=self._get_block_manager_axis(axis),-> 2789                                    verify=True)   2790         result = self._constructor(new_data).__finalize__(self)  pandas/core/internals.py(4539)take()   4538         return self.reindex_indexer(new_axis=new_labels, indexer=indexer,-> 4539                                     axis=axis, allow_dups=True)   4540   pandas/core/internals.py(4421)reindex_indexer()   4420             new_blocks = self._slice_take_blocks_ax0(indexer,-> 4421                                                      fill_tuple=(fill_value,))   4422         else:  pandas/core/internals.py(1254)take_nd()   1253             new_values = algos.take_nd(values, indexer, axis=axis,-> 1254                                        allow_fill=False)   1255         else:> pandas/core/algorithms.py(1658)take_nd()   1657     import ipdb; ipdb.set_trace()-> 1658     func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,   1659                                  mask_info=mask_info)   1660     func(arr, indexer, out, fill_value)

The func call on L1660 in pandas/core/algorithms ultimately calls a cython function with O(m * n) complexity. This is where data from the the original data is copied into out. out contains a copy of the original data in reversed order.

    inner_take_2d_axis0_template = """\    cdef:        Py_ssize_t i, j, k, n, idx        %(c_type_out)s fv    n = len(indexer)    k = values.shape[1]    fv = fill_value    IF %(can_copy)s:        cdef:            %(c_type_out)s *v            %(c_type_out)s *o        #GH3130        if (values.strides[1] == out.strides[1] and            values.strides[1] == sizeof(%(c_type_out)s) and            sizeof(%(c_type_out)s) * n >= 256):            for i from 0 <= i < n:                idx = indexer[i]                if idx == -1:                    for j from 0 <= j < k:                        out[i, j] = fv                else:                    v = &values[idx, 0]                    o = &out[i, 0]                    memmove(o, v, <size_t>(sizeof(%(c_type_out)s) * k))            return    for i from 0 <= i < n:        idx = indexer[i]        if idx == -1:            for j from 0 <= j < k:                out[i, j] = fv        else:            for j from 0 <= j < k:                out[i, j] = %(preval)svalues[idx, j]%(postval)s"""

Note that in the above template function, there is a path that uses memmove (which is the path taken in this case because we are mapping from int64 to int64 and the dimension of the output is identical as we are just switching the indexes). Note that memmove is still O(n), being proportional to the number of bytes it has to copy, although likely faster than writing to the indexes directly.


I ran an empirical test using big_O fitting library here

Note: All tests have been conducted on independent variable sweeping 6 orders of magnitude (i.e.

  • rows from 10 to 10^6 vs. constant column size of 3,
  • columns from 10 to 10^6 vs. constant row size of 10

The result shows that the columns reverse operation .columns[::-1] complexity in the DataFrame is

  1. Cubical: O(n^3) where n is the number of rows
  2. Cubical: O(n^3) where n is the number of columns

Prerequisites: You will need to install big_o() using terminal command pip install big_o

Code

import big_oimport pandas as pdimport numpy as npSWEAP_LOG10 = 6COLUMNS = 3ROWS = 10def build_df(rows, columns):    # To isolated the creation of the DataFrame from the inversion operation.    narray = np.zeros(rows*columns).reshape(rows, columns)    df = pd.DataFrame(narray)    return dfdef flip_columns(df):    return df[df.columns[::-1]]def get_row_df(n, m=COLUMNS):    return build_df(1*10**n, m)def get_column_df(n, m=ROWS):    return build_df(m, 1*10**n)# infer the big_o on columns[::-1] operation vs. rowsbest, others = big_o.big_o(flip_columns, get_row_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)# print resultsprint('Measuring .columns[::-1] complexity against rapid increase in # rows')print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)for class_, residual in others.items():    print('{:<60s}  (res: {:.2G})'.format(str(class_), residual))print('-'*80)# infer the big_o on columns[::-1] operation vs. columnsbest, others = big_o.big_o(flip_columns, get_column_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)# print resultsprint()print('Measuring .columns[::-1] complexity against rapid increase in # columns')print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)for class_, residual in others.items():    print('{:<60s}  (res: {:.2G})'.format(str(class_), residual))    print('-'*80)

Results

Measuring .columns[::-1] complexity against rapid increase in # rows--------------------------------------------------------------------------------Big O() fits: Cubic: time = -0.017 + 0.00067*n^3--------------------------------------------------------------------------------Constant: time = 0.032                                        (res: 0.021)Linear: time = -0.051 + 0.024*n                               (res: 0.011)Quadratic: time = -0.026 + 0.0038*n^2                         (res: 0.0077)Cubic: time = -0.017 + 0.00067*n^3                            (res: 0.0052)Polynomial: time = -6.3 * x^1.5                               (res: 6)Logarithmic: time = -0.026 + 0.053*log(n)                     (res: 0.015)Linearithmic: time = -0.024 + 0.012*n*log(n)                  (res: 0.0094)Exponential: time = -7 * 0.66^n                               (res: 3.6)--------------------------------------------------------------------------------Measuring .columns[::-1] complexity against rapid increase in # columns--------------------------------------------------------------------------------Big O() fits: Cubic: time = -0.28 + 0.009*n^3--------------------------------------------------------------------------------Constant: time = 0.38                                         (res: 3.9)Linear: time = -0.73 + 0.32*n                                 (res: 2.1)Quadratic: time = -0.4 + 0.052*n^2                            (res: 1.5)Cubic: time = -0.28 + 0.009*n^3                               (res: 1.1)Polynomial: time = -6 * x^2.2                                 (res: 16)Logarithmic: time = -0.39 + 0.71*log(n)                       (res: 2.8)Linearithmic: time = -0.38 + 0.16*n*log(n)                    (res: 1.8)Exponential: time = -7 * 1^n                                  (res: 9.7)--------------------------------------------------------------------------------