Inconsistent behavior of jitted function
Ah, that's because in your "failing case" the df["z"].values
returns a copy of what is stored in the 'z'
column of df
. It has nothing to do with the numba function:
>>> import pandas as pd>>> import numpy as np>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])>>> np.shares_memory(df['z'].values, df['z'])False
While in the "working case" it's a view into the 'z'
column:
>>> df = pd.DataFrame([[0, 3, 1, np.nan]], columns=['v', 'y', 'x', 'z'])>>> np.shares_memory(df['z'].values, df['z'])True
NB: It's actually quite funny that this works, because the copy is made when you do df['z']
not when you access the .values
.
The take-away here is that you cannot expect that indexing a DataFrame or accessing the .values
of a Series will always return a view. So updating the column in-place may not change the values of the original. Not only duplicate column names could be a problem. When the property values
returns a copy and when it returns a view is not always clear (except for pd.Series
then it's always a view). But these are just implementation details. So it's never a good idea to rely on a specific behavior here. The only guarantee that .values
is making is that it returns a numpy.ndarray
containing the same values.
However it's pretty easy to avoid that problem by simply returning the modified z
column from the function:
import numba as nbimport numpy as npimport pandas as pd@nb.njitdef f_(n, x, y, z): for i in range(n): z[i] = x[i] * y[i] return z # this is new
Then assign the result of the function to the column:
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])>>> df['z'] = f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)>>> df v y v x z0 0 3 0 1 3.0>>> df = pd.DataFrame([[0, 3, 1, np.nan]], columns=['v', 'y', 'x', 'z'])>>> df['z'] = f_(df.shape[0], df["x"].values, df["y"].values, df["z"].values)>>> df v y x z0 0 3 1 3.0
In case you're interested what happened in your specific case currently (as I mentioned we're talking about implementation details here so don't take this as given. It's just the way it's implemented now). If you have a DataFrame it will store the columns that have the same dtype
in a multidimensional NumPy array. This can be seen if you access the blocks
attribute (deprecated because the internal storage may change in the near future):
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])>>> df.blocks{'float64': z 0 NaN , 'int64': v y v x 0 0 3 0 1}
Normally it's very easy to create a view into that block, by translating the column name to the column index of the corresponding block. However if you have a duplicate column name the accessing an arbitrary column cannot be guaranteed to be a view. For example if you want to access 'v'
then it has to index the Int64 Block with index 0 and 2:
>>> df = pd.DataFrame([[0, 3, 0, 1, np.nan]], columns=['v', 'y', 'v', 'x', 'z'])>>> df['v'] v v0 0 0
Technically it could be possible to index the non-duplicated columns as views (and in this case even for the duplicated column, for example by using Int64Block[::2]
but that's a very special case...). Pandas opts for the safe option to always return a copy if there are duplicate column names (makes sense if you think about it. Why should indexing one column return a view and another returns a copy). The indexing of the DataFrame
has an explicit check for duplicate columns and treats them differently (resulting in copies):
def _getitem_column(self, key): """ return the actual column """ # get column if self.columns.is_unique: return self._get_item_cache(key) # duplicate columns & possible reduce dimensionality result = self._constructor(self._data.get(key)) if result.columns.is_unique: result = result[key] return result
The columns.is_unique
is the important line here. It's True
for your "normal case" but "False" for the "failing case".