Do the individual Series contained within a DataFrame maintain their own index?
This looks like either a bug or unintended consequence of python object identities, prior to the assignment we can see that the indices are the same:
In [175]:df = pd.DataFrame(dict(A=[1, 2, 3]))dfOut[175]: A0 11 22 3In [176]:print(id(df.index))print(id(df['A']))print(id(df['A'].index))a = df.Aa132848496135123240132848496Out[176]:0 11 22 3Name: A, dtype: int64
Now if we modify our reference, the indices now become distinct objects and both a
and df['A']
are the same:
In [177]:a.index = a.index + 1print(a)print(id(a))print(id(df.A))print()print(df)print(id(df.A.index))print(id(a.index))1 12 23 3Name: A, dtype: int64135123240135123240 A0 11 22 3135125144135125144
but now df.index
is distinct from df['A'].index
and a.index
:
In [181]:print(id(df.index))print(id(a.index))print(id(df['A'].index))132848496135124808135124808
Personally I'd consider this an unintended consequence as it's difficult once you take the reference a
to column 'A'
what should the original df
do once you start to mutate the reference and I bet this is even harder to catch than the usual Setting on copy
warning
In order to avoid this it's best to call copy()
to make a deep copy so that any mutations don't affect the orig df:
In [183]:df = pd.DataFrame(dict(A=[1, 2, 3]))a = df['A'].copy()a.index = a.index+1print(a)print(df['A'])print(df['A'].index)print(df.index)print()print(id(df['A']))print(id(a))print(id(df['A'].index))print(id(a.index))1 12 23 3Name: A, dtype: int640 11 22 3Name: A, dtype: int64RangeIndex(start=0, stop=3, step=1)RangeIndex(start=0, stop=3, step=1)135125984135165376135165544135125816
it's the game of references(pointers), each DataFrame has its own index array, series in the DataFrame have references to the same index array
when a.index = a.index + 1
is executed the reference in the series was changed so a.index is the same as df.A.index which is different than df.index
now if you try to clear df cache, this will reset the series :
print(df.A.index)df._clear_item_cache()print(df.A.index)
by default series indexes inside the DataFrame are immutable but copying the series reference allowed a workaround to edit the index reference