why is blindly using df.copy() a bad idea to fix the SettingWithCopyWarning why is blindly using df.copy() a bad idea to fix the SettingWithCopyWarning python python

why is blindly using df.copy() a bad idea to fix the SettingWithCopyWarning


here is my 2 cent on this with a very simple example why the warning is important.

so assuming that I am creating a df such has

x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b'])print(x)   a  b0  0  01  1  12  2  23  3  3

now I want to create a new dataframe based on a subset of the original and modify it such has:

 q = x.loc[:, 'a']

now this is a slice of the original and whatever I do on it will affect x:

q += 2print(x)  # checking x again, wow! it changed!   a  b0  2  01  3  12  4  23  5  3

this is what the warning is telling you. you are working on a slice, so everything you do on it will be reflected on the original DataFrame

now using .copy(), it won't be a slice of the original, so doing an operation on q wont affect x :

x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b'])print(x)   a  b0  0  01  1  12  2  23  3  3q = x.loc[:, 'a'].copy()q += 2print(x)  # oh, x did not change because q is a copy now   a  b0  0  01  1  12  2  23  3  3

and btw, a copy just mean that q will be a new object in memory. where a slice share the same original object in memory

imo, using .copy()is very safe. as an example df.loc[:, 'a'] return a slice but df.loc[df.index, 'a'] return a copy. Jeff told me that this was an unexpected behavior and : or df.index should have the same behavior as an indexer in .loc[], but using .copy() on both will return a copy, better be safe. so use .copy() if you don't want to affect the original dataframe.

now using .copy() return a deepcopy of the DataFrame, which is a very safe approach not to get the phone call you are talking about.

but using df.is_copy = None, is just a trick that does not copy anything which is a very bad idea, you will still be working on a slice of the original DataFrame

one more thing that people tend not to know:

df[columns] may return a view.

df.loc[indexer, columns] also may return a view, but almost always does not in practice.emphasis on the may here


While the other answers provide good information about why one shouldn't simply ignore the warning, I think your original question has not been answered, yet.

@thn points out that using copy() completely depends on the scenario at hand. When you want that the original data is preserved, you use .copy(), otherwise you don't. If you are using copy() to circumvent the SettingWithCopyWarning you are ignoring the fact that you may introduce a logical bug into your software. As long as you are absolutely certain that this is what you want to do, you are fine.

However, when using .copy() blindly you may run into another issue, which is no longer really pandas specific, but occurs every time you are copying data.

I slightly modified your example code to make the problem more apparent:

@profiledef foo():    df = pd.DataFrame(np.random.randn(2 * 10 ** 7))    d1 = df[:]    d1 = d1.copy()if __name__ == '__main__':    foo()

When using memory_profile one can clearly see that .copy() doubles our memory consumption:

> python -m memory_profiler demo.py Filename: demo.pyLine #    Mem usage    Increment   Line Contents================================================     4   61.195 MiB    0.000 MiB   @profile     5                             def foo():     6  213.828 MiB  152.633 MiB    df = pd.DataFrame(np.random.randn(2 * 10 ** 7))     7                                  8  213.863 MiB    0.035 MiB    d1 = df[:]     9  366.457 MiB  152.594 MiB    d1 = d1.copy()

This relates to the fact, that there is still a reference (df) which points to the original data frame. Thus, df is not cleaned up by the garbage collector and is kept in memory.

When you are using this code in a production system, you may or may not get a MemoryError depending on the size of the data you are dealing with and your available memory.

To conclude, it is not a wise idea to use .copy() blindly. Not just because you may introduce a logical bug in your software, but also because it may expose runtime dangers such as a MemoryError.


Edit:Even if you are doing df = df.copy(), and you can ensure that there are no other references to the original df, still copy() is evaluated before the assignment. Meaning that for a short time both data frames will be in memory.

Example (notice that you cannot see this behavior in the memory summary):

> mprof run -T 0.001 demo.pyLine #    Mem usage    Increment   Line Contents================================================     7     62.9 MiB      0.0 MiB   @profile     8                             def foo():     9    215.5 MiB    152.6 MiB    df = pd.DataFrame(np.random.randn(2 * 10 ** 7))    10    215.5 MiB      0.0 MiB    df = df.copy()

But if you visualise memory consumption over time, at 1.6s both data frames are in memory:

enter image description here


EDIT:

After our comment exchange and from reading around a bit (I even found @Jeff's answer), I may bring owls to Athens, but in panda-docs exists this code example:

Sometimes a SettingWithCopy warning will arise at times when there’s no obvious chained indexing going on. These are the bugs that SettingWithCopy is designed to catch! Pandas is probably trying to warn you that you’ve done this:

def do_something(df):          foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!       # ... many lines here ...          foo['quux'] = value  # We don't know whether this will modify df or not!         return foo

That maybe an easily avoided problem, for an experienced user/developer but pandas is not only for the experienced...

Still you probably will not get a phone call in the middle of the night on a Sunday about this but it may damage your data integrity in the long if you don't catch it early.
Also as the Murphy's law states, the most time consuming and complex data manipulation that you will do it WILL be on a copy which will get discarded before it is used and you will spend hours try to debug it!

Note: All that are hypothetical because the very definition in the docs is a hypothesis based on probability of (unfortunate) events... SettingWithCopy is a new-user-friendly warning which exists to warn new users of a potentially random and unwanted behavior of their code.


There exists this issue from 2014.
The code that causes the warning in this case looks like this:

from pandas import DataFrame# create example dataframe:df = DataFrame ({'column1':['a', 'a', 'a'], 'column2': [4,8,9] })df# assign string to 'column1':df['column1'] = df['column1'] + 'b'df# it works just fine - no warnings#now remove one line from dataframe df:df = df [df['column2']!=8]df# adding string to 'column1' gives warning:df['column1'] = df['column1'] + 'c'df

And jreback make some comments on the matter:

You are in fact setting a copy.

You prob don't care; it is mainly to address situations like:

df['foo'][0] = 123... 

which sets the copy (and thus is not visible to the user)

This operation, make the df now point to a copy of the original

df = df [df['column2']!=8]

If you don't care about the 'original' frame, then its ok

If you are expecting that the

df['column1'] = df['columns'] + 'c'

would actually set the original frame (they are both called 'df' here which is confusing) then you would be suprised.

and

(this warning is mainly for new users to avoid setting the copy)

Finally he concludes:

Copies don't normally matter except when you are then trying to set them in a chained manner.

From the above we can draw this conclusions:

  1. SettingWithCopyWarning has a meaning and there are (as presented by jreback) situations in which this warning matters and the complications may be avoided.
  2. The warning is mainly a "safety net" for newer users to make them pay attention to what they are doing and that it may cause unexpected behavior on chained operations. Thus a more advanced user can turn of the warning (from jreback's answer):
pd.set_option('chained_assignement',None)

or you could do:

df.is_copy = False