why should I make a copy of a data frame in pandas why should I make a copy of a data frame in pandas python python

why should I make a copy of a data frame in pandas


This expands on Paul's answer. In Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame. Thus, you'd want to use the copy if you want to make sure the initial DataFrame shouldn't change. Consider the following code:

df = DataFrame({'x': [1,2]})df_sub = df[0:1]df_sub.x = -1print(df)

You'll get:

x0 -11  2

In contrast, the following leaves df unchanged:

df_sub_copy = df[0:1].copy()df_sub_copy.x = -1


Because if you don't make a copy then the indices can still be manipulated elsewhere even if you assign the dataFrame to a different name.

For example:

df2 = dffunc1(df2)func2(df)

func1 can modify df by modifying df2, so to avoid that:

df2 = df.copy()func1(df2)func2(df)


It's necessary to mention that returning copy or view depends on kind of indexing.

The pandas documentation says:

Returning a view versus a copy

The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.


matomo