Selecting/excluding sets of columns in pandas [duplicate]
There is a new index method called difference
. It returns the original columns, with the columns passed as argument removed.
Here, the result is used to remove columns B
and D
from df
:
df2 = df[df.columns.difference(['B', 'D'])]
Note that it's a set-based method, so duplicate column names will cause issues, and the column order may be changed.
Advantage over drop
: you don't create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:
# may create a copy of the dataframesubset = df.drop(['B', 'D'], axis=1).columns# does not create a copy the dataframesubset = df.columns.difference(['B', 'D'])df = df.drop_duplicates(subset=subset)
Another option, without dropping or filtering in a loop:
import numpy as npimport pandas as pd# Create a dataframe with columns A,B,C and Ddf = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))# include the columns you wantdf[df.columns[df.columns.isin(['A', 'B'])]]# or more simply include columns:df[['A', 'B']]# exclude columns you don't wantdf[df.columns[~df.columns.isin(['C','D'])]]# or even simpler since 0.24# with the caveat that it reorders columns alphabetically df[df.columns.difference(['C', 'D'])]