How to drop columns which have same values in all rows via pandas or spark dataframe?
What we can do is use nunique
to calculate the number of unique values in each column of the dataframe, and drop the columns which only have a single unique value:
In [285]:nunique = df.nunique()cols_to_drop = nunique[nunique == 1].indexdf.drop(cols_to_drop, axis=1)Out[285]: index id name data10 0 345 name1 31 1 12 name2 22 5 2 name6 7
Another way is to just diff
the numeric columns, take abs
values and sums
them:
In [298]:cols = df.select_dtypes([np.number]).columnsdiff = df[cols].diff().abs().sum()df.drop(diff[diff== 0].index, axis=1)Out[298]: index id name data10 0 345 name1 31 1 12 name2 22 5 2 name6 7
Another approach is to use the property that the standard deviation will be zero for a column with the same value:
In [300]:cols = df.select_dtypes([np.number]).columnsstd = df[cols].std()cols_to_drop = std[std==0].indexdf.drop(cols_to_drop, axis=1)Out[300]: index id name data10 0 345 name1 31 1 12 name2 22 5 2 name6 7
Actually the above can be done in a one-liner:
In [306]:df.drop(df.std()[(df.std() == 0)].index, axis=1)Out[306]: index id name data10 0 345 name1 31 1 12 name2 22 5 2 name6 7
A simple one liner(python):
df=df[[i for i in df if len(set(df[i]))>1]]
Another solution is set_index
from column which are not compared and then compare first row selected by iloc
by eq
with all DataFrame
and last use boolean indexing
:
df1 = df.set_index(['index','id','name',])print (~df1.eq(df1.iloc[0]).all())value Falsevalue2 Falsevalue3 Falsedata1 Trueval5 Falsedtype: boolprint (df1.ix[:, (~df1.eq(df1.iloc[0]).all())].reset_index()) index id name data10 0 345 name1 31 1 12 name2 22 5 2 name6 7