How to "select distinct" across multiple data frame columns in pandas? How to "select distinct" across multiple data frame columns in pandas? python python

How to "select distinct" across multiple data frame columns in pandas?


You can use the drop_duplicates method to get the unique rows in a DataFrame:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})In [30]: dfOut[30]:   a  b0  1  31  2  42  1  33  2  5In [32]: df.drop_duplicates()Out[32]:   a  b0  1  31  2  43  2  5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.


I've tried different solutions. First was:

a_df=np.unique(df[['col1','col2']], axis=0)

and it works well for not object dataAnother way to do this and to avoid error (for object columns type) is to apply drop_duplicates()

a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]

You can also use SQL to do this, but it worked very slow in my case:

from pandasql import sqldfq="""SELECT DISTINCT col1, col2 FROM df;"""pysqldf = lambda q: sqldf(q, globals())a_df = pysqldf(q)


To solve a similar problem, I'm using groupby:

print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")

Whether that's appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCT as shown).