Filter Pyspark dataframe column with None value Filter Pyspark dataframe column with None value python python

Filter Pyspark dataframe column with None value


You can use Column.isNull / Column.isNotNull:

df.where(col("dt_mvmt").isNull())df.where(col("dt_mvmt").isNotNull())

If you want to simply drop NULL values you can use na.drop with subset argument:

df.na.drop(subset=["dt_mvmt"])

Equality based comparisons with NULL won't work because in SQL NULL is undefined so any attempt to compare it with another value returns NULL:

sqlContext.sql("SELECT NULL = NULL").show()## +-------------+## |(NULL = NULL)|## +-------------+## |         null|## +-------------+sqlContext.sql("SELECT NULL != NULL").show()## +-------------------+## |(NOT (NULL = NULL))|## +-------------------+## |               null|## +-------------------+

The only valid method to compare value with NULL is IS / IS NOT which are equivalent to the isNull / isNotNull method calls.


Try to just use isNotNull function.

df.filter(df.dt_mvmt.isNotNull()).count()


To obtain entries whose values in the dt_mvmt column are not null we have

df.filter("dt_mvmt is not NULL")

and for entries which are null we have

df.filter("dt_mvmt is NULL")