Filter Pyspark dataframe column with None value
You can use Column.isNull
/ Column.isNotNull
:
df.where(col("dt_mvmt").isNull())df.where(col("dt_mvmt").isNotNull())
If you want to simply drop NULL
values you can use na.drop
with subset
argument:
df.na.drop(subset=["dt_mvmt"])
Equality based comparisons with NULL
won't work because in SQL NULL
is undefined so any attempt to compare it with another value returns NULL
:
sqlContext.sql("SELECT NULL = NULL").show()## +-------------+## |(NULL = NULL)|## +-------------+## | null|## +-------------+sqlContext.sql("SELECT NULL != NULL").show()## +-------------------+## |(NOT (NULL = NULL))|## +-------------------+## | null|## +-------------------+
The only valid method to compare value with NULL
is IS
/ IS NOT
which are equivalent to the isNull
/ isNotNull
method calls.
To obtain entries whose values in the dt_mvmt
column are not null we have
df.filter("dt_mvmt is not NULL")
and for entries which are null we have
df.filter("dt_mvmt is NULL")