Differences between null and NaN in spark? How to deal with it?

python apache-spark null pyspark nan

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

One possible way to handle null values is to remove them with:

df.na.drop()

Or you can change them to an actual value (here I used 0) with:

df.na.fill(0)

Another way would be to select the rows where a specific column is null for further processing:

df.where(col("a").isNull())df.where(col("a").isNotNull())

Rows with NaN can also be selected using the equivalent method:

from pyspark.sql.functions import isnandf.where(isnan(col("a")))

python apache-spark null pyspark nan

You can diference your NaN values using the function isnan, like this example

>>> df = spark.createDataFrame([(1.0, float('nan')), (float('nan'), 2.0)], ("a", "b"))>>> df.select(isnan("a").alias("r1"), isnan(df.a).alias("r2")).collect()[Row(r1=False, r2=False), Row(r1=True, r2=True)]

The difference is in the type of the object that generates the value. NaN (not a number) is an old fashion way to deal with the "None value for a number", you can think that you have all the numbers (-1-2...0,1,2...) and there is the need to have and extra value, for cases of errors (example, 1/0), I want that 1/0 gives me a number, but which number? well, like there is number for 1/0, they create a new value called NaN, that is also of type Number.

None is used for the void, absence of an element, is even more abstract, because inside the number type, you have, besides de NaN value, the None value. The None value is present in all the sets of values of all the types

python apache-spark null pyspark nan

you can deal with it using this code

df = df.where(pandas.notnull(df), None)

The code will convert any NaN value into null

Below is the reffrence link

Link

CodeHunter

Differences between null and NaN in spark? How to deal with it?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last