How to filter out rows with NaN values in Hive?

sql hadoop hive nan hue

Hive relies on Java (plus SQL-specific semantics for Null and friends), and Java honors the IEEE standard for number semantics. Which means that... NaN is tricky.

Quoting that post...

(Float.NaN == Float.NaN) always returns false.
In fact, if you look at the JDK implementation of Float.isNaN(), a number is not-a-number if it is not equal to itself (which makes sense because a number should be equal to itself).
The same holds for Double.NaN

So, there is no point in showing you how to use the (undocumented) Hive function called reflect2, which allows you to invoke raw Java methods on Hive columns, i.e.

where v1 is not null and not reflect2(v1, "isNaN")

...because -- in theory -- you can simply state:

where v1 is not null and v1=v1

Disclaimer -- I have seen cases where the Hive optimizer makes aggressive "optimizations" and produces wrong results.
In other words, if the simple v1=v1 clause does not filter out the NaN values as expected, then look into reflect2...

Edit -- indeed, the optimizer appears to ignore the v1=v1 clause in some versions of Hive (see comments) so a more devious formula is necessary:

v1 +1.0 > v1 should work... except when rounding errors make either abs(v1) << 1 or abs(v1) >> 1
other "numeric" tricks will fail similarly in edge cases, especially when v1 =0.0

In the end, the most robust approach appears to try cast(v1 as String) <>'NaN' (because all possible NaN values are displayed as "NaN" even if they are not strictly "equal" in the arithmetical sense).

Side note about reflect2 -- you can see that it is indeed not mentioned in the official Hive doc, while reflect is mentioned (and even has a specific Wiki entry). But it has been implemented as early as Hive V0.11 cf. Hive-4025

Edit -- Java "reflection" is now disabled by default for ODBC / JDBC / Hue connections (see comments), and cannot be re-enabled when using security plug-ins such as ranger or Sentry. So its usage is restricted to the (deprecated) hive CLI.

sql hadoop hive nan hue

You can handle NaN as

SELECT SUM(CAST(IF(v1 ='NaN', 0, v1)) as Double) FROM hivedb.tb1

sql hadoop hive nan hue

Not sure if this applies in many cases, but in Hive 3 I'm getting:

select float('NaN') = float('NaN')

returns True

So in theory:

select * from hivedb.tb1 where v1 <> float('NaN');

should accomplish this

CodeHunter

How to filter out rows with NaN values in Hive?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last