Why is Apache Orc RecordReader.searchArgument() not filtering correctly?

I encountered the same issue, and I think it was rectified by changing

.equals("x", Type.LONG,

.equals("x",PredicateLeaf.Type.LONG

On using this, the reader seems to return only the batch with the relevant rows, not only once which we asked for.

I know this question is old but maybe the answer is useful for someone. (And I just saw that mac wrote a comment saying basically the same as me a few hours ago, but I think a separate answer is better visible)

Orc internally separates the data into so called "row groups" (with 10000 rows each per default) where each row group has its own indices. The search argument is only used to filter out row groups in which no row can match the search argument. However, it does NOT filter out individual rows. It could even be that the indices state a row group matches a search argument while not a single row in it actually matches the search. This is because the row group indices mainly consist of min and max values of each column in the row group.

So you will have to iterate over the returned rows and skip the ones that do not match your search criteria.

CodeHunter

Why is Apache Orc RecordReader.searchArgument() not filtering correctly?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last