Why is Apache Orc RecordReader.searchArgument() not filtering correctly? Why is Apache Orc RecordReader.searchArgument() not filtering correctly? hadoop hadoop

Why is Apache Orc RecordReader.searchArgument() not filtering correctly?


I encountered the same issue, and I think it was rectified by changing

.equals("x", Type.LONG,

to

.equals("x",PredicateLeaf.Type.LONG

On using this, the reader seems to return only the batch with the relevant rows, not only once which we asked for.


I know this question is old but maybe the answer is useful for someone. (And I just saw that mac wrote a comment saying basically the same as me a few hours ago, but I think a separate answer is better visible)

Orc internally separates the data into so called "row groups" (with 10000 rows each per default) where each row group has its own indices. The search argument is only used to filter out row groups in which no row can match the search argument. However, it does NOT filter out individual rows. It could even be that the indices state a row group matches a search argument while not a single row in it actually matches the search. This is because the row group indices mainly consist of min and max values of each column in the row group.

So you will have to iterate over the returned rows and skip the ones that do not match your search criteria.