Why is querying Parquet files is slower than text files in Hive? Why is querying Parquet files is slower than text files in Hive? hadoop hadoop

Why is querying Parquet files is slower than text files in Hive?


First I would like to just point out that it is virtually impossible to answer your question with the given details.

Few points:

  • measuring time in a distributed environment is not the way to determine if something is slow (if you have many queries running and competing for resources you are not measuring what you think you are measuring)

  • not providing the actual table definition and the queries running against those tables makes this problem impossible to reproduce

  • not providing the number of rows of the table and the cardinality its individual fields is also not helping

In general, querying Parquet is much faster than querying text files because Parquet employs many things to make read operations much faster. Few of these things:

  • compression
  • run length encoding
  • dictionary encoding

Depending on the use case some of the parameters of things can be tuned to the exact use case.