Which would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch? Which would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch? elasticsearch elasticsearch

Which would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch?


Based on the information you've provided, I am going to make several assumptions:

  1. You are on AWS (hence Elastic Search and Athena being options). Therefore, I will steer you to AWS documentation.
  2. As you have pre-defined and indexed filters, you have well ordered, structured data.

Going through the options listed

  1. Spark SQL - If you are already considering Spark and you are already on AWS, then you can leverage AWS Elastic Map Reduce.
  2. AWS Athena (Serverless SQL querying, based on Presto) - Athena is a powerful tool. It lets you query data stored on S3, which is quite cost effective. However, building workflows in Athena can require a bit of work as you'll spend a lot of time managing files on S3. Historically, Athena can only produce CSV output, so it often works best as the final stage in a Big Data Pipeline. However, with support for CTAS statements, you can now output data in multiple formats such as Parquet with multiple compression algorithms.
  3. Elastic Search (search engine) - Is not really a query tool, so it is likely not part of the core of this pipeline.
  4. Redis (Key Value DB) - Redis is an in memory key-value data store. It is generally used to provide small bits of information to be rapidly consumed by applications in use cases such as caching and session management. Therefore, it does not seem to fit your use case. If you want some hands on experience with Redis, I recommend Try Redis.

I would also look into Amazon Redshift.

For further reading, read Big Data Analytics Options on AWS.

As @Damien_The_Unbeliever recommended, there will be no substitute for your own prototyping and benchmarking.


Athena is not limited to .csv. In fact using binary compressed formats like parquet are a best practice for use with Athena, because it substantially reduces query times and cost. I have used AWS firehose, lambda functions and glue crawlers for converting text data to a compressed binary format for querying via Athena. When I have had issues with processing large data volumes, the issue was forgetting to raise the default Athena limits set for the accounts. I have a friend who processes gigantic volumes of utility data for predictive analytics, and he did encounter scaling problems with Athena, but that was in its early days.

I also work with ElasticSearch with Kibana as a text search engine and we use the AWS Log Analytics "solution" based on ElasticSearch and Kibana. I like both. Athena is best for working with huge volumes of log data, because it is more economical to work with it in a compressed binary format. A terabyte of JSON text data reduces down to approximately 30 gig or less in parquet format. Our developers are more productive when they use ElasticSearch/Kibana to analyze problems in their log files, because ElasticSeach and Kibana are so easy to use. The curator Lambda function that controls logging retention times and is a part of AWS Centralized logging is also very convenient.