Getting Elasticsearch Data into HDFS Easily Getting Elasticsearch Data into HDFS Easily elasticsearch elasticsearch

Getting Elasticsearch Data into HDFS Easily


I have an example in PySpark of doing some basic queries, but when I try and pull an entire index (100gb daily generated index) into an RDD, I get out of memory errors

Spark doesn't allocate much memory to your jobs by default, so yes, when dealing with that much data, you'll get OOM errors.

Here's the key properties you should be concerned with and their defaults.

  • spark.dynamicAllocation.enabled - false
  • spark.executor.instances - 2
  • spark.executor.memory - 1g
  • spark.driver.cores - 1

If your Spark jobs are running under YARN cluster management, you also need to consider your YARN container sizes. When running in cluster mode, the Application Master will be the Spark driver container. In my experience, unless your Spark code is calling collect() to send data back through the driver, it doesn't need that much memory itself.

I would try first increasing the Executor memory, and then the number of executors. If you enable dynamic allocation, then you can consider not specifying the executor amount, but it does set a lower boundary to start with.

ES-Hadoop provides many connectors to extract data, but it all comes down to preference. If you know SQL, use Hive. Pig is simpler to run than Spark. Spark is very memory heavy, which might not work well in some clusters.

You mention NiFi in your comments, but that is still a Java process, and prone to OOM errors. Unless you have a NiFi cluster, you'll have a single process somewhere pulling 100 GB through a FlowFile on disk before writing to HDFS.

If you need a Snapshot of a whole Index, Elasticsearch provides HDFS support for such a feature. I'm unsure what data format that is, though, or if Hadoop processes can read it.