Getting Elasticsearch Data into HDFS Easily

hadoop apache-spark elasticsearch hdfs

I have an example in PySpark of doing some basic queries, but when I try and pull an entire index (100gb daily generated index) into an RDD, I get out of memory errors

Spark doesn't allocate much memory to your jobs by default, so yes, when dealing with that much data, you'll get OOM errors.

Here's the key properties you should be concerned with and their defaults.

spark.dynamicAllocation.enabled - false
spark.executor.instances - 2
spark.executor.memory - 1g
spark.driver.cores - 1

If your Spark jobs are running under YARN cluster management, you also need to consider your YARN container sizes. When running in cluster mode, the Application Master will be the Spark driver container. In my experience, unless your Spark code is calling collect() to send data back through the driver, it doesn't need that much memory itself.

I would try first increasing the Executor memory, and then the number of executors. If you enable dynamic allocation, then you can consider not specifying the executor amount, but it does set a lower boundary to start with.

ES-Hadoop provides many connectors to extract data, but it all comes down to preference. If you know SQL, use Hive. Pig is simpler to run than Spark. Spark is very memory heavy, which might not work well in some clusters.

You mention NiFi in your comments, but that is still a Java process, and prone to OOM errors. Unless you have a NiFi cluster, you'll have a single process somewhere pulling 100 GB through a FlowFile on disk before writing to HDFS.

If you need a Snapshot of a whole Index, Elasticsearch provides HDFS support for such a feature. I'm unsure what data format that is, though, or if Hadoop processes can read it.

CodeHunter

Getting Elasticsearch Data into HDFS Easily

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last