Spark 2.2.0 FileOutputCommitter

hadoop apache-spark amazon-s3 apache-spark-sql amazon-emr

I have been hit by this issue. Spark is discouraging the usage of DirectFileOutputCommitter as it might lead to data loss in case of race situation. The algorithm version 2 doesn't help a lot.

I have tried to use the gzip to save the data in s3 instead of snappy compression which gave some benefit.

The real issue here is that spark writes in the s3://<output_directory>/_temporary/0 first then copies the data from temporary to the output. This process is pretty slow in s3,(Generally 6MBPS) So if you get lot of data you will get considerable slowdown.

The alternative is to write to HDFS first then use distcp / s3distcp to copy the data to s3.

Also , You could look for a solution Netflix provided.

I haven't evaluated that.

EDIT:

The new spark2.4 version has solved the problem of slow s3 write. I have found the s3 write performance of spark2.4 with hadoop 2.8 in the latest EMR version (5.24) is almost at par with HDFS write.

See the documents

CodeHunter

Spark 2.2.0 FileOutputCommitter

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last