S3 parallel read and write performance?

apache-spark hadoop amazon-s3 parallel-processing

Yes, S3 is slower than HDFS. but it's interesting to look at why, and how to mitigate the impact. Key thing: if you are reading a lot more data than writing, then read performance is critical; the S3A connector in Hadoop 2.8+ really helps there, as it was tuned for reading Parquet/ORC files based on traces of real benchmarks. Write performance also suffers, and the more data you generate the worse it gets. People complain about that, when they should really be worrying about the fact that without special effort, you may actually end up with invalid output. That's generally the more important issue -just less obvious.

Read performance

Reading from S3 suffers due to

bandwidth between S3 and your VM. The more you pay for an EC2 VM, the more network bandwidth you get, the better
latency of HEAD/GET/LIST requests, especially all those used in the work to make the object store look like a filesystem with directories. This can particularly hurt the partitioning phase of a query, when all the source files are listed and those to actually read identified.
Cost of seek() being awful if the HTTP connection for a read is aborted and a new one renegotiated. Without a connector which has optimised seek() for this, ORC and Parquet input suffers badly. the s3a connector in Hadoop 2.8+ does precisely this if you set fs.s3a.experimental.fadvise to random.

Spark will split up work on file if the format is splittable, and whatever compression format is used is also splittable (gz isn't, snappy is). It will do it on block size, which is something you can configure/tune for a specific job (fs.s3a.block.size).If > 1 client reads the same file, then yes, you get some overload of the disk IO to that file, but generally its minor compared to the rest. One little secret: for multipart uploaded files then reading separate parts seems to avoid this, so upload and download with the same configured block size.

Write Performance

Write performance suffers from

caching of some/many MB of data in blocks before upload, with the upload not starting until the write is completed. S3A on hadoop 2.8+: set fs.s3a.fast.upload = true.
Network upload bandwidth, again a function of the VM type you pay for.

Commit performance and correctness

When output is committed by rename() of the files written to a temporary location, the time to copy each object to its final path is 6-10 MB/S.

A bigger issue is that it very bad at handling inconsistent directory listings or failures of tasks during the commit process. You cannot safely use S3 as a direct destination of work with the normal rename-by-commit algorithm without something to give you a consistent view of the store (consistent emrfs, s3mper, s3guard).

For maximum performance and safe committing of work, you need an output committer optimised for S3. Databricks have their own thing there, Apache Hadoop 3.1 adds the "S3A output committer". EMR now apparently has something here too.

See A zero rename committer for the details on that problem. After which, hopefully, you'll either move to a safe commit mechanism or use HDFS as a destination of work.

CodeHunter

S3 parallel read and write performance?

Read performance

Write Performance

Commit performance and correctness

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last