AWS connection timeout when running Spark job on EMR AWS connection timeout when running Spark job on EMR hadoop hadoop

AWS connection timeout when running Spark job on EMR


TLDR: The property you need to set is fs.s3.maxConnections in the emrfs-site.xml configuration file. It defaults to 50. We were getting exactly the same error/stack trace as you, so I set it to 5000, which fixed the problem and had no ill effects.

From what I can tell, the root cause is InputFormat implementations that do not properly use try...finally to ensure that connections get closed when an exceptions are thrown. Notably, older versions of Hive, including v1.2.1 that Spark is compiled against, exhibit this bug. Hive 2.x massively refactors OrcInputFormat, though I haven't verified that the bug is fixed, nor do I know if/when/how you can compile Spark against Hive 2.x.

The workaround increases the size of the connection pool, as suggested in another answer, but both the property and its location are quite different than in the "classic" S3 filesystems (s3/s3a/s3n). Of course, this isn't documented anywhere and required decompilation of the emrfs jar to tease out...


I don't use EMRFS, but I do know the other spark/hadoop S3 clients all use a pool of http connections for their requests to S3, and "timeout waiting for pool" messages invariably means "pool isn't big enough". See if you can find out what the emrfs options are for increasing that pool size. You will need at least one for every worker thread running in your process, and I'd double it in the hope that emrfs parallelises block uploads the way the s3a client does.