Why hdfs throwing LeaseExpiredException in Hadoop cluster (AWS EMR) Why hdfs throwing LeaseExpiredException in Hadoop cluster (AWS EMR) hadoop hadoop

Why hdfs throwing LeaseExpiredException in Hadoop cluster (AWS EMR)


I resolved the issue. Let me explain in detail.

Exceptions that is coming -

  1. LeaveExpirtedException - from HDFS side.
  2. FileNotFoundException - from Hive side (when Tez execution engine executes DAG)

Problem scenario-

  1. We just upgraded the hive version from 0.13.0 to 2.1.0. And, everything was working fine with previous version. Zero runtime exception.

Different thoughts to resolve the issue -

  1. First thought was, two threads was working on same piece because of NN intelligence. But as per below settings

    set mapreduce.map.speculative=falseset mapreduce.reduce.speculative=false

that was not possible.

  1. then, I increase the count from 1000 to 100000 for below settings -

    SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.dynamic.partitions.pernode=100000;

that also didn't work.

  1. Then the third thought was, definitely in a same process, what mapper-1 was created was deleted by another mapper/reducer. But, we didn't found any such logs in Hveserver2, Tez logs.

  2. Finally the root cause lies in a application layer code itself. In hive-exec-2.1.0 version, they introduced new configuration property

    "hive.exec.stagingdir":".hive-staging"

Description of above property -

Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces ${hive.exec.scratchdir} for query results with the exception of read-only tables. In all cases ${hive.exec.scratchdir} is still used for other temporary files, such as job plans.

So if there is any concurrent jobs in Application layer code (ETL), and are doing operation(rename/delete/move) on same table, then it may lead to this problem.

And, in our case, 2 concurrent jobs are doing "INSERT OVERWRITE" on same table, that leads to delete metadata file of 1 mapper, that is causing this issue.

Resolution -

  1. Move the metadata file location to outside table(table lies in S3).
  2. Disable HDFS encryption (as mentioned in Description of stagingdir property.)
  3. Change into your Application layer code to avoid concurrency issue.

Related question - Why hive_staging file is missing in AWS EMR