ALS.checkpointInterval and SparkContext.setCheckpointDir ALS.checkpointInterval and SparkContext.setCheckpointDir hadoop hadoop

ALS.checkpointInterval and SparkContext.setCheckpointDir


SparkContext.setCheckpointDir is used to set global checkpoint directory. It is not in limited to ALS or any other specific algorithm but it is required for RDD.checkpoint to work.

ALS.checkpointInterval is an algorithm specific property and doesn't affect any global settings. From ML docs:

Param for set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations.

Putting this two things together:

  • this two thing work in a completely different context and have different consequences
  • both are required for proper checkpointing in ALS. If checkpoint directory is not set ALS won't checkpoint even if checkpoint interval is set:

    val shouldCheckpoint: Int => Boolean = (iter) =>  sc.checkpointDir.isDefined &&   checkpointInterval != -1 &&  (iter % checkpointInterval == 0)