Custom TensorFlow Keras optimizer
Update: TF2.2 forced me to clean up all implementations - so now they can be used as a reference for TF best practices. Also added a section below on _get_hyper
vs. _set_hyper
.
I've implemented Keras AdamW in all major TF & Keras versions - I invite you to examine optimizers_v2.py. Several points:
- You should inherit
OptimizerV2
, which is actually what you linked; it's the latest and current base class fortf.keras
optimizers - You are correct in (1) - this is a documentation mistake; the methods are private, as they aren't meant to be used by the user directly.
apply_gradients
(or any other method) is only overidden if the default doesn't accomplish what's needed for a given optimizer; in your linked example, it's just a one-liner addon to the original- "So, it seems that a
_create_slots
method must be defined in an optimizer subclass if that subclass does not overrideapply_gradients
" - the two are unrelated; it's coincidental.
- What is the difference between
_resource_apply_dense
and_resource_apply_sparse
?
Latter deals with sparse layers - e.g. Embedding
- and former with everything else; example.
- When should I use
_create_slots()
?
When defining trainable tf.Variable
s; example: weights' first and second order moments (e.g. Adam). It uses add_slot()
.
_get_hyper
vs. _set_hyper
: they enable setting and getting Python literals (int
, str
, etc), callables, and tensors. They exist largely for convenience: anything set via _set_hyper
can be retrieved via _get_hyper
, avoiding repeating boilerplate code. I dedicated a Q&A to it here.
- Yes, this looks to be a documentation error. The preceding underscore names are the correct methods to override. Related is the non-Keras Optimizer which has these all defined, but not implemented in the base class https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/optimizer.py
def _create_slots(self, var_list): """Create all slots needed by the variables. Args: var_list: A list of `Variable` objects. """ # No slots needed by default pass def _resource_apply_dense(self, grad, handle): """Add ops to apply dense gradients to the variable `handle`. Args: grad: a `Tensor` representing the gradient. handle: a `Tensor` of dtype `resource` which points to the variable to be updated. Returns: An `Operation` which updates the value of the variable. """ raise NotImplementedError() def _resource_apply_sparse(self, grad, handle, indices): """Add ops to apply sparse gradients to the variable `handle`. Similar to `_apply_sparse`, the `indices` argument to this method has been de-duplicated. Optimizers which deal correctly with non-unique indices may instead override `_resource_apply_sparse_duplicate_indices` to avoid this overhead. Args: grad: a `Tensor` representing the gradient for the affected indices. handle: a `Tensor` of dtype `resource` which points to the variable to be updated. indices: a `Tensor` of integral type representing the indices for which the gradient is nonzero. Indices are unique. Returns: An `Operation` which updates the value of the variable. """ raise NotImplementedError()
- I don't know about
apply_dense
. For one thing, if you do override it, the code mentions that a per-replica DistributionStrategy could be "dangerous"
# TODO(isaprykin): When using a DistributionStrategy, and when an # optimizer is created in each replica, it might be dangerous to # rely on some Optimizer methods. When such methods are called on a # per-replica optimizer, an exception needs to be thrown. We do # allow creation per-replica optimizers however, because the # compute_gradients()->apply_gradients() sequence is safe.