How to apply gradient clipping in TensorFlow?

python tensorflow machine-learning keras deep-learning

Gradient clipping needs to happen after computing the gradients, but before applying them to update the model's parameters. In your example, both of those things are handled by the AdamOptimizer.minimize() method.

In order to clip your gradients you'll need to explicitly compute, clip, and apply them as described in this section in TensorFlow's API documentation. Specifically you'll need to substitute the call to the minimize() method with something like the following:

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)gvs = optimizer.compute_gradients(cost)capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]train_op = optimizer.apply_gradients(capped_gvs)

python tensorflow machine-learning keras deep-learning

Despite what seems to be popular, you probably want to clip the whole gradient by its global norm:

optimizer = tf.train.AdamOptimizer(1e-3)gradients, variables = zip(*optimizer.compute_gradients(loss))gradients, _ = tf.clip_by_global_norm(gradients, 5.0)optimize = optimizer.apply_gradients(zip(gradients, variables))

Clipping each gradient matrix individually changes their relative scale but is also possible:

optimizer = tf.train.AdamOptimizer(1e-3)gradients, variables = zip(*optimizer.compute_gradients(loss))gradients = [    None if gradient is None else tf.clip_by_norm(gradient, 5.0)    for gradient in gradients]optimize = optimizer.apply_gradients(zip(gradients, variables))

In TensorFlow 2, a tape computes the gradients, the optimizers come from Keras, and we don't need to store the update op because it runs automatically without passing it to a session:

optimizer = tf.keras.optimizers.Adam(1e-3)# ...with tf.GradientTape() as tape:  loss = ...variables = ...gradients = tape.gradient(loss, variables)gradients, _ = tf.clip_by_global_norm(gradients, 5.0)optimizer.apply_gradients(zip(gradients, variables))

python tensorflow machine-learning keras deep-learning

This is actually properly explained in the documentation.:

Calling minimize() takes care of both computing the gradients and applying them to the variables. If you want to process the gradients before applying them you can instead use the optimizer in three steps:
Compute the gradients with compute_gradients().
Process the gradients as you wish.
Apply the processed gradients with apply_gradients().

And in the example they provide they use these 3 steps:

# Create an optimizer.opt = GradientDescentOptimizer(learning_rate=0.1)# Compute the gradients for a list of variables.grads_and_vars = opt.compute_gradients(loss, <list of variables>)# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you# need to the 'gradient' part, for example cap them, etc.capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]# Ask the optimizer to apply the capped gradients.opt.apply_gradients(capped_grads_and_vars)

Here MyCapper is any function that caps your gradient. The list of useful functions (other than tf.clip_by_value()) is here.

CodeHunter

How to apply gradient clipping in TensorFlow?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last