While it is useful to go through this derivation in order to understand how the update rules for a deep neural network are derived, this would clearly quickly become unwieldy for large networks and complex architectures. It's fortunate, therefore, that TensorFlow 2 handles the computation of these gradients automatically. During the initialization of the model, each gradient is computed as an intermediate node between tensors and operations in the graph: as an example, see Figure 3.4:
The left side of the preceding figure shows a cost function C computed from the output of a Rectified Linear Unit (ReLU) - a type of neuron function we'll cover later in this chapter), which in turn is computed from multiplying a weight vector by an input x and adding a bias term b. On the right, you can see that this graph has been augmented by TensorFlow to compute all the intermediate gradients required for backpropagation as part of the overall control flow.
After storing these intermediate values, the task of combining them, as shown in the calculation in Figure 3.4, into a complete gradient through recursive operation falls to the GradientTape API. Under the hood, TensorFlow uses a method called reversemode automatic differentiation to compute gradients; it holds the dependent variable (the output y) fixed, and recursively computes backwards to the beginning of the network the required gradients.
For example, let's consider a neural network of the following form:
If we want to compute the derivative of the output y with respect to an input x we need to repeatedly substitute the outermost expression.
Thus, to compute the desired gradient we need to just traverse the graph from top to bottom, storing each intermediate gradient as we calculate it. These values are stored on a record, referred to as a tape in reference to early computers in which information was stored on a magnetic taps, which is then used to replay the values for calcuation. The alternative would be to use forward-mode automatic differentiation, computing, from bottom to top. This requires two instead of one pass(for each branch feeding into the final value), but is conceptually simpler to implement and doesn't require the storage memory of reverse mode. More importantly, though, reverse-mode minics the derivation of backpropagation that I described earlier.
The taps (aslo known as the Wengert Tape, after one of its devcelopers) is actually a data structure that you can access in the TensorFlow Core API. As an example, import the core library:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
The tape is then available using the tf.gradientTape() method, with which you can evaluate gradients with respect to intermediate values within the graph:
x = tf.one((2,2))
with tf.GradientTape() as t:
t.watch(x)
y = tf.reduce_sum(x)
z = tf.mutiply(y,y)
# use the tap to compute the derivative of z with respect to the
# intermediate value y.
dz_dy = t.gredient(z,y)
# note that the resulting derivative, 2*y, = sum(x) *2 = 8
assert bz_dy.numpy() == 8.0
By defualt, the memory resources used by GradientTape() are released once gradient() is called; however, you can also use the persistent argument to store these results:
x = tf.constant(3.0)
with tf.GradientTape(persistent = true) as t:
t.watch(x)
y = x * x
z = y * y
dz_dx = t.gradient(z, x) # 108.0 (4*x ^3 at x =3)
dy_dx = t.gradient(y, x) # 6.0
Now that you've seen how TensorFlow computes gradients in practice to evaluate backpropagation, let's return to the details of how the backpropagation technique evolved over time in response to challenges in practical implementation.