페이지

2022년 2월 26일 토요일

The shortfalls of backpropagation

 While the backpropagation procedure provides a way to update interior weights within the network in a principled way, it has several shortcomings that make deep networks difficult to use in practice. One is the problem of vanishing gradients. In out derivation of the backpropagation formulas, you saw that gradients for weights depper in the network are product of successive partial derivatives from higher layers. In our example, we used the sigmoid function; if we plot out the value of the sigmoid and its first derivative, we can see a potential problem:


As the value of the sigmoid function increase or decrease towards the extremes (0 or 1, representing being either "off" or "on"), the values of the gradient vanish to near zero. This means that the updates to w and b, which are products of these gradients from hidden activation functions y, shrink towards zero, making the weights change little between iterations and making the parameters of the hidden layer neurons change very slowly during backpropagation. Clearly one problem here is that the sigmoid function saturates; thus, choosing another nonliearity might circumvent this problem (this is indeed one of the solutions that was proposed as the ReLU, as we'll cover later).

Another problem is more subtle, and has to do with how the network utilizes its available free parameters. As you saw in Chapter 1, An Introduction to Generative AI: "Drawing" Data from Models, a posterior probability of a variable can be computed as a product of a likelihood and a prior distribution. We can see deep neural networks as a graphical representation of this kind of probability: the ouput of the neuraon, depending updon its parameters, is a product of all the input values and the distributions on those inputs (the priors). A problem occurs when those values become tightly coupled. As an illustration, consider the competing hypotheses for a headache:


If a patient has cancer, the evidence is so overwhelming that whether they have a could or not profides no additional value; in essence, the vlaue of the two prior hypotheses becomes coupled because of the influence of one. This makes it intractable to compute the relative contribution of different parameters, particularly in a deep network; we will cover this problem in our discussion of Restricted Boltman Machine and Deep Belief Networks in Chapter 4, Teaching Networks to Generate Digits. As we will describe in more detail in that chapter, a 2006 study showed how to counteract this effect, and was one of the first demonstrations of tractable inference in deep neural networks, a breakthrough that relied upon a generative model that produced images of hand-drawn digits.

Beyond these concerns, other challenges in the more widespread adaption of neural networks in the 1990s and early 200-s were the availability of methods such as Support Vector Machines, Gradient and Stochastic Gradient Bootstring Models, Random Forests, and even penalized regression methods such as LASSO and Elastic Net, for classification and regression tasks.

While, in theory, deep neural networks had potentially greater representational power than these models since they built hierarchical representations of the input data through successive layers in contrast to the "shallow" representation given by a single transformation such as a regression weight or decision tree, in practice the Challenges of training deep networks made these "shallo" methods more attrative for practical applications. This was coupled with the fact that larger networks required tuning thousands or even millions of parameters, requiring larg-scale matrix calculations that were infeasible before the explosion of cheap compute resources - including GPUs and TPUs especially suited to rapid matrix calculations - available from cloud vendors made these experiments practical.

Now that we've covered the basics of training simple network architectures, let's turn to more complex models that will form the building blocks of many of the generative models in the rest of the book: CNNs and sequence models (RNNs, LSTMs, and other).


댓글 없음: