Now let's summarize the preceding solution: we need to find the optimal parameters w and b, so that the input and output meet a linear relationship y = wx + b, i ∈ [1,n]. However, due to the existence of observation errors e, it is necessary to sample a data set D = {(x(1),y(1)),(x(2),y(2)), x(3),y(3)...,(x(n),y(n))}, composed of a sufficient number of data samples, to find an optimal set of parameters w and b to minimize the mean squared error L = 1/n(wx(i) + b - y(i))2.
For a single-input neuron model, only two samples are needed to obtain the exact solution of the equations by the elimination method. This exact solution derived by a strict formula is called an analytical solution. However, in the case of multiple data points (n 2), there is probably no analytical solution. We can only use numerical optimization methods to obtain an approximate nuimerical solution. Why is it called optimization? This is because the computer's calculation speed is very fast. We can use the powerful computing power to "search" and "try" multiple times, thereby reducing the error L step by step. The simplest optimization method is brute-force search or random experiment. For example, to find the most suitable w and b, we can randomly sample any w and b from the real number space and calculate the error value L of the corresponding model. Pick out the semallest error L from all the experiments {L}, and its corresponding, w and b are the optimal parameters we are looking for.
This brute-force algorithm is simple and straightforward, but it is extremely inefficient for large-scale, high-dimensional optimization problems. Gradient descent is the most commonly used optimization algorithm in neural network training. With the parallel acceleration capability of powerful graphics processing unit(GPU) chips, it is very suitable for optimizing neural network models with massive data.
Naturaaly it is also suitable for optimizing our simple linear neuron model. Since the gradient descent algorithm is the core algorithm of deep learning, we will first apply the gradient descent algorithm to solve simple nueuron models and then detail ists application in neural network in Chapter 7.
With the concept of derivative, if we want to solve the maximum and minimum values of a function, we can simply set the derivative function to be 0 and find the corresponding independent variable a values, that is, the stagnation point, and then check the stagnation type. Taking the function f(x) = x2.sin(x) as an example, we can plot the functjion and its derivative in the interval x |-10, 10|, where the blue solid line is f(x) and the yellow dotted line is df(x)/dx as shown in Figure 2-5. It can be seen that the points where the derivative (dashed line) is 0 are the stagnation points, and both the maximum and minimum values of f(x) appear in the stagnation points.
The gradient of a function is defined as a vector of partial derivatives of the function on each independent variable. Considering a three-dimensional function z = f(x,y), the partial derivative of the function with respect to the independent variable x is dz/dx, the partial derivative of the function with respect to the independent variable y is recorded as dz/dy, and the gradient ∇f is a vector (dz/dx, dz/dy). Let's look at a specific function f(x,y) = -(cos2x + cos2y)2. As shown in Figure 2-6, the length of the red arrow in the plane represents the modulus of the gradient vector, and the direction of the arrow represents the direction of the gradient vector. It can be seen that the direction of the arrow always points to the function value increasing direction. The steeper the function surface, the longer the length of the arrow, and the larger the modulus of the gradient.
Through the preceding example, we can intuitively feel that the gradient direction of th efunction always points to the direction in which the function value increases. Then the opposite direction of the gradient should point to the direction in which the function value decreases.
To take advantage of this property, we just need to follow the preceding equation to iteratively update x. Then we can get smaller and smaller function values. n is used to scale the gradient vector, which is known as learning rate and generally set to a smaller value, such as 0.01 or 0.001. In particular, for one-dimensional functions, the preceding vector form can be written into a scalar form:
x' = x -n.dy/dx
By iterating and updating x several times through the preceding formula, the function value y' at x' is always more likely to be smaller than the function value at x.
The method of optimizing parameters by the formula(2.1) is called the gradient descent algorithm. It calculates the gradient f of the function f and iteratively updates the parameters to obtain the optimal numberical solution of the parameters when the function f reaches its minimum value. It should be noted that model input in deep learning is generally represented as x and the parameters to be optimized are generally represented by 0,w, and b.
Now we will apply the gradient descent algorithm to calculate the optimal parameters w' and b in the beginning of this session. Here the mean squared error function is minimized:
The model parameters that need to be optimized are w and b, so we update them iteratively using the following equations:
x