An adult brain contains about 100 billion neuraons. Each neuraon obtains input signals through dendrites and transmits output signals through axons. The neurons are interconnected to form a huge neural network, thus forming the human brain, the basis of perception and consciousness. Figure 2-1 is a typical biological neuron structure. In 1943, the psychologist Warren McCulloch and mathematical logician Walter Pitts proposed a mathematical model of artificial neural networks to simulate the mechanism of biological neuraons. This research was further developed by the American neurologist Frank Rosenblatt into the perceptron model, which is also the cornerstone of modern deep learning.
Starting from the structure of biological neurons, we will revisit the exploration of scientific pioneers and gradually unveil the mystery of automatic learning machines.
First, we can abstract the neuron model into the mathematical structure as shown in Figure 2-2. The neuron input vector x = [x1, x2, x3,...xn]T maps to y through function f:x->y, where θ represents the parameters in the function f. Consider a simplified case, such as linear transformation: f(x) = wtx + b. The expanded form is
f(x) = w1x1 + w2x2 +.... +wnxn +b
The preceding calculation logic can be intuitively shown in Figure 2-2.
The parameters θ = {w1, w2, w3,...,wn,b} determine the state of the neuron, and the processing logic of this neuron can be determined by fixing those parameters. When the number of input nodes n = 1 (single input), the neuron model can be further simplified as
y = ws +b
Then we can plot the change of y as a function of x as shown in Figure 2-3. As the input signal x increases, the output also increases linearly. Here parameter w can be understood as the slope of the straight line, and b is the bias of the straight line.
For a certain neuron, the mapping relationship f between x and y is unknown but fixed. Two pints can determine a straight line. In order to estimate the value of w and b, we only need to sample any two data points(x(1), y(1)) and (x(2), y(2)) from the straight line in Figure 2-3, where the superscript indicates the data point number:
y(1) = wx(1) +b
y(2) = wx(2) + b
If(x(1), y(1)) (x(2), y(2)), we can solve the preceding equations to get the value of w and b. Let's consider a specific example: x(1) = 1, y(1) = 1.567, x(2) = 2, y(2) = 3.043. Substituting the numbers in the preceding formulas gives
1.567 = w.1 + b
3.043 = w.2 + b
This is the system of binary linear equations that we learned in junior or high school. The analytical solution can be easily calculated using the elimination method, that is, w = 1.477, b=0.089.
You can see that we only need two different data points to perfectly solve the parameters of a single-input lineary neuron model. For linear neuron models with N input, we only need to sample N + 1 different data points. It seems thjat the linear neuron models can be perfectly resolved. So what's wrong with the preceding method? Considering that there may be observation errors for any sampling point, we assume that the observation error variable e follows a normal distribution N(μ, σ2) with μ as mean and σ2 as variance. Then the samples follow:
y = wx + b + e, e - N(μ, σ2)
Once the observation error is introduced, event if it is as simple as a linear model, if only two data ppoints are smapled, it may bring a large estimation bias. As shown in Figure 2-4, the data points all have observation errors. IF the estimatino is based on the two blue rectangular data points, the estimatied blue dotted line woould have a large deviation from the true orange straight line. In order to reduce the estimation bias introduced by observation errors, we can sample multiple data points D = {(x(1), y(1)), (x(2),y(2)), (x(3),y(3))...,(x(n),y(n))} and then find a "best" straight line, so that it minimizes the sum of errors between all sampling points and the straight line.
Due to the existence of observation errors, there may not be a straight line that perfectly passes through all the sampling points D. Therefore, we hope to find a "good" straight line close to all sampling points. How to measure "good" and "bad"? A natural idea is to use the mean squared error (MSE) between the predicted vaslue wx(i) + b and the true value y(i) at all sampling points as the total error, that is
Then search a set of parameters w and b to minimize the total error L. The straight line corresponding to the minimal total error is the optimal straight line we are looking for, that is
Here n represents the number of sampling points.