Back Propagation - A Beginner's Guide to Deep Learning

Vikrant · July 18, 2023, 1:48am

Could you drive dE/b3 , dE/b2 and dE/b1 please? In the similar way you’ve derived i.e, derivative of error wrt w5, derivative of error wrt w5 and derivative or error wrt w1 ( i.e., with complete explanation)
Thanks in advance.

Javeria_Tariq · July 19, 2023, 7:26am

Hi @Vikrant !!
Sure! Let’s calculate the derivatives of the error with respect to the biases (b_3), (b_2), and (b_1) using the chain rule. We’ll follow a similar approach as we did for the derivative of the error with respect to (w_5).

First, let’s define the error function (\xi) (also known as the loss function), which is typically a measure of the difference between the target output and the actual output of the neural network. Let’s assume (\xi) is defined as the mean squared error (MSE) for simplicity:

[\xi = \frac{1}{2}(target_y - out_y)^2]

Where:
(target_y) is the target output.
(out_y) is the actual output (the output of the last layer of the neural network).

Now, we’ll calculate the derivatives step by step:

(\frac{\partial \xi}{\partial b_3}):

[\frac{\partial \xi}{\partial b_3} = \frac{\partial \xi}{\partial out_y} \frac{\partial out_y}{\partial net_y} \frac{\partial net_y}{\partial b_3}]

From the previous calculations:
[\frac{\partial \xi}{\partial out_y} = -(target_y - out_y)]
[\frac{\partial out_y}{\partial net_y} = out_y(1 - out_y)]
[\frac{\partial net_y}{\partial b_3} = 1]

So, combining these results:
[\frac{\partial \xi}{\partial b_3} = -(target_y - out_y) \cdot out_y(1 - out_y) \cdot 1]

(\frac{\partial \xi}{\partial b_2}):

[\frac{\partial \xi}{\partial b_2} = \frac{\partial \xi}{\partial out_h} \frac{\partial out_h}{\partial net_h} \frac{\partial net_h}{\partial b_2}]

Where (out_h) is the output of the hidden layer, and (net_h) is the input to the hidden layer (before the activation function is applied).

From the previous calculations, we already have:
[\frac{\partial \xi}{\partial out_h} = -(target_y - out_y) \cdot out_y(1 - out_y) \cdot w_5]
[\frac{\partial out_h}{\partial net_h} = out_h(1 - out_h)]
[\frac{\partial net_h}{\partial b_2} = 1]

So, combining these results:
[\frac{\partial \xi}{\partial b_2} = -(target_y - out_y) \cdot out_y(1 - out_y) \cdot w_5 \cdot out_h(1 - out_h) \cdot 1]

(\frac{\partial \xi}{\partial b_1}):

[\frac{\partial \xi}{\partial b_1} = \frac{\partial \xi}{\partial out_h} \frac{\partial out_h}{\partial net_h} \frac{\partial net_h}{\partial b_1}]

From the previous calculations:
[\frac{\partial \xi}{\partial out_h} = -(target_y - out_y) \cdot out_y(1 - out_y) \cdot w_5 \cdot out_h(1 - out_h) \cdot w_2]
[\frac{\partial out_h}{\partial net_h} = out_h(1 - out_h)]
[\frac{\partial net_h}{\partial b_1} = 1]

So, combining these results:
[\frac{\partial \xi}{\partial b_1} = -(target_y - out_y) \cdot out_y(1 - out_y) \cdot w_5 \cdot out_h(1 - out_h) \cdot w_2 \cdot out_x(1 - out_x) \cdot 1]

Finally, we can use these derivatives to update the biases using the weight update rule:

[b_3 = b_3 - \eta \frac{\partial \xi}{\partial b_3}]
[b_2 = b_2 - \eta \frac{\partial \xi}{\partial b_2}]
[b_1 = b_1 - \eta \frac{\partial \xi}{\partial b_1}]

Where:
(\eta) is the learning rate, which controls the step size in the weight update process.
I hope it helps. Happy Learning

Vikrant · July 20, 2023, 5:49am

Hi @Javeria_Tariq Thanks for the explanation. Could you help me with the formatting i.e., there is back ward slash in all the expressions which is making hard to understand the explanation. Do you mind explaining the usage ? e.g. (\xi) and rest of whole example

Javeria_Tariq · July 20, 2023, 8:22am

Apologies for the confusion with the formatting. Let me reformat the example and explain the notation used:

Let’s consider a simple neural network with three layers: an input layer (x), a hidden layer (h), and an output layer (y). The output of the neural network is denoted as (out_y), and the target output (ground truth) is denoted as (target_y). The error function (\xi) measures the difference between the target output and the actual output of the neural network, and in this example, we’ll use the mean squared error (MSE) as the error function:

Where η is the learning rate, which controls the step size of the weight update process. The learning rate is a hyperparameter that needs to be chosen carefully, as it can affect the convergence and stability of the learning process.