Backpropagation is an algorithm to calculate the derivative (gradient) of the cost function with respect to the parameters of the neural network.
The name of the Backpropagation which is also called back prop or backward pass, coming from the fact that after the forward pass which calculates the output of the network based on the current parameters, then backprop calculates the derivative (gradient) of the cost function with respect to the parameters in the reverse order of the forward pass, meaning from the output layer back to the first layer. Hence, the name back propagation.
Computational Graph Link to heading
Computation graph is break down of the neural network (forward propagation and back propagation) into smaller building blocks, which then can be calculated step by step. This graph is a directed acyclic graph (DAG), and output of each node is calculated based on the output of the previous nodes. So, it can reuse the output of the previous nodes to calculate the output of the next nodes, which makes the computation more efficient.
See Computational Graph for more details.
In case of backpropagation, the computation graph is like breaking down the chain rule of the derivative into each composite function, and then calculate from the most outer function (loss function) to the most inner function (the parameters of the neural network).
Without the computational graph, we have to calculate the derivative of the loss function with respect to each parameter one by one which is very inefficient. However, with the computational graph, we start from the loss function (end of the graph) and calculate the derivative of the loss function with respect to the output of each function, then keep going backward to calculate the derivative of the loss function with respect to each parameter in the neural network. In this way, we reuse the output of the previous derivatives to calculate the next derivatives, which makes the computation more efficient.
Note: In the backpropogation, we compute the gradient of the loss function with respect to each parameter, and then we average the gradients to find the mean gradient for that parameter over all samples. We’ll use that mean gradient in the gradient descent algorithm to update the parameters at each iteration (step).
Let’s go through this algorithm step by step using an example of a simple neural network like the one below:
Forward Propagation Link to heading
We move forward (Left to Right) from the input node to the output node.
Loss Function Link to heading
We continue to move forward (Left to Right) from the output node to the loss node.
Backpropagation Algorithm Link to heading
Now we move backwards (Right to Left), from the loss node to the input node.