本章主要内容是导函数的计算以及 backpropagation 算法的概念和步骤
Rember: Stochastic Gradient Descent
Our update equation:
$$
\theta^{new} = \theta^{old} - \alpha \nabla_{\theta} J(\theta)
$$
How can we compute the gradient:
- By hand
- backpropagation algorithm
Gradients
Skip this part.
Jacobian Matrix: Generalization of the Gradient
Given a function with $m$ outputs and $n$ inputs:
$$
\bold{f(x)} = [f_1(x_1, ..., x_n), ..., f_m(x_1, ..., x_n)]
$$
It’s Jacobian is an $m * n$ matrix of partial derivatives:
$$
\frac{\partial \bold{f}}{\partial \bold{x}} = \left[
\begin{matrix}
\frac{\partial f_1}{\partial x_1} & ... & \frac{\partial f_1}{\partial x_n} \
... & ... & ... \
\frac{\partial f_m}{\partial x_1} & ... & \frac{\partial f_m}{\partial x_n} \
\end{matrix}
\right]
$$
$$
\left( \frac{\partial f}{\partial{x}} \right)_{ij} = \frac{\partial f_i}{\partial x_j}
$$
Example
$$
\bold{h} = f(\bold{z}) \
h_i = f(zi) \
\left( \frac{\partial h}{\partial{z}} \right){ij} = \frac{\partial h_i}{\partial z_j} = \frac{\partial}{\partial z_j}f(z_i) =
\left{
\begin{array}{l}
f\prime(z_i), i = j \
0, otherwise
\end{array}
\right.
$$
So the gradient is a diagnal matrix.
Other Jacobians
$$
\frac{\partial}{\partial \bold{x}}(\bold{W} \bold{x} + \bold{b}) = \bold{W} \
\frac{\partial}{\partial \bold{b}}(\bold{W} \bold{x} + \bold{b}) = \bold{I} \
\frac{\partial}{\partial \bold{u}}(\bold{u}^T \bold{h}) = \bold{h}^T \
$$
Chain rule
Skip this part.
Shape convention
The derivates result should always has the same form to denominator. So just use Jacobian form as much as possible, reshape to follow the shape convention at the end.
Backpropagation
In short, we comput the value of a function step by step which is called forward propagation, while we follow an inversed steps to compute gradients. It’s similar to chain rule. The core idea is that:
$Downstream \ gradient = Local \ gradient * Upstream \ gradient$
And with multiple upstream nodes, we sum up them.
Manual Gradient checking: Numeric Gradient
Generally, we have use $\frac{f(x + \Delta x) - f(x)}{\Delta x}$ to estimate gradient. But in fact, another form could be better: $\frac{f(x + \Delta x) - f(x - \Delta x)}{2 \Delta x}$ .
Comments NOTHING