本章主要内容是导函数的计算以及 backpropagation 算法的概念和步骤

Rember: Stochastic Gradient Descent

Our update equation:

$$
\theta^{new} = \theta^{old} - \alpha \nabla_{\theta} J(\theta)
$$

How can we compute the gradient:

By hand
backpropagation algorithm

Gradients

Skip this part.

Jacobian Matrix: Generalization of the Gradient

Given a function with $m$ outputs and $n$ inputs:

$$
\bold{f(x)} = [f_1(x_1, ..., x_n), ..., f_m(x_1, ..., x_n)]
$$

It’s Jacobian is an $m * n$ matrix of partial derivatives:

$$
\frac{\partial \bold{f}}{\partial \bold{x}} = \left[
\begin{matrix}
\frac{\partial f_1}{\partial x_1} & ... & \frac{\partial f_1}{\partial x_n} \
... & ... & ... \
\frac{\partial f_m}{\partial x_1} & ... & \frac{\partial f_m}{\partial x_n} \
\end{matrix}
\right]
$$

$$
\left( \frac{\partial f}{\partial{x}} \right)_{ij} = \frac{\partial f_i}{\partial x_j}
$$

Example

$$
\bold{h} = f(\bold{z}) \
h_i = f(zi) \
\left( \frac{\partial h}{\partial{z}} \right){ij} = \frac{\partial h_i}{\partial z_j} = \frac{\partial}{\partial z_j}f(z_i) =
\left{
\begin{array}{l}
f\prime(z_i), i = j \
0, otherwise
\end{array}
\right.
$$

So the gradient is a diagnal matrix.

Other Jacobians

$$
\frac{\partial}{\partial \bold{x}}(\bold{W} \bold{x} + \bold{b}) = \bold{W} \
\frac{\partial}{\partial \bold{b}}(\bold{W} \bold{x} + \bold{b}) = \bold{I} \
\frac{\partial}{\partial \bold{u}}(\bold{u}^T \bold{h}) = \bold{h}^T \
$$

Chain rule

Skip this part.

Shape convention

The derivates result should always has the same form to denominator. So just use Jacobian form as much as possible, reshape to follow the shape convention at the end.

Backpropagation

In short, we comput the value of a function step by step which is called forward propagation, while we follow an inversed steps to compute gradients. It’s similar to chain rule. The core idea is that:

$Downstream \ gradient = Local \ gradient * Upstream \ gradient$

And with multiple upstream nodes, we sum up them.

Manual Gradient checking: Numeric Gradient

Generally, we have use $\frac{f(x + \Delta x) - f(x)}{\Delta x}$ to estimate gradient. But in fact, another form could be better: $\frac{f(x + \Delta x) - f(x - \Delta x)}{2 \Delta x}$ .

CS224N Lecture 3: Neural net learning: Gradients and algorithmically

Rember: Stochastic Gradient Descent

Gradients

Jacobian Matrix: Generalization of the Gradient

Example

Other Jacobians

Chain rule

Shape convention

Backpropagation

Manual Gradient checking: Numeric Gradient

CS224N Lecture 2: Word Vectors, Word Senses, and Neural Classifiers

LeetCode每日一题：Day16

Comments NOTHING

取消回复