MiniFlow – 自制玩具版 TensorFlow – 肥叉烧 feichashao.com

Contents

1 定义神经元
2 运算顺序 topological sort
3 Forward Pass
4 优化参数 - Gradient Descent
5 Backpropagation
6 参考文档

神经网络由一个个神经元组成，简单而言，每个神经元会对输入进行一些简单运算，得到输出。而训练一个网络，则是调整网络中的参数，使得输出接近学习样本的期望输出。
比如，假设有一个网络只计算 y = w * x, 初始值 w=5。把 x=10 放进这个网络，会得到 y=50. 而我们预期 y=20. 通过训练这个网络，w 将会被调整为 w=2, 输出 y=20 则符合(接近)预期。

在实际应用中，一个网络会有不少参数，我们需要一个方法来调整这些参数。按我的理解，这个“调整参数”的过程就是机器学习的过程。

一个常见的神经网络，会由 Input layer, Hidden layer 和 output layer 组成。每层 layer 由众多神经元组成，Input/output layer 的神经元通常仅存储一个数值，本身不会进行运算。Hidden layer 的神经元则会对输入进行运算，将输出传给下一层。

By Glosser.ca - Own work, Derivative of File:Artificial neural network.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24913461

Udacity CarND-term1 Lesson 5 让我们自己去实现一个 Miniflow, 相当于一个玩具版的 Tensorflow. 通过实现 Miniflow, 可以体会到 Forward Pass 和 Backpropagation 的过程。

定义神经元

首先可以定义一个Neuron的类，之后 Input/Output/Add/Sigmoid/linear 之类的神经元也可以继承这个类。Neuron 包含__init__， forward, 和 backward 这几个函数，以及一个 value 存放它自己的值。

在__init__时，它会获取一个 inboound_neurons 的列表，这个列表记录每个输入神经元。随后，__init__会把自身神经元加到每个输入神经元的 outbound_neurons 列表中。

## 源码来自 Udacity CarND-term1 Lesson 5 ##
class Neuron:
def __init__(self, inbound_neurons=[]):
# Neurons from which this Node receives values
self.inbound_neurons = inbound_neurons
# Neurons to which this Node passes values
self.outbound_neurons = []
# A calculated value
self.value = None
# Add this node as an outbound node on its inputs.
for n in self.inbound_neurons:
n.outbound_neurons.append(self)

# These will be implemented in a subclass.
def forward(self):
"""
Forward propagation.

Compute the output value based on `inbound_neurons` and
store the result in self.value.
"""
raise NotImplemented

运算顺序 topological sort

有了神经元，且神经元之间可以连接，我们需要决定数据的传输顺序。比如，下面网路的运算顺序可以为 A->B->C->D->E

可以使用Kahn's Algorithm [7]来对拓朴进行排序，决定计算的先后顺序。

## 源码来自 Udacity CarND-term1 Lesson 5 ##
def topological_sort(feed_dict):
"""
Sort generic nodes in topological order using Kahn's Algorithm.

`feed_dict`: A dictionary where the key is an `Input` node and the value is the respective value feed to that node.

Returns a list of sorted nodes.
"""

input_neurons = [n for n in feed_dict.keys()]

G = {}
neurons = [n for n in input_neurons]
while len(neurons) > 0:
n = neurons.pop(0)
if n not in G:
G[n] = {'in': set(), 'out': set()}
for m in n.outbound_neurons:
if m not in G:
G[m] = {'in': set(), 'out': set()}
G[n]['out'].add(m)
G[m]['in'].add(n)
neurons.append(m)

L = []
S = set(input_neurons)
while len(S) > 0:
n = S.pop()

if isinstance(n, Input):
n.value = feed_dict[n]

L.append(n)
for m in n.outbound_neurons:
G[n]['out'].remove(m)
G[m]['in'].remove(n)
# if no other incoming edges add to S
if len(G[m]['in']) == 0:
S.add(m)
return L

Forward Pass

Forward Pass 是把输入的数据从前往后计算，得出最终结果的过程。

每一个神经元只需做好自己的运算即可。比如，做加法的神经元 Add, 将两个输入加起来即可得到输出。

在代码中，把两个input neuron的值加起来，就能得到当前Neuron的值(输出到下一个Neuron的值).

## 源码来自 Udacity CarND-term1 Lesson 5 ##
class Add(Neuron):
def __init__(self, x, y):
# You could access `x` and `y` in forward with
# self.inbound_neurons[0] (`x`) and self.inbound_neurons[1] (`y`)
Neuron.__init__(self, [x, y])

def forward(self):
"""
Set the value of this neuron (`self.value`) to the sum of it's inbound_nodes.

Your code here!
"""
self.value = self.inbound_neurons[0].value + self.inbound_neurons[1].value

优化参数 - Gradient Descent

重点来了。通过 Forward Pass，可以将输入的值进行一系列计算得到最终结果，但是这个结果是我们所预期的吗？通常不。所以，可以通过调整神经元的参数，使得输出的结果符合预期。

还是开头的例子，我们希望能从w=5调整为w=2，得到我们预期的输出。那么，计算机怎么知道w=2就是想要的参数呢？

定义 cost
要调整参数，首先要知道，实际结果跟预期结果相差有多远，否则我们难以评估调整的效果。这个“相差多远”可以称为Cost.

比如，可以设定Cost为 MSE (Mean Square Error)。那么w=5时，cost=(50-20)^2/1=900. w=2时，cost=(20-20)^2/1=0.

用 gradient Descent 逐步贴近
站在上帝视角，我们知道w=2时cost最小，但是身在其中的程序怎么知道w=2可以得到最小cost呢？

用数学推导的方法，我们知道：
cost=(20-wx)^{2}
x=5

要使得cost最小，易得：
w=2

在这个简单的网络中，只有一个神经元，可以用数学的方法来求出cost最小时，w的取值。

但是，当网络变得复杂之后，数学分析的方法就不奏效了。首先，数学分析的复杂度非常高，而且组合成的最终函数不一定是凸函数，不一定有唯一的最小解。这时候，数学分析几乎不可实现。

于是，可以通过“逐步贴近”的方法，找到前往最小值的方向，前进一小步，再找方向，再前进一小步，如此反复，可以逐步找到最优解。

这个方向怎样找呢？可以通过求导，找到这个函数在当前位置的斜率(Gradient)，然后沿着这个斜率往下走。这个按着斜率往下走的方法，叫做 Gradient Descent.

Backpropagation

一个复杂的网络，对于特定输入值，求关于每个变量(比如w)的斜率，可以利用 chain rule 来求偏微分。

利用 chain rule, 可以求得每一个参数对于f的gradient. 在 x=-2, y=5, z=-4 的输入下，x对结果f的gradient是-4，也就是说，按照这个趋势(gradient)，x增加1,f的输出会增加-4. 同理，z对结果f的gradient是3,也就是说，按照这个趋势(gradient)，z增加1，f的输出会增加3.

对于Backpropagation，一些运算有特定的性质。比如，"add gate"会将output(右边)的gradient，分发到所有input上。"max gate"则只将output的gradient分给其中一个input，其他input只能得到0. “multiply gate”起到交换作用，比如上图中，y的gradient是2*x=6, x的gradient是2*y=-8.

体现到代码中，Backpropagation就是反过来，将output的所有gradient求和，乘上当前Neuron的gradient, 反向传递到 input 中去。

## 源码来自 Udacity CarND-term1 Lesson 5 ##
class Sigmoid(Layer):
"""
Represents a layer that performs the sigmoid activation function.
"""
def __init__(self, layer):
# The base class constructor.
Layer.__init__(self, [layer])

def _sigmoid(self, x):
"""
This method is separate from `forward` because it
will be used with `backward` as well.

`x`: A numpy array-like object.
"""
return 1. / (1. + np.exp(-x))

def forward(self):
"""
Perform the sigmoid function and set the value.
"""
input_value = self.inbound_layers[0].value
self.value = self._sigmoid(input_value)

def backward(self):
"""
Calculates the gradient using the derivative of
the sigmoid function.
"""
# Initialize the gradients to 0.
self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_layers}

# Cycle through the outputs. The gradient will change depending
# on each output, so the gradients are summed over all outputs.
for n in self.outbound_layers:
# Get the partial of the cost with respect to this layer.
grad_cost = n.gradients[self]
"""
TODO: Your code goes here!

Set the gradients property to the gradients with respect to each input.

NOTE: See the Linear layer and MSE layer for examples.
"""
sigmoid = self.value
self.gradients[self.inbound_layers[0]] += sigmoid * (1-sigmoid) * grad_cost
## 对所有output反馈的gradient作求和运算。之所以要求和，是因为“对这个input作的修改，会影响到它后续输出的所有值”.

参考文档

[1] https://brilliant.org/wiki/backpropagation/
[2] https://brilliant.org/wiki/artificial-neural-network/
[3] https://stats.stackexchange.com/questions/212619/why-is-gradient-descent-required
[4] https://www.quora.com/In-mathematical-optimization-why-would-someone-use-gradient-descent-for-a-convex-function-Why-wouldnt-they-just-find-the-derivative-of-this-function-and-look-for-the-minimum-in-the-traditional-way
[5] https://en.wikipedia.org/wiki/Gradient_descent
[6] http://cs231n.github.io/optimization-2/
[7] https://en.wikipedia.org/wiki/Topological_sorting#Kahn.27s_algorithm