From Restricted Boltzmann Machine to Deep Neural Network -- Missing Links Explained

LaTeX version:
https://www.sharelatex.com/project/52295b43e77a8bec1401f6bc

1. Introduction

Deep neural network (DNN) has been a hot topic in the community of speech processing in recent years [1]. People are eager to learn about the various theories of DNN. However, those who are not familiar with some background knowledge of Probabilistic Graphical Models [2], Markov chain Monte Carlo (MCMC) methods [3], variational methods [4][5], etc., may found the related publications literally too deep to follow.

Here I documented my footprints on figuring out some fundamental theories of DNN. This note spins around the paper written by Hinton et. al. in 2006 [6], which I think is the pivot reference for understanding DNN. I strongly recommend readers to at least scan through most of the references co-authored by Hinton before reading this note.

Below I will begin with the training algorithm of restricted Boltzmann machine (RBM), i.e. contrastive divergence. Then I will investigate why initializing DNN with RBM is justifiable.

2. Background Overview

2.1. Terminology

The constantly changing notations is probably first obstacle for readers to follow various references of RBM and DNN. Below I try to unify and clarify the connections among these notations.

Model parameters

In general, $\theta$ stands for the weights and bias terms of any Boltzmann Machine with arbitrary structure. When we focus on the Restricted Boltzmann Machine, only the weights $W$ among hidden and visible layers are taken into account:
$$
W=\{w_{ij}|\mbox{ i: visible node index, j: hidden node index}\}.
$$

Empirical data distribution

As stated in [7], the empirical data distribution "is a uniform mixture of delta distributions, one for each training point'':
$$
\begin{aligned}
P_{data}(v) &\equiv P^0(v) \equiv p_0(v)\\
&= \frac{1}{N}\Sigma_{n=1}^N \delta (v-v_n)
\end{aligned}
$$
, where $N$ is the number of training instances. The subscript/superscript "0'' implies this is the initial distribution of the visible random variables before applying Gibbs Sampling.

Model distribution

By definition, the probability distribution function of an RBM is given by
$$
\begin{aligned}
P_{model}(v) &\equiv P_\theta^\infty(v) \equiv p(v;\theta)\\
\label{p_model} &= \Sigma_h P_{model}(v,h) = \Sigma_h \frac{e^{-E(v,h)}}{\Sigma_{\tilde{v},\tilde{h}} e^{-E(\tilde{v},\tilde{h})}}
\end{aligned}
$$
, where
$$
E(v,h) = -v'Wh.
$$
Again, the superscript $\infty$ implies this is the ultimate distribution of the visible random variables after applying Gibbs Sampling until equilibrium.

2.2. Gibbs sampling on RBM

[Edited: 01/22/2017]
Below I will give some head first explanation of Gibbs sampling on RBM, which is crucial to understand

Why the estimation of the $P_{model}$ can be done by Gibbs Sampling in contrastive divergence learning;
Why we can unroll an RBM into a deep network, and reverse the direction of propagation.

The process of Gibbs sampling, or the family of Markov chain Monte Carlo methods in general, can be imagined as a particle wandering in the space consists of all possible outcomes of a set of random variables. And how this particle "wanders" is limited by the underlying probability distribution constructed by these random variables. We can thus estimate this underlying probability distribution by observing the history of outcomes this particle has visited.

It is not coincidence that this family of methods is called Markov chain Monte Carlo -- this particle moving from one possible "outcome" to the next is just like transitioning on a Markov chain from one "state" to the next.

Under the world view of probabilistic graphical model, an RBM with V visible nodes and H hidden nodes is just a probability distribution with V+H random variables having bipartite dependencies, i.e. only among pairs of visible and hidden variables. Due to the special structure, it is legal to perform Gibbs sampling on all hidden variables at once given visible variables, and vice versa.

Assume first that we are given an RBM with linear perceptrons, and the weight matrix is $W$. Let us set the initial value of the visible random variables $v$ as $v^0$ drawn from the training data set $D$. As stated in [8], each step of the Gibbs Sampling is as follows:

1. Multiply the sample of the visible random vector $v$ with weight matrix $W$ into the sample of the hidden random vector $h$:
$$
h=vW.
$$
2. Multiply the sample of the hidden random vector $h$ with weight matrix $W$ into the "new'' sample of the visible random vector $v$:
$$
v=hW',
$$
where $W'$ is the transposed $W$.

The full process of Gibbs sampling is thus nothing more than keep multiplying an initial vector $v^0$ with $W$ and $W'$ again and again.

Now let us change our perspective: we see the nodes $v$ as the states in a Markov chain. The above Gibbs sampling is in fact exactly the same as the state transitioning process of a time-homogeneous Markov chain with the states initialized as $v^0$, and the state transition matrix T defined as
$$
T=WW'.
$$
The state transition process is given by
$$
v^t = v^{t-1}T,
$$
where t is the index of iteration. Note that if we see from the perspective of the hidden nodes h, the Gibbs Sampling is also a state transitioning process, but with $T=W'W$ instead. The above perspective is further illustrated in Fig. 1.

Note the above interpretation still holds with non-linear activation functions -- it is simply generalizing the transitions of Markov chain from linear to non-linear ones.

Figure. 1. This is a further clarified version of one of the most important figures in [6]. The solid lines and arrows show the process of Gibbs Sampling. And the dotted lines and arrows show how one can conceptually view the Gibbs Sampling process as two Markov Chains, one of visible nodes and one hidden nodes, with transition matrix $WW'$ and $W'W$ respectively.

3. RBM Training

3.1. ML training using gradient descent

In some of the most recent publications of Hinton's [6][10], the gradient of log probability for ML training is simply written as
$$
\frac{\partial \log P_{model}(v)}{\partial w_{ij}} = <v_i h_j>_{P_{data}(v, h)} - <v_i h_j>_{P_{model}(v, h)}
$$
, where $< ... >_P$ stands for taking the expectation with some distribution $P$. Note how I expanded the subscript compared to [10] for completeness. It seems that $v$ is just 1 sample drawn from the training data set $D$, which is true for stochastic gradient descent or contrastive divergence learning.

However, the original derivation (i.e. more formal formulation) was in fact done on ordinary gradient descent, of which the gradient is the average of the gradient of all the instances in $D$ [7][8][9]:
$$
\begin{aligned}
&<\frac{\partial \log P_{model}(v)}{\partial w_{ij}}>_{P_{data}(v)}\\
&=\frac{1}{N} \Sigma_{v\in{D}} \frac{\partial \log P_{model}(v)}{\partial w_{ij}}\\
&= <v_i h_j>_{P_{data}(v, h)} - <v_i h_j>_{P_{model}(v, h)}.
\end{aligned}
$$
With $P_{model}(v)$ given in Section 2.1, we can calculate the partial derivative:
$$
\begin{aligned}
&\frac{1}{N} \Sigma_{v\in{D}} \frac{\partial \log P_{model}(v)}{\partial w_{ij}}\\
&= \frac{1}{N} \Sigma_{v\in{D}} \frac{\partial \log \frac{\Sigma_h e^{-E(v,h)}}{\Sigma_{\tilde{v},\tilde{h}} e^{-E(\tilde{v},\tilde{h})}}}{\partial w_{ij}}\\
&= \frac{1}{N} \Sigma_{v\in{D}} \frac{\partial \log \Sigma_h e^{-E(v,h)}}{\partial w_{ij}} - \frac{1}{N} \Sigma_{v\in{D}} \frac{\partial \log \Sigma_{\tilde{v},\tilde{h}} e^{-E(\tilde{v},\tilde{h})}}{\partial w_{ij}}\\
&= \frac{1}{N} \Sigma_{v\in{D}} \frac{\Sigma_h [v_i h_j e^{-E(v,h)}]}{\Sigma_{\tilde{h}} e^{-E(v,\tilde{h})}} - \frac{1}{N} \Sigma_{v\in{D}} \frac{\Sigma_{\tilde{v},\tilde{h}} [\tilde{v}_i \tilde{h}_j e^{-E(\tilde{v},\tilde{h})}]}{\Sigma_{\hat{v},\hat{h}} e^{-E(\hat{v},\hat{h})}}\\
&= \frac{1}{N} \Sigma_{v\in{D}} \Sigma_h [v_i h_j \frac{e^{-E(v,h)}}{\Sigma_{\tilde{h}} e^{-E(v,\tilde{h})}}] - \frac{1}{N} \Sigma_{v\in{D}} \Sigma_{\tilde{v},\tilde{h}} [\tilde{v}_i \tilde{h}_j \frac{e^{-E(\tilde{v},\tilde{h})}}{\Sigma_{\hat{v},\hat{h}} e^{-E(\hat{v},\hat{h})}}].
\end{aligned}
$$
Note there is no variable $v$ in the 2nd term, therefore $\frac{1}{N}\sum_{v\in D}$ can be omitted.

By comparing the expectation and summation terms in the above 2 equations, now it is clear that
$$
\begin{aligned}
P_{data}(v,h) &=
\begin{cases}
\frac{1}{N} \frac{e^{-E(v,h)}}{\Sigma_{\tilde{h}} e^{-E(v,\tilde{h})}} & \text{, if }v\in D\\
0 & \text{, otherwise}\\
\end{cases}\\
&=
\begin{cases}
\frac{1}{N} \frac{P_{model}(v,h)}{P_{model}(v)} & \text{, if }v\in D\\
0 & \text{, otherwise}\\
\end{cases}\\
&=
\begin{cases}
\frac{1}{N} P_{model}(h|v) & \text{, if }v\in D\\
0 & \text{, otherwise}\\
\end{cases}\\
&= \frac{1}{N}\Sigma_{n=1}^N P_{model}(h|v) \delta(v-v_n) \\
&= P_{model}(h|v)P_{data}(v)
\end{aligned}$$
Also,
$$
P_{model}(v,h) = \frac{e^{-E(v,h)}}{\Sigma_{\tilde{v},\tilde{h}} e^{-E(\tilde{v},\tilde{h})}}$$
which is identical to the definition of the probability function of RBM.

Assuming binary hidden variables, $P_{model}(h\mid v)$ can be derived simply by passing all the data through RBM, thanks to the structure of RBM:
$$
P_{model}(h_j=1|v) = \sigma (\Sigma_i v_i w_{ij})
$$
, where $\sigma(x)$ is the logistic sigmoid function $1/(1+\exp(-x))$ [10], and then check all possible configurations of $h$ for the summation.

On the other hand, calculating $P_{model}(v,h)$ requires exhaustively going through all possible configurations of $\{v,h\}$, which is intractable when visible nodes are continuous random variables.

3.2. Contrastive divergence learning

In the contrastive divergence learning proposed by Hinton in 2002 [8], the above criteria for gradient descent is simplified in 2 aspects:

First, observe that
$$
\begin{aligned}
<\frac{\partial \log P_{model}(v)}{\partial w_{ij}}>_{P_{data}(v)} &=\frac{1}{N} \Sigma_{v\in{D}} \frac{\partial \log P_{model}(v)}{\partial w_{ij}}\\
&=\frac{\partial \frac{1}{N} \Sigma_{v\in{D}} \log P_{model}(v)}{\partial w_{ij}}\\
&=\frac{\partial <\log P_{model}(v)>_{P_{data}(v)}}{\partial w_{ij}}
\end{aligned}
$$
This is why Hinton came up with the explanation that "Maximizing the log likelihood of the data (averaged over the data distribution) is equivalent to minimizing the Kullback-Leibler (KL) divergence between the data distribution, $P^0$, and the equilibrium distribution over the visible variables, $P^\infty_\theta$, that is produced by prolonged Gibbs sampling from the generative model'' [8]:
$$
\begin{aligned}
&\arg\min_\theta KL(P^0\parallel P_\theta^\infty)\\
&= \arg\min_\theta [\Sigma_v P^0(v)\log P^0(v) - \Sigma_v P^0(v) \log P_\theta^\infty(v)]\\
&= \arg\max_\theta [\Sigma_v P^0(v) \log P_\theta^\infty(v)]\\
&= \arg\max_\theta [\Sigma_{v\in{D}} \log P_\theta^\infty(v)]\\
&= \arg\max_\theta [<\log P_{model}(v)>_{P_{data}(v)}]
\end{aligned}
$$
Based on such interpretation, we can generalize the concept and modify the criteria to only minimize the KL divergence between $P^0$ and $P^t$, which is the distribution after only t steps of Gibbs sampling.

Second, as I mentioned previously, the contrastive divergence learning is in fact a "stochastic'' learning method. Instead of calculating $<v_i h_j>_{P_{model}(v,h)}$, the contrastive divergence learning only takes the instance of $v_i h_j$ after t steps of Gibbs sampling. Also the $<v_i h_j>_{P_{data}(v,h)}$ is replaced by the instance of $v_i h_j$ in the $0^{th}$ iteration of Gibbs sampling:
$$
h^0 = W'v^0.
$$
Theoretically, if we run sufficiently many iterations of Gibbs sampling, the instance of $v_i h_j$ is effectively drawn from the equilibrium distribution, i.e. $P_{model}$, of the Markov chain. In practice, however, we usually run Gibbs sampling for only 1 step in contrastive divergence learning. More discussion can be found in [8][9].

4. From RBM to DNN

4.1. Deep Networks at a Glance

The terminology of various deep networks are often mixed and confusing. We illustrate the definition given by Hinton et. al. [11][12] in Fig. 2. First, the structure of deep neural network (DNN) is in fact exactly the same as what have been called "neural network" or "multi-layer perceptron" (MLP) for decades. Second, the deep believe network (DBN) is a generative model with undirected edges between the two top layers, and directed edges toward visible layer between the remaining layers. Last but not least, deep Boltzmann machine (DBM) is literally a multi-layer RBM.

Figure. 2. Deep Neural Network (DNN), Deep Believe Network (DBN) and Deep Boltzmann Machine (DBM).

Note in Fig. 2 the number of nodes in all the layers are the same. Such configuration is just for the sake of concept discussion below. Also just for discussion purpose, below I will refer to the term DBN as a generative neural network, not necessarily with undirected edges among the two top layers.

4.2. Unrolling an RBM, reversibility and complementary prior

A constantly-mentioned concept of DNN initialization with RBM is that, one can "unroll'' an RBM into a DNN with infinite number of stacks [6][13]. With the insight I mentioned in Section 2.2 that, performing Gibbs sampling on an RBM actually resembles the state transition process of a Markov chain, this unrolling concept is not as hard to understand. All we have to do is rearranging Fig. 1 so the transition is illustrated from bottom to top, and the resulting structure will be a DNN with infinite layers, as illustrated in the left half of Fig. 3.

What is more, since the resulting neural network can still be seen as a Markov chain (or 2 Markov chains, if you see visible and hidden nodes separately), and the Markov chain resulted from Gibbs sampling satisfies detailed balance, a.k.a. reversibility [3][13], we can reverse the direction of weight propagation from bottom-up to top-down. We thus conceptually transformed the DNN with infinite layers into a DBN also with infinite layers, as illustrated in the right half of Fig. 3.

Figure. 3. Unrolling an RBM, based on the Gibbs sampling procedure illustrated in Fig.1, into a DNN with infinite layers; and reversing a DNN into a DBN.

The reversibility between DBN and DNN was explained in [6] with so called "complementary prior''. Later in [13] Teh et. al. further explained this term. But it seems simpler if we see the reversion of edge directions as the factorization of a probability distribution with different structures of Bayesian network:
$$
\begin{aligned}
P(v0, h0, v1, h1, ......)&= P(v0)P(h0|v0)P(v1|h0)P(h1|v1)......\\
&= P(v0|h0)P(h0)P(v1|h0)P(h1|v1)......\\
&= P(v0|h0)P(h0|v1)P(v1)P(h1|v1)......\\
&= ......
\end{aligned}
$$

4.3. Layer-by-Layer initialization of DNN with RBM

Now we are ready to see why the weights in DNN can be initialized with an RBM. In [6][13], the proposed DBN training algorithm was justified with the concept of variational bounds. But, again, it seems simpler if we see the whole process as follows:

Let us begin with the DBN in Fig. 3. Since this DBN is transformed from an RBM, which we assumed have been trained with some data already, the likelihood of the data, log(P(v)) , should have been maximize "to the best we could now". If we take the first hidden layer $h_0$ into consideration, the likelihood of data can be written as
$$
\begin{aligned}
P_{model}(v_i=1) &= \Sigma_{h0} P_{model}(v_i=1, h0)\\
&= \Sigma_{h0} P_{model}(h0)P_{model}(v_i=1|h0)\\
&= \Sigma_{h0} P_{model}(h0)\sigma(\Sigma_j h0_j w_{ji}).
\end{aligned}
$$

Now, if we wish to maximize the likelihood even further, we either polish $W$ or $P_{model}(h_0)$. But because in a DBN transformed from an already trained RBM, the weights among each layer are given by the same weight matrix which have been optimized either by gradient descent, contrastive divergence or their stochastic versions. Therefore as in [6], we turn to optimize $P_{model}(h_0)$ as follows:

We fix the weight matrix between the first and second layer, and "untie'' it with all the remaining weight matrices, which means all the weights above the 2nd layer are now free parameters.
The sub-DBN above the 2nd layer of the original DBN is again equivalent to a RBM with visible layer being h0. Hence we are able to apply the RBM training algorithm to maximize the likelihood of $h_0$, i.e. $P_{model}(h0)$ in Eq. 19.

By iteratively applying the above steps upward the DBN layer by layer, the likelihood of the original data set, $P_{model}(v)$, can thus theoretically be maximized further.

Finally, as stated in [12], we just revert the DBN derived as above, add a soft-max layer on the top, and run the traditional back-propagation training algorithm for neural networks. The resulting DNN is what we want.

5. Conclusion

In this document, I tried to clarify some of the ambiguities that people may run into while reading important references about RBM and DNN. Due to the commentary nature of this document, it is strongly suggested that you should at least read some of Hinton's papers and have some basic ideas about RBM and DNN before reading this document.

I believe this document can save lots of time and effort for those who are dying to figure out all the equations on Hinton's papers and understand the basic theories of deep learning. (Or you might not care at all because you just want to run some existing libraries and get jobs done.)

References

[1] Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." Signal Processing Magazine, IEEE 29.6 (2012): 82-97.
[2] Kollar, Daphne, and Nir Friedman. Probabilistic graphical models: principles and techniques. The MIT Press, 2009.
[3] "MCMC expalined" [online],\\http://www.youtube.com/playlist?list=PL0E34F5354AA5952E
[4] Neal, Radford M., and Geoffrey E. Hinton. "A view of the EM algorithm that justifies incremental, sparse, and other variants." Learning in graphical models. Springer Netherlands, 1998. 355-368.
[5] Beal, Matthew James. Variational algorithms for approximate Bayesian inference. Diss. University of London, 2003.
[6] Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554.
[7] Sutskever, Ilya, and Tijmen Tieleman. "On the convergence properties of contrastive divergence." International Conference on Artificial Intelligence and Statistics. 2010.
[8] Hinton, Geoffrey E. "Training products of experts by minimizing contrastive divergence." Neural computation 14.8 (2002): 1771-1800.
[9] Carreira-Perpinan, Miguel A., and Geoffrey E. Hinton. "On contrastive divergence learning." Artificial Intelligence and Statistics. Vol. 2005. 2005.
[10] Hinton, Geoffrey. "A practical guide to training restricted Boltzmann machines." Momentum 9.1 (2010).
[11] Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Deep Boltzmann machines." International Conference on Artificial Intelligence and Statistics. 2009.
[12] Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." Signal Processing Magazine, IEEE 29.6 (2012): 82-97.
[13] Yee Whye Teh, Geoffrey E. Hinton, Simon Osindero, "Setting the Stage: Complementary Priors and Variational Bounds", in Deep Learning Workshop of NIPS, December 6, 2007.

Search This Blog

Learning by writing