# Neural Networks as Gaussian Processes

Interesting Idea…

### Gaussian Process

A Gaussian process (GP) is a probability distribution over functions $f(x)$ such that the set of values of $f(x)$ evaluated at an arbitrary set of points have a Gaussian distribution.

A GP $f(x)$ could be defined by a mean function $\mu(x)$ and covariance function (or kernel) $k(x, x’)$ as
\begin{align} \mu(x) &=\mathbb E[f(x)] \\ k(x,x’) &=\mathbb E[(f(x) – \mu( x))(f(x’)-\mu(x’ ))] \end{align}

### Gaussian Process Regression

Assume we have $y=f(\boldsymbol x)+\epsilon$ where $f(\boldsymbol x)\sim \mathcal{GP}(\mu,k)$ and $\epsilon\sim \mathcal N(0,\sigma_n^2)$. And we have some observations $\{(\boldsymbol{x}_{i}, y_{i})\}_{i=1}^{n}, \boldsymbol{x}_{i} \in \mathbb{R}^{p}, y_{i} \in \mathbb{R}$. By the definition of GP, they must have
$$\left[f\left(\boldsymbol{x}_{1}\right), f\left(\boldsymbol{x}_{2}\right), \ldots, f\left(\boldsymbol{x}_{n}\right)\right]^{\mathrm{T}} \sim \mathcal{N}(\boldsymbol{\mu}_X, K_X)$$ where $\boldsymbol{\mu}_X=[\mu\left(\boldsymbol{x}_{1}\right), \mu\left(\boldsymbol{x}_{2}\right), \ldots, \mu\left(\boldsymbol{x}_{n}\right)]^{\mathrm{T}}$, and $K_X$ is a $n$ by $n$ matrix such that $[K_X]_{ij}=k(\boldsymbol x_i, \boldsymbol x_j)$.

To predict on unkown data $Z=[\boldsymbol{z}_1,\boldsymbol{z}_2,\ldots,\boldsymbol{z}_m]^{\mathrm{T}}$, suppose $f_*=f(Z)$, we have,
$$\left[\begin{array}{c} \boldsymbol{y} \\ f_{*} \end{array}\right] \sim \mathcal{N}\left(\left[\begin{array}{l} \boldsymbol{\mu}(X) \\ \boldsymbol{\mu}(Z) \end{array}\right],\left[\begin{array}{cl} K(X, X)+\sigma_{n}^{2} \mathbf{I} & K(Z, X)^{\mathrm{T}} \\ K(Z, X) & K(Z, Z) \end{array}\right]\right)$$ where $\boldsymbol{\mu}(X)=\boldsymbol{\mu}_X$, $\boldsymbol{\mu}(Z)=\left[\mu\left(\boldsymbol{z}_{1}\right), \ldots, \mu\left(\boldsymbol{z}_{m}\right)\right]^{\mathrm{T}}$, $K(X, X)=K_X$.

$K(Z, X)$ is a $m$ by $n$ matrix such that $[K(Z, X)]_{ij}=k(\boldsymbol z_i, \boldsymbol x_j)$.

$K(Z, Z)$ is a $m$ by $m$ matrix such that $[K(Z, Z)]_{ij}=k(\boldsymbol z_i, \boldsymbol z_j)$.

By the property of Gaussian, we have $p\left(f_{*} \mid X, \boldsymbol{y}, Z\right)=\mathcal{N}(\hat{\boldsymbol{\mu}}, \hat{\Sigma})$ where
\begin{aligned} &\hat{\boldsymbol{\mu}}=K(X, Z)^{\mathrm{T}}\left(K(X, X)+\sigma_{n}^{2} \mathbf{I}\right)^{-1}(\boldsymbol{y}-\boldsymbol{\mu}(X))+\boldsymbol{\mu}(Z) \\ &\hat{\Sigma}=K(Z, Z)-K(X, Z)^{\mathrm{T}}\left(K(X, X)+\sigma_{n}^{2} \mathbf{I}\right)^{-1} K(X, Z) \end{aligned} Finally, including the noise, we have,
$$p\left(\boldsymbol y_{*} \mid X, \boldsymbol{y}, Z\right)=\mathcal{N}(\hat{\boldsymbol{\mu}}, \hat{\Sigma}+\sigma^2_n \mathbf I)$$

### Neural Networks as Gaussian Processes

#### Lindeberg-Lévy CLT

Suppose $\left\{X_{1}, \ldots, X_{n}\right\}$ is a sequence of i.i.d. random variables with $\mathbb{E}\left[X_{i}\right]=\mu$ and $\operatorname{Var}\left[X_{i}\right]=\sigma^{2}<\infty$. Then as $n$ approaches infinity, the random variables $\sqrt{n}\left(\bar{X}_{n}-\mu\right)$ converge in distribution to a normal $\mathcal{N}\left(0, \sigma^{2}\right)$:
$$\sqrt{n}\left(\bar{X}_{n}-\mu\right) \stackrel{d}{\rightarrow} \mathcal{N}\left(0, \sigma^{2}\right) .$$ (From Wiki)

#### Single Layer Neural Network

Suppose we have a single layer neural network with input $x\in \mathbb R^{d_{in}}$, neurons $x^1\in \mathbb R^{N_1}$, output $z^1\in \mathbb R^{d_{out}}$, we have
$$z_{i}^{1}(x)=b_{i}^{1}+\sum_{j=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x), \quad x_{j}^{1}(x)=\phi\left(b_{j}^{0}+\sum_{k=1}^{d_{i n}} W_{j k}^{0} x_{k}\right)$$ Suppose we have $W_{i j}^{l} \sim \mathcal{N}\left(0, \sigma_{w}^{2} / N_{l}\right)$, $b_{i}^{l} \sim \mathcal{N}\left(0, \sigma_{b}^{2}\right)$

Take $N_1 \to \infty$, we could conclude some interesting properties. Suppose $\tilde W_{i j}^{1}=\sqrt{ N_1}W_{i j}^{1} \sim \mathcal{N}\left(0, \sigma_{w}^{2}\right)$, we have,
$$z_{i}^{1}(x)=b_{i}^{1}+\sqrt{ N_1}{1\over N_1}\sum_{j=1}^{N_{1}} \tilde W_{i j}^{1} x_{j}^{1}(x)\stackrel{d}{\rightarrow} \text{ Gaussian}$$ Thus, using Multivariate CLT, we have
$$\{z_{i}^{1}(x^{(1)}),z_{i}^{1}(x^{(2)}),\ldots,z_{i}^{1}(x^{(k)})\}\stackrel{d}{\rightarrow} \text{joint Gaussian}$$ for different input $x^{(i)}$. This could be seen as a stochastic process with mean $\mu^1$ and kernel $K^1$, where
$$\mu^1(x)= \mathbb E\left[z_{i}^{1}(x)\right]=\mathbb E\left[b_{i}^{1}+\sum_{j=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x)\right]=0$$ \begin{align} K^1(x,x’)&\equiv\mathbb E\left[z_{i}^{1}(x)z_{i}^{1}(x’)\right]\\ &=\mathbb E\left[\left(b_{i}^{1}+\sum_{j=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x)\right)\left(b_{i}^{1}+\sum_{j=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x’)\right)\right]\\ &=\mathbb E\left[\left(b_{i}^{1}\right)^2+b_{i}^{1}\sum_{j=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x)+b_{i}^{1}\sum_{j=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x’)+\left(\sum_{j=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x)\right)\left(\sum_{j=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x’)\right)\right]\\ &=\mathbb E\left[\left(b_{i}^{1}\right)^2\right]+\mathbb E\left[\left(\sum_{j=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x)\right)\left(\sum_{j=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x’)\right)\right]\\ &=\sigma_b^2+\mathbb E\left[\sum_{j=1}^{N_{1}}\sum_{j’=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x) W_{i j’}^{1} x_{j’}^{1}(x’)\right]\\ &=\sigma_b^2+\mathbb E\left[\sum_{j=1}^{N_{1}} \left(W_{i j}^{1}\right)^2 x_{j}^{1}(x)x_{j}^{1}(x’)\right]\quad(x_{j}^{1}\text{ and }x_{j’}^{1}\text{ are independent when }j\not = j’)\\ &=\sigma_b^2+\mathbb E\left[\sum_{j=1}^{N_{1}} {\sigma_{w}^{2} \over N_1} x_{j}^{1}(x)x_{j}^{1}(x’)\right]\\ &=\sigma_b^2+\sigma_{w}^{2}\mathbb E\left[x_{j}^{1}(x)x_{j}^{1}(x’)\right]\\ &\equiv\sigma_b^2+\sigma_{w}^{2}C(x,x’)\\ \end{align} Thus, a signle layer neural network could be described as a Gaussina Process $z_{i}^{1} \sim \mathcal{GP}\left(\mu^{1}, K^{1}\right)$.

#### Deep Neural Network

Consider a neural network with $L$ layers. We define it as,
$$\begin{array}{ll} z_{i}^{l}(x)=b_{i}^{l}+\sum_{j=1}^{N_{l}} W_{i j}^{l} x_{j}^{l}(x) & 1 \leq l \leq L \\ x_{j}^{l}(x)=\phi\left(z_{j}^{l-1}(x)\right) & 2 \leq l \leq L-1 \end{array}$$ When $N_{l} \rightarrow \infty$, we also have $\{z_{i}^{l}(x^{(1)}),z_{i}^{l}(x^{(2)}),\ldots,z_{i}^{l}(x^{(k)})\}$ is a joint Gaussian and $z_{i}^{l} \sim \mathcal{G P}\left(0, K^{l}\right)$, where
\begin{align} K^l(x,x’)&\equiv\mathbb E\left[z_{i}^{l}(x)z_{i}^{l}(x’)\right]\\ &=\mathbb E\left[\left(b_{i}^{l}+\sum_{j=1}^{N_{l}} W_{i j}^{l} x_{j}^{k}(x)\right)\left(b_{i}^{l}+\sum_{j=1}^{N_{l}} W_{i j}^{l} x_{j}^{l}(x’)\right)\right]\\ &=\mathbb E\left[\left(b_{i}^{l}\right)^2+b_{i}^{l}\sum_{j=1}^{N_{l}} W_{i j}^{l} x_{j}^{l}(x)+b_{i}^{l}\sum_{j=1}^{N_{l}} W_{i j}^{l} x_{j}^{l}(x’)+\left(\sum_{j=1}^{N_{l}} W_{i j}^{l} x_{j}^{l}(x)\right)\left(\sum_{j=1}^{N_{l}} W_{i j}^{l} x_{j}^{l}(x’)\right)\right]\\ &=\mathbb E\left[\left(b_{i}^{l}\right)^2\right]+\mathbb E\left[\left(\sum_{j=1}^{N_{l}} W_{i j}^{l} x_{j}^{l}(x)\right)\left(\sum_{j=1}^{N_{l}} W_{i j}^{l} x_{j}^{l}(x’)\right)\right]\\ &=\sigma_b^2+\mathbb E\left[\sum_{j=1}^{N_{l}}\sum_{j’=1}^{N_{l}} W_{i j}^{l} x_{j}^{l}(x) W_{i j’}^{l} x_{j’}^{l}(x’)\right]\\ &=\sigma_b^2+\mathbb E\left[\sum_{j=1}^{N_{l}} \left(W_{i j}^{l}\right)^2 x_{j}^{l}(x)x_{j}^{l}(x’)\right]\quad(x_{j}^{l}\text{ and }x_{j’}^{l}\text{ are independent when }j\not = j’)\\ &=\sigma_b^2+\mathbb E\left[\sum_{j=1}^{N_{l}} {\sigma_{w}^{2} \over N_l} x_{j}^{l}(x)x_{j}^{l}(x’)\right]\\ &=\sigma_b^2+\sigma_{w}^{2}\mathbb E[x_{j}^{l}(x)x_{j}^{l}(x’)]\\ &=\sigma_b^2+\sigma_{w}^{2}\mathbb E_{z_i^{l-1}\sim\mathcal{GP}(0,K^{l-1})}\left[\phi(z_{j}^{l-1}(x))\phi(z_{j}^{l-1}(x’))\right]\\ \end{align} Since $K^l(x,x’)=\sigma_b^2+\sigma_{w}^{2}\mathbb E_{z_i^{l-1}\sim\mathcal{GP}(0,K^{l-1})}\left[\phi(z_{j}^{l-1}(x))\phi(z_{j}^{l-1}(x’))\right]$, we note that the expectation is taken over the GP of $z_i^{l-1}$, but it is equivalent to integrating against the joint distribution of only $z_i^{l-1}(x)$ and $z_i^{l-1}(x’)$,
$$K^{l}\left(x, x^{\prime}\right)=\int d z d z^{\prime} \phi(z) \phi\left(z^{\prime}\right) \mathcal{N}\left(\left[\begin{array}{l} z \\ z^{\prime} \end{array}\right] ; 0, \sigma_{w}^{2}\left[\begin{array}{ll} K^{l-1}(x, x) & K^{l-1}\left(x, x^{\prime}\right) \\ K^{l-1}\left(x^{\prime}, x\right) & K^{l-1}\left(x^{\prime}, x^{\prime}\right) \end{array}\right]+\sigma_{b}^{2}\right)$$ We have that $K^l(x, x’)$ only depend on $K^{l-1}(x, x’)$, $K^{l-1}(x, x)$, $K^{l-1}(x’, x’)$. Thus, there exist a function $F_\phi$ such that
$$K^{l}\left(x, x^{\prime}\right)=\sigma_{b}^{2}+\sigma_{w}^{2} F_{\phi}\left(K^{l-1}\left(x, x^{\prime}\right), K^{l-1}(x, x), K^{l-1}\left(x^{\prime}, x^{\prime}\right)\right)$$

### References

[1] Rasmussen, Carl Edward, and Christopher KI Williams. Gaussian processes for machine learning. Vol. 1. Cambridge: MIT press, 2006.

[2] Lee, Jaehoon, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. ICLR, 2018

[3] STATS 403, Lecture Slides. Duke Kunshan University