Eric Jang: 2016

Tuesday, November 8, 2016

Tutorial: Categorical Variational Autoencoders using Gumbel-Softmax

In this post, I discuss our recent paper, Categorical Reparameterization with Gumbel-Softmax, which introduces a simple technique for training neural networks with discrete latent variables. I'm really excited to share this because (1) I believe it will be quite useful for a variety of Machine Learning research problems, (2) this is my first published paper ever (on Arxiv, and submitted to a NIPS workshop and ICLR as well).
The TLDR; if you want categorical features in your neural nets, just let sample = softmax((logits+gumbel noise)/temperature), and then backprop as usual using your favorite automatic differentiation software (e.g. TensorFlow, Torch, Theano).
You can find the code for this article here

Introduction

One of the main themes in Deep Learning is to “let the neural net figure out all the intermediate features”. For example: training convolutional neural networks results in the self-organization of a feature detector hierarchy, while Neural Turing Machines automatically “discover” copying and sorting algorithms.
The workhorse of Deep Learning is the backpropagation algorithm, which uses dynamic programming to compute parameter gradients of the network. These gradients are then used to minimize the optimization objective via gradient descent. In order for this to work, all of the layers in our neural network — i.e. our learned intermediate features — must be continuous-valued functions.
What happens if we want to learn intermediate representations that are discrete? Many "codes" we want to learn are fundamentally discrete - musical notes on a keyboard, object classes (“kitten”, “balloon”, “truck”), and quantized addresses (“index 423 in memory”).

We can use stochastic neural networks, where each layer compute the parameters of some (discrete) distribution, and its forward pass consists of taking a sample from that parametric distribution. However, the difficulty is that we can’t backpropagate through samples. As shown below, there is a stochastic node (blue circle) in between $f(z)$ and $\theta$.

Left: in continuous neural nets, you can use backprop to compute parameter gradients. Right: backpropagation is not possible through stochastic nodes.

Gumbel-Softmax Distribution

The problem of backpropagating through stochastic nodes can be circumvented if we can re-express the sample $z \sim p_\theta(z)$, such that gradients can flow from $f(z)$ to $\theta$ without encountering stochastic nodes. For example, samples from the normal distribution $z \sim \mathcal{N}(\mu,\sigma)$ can be re-written as $z = \mu + \sigma \cdot \epsilon$, where $\epsilon \sim \mathcal{N}(0,1)$. This is also known as the “reparameterization trick”, and is commonly used to train variational autoencoders with Gaussian latent variables.

The Gumbel-Softmax distribution is reparameterizable, allowing us to avoid the stochastic node during backpropagation.
The main contribution of this work is a “reparameterization trick” for the categorical distribution. Well, not quite – it’s actually a re-parameterization trick for a distribution that we can smoothly deform into the categorical distribution. We use the Gumbel-Max trick, which provides an efficient way to draw samples $z$ from the Categorical distribution with class probabilities $\pi_i$:
$$ \DeclareMathOperator*{\argmax}{arg\,max} z = \verb|one_hot|\left(\argmax_{i}{\left[ g_i + \log \pi_i \right]}\right) $$
argmax is not differentiable, so we simply use the softmax function as a continuous approximation of argmax:
$$ y_i = \frac{\text{exp}((\log(\pi_i)+g_i)/\tau)}{\sum_{j=1}^k \text{exp}((\log(\pi_j)+g_j)/\tau)} \qquad \text{for } i=1, ..., k. $$
Hence, we call this the Gumbel-SoftMax distribution*. $\tau$ is a temperature parameter that allows us to control how closely samples from the Gumbel-Softmax distribution approximate those from the categorical distribution. As $\tau \to 0$, the softmax becomes an argmax and the Gumbel-Softmax distribution becomes the categorical distribution. During training, we let $\tau > 0$ to allow gradients past the sample, then gradually anneal the temperature $\tau$ (but not completely to 0, as the gradients would blow up).
Below is an interactive widget that draws samples from the Gumbel-Softmax distribution. Keep in mind that samples are vectors, and a one-hot vector (i.e. one of the elements is 1.0 and the others are 0.0) corresponds to a discrete category. Click "re-sample" to generate a new sample, and try dragging the slider and see what samples look like when the temperature $\tau$ is small!

1.0

TensorFlow Implementation

Using this technique is extremely simple, and only requires 12 lines of Python code:

Despite its simplicity, Gumbel-Softmax works surprisingly well - we benchmarked it against other stochastic gradient estimators for a couple tasks and Gumbel-Softmax outperformed them for both Bernoulli (K=2) and Categorical (K=10) latent variables. We can also use it to train semi-supervised classification models much faster than previous approaches. See our paper for more details.

Categorical VAE with Gumbel-Softmax

To demonstrate this technique in practice, here's a categorical variational autoencoder for MNIST, implemented in less than 100 lines of Python + TensorFlow code.
In standard Variational Autoencoders, we learn an encoding function that maps the data manifold to an isotropic Gaussian, and a decoding function that transforms it back to the sample. The data manifold is projected into a Gaussian ball; this can be hard to interpret if you are trying to learn the categorical structure within your data.
First, we declare the encoding network:

Next, we sample from the Gumbel-Softmax posterior and decode it back into our MNIST image.

Variational autoencoders minimizes reconstruction error of the data by maximizing an expectedlower bound (ELBO) on the likelihood of the data, under a generative model $p_\theta(x)$. For a derivation, see this tutorial on variational methods.
$$\log p_\theta(x) \geq \mathbb{E}_{q_\phi(y|x)}[\log p_\theta(x|y)] - KL[q_\phi(y|x)||p_\theta(y)]$$
Finally, we run train our VAE:

...and, that's it! Now we can sample randomly from our latent categorical code and decode it back into MNIST images:

Code can be found here. Thank you for reading, and let me know if you find this technique useful!

Acknowledgements

I'm sincerely grateful to my co-authors, Shane Gu and Ben Poole for collaborating with me on this work. Shane introduced me to the Gumbel-Max trick back in August, and supplied the framework for comparing Gumbel-Softmax with existing stochastic gradient estimators. Ben suggested and implemented the semi-supervised learning aspect of the paper, did the math derivations in the Appendix, and helped me a lot with editing the paper. Finally, thanks to Vincent Vanhoucke and the Google Brain team for encouraging me to pursue this idea.
*Chris J. Maddison, Andriy Mnih, and Yee Whye Teh at Deepmind have discovered this technique independently and published their own paper on it - they call it the “Concrete Distribution”. We only found out about each other’s work right as we were submitting our papers to conferences (oops!). If you use this technique in your work, please cite both of our papers! They deserve just as much credit.

23 Apr 2017: Update - Chris Maddison and I integrated these distributions into TensorFlow's Distributions sub-framework. Here's a code example of how to implement a categorical VAE using the distributions API.

Tuesday, September 6, 2016

Riemann Summation and Physics Simulation are Statistically Biased

Riemann summation is the simplest method for numerically approximating integrals. in which you chop up the domain into equally sized segments and add a bunch of rectangles together.

$$\int f(x) dx \sim \sum_i^k f(x) \Delta x$$

This technique also comes up in Euler integration for simulating physics:

$$
\begin{align*}
\Delta x & = \int \dot{x}(t) dt \\
& \sim \dot{x}(t) \Delta t
\end{align*}
$$

These are all special cases of "Deterministic Quadratures", where the integration step size $\Delta x$ is non-random ("deterministic") and we are summing up a bunch of quadrilateral pieces ("Quadratures") to approximate areas.

This post interprets the numerical integration as a special case of stratified sampling, and shows that deterministic quadrature rules are statistically biased estimators.

Suppose we want to compute $E_X[f(X)] = \int_x p(x)\cdot f(x) dx$ via a Monte Carlo estimator.

A plain Monte Carlo Estimator is given by $\mu = \frac{1}{n}\sum_i^n f(X_i)$, and has variance $\frac{1}{n}\sigma^2$ where $\sigma^2=Var[X_i]$.

Stratified sampling (see previous blog post) introduces a stratification variable $Y$ with a discrete distribution of $k$ strata with probability densities $p(y_1),p(y_2),...,p(y_k)$, respectively. A common choice is to assign each $p(y_1) = p(y_2) = ... = p(y_k) = 1/k$, and arrange the strata in a grid. In this case, $y_i$ correspond to the corners of the strata and $P(X|Y=y_i)$ corresponds to a uniform distribution over that square ($X_i = Y_i + 0.1*\text{rand}()$).

The variance of this estimator is $\sum_i^k p(y_i)^2 \cdot V_i$, where $V_i$ is the variance of $\mathbb{E}[X|Y=y_i]$ for strata $i$.

Suppose we let $\mathbb{E}[X|Y=y_i]$ as $y_i$ - that is, we just sample at the corners where $y_i$ sampels are.

The good news is that $V_i = 0$, and we've reduced the variance of the total estimator to zero. The bad news is that the estimator is now biased because our per-strata estimators $\mathbb{E}[X|Y=y_i]$ are biased (our estimator never uses information from the interiors/edges of the squares). The estimator is also inconsistent for finite strata.

Does this figure remind you of anything? This estimator behaves identically to a Riemann Summation in 2D!

What does this mean?

If you ever take a Riemann sum with fixed step size, your integral is biased!
If you are monitoring mean ocean temperatures and place your sensors in an regular grid across the water (like above), your estimator is biased!
If you are recording video at a too-low frequency and pick up aliasing artifacts, your estimates are biased!
If you use a fixed timestep in a physics simulation, such as Euler or Runge-Kutta methods, your space integrals are biased!

The takeaway here is that *all* deterministic quadrature rules are statistically biased. that's okay though - the expected value of biased estimators *do* converge to the true expectation as you crank the number of strata to infinity (e.g. $\Delta x \to 0$).

A "less-biased" physics simulator ought to incorporate random time steps (for instance, sampled between 0 and MAX_TIME_STEP). To reduce variance, one might generate multiple samples, compute a nonlinear update, then average out the integration.

Fairly obvious in hindsight, but I found it interesting to think about the relationship between stratified sampling and deterministic quadrature :)

Monday, September 5, 2016

Monte Carlo Variance Reduction Techniques in Julia

Monte Carlo methods are a simple yet powerful family of techniques for performing numerical integration in high-dimensional spaces.

This post is a tutorial on a few variance reduction techniques that can be used to reduce the noisiness of your estimators: antithetic sampling, control variates, and stratified sampling.

Code on Github

The Problem

Our goal is to find the expected value of the random variable $X$:

$$X = f(u) = \exp(\sin(8 u_1^2 + 3 u_1)+ \sin(11 u_2))$$

where $u$ is a 2D point sampled uniformly from the unit square, as shown below:

f(x,y) = exp(sin(8x.^2+3x)+sin(11y))
plot(linspace(0,1,100),linspace(0,1,100),f,st=:contourf,title="f(X)",size=(400,400))

The expectation can be re-written as an integral over the unit square.

$$
\mathbb{E}[X] = \int_0^1 \int_0^1 \exp(\sin(8 u_1^2 + 3 u_1)+ \sin(11 u_2)) du_1 du_2
$$

Monte Carlo is useful in scenarios where a closed-form integral does not exist, or we do not have access to the analytical form of the function (for example, it could be a black box program that the NSA hands to us). In this case we can integrate it analytically, but let's pretend we can't :)

Basic Monte Carlo Estimator

The basic Monte Carlo estimator $\hat{\mu}$ gives a straightforward answer: "evaluate $x_i=f(u_i)$ at $n$ random values of $U$, and then take the average among all your samples." (to make comparison easier with the estimators I will introduce later on, I use $2n$ samples instead of $n$).

$$
\hat{\mu} = \frac{1}{2n}\sum_i^{2n} x_i \approx \mathbb{E}[X]
$$

Here is the estimator in code :


function plainMC(N)
    u1 = rand(2*N)
    u2 = rand(2*N)
    return u1,u2,f(u1,u2),[]
end

The plot below shows 100 samples and their values, colored according to the value of $f(X)$ at that location.


u1,u2,x,ii=plainMC(N)
scatter(u1,u2,markercolor=x,
        xlabel="u1",ylabel="u2",
        title="Plain MC Samples",
        size=(400,400),label="")

The variance of this (2n-sample) estimator is given by:

$$
\newcommand{\Var}{\operatorname{Var}}
\begin{align*}
\Var[\hat{\mu}] & = \Var[ \frac{1}{2n}\sum_i^{2n} X_i ] \\
& = (\frac{1}{2n})^2\sum_i^{2n} \Var [X_i] \\
& = \frac{\sigma^2}{2n}
\end{align*}
$$

Where $\sigma^2$ is the variance of a single sample $X_i$ and $n$ is the total number of samples.

The variance of our estimator is $\mathbb{E}[(\hat{\mu}-\mu)^2]$, or in other words, the "expected squared error" with respect to the true expectation $\mu$.

We could decrease variance by simply cranking up our sample count $n$, but this is computationally expensive. While variance is proportional to $\frac{1}{n}$, the standard deviation (which has the same units as $\mu$ and is proportional to the confidence interval of our estimator) decreases at $\frac{1}{\sqrt{n}}$, a far slower rate. This is highly unsatisfactory.

Instead, we will use variance reduction techniques to construct estimators with the same expected value as our MC estimator, and still achieve a lower variance with the same number of samples.

Antithetic Sampling

Antithetic sampling is a slight extension to the basic Monte Carlo estimator. The idea is to pair each sample $X_i$ with an "antithetic sample" $Y_i$ such that $X_i$ and $Y_i$ are identically distributed, but negatively correlated. Our single-sample estimate is given by:

$$
\hat{\mu} = \frac{X_i + Y_i}{2}
$$

$\mathbb{E}[X_i] = \mathbb{E}[Y_i]$ so this estimator has the same expected value as the plain MC estimator.

$$
\begin{align*}
\Var[\hat{\mu}] & = \frac{1}{4}\Var[X_i + Y_i] \\
& = \frac{1}{4}(\Var[X_i] + \Var[Y_i] + 2 \text{Cov}(X_i,Y_i)) && \Var[X_i] = \Var[Y_i] = \sigma^2 \\
& = \frac{1}{2}(\sigma^2 + \sigma^2\beta) && \beta = \text{Corr}(X_i,Y_i) = \frac{\text{Cov}(X_i,Y_i)}{\sigma^2}
\end{align*}
$$

If we sample $N$ of these pairs (for a total of $2N$ samples), the variance is $\frac{1}{2n}(\sigma^2 + \sigma^2\beta)$, and it's easy to see that this estimator has a strictly lower variance than the plain Monte Carlo as long as $\beta$ is negative.

Unfortunately, there isn't a general method for generating good antithetic samples. If the random sample of interest $X_i = f(U_i)$ is a transformation of a uniform random variable $U_i \sim \text{Uniform}(0,1)$, we can try "flipping" the input and letting $Y_i = f(1-U_i)$. $1-U_i$ is distributed identically to $U_i$, and assuming $f(U_i)$ is monotonic, then $Y_i$ will be large when $X_i$ is small, and vice versa.

However, this monotonicity assumption does not hold in general for 2D functions. In fact, for our choice of $f(U)$, it is only slightly negatively correlated ($\beta \approx -0.07$).

function antithetic(N)
    u1_x = rand(N); u2_x = rand(N);
    u1_y = 1-u1_x; u2_y = 1-u2_x;
    x = f(u1_x,u2_x);
    y = f(u1_y,u2_y);
    return [u1_x; u1_y], [u2_x; u2_y], [x;y], (u1_x,u2_x,x,u1_y,u2_y,y)
end

Notice how each $X_i$ sample ("+") is paired with an antithetic sample $Y_i$ ("x") mirrored across the origin.


u1,u2,x,ii=antithetic(N)
u1_x,u2_x,x,u1_y,u2_y,y = ii
scatter(u1_x,u2_x,
        m=(10,:cross),markercolor=x,
        size=(400,400),label="X samples",
        xlabel="u1",ylabel="u2",title="Antithetic Sampling")
scatter!(u1_y,u2_y,
        m=(10,:xcross),markercolor=y,
        label="antithetic samples")

Control Variates

Control Variates is similar to Antithetic Sampling, in that we will be pairing every sample $X_i$ with a sample of $Y_i$ (the "control variate"), and exploiting correlations between $X_i$ and $Y_i$ in order to decrease variance.

However, $Y_i$ is no longer sampled from the same distribution as $X_i$. Typically $Y_i$ is some random variable that is computationally easy to generate (or is generated as a side effect) from the generative process for $X_i$. For instance, the weather report that forecasts temperature $X$ also happens to predict rainfall $Y$. We can use rainfall measurements $Y_i$ to improve our estimates of "average temperature" $\mathbb{E}[X]$.

Suppose control variate $Y$ has a known expected value $\bar{\mu}$. The single-sample estimator is given by

$$
\hat{\mu} = X_i - c (Y_i - \mathbb{E}[Y_i])
$$

Where $c$ is some constant. In expectation, $Y_i$ samples are cancelled out by $\bar{\mu}$ so this estimator is unbiased.

It's variance is

$$
\begin{align*}
\Var[\hat{\mu}] & = \Var[X_i - c (Y_i - \mathbb{E}[Y_i])] \\
& = \Var[X_i - c Y_i] && \mathbb{E}[Y_i] \text{ is not random, and has zero variance} \\
& = \Var[X_i] + c^2 \Var[Y_i] + 2cCov(X_i,Y_i)
\end{align*}
$$

Simple calculus tells us that the value of $c$ that minimizes $Var[\hat{\mu}]$ is

$$
c^* = -\frac{Cov(X_i,Y_i)}{\Var(Y_i)}
$$

so the best variance we can achieve is

$$
\Var[\hat{\mu}] = \Var[X_i] - \frac{Cov(X_i,Y_i)^2}{\Var(Y_i)}
$$

The variance of our estimator is reduced as long as $Cov(X_i,Y_i)^2 \neq 0$. Unlike antithetic sampling, control variates can be either positively or negatively correlated with $X_i$.

The catch here is that $c^*$ will need to be estimated. There are a couple ways:

Forget optimality, just let $c^* = 1$! This naive approach still works better than basic Monte Carlo because it exploits covariance structure.
Estimate $c^*$ from samples $X_i$, $Y_i$. Doing so introduces a dependence between $c^*$ and $Y_i$, which makes the estimator biased, but in practice this bias is small compared to the standard error of the estimate.
An unbiased alternative to (2) is to compute $c^*$ from a small set of $m < n$ "pilot samples", and then throw the pilot samples away before collecting the real $n$ samples.

In the case of our $f(u)$, we know that the modes of the distribution are scattered in the corners of the unit square. In polar coordinates centered at $(0.5,0.5)$, that's about $k \frac{\pi}{2} + \frac{\pi}{4}, k=0,1,2,3$. Let's choose $Y = -cos(4\theta)$, where $\theta$ is uniform about the unit circle.


cv(u1,u2) = -cos(4*atan((u2-.5)./(u1-.5)))
pcv = plot(linspace(0,1,100),linspace(0,1,100),cv,st=:contourf,title="Y",xlabel="u1",ylabel="u2")
plot(pcv,p0,layout=2)

$$
\begin{align*}
\mathbb{E}_U[Y] & = \int_0^1 \int_0^1 -\cos(4\theta) du_1 du_2 \\
& = \int_0^1 \int_0^1 - \frac{1-\tan^2(2\theta)}{1+\tan^2(2\theta)} du_1 du_2 \\
& = \int_0^1 \int_0^1 - \frac{1-(\frac{2\tan \theta}{1 - \tan^2(\theta)})^2}{1+(\frac{2\tan \theta}{1 - \tan^2(\theta)})^2} du_1 du_2 && \tan(\theta) = \frac{y}{x} \\
& = \pi-3
\end{align*}
$$

In practice though, we don't have an analytic way to compute $E[Y]$, but we can estimate it in an unbiased manner from our pilot samples.


function controlvariates(N)
    # generate pilot samples
    npilot = 30
    u1 = rand(npilot); u2 = rand(npilot)
    x = f(u1,u2)
    y = cv(u1,u2)
    β=corr(x,y)
    c = -β/var(y)
    μ_y = mean(y) # estimate mean from pilot samples
    # real samples
    u1 = rand(2*N); u2 = rand(2*N)
    x = f(u1,u2)
    y = cv(u1,u2)
    return u1,u2,x+c*(y-μ_y),(μ_y, β, c)
end

We can verify from our pilot samples that the covariance is quite large:


βs=[]
for i=1:1000
    u1,u2,x,ii=controlvariates(100)
    βs = [βs; ii[2]]
end
mean(βs) # 0.367405

Stratified Sampling

The idea behind stratified sampling is very simple: instead of estimating $E[X]$ over the domain $U$, we break up the domain into $K$ strata, estimate the conditional expectation $X$ over each strata separately, then average our per-strata estimates.

In the above picture, the sample space is divided into 50 strata going across the horizontal axis and there are 2 points placed in each strata. There are lots of ways to stratify - you can stratify in a 1D grid, 2D grid, concentric circles with varying areas, whatever you like.

We're not restricted to chopping up $U$ into little fiefdoms as above (although it is a natural choice). Instead, we introduce a "stratification variable" $Y$ with discrete values $y_1,...,y_k$. Our estimators for each strata correspond to the conditional expectations $\mathbb{E}[X|Y=y_i] \text{ for } i=1...K$.

The estimator is given by a weighted combination of strata estimates with the respective probability of each strata:


function stratified(N)
    # N strata, 2 samples per strata = 2N samples
    u1 = Float64[]
    u2 = Float64[]
    for y=range(0,1/N,N)
        for i=1:2
            append!(u1,[y+1/N*rand()])
            append!(u2,[rand()])
        end
    end
    return u1,u2,f(u1,u2),[]
end

$$
\hat{mu} = \sum_i^k p(y_i) \mathbb{E}[X|Y=y_i]
$$

The variance of this estimator is

$$
\sum_i^k p(y_i)^2 (V_i)
$$

Where $V_i$ is the variance of estimator $i$.

For simplicity, let's assume:

We stratify along an evenly-spaced grid, i.e. $p(y_i) = \frac{1}{k}$
We are using a simple MC estimator within each strata, $V_i = \frac{\sigma_i^2}{n_i}$, with $n_i = \frac{2n}{k}$ samples each (so the total number of samples is $2n$, the same as the plain MC estimator). $\sigma_i^2$ is the variance of an individual sample within each strata.

Then the variance is

$$
\sum_i^k (\frac{1}{k} p(y_i)) (\frac{k}{2n}\sigma_i^2) = \frac{1}{2n} \sum_i^k p(y_i) \sigma_i^2 \leq \frac{1}{2n} \sigma^2
$$

Intuitively, the variance within a single grid cell $\sigma_i^2$ should be smaller than the variance of the plain MC estimator, because samples are closer together.

Here's a proof that $\sum_i^k p(y_i)\sigma_i^2 \leq \sigma^2$:

Let $\mu_i = \mathbb{E}[X|Y=y_i], \sigma_i^2 = \Var[X|Y]$

$$
\begin{align*}
\mathbb{E}[X^2] & = \mathbb{E}_Y[\mathbb{E}_X[X^2|Y]] && \text{Tower property} \\
& = \sum_i^k p(y_i) \mathbb{E}[X^2|Y=y_i] \\
& = \sum_i^k p(y_i) (\sigma_i^2 + \mu_i^2) && \text{via } \Var[X|Y] = \mathbb{E}[X^2|Y] - \mathbb{E}[X|Y=y_i]^2 \\
\mathbb{E}[X] & = \mathbb{E}_Y[\mathbb{E}_X[X|Y]] \\
& =\sum_i^k p(y_i) \mu_i \\
\sigma^2 & = \mathbb{E}[X^2] - \mathbb{E}[X]^2 \\
& =\sum_i^k p(y_i) \sigma_i^2 + \big[\sum_i^k p(y_i) \mu_i^2 - (\sum_i^k p(y_i) \mu_i)^2\big]
\end{align*}
$$

Note that the term

$$
\sum_i^k p(y_i) \mu_i^2 - (\sum_i^k p(y_i) \mu_i)^2
$$

corresponds to the variance of a categorical distribution that takes value $\mu_i$ with probability $p(y_i)$. This term is strictly non-negative, and therefore the inequality $\sum_i^k \frac{k}{2n}\sigma_i^2 \leq \sigma^2$ holds.

This yields an interesting observation: when we do stratified sampling, we are eliminating the variance between strata means. This implies that we get the most gains by choosing strata such that samples vary greatly between strata.

Here's a comparison of the empirical vs. theoretical variance with respect to number of strata:


# theoretical variance
strat_var = zeros(50)
for N=1:50
    sigma_i_2 = zeros(N)
    y = range(0,1/N,N)
    for i=1:N
        x1 = y[i].+1/N*rand(100)
        x2 = rand(100)
        sigma_i_2[i]=var(f(x1,x2))
    end
    n_i = 2
    strat_var[N] = sum((1/(N))^2*(1/n_i*sigma_i_2))    
end
s = estimatorVar(stratified)
plot(1:50,s,label="Empirical Variance",title="Stratified Sampling")
plot!(1:50,strat_var,label="Theoretical Variance")

Remarks:

Stratified Sampling can actually be combined with other variance reduction techniques - for instance, when computing the conditional estimate within a strata, you are free to use antithetic sampling, control variates within the strata. You can even do stratified sampling within your strata! This makes sense if your sampling domain is already partitioned into a bounding volume hierarchy, such as a KD-Tree.
Antithetic sampling can be regarded as a special case of stratified sampling with 2 strata. The inter-strata variance you are eliminating here is the self-symmetry in the distribution of $X$ (the covariance between one half and another half of the samples).

Discussion

Here's a plot comparing the standard deviations of these estimators as a function of sample size. Keep in mind that your choice of variance reduction technique should depend on the problem, and no method is objectively better than the others. This is a very simple toy problem so all estimators perform pretty well.

I'm currently working on a follow-up tutorial on Importance Sampling. The explanation is a bit long, so I'm keeping it separate from this tutorial.

A toy example is all well and good, but where does variance reduction come up in modern scientific computing?

Scientific Data Analysis

At a very fundamental level, variance in scientific data either suggests an underlying noisy process or errors. Variance degrades the statistical significance of findings, so it's desirable to conduct experiments and measurements such that the variance of findings is low.

Quantitative Finance

In finance, we often simulate expected returns of portfolios by sampling over all possible movements in the market. Certain rare events (for example, a terrorist attack) are highly unlikely but have such a big impact that on the economy that we want to account for it in our expectation. Modeling these "black swan" events are of considerable interest to financial & actuarial industries.

Deep Learning

In minibatch stochastic gradient descent (the "hammer" of Deep Learning), we compute the parameter update using the expected gradient of the total loss across the training dataset. It takes too long to consider every point in the dataset, so we estimate gradients via stochastic minibatch. This is nothing more than a Monte Carlo estimate.

$$E_x[\nabla_\theta J(x;\theta)] = \int p(x) \nabla_\theta J(x;\theta) dx \approx \frac{1}{m}\sum_i^m \nabla_\theta J(x_i;\theta)$$

The less variance we have, the sooner our training converges.

Computer Graphics

In physically based rendering, we are trying to compute the total incoming light arriving at each camera pixel. The vast majority of paths contribute zero radiance, thus making our samples noisy. Decreasing variance allows us to "de-speckle" the pictures.

$$
\begin{align*}
\mathcal{L}(x,\omega_o) & = \mathcal{L}_e(x,\omega_o) + \int_\Omega{f_r(x,\omega_i,\omega_o)\mathcal{L}(x^\prime,\omega_i)(\omega_o \cdot n_x) d\omega_i} \\
& \approx \mathcal{L}_e(x,\omega_o) + \frac{1}{m}\sum_i^m{f_r(x,\omega_i,\omega_o)\mathcal{L}(x^\prime,\omega_i)(\omega_o \cdot n_x)}
\end{align*}
$$

Sunday, August 7, 2016

A Beginner's Guide to Variational Methods: Mean-Field Approximation

Variational Bayeisan (VB) Methods are a family of techniques that are very popular in statistical Machine Learning. VB methods allow us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another random variable) as optimization problems (i.e. find the parameter values that minimize some objective function).

This inference-optimization duality is powerful because it allows us to use the latest-and-greatest optimization algorithms to solve statistical Machine Learning problems (and vice versa, minimize functions using statistical techniques).

This post is an introductory tutorial on Variational Methods. I will derive the optimization objective for the simplest of VB methods, known as the Mean-Field Approximation. This objective, also known as the Variational Lower Bound, is exactly the same one used in Variational Autoencoders (a neat paper which I will explain in a follow-up post).

Preliminaries and Notation
Problem formulation
Variational Lower Bound for Mean-field Approximation
Forward KL vs. Reverse KL
Connections to Deep Learning

Preliminaries and Notation

This article assumes that the reader is familiar with concepts like random variables, probability distributions, and expectations. Here's a refresher if you forgot some stuff. Machine Learning & Statistics notation isn't standardized very well, so it's helpful to be really precise with notation in this post:

Uppercase $X$ denotes a random variable
Uppercase $P(X)$ denotes the probability distribution over that variable
Lowercase $x \sim P(X)$ denotes a value $x$ sampled ($\sim$) from the probability distribution $P(X)$ via some generative process.
Lowercase $p(X)$ is the density function of the distribution of $X$. It is a scalar function over the measure space of $X$.
$p(X=x)$ (shorthand $p(x)$) denotes the density function evaluated at a particular value $x$.

Many academic papers use the terms "variables", "distributions", "densities", and even "models" interchangeably. This is not necessarily wrong per se, since $X$, $P(X)$, and $p(X)$ all imply each other via a one-to-one correspondence. However, it's confusing to mix these words together because their types are different (it doesn't make sense to sample a function, nor does it make sense to integrate a distribution).

We model systems as a collection of random variables, where some variables ($X$) are "observable", while other variables ($Z$) are "hidden". We can draw this relationship via the following graph:

The edge drawn from $Z$ to $X$ relates the two variables together via the conditional distribution $P(X|Z)$.

Here's a more concrete example: $X$ might represent the "raw pixel values of an image", while $Z$ is a binary variable such that $Z=1$ "if $X$ is an image of a cat".

$X = $

$P(Z=1) = 1$ (definitely a cat)

$X= $

$P(Z=1) = 0$ (definitely not a cat)

$X = $

$P(Z=1) = 0.1$ (sort of cat-like)

Bayes' Theorem gives us a general relationship between any pair of random variables:

$$p(Z|X) = \frac{p(X|Z)p(Z)}{p(X)}$$

The various pieces of this are associated with common names:

$p(Z|X)$ is the posterior probability: "given the image, what is the probability that this is of a cat?" If we can sample from $z \sim P(Z|X)$, we can use this to make a cat classifier that tells us whether a given image is a cat or not.

$p(X|Z)$ is the likelihood: "given a value of $Z$ this computes how "probable" this image $X$ is under that category ({"is-a-cat" / "is-not-a-cat"}). If we can sample from $x \sim P(X|Z)$, then we generate images of cats and images of non-cats just as easily as we can generate random numbers. If you'd like to learn more about this, see my other articles on generative models: [1], [2].

$p(Z)$ is the prior probability. This captures any prior information we know about $Z$ - for example, if we think that 1/3 of all images in existence are of cats, then $p(Z=1) = \frac{1}{3}$ and $p(Z=0) = \frac{2}{3}$.

Hidden Variables as Priors

This is an aside for interested readers. Skip to the next section to continue with the tutorial.

The previous cat example presents a very conventional example of observed variables, hidden variables, and priors. However, it's important to realize that the distinction between hidden / observed variables is somewhat arbitrary, and you're free to factor the graphical model however you like.

We can re-write Bayes' Theorem by swapping the terms:

$$\frac{p(Z|X)p(X)}{p(Z)} = p(X|Z)$$

The "posterior" in question is now $P(X|Z)$.

Hidden variables can be interpreted from a Bayesian Statistics framework as prior beliefs attached to the observed variables. For example, if we believe $X$ is a multivariate Gaussian, the hidden variable $Z$ might represent the mean and variance of the Gaussian distribution. The distribution over parameters $P(Z)$ is then a prior distribution to $P(X)$.

You are also free to choose which values $X$ and $Z$ represent. For example, $Z$ could instead be "mean, cube root of variance, and $X+Y$ where $Y \sim \mathcal{N}(0,1)$". This is somewhat unnatural and weird, but the structure is still valid, as long as $P(X|Z)$ is modified accordingly.

You can even "add" variables to your system. The prior itself might be dependent on other random variables via $P(Z|\theta)$, which have prior distributions of their own $P(\theta)$, and those have priors still, and so on. Any hyper-parameter can be thought of as a prior. In Bayesian statistics, it's priors all the way down.

Problem Formulation

The key problem we are interested in is posterior inference, or computing functions on the hidden variable $Z$. Some canonical examples of posterior inference:

Given this surveillance footage $X$, did the suspect show up in it?
Given this twitter feed $X$, is the author depressed?
Given historical stock prices $X_{1:t-1}$, what will $X_t$ be?

We usually assume that we know how to compute functions on likelihood function $P(X|Z)$ and priors $P(Z)$.

The problem is, for complicated tasks like above, we often don't know how to sample from $P(Z|X)$ or compute $p(X|Z)$. Alternatively, we might know the form of $p(Z|X)$, but the corresponding computation is so complicated that we cannot evaluate it in a reasonable amount of time. We could try to use sampling-based approaches like MCMC, but these are slow to converge.

Variational Lower Bound for Mean-field Approximation

The idea behind variational inference is this: let's just perform inference on an easy, parametric distribution $Q_\phi(Z|X)$ (like a Gaussian) for which we know how to do posterior inference, but adjust the parameters $\phi$ so that $Q_\phi$ is as close to $P$ as possible.

This is visually illustrated below: the blue curve is the true posterior distribution, and the green distribution is the variational approximation (Gaussian) that we fit to the blue density via optimization.

What does it mean for distributions to be "close"? Mean-field variational Bayes (the most common type) uses the Reverse KL Divergence to as the distance metric between two distributions.

$$KL(Q_\phi(Z|X)||P(Z|X)) = \sum_{z \in Z}{q_\phi(z|x)\log\frac{q_\phi(z|x)}{p(z|x)}}$$

Reverse KL divergence measures the amount of information (in nats, or units of $\frac{1}{\log(2)}$ bits) required to "distort" $P(Z)$ into $Q_\phi(Z)$. We wish to minimize this quantity with respect to $\phi$.

By definition of a conditional distribution, $p(z|x) = \frac{p(x,z)}{p(x)}$. Let's substitute this expression into our original $KL$ expression, and then distribute:

$$
\begin{align}
KL(Q||P) & = \sum_{z \in Z}{q_\phi(z|x)\log\frac{q_\phi(z|x)p(x)}{p(z,x)}} && \text{(1)} \\
& = \sum_{z \in Z}{q_\phi(z|x)\big(\log{\frac{q_\phi(z|x)}{p(z,x)}} + \log{p(x)}\big)} \\
& = \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) + \Big(\sum_{z}{\log{p(x)}q_\phi(z|x)}\Big) \\
& = \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) + \Big(\log{p(x)}\sum_{z}{q_\phi(z|x)}\Big) && \text{note: $\sum_{z}{q(z)} = 1 $} \\
& = \log{p(x)} + \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) \\
\end{align}
$$

To minimize $KL(Q||P)$ with respect to variational parameters $\phi$, we just have to minimize $\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}$, since $\log{p(x)}$ is fixed with respect to $\phi$. Let's re-write this quantity as an expectation over the distribution $Q_\phi(Z|X)$.

$$
\begin{align}
\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}} & = \mathbb{E}_{z \sim Q_\phi(Z|X)}\big[\log{\frac{q_\phi(z|x)}{p(z,x)}}\big]\\
& = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - \log{p(x,z)} \big] \\
& = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - (\log{p(x|z)} + \log(p(z))) \big] && \text{(via $\log{p(x,z)=p(x|z)p(z)}$) }\\
& = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - \log{p(x|z)} - \log(p(z))) \big] \\
\end{align} \\
$$

Minimizing this is equivalent to maximizing the negation of this function:

$$
\begin{align}
\text{maximize } \mathcal{L} & = -\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}} \\
& = \mathbb{E}_Q\big[ -\log{q_\phi(z|x)} + \log{p(x|z)} + \log(p(z))) \big] \\
& = \mathbb{E}_Q\big[ \log{p(x|z)} + \log{\frac{p(z)}{ q_\phi(z|x)}} \big] && \text{(2)} \\
\end{align}
$$

In literature, $\mathcal{L}$ is known as the variational lower bound, and is computationally tractable if we can evaluate $p(x|z), p(z), q(z|x)$. We can further re-arrange terms in a way that yields an intuitive formula:

$$
\begin{align*}
\mathcal{L} & = \mathbb{E}_Q\big[ \log{p(x|z)} + \log{\frac{p(z)}{ q_\phi(z|x)}} \big] \\
& = \mathbb{E}_Q\big[ \log{p(x|z)} \big] + \sum_{Q}{q(z|x)\log{\frac{p(z)}{ q_\phi(z|x)}}} && \text{Definition of expectation} \\
& = \mathbb{E}_Q\big[ \log{p(x|z)} \big] - KL(Q(Z|X)||P(Z)) && \text{Definition of KL divergence} && \text{(3)}
\end{align*}
$$

If sampling $z \sim Q(Z|X)$ is an "encoding" process that converts an observation $x$ to latent code $z$, then sampling $x \sim Q(X|Z)$ is a "decoding" process that reconstructs the observation from $z$.

It follows that $\mathcal{L}$ is the sum of the expected "decoding" likelihood (how good our variational distribution can decode a sample of $Z$ back to a sample of $X$), plus the KL divergence between the variational approximation and the prior on $Z$. If we assume $Q(Z|X)$ is conditionally Gaussian, then prior $Z$ is often chosen to be a diagonal Gaussian distribution with mean 0 and standard deviation 1.

Why is $\mathcal{L}$ called the variational lower bound? Substituting $\mathcal{L}$ back into Eq. (1), we have:

$$
\begin{align*}
KL(Q||P) & = \log p(x) - \mathcal{L} \\
\log p(x) & = \mathcal{L} + KL(Q||P) && \text{(4)}
\end{align*}
$$

The meaning of Eq. (4), in plain language, is that $p(x)$, the log-likelihood of a data point $x$ under the true distribution, is $\mathcal{L}$, plus an error term $KL(Q||P)$ that captures the distance between $Q(Z|X=x)$ and $P(Z|X=x)$ at that particular value of $X$.

Since $KL(Q||P) \geq 0$, $\log p(x)$ must be greater than $\mathcal{L}$. Therefore $\mathcal{L}$ is a lower bound for $\log p(x)$. $\mathcal{L}$ is also referred to as evidence lower bound (ELBO), via the alternate formulation:

$$
\mathcal{L} = \log p(x) - KL(Q(Z|X)||P(Z|X)) = \mathbb{E}_Q\big[ \log{p(x|z)} \big] - KL(Q(Z|X)||P(Z))
$$

Note that $\mathcal{L}$ itself contains a KL divergence term between the approximate posterior and the prior, so there are two KL terms in total in $\log p(x)$.

Forward KL vs. Reverse KL

KL divergence is not a symmetric distance function, i.e. $KL(P||Q) \neq KL(Q||P)$ (except when $Q \equiv P$) The first is known as the "forward KL", while the latter is "reverse KL". So why do we use Reverse KL? This is because the resulting derivation would require us to know how to compute $p(Z|X)$, which is what we'd like to do in the first place.

I really like Kevin Murphy's explanation in the PML textbook, which I shall attempt to re-phrase here:

Let's consider the forward-KL first. As we saw from the above derivations, we can write KL as the expectation of a "penalty" function $\log \frac{p(z)}{q(z)}$ over a weighing function $p(z)$.

$$
\begin{align*}
KL(P||Q) & = \sum_z p(z) \log \frac{p(z)}{q(z)} \\
& = \mathbb{E}_{p(z)}{\big[\log \frac{p(z)}{q(z)}\big]}\\
\end{align*}
$$

The penalty function contributes loss to the total KL wherever $p(Z) > 0$. For $p(Z) > 0$, $\lim_{q(Z) \to 0} \log \frac{p(z)}{q(z)} \to \infty$. This means that the forward-KL will be large wherever $Q(Z)$ fails to "cover up" $P(Z)$.

Therefore, the forward-KL is minimized when we ensure that $q(z) > 0$ wherever $p(z)> 0$. The optimized variational distribution $Q(Z)$ is known as "zero-avoiding" (density avoids zero when $p(Z)$ is zero).

Minimizing the Reverse-KL has exactly the opposite behavior:

$$
\begin{align*}
KL(Q||P) & = \sum_z q(z) \log \frac{q(z)}{p(z)} \\
& = \mathbb{E}_{p(z)}{\big[\log \frac{q(z)}{p(z)}\big]}
\end{align*}
$$

If $p(Z) = 0$, we must ensure that the weighting function $q(Z) = 0$ wherever denominator $p(Z) = 0$, otherwise the KL blows up. This is known as "zero-forcing":

So in summary, minimizing forward-KL "stretches" your variational distribution $Q(Z)$ to cover over the entire $P(Z)$ like a tarp, while minimizing reverse-KL "squeezes" the $Q(Z)$ under $P(Z)$.

It's important to keep in mind the implications of using reverse-KL when using the mean-field approximation in machine learning problems. If we are fitting a unimodal distribution to a multi-modal one, we'll end up with more false negatives (there is actually probability mass in $P(Z)$ where we think there is none in $Q(Z)$).

Connections to Deep Learning

Variational methods are really important for Deep Learning. I will elaborate more in a later post, but here's a quick spoiler:

Deep learning is really good at optimization (specifically, gradient descent) over very large parameter spaces using lots of data.
Variational Bayes give us a framework with which we can re-write statistical inference problems as optimization problems.

Combining Deep learning and VB Methods allow us to perform inference on extremely complex posterior distributions. As it turns out, modern techniques like Variational Autoencoders optimize the exact same mean-field variational lower-bound derived in this post!

Thanks for reading, and stay tuned!

Sunday, July 24, 2016

Why Randomness is Important for Deep Learning

This afternoon I attempted to explain to my mom why Randomness is important for Deep Learning without using any jargon from probability, statistics, or machine learning.

The exercise was partially successful. Maybe. I still don't think she knows what Deep Learning is, besides that I am a big fan of it and use it for my job.

I'm a big fan of Deep Learning

This post is a slightly more technical version of the explanation I gave my mom, with the hope that it will help Deep Learning practitioners to better understand what is going on in neural networks.

If you are getting started into Deep Learning research, you may discover that there are a whole bunch of seemingly arbitrary techniques used to train neural networks, with very little theoretical justification besides "it works". For example: dropout regularization, adding gradient noise,, asynchronous stochastic descent.

What do all these hocus pocus techniques have in common? They incorporate randomness!

Random noise actually is crucial for getting DNNs to work well:

Random noise allows neural nets to produce multiple outputs given the same instance of input.
Random noise limits the amount of information flowing through the network, forcing the network to learn meaningful representations of data.
Random noise provides "exploration energy" for finding better optimization solutions during gradient descent.

Single Input, Multiple Output

Suppose you are training a deep neural network (DNN) to classify images.

For each cropped region, the network learns to convert an image into a number representing a class label, such as “dog” or "person".

That’s all very well and good, and these kind of DNNs do not require randomness in their inference model. After all, any image of a dog should be mapped to the “dog” label, and there is nothing random about that.

Now suppose you are training a deep neural network (DNN) to play Go (game). In the case of the image below, the DNN has to make the first move.

If you used the same deterministic strategy described above, you will find that this network fails to give good results. Why? Because there is no single “optimal starting move” - for each possible stone placement on the board, there is a equally good stone placement on the other side board, via rotational symmetry. There are multiple best answers.

If the network is deterministic and only capable of picking one output per input, then the optimization process will force the network to choose the move that averages all the best answers, which is smack-dab in the middle of the board. This behavior is highly undesirable, as the center of the board is generally regarded as a bad starting move.

Hence, randomness is important when you want the network to be able to output multiple possibilities given the same input, rather than generating the same output over and over again. This is crucial when there are underlying symmetries in the action space - incorporating randomness literally helps us break out of the "stuck between two bales of hay" scenario.

Similarly, if we are training a neural net to compose music or draw pictures, we don’t want it to always draw the same thing or play the same music every time it is given a blank sheet of paper. We want some notion of “variation”, "surprise", and “creativity” in our generative models.

One approach to incorporating randomness into DNNs is to keep the network deterministic, but have its outputs be the parameters of a probability distribution, which we can then draw samples from using conventional sampling methods to generate “stochastic outputs”.

Deepmind's AlphaGo utilized this principle: given an image of a Go board, it outputs the probability of winning given each possible move. The practice of modeling a distribution after the output of the network is commonly used in other areas of Deep Reinforcement Learning.

Randomness and Information Theory

During my first few courses in probability & statistics, I really struggled to understand the physical meaning of randomness. When you flip a coin, where does this randomness come from? Is randomness just deterministic chaos? Is it possible for something to be actually random?

To be honest, I still don't fully understand.

Information theory offers a definition of randomness that is grounded enough to use without being kept wide awake at night: "randomness" is nothing more than the "lack of information".

More specifically, the amount of information in an object is the length (e.g. in bits, kilobytes, etc.) of the shortest computer program needed to fully describe it. For example, the first 1 million digits of $\pi = 3.14159265....$ can be represented as a string of length 1,000,002 characters, but it can be more compactly represented using 70 characters, via an implementation of the Leibniz Formula:

The above program is nothing more than a compressed version of a million digits of $\pi$. A more concise program could probably express the first million digits of $\pi$ in far fewer bits.

Under this interpretation, randomness is "that which cannot be compressed". While the first million digits of $\pi$ can be compressed and are thus not random, empirical evidence suggests (but has not proven) that $\pi$ itself is normal number, and thus the amount of information encoded in $\pi$ is infinite.

Consider a number $a$ that is equal to the first trillion digits of $\pi$, $a = 3.14159265...$. If we add to that a uniform random number $r$ that lies in the range $(-0.001,0.001)$, we get a number that ranges in between $3.14059...$ and $3.14259...$. The resulting number $a + r$ now only carries ~three digits worth of information, because the process of adding random noise destroyed any information carried after the hundredth's decimal place.

Limiting Information in Neural Nets

What does this definition of randomness have to deal with randomness?

Another way randomness is incorporated into DNNs is by injecting noise directly into the network itself, rather than using the DNN to model a distribution. This makes the task "harder" to learn as the network has to overcome these internal "perturbations".

Why on Earth would you want to do this? The basic intuition is that noise limits the amount of information you can pass through a channel.

Consider an autoencoder, a type of neural network architecture that attempts to learn an efficient encoding of data by "squeezing" the input into fewer dimensions in the middle and re-constituting the original data at the other end. A diagram is shown below:

During inference, inputs flow from the left through the nodes of the network and come out the other side, just like a pipe.

If we consider a theoretical neural network that operates on real numbers (rather than floating point numbers), then without noise in the network, every layer of the DNN actually has access to infinite information bandwidth.

Even though we are squeezing representations (the pipe) into fewer hidden units, the network could still learn to encode the previous layer's data into the decimal point values without actually learning any meaningful features. In fact, we could represent all the information in the network with a single number. This is undesirable.

By limiting the amount of information in a network, we force it to learn compact representations of input features. Several ways to go about this:

Variational autoencoders (VAE) add Gaussian noise to the hidden layer. This noise destroys "excess information," forcing the network to learn compact representations of data.
Closely related to VAE noise (possibly equivalent?) to this is the idea of Dropout Regularization - randomly zeroing out some fraction of units during training. Like the VAE, dropout noise forces the network to learn useful information under limited bandwidth.
Deep Networks with Stochastic Depth - similar idea to dropout, but at a per-layer level rather than per-unit level.
There's a very interesting paper called Binarized Neural Networks that uses binary weights and activations in the inference pass, but real-valued gradients in the backward pass. The source of noise comes from the fact that the gradient is a noisy version of the binarized gradient. While BinaryNets are not necessarily more powerful than regular DNNs, individual units can only encode one bit of information, which regularizes against two features being squeezed into a single unit via floating point encoding.

More efficient compression schemes mean better generalization at test-time, which explains why dropout works so well for over-fitting. If you decide to use regular autoencoders over variational autoencoders, you must use a stochastic regularization trick such as dropout to control how many bits your compressed features should be, otherwise you will likely over-fit.

I think VAEs are objectively superior because they are easy to implement and allow you to specify exactly how many bits of information are passing through each layer.

Exploration "Energy"

DNNs are usually trained via some variant of gradient descent, which basically amounts to finding the parameter update that "rolls downhill" along some loss function. When you get to the bottom of the deepest valley, you've found the best possible parameters for your neural net.

The problem with this approach is that neural network loss surfaces have a lot of local minima and plateaus. It's easy to get stuck in a small dip or a flat portion where the slope is already zero (local minima) but you are not done yet.

The third interpretation of how randomness assists Deep Learning models is based on the idea of the exploration.

Because the datasets used to train DNNs are huge, it’s too expensive to compute the gradient across terabytes of data for every single gradient descent step. Instead, we use stochastic gradient descent (SGD), where we just compute the average gradient across a small subset of the dataset chosen randomly from the dataset.

In evolution, if the success of a species is modeled by random variable $X$, then random mutation or noise increases the variance of $X$ - its progeny could either be far better off (adaptation, poison defense) or far worse off (lethal defects, sterility).

In numerical optimization, this "genetic mutability" is called "thermodynamic energy", or Temperature that allows the parameter update trajectory to not always "roll downhill", but occasionally bounce out of a local minima or "tunnel through" hills.

This is all deeply related to the Exploration-vs.-Exploitation tradeoff as formulated in RL. Training a purely deterministic DNN with zero gradient noise has zero exploitation capabilities - it converges straight to the nearest local minima, however shallow.

Using stochastic gradients (either via small minibatches or literally adding noise to the gradients themselves) is an effective way to allow the optimization to do a bit of "searching" and "bouncing" out of weak local minima. Asynchronous stochastic gradient descent, in which many machines are performing gradient descent in parallel, is another possible source of noise.

This "thermodynamic" energy ensures symmetry-breaking during early stages of training, to ensure that all the gradients in a layer are not synchronized to the same values. Not only does noise perform symmetry breaking in action space of the neural net, but noise also performs symmetry breaking in the parameter space of the neural net.

Closing Thoughts

I find it really interesting that random noise actually helps Artificial Intelligence algorithms to avoid over-fitting and explore the solution space during optimization or Reinforcement Learning. It raises interesting philosophical questions on whether the inherent noisiness of our neural code is a feature, not a bug.

One theoretical ML research question I am interested in is whether all these neural network training tricks are actually variations of some general regularization theorem. Perhaps theoretical work on compression will be really useful for understanding this.

It would be interesting to check the information capacity of various neural networks relative to hand-engineered feature representations, and see how that relates to overfitting tendency or quality of gradients. It's certainly not trivial to measure the information capacity of a network with dropout or trained via SGD, but I think it can be done. For example, constructing a database of synthetic vectors whose information content (in bits, kilobytes, etc) is exactly known, and seeing how networks of various sizes, in combination with techniques like dropout, deal with learning a generative model of that dataset.

Saturday, July 16, 2016

What product breakthroughs will recent advances in deep learning enable?

This is re-posted from a Quora answer I wrote on on 6/11/16.

Deep Learning refers to a class of machine learning (ML) techniques that combine the following:

Large neural networks (millions of free parameters)
High performance computing ( thousands of processors running in parallel)
Big Data (e.g. millions of color images or recorded chess games)

Deep learning techniques currently achieve state of the art performance in a multitude of problem domains (vision, audio, robotics, natural language processing, to name a few). Recent advances in Deep Learning also incorporate ideas from statistical learning [1,2], reinforcement learning (RL) [3], and numerical optimization . For a broad survey of the field, see [9,10].

In no particular order, here are some product categories made possible with today's deep learning techniques:

customized data compression
compressive sensing
data-driven sensor calibration
offline AI
human-computer interaction
gaming, artistic assistants
unstructured data mining
voice synthesis

Customized data compression

Suppose you are designing a video conferencing app, and want to come up with a lossy encoding scheme to reduce the number of packets you need to send over the Internet.

You could use an off-the-shelf codec like H.264, but H.264 is not optimal because it is calibrated for generic video - anything from cat videos to feature films to clouds. It would be nice if instead we had a video codec that was optimized for specifically FaceTime videos. We can save even more bytes than a generic algorithm if we take advantage of the fact that most of the time, there is a face in the center of the screen. However, designing such an encoding scheme is tricky:

How do we specify where the face is positioned, how much eyebrow hair the subject has, what color their eyes are, the shape of their jaw?
What if their hair is covering one of their eyes?
What if there are zero or multiple faces in the picture?

Deep learning can be applied here. Auto-encoders are a type of neural network whose output is merely a copy of the input data. Learning this "identity mapping" would be trivial if it weren't for the fact that the hidden layers of the auto-encoder are chosen to be smaller than the input layer. This "information bottleneck" forces the auto-encoder to learn an compressed representation of the data in the hidden layer, which is then decoded back to the original form by the remaining layers in the network.

Through end-to-end training, auto-encoders and other deep learning techniques *adapt* to the specific nuances of your data. Unlike principal components analysis, the encoding and decoding steps are not limited to affine (linear) transformations. PCA learns an "encoding linear transform", while auto-encoders learn a "encoding program".

This makes neural nets far more powerful, and allows for complex, domain-specific compression; anything from storing a gazillion selfies on Facebook, to faster YouTube video streaming, to scientific data compression, to reducing the space needed for your personal iTunes library. Imagine if your iTunes library learned a "country music" auto-encoder just to compress your personal music collection!

Compressive sensing

Compressive sensing is closely related to the decoding aspects of lossy compression. Many interesting signals have a particular structure to them - that is, the distribution of signals is not completely arbitrary. This means that we don't actually have to sample at the Nyquist limit in order to obtain a perfect reconstruction of the signal, as long our decoding algorithm can properly exploit the underlying structure.

Deep learning is applicable here because we can use neural networks to learn the sparse structure without manual feature engineering. Some product applications:

Super-resolution algorithms (waifu2X)- literally an "enhance" button like those from CSI Miami
using WiFi radio wave interference to see people through walls (MIT Wi-Vi)
interpreting 3D structure of an object given incomplete observations (such as a 2D image or partial occlusion
more accurate reconstructions from sonar / LIDAR data

Data-driven sensor calibration

Good sensors and measurement devices often rely on expensive, precision-manufactured components.

Take digital cameras, for example. Digital cameras assume the glass lens is of a certain "nice" geometry. When taking a picture, the onboard processor solves the light transport equations through the lens to compute the final image.

If the lens is scratched, or warped or shaped like a bunny (instead of a disc) these assumptions are broken and the images no longer turn out well. Another example: our current decoding models used in MRI and EEG assume the cranium is a perfect sphere in order to keep the math manageable [4]. This sort of works, but sometimes we miss the location of a tumor by a few mm. More accurate photographic and MRI imaging ought to compensate for geometric deviation, whether they result from underlying sources or manufacturing defects.

Fortunately, deep learning allows us to calibrate our decoding algorithms with data.

Instead of a one-size-fits-all decoding model (such as a Kalman filter), we can express more complex biases specifically tuned to each patient or each measuring device. If our camera lens is scratched, we can train the decoding software to implicitly compensate for the altered geometry. This means we no longer have to manufacture and align sensors with utmost precision, and this saves a lot of money.

In some cases, we can do away with hardware completely and let the decoding algorithm compensate for that; the Columbia Computational Photography lab has developed a kind of camera that doesn't have a lens. Software-defined imaging, so to speak.

Offline AI

Being able to run AI algorithms without Internet is crucial for apps that have low latency requirements (I.e. self driving cars & robotics) or do not have reliable connectivity (smartphone apps for traveling).

Deep Learning is especially suitable for this. After the training phase, neural networks can run the feed forward step very quickly. Furthermore, it is straightforward to shrink down large neural nets into small ones, until they are portable enough to run on a smartphone (at the expense of some accuracy).

Google has already done this in their offline camera translation feature in Google Translate App [6].

Some other possibilities:

Intelligent assistants (e.g. Siri) that retain some functionality even when offline.
wilderness survival app that tells you if that plant is poison ivy, or whether those mushrooms are safe to eat
small drones with on-board TPU chips [11] that can perform simple obstacle avoidance and navigation

Human-computer interaction

Deep Neural Networks are the first kind of models that can really see and hear our worldwith an acceptable level of robustness. This opens up a lot of possibilities for Human-Computer Interaction.

Cameras can now be used to read sign language and read books aloud to people. In fact, deep neural networks can now describe to us in full sentences what they see [12]. Baidu's DuLight project is enabling visually-impaired people to see the world around them through a sight-to-speech earpiece.

Dulight--Eyes for visually impaired

We are not limited to vision-based HCI. Deep learning can help calibrate EEG interfaces for paraplegics to interact with computers more rapidly, or provide more accurate decoding tech for projects like Soli [7].

Gaming

Games are computationally challenging because they run physics simulation, AI logic, rendering, and multiplayer interaction together in real time. Many of these components have at least O(N^2) in complexity, so our current algorithms have hit their Moore's ceiling.

Deep learning pushes the boundaries on what games are capable of in several ways.

Obviously, there's the "game AI" aspect. In current video games, AI logic for non-playable characters (NPC) are not much more than a bunch of if-then-else statements tweaked to imitate intelligent behavior. This is not clever enough for advanced gamers, and leads to somewhat unchallenging character interaction in single-player mode. Even in multiplayer, a human player is usually the smartest element in the game loop.

This changes with Deep Learning. Google Deepmind's AlphaGo has shown us that Deep Neural Networks, combined with policy gradient learning, are powerful enough to beat the strongest of human players at complex games like Go. The Deep Learning techniques that drive AlphaGo may soon enable NPCs that can exploit the player's weaknesses and provide a more engaging gaming experience. Game data from other players can be sent to the cloud for training the AI to learn from its own mistakes.

Another application of deep learning in games is physics simulation. Instead of simulating fluids and particles from first principles, perhaps we can turn the nonlinear dynamics problem into a regression problem. For instance, if we train a neural net to learn the physical rules that govern fluid dynamics, we can evaluate it quickly during gameplay without having to perform large-scale solutions to Navier stokes equations in real time.

In fact, this has been done already by Ladicky and Jeong 2015 [8].

For VR applications that must run at 90 FPS minimum, this may be the only viable approach given current hardware constraints.

Third, deep generative modeling techniques can be used to create unlimited, rich procedural content - fauna, character dialogue, animation, music, perhaps the narrative of the game itself. This is an area that is just starting to be explored by games like No Man's Sky, which could potentially make games with endless novel content.

To add a cherry on top, Deep Neural nets are well suited for parallel mini-batched evaluation, which means that AI logic for a 128 NPCs or 32 water simulations might be evaluated simultaneously on a single graphics card.

Artistic Assistants

Given how well neural networks perceive images, audio, and text, it's no surprise that they also work when we use them to draw paintings [13], compose music [14], and write fiction [15].

People have been trying to get computers to compose music and paint pictures for ages, but deep learning is the first one that actually generates "good results". There are already several apps in the App Store that implement these algorithms for giggles, but soon we may see them as assistive generators/filters in professional content creation software.

Data Mining from Unstructured Data

Deep learning isn't at the level where it can extract the same amount of information humans can from web pages, but the vision capabilities of deep neural nets are good enough for allowing machines to understand more than just hypertext.

For instance:

Parsing events from scanned flyers
identifying which products on EBay are the same
determining consumer sentiment from webcam
extracting blog content from pages without RSS feeds
integrate photo information into valuing financial instruments, insurance policies, and credit scores.

Voice synthesis

Generative modeling techniques have come far enough and there is sufficient data out there that it is only a matter of time before someone makes an app that reads aloud to you in Morgan Freeman's or Scarlet Johansen's voice. At Vanguard, my voice is my password.

Bonus: more products

Adaptive OS / Network stack scheduling - scheduling threads and processes in an OS is a NP hard problem. We don't have a very satisfactory solution to this right now, and scheduling algorithms in modern operating systems, filesystems, and TCP/IP implementations are all fairly simple. Perhaps if a small neural net could be used to adapt to a user's particular scheduling patterns (frame this as an RL problem), we would decrease scheduling overhead incurred by the OS. This might make a lot of sense inside of data centers where the savings can really scale.
Colony counting & cell tracking for microscopy software (for wet lab research)
The strategy of "replacing simulation with machine learning" has been useful in the fields of drug design too, presenting enormous speed ups in finding which compounds are helpful or toxic [untethiner 2015].