## Monday, January 2, 2017

### Summary of NIPS 2016

The 30th annual Neural Information Processing Systems (NIPS) conference took place in Barcelona in early December. In this post, I share my thoughts on the conference. Let me know your thoughts in the comments section!

Basic outline:

• State of AI Research and trends in 2016
• Talks and papers that I liked
• Predictions for 2017
• Miscellaneous notes

## State of Artificial Intelligence Research: 2016 Trends

Academic, industry, and public interest in Artificial Intelligence (A.I.) is taking off like a goddamn rocket. There was a 50% percent increase in NIPS registrations from last year:

It's not a coincidence that Nvidia, the literal arms-dealer of deep learning, has had a good year in the stock market.

2016 saw the continued success of Deep Learning: superhuman Go playing, superhuman speech transcription, superhuman translation, superhuman lip reading, and maybe-superhuman driving ability (no public benchmarks for self-driving cars yet). These accomplishments are especially remarkable, considering that many AI researchers used to believe that solving the relevant problems required solving general intelligence itself.

The research topics that are in vogue right now (GANs, RL, generative models, unsupervised learning, robotics, alternative ways to train DNNs) are important problems to tackle before we can build general AI. However, they haven't created significant commercial value in industry yet, in ways that couldn't plausibly be substituted with traditional supervised learning.

Transfer learning, domain adaptation and semi-supervised learning alleviate the data-hungry requirements of deep learning, and are starting to work really well.

The volume and quality of published work in 2016 was staggering, and it's hard to come away from the conference feeling like there are any ideas left to work on. Many papers and posters I saw at the conference had multiple groups converging on similar ideas.

### GANs

I can’t imagine a higher compliment for Ian Goodfellow than Yann Lecun declaring Generative Adversarial Networks (GANs) to be “the biggest breakthrough in Machine Learning in the last 1-2 decades.”

For those of you who are unfamiliar with GANs, here's a simple writeup explaining them. GANs were initially developed for generating images, but researchers are exploring their applications in reinforcement learning, domain adaptation, security ML, compression, and more. A more appropriate terminology would be “adversarial methods”, because the network we are trying to train need not be "generative" - it could be a deterministic classifier or policy instead.

Why are GANs so cool? Before they became popular, many researchers formulated learning as an optimization objective, i.e. “minimize a loss or maximize a reward". GANs break away from that paradigm; instead, adversarial methods solve for a Nash equilibria between at least two “agents”, both of which are allowed to co-adapt to each other.

The research community cares about this because some loss functions are often difficult to specify directly. In many over-parameterized problems, the optimization process will “short-circuit” design flaws in the objective and yield a result that might minimize the objective, but isn't quite what we really wanted.

For example, if we wanted to minimize some error for image compression/reconstruction, often what we find is that a naive choice of error metric (e.g. euclidean distance to the ground truth label) results in qualitatively bad results. The design flaw is that we don’t have good perceptual similarity metrics for images that are universally applicable for the space of all images. GANs use a second “adversarial” network learn an optimal implicit distance function (in theory).

Ian Goodfellow gave a very nice tutorial on GANs. The main challenge for GANs right now is preventing degenerate equilibria between the discriminator and generator. These issues often manifest themselves as mode collapse (the generator learns a single tight mode over a single sample) or the discriminator "overpowering" the generator. I'm not worried about this - I think with time, intuitive tricks and recipes from experienced researchers will make them easier to use. People used to think that only Geoff Hinton could train DNNs, and that ended up not being the case.

Here's a quote from Maciej Cegłowski's an eminently hilarious, bookmark-bar-tier blog post about AI existential alarmism:

With no way to define intelligence (except just pointing to ourselves), we don't even know if it's a quantity that can be maximized. For all we know, human-level intelligence could be a tradeoff. Maybe any entity significantly smarter than a human being would be crippled by existential despair, or spend all its time in Buddha-like contemplation.

Does intelligent life arise from optimization? Or is it an equilibrium? Or to quote Yann, “a bunch of hacks that just happen to work?” AI research is not ready to answer these questions yet, but I am optimistic that research on GANs will tease out what kinds of problems are better-suited for expressing as an equilibrium-finding vs. minimization problem.

### Deep RL

People usually regard Reinforcement Learning (RL) as pertaining to playing video games and controlling robots, but RL methods are far more general than that - any ML problem with dependencies across space or time can benefit from RL techniques (i.e. sequential pixels). RL allows us to think about non-stationary data distributions as "environments" that respond to an agent's behavior. RL is applied creatively to a lot of ML problems nowadays, such as translation.

Pieter Abbeel and John Schulman gave a very nice tutorial on policy gradient methods, and John gave a talk on the “nuts and bolts” of Deep RL research. The nuts and bolts talk is jam-packed with great advice that would have taken me years to figure out on my own, and I highly recommend it for anyone who is working on RL projects.

Curiously, a recent paper by Chelsea Finn showed that certain classes of inverse RL problems have a mathematically equivalent GAN formulation!

Someone asked the Deep RL panelists what some of the most challenging problems were in RL. Rich Sutton and Raia Hadsell mentioned learning of subtasks/subgoals, i.e. hierarchical representations of the task at hand. John Schulman mentioned an interesting comment about how the Bellman discount factor in the discrete MDP setting makes it difficult to plan more than >100 time steps into the future, even when r=0.99. Finally, Chelsea Finn discussed sample complexity: it’s annoying that we have to learn every RL task from scratch, and some kind of inter-task transfer learning is needed. After all, humans can quickly learn to play new games because we've played other games before.

### Bayesian Deep Learning

I think one of the most promising areas of research right now is the marriage of deep neural nets with graphical models,  referred to some as “Bayesian deep learning.” It’s probably the closest thing to a theoretical understanding of deep learning that generalizes to how neural nets are used in practice.

Kyle Cranmer gave a phenomenal talk on how likelihood-free inference methods were used to find evidence of the Higgs Boson. A lot of it went over my head, but what little I understood from it was that they used a variational model to approximate the simulation of a 11-story particle detector at atomic scale:

There were also a whole bunch of interesting variational inference techniques presented at the conference:
• Symmetrized Variational Inference - a way to avoid mode collapse with Gaussian VAEs
• Rejection Sampling Variational inference - reparameterization trick for a rejection sampler, allowing us to train gamma, beta, Dirichlet and many other variational distributions! This is huge.
• Robust Inference with Variational Bayes - improving the robustness of a variational posterior to choice of prior and likelihood.
• ELBO surgery - ELBO is the most common variational lower bound used to train VAEs, and this paper proposes an alternative way to write the ELBO: split up the KL(q(z|x)|p(z)) term into an index-code mutual information term and marginal KL term KL(q(z)|p(z)). Experiments show that the latter term is actually quite large, which suggests that we should focus on improving priors. This work is quite important because it provides a easy-to-implement debugging tool for anyone training VAEs.
• Operator Variational Inference, Normalizing Flows, and Real-NVP enable richer families of posterior approximations.
• Variational lossy autoencoders

### Open Research

There were a couple exciting software announcements at NIPS this year. First was OpenAI's announcement of Universe, a software interface that bridges all kinds of software to OpenAI Gym. This means we can bring RL to interact with environments such as the full-featured Chrome Browser, or desktop applications like Photoshop. Think of this as "AppleScript on steroids".

Deepmind released Deepmind Lab for FPS + 3D agent research in their Labyrinth environment, and earlier this year, Microsoft released Project Malmo for training AI agents in Minecraft. These are really interesting environments, and typically outside the capabilities of a grad student to implement on their own.

These are just a few of the high-profile software releases; many other tech companies and academic institutions are releasing benchmark software, deep learning frameworks, and data all the time.

People often cite 1) compute, 2) data, and 3) algorithms as the reason why DL is taking off, and I think we should add 4) the unprecedented accessibility of ML research. Tons of people are already using OpenAI gym for RL projects, and many research codes are being made public for other people to build on top of. The quality and scope of research benefits dramatically from students not having to write thousands of lines of support code, and instead focus on their research problem.

It’s great that large tech companies are making such great software available to the broader community, and this kind of work that is more impactful than any research paper I saw at the conference this year.

## Talks and Papers that I Liked

I had nowhere enough mental capacity to see all the papers and talks at NIPs, but I did write down some talks, papers, and posters that I came away really inspired by. Here’s a non-exhaustive list.

• Andrew Ng gave a practical tutorial on how to debug ML research projects. This is a must-watch for beginners in the field. Here is a video of a similar talk he gave at the Deep Learning summer school.
• Fixed Points of Belief Propagation - the authors formulate the convergence condition of belief propagation (BP) as the solution of polynomial equations. This makes me wonder if these methods could be used to bridge nonlinear dynamical systems with statistical inference.
• Deep Networks with Stochastic Depth - layer-wise dropout allows for training ResNets with >1000 layers.
• Layer Norm - like batch norm, layer norm is used to prevent covariance shift between successive deep layers. The advantage is that layer norm does not introduce gradient variance from other minibatch elements.
• Deep Successor Reinforcement Learning - I really liked this paper because it gets at model-based + model-free ideas. The motivation is that model-free methods like DQN take a long time to learn adequate low-level features because we are often trying to train a CNN to “see” using a single value scalar that is not directly related to image features. The idea is to learn “successor features” in an unsupervised manner (using rich supervision provided by model prediction error), and then separately learn a simple linear value function on top of these features. The paper is short and well-written.
• The Intelligent Biosphere by Drew Purves (couldn't find the talk slides). This talk was a bit of an oddball among the other mathy ones at NIPS (I thought the drawings were distracting and failed to capture the nuance of what Drew was saying), but I think the message is crucial for AI researchers who want to take a step back and think about what it means for a system to be “intelligent”. The basic idea is that “intelligence” extends beyond brains; something as simple as self-replicating RNA exhibit intelligent behavior at the evolutionary scale. The natural world is fractal, cyclic, and fuzzy, as opposed to the simplicity of our current simulated environments for benchmarking AI. The implication is that there may be challenges learning in these environments that don’t carry over from Atari games. One interesting comment Drew made was that in a biosphere, every organism is a resource to another organism. That is, learning and adaptation of each organism is not independent of other organisms (reminds me of GANs)! If you’re interested in these environment + intelligence ideas, similar arguments have been made in ethology and neuroethology texts.
• Input Convex Neural Networks - the authors presented a very interesting follow-up workshop poster at the Deep RL workshop (I can't find a paper), where they were able to solve continuous Q functions by parameterizing the Q-function to be convex with respect to the action vector of the agent (and non-convex with respect to the input state). This allows for efficient DQN policies in continuous action domains.
• Using Fast Weights to Attend to the Recent Past - basically an alternative to LSTM that seems to work a lot better.
• Phased LSTM - a way to learn an adaptive time scale for LSTM architectures, where the hidden state is updated at a learned frequency. This is ideal for learning from data with arbitrary sampling rates, such as from event-based vision sensors (which naturally solve some hard computer vision problems like aliasing and bandwidth constraints). It also happens that this Zurich lab happens to make a really awesome event-based camera and I’m really excited to see how they apply ML to it.
• Residual Networks Behave Like Ensembles of Relatively Shallow Networks - the title says it all. I thought the experiments could have been a lot better, but regardless, this work could potentially have a huge impact in new ways to train very deep ResNets quickly.
• Graph Convolutional Networks - generalization of convolutions to allow DNN processing on graph data structures (the kind with edges and vertices). This work is important for getting DL methods to learn approximate solutions to NP-hard graph problems.
• Minimizing Quadratic Functions in Constant Time - I did a double-take when I saw the title of this poster. I don’t understand the math very well, but the algorithm is very simple: subsample a $n \times n$ matrix to a $k \times k$ matrix and solve it. A simple scaling of the resulting minimized value yields an approximation to the minimized objective of the original system! In problems where we need to branch-and-bound a large search space (an optimal portfolio), we might find the value of a solution more informative than the solution itself.

## Predictions for 2017

• SGD + backprop will still be the best method for training deep neural nets in 2017.
• Really good transfer learning, domain adaptation, and/or multitask learning will be one of the major accomplishments in 2017 (this was suggested by Andrew Ng during his tutorial).
• Ensemble methods (e.g. training 10 versions of a network and averaging the predictions) will squeeze out 2-5% performance on all kinds of ML tasks.
• Widespread use of model-based + model-free RL, e.g. Reinforcement Learning with Unsupervised Auxiliary Tasks as a way to alleviate sample-complexity of model-free methods.

## Miscellaneous Notes

Some of these are paraphrased opinions of panelists and speakers. If you think I misrepresented your opinions in any way, please don't hesitate to reach out and I will make a correction!

• Boston Dynamics displayed a series of very entertaining and impressive robotics demos. Their basic approach appears to be high precision hardware that behaves faithfully to a simulated planning model, plus a bit of robust control to handle the imperfections. Machine Learning doesn’t seem to be an important priority for their approach to robotic control, but I think it's good to have uncorrelated research approaches. It's hard to say which one will end up working best. God, I hope Alphabet doesn’t sell them off. Their stuff is really cool.
• Apple has publicly announced that they will open up their research culture. I think a product-driven research culture has the potential to align research towards important problems (rather than maximizing citations, as is done in academia).
• Certain prop trading firms are using neural nets to do price forecasting. I had previously thought that all trading firms were too concerned about model interpretability to use neural nets.
• Speaking of hedge funds, many top hedge funds and trading shops came to NIPS to run career booths, but there was a surprising lack of interest from attendees (attendees were more interested in the likes of Apple, Facebook, Deepmind, Google, etc). At a regular college career fair in the East Coast, these roles are reversed. The talent pool at NIPS seems to be far more interested in tech and open-ended research than making money.
• Deers, despite being “herbivores”, will sometimes eat the heads of baby birds for extra calcium (this was from Drew Purves talk). Life rejects your reductionist labels, human!
• Engineering Principles From Stable and Developing Brains (Saket Navlakha) - interestingly enough, pruning unused neuron units seems to work better for learning than growing additional neurons.
• Yoshua Bengio - much better generalization (train vs. test error) happens when telling generative models what their latent codes should be.
• There was a great question asked during the Brains and Bits panel about why we are focusing so much intelligence research on humans, instead of simpler animals like octopi and birds. Yoshua says that focusing on simple models might cause researchers to overlook more general principles of intelligence. I agree with the argument, but not it’s premise - I think it is our AI algorithms that are simple, not the octopi and birds. I believe that we should build an octopi AI before a human one (see the "whole iguana" approach).
• Yann Lecun thinks model interpretability is overrated: we have lots of interpretable algorithms, but we still lack the basic principles with which to build a general-purpose learning system that can navigate the complexities of our natural world. I agree wholeheartedly with this - requiring interpretability unnecessarily confines us to look for solutions in models that we understand bottom-up. If history of scientific discovery is any clue, that is not always the case. In fact, serendipitous discovery sometimes invalidates our bottom-up assumptions! (e.g. "neural nets don't work").
• Yann is not a fan of the log-likelihood metric. Log-likelihood is evaluated under a particular model (e.g. PixelCNN decoder), and if the model is not good to begin with, then log-likelihood is somewhat meaningless.
• Andrew Saxe asked a great question at the Brain and Bits workshop: “what are fringe topics that might be important to solving intelligence?” Demis Hassabis replied that people don't regard the study of imagination, dreaming, consciousness as very scientific, but they may become serious practical questions within 5 years. Terry Sejnowski replied that understanding sleep is crucial, as it is very much a computational phenomenon rather than a torpid one. Yoshua wonders about what it means to "understand something", i.e. neural nets and other models can learn computations and statistical correlations, but how should semantic meaning be extracted?

I didn't attend any talks on meta-learning (e.g. synthetic gradients, learning-to-learn, hypernetworks), and a great many other research themes, but others have blogged about their experiences at NIPS. Check these articles out to piece together a fuller picture!

NIPS 2016 by Martin Thoma
50 things I learned at NIPS 2016 by Andreas Stuhlmüller

Thank you for reading, and happy new year!

## Tuesday, November 8, 2016

### Tutorial: Categorical Variational Autoencoders using Gumbel-Softmax

In this post, I discuss our recent paper, Categorical Reparameterization with Gumbel-Softmax, which introduces a simple technique for training neural networks with discrete latent variables. I'm really excited to share this because (1) I believe it will be quite useful for a variety of Machine Learning research problems, (2) this is my first published paper ever (on Arxiv, and submitted to a NIPS workshop and ICLR as well).
The TLDR; if you want categorical features in your neural nets, just let sample = softmax((logits+gumbel noise)/temperature), and then backprop as usual using your favorite automatic differentiation software (e.g. TensorFlow, Torch, Theano).

## Introduction

One of the main themes in Deep Learning is to “let the neural net figure out all the intermediate features”. For example: training convolutional neural networks results in the self-organization of a feature detector hierarchy, while Neural Turing Machines automatically “discover” copying and sorting algorithms.
The workhorse of Deep Learning is the backpropagation algorithm, which uses dynamic programming to compute parameter gradients of the network. These gradients are then used to minimize the optimization objective via gradient descent. In order for this to work, all of the layers in our neural network — i.e. our learned intermediate features — must be continuous-valued functions.
What happens if we want to learn intermediate representations that are discrete? Many "codes" we want to learn are fundamentally discrete - musical notes on a keyboard, object classes (“kitten”, “balloon”, “truck”), and quantized addresses (“index 423 in memory”).
We can use stochastic neural networks, where each layer compute the parameters of some (discrete) distribution, and its forward pass consists of taking a sample from that parametric distribution. However, the difficulty is that we can’t backpropagate through samples. As shown below, there is a stochastic node (blue circle) in between $f(z)$ and $\theta$.
Left: in continuous neural nets, you can use backprop to compute parameter gradients. Right: backpropagation is not possible through stochastic nodes.

## Gumbel-Softmax Distribution

The problem of backpropagating through stochastic nodes can be circumvented if we can re-express the sample $z \sim p_\theta(z)$, such that gradients can flow from $f(z)$ to $\theta$ without encountering stochastic nodes. For example, samples from the normal distribution $z \sim \mathcal{N}(\mu,\sigma)$ can be re-written as $z = \mu + \sigma \cdot \epsilon$, where $\epsilon \sim \mathcal{N}(0,1)$. This is also known as the “reparameterization trick”, and is commonly used to train variational autoencoders with Gaussian latent variables.

The Gumbel-Softmax distribution is reparameterizable, allowing us to avoid the stochastic node during backpropagation.
The main contribution of this work is a “reparameterization trick” for the categorical distribution. Well, not quite – it’s actually a re-parameterization trick for a distribution that we can smoothly deform into the categorical distribution. We use the Gumbel-Max trick, which provides an efficient way to draw samples $z$ from the Categorical distribution with class probabilities $\pi_i$:
$$\DeclareMathOperator*{\argmax}{arg\,max} z = \verb|one_hot|\left(\argmax_{i}{\left[ g_i + \log \pi_i \right]}\right)$$
argmax is not differentiable, so we simply use the softmax function as a continuous approximation of argmax:
$$y_i = \frac{\text{exp}((\log(\pi_i)+g_i)/\tau)}{\sum_{j=1}^k \text{exp}((\log(\pi_j)+g_j)/\tau)} \qquad \text{for } i=1, ..., k.$$
Hence, we call this the Gumbel-SoftMax distribution*. $\tau$ is a temperature parameter that allows us to control how closely samples from the Gumbel-Softmax distribution approximate those from the categorical distribution. As $\tau \to 0$, the softmax becomes an argmax and the Gumbel-Softmax distribution becomes the categorical distribution. During training, we let $\tau > 0$ to allow gradients past the sample, then gradually anneal the temperature $\tau$ (but not completely to 0, as the gradients would blow up).
Below is an interactive widget that draws samples from the Gumbel-Softmax distribution. Keep in mind that samples are vectors, and a one-hot vector (i.e. one of the elements is 1.0 and the others are 0.0) corresponds to a discrete category. Click "re-sample" to generate a new sample, and try dragging the slider and see what samples look like when the temperature $\tau$ is small!
1.0

## TensorFlow Implementation

Using this technique is extremely simple, and only requires 12 lines of Python code:

Despite its simplicity, Gumbel-Softmax works surprisingly well - we benchmarked it against other stochastic gradient estimators for a couple tasks and Gumbel-Softmax outperformed them for both Bernoulli (K=2) and Categorical (K=10) latent variables. We can also use it to train semi-supervised classification models much faster than previous approaches. See our paper for more details.

## Categorical VAE with Gumbel-Softmax

To demonstrate this technique in practice, here's a categorical variational autoencoder for MNIST, implemented in less than 100 lines of Python + TensorFlow code.
In standard Variational Autoencoders, we learn an encoding function that maps the data manifold to an isotropic Gaussian, and a decoding function that transforms it back to the sample. The data manifold is projected into a Gaussian ball; this can be hard to interpret if you are trying to learn the categorical structure within your data.
First, we declare the encoding network:

Next, we sample from the Gumbel-Softmax posterior and decode it back into our MNIST image.

Variational autoencoders minimizes reconstruction error of the data by maximizing an expectedlower bound (ELBO) on the likelihood of the data, under a generative model $p_\theta(x)$. For a derivation, see this tutorial on variational methods.
$$\log p_\theta(x) \geq \mathbb{E}_{q_\phi(y|x)}[\log p_\theta(x|y)] - KL[q_\phi(y|x)||p_\theta(y)]$$
Finally, we run train our VAE:

...and, that's it! Now we can sample randomly from our latent categorical code and decode it back into MNIST images:
Code can be found here. Thank you for reading, and let me know if you find this technique useful!

## Acknowledgements

I'm sincerely grateful to my co-authors, Shane Gu and Ben Poole for collaborating with me on this work. Shane introduced me to the Gumbel-Max trick back in August, and supplied the framework for comparing Gumbel-Softmax with existing stochastic gradient estimators. Ben suggested and implemented the semi-supervised learning aspect of the paper, did the math derivations in the Appendix, and helped me a lot with editing the paper. Finally, thanks to Vincent Vanhoucke and the Google Brain team for encouraging me to pursue this idea.
*Chris J. Maddison, Andriy Mnih, and Yee Whye Teh at Deepmind have discovered this technique independently and published their own paper on it - they call it the “Concrete Distribution”. We only found out about each other’s work right as we were submitting our papers to conferences (oops!). If you use this technique in your work, please cite both of our papers! They deserve just as much credit.

23 Apr 2017: Update - Chris Maddison and I integrated these distributions into TensorFlow's Distributions sub-framework. Here's a code example of how to implement a categorical VAE using the distributions API.

## Tuesday, September 6, 2016

### Riemann Summation and Physics Simulation are Statistically Biased

Riemann summation is the simplest method for numerically approximating integrals. in which you chop up the domain into equally sized segments and add a bunch of rectangles together.

$$\int f(x) dx \sim \sum_i^k f(x) \Delta x$$

This technique also comes up in Euler integration for simulating physics:

\begin{align*} \Delta x & = \int \dot{x}(t) dt \\ & \sim \dot{x}(t) \Delta t \end{align*}

These are all special cases of "Deterministic Quadratures", where the integration step size $\Delta x$ is non-random ("deterministic") and we are summing up a bunch of quadrilateral pieces ("Quadratures") to approximate areas.

This post interprets the numerical integration as a special case of stratified sampling, and shows that deterministic quadrature rules are statistically biased estimators.

Suppose we want to compute $E_X[f(X)] = \int_x p(x)\cdot f(x) dx$ via a Monte Carlo estimator.

A plain Monte Carlo Estimator is given by $\mu = \frac{1}{n}\sum_i^n f(X_i)$, and has variance $\frac{1}{n}\sigma^2$ where $\sigma^2=Var[X_i]$.

Stratified sampling (see previous blog post) introduces a stratification variable $Y$ with a discrete distribution of $k$ strata with probability densities  $p(y_1),p(y_2),...,p(y_k)$, respectively. A common choice is to assign each $p(y_1) = p(y_2) = ... = p(y_k) = 1/k$, and arrange the strata in a grid. In this case, $y_i$ correspond to the corners of the strata and $P(X|Y=y_i)$ corresponds to a uniform distribution over that square ($X_i = Y_i + 0.1*\text{rand}()$).

The variance of this estimator is $\sum_i^k p(y_i)^2 \cdot V_i$, where $V_i$ is the variance of $\mathbb{E}[X|Y=y_i]$ for strata $i$.

Suppose we let $\mathbb{E}[X|Y=y_i]$ as $y_i$ - that is, we just sample at the corners where $y_i$ sampels are.

The good news is that $V_i = 0$, and we've reduced the variance of the total estimator to zero. The bad news is that the estimator is now biased because our per-strata estimators $\mathbb{E}[X|Y=y_i]$ are biased (our estimator never uses information from the interiors/edges of the squares). The estimator is also inconsistent for finite strata.

Does this figure remind you of anything? This estimator behaves identically to a Riemann Summation in 2D!

What does this mean?

• If you ever take a Riemann sum with fixed step size, your integral is biased!
• If you are monitoring mean ocean temperatures and place your sensors in an regular grid across the water (like above), your estimator is biased!
• If you are recording video at a too-low frequency and pick up aliasing artifacts, your estimates are biased!
• If you use a fixed timestep in a physics simulation, such as Euler or Runge-Kutta methods, your space integrals are biased!

The takeaway here is that *all* deterministic quadrature rules are statistically biased. that's okay though - the expected value of biased estimators *do* converge to the true expectation as you crank the number of strata to infinity (e.g. $\Delta x \to 0$).

A "less-biased" physics simulator ought to incorporate random time steps (for instance, sampled between 0 and MAX_TIME_STEP). To reduce variance, one might generate multiple samples, compute a nonlinear update, then average out the integration.

Fairly obvious in hindsight, but I found it interesting to think about the relationship between stratified sampling and deterministic quadrature :)

## Monday, September 5, 2016

### Monte Carlo Variance Reduction Techniques in Julia

Monte Carlo methods are a simple yet powerful family of techniques for performing numerical integration in high-dimensional spaces.

This post is a tutorial on a few variance reduction techniques that can be used to reduce the noisiness of your estimators: antithetic samplingcontrol variates, and stratified sampling.

Code on Github

## The Problem

Our goal is to find the expected value of the random variable $X$:

$$X = f(u) = \exp(\sin(8 u_1^2 + 3 u_1)+ \sin(11 u_2))$$

where $u$ is a 2D point sampled uniformly from the unit square, as shown below:

f(x,y) = exp(sin(8x.^2+3x)+sin(11y))
plot(linspace(0,1,100),linspace(0,1,100),f,st=:contourf,title="f(X)",size=(400,400))


The expectation can be re-written as an integral over the unit square.

$$\mathbb{E}[X] = \int_0^1 \int_0^1 \exp(\sin(8 u_1^2 + 3 u_1)+ \sin(11 u_2)) du_1 du_2$$

Monte Carlo is useful in scenarios where a closed-form integral does not exist, or we do not have access to the analytical form of the function (for example, it could be a black box program that the NSA hands to us). In this case we can integrate it analytically, but let's pretend we can't :)

## Basic Monte Carlo Estimator

The basic Monte Carlo estimator $\hat{\mu}$ gives a straightforward answer: "evaluate $x_i=f(u_i)$ at $n$ random values of $U$, and then take the average among all your samples." (to make comparison easier with the estimators I will introduce later on, I use $2n$ samples instead of $n$).

$$\hat{\mu} = \frac{1}{2n}\sum_i^{2n} x_i \approx \mathbb{E}[X]$$

Here is the estimator in code :

function plainMC(N)
u1 = rand(2*N)
u2 = rand(2*N)
return u1,u2,f(u1,u2),[]
end


The plot below shows 100 samples and their values, colored according to the value of $f(X)$ at that location.


u1,u2,x,ii=plainMC(N)
scatter(u1,u2,markercolor=x,
xlabel="u1",ylabel="u2",
title="Plain MC Samples",
size=(400,400),label="")


The variance of this (2n-sample) estimator is given by:

\newcommand{\Var}{\operatorname{Var}} \begin{align*} \Var[\hat{\mu}] & = \Var[ \frac{1}{2n}\sum_i^{2n} X_i ] \\ & = (\frac{1}{2n})^2\sum_i^{2n} \Var [X_i] \\ & = \frac{\sigma^2}{2n} \end{align*}

Where $\sigma^2$ is the variance of a single sample $X_i$ and $n$ is the total number of samples.

The variance of our estimator is $\mathbb{E}[(\hat{\mu}-\mu)^2]$, or in other words, the "expected squared error" with respect to the true expectation $\mu$.

We could decrease variance by simply cranking up our sample count $n$, but this is computationally expensive. While variance is proportional to $\frac{1}{n}$, the standard deviation (which has the same units as $\mu$ and is proportional to the confidence interval of our estimator) decreases at $\frac{1}{\sqrt{n}}$, a far slower rate. This is highly unsatisfactory.

Instead, we will use variance reduction techniques to construct estimators with the same expected value as our MC estimator, and still achieve a lower variance with the same number of samples.

## Antithetic Sampling

Antithetic sampling is a slight extension to the basic Monte Carlo estimator. The idea is to pair each sample $X_i$ with an "antithetic sample" $Y_i$ such that $X_i$ and $Y_i$ are identically distributed, but negatively correlated. Our single-sample estimate is given by:

$$\hat{\mu} = \frac{X_i + Y_i}{2}$$

$\mathbb{E}[X_i] = \mathbb{E}[Y_i]$ so this estimator has the same expected value as the plain MC estimator.

\begin{align*} \Var[\hat{\mu}] & = \frac{1}{4}\Var[X_i + Y_i] \\ & = \frac{1}{4}(\Var[X_i] + \Var[Y_i] + 2 \text{Cov}(X_i,Y_i)) && \Var[X_i] = \Var[Y_i] = \sigma^2 \\ & = \frac{1}{2}(\sigma^2 + \sigma^2\beta) && \beta = \text{Corr}(X_i,Y_i) = \frac{\text{Cov}(X_i,Y_i)}{\sigma^2} \end{align*}

If we sample $N$ of these pairs (for a total of $2N$ samples), the variance is $\frac{1}{2n}(\sigma^2 + \sigma^2\beta)$, and it's easy to see that this estimator has a strictly lower variance than the plain Monte Carlo as long as $\beta$ is negative.

Unfortunately, there isn't a general method for generating good antithetic samples. If the random sample of interest $X_i = f(U_i)$ is a transformation of a uniform random variable $U_i \sim \text{Uniform}(0,1)$, we can try "flipping" the input and letting $Y_i = f(1-U_i)$. $1-U_i$ is distributed identically to $U_i$, and assuming $f(U_i)$ is monotonic, then $Y_i$ will be large when $X_i$ is small, and vice versa.

However, this monotonicity assumption does not hold in general for 2D functions. In fact, for our choice of $f(U)$, it is only slightly negatively correlated ($\beta \approx -0.07$).

function antithetic(N)
u1_x = rand(N); u2_x = rand(N);
u1_y = 1-u1_x; u2_y = 1-u2_x;
x = f(u1_x,u2_x);
y = f(u1_y,u2_y);
return [u1_x; u1_y], [u2_x; u2_y], [x;y], (u1_x,u2_x,x,u1_y,u2_y,y)
end


Notice how each $X_i$ sample ("+") is paired with an antithetic sample $Y_i$ ("x") mirrored across the origin.


u1,u2,x,ii=antithetic(N)
u1_x,u2_x,x,u1_y,u2_y,y = ii
scatter(u1_x,u2_x,
m=(10,:cross),markercolor=x,
size=(400,400),label="X samples",
xlabel="u1",ylabel="u2",title="Antithetic Sampling")
scatter!(u1_y,u2_y,
m=(10,:xcross),markercolor=y,
label="antithetic samples")


## Control Variates

Control Variates is similar to Antithetic Sampling, in that we will be pairing every sample $X_i$ with a sample of $Y_i$ (the "control variate"), and exploiting correlations between $X_i$ and $Y_i$ in order to decrease variance.

However, $Y_i$ is no longer sampled from the same distribution as $X_i$. Typically $Y_i$ is some random variable that is computationally easy to generate (or is generated as a side effect) from the generative process for $X_i$. For instance, the weather report that forecasts temperature $X$ also happens to predict rainfall $Y$. We can use rainfall measurements $Y_i$ to improve our estimates of  "average temperature" $\mathbb{E}[X]$.

Suppose control variate $Y$ has a known expected value $\bar{\mu}$. The single-sample estimator is given by

$$\hat{\mu} = X_i - c (Y_i - \mathbb{E}[Y_i])$$

Where $c$ is some constant. In expectation, $Y_i$ samples are cancelled out by $\bar{\mu}$ so this estimator is unbiased.

It's variance is

\begin{align*} \Var[\hat{\mu}] & = \Var[X_i - c (Y_i - \mathbb{E}[Y_i])] \\ & = \Var[X_i - c Y_i] && \mathbb{E}[Y_i] \text{ is not random, and has zero variance} \\ & = \Var[X_i] + c^2 \Var[Y_i] + 2cCov(X_i,Y_i) \end{align*}

Simple calculus tells us that the value of $c$ that minimizes $Var[\hat{\mu}]$ is

$$c^* = -\frac{Cov(X_i,Y_i)}{\Var(Y_i)}$$

so the best variance we can achieve is

$$\Var[\hat{\mu}] = \Var[X_i] - \frac{Cov(X_i,Y_i)^2}{\Var(Y_i)}$$

The variance of our estimator is reduced as long as $Cov(X_i,Y_i)^2 \neq 0$. Unlike antithetic sampling, control variates can be either positively or negatively correlated with $X_i$.

The catch here is that $c^*$ will need to be estimated. There are a couple ways:

1. Forget optimality, just let $c^* = 1$! This naive approach still works better than basic Monte Carlo because it exploits covariance structure.
2. Estimate $c^*$ from samples $X_i$, $Y_i$. Doing so introduces a dependence between $c^*$ and $Y_i$, which makes the estimator biased, but in practice this bias is small compared to the standard error of the estimate.
3. An unbiased alternative to (2) is to compute $c^*$ from a small set of $m < n$ "pilot samples", and then throw the pilot samples away before collecting the real $n$ samples.

In the case of our $f(u)$, we know that the modes of the distribution are scattered in the corners of the unit square. In polar coordinates centered at $(0.5,0.5)$, that's about $k \frac{\pi}{2} + \frac{\pi}{4}, k=0,1,2,3$. Let's choose $Y = -cos(4\theta)$, where $\theta$ is uniform about the unit circle.


cv(u1,u2) = -cos(4*atan((u2-.5)./(u1-.5)))
pcv = plot(linspace(0,1,100),linspace(0,1,100),cv,st=:contourf,title="Y",xlabel="u1",ylabel="u2")
plot(pcv,p0,layout=2)



\begin{align*} \mathbb{E}_U[Y] & = \int_0^1 \int_0^1 -\cos(4\theta) du_1 du_2 \\ & = \int_0^1 \int_0^1 - \frac{1-\tan^2(2\theta)}{1+\tan^2(2\theta)} du_1 du_2 \\ & = \int_0^1 \int_0^1 - \frac{1-(\frac{2\tan \theta}{1 - \tan^2(\theta)})^2}{1+(\frac{2\tan \theta}{1 - \tan^2(\theta)})^2} du_1 du_2 && \tan(\theta) = \frac{y}{x} \\ & = \pi-3 \end{align*}

In practice though, we don't have an analytic way to compute $E[Y]$, but we can estimate it in an unbiased manner from our pilot samples.


function controlvariates(N)
# generate pilot samples
npilot = 30
u1 = rand(npilot); u2 = rand(npilot)
x = f(u1,u2)
y = cv(u1,u2)
β=corr(x,y)
c = -β/var(y)
μ_y = mean(y) # estimate mean from pilot samples
# real samples
u1 = rand(2*N); u2 = rand(2*N)
x = f(u1,u2)
y = cv(u1,u2)
return u1,u2,x+c*(y-μ_y),(μ_y, β, c)
end


We can verify from our pilot samples that the covariance is quite large:

βs=[]
for i=1:1000
u1,u2,x,ii=controlvariates(100)
βs = [βs; ii[2]]
end
mean(βs) # 0.367405



## Stratified Sampling

The idea behind stratified sampling is very simple: instead of estimating $E[X]$ over the domain $U$, we break up the domain into $K$ strata, estimate the conditional expectation $X$ over each strata separately, then average our per-strata estimates.

In the above picture, the sample space is divided into 50 strata going across the horizontal axis and there are 2 points placed in each strata. There are lots of ways to stratify - you can stratify in a 1D grid, 2D grid, concentric circles with varying areas, whatever you like.

We're not restricted to chopping up $U$ into little fiefdoms as above (although it is a natural choice). Instead, we introduce a "stratification variable" $Y$ with discrete values $y_1,...,y_k$. Our estimators for each strata correspond to the conditional expectations $\mathbb{E}[X|Y=y_i] \text{ for } i=1...K$.

The estimator is given by a weighted combination of strata estimates with the respective probability of each strata:


function stratified(N)
# N strata, 2 samples per strata = 2N samples
u1 = Float64[]
u2 = Float64[]
for y=range(0,1/N,N)
for i=1:2
append!(u1,[y+1/N*rand()])
append!(u2,[rand()])
end
end
return u1,u2,f(u1,u2),[]
end


$$\hat{mu} = \sum_i^k p(y_i) \mathbb{E}[X|Y=y_i]$$

The variance of this estimator is

$$\sum_i^k p(y_i)^2 (V_i)$$

Where $V_i$ is the variance of estimator $i$.

For simplicity, let's assume:

1. We stratify along an evenly-spaced grid, i.e. $p(y_i) = \frac{1}{k}$
2. We are using a simple MC estimator within each strata, $V_i = \frac{\sigma_i^2}{n_i}$, with $n_i = \frac{2n}{k}$ samples each (so the total number of samples is $2n$, the same as the plain MC estimator). $\sigma_i^2$ is the variance of an individual sample within each strata.

Then the variance is

$$\sum_i^k (\frac{1}{k} p(y_i)) (\frac{k}{2n}\sigma_i^2) = \frac{1}{2n} \sum_i^k p(y_i) \sigma_i^2 \leq \frac{1}{2n} \sigma^2$$

Intuitively, the variance within a single grid cell $\sigma_i^2$ should be smaller than the variance of the plain MC estimator, because samples are closer together.

Here's a proof that $\sum_i^k p(y_i)\sigma_i^2 \leq \sigma^2$:

Let $\mu_i = \mathbb{E}[X|Y=y_i], \sigma_i^2 = \Var[X|Y]$

\begin{align*} \mathbb{E}[X^2] & = \mathbb{E}_Y[\mathbb{E}_X[X^2|Y]] && \text{Tower property} \\ & = \sum_i^k p(y_i) \mathbb{E}[X^2|Y=y_i] \\ & = \sum_i^k p(y_i) (\sigma_i^2 + \mu_i^2) && \text{via } \Var[X|Y] = \mathbb{E}[X^2|Y] - \mathbb{E}[X|Y=y_i]^2 \\ \mathbb{E}[X] & = \mathbb{E}_Y[\mathbb{E}_X[X|Y]] \\ & =\sum_i^k p(y_i) \mu_i \\ \sigma^2 & = \mathbb{E}[X^2] - \mathbb{E}[X]^2 \\ & =\sum_i^k p(y_i) \sigma_i^2 + \big[\sum_i^k p(y_i) \mu_i^2 - (\sum_i^k p(y_i) \mu_i)^2\big] \end{align*}

Note that the term

$$\sum_i^k p(y_i) \mu_i^2 - (\sum_i^k p(y_i) \mu_i)^2$$

corresponds to the variance of a categorical distribution that takes value $\mu_i$ with probability $p(y_i)$. This term is strictly non-negative, and therefore the inequality  $\sum_i^k \frac{k}{2n}\sigma_i^2 \leq \sigma^2$ holds.

This yields an interesting observation: when we do stratified sampling, we are eliminating the variance between strata means. This implies that we get the most gains by choosing strata such that samples vary greatly between strata.

Here's a comparison of the empirical vs. theoretical variance with respect to number of strata:


# theoretical variance
strat_var = zeros(50)
for N=1:50
sigma_i_2 = zeros(N)
y = range(0,1/N,N)
for i=1:N
x1 = y[i].+1/N*rand(100)
x2 = rand(100)
sigma_i_2[i]=var(f(x1,x2))
end
n_i = 2
strat_var[N] = sum((1/(N))^2*(1/n_i*sigma_i_2))
end
s = estimatorVar(stratified)
plot(1:50,s,label="Empirical Variance",title="Stratified Sampling")
plot!(1:50,strat_var,label="Theoretical Variance")


Remarks:

• Stratified Sampling can actually be combined with other variance reduction techniques - for instance, when computing the conditional estimate within a strata, you are free to use antithetic sampling, control variates within the strata. You can even do stratified sampling within your strata! This makes sense if your sampling domain is already partitioned into a bounding volume hierarchy, such as a KD-Tree.
• Antithetic sampling can be regarded as a special case of stratified sampling with 2 strata. The inter-strata variance you are eliminating here is the self-symmetry in the distribution of $X$ (the covariance between one half and another half of the samples).

## Discussion

Here's a plot comparing the standard deviations of these estimators as a function of sample size. Keep in mind that your choice of variance reduction technique should depend on the problem, and no method is objectively better than the others. This is a very simple toy problem so all estimators perform pretty well.

I'm currently working on a follow-up tutorial on Importance Sampling. The explanation is a bit long, so I'm keeping it separate from this tutorial.

A toy example is all well and good, but where does variance reduction come up in modern scientific computing?

### Scientific Data Analysis

At a very fundamental level, variance in scientific data either suggests an underlying noisy process or errors. Variance degrades the statistical significance of findings, so it's desirable to conduct experiments and measurements such that the variance of findings is low.

### Quantitative Finance

In finance, we often simulate expected returns of portfolios by sampling over all possible movements in the market. Certain rare events (for example, a terrorist attack) are highly unlikely but have such a big impact that on the economy that we want to account for it in our expectation. Modeling these "black swan" events are of considerable interest to financial & actuarial industries.

### Deep Learning

In minibatch stochastic gradient descent (the "hammer" of Deep Learning), we compute the parameter update using the expected gradient of the total loss across the training dataset. It takes too long to consider every point in the dataset, so we estimate gradients via stochastic minibatch. This is nothing more than a Monte Carlo estimate.

$$E_x[\nabla_\theta J(x;\theta)] = \int p(x) \nabla_\theta J(x;\theta) dx \approx \frac{1}{m}\sum_i^m \nabla_\theta J(x_i;\theta)$$

The less variance we have, the sooner our training converges.

### Computer Graphics

In physically based rendering, we are trying to compute the total incoming light arriving at each camera pixel. The vast majority of paths contribute zero radiance, thus making our samples noisy. Decreasing variance allows us to "de-speckle" the pictures.

\begin{align*} \mathcal{L}(x,\omega_o) & = \mathcal{L}_e(x,\omega_o) + \int_\Omega{f_r(x,\omega_i,\omega_o)\mathcal{L}(x^\prime,\omega_i)(\omega_o \cdot n_x) d\omega_i} \\ & \approx \mathcal{L}_e(x,\omega_o) + \frac{1}{m}\sum_i^m{f_r(x,\omega_i,\omega_o)\mathcal{L}(x^\prime,\omega_i)(\omega_o \cdot n_x)} \end{align*}

## Sunday, August 7, 2016

### A Beginner's Guide to Variational Methods: Mean-Field Approximation

Variational Bayeisan (VB) Methods are a family of techniques that are very popular in statistical Machine Learning. VB methods allow us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another random variable) as optimization problems (i.e. find the parameter values that minimize some objective function).

This inference-optimization duality is powerful because it allows us to use the latest-and-greatest optimization algorithms to solve statistical Machine Learning problems (and vice versa, minimize functions using statistical techniques).

This post is an introductory tutorial on Variational Methods. I will derive the optimization objective for the simplest of VB methods, known as the Mean-Field Approximation. This objective, also known as the Variational Lower Bound, is exactly the same one used in Variational Autoencoders (a neat paper which I will explain in a follow-up post).

1. Preliminaries and Notation
2. Problem formulation
3. Variational Lower Bound for Mean-field Approximation
4. Forward KL vs. Reverse KL
5. Connections to Deep Learning

## Preliminaries and Notation

This article assumes that the reader is familiar with concepts like random variables, probability distributions, and expectations. Here's a refresher if you forgot some stuff. Machine Learning & Statistics notation isn't standardized very well, so it's helpful to be really precise with notation in this post:
• Uppercase $X$ denotes a random variable
• Uppercase $P(X)$ denotes the probability distribution over that variable
• Lowercase $x \sim P(X)$ denotes a value $x$ sampled ($\sim$) from the probability distribution $P(X)$ via some generative process.
• Lowercase $p(X)$ is the density function of the distribution of $X$. It is a scalar function over the measure space of $X$.
• $p(X=x)$ (shorthand $p(x)$) denotes the density function evaluated at a particular value $x$.

Many academic papers use the terms "variables", "distributions", "densities", and even "models" interchangeably. This is not necessarily wrong per se, since $X$, $P(X)$, and $p(X)$ all imply each other via a one-to-one correspondence. However, it's confusing to mix these words together because their types are different (it doesn't make sense to sample a function, nor does it make sense to integrate a distribution).

We model systems as a collection of random variables, where some variables ($X$) are "observable", while other variables ($Z$) are "hidden". We can draw this relationship via the following graph:

The edge drawn from $Z$ to $X$ relates the two variables together via the conditional distribution $P(X|Z)$.

Here's a more concrete example: $X$ might represent the "raw pixel values of an image", while $Z$ is a binary variable such that $Z=1$ "if $X$ is an image of a cat".

$X =$
$P(Z=1) = 1$ (definitely a cat)

$X=$
$P(Z=1) = 0$ (definitely not a cat)

$X =$
$P(Z=1) = 0.1$ (sort of cat-like)

Bayes' Theorem gives us a general relationship between any pair of random variables:

$$p(Z|X) = \frac{p(X|Z)p(Z)}{p(X)}$$

The various pieces of this are associated with common names:

$p(Z|X)$ is the posterior probability: "given the image, what is the probability that this is of a cat?" If we can sample from $z \sim P(Z|X)$, we can use this to make a cat classifier that tells us whether a given image is a cat or not.

$p(X|Z)$ is the likelihood: "given a value of $Z$  this computes how "probable" this image $X$ is under that category ({"is-a-cat" / "is-not-a-cat"}). If we can sample from $x \sim P(X|Z)$, then we generate images of cats and images of non-cats just as easily as we can generate random numbers. If you'd like to learn more about this, see my other articles on generative models: [1], [2].

$p(Z)$ is the prior probability. This captures any prior information we know about $Z$ - for example, if we think that 1/3 of all images in existence are of cats, then $p(Z=1) = \frac{1}{3}$ and $p(Z=0) = \frac{2}{3}$.

### Hidden Variables as Priors

This is an aside for interested readers. Skip to the next section to continue with the tutorial.

The previous cat example presents a very conventional example of observed variables, hidden variables, and priors. However, it's important to realize that the distinction between hidden / observed variables is somewhat arbitrary, and you're free to factor the graphical model however you like.

We can re-write Bayes' Theorem by swapping the terms:

$$\frac{p(Z|X)p(X)}{p(Z)} = p(X|Z)$$

The "posterior" in question is now $P(X|Z)$.

Hidden variables can be interpreted from a Bayesian Statistics framework as prior beliefs attached to the observed variables. For example, if we believe $X$ is a multivariate Gaussian, the hidden variable $Z$ might represent the mean and variance of the Gaussian distribution. The distribution over parameters $P(Z)$ is then a prior distribution to $P(X)$.

You are also free to choose which values $X$ and $Z$ represent. For example, $Z$ could instead be "mean, cube root of variance, and $X+Y$ where $Y \sim \mathcal{N}(0,1)$". This is somewhat unnatural and weird, but the structure is still valid, as long as $P(X|Z)$ is modified accordingly.

You can even "add" variables to your system. The prior itself might be dependent on other random variables via $P(Z|\theta)$, which have prior distributions of their own $P(\theta)$, and those have priors still, and so on. Any hyper-parameter can be thought of as a prior. In Bayesian statistics, it's priors all the way down.

## Problem Formulation

The key problem we are interested in is posterior inference, or computing functions on the hidden variable $Z$. Some canonical examples of posterior inference:
• Given this surveillance footage $X$, did the suspect show up in it?
• Given this twitter feed $X$, is the author depressed?
• Given historical stock prices $X_{1:t-1}$, what will $X_t$ be?

We usually assume that we know how to compute functions on likelihood function $P(X|Z)$ and priors $P(Z)$.

The problem is, for complicated tasks like above, we often don't know how to sample from $P(Z|X)$ or compute $p(X|Z)$. Alternatively, we might know the form of $p(Z|X)$, but the corresponding computation is so complicated that we cannot evaluate it in a reasonable amount of time. We could try to use sampling-based approaches like MCMC, but these are slow to converge.

## Variational Lower Bound for Mean-field Approximation

The idea behind variational inference is this: let's just perform inference on an easy, parametric distribution $Q_\phi(Z|X)$ (like a Gaussian) for which we know how to do posterior inference, but adjust the parameters $\phi$ so that $Q_\phi$ is as close to $P$ as possible.

This is visually illustrated below: the blue curve is the true posterior distribution, and the green distribution is the variational approximation (Gaussian) that we fit to the blue density via optimization.

What does it mean for distributions to be "close"? Mean-field variational Bayes (the most common type) uses the Reverse KL Divergence to as the distance metric between two distributions.

$$KL(Q_\phi(Z|X)||P(Z|X)) = \sum_{z \in Z}{q_\phi(z|x)\log\frac{q_\phi(z|x)}{p(z|x)}}$$

Reverse KL divergence measures the amount of information (in nats, or units of $\frac{1}{\log(2)}$ bits) required to "distort" $P(Z)$ into $Q_\phi(Z)$. We wish to minimize this quantity with respect to $\phi$.

By definition of a conditional distribution, $p(z|x) = \frac{p(x,z)}{p(x)}$. Let's substitute this expression into our original $KL$ expression, and then distribute:

\begin{align} KL(Q||P) & = \sum_{z \in Z}{q_\phi(z|x)\log\frac{q_\phi(z|x)p(x)}{p(z,x)}} && \text{(1)} \\ & = \sum_{z \in Z}{q_\phi(z|x)\big(\log{\frac{q_\phi(z|x)}{p(z,x)}} + \log{p(x)}\big)} \\ & = \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) + \Big(\sum_{z}{\log{p(x)}q_\phi(z|x)}\Big) \\ & = \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) + \Big(\log{p(x)}\sum_{z}{q_\phi(z|x)}\Big) && \text{note: \sum_{z}{q(z)} = 1 } \\ & = \log{p(x)} + \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) \\ \end{align}

To minimize $KL(Q||P)$ with respect to variational parameters $\phi$, we just have to minimize $\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}$, since $\log{p(x)}$ is fixed with respect to $\phi$. Let's re-write this quantity as an expectation over the distribution $Q_\phi(Z|X)$.

\begin{align} \sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}} & = \mathbb{E}_{z \sim Q_\phi(Z|X)}\big[\log{\frac{q_\phi(z|x)}{p(z,x)}}\big]\\ & = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - \log{p(x,z)} \big] \\ & = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - (\log{p(x|z)} + \log(p(z))) \big] && \text{(via \log{p(x,z)=p(x|z)p(z)}) }\\ & = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - \log{p(x|z)} - \log(p(z))) \big] \\ \end{align} \\

Minimizing this is equivalent to maximizing the negation of this function:

\begin{align} \text{maximize } \mathcal{L} & = -\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}} \\ & = \mathbb{E}_Q\big[ -\log{q_\phi(z|x)} + \log{p(x|z)} + \log(p(z))) \big] \\ & = \mathbb{E}_Q\big[ \log{p(x|z)} + \log{\frac{p(z)}{ q_\phi(z|x)}} \big] && \text{(2)} \\ \end{align}

In literature, $\mathcal{L}$ is known as the variational lower bound, and is computationally tractable if we can evaluate $p(x|z), p(z), q(z|x)$. We can further re-arrange terms in a way that yields an intuitive formula:

\begin{align*} \mathcal{L} & = \mathbb{E}_Q\big[ \log{p(x|z)} + \log{\frac{p(z)}{ q_\phi(z|x)}} \big] \\ & = \mathbb{E}_Q\big[ \log{p(x|z)} \big] + \sum_{Q}{q(z|x)\log{\frac{p(z)}{ q_\phi(z|x)}}} && \text{Definition of expectation} \\ & = \mathbb{E}_Q\big[ \log{p(x|z)} \big] - KL(Q(Z|X)||P(Z)) && \text{Definition of KL divergence} && \text{(3)} \end{align*}

If sampling $z \sim Q(Z|X)$ is an "encoding" process that converts an observation $x$ to latent code $z$, then sampling $x \sim Q(X|Z)$ is a "decoding" process that reconstructs the observation from $z$.

It follows that $\mathcal{L}$ is the sum of the expected "decoding" likelihood (how good our variational distribution can decode a sample of $Z$ back to a sample of $X$), plus the KL divergence between the variational approximation and the prior on $Z$. If we assume $Q(Z|X)$ is conditionally Gaussian, then prior $Z$ is often chosen to be a diagonal Gaussian distribution with mean 0 and standard deviation 1.

Why is $\mathcal{L}$ called the variational lower bound? Substituting $\mathcal{L}$ back into Eq. (1), we have:

\begin{align*} KL(Q||P) & = \log p(x) - \mathcal{L} \\ \log p(x) & = \mathcal{L} + KL(Q||P) && \text{(4)} \end{align*}

The meaning of Eq. (4), in plain language, is that $p(x)$, the log-likelihood of a data point $x$ under the true distribution, is $\mathcal{L}$, plus an error term $KL(Q||P)$ that captures the distance between $Q(Z|X=x)$ and $P(Z|X=x)$ at that particular value of $X$.

Since $KL(Q||P) \geq 0$, $\log p(x)$ must be greater than $\mathcal{L}$. Therefore $\mathcal{L}$ is a lower bound for $\log p(x)$. $\mathcal{L}$ is also referred to as evidence lower bound (ELBO), via the alternate formulation:

$$\mathcal{L} = \log p(x) - KL(Q(Z|X)||P(Z|X)) = \mathbb{E}_Q\big[ \log{p(x|z)} \big] - KL(Q(Z|X)||P(Z))$$

Note that $\mathcal{L}$ itself contains a KL divergence term between the approximate posterior and the prior, so there are two KL terms in total in $\log p(x)$.

## Forward KL vs. Reverse KL

KL divergence is not a symmetric distance function, i.e. $KL(P||Q) \neq KL(Q||P)$ (except when $Q \equiv P$) The first is known as the "forward KL", while the latter is "reverse KL". So why do we use Reverse KL? This is because the resulting derivation would require us to know how to compute $p(Z|X)$, which is what we'd like to do in the first place.

I really like Kevin Murphy's explanation in the PML textbook, which I shall attempt to re-phrase here:

Let's consider the forward-KL first. As we saw from the above derivations, we can write KL as the expectation of a "penalty" function $\log \frac{p(z)}{q(z)}$ over a weighing function $p(z)$.

\begin{align*} KL(P||Q) & = \sum_z p(z) \log \frac{p(z)}{q(z)} \\ & = \mathbb{E}_{p(z)}{\big[\log \frac{p(z)}{q(z)}\big]}\\ \end{align*}

The penalty function contributes loss to the total KL wherever $p(Z) > 0$. For $p(Z) > 0$, $\lim_{q(Z) \to 0} \log \frac{p(z)}{q(z)} \to \infty$. This means that the forward-KL will be large wherever $Q(Z)$ fails to "cover up" $P(Z)$.

Therefore, the forward-KL is minimized when we ensure that $q(z) > 0$ wherever $p(z)> 0$. The optimized variational distribution $Q(Z)$ is known as "zero-avoiding" (density avoids zero when $p(Z)$ is zero).

Minimizing the Reverse-KL has exactly the opposite behavior:

\begin{align*} KL(Q||P) & = \sum_z q(z) \log \frac{q(z)}{p(z)} \\ & = \mathbb{E}_{p(z)}{\big[\log \frac{q(z)}{p(z)}\big]} \end{align*}

If $p(Z) = 0$, we must ensure that the weighting function $q(Z) = 0$ wherever denominator $p(Z) = 0$, otherwise the KL blows up. This is known as "zero-forcing":

So in summary, minimizing forward-KL "stretches" your variational distribution $Q(Z)$ to cover over the entire $P(Z)$ like a tarp, while minimizing reverse-KL "squeezes" the $Q(Z)$ under $P(Z)$.

It's important to keep in mind the implications of using reverse-KL when using the mean-field approximation in machine learning problems. If we are fitting a unimodal distribution to a multi-modal one, we'll end up with more false negatives (there is actually probability mass in $P(Z)$ where we think there is none in $Q(Z)$).

## Connections to Deep Learning

Variational methods are really important for Deep Learning. I will elaborate more in a later post, but here's a quick spoiler:
1. Deep learning is really good at optimization (specifically, gradient descent) over very large parameter spaces using lots of data.
2. Variational Bayes give us a framework with which we can re-write statistical inference problems as optimization problems.
Combining Deep learning and VB Methods allow us to perform inference on extremely complex posterior distributions. As it turns out, modern techniques like Variational Autoencoders optimize the exact same mean-field variational lower-bound derived in this post!

Thanks for reading, and stay tuned!

## Sunday, July 24, 2016

### Why Randomness is Important for Deep Learning

This afternoon I attempted to explain to my mom why Randomness is important for Deep Learning without using any jargon from probability, statistics, or machine learning.

The exercise was partially successful. Maybe. I still don't think she knows what Deep Learning is, besides that I am a big fan of it and use it for my job.

 I'm a big fan of Deep Learning

This post is a slightly more technical version of the explanation I gave my mom, with the hope that it will help Deep Learning practitioners to better understand what is going on in neural networks.

If you are getting started into Deep Learning research, you may discover that there are a whole bunch of seemingly arbitrary techniques used to train neural networks, with very little theoretical justification besides "it works". For example: dropout regularization, adding gradient noise,, asynchronous stochastic descent.

What do all these hocus pocus techniques have in common? They incorporate randomness!

Random noise actually is crucial for getting DNNs to work well:

1. Random noise allows neural nets to produce multiple outputs given the same instance of input.
2. Random noise limits the amount of information flowing through the network, forcing the network to learn meaningful representations of data.
3. Random noise provides "exploration energy" for finding better optimization solutions during gradient descent.

## Single Input, Multiple Output

Suppose you are training a deep neural network (DNN) to classify images.

For each cropped region, the network learns to convert an image into a number representing a class label, such as “dog” or "person".

That’s all very well and good, and these kind of DNNs do not require randomness in their inference model. After all, any image of a dog should be mapped to the “dog” label, and there is nothing random about that.

Now suppose you are training a deep neural network (DNN) to play Go (game). In the case of the image below, the DNN has to make the first move.

If you used the same deterministic strategy described above, you will find that this network fails to give good results. Why? Because there is no single “optimal starting move” - for each possible stone placement on the board, there is a equally good stone placement on the other side board, via rotational symmetry. There are multiple best answers.

If the network is deterministic and only capable of picking one output per input, then the optimization process will force the network to choose the move that averages all the best answers, which is smack-dab in the middle of the board. This behavior is highly undesirable, as the center of the board is generally regarded as a bad starting move.

Hence, randomness is important when you want the network to be able to output multiple possibilities given the same input, rather than generating the same output over and over again. This is crucial when there are underlying symmetries in the action space - incorporating randomness literally helps us break out of the "stuck between two bales of hay" scenario.

Similarly, if we are training a neural net to compose music or draw pictures, we don’t want it to always draw the same thing or play the same music every time it is given a blank sheet of paper. We want some notion of “variation”, "surprise", and “creativity” in our generative models.

One approach to incorporating randomness into DNNs is to keep the network deterministic, but have its outputs be the parameters of a probability distribution, which we can then draw samples from using conventional sampling methods to generate “stochastic outputs”.

Deepmind's AlphaGo utilized this principle: given an image of a Go board, it outputs the probability of winning given each possible move. The practice of modeling a distribution after the output of the network is commonly used in other areas of Deep Reinforcement Learning.

## Randomness and Information Theory

During my first few courses in probability & statistics, I really struggled to understand the physical meaning of randomness. When you flip a coin, where does this randomness come from? Is randomness just deterministic chaos? Is it possible for something to be actually random

To be honest, I still don't fully understand.

Information theory offers a definition of randomness that is grounded enough to use without being kept wide awake at night: "randomness" is nothing more than the "lack of information".

More specifically, the amount of information in an object is the length  (e.g. in bits, kilobytes, etc.)  of the shortest computer program needed to fully describe it. For example, the first 1 million digits of $\pi = 3.14159265....$ can be represented as a string of length 1,000,002 characters, but it can be more compactly represented using 70 characters, via an implementation of the Leibniz Formula:

The above program is nothing more than a compressed version of a million digits of $\pi$. A more concise program could probably express the first million digits of $\pi$ in far fewer bits.

Under this interpretation, randomness is "that which cannot be compressed". While the first million digits of $\pi$ can be compressed and are thus not random, empirical evidence suggests (but has not proven) that $\pi$ itself is normal number, and thus the amount of information encoded in $\pi$ is infinite.

Consider a number $a$ that is equal to the first trillion digits of $\pi$, $a = 3.14159265...$. If we add to that a uniform random number $r$ that lies in the range $(-0.001,0.001)$, we get a number that ranges in between $3.14059...$ and $3.14259...$. The resulting number $a + r$ now only carries ~three digits worth of information, because the process of adding random noise destroyed any information carried after the hundredth's decimal place.

## Limiting Information in Neural Nets

What does this definition of randomness have to deal with randomness?

Another way randomness is incorporated into DNNs is by injecting noise directly into the network itself, rather than using the DNN to model a distribution. This makes the task "harder" to learn as the network has to overcome these internal "perturbations".

Why on Earth would you want to do this? The basic intuition is that noise limits the amount of information you can pass through a channel.

Consider an autoencoder, a type of neural network architecture that attempts to learn an efficient encoding of data by "squeezing" the input into fewer dimensions in the middle and re-constituting the original data at the other end. A diagram is shown below:

During inference, inputs flow from the left through the nodes of the network and come out the other side, just like a pipe.

If we consider a theoretical neural network that operates on real numbers (rather than floating point numbers), then without noise in the network, every layer of the DNN actually has access to infinite information bandwidth.

Even though we are squeezing representations (the pipe) into fewer hidden units, the network could still learn to encode the previous layer's data into the decimal point values without actually learning any meaningful features. In fact, we could represent all the information in the network with a single number. This is undesirable.

By limiting the amount of information in a network, we force it to learn compact representations of input features. Several ways to go about this:

• Variational autoencoders (VAE) add Gaussian noise to the hidden layer. This noise destroys "excess information," forcing the network to learn compact representations of data.
• Closely related to VAE noise (possibly equivalent?) to this is the idea of Dropout Regularization - randomly zeroing out some fraction of units during training. Like the VAE, dropout noise forces the network to learn useful information under limited bandwidth.
• Deep Networks with Stochastic Depth - similar idea to dropout, but at a per-layer level rather than per-unit level.
• There's a very interesting paper called Binarized Neural Networks that uses binary weights and activations in the inference pass, but real-valued gradients in the backward pass. The source of noise comes from the fact that the gradient is a noisy version of the binarized gradient. While BinaryNets are not necessarily more powerful than regular DNNs, individual units can only encode one bit of information, which regularizes against two features being squeezed into a single unit via floating point encoding.
More efficient compression schemes mean better generalization at test-time, which explains why dropout works so well for over-fitting. If you decide to use regular autoencoders over variational autoencoders, you must use a stochastic regularization trick such as dropout to control how many bits your compressed features should be, otherwise you will likely over-fit.

I think VAEs are objectively superior because they are easy to implement and allow you to specify exactly how many bits of information are passing through each layer.

## Exploration "Energy"

DNNs are usually trained via some variant of gradient descent, which basically amounts to finding the parameter update that "rolls downhill" along some loss function. When you get to the bottom of the deepest valley, you've found the best possible parameters for your neural net.

The problem with this approach is that neural network loss surfaces have a lot of local minima and plateaus. It's easy to get stuck in a small dip or a flat portion where the slope is already zero (local minima) but you are not done yet.

The third interpretation of how randomness assists Deep Learning models is based on the idea of the exploration.

Because the datasets used to train DNNs are huge, it’s too expensive to compute the gradient across terabytes of data for every single gradient descent step. Instead, we use stochastic gradient descent (SGD), where we just compute the average gradient across a small subset of the dataset chosen randomly from the dataset.

In evolution, if the success of a species is modeled by random variable $X$, then random mutation or noise increases the variance of $X$ - its progeny could either be far better off (adaptation, poison defense) or far worse off (lethal defects, sterility).

In numerical optimization, this "genetic mutability" is called "thermodynamic energy", or Temperature that allows the parameter update trajectory to not always "roll downhill", but occasionally bounce out of a local minima or "tunnel through" hills.

This is all deeply related to the Exploration-vs.-Exploitation tradeoff as formulated in RL. Training a purely deterministic DNN with zero gradient noise has zero exploitation capabilities - it converges straight to the nearest local minima, however shallow.

Using stochastic gradients (either via small minibatches or literally adding noise to the gradients themselves) is an effective way to allow the optimization to do a bit of "searching" and "bouncing" out of weak local minima. Asynchronous stochastic gradient descent, in which many machines are performing gradient descent in parallel, is another possible source of noise.

This "thermodynamic" energy ensures symmetry-breaking during early stages of training, to ensure that all the gradients in a layer are not synchronized to the same values. Not only does noise perform symmetry breaking in action space of the neural net, but noise also performs symmetry breaking in the parameter space of the neural net.

## Closing Thoughts

I find it really interesting that random noise actually helps Artificial Intelligence algorithms to avoid over-fitting and explore the solution space during optimization or Reinforcement Learning. It raises interesting philosophical questions on whether the inherent noisiness of our neural code is a feature, not a bug.

One theoretical ML research question I am interested in is whether all these neural network training tricks are actually variations of some general regularization theorem. Perhaps theoretical work on compression will be really useful for understanding this.

It would be interesting to check the information capacity of various neural networks relative to hand-engineered feature representations, and see how that relates to overfitting tendency or quality of gradients. It's certainly not trivial to measure the information capacity of a network with dropout or trained via SGD, but I think it can be done. For example, constructing a database of synthetic vectors whose information content (in bits, kilobytes, etc) is exactly known, and seeing how networks of various sizes, in combination with techniques like dropout, deal with learning a generative model of that dataset.