Monday, November 20, 2017

Expressivity, Trainability, and Generalization in Machine Learning

Update 11/29: I'm looking for translators to help translate this post into different languages, particularly Chinese (中文), Spanish (Español), Korean (한국어), Russian (ру́сский язы́к), and Japanese (日本語). Please email me at <myfirstname><mylastname>2004<at>gmail.com! 

When I read Machine Learning papers, I ask myself whether the contributions of the paper fall under improvements to 1) Expressivity 2) Trainability, and/or 3) Generalization. I learned this categorization from my colleague Jascha Sohl-Dickstein at Google Brain, and the terminology is also introduced in this paper. I have found this categorization effective in thinking about how individual research papers (especially on the theoretical side) tie subfields of AI research (e.g. robotics, generative models, NLP) together into one big picture [1].

In this blog post, I discuss how these concepts tie into current (Nov 2017) machine learning research on Supervised Learning, Unsupervised Learning, and Reinforcement Learning. I consider Generalization to be comprised of two categories -- “weak” and “strong” generalization -- and I will discuss them separately. This table summarizes my opinion on where things are:




Expressivity

What computations can this model perform?
Trainability

How easy is it to fit the model to data?
Weak Generalization

How well does the model perform on small perturbations of its data distribution?
Strong Generalization

How well does the model perform on a large shift in data distribution?
Supervised Learning

Learn using human-provided labels. Predict those labels.
Easy



OK

Getting There

Hard
Unsupervised Learning

Learn without human-provided labels. Predict those labels.
Easy

OK

Getting There

Hard

Reinforcement Learning

Learn to maximize reward in an environment.
Easy

Hard

Mega-hard

Mega-hard



I’m deeply grateful to Jascha Sohl-Dickstein and Ben Poole for providing feedback and edits to this post, as well as Marcin Moczulski for helpful discussions about trainability of RNNs. 

This post covers a broad spectrum of research at a fairly opinionated level. I emphasize that any factual errors here are my own and do not reflect opinions of my colleagues and proofreaders. Feel free to contribute feedback in the comments or send me an email if you’d like to discuss / suggest edits - I’m here to learn. 

Expressivity

What computations can this model perform?

Measures of expressivity characterize the complexity of functions that can possibly be computed by a parametric function such as a neural net. Deep neural networks are exponentially expressive with respect to their depth, which means that moderately-sized neural networks are sufficiently expressive for most supervised, unsupervised, and RL problems being researched today [2]. One piece of evidence for this is that deep neural nets are capable of memorizing very large datasets.

Neural nets can express all sorts of things: continuous, complex, discrete, and even stochastic variables. Thanks to research in generative modeling and Bayesian Deep Learning in the last few years, deep neural networks have been leveraged to construct probabilistic neural networks have produced incredible generative modeling results.

Recent breakthroughs in generative modeling illustrate the profound expressivity of neural networks: neural nets can output extremely complex data manifolds (audio, images) that are nearly indistinguishable from real data. Here is the output on a recent GAN architecture proposed by researchers at NVIDIA:
 

It’s not perfect yet (notice the warped background), but we are getting very close. Similarly, in audio synthesis, audio generated samples from the latest WaveNet models sound like real people.

Unsupervised learning is not limited to generative modeling. Some researchers like Yann Lecun are rebranding unsupervised learning as “predictive learning”, where the model infers the past, imputes the present, or predicts the future. However, since much of unsupervised learning has focused on predicting extremely complex joint distributions (images from past/future, audio), I think generative modeling is a reasonably good benchmark to measure expressivity in the unsupervised domain.

Neural nets seem to be expressive enough for RL as well. Fairly small networks (2 conv layers and 2 fully-connected layers) are powerful enough to solve Atari and Mujoco control tasks (though their trainability leaves much to be desired - see the next section).

On its own, expressivity is not a very interesting question - we can always increase it by adding more layers, more connectivity, and so on. The challenge is making neural nets expressive enough for both the training and test set, while keeping the training difficulty manageable. For example, 2D convolutions seem to be required to make image classification models generalize, even though a deep fully-connected network has enough capacity to memorize a training set.

Expressivity is the easiest problem to deal with (add more layers!), but also simultaneously the most mysterious: we don’t have good way of measuring how much expressivity (and of what kind) is demanded by a given task. What kinds of problems will demand neural networks that are orders of magnitude larger than the ones we use today? Why will they require so much compute? Are our current neural networks expressive enough to implement intelligence resembling that of a human? Does solving generalization require vastly more expressive models?
The brain has many orders of magnitude more “neuronal units” (1e11) than the number units in our large neural networks (Inception-ResNet-V2 has roughly 25e6 ReLU units). This comparison is already drastic, even when we give Relus the benefit of the doubt that they are even remotely comparable to biological neurons. A single biological neuron -- with its various neurotransmitters, dendritic forest integrating input from 10000 other neurons in a time-varying integration -- are also incredibly expressive. A locust implements its collision detection system in one neuron and it already flies better than any drone system we have ever built. Where is all this expressivity coming from and going to? How much more expressivity do we need?


Trainability

Given a sufficiently expressive space of models, can we find a good model?

A Machine Learning model is any computer program that has some of its functionality learned from data. During “learning”, we search for a reasonably good model that utilizes knowledge from the data to make decisions, out of a (potentially huge) space of models. This search process is usually formulated as solving an optimization problem over the space of models.



Several Varieties of Optimization

A common approach, especially in deep learning, is to define some scalar metric that evaluates the “goodness” of a model. Then use a numerical optimization technique to maximize said “goodness” (or equivalently, “minimize badness”).

A concrete example: minimizing the average cross-entropy error is a standard way to train neural networks to classify images. This is done in the hopes that when the model x dataset have been minimized on cross entropy loss, it will also do what we actually want, e.g. classify images correctly with a certain degree of precision and recall on test images. Often the evaluation metric cannot be optimized directly (the most obvious reason being that we don’t have access to the test dataset) but a surrogate function like cross-entropy on a training set can.

Searching for good models (training) ultimately amounts to optimization -- no two ways about it! But … objectives are sometimes hard to specify. A classic scenario in supervised learning is in image down-sampling, where it’s difficult to define a single scalar quantity that aligns exactly with how humans perceive “perceptual loss” from a particular downsampling algorithm. In a similar manner, super-resolution and image synthesis are also difficult, because we have a hard time writing down the “goodness” as a maximization objective. Imagine writing down a function that tells you how “photoreal” an image is! The debate on how to measure quality of generative models (for images, audio) rages on to this day.

The most popular technique in recent years that addresses this problem is a co-adaptation approach, which formulate the optimization problem as solving for the equilibrium point between two non-stationary distributions that are evolving in tandem [3]. An intuitive analogy to explain why this is “natural” would be to compare this to ecological evolution of a predator species and a prey species. At first, predators get smarter so they can catch prey effectively. Then, the prey get smarter in order to evade predators. The species co-evolve against each other, and the end result is that both species become more intelligent.

Generative Adversarial Networks operate according to a similar principle, in order to avoid having to define explicit perceptual loss objectives. Similarly, competitive self-play in reinforcement learning employs this principle to learn rich behaviors. Although the optimization objective is now implicitly specified, it’s still an optimization problem, and machine learning practitioners can re-use familiar tools like deep neural networks and SGD.

Evolution strategies typically consider optimization-as-a-simulation. The user specifies some dynamical system over a population of models, and at each timestep of the simulation, the population is updated according to the rules of the dynamical system. The models may or may not interact with each other. The simulation runs forward in time, and hopefully the dynamics of the system induce the population to eventually converge around “good models”.

For a great tutorial on ES in the context of RL, see A Visual Guide to Evolution Strategies by David Ha (the “References and Further Reading” section is really great!).

Current State of Affairs

Trainability is basically solved for feedforward networks and “direct” objectives encountered in supervised learning (I make this claim empirically, not theoretically). Some popular breakthroughs published in 2015 (Batch Norm, ResNets, Good Init) are widely used today to make training of feedforward networks super trainable. In fact, deep nets with hundreds of layers can now minimize training error of large classification datasets to zero. See this paper for a comprehensive survey on the hardware and algorithmic infrastructure underlying modern DNNs.

RNNs are still pretty tricky, but the research community is making a lot of progress, to the point where it is no longer crazy to drop an LSTM into a complex robotic policy and expect it to “just work”. That’s pretty incredible, considering that as late as 2014 not many people had confidence in trainability RNNs and in the previous decades there was a lot of work showing that RNNs are horrible to train. There’s evidence that a whole bunch of RNN architectures are equivalently expressive, and any differences in performance are due to some architectures being easier to optimize than others [4].

In unsupervised learning, the model outputs are often (but not always!) bigger -- for instance, 1024 x 1024 pixels, gigantic sequences for speech and text. This unfortunately makes them harder to train.

One of the big breakthroughs in 2017 was that GANs are now dramatically easier to train. The most popular improvements have been simple modifications to the original Jensen-Shannon divergence objective: least squares, absolute deviation with margin, and Wasserstein distance (1, 2). The recent work by NVIDIA extends Wasserstein-based GANs to be make the model less sensitive to a wide variety of hyperparameters like BatchNorm parameters, architecture choice, and so on. Stability is incredibly important for practical and industrial applications - it’s what gives us the confidence that it will be compatible with our future research ideas or applications. Taken together, these results are exciting because they indicate that our generator neural networks are plenty expressive to generate the right images, and the bottleneck that was holding them back was one of trainability. Trainability might still be a bottleneck - unfortunately with neural nets it’s hard to tell if a model is simply not expressive enough and/or we haven’t trained it enough.

Latent discrete variable inference within neural nets was also previously difficult to train due to monte-carlo gradient estimators having high variance. However it’s starting to make a comeback in recent years in all sorts of architectures, from GANS to Language Modeling to Memory-Augmented Neural Networks to Reinforcement Learning. Discrete representations are very useful from an expressivity standpoint and it’s great that we can now train them fairly reliably.

Tragically, Deep Reinforcement Learning is still quite behind when it comes to pure trainability, without even considering the generalization aspect. Reinforcement learning is difficult because for environments with more than 1 timestep, we are searching for a model that then performs optimization (of the reward) at inference time. There is an inner optimization process where the model induces optimal control, and an outer optimization loop that learns this optimal model, using only a database of what the agent saw.

Recently I added one extra dimension to a continuous robotic control task, and the performance of my RL algorithm dropped from >80% to 10%. Not only is RL training hard, but it’s also unreliable! Because optimization landscapes are so stochastic, we can’t even get the same results with different random seeds, so the strategy is to report a distribution of reward curves over multiple trials, with different random seeds. Different implementations of the same algorithm perform differently across environments, so RL scores reported in literature are to be taken with a handful of salt.

This is a travesty! Trainability in RL is still very much unsolved, because we still cannot scale up our problems by a little bit and expect the same learning to do the same thing 10 times out of 10.

If we treat RL as a pure optimization problem (and worry about generalization and complex tasks later), the problem is still intractable. Let’s say there’s an environment where a sparse reward is given only at the end of the episode (e.g. babysitting a child where the babysitter is paid after the parents return home). The number of actions (and corresponding outcomes of the environment) grows exponentially with the duration of the episode, but only a few of these action sequences corresponds to a success. 

Therefore, estimating the policy gradient at any point in the optimization landscape of models requires exponentially many samples in the action space before a useful learning signal is obtained. This is as bad as trying to compute a Monte Carlo expectation of a probability distribution (over all action sequences) where the mass is concentrated at a “delta distribution” (see diagram below). When there is vanishing overlap between the proposal distribution (exploration) and reward distribution, finite-sample Monte Carlo estimates of policy gradients simply will not work, no matter how many samples you collect.


Furthermore, if the data distribution is non-stationary (as in the case of off-policy learning algorithms with a replay buffer), collecting “bad data” can introduce unstable feedback loops into the outer optimization process.

Here’s the same idea from an optimization perspective rather than a Monte Carlo estimation perspective: without any priors over the state space (such as an understanding of the world or an explicit instruction provided to the agent), the optimization landscape looks like “swiss cheese” -- small “holes” of convex optima surrounded by vast plateaus of parameter space where policy gradient information is useless. This means that the entire model space basically has zero information (because the learning signal is effectively uniform across the space of models, since the non-zero area is vanishingly small).





Without developing good representations that we can do learning on top of, we might as well just cycle through random seeds and sample random policies until we happen upon a good model that lands in one of these swiss cheese holes. The fact that this is a surprisingly strong baseline in RL suggests that our optimization landscapes may very well look like this.

I believe that RL benchmarks like Atari and Mujoco, while interesting from a pure optimization standpoint, do not really push the limits of machine learning, so much as just solve for a single monolithic policy that optimizes performance for a fairly sterile environment. There is little selective pressure encouraging the model to “generalize”, which makes the problem a pure optimization problem, and not really a hard ML problem.

It’s not that I like to complicate the trainability problems of RL with issues of generalization (it certainly does not make things easy to debug!), but I think learning understanding of the environment and understanding the task is the only way that will make reinforcement learning tractable for real-world robotic problems.

Contrast this with supervised learning and unsupervised learning, we can obtain learning signals cheaply, no matter where we are in the model search space. The proposal distribution for minibatch gradients has nonzero overlap with the distribution of gradients. If we are using SGD with minibatch size=1, then the probability of sampling the transition with a useful learning signal is at worst 1/N where N is the size of the dataset (so learning is guaranteed after each epoch). We can brute-force our way to a good solution by simply throwing a lot of compute and data at the problem. Furthermore, improving perceptual generalization of lower layers may actually have a variance-reducing effect via “bootstrapping” on top of lower-level features.

In order to solve RL problems with high dimensionality and complexity, we must consider generalization and general perceptual understanding before tackling the numerical optimization problem. To attempt otherwise is a foolish waste of compute and data (though it’s still useful to answer exactly how much we can brute force). We need to get to a point where every data point provides a non-zero number of bits to the RL algorithm, and have an easy way to importance-sample gradients when the task is very complex (without gathering exponentially more data). Only then, is it reasonable to assume we can succeed at brute-forcing the problem.

Learning from demonstration, imitation learning, inverse reinforcement learning, and interfacing with natural language instructions perhaps provide ways to quickly bring the starting policy to a point that obtains some learning signal, or shape the search space in a way such that each episode gives a non-zero amount of information to the policy (for example, the environment provides a reward of 0 but observations yield some information that is helpful to the inductive biases of the model’s planning module).

To summarize: trainability is: easy for supervised learning, is harder-but-getting-there for unsupervised learning, and is still terribly broken for reinforcement learning.

Generalization


Generalization is the most profound of the 3 problems, and is the heart of Machine Learning itself. Loosely speaking, it is how well the model performs on test dataset when it is trained on a training dataset.

There are two distinctive scenarios when talking about generalization: 1) the training and test data are sampled from the same distribution (and we need to learn this distribution from the training data alone) or 2) training and test data are drawn from different distributions (and we need to generalize from the train to the test distribution). Let’s call (1) “weak generalization” and (2) “strong generalization. This way of classifying generalization could be also referred to as “interpolation vs. extrapolation”, or “robustness vs. understanding”.

Weak Generalization: Two Landscapes

How well does the model perform on small perturbations of its data distribution?
In “weak generalization”, we usually assume the training and test data samples are drawn from the same distribution. However, in real world settings, there is almost always some difference between the train and test distributions, even in the large-sample limit.

These differences can come from sensor noise, changes in ambient light conditions, gradual wear and tear of objects, variations in lighting conditions (maybe it was a cloudy day when the photographer gathered the test set images). Another situation is that differences can be generated by an adversary. Adversarial perturbations are barely perceptible to our human vision, so we might consider adversarial examples to be drawn from the “same distribution”.

Therefore, it’s helpful in practice to think about “weak generalization” as evaluating on a “perturbed” version of the training distribution:



A perturbed testing data distribution might also induce a perturbed optimization landscape (lowest point is best).



The fact that we don’t know the testing distribution ahead of time presents some difficulties for optimization. If we are too aggressive in optimizing the training landscape (the sharp global minima on the left of the blue curve), we end up with a model that is sub-optimal with respect test data (sharp local minima on the red curve). Here, we’ve overfit to the training distribution or training data samples, and failed to generalize to the perturbed test distribution.

“Regularization” is any technique we employ to prevent overfitting. We don’t have any prior information on what the test perturbation is, so often the best we can usually do is try to train on random perturbations versions of the training distribution, in hopes that these perturbations cover the test distribution. Stochastic gradient descent, dropout, weight noise, activation noise, data augmentation are commonly-used regularizers in Deep Learning. In reinforcement learning, randomizing simulation parameters makes the training more robust. In his talk at ICLR 2017, Chiyuan Zhang made the comment that he sees regularization as “anything that makes training harder” (rather than the conventional view of “limiting model capacity”). Basically, making things *harder* to optimize somehow improves generalization.

This is really disturbing -- our methods of “generalizing” are quite crude, amounting to “optimizer lobotomy”. We basically throw a wrench into the optimizer and hope it messes training just enough to prevent overfitting. Furthermore, improving trainability is can make you pay a price in generalization! This way of viewing (weak) generalization certainly complicates how we proceed with trainability research.

But if better optimizers overfit, then how do we explain why some optimizers seem to decrease both training and test error? The reality is that any combination of optimization landscape and optimizer strikes a balance between 1) finding a better region of models and 2) overfitting to a specific solution, and we don’t have good methods for controlling this balance.

The most challenging test of weak generalization is probably that of adversarial attacks, where the perturbations are coming from an “adversary” that provides the worst-possible perturbation to your data point, so that your model performs badly with high probability. We still don’t have any deep learning approaches that are very robust to adversarial examples, but my gut feeling is that it is eventually solvable [5].

There’s some theoretical work on applying information theory to show that neural networks apparently go through a phase transition during the training process in which the model switches from “memorizing” data to “compressing” data. This theory is starting to pick up steam, though there is ongoing academic debate as to whether this theory is valid. Keep your eyes peeled for this one - the intuition of “memorization” and “compression” is compelling.

Strong Generalization: Natural Manifold

In tests of “strong generalization”, the model is evaluated on a completely different data distribution than the test, but with data coming from the same underlying manifold (or generative process).




How could one could ever learn a good model for the test distribution if it is “completely different” from the training distribution? The resolution for this is that these “glimpses” of data actually come from the same manifold of “natural data”. There can still be a lot of information overlap between train and test distributions, as long as they come from the same underlying generative process.

The space of observable data in the world can be described as a very high-dimensional, continuously varying “natural manifold”. The tuple (video clip of a pair of cymbals crashing together, the sound of crashing cymbals) is a point on this manifold. The tuple (video clip of a pair of cymbals crashing together, the sound of frogs croaking)is not - those joint data points are simply are inconsistent with our reality. As you can see, even though the manifold is positively gigantic, it’s also highly structured. For example, physical laws like gravity are obeyed in all the data that we observe, objects don’t blip in and out of existence, and so on.

Strong Generalization can be thought of as how much of this “super-manifold” is captured by a particular model that is only trained on a teensy sampling of data points from the manifold. Note that an image classifier does not need to discover Maxwell’s Equations -- it merely needs to have an understanding of reality that is consistent with data on the manifold.

Modern classification models trained on ImageNet are arguably OK at strong generalization - principles like edges, contours, and objects are indeed understood by models trained on ImageNet, which is why it’s so popular to transfer these pre-trained weights to other datasets for few-shot & metric learning. It could be a lot better though: classifiers trained on ImageNet don’t work universally, few-shot learning still is unsolved, and they are still susceptible to adversarial examples. Obviously our models don’t understand what they are looking at, to the same degree that humans do. But it’s a start.

Similar to weak generalization, the test distribution can be sampled adversarially in a way that induces maximum discrepancy between the training and testing distribution. AlphaGo Zero is my favorite example of this: at test time, it observes data from human players that is completely different than its training distribution (it has never “seen” a human before). Furthermore, the human is using all of their intelligence to actively lead AlphaGo to regimes that are unobserved in the training data. Even though AlphaGo doesn’t explicitly understand anything about abstract mathematics, opponent psychology, or what the color green means, it clearly understood enough about the world to outwit a human player within a narrow domain. If an AI system is robust against a skilled human adversary, I consider it to have sufficient strong generalization capability.

Sadly, Reinforcement Learning research has largely ignored the problem of strong generalization. Most benchmarks are static environments, with very little perceptual richness (e.g. the humanoid doesn’t understand its world or what its body even looks like, beyond some floating joint positions that lead to some reward).

I really believe that solving generalization is key to solving RL trainability. The more our learning system “understands” about the world, the better able it is to obtain learning signals, perhaps with fewer samples. This is why few-shot learning, imitation learning, learning-to-learn is important: it moves us away from brute force solutions where variance is high and information is low.

I believe two things are required to achieve stronger forms of generalization:

First, we need models that actively deduce the fundamental laws of the world from observation and experimentation. Symbolic reasoning and causal inference seem like ripe topics to study, but any kind of unsupervised learning is likely to help. I am reminded of humankind’s long quest to understand the motion of celestial bodies by deriving physical laws of the universe using a system of logical reasoning (mathematics). It’s interesting to note that before the Copernican Revolution humans probably relied on Bayesian heuristics at first (“superstitions”), and these “Bayesian” models were discarded once we discovered classical mechanics.

Our model-based machine learning methods (models that attempt to “predict” aspects of their environment) right now are “pre-Copernican” in the sense that they only interpolate based on very shallow statistical superstitions, rather than coming up with deep, general principles to explain and extrapolate data that can be millions of light years away or many timesteps into the future. Note that humans didn’t really need a firm grasp on probability theory to derive deterministic celestial mechanics, which begs the question of whether there are ways to do machine learning and causal inference without an explicit statistical framework.

One way to drastically reduce the complexity burden is to make our learning systems more adaptive. We need to go beyond optimizing models that predict or act in static ways. Instead, we need to be optimizing models that can thinkremember, and learn, all in real-time.

Second, we need to throw enough diversity of data at the problem, so that the model can be pushed to develop these abstract representations. An environment needs to be rich enough to force the right representations to develop (though AlphaGo Zero does raise questions on how much of the natural manifold an agent really needs to experience). Without these constraints, the problem is fundamentally underspecified and the chance that we accidentally discover the right solution is nil. Perhaps humans would never have gained intelligence if they could not stand up on their hind legs and wonder about why stars moving in such strange, elliptical patterns.

I wonder if the Trisolarian civilization (from the book "The Three Body Problem") evolved to such a high level of technological advancement because their livelihoods depended on their physical understanding of complex celestial mechanics. Maybe we need to introduce some celestial motion into our Mujoco & Bullet environments :)


Footnotes


[1] There are some research areas that don’t fit as neatly into the framework of expressivity, trainability, and generalization. For example, interpretability research seeks to understand why a model provides a particular answer. Not only is this expected by ML customers and policymakers in some high-stakes fields (medicine, law enforcement), but these can shed light on generalization: if we discover that the model is providing diagnoses in a way that differs a lot from how a human medical professional would arrive at those conclusions, it could mean that there are edge cases where the model’s deductive process will fail to generalize. Finding out whether your model learned the right thing is even stronger having low test error! Differential privacy is another constraint that we sometimes demand from our ML models. These topics are out of the scope of this blog post.

[2] A hand-wavy explanation for why this is the case: A fully-connected layer of size N followed by a ReLU nonlinearity can chop up a vector space into N piecewise-linear pieces. Adding a second ReLU layer further subdivides space into N more pieces, resulting in N^2 piecewise linear regions in input space, 3 layers is N^3, and so on. For detailed analysis, see Raghu et al. 2017.

[3] This is sometimes called a Multi-level optimization problem. However, this implies an “outer” and “inner” optimization loop, whereas co-adaptation may occur simultaneously. e.g. concurrent processes on a single machine communicating asynchronously, or species evolving continuously w.r.t each other in an ecosystem. In these cases, there is no clear “outer” and “inner” optimization loop.

[4] seq2seq with attention achieved SOTA when they were first introduced, but I suspect that the advantage it confers is one of trainability, not expressivity or generalization. Maybe seq2seq without attention can do just as well if it is properly initialized.

[5] One idea to protect against adversarial methods, without solving strong generalization: make it extremely expensive to compute adversarial perturbations. Models and data are partially black-box. At each call to the model during inference time, randomly pick a model from an ensemble of trained models, and serve that model to the adversary without telling them which model they got. The models are trained independently of each other and can even adopt different architectures. This makes it difficult to compute finite-difference gradients, as f(x+dx) - f(x) can have arbitrarily high variance. Furthermore, gradients between successive gradient computations will still have high variance because different pairs of models can be sampled. Another way to improve things would be to use multi-modal data (video, multi-view, images + sound), so that it is difficult to perturb inputs in a way that preserves consistency of inputs.

Monday, January 2, 2017

Summary of NIPS 2016

The 30th annual Neural Information Processing Systems (NIPS) conference took place in Barcelona in early December. In this post, I share my thoughts on the conference. Let me know your thoughts in the comments section!

Basic outline:

  • State of AI Research and trends in 2016
  • Talks and papers that I liked
  • Predictions for 2017
  • Miscellaneous notes

State of Artificial Intelligence Research: 2016 Trends


Academic, industry, and public interest in Artificial Intelligence (A.I.) is taking off like a goddamn rocket. There was a 50% percent increase in NIPS registrations from last year:


It's not a coincidence that Nvidia, the literal arms-dealer of deep learning, has had a good year in the stock market.




2016 saw the continued success of Deep Learning: superhuman Go playing, superhuman speech transcription, superhuman translation, superhuman lip reading, and maybe-superhuman driving ability (no public benchmarks for self-driving cars yet). These accomplishments are especially remarkable, considering that many AI researchers used to believe that solving the relevant problems required solving general intelligence itself.

The research topics that are in vogue right now (GANs, RL, generative models, unsupervised learning, robotics, alternative ways to train DNNs) are important problems to tackle before we can build general AI. However, they haven't created significant commercial value in industry yet, in ways that couldn't plausibly be substituted with traditional supervised learning. 

Transfer learning, domain adaptation and semi-supervised learning alleviate the data-hungry requirements of deep learning, and are starting to work really well.

The volume and quality of published work in 2016 was staggering, and it's hard to come away from the conference feeling like there are any ideas left to work on. Many papers and posters I saw at the conference had multiple groups converging on similar ideas. 

GANs


I can’t imagine a higher compliment for Ian Goodfellow than Yann Lecun declaring Generative Adversarial Networks (GANs) to be “the biggest breakthrough in Machine Learning in the last 1-2 decades.” 

For those of you who are unfamiliar with GANs, here's a simple writeup explaining them. GANs were initially developed for generating images, but researchers are exploring their applications in reinforcement learning, domain adaptation, security ML, compression, and more. A more appropriate terminology would be “adversarial methods”, because the network we are trying to train need not be "generative" - it could be a deterministic classifier or policy instead.

Why are GANs so cool? Before they became popular, many researchers formulated learning as an optimization objective, i.e. “minimize a loss or maximize a reward". GANs break away from that paradigm; instead, adversarial methods solve for a Nash equilibria between at least two “agents”, both of which are allowed to co-adapt to each other.

The research community cares about this because some loss functions are often difficult to specify directly. In many over-parameterized problems, the optimization process will “short-circuit” design flaws in the objective and yield a result that might minimize the objective, but isn't quite what we really wanted. 

For example, if we wanted to minimize some error for image compression/reconstruction, often what we find is that a naive choice of error metric (e.g. euclidean distance to the ground truth label) results in qualitatively bad results. The design flaw is that we don’t have good perceptual similarity metrics for images that are universally applicable for the space of all images. GANs use a second “adversarial” network learn an optimal implicit distance function (in theory).

Ian Goodfellow gave a very nice tutorial on GANs. The main challenge for GANs right now is preventing degenerate equilibria between the discriminator and generator. These issues often manifest themselves as mode collapse (the generator learns a single tight mode over a single sample) or the discriminator "overpowering" the generator. I'm not worried about this - I think with time, intuitive tricks and recipes from experienced researchers will make them easier to use. People used to think that only Geoff Hinton could train DNNs, and that ended up not being the case.

Here's a quote from Maciej Cegłowski's an eminently hilarious, bookmark-bar-tier blog post about AI existential alarmism:

With no way to define intelligence (except just pointing to ourselves), we don't even know if it's a quantity that can be maximized. For all we know, human-level intelligence could be a tradeoff. Maybe any entity significantly smarter than a human being would be crippled by existential despair, or spend all its time in Buddha-like contemplation.

Does intelligent life arise from optimization? Or is it an equilibrium? Or to quote Yann, “a bunch of hacks that just happen to work?” AI research is not ready to answer these questions yet, but I am optimistic that research on GANs will tease out what kinds of problems are better-suited for expressing as an equilibrium-finding vs. minimization problem. 


Deep RL


People usually regard Reinforcement Learning (RL) as pertaining to playing video games and controlling robots, but RL methods are far more general than that - any ML problem with dependencies across space or time can benefit from RL techniques (i.e. sequential pixels). RL allows us to think about non-stationary data distributions as "environments" that respond to an agent's behavior. RL is applied creatively to a lot of ML problems nowadays, such as translation.

Pieter Abbeel and John Schulman gave a very nice tutorial on policy gradient methods, and John gave a talk on the “nuts and bolts” of Deep RL research. The nuts and bolts talk is jam-packed with great advice that would have taken me years to figure out on my own, and I highly recommend it for anyone who is working on RL projects.

Curiously, a recent paper by Chelsea Finn showed that certain classes of inverse RL problems have a mathematically equivalent GAN formulation!

Someone asked the Deep RL panelists what some of the most challenging problems were in RL. Rich Sutton and Raia Hadsell mentioned learning of subtasks/subgoals, i.e. hierarchical representations of the task at hand. John Schulman mentioned an interesting comment about how the Bellman discount factor in the discrete MDP setting makes it difficult to plan more than >100 time steps into the future, even when r=0.99. Finally, Chelsea Finn discussed sample complexity: it’s annoying that we have to learn every RL task from scratch, and some kind of inter-task transfer learning is needed. After all, humans can quickly learn to play new games because we've played other games before.


Bayesian Deep Learning


I think one of the most promising areas of research right now is the marriage of deep neural nets with graphical models,  referred to some as “Bayesian deep learning.” It’s probably the closest thing to a theoretical understanding of deep learning that generalizes to how neural nets are used in practice.

Kyle Cranmer gave a phenomenal talk on how likelihood-free inference methods were used to find evidence of the Higgs Boson. A lot of it went over my head, but what little I understood from it was that they used a variational model to approximate the simulation of a 11-story particle detector at atomic scale:



There were also a whole bunch of interesting variational inference techniques presented at the conference:
  • Symmetrized Variational Inference - a way to avoid mode collapse with Gaussian VAEs
  • Rejection Sampling Variational inference - reparameterization trick for a rejection sampler, allowing us to train gamma, beta, Dirichlet and many other variational distributions! This is huge. 
  • Robust Inference with Variational Bayes - improving the robustness of a variational posterior to choice of prior and likelihood.
  • ELBO surgery - ELBO is the most common variational lower bound used to train VAEs, and this paper proposes an alternative way to write the ELBO: split up the KL(q(z|x)|p(z)) term into an index-code mutual information term and marginal KL term KL(q(z)|p(z)). Experiments show that the latter term is actually quite large, which suggests that we should focus on improving priors. This work is quite important because it provides a easy-to-implement debugging tool for anyone training VAEs.
  • Operator Variational Inference, Normalizing Flows, and Real-NVP enable richer families of posterior approximations.
  • Variational lossy autoencoders

Open Research


There were a couple exciting software announcements at NIPS this year. First was OpenAI's announcement of Universe, a software interface that bridges all kinds of software to OpenAI Gym. This means we can bring RL to interact with environments such as the full-featured Chrome Browser, or desktop applications like Photoshop. Think of this as "AppleScript on steroids".

Deepmind released Deepmind Lab for FPS + 3D agent research in their Labyrinth environment, and earlier this year, Microsoft released Project Malmo for training AI agents in Minecraft. These are really interesting environments, and typically outside the capabilities of a grad student to implement on their own.

These are just a few of the high-profile software releases; many other tech companies and academic institutions are releasing benchmark software, deep learning frameworks, and data all the time. 

People often cite 1) compute, 2) data, and 3) algorithms as the reason why DL is taking off, and I think we should add 4) the unprecedented accessibility of ML research. Tons of people are already using OpenAI gym for RL projects, and many research codes are being made public for other people to build on top of. The quality and scope of research benefits dramatically from students not having to write thousands of lines of support code, and instead focus on their research problem.

It’s great that large tech companies are making such great software available to the broader community, and this kind of work that is more impactful than any research paper I saw at the conference this year. 

Talks and Papers that I Liked


I had nowhere enough mental capacity to see all the papers and talks at NIPs, but I did write down some talks, papers, and posters that I came away really inspired by. Here’s a non-exhaustive list.

  • Andrew Ng gave a practical tutorial on how to debug ML research projects. This is a must-watch for beginners in the field. Here is a video of a similar talk he gave at the Deep Learning summer school.
  • Fixed Points of Belief Propagation - the authors formulate the convergence condition of belief propagation (BP) as the solution of polynomial equations. This makes me wonder if these methods could be used to bridge nonlinear dynamical systems with statistical inference.
  • Deep Networks with Stochastic Depth - layer-wise dropout allows for training ResNets with >1000 layers.
  • Layer Norm - like batch norm, layer norm is used to prevent covariance shift between successive deep layers. The advantage is that layer norm does not introduce gradient variance from other minibatch elements.
  • Deep Successor Reinforcement Learning - I really liked this paper because it gets at model-based + model-free ideas. The motivation is that model-free methods like DQN take a long time to learn adequate low-level features because we are often trying to train a CNN to “see” using a single value scalar that is not directly related to image features. The idea is to learn “successor features” in an unsupervised manner (using rich supervision provided by model prediction error), and then separately learn a simple linear value function on top of these features. The paper is short and well-written.
  • The Intelligent Biosphere by Drew Purves (couldn't find the talk slides). This talk was a bit of an oddball among the other mathy ones at NIPS (I thought the drawings were distracting and failed to capture the nuance of what Drew was saying), but I think the message is crucial for AI researchers who want to take a step back and think about what it means for a system to be “intelligent”. The basic idea is that “intelligence” extends beyond brains; something as simple as self-replicating RNA exhibit intelligent behavior at the evolutionary scale. The natural world is fractal, cyclic, and fuzzy, as opposed to the simplicity of our current simulated environments for benchmarking AI. The implication is that there may be challenges learning in these environments that don’t carry over from Atari games. One interesting comment Drew made was that in a biosphere, every organism is a resource to another organism. That is, learning and adaptation of each organism is not independent of other organisms (reminds me of GANs)! If you’re interested in these environment + intelligence ideas, similar arguments have been made in ethology and neuroethology texts.
  • Input Convex Neural Networks - the authors presented a very interesting follow-up workshop poster at the Deep RL workshop (I can't find a paper), where they were able to solve continuous Q functions by parameterizing the Q-function to be convex with respect to the action vector of the agent (and non-convex with respect to the input state). This allows for efficient DQN policies in continuous action domains. 
  • Using Fast Weights to Attend to the Recent Past - basically an alternative to LSTM that seems to work a lot better.
  • Phased LSTM - a way to learn an adaptive time scale for LSTM architectures, where the hidden state is updated at a learned frequency. This is ideal for learning from data with arbitrary sampling rates, such as from event-based vision sensors (which naturally solve some hard computer vision problems like aliasing and bandwidth constraints). It also happens that this Zurich lab happens to make a really awesome event-based camera and I’m really excited to see how they apply ML to it.
  • Residual Networks Behave Like Ensembles of Relatively Shallow Networks - the title says it all. I thought the experiments could have been a lot better, but regardless, this work could potentially have a huge impact in new ways to train very deep ResNets quickly.
  • Graph Convolutional Networks - generalization of convolutions to allow DNN processing on graph data structures (the kind with edges and vertices). This work is important for getting DL methods to learn approximate solutions to NP-hard graph problems.
  • Minimizing Quadratic Functions in Constant Time - I did a double-take when I saw the title of this poster. I don’t understand the math very well, but the algorithm is very simple: subsample a $n \times n$ matrix to a $k \times k$ matrix and solve it. A simple scaling of the resulting minimized value yields an approximation to the minimized objective of the original system! In problems where we need to branch-and-bound a large search space (an optimal portfolio), we might find the value of a solution more informative than the solution itself.

Predictions for 2017


  • SGD + backprop will still be the best method for training deep neural nets in 2017.
  • Really good transfer learning, domain adaptation, and/or multitask learning will be one of the major accomplishments in 2017 (this was suggested by Andrew Ng during his tutorial).
  • Ensemble methods (e.g. training 10 versions of a network and averaging the predictions) will squeeze out 2-5% performance on all kinds of ML tasks. 
  • Widespread use of model-based + model-free RL, e.g. Reinforcement Learning with Unsupervised Auxiliary Tasks as a way to alleviate sample-complexity of model-free methods.

Miscellaneous Notes


    Some of these are paraphrased opinions of panelists and speakers. If you think I misrepresented your opinions in any way, please don't hesitate to reach out and I will make a correction!

    • Boston Dynamics displayed a series of very entertaining and impressive robotics demos. Their basic approach appears to be high precision hardware that behaves faithfully to a simulated planning model, plus a bit of robust control to handle the imperfections. Machine Learning doesn’t seem to be an important priority for their approach to robotic control, but I think it's good to have uncorrelated research approaches. It's hard to say which one will end up working best. God, I hope Alphabet doesn’t sell them off. Their stuff is really cool.
    • Apple has publicly announced that they will open up their research culture. I think a product-driven research culture has the potential to align research towards important problems (rather than maximizing citations, as is done in academia).
    • Certain prop trading firms are using neural nets to do price forecasting. I had previously thought that all trading firms were too concerned about model interpretability to use neural nets.
    • Speaking of hedge funds, many top hedge funds and trading shops came to NIPS to run career booths, but there was a surprising lack of interest from attendees (attendees were more interested in the likes of Apple, Facebook, Deepmind, Google, etc). At a regular college career fair in the East Coast, these roles are reversed. The talent pool at NIPS seems to be far more interested in tech and open-ended research than making money. 
    • Deers, despite being “herbivores”, will sometimes eat the heads of baby birds for extra calcium (this was from Drew Purves talk). Life rejects your reductionist labels, human!
    • Engineering Principles From Stable and Developing Brains (Saket Navlakha) - interestingly enough, pruning unused neuron units seems to work better for learning than growing additional neurons.
    • Yoshua Bengio - much better generalization (train vs. test error) happens when telling generative models what their latent codes should be.
    • There was a great question asked during the Brains and Bits panel about why we are focusing so much intelligence research on humans, instead of simpler animals like octopi and birds. Yoshua says that focusing on simple models might cause researchers to overlook more general principles of intelligence. I agree with the argument, but not it’s premise - I think it is our AI algorithms that are simple, not the octopi and birds. I believe that we should build an octopi AI before a human one (see the "whole iguana" approach).
    • Yann Lecun thinks model interpretability is overrated: we have lots of interpretable algorithms, but we still lack the basic principles with which to build a general-purpose learning system that can navigate the complexities of our natural world. I agree wholeheartedly with this - requiring interpretability unnecessarily confines us to look for solutions in models that we understand bottom-up. If history of scientific discovery is any clue, that is not always the case. In fact, serendipitous discovery sometimes invalidates our bottom-up assumptions! (e.g. "neural nets don't work").
    • Yann is not a fan of the log-likelihood metric. Log-likelihood is evaluated under a particular model (e.g. PixelCNN decoder), and if the model is not good to begin with, then log-likelihood is somewhat meaningless.
    • Andrew Saxe asked a great question at the Brain and Bits workshop: “what are fringe topics that might be important to solving intelligence?” Demis Hassabis replied that people don't regard the study of imagination, dreaming, consciousness as very scientific, but they may become serious practical questions within 5 years. Terry Sejnowski replied that understanding sleep is crucial, as it is very much a computational phenomenon rather than a torpid one. Yoshua wonders about what it means to "understand something", i.e. neural nets and other models can learn computations and statistical correlations, but how should semantic meaning be extracted?

    Further Reading


    I didn't attend any talks on meta-learning (e.g. synthetic gradients, learning-to-learn, hypernetworks), and a great many other research themes, but others have blogged about their experiences at NIPS. Check these articles out to piece together a fuller picture!

    NIPS 2016 by Martin Thoma
    50 things I learned at NIPS 2016 by Andreas Stuhlmüller


    Thank you for reading, and happy new year!