## Monday, January 2, 2017

### Summary of NIPS 2016

The 30th annual Neural Information Processing Systems (NIPS) conference took place in Barcelona in early December. In this post, I share my thoughts on the conference. Let me know your thoughts in the comments section!

Basic outline:

• State of AI Research and trends in 2016
• Talks and papers that I liked
• Predictions for 2017
• Miscellaneous notes

## State of Artificial Intelligence Research: 2016 Trends

Academic, industry, and public interest in Artificial Intelligence (A.I.) is taking off like a goddamn rocket. There was a 50% percent increase in NIPS registrations from last year:

It's not a coincidence that Nvidia, the literal arms-dealer of deep learning, has had a good year in the stock market.

2016 saw the continued success of Deep Learning: superhuman Go playing, superhuman speech transcription, superhuman translation, superhuman lip reading, and maybe-superhuman driving ability (no public benchmarks for self-driving cars yet). These accomplishments are especially remarkable, considering that many AI researchers used to believe that solving the relevant problems required solving general intelligence itself.

The research topics that are in vogue right now (GANs, RL, generative models, unsupervised learning, robotics, alternative ways to train DNNs) are important problems to tackle before we can build general AI. However, they haven't created significant commercial value in industry yet, in ways that couldn't plausibly be substituted with traditional supervised learning.

Transfer learning, domain adaptation and semi-supervised learning alleviate the data-hungry requirements of deep learning, and are starting to work really well.

The volume and quality of published work in 2016 was staggering, and it's hard to come away from the conference feeling like there are any ideas left to work on. Many papers and posters I saw at the conference had multiple groups converging on similar ideas.

### GANs

I can’t imagine a higher compliment for Ian Goodfellow than Yann Lecun declaring Generative Adversarial Networks (GANs) to be “the biggest breakthrough in Machine Learning in the last 1-2 decades.”

For those of you who are unfamiliar with GANs, here's a simple writeup explaining them. GANs were initially developed for generating images, but researchers are exploring their applications in reinforcement learning, domain adaptation, security ML, compression, and more. A more appropriate terminology would be “adversarial methods”, because the network we are trying to train need not be "generative" - it could be a deterministic classifier or policy instead.

Why are GANs so cool? Before they became popular, many researchers formulated learning as an optimization objective, i.e. “minimize a loss or maximize a reward". GANs break away from that paradigm; instead, adversarial methods solve for a Nash equilibria between at least two “agents”, both of which are allowed to co-adapt to each other.

The research community cares about this because some loss functions are often difficult to specify directly. In many over-parameterized problems, the optimization process will “short-circuit” design flaws in the objective and yield a result that might minimize the objective, but isn't quite what we really wanted.

For example, if we wanted to minimize some error for image compression/reconstruction, often what we find is that a naive choice of error metric (e.g. euclidean distance to the ground truth label) results in qualitatively bad results. The design flaw is that we don’t have good perceptual similarity metrics for images that are universally applicable for the space of all images. GANs use a second “adversarial” network learn an optimal implicit distance function (in theory).

Ian Goodfellow gave a very nice tutorial on GANs. The main challenge for GANs right now is preventing degenerate equilibria between the discriminator and generator. These issues often manifest themselves as mode collapse (the generator learns a single tight mode over a single sample) or the discriminator "overpowering" the generator. I'm not worried about this - I think with time, intuitive tricks and recipes from experienced researchers will make them easier to use. People used to think that only Geoff Hinton could train DNNs, and that ended up not being the case.

Here's a quote from Maciej Cegłowski's an eminently hilarious, bookmark-bar-tier blog post about AI existential alarmism:

With no way to define intelligence (except just pointing to ourselves), we don't even know if it's a quantity that can be maximized. For all we know, human-level intelligence could be a tradeoff. Maybe any entity significantly smarter than a human being would be crippled by existential despair, or spend all its time in Buddha-like contemplation.

Does intelligent life arise from optimization? Or is it an equilibrium? Or to quote Yann, “a bunch of hacks that just happen to work?” AI research is not ready to answer these questions yet, but I am optimistic that research on GANs will tease out what kinds of problems are better-suited for expressing as an equilibrium-finding vs. minimization problem.

### Deep RL

People usually regard Reinforcement Learning (RL) as pertaining to playing video games and controlling robots, but RL methods are far more general than that - any ML problem with dependencies across space or time can benefit from RL techniques (i.e. sequential pixels). RL allows us to think about non-stationary data distributions as "environments" that respond to an agent's behavior. RL is applied creatively to a lot of ML problems nowadays, such as translation.

Pieter Abbeel and John Schulman gave a very nice tutorial on policy gradient methods, and John gave a talk on the “nuts and bolts” of Deep RL research. The nuts and bolts talk is jam-packed with great advice that would have taken me years to figure out on my own, and I highly recommend it for anyone who is working on RL projects.

Curiously, a recent paper by Chelsea Finn showed that certain classes of inverse RL problems have a mathematically equivalent GAN formulation!

Someone asked the Deep RL panelists what some of the most challenging problems were in RL. Rich Sutton and Raia Hadsell mentioned learning of subtasks/subgoals, i.e. hierarchical representations of the task at hand. John Schulman mentioned an interesting comment about how the Bellman discount factor in the discrete MDP setting makes it difficult to plan more than >100 time steps into the future, even when r=0.99. Finally, Chelsea Finn discussed sample complexity: it’s annoying that we have to learn every RL task from scratch, and some kind of inter-task transfer learning is needed. After all, humans can quickly learn to play new games because we've played other games before.

### Bayesian Deep Learning

I think one of the most promising areas of research right now is the marriage of deep neural nets with graphical models,  referred to some as “Bayesian deep learning.” It’s probably the closest thing to a theoretical understanding of deep learning that generalizes to how neural nets are used in practice.

Kyle Cranmer gave a phenomenal talk on how likelihood-free inference methods were used to find evidence of the Higgs Boson. A lot of it went over my head, but what little I understood from it was that they used a variational model to approximate the simulation of a 11-story particle detector at atomic scale:

There were also a whole bunch of interesting variational inference techniques presented at the conference:
• Symmetrized Variational Inference - a way to avoid mode collapse with Gaussian VAEs
• Rejection Sampling Variational inference - reparameterization trick for a rejection sampler, allowing us to train gamma, beta, Dirichlet and many other variational distributions! This is huge.
• Robust Inference with Variational Bayes - improving the robustness of a variational posterior to choice of prior and likelihood.
• ELBO surgery - ELBO is the most common variational lower bound used to train VAEs, and this paper proposes an alternative way to write the ELBO: split up the KL(q(z|x)|p(z)) term into an index-code mutual information term and marginal KL term KL(q(z)|p(z)). Experiments show that the latter term is actually quite large, which suggests that we should focus on improving priors. This work is quite important because it provides a easy-to-implement debugging tool for anyone training VAEs.
• Operator Variational Inference, Normalizing Flows, and Real-NVP enable richer families of posterior approximations.
• Variational lossy autoencoders

### Open Research

There were a couple exciting software announcements at NIPS this year. First was OpenAI's announcement of Universe, a software interface that bridges all kinds of software to OpenAI Gym. This means we can bring RL to interact with environments such as the full-featured Chrome Browser, or desktop applications like Photoshop. Think of this as "AppleScript on steroids".

Deepmind released Deepmind Lab for FPS + 3D agent research in their Labyrinth environment, and earlier this year, Microsoft released Project Malmo for training AI agents in Minecraft. These are really interesting environments, and typically outside the capabilities of a grad student to implement on their own.

These are just a few of the high-profile software releases; many other tech companies and academic institutions are releasing benchmark software, deep learning frameworks, and data all the time.

People often cite 1) compute, 2) data, and 3) algorithms as the reason why DL is taking off, and I think we should add 4) the unprecedented accessibility of ML research. Tons of people are already using OpenAI gym for RL projects, and many research codes are being made public for other people to build on top of. The quality and scope of research benefits dramatically from students not having to write thousands of lines of support code, and instead focus on their research problem.

It’s great that large tech companies are making such great software available to the broader community, and this kind of work that is more impactful than any research paper I saw at the conference this year.

## Talks and Papers that I Liked

I had nowhere enough mental capacity to see all the papers and talks at NIPs, but I did write down some talks, papers, and posters that I came away really inspired by. Here’s a non-exhaustive list.

• Andrew Ng gave a practical tutorial on how to debug ML research projects. This is a must-watch for beginners in the field. Here is a video of a similar talk he gave at the Deep Learning summer school.
• Fixed Points of Belief Propagation - the authors formulate the convergence condition of belief propagation (BP) as the solution of polynomial equations. This makes me wonder if these methods could be used to bridge nonlinear dynamical systems with statistical inference.
• Deep Networks with Stochastic Depth - layer-wise dropout allows for training ResNets with >1000 layers.
• Layer Norm - like batch norm, layer norm is used to prevent covariance shift between successive deep layers. The advantage is that layer norm does not introduce gradient variance from other minibatch elements.
• Deep Successor Reinforcement Learning - I really liked this paper because it gets at model-based + model-free ideas. The motivation is that model-free methods like DQN take a long time to learn adequate low-level features because we are often trying to train a CNN to “see” using a single value scalar that is not directly related to image features. The idea is to learn “successor features” in an unsupervised manner (using rich supervision provided by model prediction error), and then separately learn a simple linear value function on top of these features. The paper is short and well-written.
• The Intelligent Biosphere by Drew Purves (couldn't find the talk slides). This talk was a bit of an oddball among the other mathy ones at NIPS (I thought the drawings were distracting and failed to capture the nuance of what Drew was saying), but I think the message is crucial for AI researchers who want to take a step back and think about what it means for a system to be “intelligent”. The basic idea is that “intelligence” extends beyond brains; something as simple as self-replicating RNA exhibit intelligent behavior at the evolutionary scale. The natural world is fractal, cyclic, and fuzzy, as opposed to the simplicity of our current simulated environments for benchmarking AI. The implication is that there may be challenges learning in these environments that don’t carry over from Atari games. One interesting comment Drew made was that in a biosphere, every organism is a resource to another organism. That is, learning and adaptation of each organism is not independent of other organisms (reminds me of GANs)! If you’re interested in these environment + intelligence ideas, similar arguments have been made in ethology and neuroethology texts.
• Input Convex Neural Networks - the authors presented a very interesting follow-up workshop poster at the Deep RL workshop (I can't find a paper), where they were able to solve continuous Q functions by parameterizing the Q-function to be convex with respect to the action vector of the agent (and non-convex with respect to the input state). This allows for efficient DQN policies in continuous action domains.
• Using Fast Weights to Attend to the Recent Past - basically an alternative to LSTM that seems to work a lot better.
• Phased LSTM - a way to learn an adaptive time scale for LSTM architectures, where the hidden state is updated at a learned frequency. This is ideal for learning from data with arbitrary sampling rates, such as from event-based vision sensors (which naturally solve some hard computer vision problems like aliasing and bandwidth constraints). It also happens that this Zurich lab happens to make a really awesome event-based camera and I’m really excited to see how they apply ML to it.
• Residual Networks Behave Like Ensembles of Relatively Shallow Networks - the title says it all. I thought the experiments could have been a lot better, but regardless, this work could potentially have a huge impact in new ways to train very deep ResNets quickly.
• Graph Convolutional Networks - generalization of convolutions to allow DNN processing on graph data structures (the kind with edges and vertices). This work is important for getting DL methods to learn approximate solutions to NP-hard graph problems.
• Minimizing Quadratic Functions in Constant Time - I did a double-take when I saw the title of this poster. I don’t understand the math very well, but the algorithm is very simple: subsample a $n \times n$ matrix to a $k \times k$ matrix and solve it. A simple scaling of the resulting minimized value yields an approximation to the minimized objective of the original system! In problems where we need to branch-and-bound a large search space (an optimal portfolio), we might find the value of a solution more informative than the solution itself.

## Predictions for 2017

• SGD + backprop will still be the best method for training deep neural nets in 2017.
• Really good transfer learning, domain adaptation, and/or multitask learning will be one of the major accomplishments in 2017 (this was suggested by Andrew Ng during his tutorial).
• Ensemble methods (e.g. training 10 versions of a network and averaging the predictions) will squeeze out 2-5% performance on all kinds of ML tasks.
• Widespread use of model-based + model-free RL, e.g. Reinforcement Learning with Unsupervised Auxiliary Tasks as a way to alleviate sample-complexity of model-free methods.

## Miscellaneous Notes

Some of these are paraphrased opinions of panelists and speakers. If you think I misrepresented your opinions in any way, please don't hesitate to reach out and I will make a correction!

• Boston Dynamics displayed a series of very entertaining and impressive robotics demos. Their basic approach appears to be high precision hardware that behaves faithfully to a simulated planning model, plus a bit of robust control to handle the imperfections. Machine Learning doesn’t seem to be an important priority for their approach to robotic control, but I think it's good to have uncorrelated research approaches. It's hard to say which one will end up working best. God, I hope Alphabet doesn’t sell them off. Their stuff is really cool.
• Apple has publicly announced that they will open up their research culture. I think a product-driven research culture has the potential to align research towards important problems (rather than maximizing citations, as is done in academia).
• Certain prop trading firms are using neural nets to do price forecasting. I had previously thought that all trading firms were too concerned about model interpretability to use neural nets.
• Speaking of hedge funds, many top hedge funds and trading shops came to NIPS to run career booths, but there was a surprising lack of interest from attendees (attendees were more interested in the likes of Apple, Facebook, Deepmind, Google, etc). At a regular college career fair in the East Coast, these roles are reversed. The talent pool at NIPS seems to be far more interested in tech and open-ended research than making money.
• Deers, despite being “herbivores”, will sometimes eat the heads of baby birds for extra calcium (this was from Drew Purves talk). Life rejects your reductionist labels, human!
• Engineering Principles From Stable and Developing Brains (Saket Navlakha) - interestingly enough, pruning unused neuron units seems to work better for learning than growing additional neurons.
• Yoshua Bengio - much better generalization (train vs. test error) happens when telling generative models what their latent codes should be.
• There was a great question asked during the Brains and Bits panel about why we are focusing so much intelligence research on humans, instead of simpler animals like octopi and birds. Yoshua says that focusing on simple models might cause researchers to overlook more general principles of intelligence. I agree with the argument, but not it’s premise - I think it is our AI algorithms that are simple, not the octopi and birds. I believe that we should build an octopi AI before a human one (see the "whole iguana" approach).
• Yann Lecun thinks model interpretability is overrated: we have lots of interpretable algorithms, but we still lack the basic principles with which to build a general-purpose learning system that can navigate the complexities of our natural world. I agree wholeheartedly with this - requiring interpretability unnecessarily confines us to look for solutions in models that we understand bottom-up. If history of scientific discovery is any clue, that is not always the case. In fact, serendipitous discovery sometimes invalidates our bottom-up assumptions! (e.g. "neural nets don't work").
• Yann is not a fan of the log-likelihood metric. Log-likelihood is evaluated under a particular model (e.g. PixelCNN decoder), and if the model is not good to begin with, then log-likelihood is somewhat meaningless.
• Andrew Saxe asked a great question at the Brain and Bits workshop: “what are fringe topics that might be important to solving intelligence?” Demis Hassabis replied that people don't regard the study of imagination, dreaming, consciousness as very scientific, but they may become serious practical questions within 5 years. Terry Sejnowski replied that understanding sleep is crucial, as it is very much a computational phenomenon rather than a torpid one. Yoshua wonders about what it means to "understand something", i.e. neural nets and other models can learn computations and statistical correlations, but how should semantic meaning be extracted?