Friday, December 28, 2018

Uncertainty: a Tutorial

A PDF version of this post can be found here.
Chinese translation by Xiaoyi Yin

Notions of uncertainty are tossed around in conversations around AI safety, risk management, portfolio optimization, scientific measurement, and insurance. Here are a few examples of colloquial use:

  • "We want machine learning models to know what they don't know.''
  • "An AI responsible for diagnosing patients and prescribing treatments should tell us how confident it is about its recommendations.''
  • "Significant figures in scientific calculations represent uncertainty in measurements.''
  • "We want autonomous agents to explore areas where they are uncertain (about rewards or predictions) so that they may discover sparse rewards.''
  • "In portfolio optimization, we want to maximize returns while limiting risk.''
  • "US equity markets finished disappointingly in 2018 due to increased geopolitical uncertainty.''

What exactly then, is uncertainty? 

Uncertainty measures reflect the amount of dispersion of a random variable. In other words, it is a scalar measure of how "random" a random variable is. In finance, it is often referred to as risk.

There is no single formula for uncertainty because there are many different ways to measure dispersion: standard deviation, variance, value-at-risk (VaR), and entropy are all appropriate measures. However, it's important to keep in mind that a single scalar number cannot paint a full picture of "randomness'', as that would require communicating the entire random variable itself! 

Nonetheless, it is helpful to collapse randomness down to a single number for the purposes of optimization and comparison. The important thing to remember is that "more uncertainty'' is usually regarded as "less good'' (except in simulated RL experiments).

Types of Uncertainty

Statistical machine learning concerns itself with the estimation of models $p(\theta|\mathcal{D})$, which in turn estimate unknown random variables $p(y|x)$. Multiple forms of uncertainty come into play here. Some notions of uncertainty describe inherent randomness that we should expect (e.g. outcome of a coin flip) while others describe our lack of confidence about our best guess of the model parameters.

To make things more concrete, let's consider a recurrent neural network (RNN) that predicts the amount of rainfall today from a sequence of daily barometer readings. A barometer measures atmospheric pressure, which often drops when its about to rain. Here's a diagram summarizing the rainfall prediction model along with different kinds of uncertainty.

Uncertainty can be understood from a simple machine learning model that attempts to predict daily rainfall from a sequence of barometer readings. Aleatoric uncertainty is irreducible randomness that arises from the data collection process. Epistemic uncertainty reflects confidence that our model is making the correct predictions. Finally, out-of-distribution errors arise when the model sees an input that differs from its training data (e.g. temperature of the sun, other anomalies).

Aleatoric Uncertainty

Aleatoric Uncertainty draws its name from the Latin root aleatorius, which means the incorporation of chance into the process of creation. It describes randomness arising from the data generating process itself; noise that cannot be eliminated by simply drawing more data. It is the coin flip whose outcome you cannot know.

In our rainfall prediction analogy, aleatoric noise arises from imprecision of the barometer. There are also important variables that the data collection setup does not observe: How much rainfall was there yesterday? Are we measuring barometric pressure in the present day, or the last ice age? These unknowns are inherent to our data collection setup, so collecting more data from that system does not absolve us of this uncertainty.

Aleatoric uncertainty propagates from the inputs to the model predictions. Consider a simple model $y = 5x$, which takes in normally-distributed input $x \sim \mathcal{N}(0,1)$. In this case, $y \sim \mathcal{N}(0, 5)$, so the aleatoric uncertainty of the predictive distribution can be described by $\sigma=5$. Of course, predictive aleatoric uncertainty is more challenging to estimate when the random structure of the input data $x$ is not known.

One might think that because aleatoric uncertainty is irreducible, one cannot do anything about it and so we should just ignore it. No! One thing to watch out for when training models is to choose an output representation capable of representing aleatoric uncertainty correctly. A standard LSTM does not emit probability distributions, so attempting to learn the outcome of a coin flip would just converge to the mean. In contrast, models for language generation emit a sequence of categorical distributions (words or characters), which can capture the inherent ambiguity in sentence completion tasks. 

Epistemic Uncertainty

"Good models are all alike; every bad model is wrong in its own way."

Epistemic Uncertainty is derived from the Greek root epistēmē, which pertains to knowledge about knowledge. It measures our ignorance of the correct prediction arising from our ignorance of the correct model parameters.

Below is a plot of a Gaussian Process Regression model on some toy 1-dimensional dataset. The confidence intervals reflect epistemic uncertainty; the uncertainty is zero for training data (red points), and as we get farther away from training points, the model ought to assign higher standard deviations to the predictive distribution. Unlike aleatoric uncertainty, epistemic uncertainty can be reduced by gathering more data and "ironing out" the regions of inputs where the model lacks knowledge.

1-D Gaussian Process Regression Model showcasing epistemic uncertainty for inputs outside its training set.
There is a rich line of inquiry connecting Deep Learning to Gaussian Processes. The hope is that we can extend the uncertainty-awareness properties of GPs with the representational power of neural networks. Unfortunately, GPs are challenging to scale to the uniform stochastic minibatch setting for large datasets, and they have fallen out of favor among those working on large models and datasets.

If one wants maximum flexibility in choosing their model family, a good alternative to estimating uncertainty is to use ensembles, which is just a fancy way of saying "multiple independently learned models''. While GP models analytically define the predictive distribution, ensembles can be used to compute the empirical distribution of predictions.

Any individual model will make some errors due to randomized biases that occur during the training process. Ensembling is powerful because other models in the ensembles tend to expose the idiosyncratic failures of a single model while agreeing with the correctly inferred predictions.

How do we sample models randomly to construct an ensemble? In Ensembling with bootstrap aggregation, we start with a training dataset of size $N$ and sample $M$ datasets of size $N$ from the original training set (with replacement, so each dataset does not span the entire dataset). The $M$ models are trained on their respective datasets and their resulting predictions collectively form an empirical predictive distribution.

If training multiple models is too expensive, it is also possible to use Dropout training to approximate a model ensemble. However, introducing dropout involves an extra hyperparameter and can compromise single model performance (often unacceptable for real world applications where calibrated uncertainty estimation is secondary to accuracy). 

Therefore, if one has access to plentiful computing resources (as one does at Google), it is often easier to just re-train multiple copies of a model. This also yields the benefits of ensembling without hurting performance. This is the approach taken by the Deep Ensembles paper. The authors of this paper also mention that the random training dynamics induced by differing weight initializations was sufficient to introduce a diverse set of models without having to resort to reducing the training set diversity via bootstrap aggregation. From a practical engineering standpoint, it's smart to bet on risk estimation methods that do not get in the way of the model's performance or whatever other ideas the researcher wants to try.

Out-of-Distribution Uncertainty

For our rainfall predictor, what if instead of feeding in the sequence of barometer readings, we fed in the temperature of the sun? Or a sequence of all zeros? Or barometer readings from a sensor that reports in different units? The RNN will happily compute away and give us a prediction, but the result will likely be meaningless.

The model is totally unqualified to make predictions on data generated via a different procedure than the one used to create the training set. This is a failure mode that is often overlooked in benchmark-driven ML research, because we typically assume that the training, validation, and test sets consist entirely of clean i.i.d data. 

Determining whether inputs are "valid'' is a serious problem for deploying ML in the wild, and is known as the Out of Distribution (OoD) problem. OoD is also synonymous with model misspecification error and anomaly detection.

Besides its obvious importance for hardening ML systems, anomaly detection models are an intrinsically useful technology. For instance, we might want to build a system that monitors a healthy patient's vitals and alerts us when something goes wrong without necessarily having seen that pattern of pathology before. Or we might be managing the "health" of a datacenter and want to know whenever unusual activity occurs (disks filling up, security breaches, hardware failures, etc.)

Since OoD inputs only occur at test-time, we should not presume to know the distribution of anomalies the model encounters. This is what makes OoD detection tricky - we have to harden a model against inputs it never sees during training! This is exactly the standard attack scenario described in Adversarial Machine Learning.

There are two ways to handle OoD inputs for a machine learning model: 1) catch the bad inputs before we even put them through the model 2) let the "weirdness'' of model predictions imply to us that the input was probably malformed.

In the first approach, we assume nothing about the downstream ML task, and simply consider the problem of whether an input is in the training distribution or not. This is exactly what discriminators in Generative Adversarial Networks (GANs) are supposed to do. However, a single discriminator is not completely robust because it is only good for discriminating between the true data distribution and whatever the generator's distribution is; it can give arbitrary predictions for an input that lies in neither distribution. 

Instead of a discriminator, we could build a density model of the in-distribution data, such as a kernel density estimator or fitting a Normalizing Flow to the data. Hyunsun Choi and I investigated this in our recent paper on using modern generative models to do OoD detection.

The second approach to OoD detection involves using the predictive (epistemic) uncertainty of the task model to tell us when inputs are OoD. Ideally, malformed inputs to a model ought to generate "weird'' predictive distribution $p(y|x)$. For instance, Hendrycks and Gimpel showed that the maximum softmax probability (the predicted class) for OoD inputs tends to be lower than that of in-distribution inputs. Here, uncertainty is inversely proportional to the "confidence'' as modeled by the max sofmax probability. Models like Gaussian Processes give us these uncertainty estimates by construction, or we could compute epistemic uncertainty via Deep Ensembles.

In reinforcement learning, OoD inputs are actually assumed to be a good thing, because it represents inputs from the world that the agent does not know how to handle yet. Encouraging the policy to find its own OoD inputs implements "intrinsic curiosity'' to explore regions the model predicts poorly in. This is all well and good, but I do wonder what would happen if such curiousity-driven agents are deployed in real world settings where sensors break easily and other experimental anomalies happen. How does a robot distinguish between "unseen states" (good) and "sensors breaking" (bad)? Might that result in agents that learn to interfere with their sensory mechanisms to generate maximum novelty?

Who Will Watch the Watchdogs?

As mentioned in the previous section, one way to defend ourselves against OoD inputs is to set up  a likelihood model that "watchdogs" the inputs to a model. I prefer this approach because it de-couples the problem of OoD inputs from epistemic and aleatoric uncertainty in the task model. It makes things easy to analyze from an engineering standpoint.

But we should not forget that the likelihood model is also a function approximator, possibly with its own OoD errors! We show in our recent work on Generative Ensembles (and also showed in concurrent work by DeepMind), that under a CIFAR likelihood model, natural images from SVHN can actually be more likely than the in-distribution CIFAR images themselves!

Likelihood estimation involves a function approximator that can itself be susceptible to OoD inputs. A likelihood model of CIFAR assigns higher probabilities to SVHN images than CIFAR test images!

However, all is not lost! It turns out that the epistemic uncertainty of likelihood models is an excellent OoD detector for the likelihood model itself. By bridging epistemic uncertainty estimation with density estimation, we can use ensembles of likelihood models to protect machine learning models against OoD inputs in a model-agnostic way.

Calibration: the Next Big Thing?

A word of warning: just because a model is able to spit out a confidence interval for a prediction doesn't mean that the confidence interval actually reflects the actual probabilities of outcomes in reality! 

Confidence intervals (e.g. $2\sigma$) implicitly assume that your predictive distribution is Gaussian-distributed, but if the distribution you're trying to predict is multi-modal or heavy-tailed, then your model will not be well calibrated!

Suppose our rainfall RNN tells us that there will be $\mathcal{N}(4, 1)$ inches of rain today. If our model is calibrated, then if we were to repeat this experiment over and over again under identical conditions (possibly re-training the model each time), we really would observe empirical rainfall to be distributed exactly $\mathcal{N}(4, 1)$. 

Machine Learning models developed by academia today mostly optimize for test accuracy or some fitness function. Researchers are not performing model selection by deploying the model in repeated identical experiments and measuring calibration error, so unsurprisingly, our models tend to be poorly calibrated.

Going forward, if we are to trust ML systems deployed in the real world (robotics, healthcare, etc.), I think a much more powerful way to "prove our models understand the world correctly'' is to test them for statistical calibration. Good calibration also implies good accuracy, so it would be a strictly higher bar to optimize against. 

Should Uncertainty be Scalar?

As useful as they are, scalar uncertainty measures will never be as informative as the random variables they describe. I find methods like particle filtering and Distributional Reinforcement Learning very cool because they are algorithms that operate on entire distributions, freeing us from resorting to simple normal distributions to keep track of uncertainty. Instead of shaping ML-based decision making with a single scalar of "uncertainty", we can now query the full structure of distributions when deciding what to do. 

The Implicit Quantile Networks paper (Dabney et al.) has a very nice discussion on how to construct "risk-sensitive agents'' from a return distribution. In some environments, one might favor an opportunitistic policy that prefers to explore the unknown, while in other environments unknown things may be unsafe and should be avoided. The choice of risk measure essentially determines how to map the distribution of returns to a scalar quantity that can be optimized against. All risk measures can be computed from the distribution, so predicting full distributions enables us to combine multiple definitions of risk easily. Furthermore, supporting flexible predictive distributions seems like a good way to improve model calibration.

Performance of various risk measures on Atari games as reported by the IQN paper.

Risk measures are a deeply important research topic to financial asset managers. The vanilla Markowitz portfolio objective minimizes a weighted variance of portfolio returns $\frac{1}{2}\lambda w^T \Sigma w$. However, variance is an unintuitive choice of "risk'' in financial contexts: most investors don't mind returns exceeding expectations, but rather wish to minimize the probability of small or negative returns. For this reason, risk measures like Value-at-Risk, Shortfall Probability, and Target Semivariance, which only pay attention to the likelihood of "bad'' outcomes, are more useful objectives to optimize.

Unfortunately, they are also more difficult to work with analytically. My hope is that research into distributional RL, Monte Carlo methods, and flexible generative models will allow us to build differentiable relaxations of risk measures that can play nicely with portfolio optimizers. If you work in finance, I highly recommend reading the IQN paper's "Risks in Reinforcement Learning" section.


Here's a recap of the main points of this post:
  • Uncertainty/risk measures are scalar measures of "randomness''. Collapsing a random variable to a single number is done for optimization and mathematical convenience.
  • Predictive uncertainty can be decomposed into aleatoric (irreducible noise arising from data collection process), epistemic  (ignorance about true model), and out-of-distribution (at test time, inputs may be malformed).
  • Epistemic uncertainty can be mitigated by softmax prediction thresholding or ensembling.
  • Instead of propagating OoD uncertainty to predictions, we can use a task-agnostic filtering mechanism that safeguards against "malformed inputs''.
  • Density models are a good choice for filtering inputs at test time. However, it's important to recognize that density models are merely approximations of the true density function, and are themselves susceptible to out-of-distribution inputs.
  • Self-plug:Generative Ensembles reduce epistemic uncertainty of likelihood models so they can be used to detect OoD inputs. 
  • Calibration is important and underappreciated in research models.
  • Some algorithms (Distributional RL) extend ML algorithms to models that emit flexible distributions, which provides more information than a single risk measure.

Further Reading

Friday, November 30, 2018

Machine Learning Memes

A periodically-updated list of my favorite Deep Learning memes. Enjoy!

content warning: may contain crude humor.

Wednesday, August 8, 2018

Dijkstra's in Disguise

You can find a PDF version of this blog post here.

A weighted graph is a data structure consisting of some vertices and edges, and each edge has an associated cost of traversal. Let's suppose we want to compute the shortest distance from vertex $u$ to every other vertex $v$ in the graph, and we express this cost function as $\mathcal{L}_u(v)$.

For example, if each edge in this graph has cost $1$, $\mathcal{L}_u(v) = 3$.

Dijkstra's, Bellman-Ford, Johnson's, Floyd-Warshall are good algorithms for solving the shortest paths problem. They all share the principle of relaxation, whereby costs are initially overestimated for all vertices and gradually corrected for using a consistent heuristic on edges (the term "relaxation" in the context of graph traversal is not be confused with "relaxation" as used in an optimization context, e.g. integer linear programs). The heuristic can be expressed in plain language as follows:

It turns out that many algorithms I've encountered in my computer graphics, finance, and reinforcement learning studies are all variations of this relaxation principle in disguise. It's quite remarkable (embarrassing?) that so much of my time has been spent on such a humble technique taught in introductory computer science courses!

This blog post is a gentle tutorial on how all these varied CS topics are connected. No prior knowledge of finance, reinforcement learning, or computer graphics is needed. The reader should be familiar with undergraduate probability theory, introductory calculus, and be willing to look at some math equations. I've also sprinkled in some insights and questions that might be interesting to the AI research audience, so hopefully there's something for everybody here.


Here's a quick introduction to Bellman-Ford, which is actually easier to understand than the famous Dijkstra's Algorithm.

Given a graph with $N$ vertices and costs $\mathcal{E}(s, v)$ associated with each directed edge $s \to v$, we want to find the cost of the shortest path from a source vertex $u$ to each other vertex $v$. The algorithm proceeds as follows: The cost to reach $u$ from itself is initialized to $0$, and all the other vertices have distances initialized to infinity. 

The relaxation step (described in the previous section) is performed across all edges in any order for each iteration. The correct distances from $u$ are guaranteed to have propagated completely to all vertices after $N-1$ iterations, since the longest of the shortest paths contain at most $N$ unique vertices. If the relaxation condition indicates there are still yet shorter paths after $N$ iterations, it implies the presence of a cycle whose total cost is negative. You can find a nice animation of the Bellman-Ford algorithm here.

Below is the pseudocode:

Currency Arbitrage

Admittedly, all this graph theory seems sort of abstract and boring at first. But would it still be boring if I told you that efficiently detecting negative cycles in graphs is a multi-billion dollar business

The foreign exchange (FX) market, where one currency is traded for another, is the largest market in the world, with about 5 trillion USD being traded every day. This market determines the exchange rate for local currencies when you travel abroad. Let's model a currency exchange's order book (the ledger of pending transactions) as a graph:
  • Each vertex represents a currency (e.g. JPY, USD, BTC).
  • Each directed edge represents the conversion of currency $A$ to currency $B$.

An arbitrage opportunity exists if the product of exchange rates in a cycle exceeds $1$, which means that you can start with 1 unit of currency $A$, trade your way around the graph back to currency $A$, and then end up with more than 1 unit of $A$!

To see how this is related to the Bellman-Ford algorithm, let each currency pair $(A, B)$ with conversion rate $\frac{B}{A}$ be represented as a directed edge from $A$ to $B$ with edge weight $\mathcal{E}(A,B) = \log \frac{A}{B}$. Rearranging the terms,

The above algebra shows that if the sum of edge weights in a cycle is negative, it is equivalent to the product of exchange rates exceeding $1$. The Bellman-Ford algorithm can be directly applied to detect currency arbitrage opportunities! This also applies to all fungible assets in general, but currencies tend to be the most strongly-connected vertices in the graph representing the financial markets.

In my sophomore year of college, I caught the cryptocurrency bug and set out to build an automated arbitrage bot for scraping these opportunities in exchanges. Cryptocurrencies - being unregulated speculative digital assets - are ripe for cross-exchange arbitrage opportunities:

  • Inter-exchange transaction costs are low (assets are ironically centralized into hot and cold wallets).
  • Lots of speculative activity, whose bias generates lots of mispricing.
  • Exchange APIs expose much more order book depth and require no license to trade cryptos. With a spoonful of Python and a little bit of initial capital, you can trade nearly any crypto you want across dozens of exchanges..
Now we have a way to automatically detect mispricings in markets and end up with more money than we started with. Do we have a money printing machine yet? 

Not so fast! A lot of things can still go wrong. Exchange rates fluctuate over time and other people are competing for the same trade, so the chances of executing all legs of the arbitrage are by no means certain. 

Execution of trading strategies is an entire research area on its own, and can be likened to crossing a frozen lake as quickly as possible. Each intermediate currency position, or "leg'', in an arbitrage strategy is like taking a cautious step forward. One must be able to forecast the stability of each step and know what steps proceed after, or else one can get "stuck'' holding a lot of a currency that gives out like thin ice and becomes worthless. Often the profit opportunity is not big enough to justify the risk of crossing that lake.

Simply taking the greedy minimum among all edge costs does not take into account the probability of various outcomes happening in the market. The right way to structure this problem is to think about edge weights being random variables that change over time. In order to compute the expected cost, we need to integrate over all possible path costs that can manifest. Hold this thought, as we will need to introduce some more terminology in the next few sections.

While the arbitrage system I implemented was capable of detecting arb opportunities, I never got around to fully automating the execution and order confirmation subsystems. Unfortunately, I got some coins stolen and lost interest in cryptos shortly after. To execute arb opportunities quickly and cheaply I had to keep small BTC/LTC/DOGE positions in each exchange, but sometimes exchanges would just vanish into thin air. Be careful of what you wish for, or you just might find your money "decentralized'' from your wallet! 

Directional Shortest-Path

Let's introduce another cost function, the directional shortest path $\mathcal{L}_u(v, s \to v)$, that computes the shortest path from $u$ to $v$, where the last traversed edge is from $s \to v$. Just like making a final stop at the bathroom $s$ before boarding an airplane $v$.

Note that the original shortest path cost $\mathcal{L}_u(v)$ is equivalent to the smallest directional shortest path cost among all of $v$'s neighboring vertices, i.e. $\mathcal{L}_u(v) = \min_{s} \mathcal{L}_u(v, s \to v)$ 

Shortest-path algorithms typically associate edges with costs, and the objective is to minimize the total cost. This is also equivalent to trying to maximize the negative cost of the path, which we call $\mathcal{Q}_u = -\mathcal{L}_u(v)$. Additionally, we can re-write this max-reduction as a sum-reduction, where each $\mathcal{Q}_u$ term is multiplied by an indicator function that is $1$ when its $\mathcal{Q}_u$ term is the largest and $0$ otherwise.

Does this remind you of any well-known algorithm? 

If you guessed "Q-Learning", you are absolutely right! 


Reinforcement learning (RL) problems entail an agent interacting with its environment such that the total expected reward $R$ it receives is maximized over a multi-step (maybe infinite) decision process. In this setup, the agent will be unable to take further actions or receive additional rewards after transitioning to a terminal (absorbing) state.

There are many ways to go about solving RL problems, and we'll discuss just one kind today: value-based algorithms, attempt to recover a value function $Q(s,a)$ that computes the maximum total reward an agent can possibly obtain if it takes an action $a$ at state $s$.

Wow, what a mouthful! Here's a diagram of what's going on along with an annotated mathematical expression.

Re-writing the shortest path relaxation procedure in terms of a directional path cost recovers the Bellman Equality, which underpins the Q-Learning algorithm. It's no coincidence that Richard Bellman of Bellman-Ford is also the same Richard Bellman of the Bellman Equality! Q-learning is a classic example of dynamic programming. 

For those new to Reinforcement Learning, it's easiest to understand Q-Learning in the context of an environment that yields a reward only at the terminal transition:

  • The value of state-action pairs $(s_T, a_T)$ that transition to a terminal state are easy to learn - it is just the sparse reward received as the episode ends, since the agent can't do anything afterwards. 
  • Once we have all those final values, the value for $(s_{T-1}, a_{T-1})$ leading to those states are "backed up'' (backwards through time) to the states that transition to them. 
  • This continues all the way to the state-action pairs $(s_1, a_1)$ encountered at the beginning of episodes. 

Handling Randomness in Shortest-Path Algorithms

Remember the "thin ice'' analogy from currency arbitrage? Let's take a look at how modern RL algorithms are able to handle random path costs. 

In RL, the agent's policy distribution $\pi(a|s)$ is a conditional probability distribution over actions, specifying how the agent behaves randomly in response to observing some state $s$. In practice, policies are made to be random in order to facilitate exploration of environments whose dynamics and set of states are unknown (e.g. imagine the RL agent opens its eyes for the first time and must learn about the world before it can solve a task). Since the agent's sampling of action $a \sim \pi(a|s)$ from the policy distribution are immediately followed by computation of environment dynamics $s^\prime = f(s, a)$, it's equivalent to view randomness as coming from a stochastic policy distribution or stochastic transition dynamics. We redefine a notion of Bellman consistency for expected future returns:

By propagating expected values, Q-learning allows for shortest-path algorithms to essentially be aware of the expected path length, and take transition probabilities of dynamics/policies into account.

Modern Q-Learning

This section discusses some recent breakthroughs in RL research, such as Q-value overestimation, Softmax Temporal Consistency, Maximum Entropy Reinforcement Learning, and Distributional Reinforcement Learning. These cutting-edge concepts are put into the context of shortest-path algorithms as discussed previously. If any of these sound interesting and you're willing to endure a bit more math jargon, read on -- otherwise, feel free to skip to the next section on computer graphics.

Single-step Bellman backups during Q-learning turn out to be rather sensitive to random noise, which can make training unstable. Randomness can come from imperfect optimization over actions during the Bellman Update, poor function approximation in the model, random label noise (e.g. human error in assigning labels to a robotic dataset), stochastic dynamics, or uncertain observations (partial observability). All of these can violate the Bellman Equality, which may cause learning to diverge or get stuck in a poor local minima.

Sources of noise that arise in Q-learning which violate the hard Bellman Equality.
A well-known problem among RL practitioners is that Q-learning suffers from over-estimation; during off-policy training, predicted Q-values climb higher and higher but the agent doesn't get better at solving the task. Why does this happen?

Even if $Q_\theta$ is an unbiased estimator of the true value function, any variance in the estimate is converted into upward bias during the Bellman update. A sketch of the proof: assuming Q values are uniformly or normally distributed about the true value function, the Fisher–Tippett–Gnedenko theorem tells us that applying the max operator over multiple normally-distributed variables is mean-centered around a Gumbel distribution with a positive mean. Therefore the updated Q function, after the Bellman update is performed, will obtain some positively skewed bias! One way to deal with this is double Q-learning, which re-evaluates the optimal next-state action value using an i.i.d $Q$ function. Assuming Q-value noise is independent of the max action, the use of a i.i.d Q function for scoring the best actions makes max-Q estimation unbiased again.

Dampening Q values can also be accomplished crudely by decreasing the discount factor (0.95 is common for environments like Atari), but $\gamma$ is kind of a hack as it is not a physically meaningful quantity in most environments.

Yet another way to decrease overestimation of Q values is to "smooth'' the greediness of the max-operator during the Bellman backup, by taking some kind of weighted average over Q values, rather than a hard max that only considers the best expected value. In discrete action spaces with $K$ possible actions, the weighted average is also known as a "softmax'' with a temperature parameter:

$$\verb|softmax|(x, \tau) = \mathbf{w}^T \mathbf{x}$$


$$\mathbf{w}_i = \frac{e^{\mathbf{x}_i/\tau}}{\sum_{j=1}^{K}{e^{\mathbf{x}_j/\tau}}}$$

Intuitively, the "softmax'' can be thought of as a confidence penalty on how likely we believe $\max Q(s^\prime, a^\prime)$ to be the actual expected return at the next time step.  Larger temperatures in the softmax drag the mean away from the max value, resulting in more pessimistic (lower) Q values. Because of this temeprature-controlled softmax, our reward objective is no longer simply to "maximize expected total reward''; rather, it is more similar to "maximizing the top-k expected rewards''. In the infinite-temperature limit, all Q-values are averaged equally and the softmax becomes a mean, corresponding to the return of a completely random policy. Hold that thought, as this detail will be visited again when we discuss computer graphics!

This modification to the standard Hard-Max Bellman Equality is known as Softmax Temporal Consistency. In continuous action spaces, the backup through an entire episode can be thought of as repeatedly backing up expectations over integrals.

By introducing a confidence penalty as an implicit regularization term, our optimization objective is no longer optimizing for the cumulative expected reward from the environment. In fact, if the policy distribution has the form of a Boltzmann Distribution:

$$\pi(a|s) \sim \exp Q(s, a)$$

This softmax regularization has a very explicit, information-theoretic interpretation: it is the optimal solution for the Maximum-Entropy RL objective:

$$\pi_{\mathrm{MaxEnt}}^* = \arg\!\max_{\pi} \mathbb{E}_{\pi}\left[ \sum_{t=0}^T r_t + \mathcal{H}(\pi(\cdot | \mathbf{s}_t)) \right]$$

An excellent explanation for the maximum entropy principle is reproduced below from Brian Ziebart's PhD thesis:

When given only partial information about a probability distribution, $\tilde{P}$, typically many different distributions, $P$, are capable of matching that information. For example, many distributions have the same mean value. The principle of maximum entropy resolves the ambiguity of an under-constrained distribution by selecting the single distribution that has the least commitment to any particular outcome while matching the observational constraints imposed on the distribution.

This is nothing more than "Occam's Razor'' in the parlance of statistics. The Maximum Entropy Principle is a framework for limiting overfitting in RL models, as it limits the amount of information (in nats) contained by the policy. The more entropy a distribution has, the less information it contains, and therefore the less "assumptions'' about the world it makes. The combination of Softmax Temporal Consistency with Boltzmann Policies is known as Soft Q-Learning.

To draw a connection back to currency arbitrage and the world of finance, limiting the number of assumptions in a model is of paramount importance to quantiatiative researchers at hedge funds, since hundreds of millions of USD could be at stake. Quants have developed a rather explicit form of Occam's Razor by tending to rely on models with as few statistical priors as possible, such as Linear models and Gaussian Process Regression with simple kernels.

Although Soft Q-Learning can regularize against model complexity, updates are still backed up over single timesteps. It is often more effective to integrate rewards with respect to a "path'' of samples actually sampled at data collection time, than backing up expected Q values one edge at a time and hoping that softmax temporal consistency remains consistent well when accumulating multiple backups. 

Work from Nachum et al. 2017, O’Donoghue et al. 2016, Schulman et al. 2017 explore the theoretical connections between multi-step return optimization objectives (policy-based) and temporal consistency (value-based) objectives. The use of a multi-step return can be thought of as a path-integral solution to marginalizing out random variables occuring during a multi-step decision process (such as random non-Markovian dynamics). In fact, long before Deep RL research became popular, control theorists have been using path integrals for optimal control to tackle the problem of integrating multi-step stochastic dynamics [1, 2]. A classic example is the use of the Viterbi Algorithm in stochastic planning.

Once trained, the value function $Q(s,a)$ implies a sequence of actions an agent must do in order to maximize expected reward (this sequence does not have to be unique). In order for the $Q$ function to be correct, it must also implicitly capture knowledge about the expected dynamics that occur along the sequence of actions. It's quite remarkable that all this "knowledge of the world and one's own behavior'' can be captured into a single scalar.

However, this representational compactness can also be a curse!

Soft Q-learning and PGQ/PCL successfully back up expected values over some return distribution, but it's still a lot to ask of a neural network to capture all the knowledge about expected future dynamics, marginalize all the randomness into a single statistic. 

We may be interested in propagating other statistics like variance, skew, and kurtosis of the value distribution. What if we did Bellman backups over entire distributions, without having to throw away the higher-order moments? 

This actually recovers the motivation of Distributional Reinforcement Learning, in which "edges'' in the shortest path algorithm propagate distributions over values rather than collapsing everything into a scalar. The main contribution of the seminal Bellemare et al. 2017 paper is defining an algebra that generalizes the Bellman Equality to operate on distributions rather than scalar statistics of them. Unlike the path-integral approach to Q-value estimation, this framework avoids marginalization error by passing richer messages in the single-step Bellman backups. 

Soft-Q learning, PGQ/PCL, and Distributional Reinforcement Learning are "probabilistically aware'' reinforcement learning algorithms. They appear to be tremendously beneficial in practice, and I would not be surprised if by next year it becomes widely accepted that these techniques are the "physically correct'' thing to do, and hard-max Q-learning (as done in standard RL evaluations) is discarded. Given that multi-step Soft-Q learning (PCL) and Distributional RL take complementary approaches to propagating value distributions, I'm also excited to see whether the approaches can be combined (e.g. policy gradients over distributional messages).

Physically-Based Rendering

Ray tracing is not slow, computers are. -- James Kajiya

A couple of the aforementioned RL works make heavy use of the terminology "path integrals''. Do you know where else path integrals and the need for "physical correctness'' arise? Computer graphics!

Whether it is done by an illustrator's hand or a computer, the problem of rendering asks "Given a scene and some light sources, what is the image that arrives at a camera lens?''. Every rendering procedure -- from the first abstract cave painting to Disney's modern Hyperion renderer, is a depiction of light transported from the world to the eye of the observer.

Here are some examples of the enormous strides rendering technology has made in the last 20 years:

From top left, clockwise: Big City Overstimulation by Gleb Alexandrov. Pacific Rim, Uprising. The late Peter Cushing resurrected for a Star Wars movie. Remove Henry's Cavill's mustache to re-shoot some scenes because he needs the mustache for another movie.

Photorealistic rendering algorithms are made possible thanks to accurate physical models of how light behaves and interacts with the natural world, combined with the computational resources to actually represent the natural world in a computer. For instance, a seemingly simple object like a butterfly wing has an insane amount of geometric detail, and light interacts with this geometry to produce some macroscopic effect like iridescence.

Light transport involves far too many calculations for a human to do by hand, so the old master painters and illustrators came up with a lot of rules about how light behaves and interacts with everyday scenes and objects. Here are some examples of these rules:

  • Cold light has a warm shadow, warm light has a cool shadow.
  • Light travels through tree leaves, resulting in umbras that are less "hard" than a platonic sphere or a rock.
  • Clear water and bright daylight result in caustics.
  • Light bounces off flat water like a billiard ball with a perfectly reflected incident angle, but choppy water turns white and no longer behaves like a mirror.

You can get quite far on a big bag of heuristics like these. Here are some majestic paintings from the Hudson River School (19th century).

Albert Bierstadt, Scenery in the Grand Tetons, 1865-1870
Albert Bierstadt, Among the Sierra Nevada Mountains, California, 1868
Mortimer Smith: Winter Landscape, 1878
However, a lot of this painterly understanding -- though breathtaking -- was non-rigorous and physically inaccurate. Scaling this up to animated sequences was also very laborious. It wasn't until 1986, with the independent discovery of the rendering equation by David Immel et al. and James Kajiya, that we obtained physically-based rendering algorithms.

Of course, the scene must obey the conservation of energy transport: the electromagnetic energy being fed into the scene (via radiating objects) must equal the total amount of electromagnetic energy being absorbed, reflected, or refracted in the scene. Here is the rendering equation explained in an annotated equation:

A Monte Carlo estimator is a method for estimating high-dimensional integrals, by simply taking the expectation over many independent samples of an unbiased estimator. Path-tracing is the simplest Monte-Carlo approximation possible to the rendering equation. I've borrowed some screenshots from Disney's very excellent tutorial on production path tracing to explain how "physically-based rendering'' works. 

Initially, the only thing visible to the camera is the light source. Let there be light!
A stream of photons is emitted from the light and strikes a surface (in this case, a rock). It can be absorbed into non-visible energy, reflected off the object, or refracted into the object.
Any reflected or refracted light is emitted from the surface and continues in another random direction, and the process repeats until there are no photons left or it is absorbed by the camera lens.
This process is repeated ad infinum for many rays until the inflow vs. outflow of photons reaches equilibrium or the artist decides that the computer has been rendering for long enough. The total light contribution to a surface is a path integral over all these light bounce paths.
This equation has applications beyond entertainment: the inverse problem is studied in astrophysics simulations (given observed radiance of a supernovae, what are the properties of its nuclear reactions?), and the neutron transport problem. In fact, Monte Carlo methods for solving integral equations were developed for studying fissile reactions for the Manhattan Project! The rendering integral is also an Inhomogeneous Fredholm equations of the second kind, which have the general form:

$${\displaystyle \varphi (t)=f(t)+\lambda \int _{a}^{b}K(t,s)\varphi (s)\,\mathrm {d} s.}$$

Take another look at the rendering equation. Déjà vu, anyone?

Once again, path tracing is nothing more than the Bellman-Ford heuristic encountered in shortest-path algorithms! The rendering integral is taken over the $4\pi$ steradian's of surface area on a unit sphere, which cover all directions an incoming light ray can come from. If we interpret this area integration probabilistically, this is nothing more than the expectation (mean) over directions sampled uniformly from a sphere.

This equation takes the same form as the high-temperature softmax limit for Soft Q-learning! Recall that as $\tau \to \infty$, softmax converges to an expectation over a uniform distribution, i.e. a policy distribution with maximum entropy and no information. Light rays have no agency, they merely bounce around the scene like RL agents taking completely random actions! 

The astute reader may wonder whether there is also a corresponding "hard-max'' version of rendering, just as hard-max Bellman Equality is to the Soft Bellman Equality in Q-learning. 

The answer is yes! The recursive raytracing algorithm (invented before path-tracing, actually) was a non-physical approximation of light transport that assumes the largest of lighting contributions reflected off a surface comes from one of the following light sources:

  1. Emitting material
  2. Direct exposure to light sources
  3. Strongly reflected light (i.e. surface is a mirror)
  4. Strongly refracted light (i.e. surface is made of glass or water).

In the case of reflected and refracted light, recursive trace rays are branched out to perform further ray intersection, usually terminating at some fixed depth.

Raytracing approximation to the rendering equation.

Because ray tracing only considers the maximum contribution directions, it is not able to model indirect light, such as light bouncing off a bright wall and bleeding into an adjacent wall. Although these contributions are minor in today setups like Cornell Boxes, they play a dominant role in rendering pictures of snow, flesh, and food. 

Below is a comparison of a ray-traced image and a path-traced image. The difference is like night and day:

Prior work has drawn connections between light transport and value-based reinforcement learning, and in fact Dahm and Keller 2017 leverage Q-learning to learn optimal selection of "ray bounce actions'' to accelerate importance sampling in path tracing. Much of the physically-based rendering literature considers the problem of optimal importance sampling to minimize variance of the path integral estimators, resulting in less "noisy'' images. 

For more information on physically-based rendering, I highly recommend Benedikt Bitterli's interactive tutorial on 2D light transport, Pat Hanrahan's book chapter on Monte Carlo Path Tracing, and the authoritative PBR textbook.

Summary and Questions

We have 3 very well-known algorithms (currency arbitrage, Q-learning, path tracing) that independently discovered the principle of relaxation used in shortest-path algorithms such as Dijkstra's and Bellman-Ford. Remarkably, each of these disparate fields of study discovered notions of hard and soft optimality, which is relevant in the presence of noise or high-dimensional path integrals. Here is a table summarizing the equations we explored:

These different fields have quite a lot of ideas that could be cross-fertilized. Just to toss some ideas out there (a request for research, if you will):

  • There has been some preliminary work on using optimal control to reduce sample complexity of path tracing algorithms. Can sampling algorithms used in rendering be leveraged for reinforcement learning?
  • Path tracing integrals are fairly expensive because states and actions are continuous and each bounce requires ray-intersecting a geometric data structure. What if we do light transport simulations on a point cloud with a precomputed visibility matrix between all points, and use that as an approximation for irradiance caching / final-gather?
  • Path tracing is to Soft Q-Learning as Photon Mapping is to ...?
  • Has anyone ever tried using the Maximum Entropy principle as a regularization framework for financial trading strategies?
  • The selection of a proposal distribution for importance-sampled Monte Carlo rendering could utilize Boltzmann Distributions with soft Q-learning. This is nice because the proposal distribution over recursive ray directions has infinite support by construction, and Soft Q-learning can be used to tune random exploration of light rays.
  •  Is there a distributional RL interpretation of path tracing, such as polarized path tracing?
  • Given the equivalence between Q Learning and shortest path algorithms, it's interesting to note that in Deep RL research, we carefully initialize weights but leave the Q-function values fairly arbitrary. However, all shortest-path algorithms rely on initializing costs to negative infinity, so that costs being propagated during relaxation correspond to actually realizable paths. Why aren't we initializing all function values to negative-valued numbers?


I'm very grateful to Austin Chen, Deniz Oktay, Ofir Nachum, and Vincent Vanhoucke for proofreading and providing feedback to this post. All typos/factual errors are my own; please write to me if you spot additional errors. And finally, thank you for reading!

Thursday, June 21, 2018

Bots & Thoughts from ICRA2018

The 35th International Conference on Robotics and Automation took place from May 21-25. I had a fantastic time attending my first ICRA: here is a brief thematic overview of my conference experience, research areas I’m most excited about, and cool robots I saw being demoed.

  • Great conference. Well-organized, thought-provoking talks, very chill and not too corporate or hyped like some Machine Learning conferences (NIPS, ICLR).
  • At a meta level, most of the papers presented here will never see the light of user-facing products. I think the technology gap between academia and industry labs is only going to increase in the coming years, and this should warrant a real existential crisis for academic robotics research.
  • Research contributions in robotics (mostly from industry) are starting to make a difference in the real world. Deep Learning-based perception is a real game-changer. Lots of interest and funding opportunities from Big Tech, VC firms, startups, and even nation-states.
  • I came into the conference prepared to learn why control theorists are so skeptical about Deep Learning, and challenge my own biases as a Deep RL researcher. I came out of the conference rather disappointed in traditional approaches to robotics and an even stronger conviction in end-to-end learning, big data, and deep learning. Oops.

The Conference

ICRA 2018 was extremely well-organized. Brisbane is a beautiful, clean, and tourist-friendly city, and the conference venue was splendid. Some statistics:
We received over 2500 submissions, a new record, from 61 countries.

The 10 most popular keywords, in descending order, were: Deep Learning in Robotics and Automation, Motion and Path Planning, Learning and Adaptive Systems, Localization, SLAM, Multi-Robot Systems, Optimization and Optimal Control, Autonomous Vehicle Navigation, Mapping, Aerial Systems - Applications.

From the very large number of high quality papers we selected 1030 for presentation which represents an acceptance rate of 40.6%.

I really enjoyed talking to companies at the sponsor booths and learning what kinds of problems their robotic solutions solved in the real world. This was much more educational to me than the booths at NIPS 2017, where it felt very corporate and mostly staffed by recruiters (photo below for comparison).

Taken at NIPS 2017. Also, snacks at NIPS were not as good as those at ICRA.

There was abundant food & refreshments during poster sessions and breaks. The conference schedule was a neat little pamphlet that fit inside the registration badge (along with tickets to any ancillary events), and the full program was given to us digitally on a 32GB usb drive. I appreciated that all the paper presentations were done as spotlight videos uploaded to YouTube. This helps to make research much more accessible to folks who don't have the means to travel to Australia. Many thanks to Alex Zelinsky (General Chair) and Peter Corke (Program Chair) for such a well-run multi-track conference.

As per usual, the robotics community is very non-diverse, and I hope that we as a community can take up stronger diversity & inclusion efforts soon, especially given the socioeconomic stakes of this technology.

It’s no longer socially appropriate to just solve hard problems in 2018, researchers now need to think of societal & ethical issues when building such powerful technology. On the bright side, this is a strong sign that our research matters!

Real World Robots vs. Academic Robotics

Rodney Brooks gave an opening plenary talk in which he hated on academia, hated on the first-generation Roomba, hated on deep learning, and especially hated on Human-Robot Interaction (HRI) research.

I loved it. Polarizing opinions -- even downright offensive opinions -- spark the fire of insightful discourse and distributed consensus. Here’s the basic gist of his talk (emphasis that these are his views and not necessarily mine):

  • Three tremendous economic forces looming on the horizon: 1) aging population 2) climate change 3) urbanization (people moving to big cities).
  • These forces will demand new labor for fixing, growing, manufacturing, and assisting, all while demographic inversion (#1) creates a massive labor shortage (it’s already arrived in some countries). 
  • Cheap labor from China is a thing of the past. 
  • The politically charged “robots taking our jobs” rhetoric is neither helpful nor accurate: those jobs are not desirable. Many factories in China have 15-16% labor turnover -- per month! Rod showed the following picture of a meat processing plant and asked a poignant question: “Would you aspire for your children to be working in these jobs?”

  • Robotics & automation is the only answer that can sustain our desired standard of living.
  • It’s quite evident that Rod does not think much of academic robotics. Great pioneers in computer science (Lovelace, Englebart, Jobs given as examples) were not concerned with the petty stakes of getting tenure, getting papers accepted at conferences, what other people thought of their work, or how hard / impossible their goals were. 
  • As members of the academic rat race (attending an academic conference), it's important to keep things in perspective and realize that customers and end-users of robotics do not even know that ICRA exists. 
  • Customers -- not being roboticists -- will demand features that might even compromise functionality. Usually you just have to give in to their unreasonable demands. Customers who open up Roombas never read the manual! Customers demand that their Roombas go up and down in straight lines, even when random trajectories clean better! 
  • A surprisingly critical view of Human-Robotic-Interaction research from out of nowhere. Rod Brooks claims “If you rely on statistics (p-values) to prove your idea is good, it’s not good” and “most HRI research is not reproducible.” He has a pretty savage invitation for someone to go and try to re-implement famous HRI papers, then tweak some nuisance variables (like age of the user or ordering of questions) and in order to obtain the opposite conclusion.
  • “Enough papers. Go and invent great stuff.” 

I thought it was a great talk. Rod is quick to throw a wet blanket on new technologies like deep learning and self-driving cars, but his experience in real-world robotics is unparalleled and he is quite hard on his own work (“we had no idea what we were doing with the 1st-gen Roomba; we were lucky to have survived”), which is refreshing to hear from a tech CEO.

It’s worth understanding where Rod’s pessimism comes from, and perhaps taking it as an invitation to prove his 2018 technology timeline wrong. For instance, he predicts that dextrous hands will be generally available by 2030-2040. What kinds of breakthroughs would put us “ahead of schedule”?

Other talks at ICRA about real-world robotic applications were much less sardonic, but the subtext remained the same. In the Real World™ , people are just not that interested in outperforming benchmarks, or demonstrating how novel their ideas are, or whether your robot uses deep neural networks or not. People just want things to work, and for algorithms to be scalable to real-world data and uncontrolled environments.

Show Me the Robots!

Matthew Dunbabin gave an awesome talk on COTSBot, an underwater autonomous vehicle that uses Deep Learning-based detection network to identify Crown Of Thorns Starfish on the seabed, and then injects the starfish with a lethal saline solution. This prevents starfish infestations from eating too much live coral.

Previously, COTS management required human divers to manually inject each arm of the starfish, which was extremely tedious. More critically, this requires dextrous autonomous manipulation -- carefully injecting each tentacle, lifting up starfish to get the one underneath -- something that robots cannot do yet in 2018.

The game-changer was the development of a saline solution that could kill a Starfish with a single injection, which absolved the robot of requiring human-level manipulation skills.

On the perception side, it was interesting to learn that pre-training the starfish detection system on Youtube videos ended up not being that helpful, due to a large domain shift from “centered glamour shots” of YouTube cameras and the real-world perceptual data (with murky / low visibility, moonlit conditions).

Projects like COTSBot are deeply meaningful to the health of the ecosystem, but all the same, the video clip of the robot autonomously locating a starfish and jamming a syringe into it drew a nervous chuckle from the audience.

CSIRO uses these cool legged robots for patrolling large swaths of grassland for things like environmental surveys.

Along similar veins, Agility Robotics and Anybotics are starting to make pretty interesting legged data-gathering platforms. 
The Oil & Gas industry is a particularly interesting customer for these kinds of robots. As it turns out, oil rigs are similar to home robotics in several ways: 
  • The environment is designed for human usage. It’s currently not cost-effective to re-design homes & oil rigs around robots, so robots must instead work in a anthropocentric environment. 
  • Legged navigation needed to get around.
  • The one exception is underwater monitoring and repair, where the lack of safe human alternatives means that oil folks are willing to re-design the task to better suit robots. For example, designing modules that are replaceable, modular units rather than having human divers perform repairs underwater.

Here’s a neat touch sensor that uses a camera to measure contact forces via optical dispersion of the little holes in rubber.

I’m excited about using high-resolution cameras to implement haptic & touch sensing, and I think this technology can scale to whole-body “skin sensors”. Why now?
  • Wiring an entire epithelium is hard. I think stretchable optical waveguides and consolidating many bundles of light into a few camera sensors are a potential solution to scaling high-resolution touch and force sensing with minimal electronic overhead.
  • Planning contacts under an analytical Model-Predictive-Control framework (i.e. “the old way of doing robotics”) is harder if the exterior is soft. Even if the sensors were packed onto the robot somehow, roboticists would not know how to deal with that volume of data & the complexity of geometry in real-time.
  • Machine Learning can be the breakthrough technology to learn force sensing from raw, highly unstructured, uncalibrated “squish” data.

I predicted last year that event-based cameras would be a big deal, and I’m still quite excited about this technology. I think that a lot of robotics perception problems simply go away if the loop is ridiculously fast.

The fabrication capabilities of academic labs are rather disappointing. Many robotics projects are motivated by some bold idea (let's make self-healing robots, let's make robots that have flexible exoskeletons like cockroaches, let's make robots that grow), but the concept is executed on crude hardware and limited by material science. A lot of these hardware projects would be WAY more useful if they were miniaturized 10X into a small form factor, which makes me wonder if the highest-impact thing that would benefit all robotics projects is better manufacturing & miniaturization capability. We should be thinking of robots as Swiss watches, not Dynamixel servos. 

Speaking of Swiss Watches, the Queensland Zoo brought out some native Australian wildlife for us to look at and pet. I’m always humbled by the complexity and sheer majesty of nature’s designs of autonomous systems; they put us roboticists to shame.

Do you know what an amazing machine a ribosome is? That’s how real robotics is done.  

Robotics & Venture Capital

The inevitable demographic demand for robotic technology has drawn VCs in like sharks to chummed water.

There was a workshop panel on “investing in robotics” where deep-tech VCs fielded some questions about their firm’s strategy and what they look for in portfolio companies. Some notes from this session:

  • First-world governments like Australia, Singapore, Canada, and China are eager to invest in robotics. Unclear where USA’s AI strategy is under the current administration.
  • The most common question asked by VCs during the Startup Pitch competition was “tell us more about the technology you have”. I think VCs want to see a company that has one or more deep technology moats.
  • I asked whether robotic hardware would eventually become commoditized, with the majority of margins coming from robotic software. The response I got back was that the tradeoff between software/hardware moats is cyclic: new hardware companies (e.g. deep learning chips) emerge when software squeezes everything it can out of existing hardware. I don't really agree here - I think software will be the dominant differentiating factor in all manner of robotic platforms.
  • An audience member astutely pointed out the opportunity for a service industry surrounding robotics: “We keep talking about automobiles, but nobody is talking about gas stations”. A former colleague of mine mentioned that he has a friend interested in clothing for robots to wear.
  • Rodney Brooks had a "fireside chat" at this workshop, in which he lamented a continuous struggle in dealing with customers who don’t understand their own robotics needs. He recounted a war story from the early days of Rethink robotics:
RB: “Look, the basic idea of the Baxter robot is to use force sensing to replace precision actuation (which is very expensive). This saves you -- the customer -- a ton of money.”
Customer: “But why doesn’t your robot have precision like my current robot?”
RB: “That level of precision is not necessary for a great deal of tasks like pick-and-placing”
Customer: “But the robot I’m using already has precision! Why can’t you do that too?”
RB: “Fine, we’ll make a robot called Sawyer that has the precision you want, and it’ll end up costing 2 times more.”
Customer: “Now I am happy.”

  • Australia lacks a competitive funding ecosystem -- that is -- many VC firms competing against each other to secure deal flow. Back in the day, NICTA was the funding agency for research and commercialization, which was not founder-friendly. A competitive VC ecosystem makes starting a startup much more attractive, which in turn brings in founder talent from other countries.
  • Chinese robotics companies have done a rather impressive job cloning US robots for much cheaper price points, such as the AUBO-i5 (clone of UR3 by Universal Robots), and Laikago (clone of Spot Mini by Boston Dynamics). Coupled with China’s manufacturing expertise, I think this could be the force that commoditizes robot hardware.
There was a little “Robotics Startup Pitch” competition where startups pitched their companies to these VCs to get some exposure. Some pitches sounded like they came out of an ICO whitepaper, but there were a few promising companies with real technology in the mix.

My favorite company was Hebi Robotics, which was a spinoff out of CMU’s snake robot lab. They ended up commercializing the “snake segment” into a modular actuator that enables researchers to put together low-volume robot prototypes quickly, much like LEGOs. The robots you can build with these kits are ideal for AI & Machine Learning researchers: arms, hexapods, mobile bases with manipulation capabilities... 

The Spectre of Deep Learning

A spectre is haunting Robotics: the spectre of Deep Learning...

Raia Hadsell’s excellent plenary talk contained a thought-provoking slide (made by Vincent Vanhoucke in 2016):

This is the gist of the message:
  1. Deep learning-based algorithms + huge amounts of data have been steamrolling classical speech recognition, classical computer vision, and classical machine translation benchmarks. We’ve basically thrown away decades of “old pipelines” in favor of big neural nets. 
  2. Will the same thing happen to classical robotics?

Given the obvious historical trope, it was quite surprising how hostile many roboticists are to the idea of throwing everything out and replacing it with neural nets. Today, the robotics community is pretty intellectually divided into two camps.

  1. "Traditional" robotics based on control theory with analytic models and probabilistic planning.
  2. The deep learning folks, who dispense with analytical understanding and just throw massive compute and data at the problem and let "Software 2.0" figure out a program from the data. 

I think a big reason these camps can’t understand each others' perspective is that they are solving different robotics tasks with different requirements, so they end up talking past each other when arguing for the merits of their approaches.

The “traditional” approach to robotics is popular in control, where a desired (low-dimensional) state must be realized. Safety is a hard constraint. Tasks include learning dynamic gaits on a quadruped, planning to avoid bumping into things, flying things, balancing heavy things that should not fall down.

However, this line of research largely avoids the problem of perception and solving diverse tasks at a semantic level, largely delegating it to a state estimation problem with an off-the-shelf perception stack or even ignoring the problem of performing tasks altogether. 

I asked a student from a well-known quadruped lab how they weighed their dynamic stability cost function with a task-specific cost function and got a blank stare: “Uh... what task?”

The "deep learning way" of doing robotics has only become popular recently, and derives its roots from computer vision researchers extending perception models to also perform control. These approaches excel at handling unstructured environments, unreliable state estimation, and generalization (e.g. unseen objects) quite well. 

However, because they are entirely data-driven, they often fail to generalize in ways that analytical methods handle trivially. By construction, it’s easy to see how an IK solver with a correctly specified model will always be better than a neural net approximation of that function.

As an employee of the Church of Deep Learning™, I came into the conference prepared to question my faith and learn the perspective of other labs and their approaches to solving real world problems.

I ended up being sorely disappointed. Coming out of the conference, I’m more convinced than ever that Deep Learning based control of robotics is the only approach that will ever scale to unstructured environments within the next decade.

Why I believe strongly in Deep Learning-based approaches to robotics warrants a much longer blog post for the future, but for now I’ll say this:

  • Deep Learning is accessible - I mentioned earlier that Hebi Robotics is enabling non-roboticists to build research hardware. Deep Learning does the same thing, but for control software. You don’t need a strong mathematical foundation anymore to train neural networks. Just concatenate some layers, gather some data -- hey presto -- your robot supports a new end-effector! 
  • Deep RL Research is scalable - a RNN technique pioneered in NLP space could be immediately applied to improving RNNs used in a robotics context. Deep Learning research feels much more collaborative, because all these people working on diverse problems now speak the same language. One can now cast SVMs, linear models, optimization, probabilistic inference, and even cognitive neuroscience all into the common language of neural nets. A DL non-believer was complaining to me that we were heading towards a research monoculture, similar to how every roboticist was working on SLAM in the early 2000's. If this "monoculture" enables new scales of collaboration and bridging of learning theory across disciplines, then I’m all for it!
  • The problems one must solve to bring lab robotics to the real world are so challenging that data-driven approaches are the only way to make them tractable. Analytical approaches require too many compromises on robot design, too much hand-tuning for the specific system.
Finally, there are reasonable, diplomatic folks who believe in marrying the benefits of both data-driven learning and analytical task-specific knowledge, but this reminds me of computer vision people who believed in fine-tuning SVMs on top of the last layer of a deep neural net back in 2015 to “get the best of both worlds.”

Data, though expensive to gather, is always right.

Much of robotics is built on the mindset of obtaining geometry, exact localization, planning around exact dynamics. However, one exciting avenue of Deep Reinforcement Learning is that learning control from vision enables control strategies that only require gaze heuristics and optical flow, rather than precise tracking and acting in a localized Euclidean space. 

Mandyam Srinivasan gave a very interesting keynote talk on biologically-inspired aerial robotics in which they modeled honeybee’s abilities for estimating distance based on optical flow. It turns out that bees don't really estimate states and forces like robots do, they just map visual motion cues to wings beating faster and slower, and everything turns out fine.

In terms of sensors, I think moving beyond precise (and expensive!) IMUs and joint encoders and instead, sensing everything from vision is the way to go. All other sensors can probably be mapped to vision too (such as optical waveguides!), and maybe Convnets and RNNs can handle the rest.

Here’s another cool example of how data-driven software can even replace expensive force sensing. Remember those skin sensors I was talking about earlier? According to the work shown in the poster below, it turns out you can just use a RNN to predict those surface contact "force transients" simply by the amount of “feedback” the motors in your arm feel when they bump into something.

Looking Forward

I attended a workshop ("Grand Scientific Challenges for the Robot Companion of the Future") where the panel of distinguished roboticists were asked what they thought were the grand challenges & questions for robotics. Here were some of the responses:
  • Energy management (power consumption, batteries)
  • Predictions & mirror neurons
  • What is the generic representation of actions?
  • An understanding of Machines vs Life (this was Oussama Khatib)
  • Wearable exoskeleton, as if it were part of the body
  • Human-computer hybrid - cheap memory capacity
I'd like to add 3 additional technologies that I believe could be game-changing for robotics:

  1. Artificial Materials: synthesizing polymers with self-healing abilities, the ability to co-locate actuation, sensing, and computation in the same volume.
  2. Artificial Muscles: Modular, electrically or chemically actuated, and at the millimeter scale. 
  3. Artificial Life: Large-scale ecosystems of agents struggling to not die, compete for resources, and reproduce.

Will follow up on these in later blog posts...