Eric Jang

Robots Must Be Ephemeralized

2021-09-20T16:36:00.003-07:00

There is a subfield of robotics research called “sim-to-real” (sim2real) whereby one attempts to solve a robotic task in simulation, and then get a real robot to do the same thing in the real world. My team at Google utilizes Sim2Real techniques extensively in pretty much every domain we study, including locomotion and navigation and manipulation.

The arguments for doing robotic research in simulation are generally well-known in the community: more statistical reproducibility, less concern about safety issues during learning, avoiding the operational complexity of maintaining thousands of robots that wear down at differing rates. Sim2Real is utilized heavily on quadruped and five-finger hand platforms, because at present, such hardware can only be operated a few hundred trials before they start to wear down or break. When the dynamics of the system start to vary from episode-to-episode, learning becomes even more difficult.

In a previous blog post, I also discussed how iterating in simulation solves some tricky problems around new code changes invalidating old data. Simulation makes this a non-issue because it is relatively cheap to re-generate your dataset every time you change the code.

Despite significant sim2real advances in the last decade, I must confess that three years ago, I was still somewhat ideologically opposed to doing robotics research in simulation, on the grounds that we should revel in the richness and complexity of real data, as opposed to perpetually staying in the safe waters of simulation.

Following those beliefs, I worked on a three-year long robotics project where our team eschewed simulation and focused the majority of our time on iterating in the real world (mea culpa). That project was a success, and the paper will be presented at the 2021 Conference on Robotic Learning. However, in the process, I learned some hard lessons that completely reversed my stance on sim2real and offline policy evaluation. I now believe that offline evaluation technology is no longer optional if you are studying general-purpose robots, and I have pivoted my research workflows to rely much more heavily on these methods. In this blog post, I outline why it is tempting for roboticists to iterate directly on real life, and how the difficulty of evaluating general-purpose robots will eventually force us to increasingly rely on offline evaluation techniques such as simulation.

Two Flavors of Sim2Real

I’m going to assume the reader is familiar with basic sim2real techniques. If not, please check out this RSS’2020 workshop website for tutorial videos. There are broadly two ways to formalize sim2real problems.

One approach is to create an “adapter” that transforms simulated sensor readings to resemble real data as much as possible, so that a robot trained in simulation behaves indistinguishably in both simulation and real. Progress on generative modeling techniques such as GANs have enabled this to work even for natural images.

Another formulation of the sim2real problem is to train simulated robots under lots of randomized conditions. In becoming robust under varied conditions, the simulated policy can treat the real world as just another instance under the training distribution. OpenAI’s Dactyl took this “domain randomization” approach, and were able to get the robot to manipulate a Rubik’s cube without ever doing policy learning on real data.

Both the domain adaptation and domain randomization approaches in practice yield similar results when transferred to real, so their technical differences are not super important. The takeaway is that the policy is learned and evaluated on simulated data, then deployed in real with fingers crossed.

The Case For Iterating Directly In Real

Three years ago, my primary arguments against sim were related to the richness of data available to real vs simulated robots:

Reality is messy and complicated. It takes regular upkeep and effort to maintain neatness for a desk or bedroom or apartment. Meanwhile, robot simulations tend to be neat and sterile by default, with not a lot of “messiness” going on. In simulation, you must put in extra work to increase disorder, whereas in the real world, entropy increases for free. This acts as a forcing function for roboticists to focus on the scalable methods that can handle the complexity of the real world.
Some things are inherently difficult to simulate - in the real world, you can have robots interact with all manner of squishy toys and articulated objects and tools. Bringing those objects into a simulation is incredibly difficult. Even if one uses photogrammetry technology to scan objects, one still needs to set-dress objects in the scene to make a virtual world resemble a real one. Meanwhile, in the real world one can collect rich and diverse data by simply grabbing the nearest household object - no coding required.
Bridging the “reality gap” is a hard research problem (often requiring training high-dimensional generative models), and it’s hard to know whether these models are helping until one is running actual robot policies in the real world anyway. It felt more pragmatic to focus on direct policy learning on the test setting, where one does not have to wonder whether their training distribution differs from their test distribution.

To put those beliefs into context, at the time, I had just finished working on Grasp2Vec and Time-Contrastive-Networks, both of which leveraged rich real-world data to learn interesting representations. The neat thing about these papers was that we could train these models on whatever object (Grasp2Vec) or video demonstration (TCN) the researcher felt like mixing into the training data, and scale up the system without writing a single line of code. For instance, if you want to gather a teleoperated demonstration of a robot playing with a Rubik’s cube, you simply need to buy a Rubik’s cube from a store and put it into the robot workspace. In simulation, you would have to model a simulated equivalent of a rubik’s cube that twists and turns just like a real one - this can be a multi-week effort just to align the physical dynamics correctly. It didn’t hurt that the models “just worked”, there wasn’t much iteration needed on the modeling front for us to start seeing cool generalization.

There were two more frivolous reasons I didn’t like sim2real:

Aesthetics: Methods that learn in simulation often rely on crutches that are only possible in simulation, not real. For example, using millions of trials with an online policy-gradient method (PPO, TRPO) or the ability to reset the simulation over and over again. As someone who is inspired by the sample efficiency of humans and animals, and who believes in the LeCake narrative of using unsupervised learning algorithms on rich data, relying on a “simulation crutch” to learn feels too ham-handed. A human doesn’t need to suffer a fatal accident to learn how to drive a car.

A “no-true-Scotsman” bias: I think there is a tendency for people who spend all their time iterating in simulation to forget the operational complexity of the real world. Truthfully, I may have just been envious of others who were publishing 3-4 papers a year on new ideas in simulated domains, while I was spending time answering questions like “why is the gripper closing so slowly?”

Suffering From Success: Evaluating General Purpose Robots

So how did I change my mind? Many researchers at the intersection of ML and Robotics are working towards the holy grail of “generalist robots that can do anything humans ask them”. Once you have the beginnings of such a system, you start to notice a host of new research problems you didn’t think of before, and this is how I came to realize that I was wrong about simulation.

In particular, there is a “Problem of Success”: how do we go about improving such generalist robots? If the success rate is middling, say, 50%, how do we accurately evaluate a system that can generalize to thousands or millions of operating conditions? The feeling of elation that a real robot has learned to do hundreds of things -- perhaps even things that people didn’t train them for -- is quickly overshadowed by uncertainty and dread of what to try next.

Let’s consider, for example, a generalist cooking robot - perhaps a bipedal humanoid that one might deploy in any home kitchen to cook any dish, including Wozniak’s Coffee Test (A machine is required to enter an average American home and figure out how to make coffee: find the coffee machine, find the coffee, add water, find a mug, and brew the coffee by pushing the proper buttons).

In research, a common metric we’d like to know is the average success rate - what is the overall success rate of the robot at performing a number of different tasks around the kitchen?

In order to estimate this quantity, we must average over the set of all things the robot is supposed to generalize to, by sampling different tasks, different starting configurations of objects, different environments, different lighting conditions, and so on.

For a single scenario, it takes a substantial number of trials to measure success rates with single digit precision:

The standard deviation of a binomial parameter is given by sqrt(P*(1-P)/N), where P is the sample mean and N is the sample size. If your empirical mean of the success rate is 50% under N=5000 samples, this equation tells you that the standard error is 0.007. A more intuitive way to understand this is in terms of a confidence interval: there is a 95% epistemic probability that the true mean, which may not be exactly 50%, lies within the range [50 - 1.3, 50 + 1.3].

5000 trials is a lot of work! Rarely do real robotics experiments do anywhere near 300 or even 3000 evaluations to measure task success.

From Vincent Vanhoucke’s blog post, here is a table drawing a connection from your sample size (under the worst case of p=50%, which maximizes standard error) to the number of significant digits you can report:

Depending on the length of the task, it could take all day or all week or all month to run one experiment. Furthermore, until robots are sufficiently capable of resetting their own workspaces, a human supervisor needs to reset the workspace over and over again as one goes through the evaluation tasks.

One consequence of these napkin calculations is that pushing the frontier of robotic capability requires a series of incremental advances (e.g. 1% at a time) with extremely costly evaluation (5000 episodes per iteration), or a series of truly quantum advances that are so large in magnitude that it takes very few samples to know that the result is significant. Going from “not working at all” to “kind of working” is one example of a large statistical leap, but in general it is hard to pull these out of the hat over and over again.

Techniques like A/B testing can help reduce the variance of estimating whether one model is better than another one, but it still does not address the problem of the sample complexity of evaluation growing exponentially with the diversity of conditions the ML models are expected to generalize to.

What about a high-variance, unbiased estimator? One approach would be to sample a location at random, then a task at random, and then an initial scene configuration at random, and then aggregate thousands of such trials into a single “overall success estimator”. This is tricky to work with because it does not help the researcher drill into problems where learning under one set of conditions causes catastrophic forgetting of another number. Furthermore, if the number of training tasks is many times larger than the number of evaluation samples and task successes are not independent, then there will be high variance in the overall success estimate.

What about evaluating general robots with a biased, low-variance estimator of the overall task success? We could train a cooking robot to make millions of dishes, but only evaluate on a few specific conditions - for example, measuring the robot’s ability to make banana bread and using that as an estimator for its ability to do all the other tasks. Catastrophic forgetting is still a problem - if the success rate of making banana bread is inversely correlated with the success rate of making stir-fry, then you may be crippling the robot in ways that you are no longer measuring. Even if that isn’t a problem, having to collect 5000 trials limits the number of experiments one can evaluate on any given day. Also, you end up with a lot of surplus banana bread.

The following is a piece of career advice, rather than a scientific claim: in general you should strive to be in a position where your productivity bottleneck is the number of ideas you can come up with in a single day, rather than some physical constraint that limits you to one experiment per day. This is true in any scientific field, whether it be in biology or robotics.

Lesson: Scaling up in reality is fast because it requires little to no additional coding, but once you have a partially working system, careful empirical evaluation in real life becomes increasingly difficult as you increase the generality of the system.

Ephemeralization

In his 2011 essay Software is Eating The World, venture capitalist Marc Andreessen pointed out that more and more of the value chain in every sector of the world was being captured by software companies. In the ensuing decade, Andreesen has refined his idea further to point out that “Software Eating The World” is a continuation of a technological trend, Ephemeralization, that precedes even the computer age. From Wikipedia:

Ephemeralization, a term coined by R. Buckminster Fuller in 1938, is the ability of technological advancement to do "more and more with less and less until eventually you can do everything with nothing,"

Consistent with this theme, I believe the solution to scaling up generalist robotics is to push as much of the iteration loop into software as possible, so that the researcher is freed from the sheer slowness of having to iterate in the real world.

Andreessen has posed the question of how future markets and industries might change when everybody has access to such massive leverage via “infinite compute”. ML researchers know that “infinite” is a generous approximation - it still costs 12M USD to train a GPT-3 level language model. However, Andreessen is directionally correct - we should dare to imagine a near future where compute power is practically limitless to the average person, and let our careers ride this tailwind of massive compute expansion. Compute and informational leverage are probably still the fastest growing resources in the world.

Software is also eating research. I used to work in a biology lab at UCSF, where only a fraction of postdoc time was spent thinking about the science and designing experiments. The majority of time was spent pipetting liquids into PCR plates, making gel media, inoculating petri dishes, and generally moving liquids around between test tubes. Today, it is possible to run a number of “standard biology protocols” in the cloud, and one could conceivably spend most of their time focusing on the high-brow experiment design and analysis rather than manual labor.

Imagine a near future where instead of doing experiments on real mice, we instead simulate a highly accurate mouse behavioral model. If such models turn out to be accurate, then medical science will be revolutionized overnight by virtue of researchers being able to launch massive-scale studies with billions of simulated mouse models. A single lab might be able to replicate a hundred years of mouse behavioral studies practically overnight. A scientist working on a laptop from a coffee shop might be able to design a drug, run clinical trials on it using a variety of cloud services, and get it FDA approved all from her laptop. When this happens, Fuller’s prediction will come true and it really will seem as if we can do “everything with nothing”.

Ephemeralization for Robotics

The most obvious way to ephemeralize robot learning in software is to make simulations that resemble reality as closely as possible. Simulators are not perfect - they still suffer from the reality gap and data richness problems that originally made me skeptical of iterating in simulation. But, having worked on general purpose robots directly in the real world, I now believe that people who want high-growth careers should actively seek workflows with highest leverage, even if it means putting in the legwork to make a simulation as close to reality as possible.

There may be ways to ephemeralize robotic evaluation without having to painstakingly hand-design Rubik’s cubes and human behavior into your physics engine. One solution is to use machine learning to learn world models from data, and having the policy interact with the world model instead of the real world for evaluation. If learning high-dimensional generative models is too hard, there are off-policy evaluation methods and offline hyperparameter selection methods that don’t necessarily require simulation infrastructure. The basic intuition is that if you have a value function for a good policy, you can use it to score other policies on your real world validation datasets. The downside to these methods is that they often require finding good policy or value function to begin with, and are only accurate for ranking policies up to the level of the aforementioned policy itself. A Q(s,a) function for a policy with a 70% success rate can tell you if your new model is performing around 70% or 30% , but is not effective at telling you whether you will get 95% (since these models don’t know what they don’t know). Some preliminary research suggests that extrapolation can be possible, but it has not yet been demonstrated at the scale of evaluating general-purpose robots on millions of different conditions.

What are some alternatives to more realistic simulators? Much like the “lab in the cloud” business, there are some emerging cloud-hosted benchmarks such as AI2Thor and MPI’s Real Robot Challenge, where researchers can simply upload their code and get back results. The robot cloud provider handles all of the operational aspects of physical robots, freeing the researcher to focus on software.

One drawback of these setups is that these hosted platforms are designed for repeatable, resettable experiments, and do not have the diversity that general purpose robots would be exposed to.

Alternatively, one could follow the Tesla Autopilot approach and deploy their research code in “shadow mode” across a fleet of robots in the real world, where the model only makes predictions but does not make control decisions. This exposes evaluation to high-diversity data that cloud benchmarks don’t have, but suffers from the long-term credit assignment problem. How do we know whether a predicted action is good or not if the agent isn’t allowed to take those actions?

For these reasons, I think data-driven realistic simulation gets the best of both worlds - you get the benefits of real world diverse data and the ability to evaluate simulated long-term outcomes. Even if you are relying heavily on real-world evaluations via a hosted cloud robotics lab or a fleet running Shadow Mode, having a complementary software-only evaluation provides additional signal can only help with saving costs and time.

I suspect that a practical middle ground is to combine multiple signals from offline metrics to predict success rate: leveraging simulation to measure success rates, training world models or value functions to help predict what will happen in “imagined rollouts”, adapting simulation images to real-like data with GANs, and using old-fashioned data science techniques (logistic regression) to study the correlations between these offline metrics and real evaluated success. As we build more general AI systems that interact with the real world, I predict that there will be cottage industries dedicated to building simulators dedicated for sim2real evaluation and data scientists who build bespoke models for guessing the result of expensive real-world evaluations.

Separately from how ephemeralization drives down the cost of evaluating robots in the real world, there is the effect of ephemeralization driving down the cost of robot hardware itself. It used to be that robotics labs could only afford a couple expensive robot arms from Kuka and Franka. Each robot would cost hundreds of thousands of dollars, because they had precisely engineered encoders and motors that enabled millimeter-level precision. Nowadays, you can buy some cheap servos from AliExpress.com for a few hundred dollars, glue it to some metal plates, and control it in a closed-loop manner using a webcam and a neural network running on a laptop.

Instead of relying on hardware precise position control, the arm moves based purely on vision and hand-eye coordination. All the complexity has been migrated from hardware to software (and machine learning). This technology is not mature enough yet for factories and automotive companies to replace their precision machines with cheap servos, but the writing is on the wall: software is coming for hardware, and this trend will only accelerate.

Acknowledgements

Thanks to Karen Yang, Irhum Shafkat, Gary Lai, Jiaying Xu, Casey Chu, Vincent Vanhoucke, Kanishka Rao for reviewing earlier drafts of this essay.

ML Mentorship: Some Q/A about RL

2021-07-30T13:36:00.016-07:00

One of my ML research mentees is following OpenAI's Spinning up in RL tutorials (thanks to the nice folks who put that guide together!). She emailed me some good questions about the basics of Reinforcement Learning, and I wanted to share some of my replies on my blog in case it helps further other student's understanding of RL.

The classic Sutton and Barto diagram of RL

Your “How to Understand ML Papers Quickly” blog post recommended asking ourselves “what loss supervises the output predictions” when reading ML papers. However, in SpinningUp, it mentions that “minimizing the ‘loss’ function has no guarantee whatsoever of improving expected return” and “loss function means nothing.” In this case, what should we look for instead when reading DRL papers if not the loss function?

Policy optimization algorithms like PPO train by minimizing some loss, which in the most naive implementation is the (negative) expected return at the current policy's parameters. So in reference to my blog post, this is the "policy gradient loss" that supervises the current policy's predictions.

It so happens that this loss function is defined with respect to data $\mathcal{D}(\pi^i)$ sampled by the *current* policy, rather than data sampled i.i.d from a fixed / offline dataset as commonly done in supervised learning. So if you change the policy from $\pi^i \to \pi^{i+1}$, then re-computing the policy gradient loss for $\pi^{i+1}$ requires collecting some new environment data $\mathcal{D}(\pi^{i+1})$ with $\pi^{i+1}$. Computing the loss function has special requirements (you have to annoyingly gather new data every time you update), but at the end of the day it is still a loss that supervises the training of a neural net, given parameters and data.

On "loss function means nothing": the Spinning Up docs are correct in saying that the loss you minimize is not actually the evaluated performance of the policy, in the same way that minimizing cross entropy loss maximizes accuracy while not telling you what the accuracy is. In a similar vein, the loss value for $\pi^i, \mathcal{D}(\pi^i)$ is decreased after a policy gradient update. You can assume that if your new policy sampled the exact same trajectory as before, the resultant reward would be the same, but your loss would be lower. Vice versa, if your new policy samples a different trajectory, you can probably assume that there will be a monotonic increase in reward as a result of taking each policy gradient step (assuming step size is correct and that you could re-evaluate the loss under a sufficiently large distribution).

However, you don't know how much decrease in loss translates to increase in reward, due to non-linear sensitivity between parameters and outputs, and further non-linear sensitivity between outputs and rewards returned by the environment. A simple illustrative example of this: a fine-grained manipulation task with sparse rewards, where the episode return is 1 if all actions are done within a 1e-3 tolerance, and 0 otherwise. A policy update might result in each of the actions improving the tolerance from 1e-2 to 5e-3, and this policy achieves a lower "loss" according to some Q function, but still has the same reward when re-evaluated in the environment.

Thus, when training RL it is not uncommon to see the actor loss go down but the reward stay flat, or vice versa (the actor loss stays flat but the reward goes up). It's usually not a great sign to see the actor loss blow up though!

Why in DRL, people frequently set up algorithms to optimize the undiscounted return, but use discount factors in estimating value functions?

See https://stats.stackexchange.com/questions/221402/understanding-the-role-of-the-discount-factor-in-reinforcement-learning. In addition to avoiding infinite sums from a mathematical perspective, the discount factor actually serves as an important hyperparameter when tuning RL agents. It biases the optimization landscape so that agents prefer the same reward sooner than later. Finishing an episode sooner also allows agents to see more episodes, which indirectly improves the amount of search and exploration a learning algorithm can do. Additionally, discounting produces a symmetry-breaking effect that further reduces the search space. In a sparse reward environment with a $\gamma=1$ (no discounting), an agent would be equally happy to do nothing on the first step, and then complete the task vs. do the task straight away. Discounting makes the task easier to learn because the agent can learn that there is only one preferable action at the first step.

In model-based RL, why embedding planning loops into policies makes model bias less of a problem?

Here is an example that might illustrate how planning helps:

Given a good Q function $Q(s,a)$, you can recover a policy $\pi(a|s)$ by performing a search procedure argmax_a $Q(s,a)$ to recover the best action that results in the best expected (discounted) future returns. A search algorithm like grid search is computationally expensive, but guaranteed to work because it will cover all the possibilities.

Imagine instead of search, you use a neural network "actor" to amortize the "search" process into a single pass through a neural network. This is what Actor-Critic algorithms do: they learn a critic and use the critic to learn an actor, which performs "amortized search over the argmax $Q(s,a)$".

Whenever you can use brute force search on the critic instead of an actor, it is better to do so. This is because an actor network (amortized search) can make mistakes, while brute force is slow but will not make a mistake.

The above example illustrates the simplest example of a 1-step planning algorithm, where "planning" is actually synonymous with "search". You can think about the act of searching for the best action with respect to $Q(s, a)$ as being equivalent to "planning for the best future outcome", where $Q(s,a)$ evaluates your plan.

Now imagine you have a perfect model of dynamics, $p(s'|s,a)$, and an okay-ish Q function where it has function approximation errors in some places. Instead of just selecting the best Q value and action at a given state, the agent can now consider the future state and consider the Q values that one encounters at the next set of actions. By using a plan and an "imagined rollout" of the future, the agent can query $Q(s,a)$ along every state in the trajectory, and potentially notice inconsistencies with Q functions. For instance, Q might be high at the beginning of the episode but low at the end of the episode despite taking the greedy action at each state. This would immediately tell you that the Q function is unreliable for some states in the trajectory.

A well-trained Q function should respect the Bellman equality, so if you have a Q function and a good dynamics model, then you can actually check your Q function for self-consistency at inference time time to make sure it satisfies Bellman equality, even before taking any actions.

One way to think of a planning module is that it "wraps" a value function $Q_\pi(s,a)$ and gives you a slightly better version of the policy, since it uses search to consider more possibilities than the neural-net amortized policy $\pi(a|s)$. You can then take the trajectory data generated by the better policy and use that to further improve your search amortizer, which yields the "minimal policy improvement technique" perspective from Ferenc Huszár.

When talking about data augmentation for model-free methods, what is the difference between “augment[ing] real experiences with fictitious ones in updating the agent” and “us[ing] only fictitious experience for updating the agent”?

If you have a perfect world model, then all you need is to train an agent on "imaginary rollouts" and then it will be exactly equivalent to training the agent on the real experience. In robotics this is really nice because you can train purely in "mental simulation" without having to wear down your robots. Model-Ensemble TRPO is a straightforward paper that tries these ideas.

Of course in practice, no one ever learns a perfect world model, so it's common to use the fictitious (imagined) experience as a supplemental experience to real interaction. The real interactions data provide some grounding in reality for both the imagination model and the policy training.

How to choose the baseline (function b) in policy gradients?

The baseline should be chosen to minimize the variance of gradients while keeping the estimate of the learning signal unbiased. Here is a talk that covers that stuff in more detail https://www.youtube.com/watch?v=ItI_gMuT5hw, you can also google terms like "variance reduction policy gradient" more and "control variates reinforcement learning". I have a blog post on variance reduction, which also discusses control variates: https://blog.evjang.com/2016/09/variance-reduction-part1.html

Consider episode returns for 3 actions = [1, 10, 100]. Clearly the third action is by far the best, but if you take a naive policy gradient, you end up increasing the likelihood of the bad actions too! Typically $b=V(s)$ is sufficient, because it turns the $Q(s,a)-V(s)$ into advantage $A(s,a)$, which has the desired effect of increasing the likelihood of good actions, keeping the likelihood of neutral actions the same, and decreasing the likelihood of bad actions. Here is a paper that applies an additional control variate on top of advantage estimation to further reduce variance.

How to better understand target policy smoothing in TD3?

In actor-critic methods, both the Q function and actor are neural networks, so it can be very easy to use gradient descent to find a region of high curvature in the Q function where the value is very high. You can think of the actor as a generator and a critic as a discriminator, and the actor learns to "adversarially exploit" regions of curvature in the critic so as to maximize the Q value without actually emitting meaningful actions.

All three of the tricks in TD3 are designed to mitigate the problem of the actor adversarially selecting an action with a pathologically high Q value. By adding noise to the input to the target Q network, it prevents the "search" from finding exact areas of high curvature. Like Trick 1, it helps make the Q function estimates more conservative, thus reducing the likelihood of choosing over-estimated Q values.

A Note on Categorizing RL Algorithms

RL is often taught in a taxonomic layout, as it helps to classify algorithms based on whether they are "model based vs. model-free", "on-policy vs. off-policy", "supervised vs. unsupervised". But these categorizations are illusory, much like the Spoon in the Matrix. There are actually many different frameworks and schools of thought that allow one to independently derive the same RL algorithms, and they cannot always be neatly classified and separated from each other.

For example, it is possible to derive actor critic algorithms from both on-policy and off-policy perspectives.

Starting from off-policy methods, you have DQN which use the inductive bias of Bellman Equality to learn optimal policies via dynamic programming. Then you can extend DQN to continuous actions via an actor network, which arrives at DDPG.

Starting from on-policy methods, you have REINFORCE, which is vanilla policy gradient algorithm. You can add a value function as a control variate, and this requires learning a critic network. This again re-derives something like PPO or DDPG.

So is DDPG an on-policy or off-policy algorithm? Depending on the frequency with which you update the critic vs. the actor, it starts to look more like onpolicy or offpolicy update. My colleague Shane has a good treatment of the subject in his Interpolated Policy Gradients paper.

Stonks are What You Can Get Away With: NFTs and Financial Nihilism

2021-06-19T16:59:00.011-07:00

Eric Jang, "Ten Apes", Jun 19 2021. NFT "drop" coming soon.

Andy Warhol once said, “Art is what you can get away with.” I interpret the quote as a nihilistic take on “beauty is in the eye of the beholder” — a urinal you found in the junkyard can be considered art, so long as you convince someone to buy it, or showcase it in a museum. All that matters is what other people see in it and what buyers are willing to pay.

The 2020’s equivalent of Warhol paintings are Non-Fungible-Tokens (NFTs). In this essay I’ll explain what NFTs are by motivating them with some interesting real-world problems. Then I’ll discuss why the NFT craze for digital art generates so much ideologically contentious debate. Finally, I’ll discuss some parallels between artistic and financial nihilism, and how this might serve as a framework for thinking about wildly speculative markets.

Explaining NFTs using Counterfeit Goods

Suppose you want to buy a Birkin bag or some other luxury brand item. An unauthorized seller — perhaps someone who needs some emergency cash — is willing to sell you a Birkin bag. They offer you a good discount, relative to the price the authorized retailer would charge you. But how can you be sure they aren’t selling you a fake? Counterfeits for these items are very high quality, and the average Birkin customer probably can’t tell the difference between a real and a fake.

One way to avoid counterfeits is to only purchase items from an authorized retailer, e.g. a trusted Hermès store. But this is not practical because it prevents people from selling or giving away their bags. If you leave your bag to someone in your will, then its authenticity is no longer guaranteed.

So we have the market need: how does a seller pass on or sell a luxury item? How does a buyer ensure that they are buying an authentic item?

One possible answer is for Hermès to print out a list of secret serial numbers, perhaps sewn inside the bag, that declare whether a bag is legit or not. Owners receive a serial number when they buy the bag. But this is not a strong deterrent. A counterfeiter could just buy a real bag and then copy its serial number into many fake bags.

What if Hermès maintains a public website of who owns which bag? Any time a bag changes ownership, this ledger needs to be updated. By recording a unique owner for each unique serial number, this solves the problem of counterfeiters simply duplicating serial numbers. The process shifts from verifying properties to verifying transactions and owners.

These approaches would work, but also have a centralized point of failure: If the Hermès website goes down, nobody can trade bags anymore. Hermès is a big company and has the resources to protect their website against DDOS attacks and other cybersecurity threat vectors, but smaller luxury brands might not have a state-of-the-art security department. If they are not careful, their security could be breached by hackers or an unscrupulous sysadmin. Also, if Hermès stops operating as a company in 25 years, who will maintain the ledger of ownership? If it is a third party company, can we trust them not to abuse that power? Even in the unlikely event that the central point of failure never makes a mistake, it’s still mildly annoying to require Hermès to get involved every time a bag changes hands.

What if you could verify transactions and owners, without a centralized party? This is where Non-Fungible Tokens, or NFTs, come in. In 2009, someone published a landmark paper on how to build a decentralized ledger of who owns what. This ledger is called a "blockchain". A blockchain is a record of the consensus state of the world, following some agreed-upon protocol that is known to everyone. The remarkable thing about blockchains is that they are decentralized (no central point of failure), and resilient to malicious actors in the network. Distributed consensus is reached by each individual contributing some resource like money, hash rate, or computer storage. So long as a large fraction of resources in the network are controlled by well-behaved actors, the integrity of the blockchain remains secure. The fraction required typically varies from one-thirds to just over a half.

There are many blockchains out there. The details of how their consensus protocols are implemented are fascinating but beyond the scope of this essay. The important thing to know is that the base technology underlying NFTs and cryptocurrencies is a formal protocol that allows people to come to an agreement on who owns what without having to involve a trusted third party (e.g. Hermès, an escrow agent, your bank, or your government). Theoretically speaking, blockchains allow shared consensus in a trustless society.

NFTs are like a paper deed of ownership, but instead of paper the certificate is digital. And unlike a paper deed an NFT cannot be forged. NFTs contain a unique “serial number” that is publicly viewable, but only one person can be said to “possess” that serial number on the blockchain, much like how home addresses are public but registered to a single owner by the recording office. To see how NFTs solve the Birkin bag counterfeit problem, let’s suppose Hermès publicly declares the following for all to hear:

“Owners of True Birkin bags will be issued a digital certificate of authenticity represented by an NFT”

As a buyer, you can be quite confident that the bag is authentic if the seller also owns the NFT, and you can verify that the NFT was indeed originally created by Hermès by looking up its public transaction history. During a transaction, the seller simply gives the buyer the bag and tells the blockchain to re-assign ownership of the NFT to the buyer’s digital identifier. If the payment is done in cryptocurrency, the escrow can even be performed using a smart contract without a centralized party (the seller publishes contract “If a specific buyer’s wallet address sends me X USDC in 24 hours, send the NFT is sent to them and send the cash to me.”)

NFTs provide the means to implement digital scarcity, but there still needs to be a way to pair it with a real-world item in the “analog” world. A seller could still bypass the security of NFTs by selling you an NFT with a fake Birkin bag. However, for every fake bag you want to sell, you need to purchase a real NFT and the real bag that comes with it. After you sell the NFT with the fake bag, you are left with a real bag with no NFT! Subsequently, the market value of the real bag drops because buyers will be highly suspicious of a seller who says "this is a real bag, I don't have the NFT because I just sold it with a fake bag." While NFTs are not sure proof of a physical Birkin bag's authenticity, they all but ruin the economic incentives of counterfeiting.

What about luxury consumable goods? You could buy NFT-certified Wagyu beef, sell the NFT with some cheaper steak, and then eat the real Wagyu beef - it doesn’t matter what other people think you're eating. However, NFT transactions are public, so a grocery shopper would be quite suspicious of a food NFT that has changed hands outside of the typical supply chain addresses. For NFTs paired with physical goods, each “unusual” transaction significantly adds to counterfeit risk, which diminishes the economic incentives to counterfeiters. This is especially true for consumable, perishable goods.

Authenticity is useful, even outside of Veblen goods. You can imagine using NFTs to implement anonymous digital identity verification (a 30B market by 2024), or ship it with food products like meat where the customer cares a lot about the provenance of the product. In Taiwan, there is a current ongoing scandal where a bunch of US-imported pork has been passed off as “domestic pork” and nobody can trust their butchers anymore.

In the most general case, NFTs can be used to implement provenance tracking of both physical and digital assets - an increasingly important need in our modern age of disinformation. Where did this photo of a politician come from? Who originally produced this audio clip?

The Riddle of Intangible Value

NFTs make a lot of sense for protecting the authenticity of luxury goods or implementing single sign-on or tracking the provenance of meat products, but that’s not what they’re primarily used for today. Rather, most people sell NFTs for digital art. Here are some early examples of art NFTs, called “Cryptopunks”. Each punk is a 24x24 RGB image.

One of these recently sold for 17M USD in an auction. At first glance, this is perplexing. The underlying digital content - some pixels stored in a file - are freely accessible to anyone. Why would anyone pay so much for a certificate of authenticity on something that anyone can enjoy for free? Is the buyer the one that gets punked?

It’s easy to dismiss this behavior as poor taste colliding with the arbitrarily large disposable income of rich people, in particular crypto millionaires that swap crypto assets with other crypto millionaires. While this may be true, I think it’s far more interesting to ask “what worldview would cause a rational person to bet $17M on a certificate for a 24x24x3 set of pixel values”?

Historically, the lion’s share of rewards for digital content has been owned by distribution technology like Spotify or content aggregators like Facebook, and then split with the management company. The creatives themselves are paid pittances, and do not share in the financialization of their labor. The optimist case for NFT art is as follows: NFTs are decentralized, which means any artist with an internet connection can draw up financial contracts for their art on their own terms. If NFTs revolutionize the business model of digital art, and if the future of art is mostly digital, then the first art NFTs to ever be issued might accrue significant cultural relevance, and that’s why they command such high speculative prices.

Valuing art based on cultural relevance might be a bit absurd, but why is the Mona Lisa “The Mona Lisa”? da Vinci arguably made “better” paintings from a technical standpoint. It's because of intangible value. The Mona Lisa is valuable because of its cultural proximity to important events and people in history, and the mimetic desire of other humans. In fact, it was a relatively obscure painting until 1911, when it was stolen from the Louvre and became a source of national shame overnight.

All art, from your child’s first finger painting, to an antique heirloom passed down generations, to a “masterpiece” like the Mona Lisa, are valued this way. They are valuable simply because others deem it valuable.

NFTs are the digital equivalent of buying a banana duck-taped to a wall; you are betting that in the future, that statement of ownership on some blockchain will be historically significant, which you can presumably trade in for cash or clout or both. But buyer beware: things get philosophically tricky when applying the theory of “intangible value” to digital information and artwork where the cost of replication goes to zero.

I can think of two ways to look at how one values NFTs for digital art. One perspective is that in a world full of fake Birkin bags and products sourced from ethically dubious places, the only thing of value is the certificate of authenticity. The cultural and mimetic value of content has transferred entirely to the provenance certificate, and not the pixels themselves (which can be copied for free). If art’s value is derived from the cultural relevance it represents and its proximity to important people, then the most sensible way to make high art would not be to improve one’s painting skills, but to schmooze with a lot of famous people and insert oneself into important events in history, and issue scarce status symbols for the bourgeoisie. Warhol did exactly that.

The alternate view is that if a perfect copy can be made of some pixels, then it is not really a counterfeit at all, and therefore the NFT secures nothing of actual value. Is it meaningful to ascribe a certificate of authenticity to something that can be perfectly replicated? Is “authenticity” of a stream of 0s and 1s meaningless? There is certainly utility in verifying the source of some information, but anyone can mint an NFT for the same information.

In summary, the Pro-NFT crowd values the intangible “collector’s scarcity and cultural relevance”. The anti-NFT focuses on tangible value - how much real value does this secure? Both are reasonable frameworks to value things, and you can end up with wildly different conclusions.

Artistic and Financial Nihilism: One and The Same?

Convince enough people that a urinal is valuable, and it becomes an investment grade asset. This is no longer merely a matter of art philosophy - when you invest in an index fund, you are essentially reinforcing the market’s current belief of valuations. When people bid up the price of TSLA or GME to stratospheric valuations, the index fund must re-adjust their market-weighted holdings to reflect those prices, creating further money inflows to the asset and thus a self-fulfilling prophecy. As it turns out, the art-of-investing is much like investing-in-art. As I have suggested in the title of this essay and borrowed from Warhol (who probably borrowed it from Marshall McLuhan), stonks are what you can get away with.

We are starting to see this valuation framework being applied to the equities market today, where price movements are dominated by narratives about where the price is going and what other people are willing to pay for it, especially with meme stocks like GME and AMC. Many retail investors don’t really care about whether GME’s price is justified by their corporate earnings - they simply buy at any cost. This financial nihilism - where intrinsic value is unknowable and all that matters is what other people think - is a worldview often encountered in Gen Z retail traders and a surprising number of professional traders I know. Perhaps the midwit meme is really true.

This is definitely a cause for some concern, but at the same time, I think value investors should keep an open mind that what first seems like irrational behavior might have a method to madness. If you have an irrational force acting in the markets, like shareholders who refuse to sell or lend their stock, a discounted cash flow model for AMC or GME starts to not become very predictive of share price. By reflexivity, that will have impacts on future cash flows! In a similar fashion, using present-day frameworks for thinking about business and value do not account for the disruptive force of technology. That’s why I find NFTs so fascinating - they are an intersection of finance, art, technology, and the nihilistic framework of valuation that is so prevalent in our society today.

What is rational behavior for an investor, anyway? Is it “standard behavior” as measured against the population average? How do you tell apart standard behavior from a collective delusion? Perhaps the luxury bag makers, Ryan Cohens, and Andy Warhol’s of the world understand it best: Convince the world to believe in your values, and you will be the sanest person on the planet. For fifteen minutes, at least.

Acknowledgements

Thanks to Cati Grasso, Sam Hoffman, Phúc Lê, Chung Kang Wang, Jerry Suh, and Ellen Jiang for comments and feedback on drafts of this post.

Sovereign Arcade: Currency as High-Margin Infrastructure

2021-05-26T17:02:00.007-07:00

This essay is about how the powerful want to become countries, and the implications of cryptocurrencies on the sovereignty of nations. I’m not an economics expert: please leave a comment if I have made any errors.

Money allows goods, services, and everything else under the sun to be assigned a value using the same unit of measurement. Without money, society reverts to bartering, which is highly inefficient. You may need plumbing services but have nothing that the plumber wants, so your toilet remains clogged. By acting as a measure of value everyone agrees on, money facilitates frictionless economic collaboration between people.

Foreign monetary policy is surprisingly simple to understand when viewed through the lens of power and control. Nation states get nervous when other nation states get too powerful, and controlling the currency is a form of power.

To see why this is the case, let’s consider a Gaming Arcade (yes, like Chuck E. Cheese) as a miniature model of a “Nation State”. To participate inside the “arcade economy”, you are to swap your outside money (USD) for arcade tokens.

Arcades are like mini nation-states: they issue their own currency, encourage spending with state-owned enterprises, and have a one-sided currency exchange to prevent money outflows.

The coins are a store of value that facilitate a one-way transaction with the Nation-State: you get to play an arcade game, and in return you get some entertainment value and some tickets, which we call “wages”.

The tickets are another store of value that can facilitate another one-way transaction: converting them into prizes. Prizes can be a stuffed animal or something else of value. Typically, the cost of winning a prize at an arcade is many multiples of what it would cost to just buy the prize at an outside store. The arcade captures that price difference as their profit.

Money’s most important feature requirement is that it is a *stable* measure of value. Too much inflation, and people stop saving money. Too much deflation, and people and companies aren’t incentivized to spend money (for example, employing people). Imagine if tomorrow, an arcade coin could let you play a game for two rounds instead of one, and the day after, you could play for four rounds! Well, no one would want to play arcade games today anymore.

The arcade imposes many kinds of draconian capital controls, and in many ways resembles an extreme form of State Capitalism:

All transactions are with state-owned enterprises (the arcade games) and must be conducted using state currencies (coins and tickets). You can’t start a business that takes people’s coins or tickets within the arcade.
The state can hand out valuable coins at virtually zero cost without worrying about inflation - every coin they issue is backed by a round of a coin-operated game, of which they have near-infinite supply. They can’t hand out infinite tickets though, because that would either require backing it up with more prizes, or devaluing each ticket so that more tickets are needed to buy the same prize.
You can bring outside money into the arcade, but you can’t convert coins, tickets, or prizes into money to take out.

Controlling the currency supply is indeed a very powerful business to be in, and why arcades would prefer to issue their own currency and keep money from leaving their borders.

Governments are just like arcades. They prefer their citizens and trading partners to use a currency they control, because it gives them a lever with which they can influence spending behavior. If country A uses country B’s currency instead, then country B’s currency supply shenanigans can actually influence saving and spending behavior of country A. This can pose a threat to the sovereignty of a nation (a fancy way to say “control over its people”).

After World War II, the US Dollar became the world’s reserve currency, which means that it’s the currency used for the majority of international trade. The USA wants the world to buy oil with US dollars, and we go to great lengths to enforce it with various forms of soft and hard power. The US dollar is backed by oil (petrodollar theory), and this “dollars-are-oil rule” in turn is enforced by US military might.

Governments print money all the time to pay for needed short-term needs like building bridges and COVID relief. However, too much of this can be a dangerous thing. The government gets what it wants in the short term, but more money chasing the same amount of goods will cause businesses to raise prices, causing inflation. Countries like Venezuela and Turkey who print too much of their own currency experience a runaway feedback loop where money supply and prices skyrocket, and then no one trusts the government currency as a stable source of value anymore.

The USA is not like other countries in this regard; controlling the world’s reserve currency gives the USA the ability to print money like no other country can. The US government owing 28 trillion USD of debt is like the Arcade owing you a trillion game coins. Yes, it is a lot of coins - maybe the arcade doesn’t even have a trillion coins to give you. But the arcade knows that you know that it’s in the best interest of everyone to not try and collect all those coins right away, because the arcade would go bankrupt, and then the coins you asked for would be worthless.

Is this sketchy? Absolutely. Most other countries absolutely hate this power dynamic. Especially China. The USA calls China a currency manipulator for devaluing the yuan, but will turn around and do the exact same thing by printing dollars. China does not want to be subject to the whims of US monetary policy, so they are working very hard to establish the yuan as the currency of exchange in international trade. Everyone wants to be the arcade operator, not the arcade player.

Large Companies as Nation-States

Nation-states not only have to worry about the currencies of other nation-states, but increasingly, large global corporations as well. Any businesses that get big enough start to think about the currency game, since currency is a form of high-margin infrastructure.

AliPay is a mobile wallet made by an affiliate company of Alibaba. It’s basically backed by an SQL table saying how much money each AliPay user has. It would be very easy for AliPay to print money - all they have to do is bump up some number in a row in the SQL table. As long as users are able to redeem their AliPay balance on something of equivalent value, Alibaba’s accounts remain solvent and they can get away with this. In fact, many of their users shop on Alibaba’s e-commerce properties anyway, so Alibaba doesn’t even need to have 100% cash reserves to back up all entries in their SQL table. Users can redeem their balances by paying for Alibaba goods, which Alibaba presumably can acquire for less than the price the user pays for.

Of course, outright printing money incurs the wrath of the Sovereign Arcade. Alibaba was severely punished for merely suggesting that they could do a better job than China’s banks. Facebook tried to challenge the dollar by introducing a token backed with other countries’ reserve currencies, and the idea was slapped down so hard that FB had to rename the project and start over. In contrast, the US government is happy to approve crypto tokens backed using the US dollar, because ultimately the US government controls the underlying resource.

There are clever ways to build high margin infrastructure without crossing the money-printing line. Any large institution with a monopoly over a high-margin resource can essentially mint debt for free, effectively printing currency like an arcade does with its coins. The resource can be a lot of things - coffee, cloud computing credits, energy, user data. In the case of a nation-state, the resource is simply violence and enforcement of the law.

As of 2019, Starbucks had 1.6B USD of gift cards in circulation, which puts it above the national GDP of about 20 countries. Like the arcade coins, Starbucks gift cards are only redeemable for limited things: scones and coffee. Starbucks can essentially mint Starbucks gift cards for free, and this doesn’t suffer from inflation because each gift card is backed by future coffee which Starbucks can also make at a marginal cost. You can even use Starbucks cards internationally, which makes “Star-Bucks” more convenient than current foreign currency exchange protocols.

As long as account balances are used to redeem a resource that the company can acquire cheaply (e.g. gift cards for coffee, gift cards for cloud computing, advertising credits), a large company could also practice “currency manipulation” by arbitrarily raising monetary balances in their SQL tables.

The Network State

Yet another threat to the sovereign power is decentralized rogue nations, made possible by cryptocurrency. At the heart of cryptocurrency’s rise is a social problem in our modern, globalized society: how do we trust our sovereigns to actually be good stewards of our property? Banking executives who overleveraged risky investments got bailed out in 2008 by the US government. The USA printed a lot of money in 2020 to bail out those impacted by COVID-19 economic shutdowns. Every few weeks, we hear about data breaches in the news. A lot of Americans are losing trust in their institutions to protect their bank accounts, their privacy, and their economic interests.

Even so, most Americans still take the power of the dollar for granted: 1) our spending power remains stable and 2) the number we see in our bank accounts is ours to spend. We have American soft and hard diplomacy to thank for that. But in less stable countries, capital controls can be rather extreme: a bank may simply decide one day that you can’t withdraw more than 1 USD per day. Or some government can decide that you’re a criminal and freeze your assets entirely.

Cryptocurrency offers a simple answer: You can’t trust the sovereign, or the bank, or any central authority to maintain the SQL table of who owns what. Instead, everyone cooperatively maintains the record of ownership in a decentralized, trustless way. For those of you who aren’t familiar with how this works, I recommend this 26-minute video by 3Blue1Brown.

To use the arcade analogy, cryptocurrency would be like a group of teenagers going to the arcade, and instead of converting their money into arcade coins, they pool it together to buy prizes from outside. They bring their own games (Nintendo Switches or whatever), and then swap prizes with each other based on who wins. They get the fun value of hanging out with friends and playing games and prizes, while cutting the arcade operator out.

The decentralized finance (DeFi) ecosystem has grown a lot in the last few years. In the first few years of crypto, all you could do was send Bitcoin and other Altcoins to each other. Today, you can swap currencies in decentralized exchanges, take out flash loans, buy distressed debt at a discount, provide liquidity as a market maker, perform no-limit betting on prediction markets, pay a foreigner with USD-backed stablecoins, and cryptographically certify authenticity of luxury goods.

Balaji Srinivasan predicts that as decentralized finance projects continue to grow, a large group of individuals with a shared sense of values and territory will congregate on the internet and declare themselves citizens of a “Network State”. It sounds fantastical at first, but many of us already live in Proto-Network states. We do our work on computers, talk to people over the internet, shop for goods online, and spend leisure time in online communities like Runescape and such. It makes sense for a geographically distributed economy to adopt a digital-native currency that transcends borders.

Network states will have the majority of their assets located on the internet, with a small amount of physical property distributed around the world for our worldly needs. The idea of a digital rogue nation is less far-fetched than you might think. If you walk into a Starbucks or McDonalds or a Google Office or an Apple Store anywhere in the world, there is a feeling of cultural consistency, a familiar ambience. In fact, Starbucks gets pretty close: you go there to eat and work and socialize and pay for things with Starbucks gift cards.

A network state might have geographically distributed physical locations that have a consistent culture, with most of its assets and culture in the cloud. Pictured: Algebraist coffee, a new entrant into the luxury coffee brand space

A network state could have a national identity independent of physical location. I see no reason why a "Texan" couldn’t enjoy ranching and brisket and big cars and football anywhere in the world.

Balaji is broadly optimistic that existing sovereigns will be tolerant or even facilitate network states, by offering them economic development zones and tax incentives to establish their physical embodiments within their borders, in exchange for the innovation and capital they attract.

I am not quite so optimistic - the fact that US persons can now pseudonymously perform economic activities with anyone in the world (including sanctioned countries) without the US government knowing, using a currency that the US government cannot control - is a terrifying prospect to the sovereign. The world’s governments highly underestimate the degree to which future decentralized economies will upset the world order and power structures of the world. Any one government can make life difficult for cryptocurrency businesses to get big, but as long as some countries are permissive towards it, it’s hard to put that genie back into the bottle and prevent the emergence of a new digital economy.

Crypto Whales

I think the biggest threat to the emergence of a network state is not existing sovereigns, but rather the power imbalance of early stakeholders versus new adopters.

At the time of writing, there are nearly 100 Bitcoin billionaires and 7062 Bitcoin wallets that own more than 10M each. This isn’t even counting the other cryptocurrencies or DeFi wealth locked in Ethereum - the other day, someone up bought nearly a billion dollars of the meme currency DOGE. We mostly have no idea who these people are - they walk amongst us, and are referred to as “whales”.

A billionaire’s taxes substantially alter state budget planning in smaller states, so politicians actually go out of their way to appease billionaires (e.g. Illinois with Ken Griffin). If crypto billionaires colluded, they could institute quite a lot of political change at local and maybe even national levels.

China has absolutely zero chill when it comes to any challenge to their sovereignty, so it was not surprising at all that they recently cracked down on domestic use of cryptocurrency. However, by shutting their miners down, I believe China is losing a strategic advantage in their quest to unseat America as the world superpower. A lot of crypto billionaires reside in China, having operated large mining pools and developing the world’s mining hardware early on. I think the smart move for China would have been to allow their miners to operate, but force them to sell their crypto holdings for digital yuan. This would peg crypto to the yuan, and also allow China to stockpile crypto reserves in case the world starts to use it more as a reserve currency.

There’s a chance that crypto might even overtake the Yuan as the challenger to reserve currency, because it’s easier to acquire in countries with strict capital controls (e.g. Venezuela, Argentina, Zimbabwe). If I were China, I’d hedge against both possibilities and try to control both.

Controlling miners has power implications far beyond stockpiling of crypto wealth. Miners play an important role in the market microstructure of cryptocurrency - they have the ability to see all potential transactions before they get permanently appended to blockchain. The assets minted by miners are virtually untraceable. One way a Network State could be compromised is if China smuggled several crypto whales into these fledgling nations that are starting to adopt Bitcoin, and then used their influence over Bitcoin reserves, tax revenues, and market microstructure to punish those who spoke out against China.

The more serious issue than China’s hypothetical influence over Bitcoin monetary policy is the staggering inequality of crypto wealth distribution. Presently, 2% of wallets control over 95% of Bitcoin. Many people are already uncomfortable with the majority of Bitcoins being owned by a handful of mining operators and Silicon Valley bros and other agents of tech inequality. Institutions fail violently when inequality is high - people will drop the existing ledger of balances and install a new one (such as Bitcoin). If people decide to form a new network state, why should they adopt a currency that would make these tech bros the richest members of their society? Would you want your richest citizen to be someone who bet their life savings on DOGE? Would you trust this person’s judgement or capacity for risk management?

Like any currency, Bitcoin and Ethereum face adoption risk if the majority of assets are held by people who lack the leadership to deploy capital effectively on behalf of society. Unless crypto billionaires vow to not spend the majority of their wealth (like Satoshi has seemingly done), or demonstrate a remarkable level of leadership and altruism towards growing the crypto economy (like Vitalik Buterin has done), the inequality aspect will remain a large barrier to the formation of stable network states.

Summary

A gaming arcade is a miniature model of a nation-state. Controlling the supply and right to issue currency is lucrative.
Large businesses with high-margin infrastructure can essentially mint debt, much like printing money.
Cryptocurrencies will create “Network States” that challenge existing nation-states. But they will not prosper if they set up their richest citizens as ones who won the “early adopter” lottery.

Science and Engineering for Learning Robots

2021-03-14T14:30:00.008-07:00

This is the text version of a talk I gave on March 12, 2021, at the Brown University Robotics Symposium. As always, all views are my own, and do not represent those of my employer.

I'm going to talk about why I believe end-to-end Machine Learning is the right approach for solving robotics problems, and invite the audience to think about a couple interesting open problems that I don't know how to solve yet.

I'm a research scientist at Robotics at Google. This is my first full-time job out of school, but I actually started my research career doing high school science fairs. I volunteered at UCSF doing wet lab experiments with telomeres, and it was a lot of pipetting and only a fraction of the time was spent thinking about hypotheses and analyzing results. I wanted to become a deep sea marine biologist when I was younger, but after pipetting several 96-well plates (and messing them up) I realized that software-defined research was faster to iterate on and freed me up to do more creative, scientific work.

I got interested in brain simulation and machine learning (thanks to Andrew Ng's Coursera Course) in 2012. I did volunteer research at a neuromorphic computing lab at Stanford and did some research at Brown on biological spiking neuron simulation in tadpoles. Neuromorphic hardware is the only plausible path to real-time, large-scale biophysical neuron simulation on a robot, but much like wet-lab research is rather slow to iterate on. It was also a struggle to learn even simple tasks, which made me pivot to artificial neural networks which were starting to work much better at a fraction of the computational cost. In 2015 I watched Sergey Levine's talk on Guided Policy Search and remember thinking to myself, "oh my God, this is what I want to work on".

The Deep Learning Revolution

We've seen a lot of progress in Machine Learning in the last decade, especially in end-to-end machine learning, also known as deep learning. Consider a task like audio transcription: classically, we would chop up the audio clip into short segments, detect phonemes, aggregate phonemes into words, words into sentences, and so on. Each of these stages is a separate software module with distinct inputs and outputs, and these modules might involve some degree of machine learning. The idea of deep learning is to fuse all these stages together into a single learning problem, where there are no distinct stages, just the end-to-end prediction task from raw data. With a lot of data and compute, such end-to-end systems vastly outperform the classical pipelined approach. We've seen similar breakthroughs in vision and natural language processing, to the extent that all state-of-the-art systems for these domains are pretty much deep learning models.

Robotics has for many decades operated under a modularized software pipeline, where first you estimate state, then plan, then perform control to realize your plan. The question our team at Google is interested in studying is whether the end-to-end advances we've seen in other domains holds for robotics as well.

Software 2.0

When it comes to thinking about the tradeoff between hand-coded, pipelined approaches versus end-to-end learning, I like Andrej Karpathy's abstraction of Software 1.0 vs Software 2.0: Software 1.0 is where a human explicitly writes down instructions for some information processing. Such instructions (e.g. in C++) are passed through a compiler that generates the low level instructions of what the computer actually executes. When building Software 2.0, you don't write the program - you give a set of inputs and outputs and it's the ML system's job to finds the best program that satisfies your input-output description. You can think of ML as a "higher order compiler that takes data and gives you programs".

The gradual or not-so-gradual subsumption of software 1.0 code into software 2.0 is inevitable - one might start by tuning some coefficients here and there, then you might optimize over one of several code branches to run, and before you know it, the system actually consists of an implicit search procedure over many possible sub-programs. The hypothesis is that as we increase availability of compute and data, we will be able to automatically do more and more search over programs to find the optimal routine. Of course, there is always a role for Software 1.0 - we need it for things like visualization and data management. All of these ideas are covered in Andrej's talks and blog posts, so I encourage you to check those out.

How Much Should We Learn in Robotics?

End-to-end learning has yet to outperform the classical control-theory approaches in some tasks, so within the robotics community there is still an ideological divide on how much learning should actually be done.

On one hand, you have classical robotics approaches, which breaks down the problem into three stages: perception, planning, and control. Perception is about determining the state of the world, planning is about high level decision making around those states, and control is about applying specific motor outputs so that you achieve what you want. Many of the ideas we explore in deep reinforcement learning today (meta-learning, imitation learning, etc.) have already been studied in classical robotics under different terminology (e.g. system identification). The key difference is that classical robotics deals with smaller state spaces, whereas end-to-end approaches fuse perception, planning, and control into a single function approximation problem. There's also a middle ground where one can attempt to use hand-coded constructs from classical robotics as a prior, and then use data to adapt the system to reality. According to Bayesian decision making theory, the stronger prior you have, the less data (evidence) you need to construct a strong posterior belief.

I happen to fall squarely on the far side of the spectrum - the end-to-end approach. I'll discuss why I believe strongly in these approaches.

Three reasons for end-to-end learning

First, it's worked for other domains, so why shouldn't it work for robotics? If there is something about robotics that makes this decidedly not the case, it would be super interesting to understand what makes robotics unique. As an existence proof, our lab and other labs have already built a few real-world systems that are capable of doing manipulation and navigation from end-to-end pixel-to-control. Shown on the left is our grasping system, Qt-Opt, which essentially performs grasping using only monocular RGB, the current arm pose, and end-to-end function approximation. It can grasp objects it's never seen before. We've also had success on door opening and manipulation from imitation learning.

Fused Perception-to-Action in Nature

Secondly, there are often many shortcuts one can take to solve specific tasks, without having to build a unified perception-planning-control stack that is general across all tasks. Work from Mandyam Srinivasan's lab has done cool experiments getting honeybees to fly and perch inside small holes, with a spiral pattern painted on the wall. They found that bees will de-accelerate as they approach the target by the simple heuristic of keeping the rate of image expansion (the spiral) constant. They found that if you artificially increase or decrease the rate of expansion by spinning the spiral clockwise or counterclockwise, the honeybee will predictably speed up or slow down. This is Nature's elegant solution to a control problem: visually-guided odometry is computationally cheaper and less error prone than having to detect where the target is in world frame, plan a trajectory, and so on. It may not be a general framework for planning and control, but it is sufficient for accomplishing what honeybees need to do.

Okay, maybe honeybees can use end-to-end approaches, but what about humans? Do we need a more general perception-planning-control framework for human problems? Maybe, but we also use many shortcuts for decision making. Take ball catching: we don't catch falling objects by solving ODEs or planning, we instead employ a gaze heuristic - as long as an object stays in the same point in your field of view, you will eventually intersect with the object's trajectory. Image taken from Henry Brighton's talk on Robust decision making in uncertain environments.

The Trouble With Defining Anything

Third, we tend to describe decision making processes with words. Words are pretty much all we have to communicate with one another, but they are inconsistent with how we actually make decisions. I like to describe this as an intelligence "iceberg"; the surface of the iceberg is how we think our brain ought to make decisions, but the vast majority of intelligent capability is submerged from view, inaccessible to our consciousness and incompressible into simple language like English. That is why we are capable of performing intelligent feats like perception and dextrous manipulation, but struggle to articulate how we actually perform them in short sentences. If it were easy to articulate in clear unambiguous language, we could just type up those words into a computer program and not have to use machine learning for anything. Words about intelligence are lossy compression, and a lossy representation of a program is not sufficient to implement the full thing.

Consider a simple task of identifying the object in the image on the left (a cow). A human might attempt to string some word-based reasoning together to justify why this is a cow: "you see the context (an open field), you see a nose, you see ears, and black-and-white spots, and maybe the most likely object that has all these parts is a cow".

This is a post-hoc justification, and not actually a full description of how our perception system registers whether something is a cow or not. If you take an actual system capable of recognizing cows with great accuracy (e.g a convnet) and inspect its salient neurons and channels that respond strongly to cows, you will find a strange looking feature map that is hard to put into words. We can't define anything in reality with human-readable words or code with the level of precision needed for interacting with reality, so we must use raw sensory data - grounded in reality - to figure out the decision-making capabilities we want.

Cooking is Not Software 1.0

Our obsession with focusing on the top half of the intelligence iceberg biases us towards the Software 1.0 way of programming, where we take a hard problem and attempt to describe it - using words - as the composition of smaller problems. There is also a tendency for programmers to think of general abstractions for their code, via ontologies that organize words with other words. Reality has many ways to defy your armchair view of what cows are and how robotic skills ought to be organized to accomplish tasks in an object-oriented manner.

Cooking is one of the holy grails of robotic tasks, because environments are open-ended and there is a lot of dextrous manipulation involved. Cooking analogies abound in programming tutorials - here is an example of making breakfast with asynchronous programming. It's tempting to think that you can build a cooking robot by simply breaking down the multi-stage cooking task into sub-tasks and individual primitive skills.

Sadly, even the most trivial of steps abounds with complexity. Consider the simple task of spreading jam on some toast.

The software 1.0 programmer approaches this problem by breaking down the task into smaller, reusable routines. Maybe you think to yourself, first I need a subroutine for holding the slice of toast in place with the robot fingers, then I need a subroutine to spread jam on the toast.

Spreading jam on toast entails three subroutines: a subroutine for scooping the jam with the knife, depositing the lump of jam on the toast, then spreading it evenly.

Here is where the best laid plans go awry. A lot of things can happen in reality at any stage that would prevent you from moving onto the next stage. What if the toaster wasn't plugged in and you're starting with untoasted bread? What if you get the jam on the knife but in the process break something on the robot and you aren't checking to make sure everything is fine before proceeding to the next subroutine? What if there isn't enough jam in the jar? What if you're on the last slice of bread in the loaf and the crust side is facing up?

The prospect of writing custom code to handle the ends of the bread loaf (literal edge cases) ought to give one pause as to whether this is approach is scalable to unstructured environments like kitchens - you end up with a million lines of code that essentially capture the state machine of reality. Reality is chaotic - even if you had a perfect perception system, simply managing reality at the planning level quickly becomes intractable. Learning based approaches give us hope of managing this complexity by accumulate all these edge cases in data, and let the end-to-end objective (getting some jam on the toast) and Software 2.0 compiler figure out how to handle all the edge cases. My belief in end-to-end learning is not because I think ML has unbounded capability, but rather that the alternative approach where we capture all of reality into a giant hand-coded state machine is utterly hopeless.

Here is a video where I am washing and cutting strawberries and putting them on some cheesecake. A roboticist that spends too much time in the lab and not the kitchen might prescribe a program that (1) "holds strawberry", (2) "cut strawberry", (3) "pick-and-place on cheesecake", but if you watch the video frame by frame, there are a lot of other manipulation tasks that happen in the meantime - opening and closing containers with one or two hands, pushing things out of the way, inspecting for quality. To use the Intelligence Iceberg analogy: the recipe and high level steps are the surface ice, but the submerged bulk are all the little micro-skills the hands need to do to open containers and adapt to reality. I believe the most dangerous conceit in robotics is to design elegant programming ontologies on a whiteboard, and ignore the subtleties of reality and what its data tells you.

There are a few links I want to share highlighting the complexity of reality. I enjoyed this recent article on Quanta Magazine about the trickiness of defining life. This is not merely a philosophical question; people at NASA are planning a Mars expedition to collect soil samples and answer whether life ever existed on Mars. This mission requires clarity on the definition of life. Just like it's hard to define intelligent capabilities in precise language, so it is to define life. These two words may as well be one and the same.

Klaus Greff's talk on What Are Objects? raises some interesting queestions about the fuzziness of word. Obviously we want our perception systems to recognize objects so that we may manipulate and plan around them. But as the talk points out, defining what is and is not an object can be quite tricky (is a hole an object? Is the frog prince defined by what he once was, or what he looks like now?).

I've also written a short story on the trickiness of defining even simple classes like "teacups".

I worked on a project with Coline Devin where we used data and Software 2.0 to learn a definition of objects without any human labels. We use a grasping system to pick up stuff and define objects as "that which is graspable". Suppose you have a bin of objects and pick one of them up. The object is now removed from the bin and maybe the other objects have shifted around the bin a little. You can also easily look at whatever is in your hand. We then design an embedding architecture and use the following assumption about reality to train it: the pre-grasp objects embedding - post-grasp objects embedding to be equal to the embedding of whatever you picked up. This allowed us to bootstrap a completely self-supervised instance grasping system from a grasping system without ever relying on labels. This is by no means a comprehensive definition of "object" (see Klaus's talk) but I think it's a pretty good one.

Science and Engineering of End-to-End ML

End-to-end learning is a wonderful principle for building robotic systems, but it is not without its practical challenges and execution risks. Deep neural nets are opaque black box function approximators, which makes debugging them at scale challenging. This requires discipline in both engineering and science, and often the roboticist needs to make a choice as to whether to solve an engineering problem or a scientific one.

This is what a standard workflow looks like for end-to-end robotics. You start by collecting some data, cleaning it, then designing the input and output specification. You fit a model to the data, validate it offline with some metrics like mean-squared error or accuracy, then deploy it in the real world and see if it continues to work as well on your validation sets. You might iterate on the model and validation via some kind of automated hyperparameter tuning.

Most ML PhDs spend all their time on the model training and validation stages of the pipeline. RL PhDs have a slightly different workflow, where they think a bit more about data collection via the exploration problem. But most RL research also happens in simulation, where there is no need to do data cleaning and the feature and label specification is provided to you via the benchmark's design.

While it's true that advancing learning methods is the primary point of ML, I think this behavior is the result of perverse academic incentives.

There is a viscious tendency for papers to put down old ideas and hype up new ones in the pursuit of "technical novelty". The absurdity of all this is that if we ever found that an existing algorithm works super well on harder and harder problems, it would have a hard time getting published on in academic conferences. Reviewers operate under the assumption that our ML algorithms are never good enough.

In contrast, production ML usually emphasizes everything else in the pipeline. Researchers on Tesla's Autopilot team have found that in general, 10x'ing your data on the same model architecture outperforms any incremental modeling improvement in the last few years. As Ilya Sutskever says, most incremental algorithm improvements are just data in disguise. Researchers at quantitative trading funds do not change models drastically: they spend their time finding novel data sources that add additional predictive signal. By focusing on large-scale problems, you get a sense of where the real bottlenecks are. You should only work on innovating new learning algorithms if you have reason to believe that that is what is holding your system back.

Here are some examples of real problems I've run into in building end-to-end ML systems. When you collect data on a robot, certain aspects of the code get baked into the data. For instance, the tuning of the IK solver or the acceleration limits on the joints. A few months later, the code on the robot controllers might have changed in subtle ways, like maybe the IK solver was swapped with a different solver. This happens a lot in a place like Google where multiple people work on a single codebase. But because assumptions of the v0 solver were baked into the training data, you now have a train-test mismatch and the ML policy no longer works as well.

Consider an imitation learning task where you collect some demonstrations, and then predict actions (labels) from states (features). An important unit test to perform before you even start training a model is to check whether a robot that replays the exact labels in order can actually solve the task (for an identical initialization as the training data). This check is important because the way you design your labels might make assumptions that don't necessarily hold at test-time.

I've found data management to be one of the most crucial aspects of debugging real world robotic systems. Recently I found a "data bug" where there was a demonstration of the robot doing nothing for 5 minutes straight - the operator probably left the recording running without realizing it. Even though the learning code was fine, noisy data like this can be catastrophic for learning performance.

As roboticists we all want to see in our lifetime robots doing holy grail tasks like tidying our homes and cooking in the kitchen. Our existing systems, whether you work on Software 1.0 or Software 2.0 approaches, are far away from that goal. Instead of spending our time researching how to re-solve a task a little bit better than an existing approach, we should be using our existing robotic capabilities to collect new data for tasks we can't solve yet.

There is a delicate balance in choosing between understanding ML algorithms better, versus pushing towards a longer term goal of qualitative leaps in robotic capability. I also acknowledge that the deep learning revolution for robotics needs to begin with solving the easier tasks and then eventually working its way up to the harder problems. One way to accomplish both good science and long term robotics is to understand how existing algorithms break down in the face of harder data and tougher generalization demands encountered in new tasks.

Interesting Problems

Hopefully I've convinced you that end-to-end learning is full of opportunities to really get robotics right, but also rife with practical challenges. I want to highlight two interesting problems that I think are deeply important to pushing this field forward, not just for robotics but for any large-scale ML system.

A typical ML research project starts from a fixed dataset. You code up and train a series of ML experiments, then you publish a paper once you're happy with one of the experiments. These codebases are not very large and don't get maintained beyond the duration of the project, so you can move quickly and scrappily with little to no version control or regression testing.

Consider how this would go for a "lifelong learning" system for robotics, where you are collecting data and never throwing it away. You start the project with some code that generates a dataset (Data v1). Then you train a model with some more code, which compiles a Software 2.0 program (ckpt.v1.a). Then you use that model to collect more data (Data v2), and concatenate your datasets together (Data v1 + Data v2) to then train another model, and use that to collect a third dataset (Data v3), and so on. All the while you might be publishing papers on the intermediate results.

The tricky thing here is that the behavior of Software 1.0 and Software 2.0 code is now baked into each round of data collection, and the Software 2.0 code has assumptions from all prior data and code baked into it. The dependency graph between past versions of code and your current system become quite complex to reason about.

This only gets trickier if you are running multiple experiments and generating multiple Software 2.0 binaries in parallel, and collecting with all of those.

Let's examine what code gets baked into a collected dataset. It is a combination of Software 1.0 code (IK solver, logging schema) and Software 2.0 code (a model checkpoint). The model checkpoint itself is the distillation of a ML experiment, which consists of more Software 1.0 code (Featurization, Training code) and Data, which in turn depends on its own Software 1.0 and 2.0 code, and so on.

Here's the open problem I'd like to pose to the audience: how can we verify correctness of lifelong learning systems (accumulating data, changing code), while ensuring experiments are reproducible and bug free? Version control software and continuous integration testing is indispensable for team collaboration on large codebases. What would the Git of Software 2.0 look like?

Here are a couple ideas on how to mitigate the difficulty of lifelong learning. The flywheel of an end-to-end learning system involves converting data to a model checkpoint, then a model checkpoint to predictions, and model predictions to a final real world evaluation number. That eval also gets converted into data. It's critical to test these four components separately to ensure there are no regressions - if one of these breaks, so does everything else.

Another strategy is to use Sim2Real, where you train everything in simulation and develop a lightweight fine-tuning procedure for transferring the system to reality. We rely on this technique heavily at Google and I've heard this is OpenAI's strategy as well. In simulation, you can transmute compute into data, so data is relatively cheap and you don't have to worry about handling old data. Every time you change your Software 1.0 code, you can just re-simulate everything from scratch and you don't have to deal with ever-increasing data heterogeneity. You might still have to manage some data dependencies for real world data, because typically sim2real methods require training a CycleGAN.

Compiling Software 2.0 Capable of Lifelong Learning

When people use the phrase "lifelong learning" there are really two definitions. One is about lifelong dataset accumulation, and concatenating prior datasets to train systems that do new capabilities. Here, we may re-compile the Software 2.0 over and over again.

A stronger version of "lifelong learning" is to attempt to train systems that learn on their own and never need to have their Software 2.0 re-compiled. You can think about this as a task that runs for a very long time.

Many of the robotic ML models we build in our lab have goldfish memories - they make all their decisions from a single instant in time. They are, by construction, incapable of remembering what the last action they took was or what happend 10 seconds ago. But there are plenty of tasks where it's useful to remember:

An AI that can watch a movie (>170k images) and give you a summary of the plot.
An AI that is conducting experimental research, and it needs to remember hundreds of prior experiments to build up its hypotheses and determine what to try next.
An AI therapist that should remember the context of all your prior conversations (say, around 100k words).
A robot that is is cooking and needs to leave something in the oven for several hours and then resume the recipe afterwards.

Memory and learning over long time periods requires some degree of selective memory and attention. We don't know how to select which moments in a sequence are important, so we must acquire that by compiling a Software 2.0 program. We can train a neural network to fit some task objective to the full "lifetime" of the model, and let the model figure out how it needs to selectively remember within that lifetime in order to solve the task.

However, this presents a big problem: in order to optimize this objective, you need to run forward predictions over every step in the lifetime. If you are using backpropagation to train your networks, then you also need to run a similar number of steps in reverse. If you have N data elements and the lifetime is T steps long, the computational cost of learning is between O(NT) and O(NT^2), depending on whether you use RNNs, Transformers, or something in between. Even though a selective attention mechanisms might be an efficient way to perform long-term memory and learning, the act of finding that program via Software 2.0 compilation is very expensive because we have to consider full sequences.

Train on Short Sequences and It Just Works

The optimistic take is that we can just train on shorter sequences, and it will just generalize to longer sequences at test time. Maybe you can train selective attention on short sequences, and then couple that with a high capacity external memory. Ideas from Neural Program Induction and Neural Turing Machines seem relevant here. Alternatively, you can use ideas from Q-learning to essentially do dynamic programming across time and avoid having to ingest the full sequence into memory (R2D2)

Hierarchical Computation

Another approach is to fuse multiple time steps into a single one, potentially repeating this trick over and over again until you have effectively O(log(T)) computation cost instead of O(T) cost. This can be done in both forward and backward passes - clockwork RNNs and Dilated Convolutions used in WaveNet are good examples of this. A variety of recent sub-quadratic attention improvements to Transformers (Block Sparse Transformers, Performers, Reformers, etc.) can be thought of as special cases of this as well.

Parallel Evolution

Maybe we do need to just bite the bullet and optimize over the full sequences, but use embarassingly parallel algorithms to ammortize the time complexity (by distributing it across space). Rather than serially running forward-backward on the same model over and over again, you could imagine testing multiple lifelong learning agents simultaneously and choosing the best-of-K agents after T time has elapsed.

If you're interested in these problems, here's some concrete advice for how to get started. Start by looking up the existing literature in the field, pick one of these papers, and see if you can re-implement it from scratch. This is a great way to learn and make sure you have the necessary coding chops to get ML systems working well. Then ask yourself, how well does the algorithm handle harder problems? At what point does it break down? Finally, rather than thinking about incremental improvements to existing algorithms and benchmarks, constantly be thinking of harder benchmarks and new capabilities.

Summary

Three reasons why I believe in end-to-end ML for robotics: (1) it worked for other domains (2) fusing perception and control is a nice way to simplfiy decision making for many tasks (3) we can't define anything precisely so we need to rely on reality (via data) to tell us what to do.
When it comes to improving our learning systems, think about the broader pipeline, not just the algorithmic and mathy learning part.
Challenge: how do we do version control for Lifelong Learning systems?
Challenge: how do we compile Software 2.0 that does Lifelong Learning? How can we optimize for long-term memory and learning without having to optimize over full lifetimes?

Don't Mess with Backprop: Doubts about Biologically Plausible Deep Learning

2021-02-13T13:45:00.010-08:00

“Traducción a Español”

Biologically Plausible Deep Learning (BPDL) is an active research field at the intersection of Neuroscience and Machine Learning, studying how we can train deep neural networks with a "learning rule" that could conceivably be implemented in the brain.

The line of reasoning that typically motivates BPDL is as follows:

A Deep Neural Network (DNN) can learn to perform perception tasks that biological brains are capable of (such as detecting and recognizing objects).
If activation units and their weights are to DNNs as what neurons and synapses are to biological brains, then what is backprop (the primary method for training deep neural nets) analogous to?
If learning rules in brains are not implemented using backprop, then how are they implemented? How can we achieve similar performance to backprop-based update rules while still respecting biological constraints?

A nice overview of the ways in which backprop is not biologically plausible can be found here, along with various algorithms that propose fixes.

My somewhat contrarian opinion is that designing biologically plausible alternatives to backprop is the wrong question to be asking. The motivating premises of BPDL makes a faulty assumption: that layer activations are neurons and weights are synapses, and therefore learning-via-backprop must have a counterpart or alternative in biological learning.

Despite the name and their impressive capabilities on various tasks, DNNs actually have very little to do with biological neural networks. One of the great errors in the field of Machine Learning is that we ascribe too much biological meaning to our statistical tools and optimal control algorithms. It leads to confusion from newcomers, who ascribe entirely different meaning to "learning", "evolutionary algorithms", and so on.

DNNs are a sequence of linear operations interspersed with nonlinear operations, applied sequentially to real-valued inputs - nothing more. They are optimized via gradient descent, and gradients are computed efficiently using a dynamic programming scheme known as backprop. Note that I didn't use the word "learning"!

Dynamic programming is the ninth wonder of the world1, and in my opinion one of the top three achievements of Computer Science. Backprop has linear time-complexity in network depth, which makes it extraordinarily hard to beat from a computational cost perspective. Many BPDL algorithms often don't do better than backprop, because they try to take an efficient optimization scheme and shoehorn in an update mechanism with additional constraints.

If the goal is to build a biologically plausible learning mechanism, there's no reason that units in Deep Neural Networks should be one-to-one with biological neurons. Trying to emulate a DNN with models of biologically neurons feels backwards; like trying to emulate the Windows OS with a human brain. It's hard and a human brain can't simulate Windows well.

Instead, let's do the emulation the other way around: optimizing a function approximator to implement a biologically plausible learning rule. The recipe is straightforward:

Build a biological plausible model of a neural network with model neurons and synaptic connections. Neurons communicate with each other using spike trains, rate coding, or gradients, and respect whatever constraints you deem to be "sufficiently biologically plausible". It has parameters that need to be trained.
Use computer-aided search to design a biologically plausible learning rule for these model neurons. For instance, each neuron's feedforward behavior and local update rules can be modeled as a decision from an artificial neural network.
Update the function approximator so that the biological model produces the desired learning behavior. We could train the neural networks via backprop.

The choice of function approximator we use to find our learning rule is irrelevant - what we care about at the end of the day is answering how a biological brain is able to learn hard tasks like perception, while respecting known constraints like the fact that biological neurons don't store all activations in memory or only employ local learning rules. We should leverage Deep Learning's ability to find good function approximators, and direct that towards finding a good biological learning rules.

The insight that we should (artificially) learn to (biologically) learn is not a new idea, but it is one that I think is not yet obvious to the neuroscience + AI community. Meta-Learning, or "Learning to Learn", is a field that has emerged in recent years, which formulates the act of acquiring a system capable of performing learning behavior (potentially superior to gradient descent). If meta-learning can find us more sample efficient or superior or robust learners, why can't it find us rules that respect biological learning constraints? Indeed, recent work [1, 2, 3, 4, 5] shows this to be the case. You can indeed use backprop to train a separate learning rule superior to naïve backprop.

I think the reason that many researchers have not really caught onto this idea (that we should emulate biologically plausible circuits with a meta-learning approach) is that until recently, compute power wasn't quite strong enough to both train a meta-learner and a learner. It still requires substantial computing power and research infrastructure to set up a meta-optimization scheme, but tools like JAX make it considerably easier now.

A true biology purist might argue that finding a learning rule using gradient descent and backprop is not an "evolutionarily plausible learning rule", because evolution clearly lacks the ability to perform dynamic programming or even gradient computation. But this can be amended by making the meta-learner evolutionarily plausible. For instance, the mechanism with which we select good function approximators does not need rely on backprop at all. Alternatively, we could formulate a meta-meta problem whereby the selection process itself obeys rules of evolutionary selection, but the selection process is found using, once again, backprop.

Don't mess with backprop!

Footnotes

[1] The eighth wonder being, of course, compound interest.

How to Understand ML Papers Quickly

2021-01-25T21:39:00.006-08:00

My ML mentees often ask me some variant of the question "how do you choose which papers to read from the deluge of publications flooding Arxiv every day?”

The nice thing about reading most ML papers is that you can cut through the jargon by asking just five simple questions. I try to answer these questions as quickly as I can when skimming papers.

1) What are the inputs to the function approximator?

E.g. a 224x224x3 RGB image with a single object roughly centered in the view.

2) What are the outputs to the function approximator?

E.g. a 1000-long vector corresponding to the class of the input image.

Thinking about inputs and outputs to the system in a method-agnostic way lets you take a step back from the algorithmic jargon and consider whether other fields have developed methods that might work here using different terminology. I find this approach especially useful when reading Meta-Learning papers.

By thinking about a ML problem first as a set of inputs and desired outputs, you can reason whether the input is even sufficient to predict the output. Without this exercise you might accidentally set up a ML problem where the output can't possibly be determined by the inputs. The result might be a ML system that performs predictions in a way that are problematic for society.

3) What loss supervises the output predictions? What assumptions about the world does this particular objective make?

ML models are formed from combining biases and data. Sometimes the biases are strong, other times they are weak. To make a model generalize better, you need to add more biases or add more unbiased data. There is no free lunch.

An example: many optimal control algorithms make the assumption of a stationary episodic data generation procedure which is a Markov-Decision Process (MDP). In an MDP, “state” and “action” deterministically map via the environment’s transition dynamics to “a next-state, reward, and whether the episode is over or not”. This structure, though very general, can be used to formulate a loss that allows learning Q values to follow the Bellman Equation.

4) Once trained, what is the model able to generalize to, in regards to input/output pairs it hasn’t seen before?

Due to the information captured in the data or the architecture of the model, the ML system may generalize fairly well to inputs it has never seen before. In recent years we are seeing more and more ambitious levels of generalization, so when reading papers I watch out to see any surprising generalization capabilities and where it comes from (data, bias, or both).

There is a lot of noise in the field about better inductive biases, like causal reasoning or symbolic methods or object-centric representations. These are important tools for building robust and reliable ML systems and I get that the line separating structured data vs. model biases can be blurry. That being said, it baffles me how many researchers think that the way to move ML forward is to reduce the amount of learning and increase the amount of hard-coded behavior.

We do ML precisely because there are things we don't know how to hard-code. As Machine Learning researchers, we should focus our work on making learning methods better, and leave the hard-coding and symbolic methods to the Machine Hard-Coding Researchers.

5) Are the claims in the paper falsifiable?

Papers that make claims that cannot be falsified are not within the realm of science.

P.S. for additional hot takes and mentorship for aspiring ML researchers, sign up for my free office hours. I've been mentoring students over Google Video Chat most weekends for 7 months now and it's going great.

Software and Hardware for General Robots

2020-11-28T23:53:00.022-08:00

Disclaimer, these are just my opinions and not necessarily those of my employer or robotics colleagues.

2021-04-23: If you liked this post, you may be interested in a more recent blog post I wrote on why I believe in end-to-end learning for robots.

Hacker News Discussion

Moravec's Paradox describes the observation that our AI systems can solve "adult-level cognitive" tasks like chess-playing or passing text-based intelligence tests fairly easily, while accomplishing basic sensorimotor skills like crawling around or grasping objects - things one-year old children can do - are very difficult.

Anyone who has tried to build a robot to do anything will realize that Moravec's Paradox is not a paradox at all, but rather a direct corollary of our physical reality being so irredeemably complex and constantly demanding. Modern humans traverse millions of square kilometers in their lifetime, a labyrinth full of dangers and opportunities. If we had to consciously process and deliberate all the survival-critical sensory inputs and motor decisions like we do moves in a game of chess, we would have probably been selected out of the gene pool by Darwinian evolution. Evolution has optimized our biology to perform sensorimotor skills in a split second and make it feel easy.

Another way to appreciate this complexity is to adjust your daily life to a major motor disability, like losing fingers or trying to get around San Francisco without legs.

Software for General Robots

The difficulty of sensorimotor problems is especially apparent to people who work in robotics and get their hands dirty with the messiness of "the real world". What are the consequences of an irredeemably complex reality on how we build software abstractions for controlling robots?

One of my pet peeves is when people who do not have sufficient respect for Moravec's Paradox propose a programming model where high-level robotic tasks ("make me dinner") can be divided into sequential or parallel computations with clearly defined logical boundaries: wash rice, de-frost meat, get the plates, set the table, etc. These sub-tasks can be in turn broken down further. When a task cannot be decomposed further because there are too many edge cases for conventional software to handle ("does the image contain a cat?"), we can attempt to invoke a Machine Learning model as "magic software" for that capability.

This way of thinking - symbolic logic that calls upon ML code - arises from engineers who are used to clinging to the tidiness of Software 1.0 abstractions and programming tutorials that use cooking analogies.

Do you have any idea how much intelligence goes into a task like "fetching me a snack", at the very lowest levels of motor skill? Allow me to illustrate. I recorded a short video of me opening a package of dates and annotated it with all the motor sub-tasks I performed in the process.

https://youtu.be/b1lysnGFpqI

In the span of 36 seconds, I counted about 14 motor and cognitive skills. They happened so quickly that I didn't consciously notice them until I went back and analyzed the video, frame by frame.

Here are some of the things I did:

Leverage past experience opening this sort of package to understand material properties and how much force to apply.
Constantly adapt my strategy in response to unforeseen circumstances (Ziploc not giving)
Adjusting grasp when slippage occurs
Devising an ad-hoc Weitlaner Retractor with thumb knuckles to increase force on the Ziploc.

As a roboticist, it's humbling to watch videos of animals making decisions so quickly and then watch our own robots struggle to do the simplest things. We even have to speed up the robot video 4x-8x to prevent the human watcher from getting bored!

With this video in mind, let's consider where we currently are in the state of robotic manipulation. In the last decade or so, multiple research labs have used deep learning to develop robotic systems that can perform any-object robotic grasping from vision. Grasping is an important problem because in order to manipulate objects, one must usually first grasp them. It took the Google Robotics and X teams 2-3 years to develop our own system, QT-Opt. This was a huge research achievement because it was a general method that worked on pretty much any object and, in principle, could be used to learn other tasks.

Some people think that this capability to pick up objects can be wrapped in a simple programmatic API and then used to bootstrap us to human-level manipulation. After all, hard problems are just composed of simpler problems, right?

I don't think it's quite so simple. The high-level API call "pick_up_object()" implies a clear semantic boundary between when the robot grasping begins and when it ends. If you re-watch the above video above, how many times do I perform a grasp? It's not clear to me at all where you would slot those function calls. Here is a survey if you are interested in participating in a poll of "how many grasps do you see in this video", whose results I will update in this blog post.

If we need to solve 13 additional manipulation skills just to open a package of dates, and each one of these capabilities take 2-3 years to build, then we are a long, long way from making robots that match the capabilities of humans. Never mind that there isn't a clear strategy for how to integrate all these behaviors together into a single algorithmic routine. Believe me, I wish reality was simple enough that complex robotic manipulation could be done mostly in Software 1.0. However, as we move beyond pick-and-place towards dexterous and complex tasks, I think we will need to completely rethink how we integrate different capabilities in robotics.

As you might note from the video, the meaning of a "grasp" is somewhat blurry. Biological intelligence was not specifically evolved for grasping - rather, hands and their behaviors emerged from a few core drives: regulate internal and external conditions, find snacks, replicate.

None of this is to say that our current robot platforms and the Software 1.0 programming models are useless for robotics research or applications. A general purpose function pick_up_object() can still be combined with "Software 1.0 code" into a reliable system worth billions of dollars in value to Amazon warehouses and other logistics fulfillment centers. General pick-and-place for any object in any unstructured environment remains an unsolved, valuable, and hard research problem.

Hardware for General Robots

What robotic hardware do we require in order to "open a package of dates"?

Willow Garage was one of the pioneers in home robots, showing that a teleoperated PR2 robot could be used to tidy up a room (note that two arms are needed here for more precise placement of pillows). These are made up of many pick-and-place operations.

https://youtu.be/o7JH3UWO6I0

This video was made in 2008. That was 12 years ago! It's sobering to think of how much time has passed and how little the needle has seemingly moved. Reality is hard.

The Stretch is a simple telescoping arm attached to a vertical gantry. It can do things like pick up objects, wipe planar surfaces, and open drawers.

https://youtu.be/2msVU0ygrqM

However, futurist beware! A common source of hype for people who don't think enough about physical reality is to watch demos of robots doing useful things in one home, and then conclude that the same robots are ready to do those tasks in any home.

The Stretch video shows the robot pulling open a dryer door (left-swinging) and retrieving clothes from it. The video is a bit deceptive - I think the camera physically cannot see the interior of the dryer, so even though a human can teleoperate the robot to do the task, it would run into serious difficulty when ensuring that the dryer has been completely emptied.

Here is a picture of my own dryer, which features a dryer with a right-swinging door close to a wall. I'm not sure if the Stretch actually can fit in this tight space, but the PR2 definitely would not be able to open this door without the base getting in the way.

Reality's edge cases are often swept under the rug when making robot demo videos, which usually show the robot operating in an optimal environment that the robot is well-suited for. But the full range of tasks humans do in the home is vast. Neither the PR2 nor the Stretch can crouch under a table to pick up lint off the floor, change a lightbulb while standing on a chair, fix caulking in a bathroom, open mail with a letter opener, move dishes from the dishwasher to the high cabinets, break down cardboard boxes for the recycle bin, go outside and retrieve the mail.

And of course, they can't even open a Ziploc package of dates. If you think that was complex, here is a first-person video of me chopping strawberries, washing utensils, and decorating a cheesecake. This was recorded with a GoPro strapped to my head. Watch each time my fingers twitch - each one is a separate manipulation task!

https://youtu.be/_Hd7JkOo0B8

We often talk about a future where robots do our cooking for us, but I don't think it's possible with any hardware on the market today. The only viable hardware for a robot meant to do any task in human spaces is an adult-sized humanoid, with two-arms, two-legs, and five fingers on each hand.

Just like I discussed about Software 1.0 in robotics, there is still an enormous space of robot morphologies that can still provide value to research and commercial applications. That doesn't change the fact that any alternative hardware can't do all the things a humanoid can in a human-centric space. Agility Robotics is one of the companies that gets it on the hardware design front. People who build physical robots use their hands a lot - could you imagine the robot you are building assembling a copy of itself?

Why Don't We Just Design Environments to be More Robot-Friendly?

A compromise is to co-design the environment with the robot to avoid infeasible tasks like above. This can simplify both the hardware and software problems. Common examples I hear incessantly go like this:

Washing machines are better than a bimanual robot washing dishes in the sink, and a dryer is a more efficient machine than a human hanging out clothes to air-dry.
Airplanes are better at transporting humans than birds
We built cars and roads, not faster horses
Wheels can bear more weight and are more energetically efficient than legs.

In the home robot setting, we could design special dryer machine doors that the robot can open easily, or have custom end-effectors (tools) for each task instead of a five-fingered hand. We could go as far as to to have the doors be motorized and open themselves with a remote API call, so the robot doesn't even need to open the dryer on its own.

At the far end of this axis, why even bother with building a robot? We could re-imagine the design of homes themselves to be a single ASRS system that brings you whatever you need from any location in the house like a Dumbwaiter (except it would work horizontally and vertically). This would dispenses with the need to have a robot walking around in your home.

This pragmatic line of thinking is fine for commercial applications, but as a human being and a scientist, it feels a bit like a concession of defeat that we cannot make robots do tasks the way humans do. Let's not forget the Science Fiction dreams that inspired so many of us down this career path - it is not about doing the tasks better, it is about doing everything humans can. A human can wash dishes and dry clothes by hand, so a truly general-purpose robot should be able too. For many people, this endeavor is as close as we can get to Biblical Creation: “Let Us make man in Our image, after Our likeness, to rule over the fish of the sea and the birds of the air, over the livestock, and over all the earth itself and every creature that crawls upon it.”

Yes, we've built airplanes to fly people around. Airplanes are wonderful flying machines. But to build a bird, which can do a million things and fly? That, in my mind, is the true spirit of general purpose robotics.

My Criteria for Reviewing Papers

2020-09-27T11:27:00.013-07:00

Xiaoyi Yin (尹肖贻) has kindly translated this post into Chinese (中文)

Accept-or-reject decisions for the NeurIPS 2020 conference are out, with 9454 submissions and 1900 accepted papers (20% acceptance rate). Congratulations to everyone (regardless of acceptance decision) for their hard work in doing good research!

It's common knowledge among machine learning (ML) researchers that acceptance decisions at NeurIPS and other conferences are something of a weighted dice roll. In this silly theatre we call "Academic Publishing" -- a mostly disjoint concept from research by the way --, reviews are all over the place because each reviewer favors different things in ML papers. Here are some criteria that a reviewer might care about:

Correctness: This is the bare minimum for a scientific paper. Are the claims made in the paper scientifically correct? Did the authors take care not to train on the test set? If an algorithm was proposed, do the authors convincingly show that it works for the reasons they stated?

New Information: Your paper has to contribute new knowledge to the field. This can take the form of a new algorithm, or new experimental data, or even just a different way of explaining an existing concept. Even survey papers should contain some nuggets of new information, such as a holistic view unifying several independent works.

Proper Citations: a related work section that articulates connections to prior work and why your work is novel. Some reviewers will reject papers that don't tithe prior work adequately, or isn't sufficiently distinguished from it.

SOTA results: It's common to see reviewers demand that papers (1) propose a new algorithm and (2) achieve state-of-the-art (SOTA) on a benchmark.

More than "Just SOTA": No reviewer will penalize you for achieving SOTA, but some expect more than just beating the benchmark, such as one or more of the criteria in this list. Some reviewers go as far as to bash the "SOTA-chasing" culture of the field, which they deem to be "not very creative" and "incremental".

Simplicity: Many researchers profess to favor "simple ideas". However, the difference between "your simple idea" and "your trivial extension to someone else's simple idea" is not always so obvious.

Complexity: Some reviewers deem papers that don't present any new methods or fancy math proofs as "trivial" or "not rigorous".

Clarity & Understanding: Some reviewers care about the mechanistic details of proposed algorithms and furthering understanding of ML, not just achieving better results. This is closely related to "Correctness".

Is it "Exciting"?: Julian Togelius (AC for NeurIPS '20) mentions that many papers he chaired were simply not very exciting. Only Julian can know what he deems "exciting", but I suppose he means having "good taste" in choosing research problems and solutions.

Sufficiently Hard Problems: Some reviewers reject papers for evaluating on datasets that are too simple, like MNIST. "Sufficiently hard" is a moving goal post, with the implicit expectation that as the field develops better methods the benchmarks have to get harder to push unsolved capabilities. Also, SOTA methods on simple benchmarks are not always SOTA on harder benchmarks that are closer to real world applications. Thankfully my most cited paper was written at a time where it was still acceptable to publish on MNIST.

Is it Surprising? Even if a paper demonstrates successful results, a reviewer might claim that they are unsurprising or "obvious". For example, papers applying standard object recognition techniques to a novel dataset might be argued to be "too easy and straightforward" given that the field expects supervised object recognition to be mostly solved (this is not really true, but the benchmarks don't reflect that).

I really enjoy papers that defy intuitions, and I personally strive to write surprising papers.

Some of my favorite papers in this category do not achieve SOTA or propose any new algorithms at all:

Is it Real? Closely related to "sufficiently hard problems". Some reviewers think that games are a good testbed to study RL, while others (typically from the classical robotics community) think that Mujoco Ant and a real robotic quadruped are entirely different problems; algorithmic comparisons on the former tell us nothing about the same set of experiments on the latter.

Does Your Work Align with Good AI Ethics? Some view the development of ML technology as a means to build a better society, and discourage papers that don't align with their AI ethics. The required "Broader Impact" statements in NeurIPS submissions this year are an indication that the field is taking this much more seriously. For example, if you submit a paper that attempts to infer criminality from only facial features or perform autonomous weapon targeting, I think it's likely your paper will be rejected regardless of what methods you develop.

Different reviewers will prioritize different aspects of the above, and many of these criteria are highly subjective (e.g. problem taste, ethics, simplicity). For each of the criteria above, it's possible to come up with counterexamples of highly-cited or impactful ML papers that don't meet that criteria but possibly meet others.

My Criteria

I wanted to share my criteria for how I review papers. When it comes to recommending accept/reject, I mostly care about Correctness and New Information. Even if I think your paper is boring and unlikely to be an actively researched topic in 10 years, I will vote to accept it as long as your paper helped me learn something new that I didn't think was already stated elsewhere.

Some more specific examples:

If you make a claim about humanlike exploration capabilities in RL in your introduction and then propose an algorithm to do something like that, I'd like to see substantial empirical justification that the algorithm is indeed similar to what humans do.
If your algorithm doesn't achieve SOTA, that's fine with me. But I would like to see a careful analysis of why your algorithm doesn't achieve it and why.
When papers propose new algorithms, I prefer to see that the algorithm is better than prior work. However, I will still vote to accept if the paper presents a factually correct analysis of why it doesn't do better than prior work.
If you claim that your new algorithm works better because of reason X, I would like to see experiments that show that it isn't because of alternate hypotheses X1, X2.

Correctness is difficult to verify. Many metric learning papers were proposed in the last 5 years and accepted at prestigious conferences, only for Musgrave et al. '20 to point out that the experimental methodology between these papers were not consistent.

I should get off my high horse and say that I'm part of the circus too. I've reviewed papers for 10+ conferences and workshops and I can honestly say that I only understood 25% of papers from just reading them. An author puts in tens or hundreds of hours into designing and crafting a research paper and the experimental methodology, and I only put in a few hours in deciding whether it is "correct science". Rarely am I able to approach a paper with the level of mastery needed to rigorously evaluate correctness.

A good question to constantly ask yourself is: "what experiment would convince me that the author's explanations are correct and not due to some alternate hypothesis? Did the authors check that hypothesis?"

I believe that we should accept all "adequate" papers, and more subjective things like "taste" and "simplicity" should be reserved for paper awards, spotlights, and oral presentations. I don't know if everyone should adopt this criteria, but I think it's helpful to at least be transparent as a reviewer on how I make accept/reject decisions.

Opportunities for Non-Traditional Researchers

If you're interested in getting mentorship for learning how to read, critique, and write papers better, I'd like to plug my weekly office hours, which I hold on Saturday mornings over Google Meet. I've been mentoring about 6 people regularly over the last 3 months and it's working out pretty well.

Anyone who is not in a traditional research background (not currently in an ML PhD program) can reach out to me to book an appointment. You can think of this like visiting your TA's office hours for help with your research work. Here are some of the services I can offer, completely pro bono:

If you have trouble understanding a paper I can try to read it with you and offer my thoughts on it as if I were reviewing it.
If you're very very new to the field and don't even know where to begin I can offer some starting exercises like reading / summarizing papers, re-producing existing papers, and so on.
I can try to help you develop a good taste of what kinds of problems to work on, how to de-risk ambitious ideas, and so on.
Advice on software engineering aspects of research. I've been coding for over 10 years; I've picked up some opinions on how to get things done quickly.
Asking questions about your work as if I was a visitor at your poster session.
Helping you craft a compelling story for a paper you want to write.

No experience is required, all that you need to bring to the table is a desire to become better at doing research. The acceptance rate for my office hours is literally 100% so don't be shy!

Chaos and Randomness

2020-09-13T11:17:00.006-07:00

For want of a nail the shoe was lost.
For want of a shoe the horse was lost.
For want of a horse the rider was lost.
For want of a rider the message was lost.
For want of a message the battle was lost.
For want of a battle the kingdom was lost.
And all for the want of a horseshoe nail.

- For Want of a Nail

Was the kingdom lost due to random chance? Or was it the inevitable outcome resulting from sensitive dependence on initial conditions? Does the difference even matter? Here is a blog post about Chaos and Randomness with Julia code.

Preliminaries

Consider a real vector space $X$ and a function $f: X \to X$ on that space. If we repeatedly apply $f$ to a starting vector $x_1$, we get a sequence of vectors known as an orbit $x_1, x_2, ... ,f^n(x_1)$.

For example, the logistic map is defined as

function logistic_map(r, x)

r*x*(1-x)

end

Here is a plot of successive applications of the logistic map for r=3.5. We can see that the system constantly oscillates between two values, ~0.495 and ~0.812.

Definition of Chaos

There is surprisingly no universally accepted mathematical definition of Chaos. For now we will present a commonly used characterization by Devaney:

We can describe an orbit $x_1, x_2, ... ,f^n(x_1)$ as *chaotic* if:

The orbit is not asymptotically periodic, meaning that it never starts repeating, nor does it approach an orbit that repeats (e.g. $a, b, c, a, b, c, a, b, c...$).
The maximum Lyapunov exponent $\lambda$ is greater than 0. This means that if you place another trajectory starting near this orbit, it will diverge at a rate $e^\lambda$. A positive $\lambda$ implies that two trajectories will diverge exponentially quickly away from each other. If $\lambda<0$, then the distance between trajectories would shrink exponentially quickly. This is the basic definition of "Sensitive Dependence to Initial Conditions (SDIC)", also colloquially understood as the "butterfly effect".

Note that (1) intuitively follows from (2), because the Lyapunov exponent of an orbit that approaches a periodic orbit would be $<0$, which contradicts the SDIC condition.

We can also define the map $f$ itself to be chaotic if there exists an invariant (trajectories cannot leave) subset $\tilde{X} \subset X$, where the following three conditions hold:

Sensitivity to Initial Conditions, as mentioned before.
Topological mixing (every point in orbits in $\tilde{X}$ approaches any other point in $\tilde{X}$).
Dense periodic orbits (every point in $\tilde{X}$ is arbitrarily close to a periodic orbit). At first, this is a bit of a head-scratcher given that we previously defined an orbit to be chaotic if it *didn't* approach a periodic orbit. The way to reconcile this is to think about the subspace $\tilde{X}$ being densely covered by periodic orbits, but they are all unstable so the chaotic orbits get bounced around $\tilde{X}$ for all eternity, never settling into an attractor but also unable to escape $\tilde{X}$.

Note that SDIC actually follows from the second two conditions. If these unstable periodic orbits cover the set $\tilde{X}$ densely and orbits also cover the set densely while not approaching the periodic ones, then intuitively the only way for this to happen is if all periodic orbits are unstable (SDIC).

These are by no means the only way to define chaos. The DynamicalSystems.jl package has an excellent documentation on several computationally tractable definitions of chaos.

Chaos in the Logistic Family

Incidentally, the logistic map exhibits chaos for most of the values of r from values 3.56995 to 4.0. We can generate the bifurcation diagram quickly thanks to Julia's de-vectorized way of numeric programming.

rs = [2.8:0.01:3.3; 3.3:0.001:4.0]

x0s = 0.1:0.1:0.6

N = 2000 # orbit length

x = zeros(length(rs), length(x0s), N)

# for each starting condtion (across rows)

for k = 1:length(rs)

    # initialize starting condition

    x[k, :, 1] = x0s

    for i = 1:length(x0s)

       for j = 1:N-1

            x[k, i, j+1] = logistic_map((r=rs[k] , x=x[k, i, j])...)

end

end

end

plot(rs, x[:, :, end], markersize=2, seriestype = :scatter, title = "Bifurcation Diagram (Logistic Map)")

We can see how starting values y1=0.1, y2=0.2, ...y6=0.6 all converge to the same value, oscillate between two values, then start to bifurcate repeatedly until chaos emerges as we increase r.

Spatial Precision Error + Chaos = Randomness

What happens to our understanding of the dynamics of a chaotic system when we can only know the orbit values with some finite precision? For instance, x=0.76399 or x=0.7641 but we only observe x=0.764 in either case.

We can generate 1000 starting conditions that are identical up to our measurement precision, and observe the histogram of where the system ends up after n=1000 iterations of the logistic map.

Let's pretend this is a probabilistic system and ask the question: what are the conditional distributions of $p(x_n|x_0)$, where $n=1000$, for different levels of measurement precision?

At less than $O(10^{-8})$ precision, we start to observe the entropy of the state evolution rapidly increasing. Even though we know that the underlying dynamics are deterministic, measurement uncertainty (a form of aleotoric uncertainty) can expand exponentially quickly due to SDIC. This results in $p(x_n|x_0)$ appearing to be a complicated probability distribution, even generating "long tails".

I find it interesting that the "multi-modal, probabilistic" nature of $p(x_n|x_0)$ vanishes to a simple uni-modal distribution when measurement is sufficiently high to mitigate chaotic effects for $n=1000$. In machine learning we concern ourselves with learning fairly rich probability distributions, even going as far as to learn transformations of simple distributions into more complicated ones.

But what if we are being over-zealous with using powerful function approximators to model $p(x_n|x_0)$? For cases like the above, we are discarding the inductive bias that $p(x_n|x_0)$ arises from a simple source of noise (uniform measurement error) coupled with a chaotic "noise amplifier". Classical chaos on top of measurement error will indeed produce Indeterminism, but does that mean we can get away with treating $p(x_n|x_0)$ as purely random?

I suspect the apparent complexity of many "rich" probability distributions we encounter in the wild are more often than not just chaos+measurement error (e.g. weather). If so, how can we leverage that knowledge to build more useful statistical learning algorithms and draw inferences?

We already know that chaos and randomness are nearly equivalent from the perspective of computational distinguishability. Did you know that you can use chaos to send secret messages? This is done by having Alice and Bob synchronize a chaotic system $x$ with the same initial state $x_0$, and then Alice sends a message $0.001*signal + x$. Bob merely evolves the chaotic system $x$ on his own and subtracts it to recover the signal. Chaos has also been used to design pseudo-random number generators.

Free Office Hours for Non-Traditional ML Researchers

2020-06-20T12:35:00.011-07:00

Xiaoyi Yin (尹肖贻) has kindly translated this post into Chinese (中文)

This post was prompted by a tweet I saw from my colleague, Colin:

I'm currently a researcher at Google with a "non-traditional background", where non-traditional background means "someone who doesn't have a PhD". People usually get PhDs so they can get hired for jobs that require that credential. In the case of AI/ML, this might be to become a professor at a university, or land a research scientist position at a place like Google, or sometimes even both.

At Google it's possible to become a researcher without having a PhD, although it's not very easy. There are a two main paths [1]:

One path is to join an AI Residency Program, which are fixed-term jobs from non-university institution (FAANG companies, AI2, etc.) that aim to jump-start a research career in ML/AI. However, these residencies are usually just 1 year long and are not long enough to really "prove yourself" as a researcher.

Another path is to start as a software engineer (SWE) in an ML-focused team and build your colleagues' trust in your research abilities. This was the route I took: I joined Google in 2016 as a software engineer in the Google Brain Robotics team. Even though I was a SWE by title, it made sense to focus on the "most important problem", which was to think really hard about why the robots weren't doing what we wanted and train deep neural nets in an attempt to fix those problems. One research project led to another, and now I just do research + publications all the time.

As the ML/AI publishing field has grown exponentially in the last few years, it has gotten harder to break into research (see Colin's tweet). Top PhD programs like BAIR usually require students to have a publication at a top conference like ICML, ICLR, NeurIPS before they even apply. I'm pretty sure I would not have been accepted to any PhD programs if I were graduating from college today, and would have probably ended up taking a job offer in quantitative finance instead.

The uphill climb gets even steeper for aspiring researchers with non-traditional backgrounds; they are competing with no shortage of qualified PhD students. As Colin alludes to, it is also getting harder for internationals to work at American technology companies and learn from American schools, thanks to our administration's moronic leadership.

The supply-demand curves for ML/AI labor are getting quite distorted. On one hand, we have a tremendous global influx of people wanting to solve hard engineering problems and contribute to scientific knowledge and share it openly with the world. On the other hand, there seems to be a shortage of formal training:

A research mentor to learn the academic lingo and academic customs from, and more importantly, how to ask good questions and design experiments to answer them.
Company environments where software engineers are encouraged to take bold risks and lead their own research (and not just support researchers with infra).

Free Office Hours

I can't do much for (2) at the moment, but I can definitely help with (1). To that end, I'm offering free ML research mentorship to aspiring researchers from non-traditional backgrounds via email and video conferencing.

I'm most familiar with applied machine learning, robotics, and generative modeling, so I'm most qualified to offer technical advice in these areas. I have a bunch of tangential interests like quantitative finance, graphics, and neuroscience. Regardless of technical topic, I can help with academic writing and de-risking ambitious projects and choosing what problems to work on. I also want to broaden my horizons and learn more from you.

If you're interested in using this resource, send me an email at <myfirstname><mylastname><2004><at><g****.com>. In your email, include:

Your resume
What you want to get out of advising
A cool research idea you have in a couple sentences

Some more details on how these office hours will work:

Book weekly or bi-weekly Google Meet [2] calls to check up on your work and ask questions, with 15 minute time slots scheduled via Google Calendar.
The point of these office hours is not to answer "how do I get a job at Google Research", but to fulfill an advisor-like role in lieu of a PhD program. If you are farther along your research career we can discuss career paths and opportunities a little bit, but mostly I just want to help people with (1).
I'm probably not going to write code or run experiments for you.
I don't want to be that PI that slaps their name on all of their student's work - most advice I give will be given freely with no strings attached. If I make a significant contribution to your work or spend > O(10) hours working with you towards a publishable result, I may request being a co-author on a publication.
I reserve the right to decline meetings if I feel that it is not a productive use of my time or if other priorities take hold.
I cannot tell you about unpublished work that I'm working on at Google or any Google-confidential information.
I'm not offering ML consultation for businesses, so your research work has to be unrelated to your job.
To re-iterate point number 2 once more, I'm less interested in giving career advice and more interested in teaching you how to design experiments, how to cite and write papers, and communicating research effectively.

What do I get out of this? First, I get to expand my network. Second, I can only personally run so many experiments by myself so this would help me grow my own research career. Third, I think the supply of mentorship opportunities offered by academia is currently not scalable, and this is a bit of an experiment on my part to see if we can do better. I'd like to give aspiring researchers similar opportunities that I had 4 years ago that allowed me to break into the field.

Footnotes

[1] Chris Olah has a great essay on some additional options and pros and cons of non-traditional education.

[2] Zoom complies with Chinese censorship requests, so as a statement of protest I avoid using Zoom when possible.

Three Questions that Keep Me Up at Night

2020-04-01T19:03:00.004-07:00

A Google interview candidate recently asked me: "What are three big science questions that keep you up at night?" This was a great question because one's answer reveals so much about one's intellectual interests - here are mine:

Q1: Can we imitate "thinking" from only observing behavior?

Suppose you have a large fleet of autonomous vehicles with human operators driving them around diverse road conditions. We can observe the decisions made by the human, and attempt to use imitation learning algorithms to map robot observations to the steering decisions that the human would take.

However, we can't observe what the homunculus is thinking directly. Humans read road text and other signage to interpret what they should and should not do. Humans plan more carefully when doing tricky maneuvers (parallel parking). Humans feel rage and drowsiness and translate those feelings into behavior.

Let's suppose we have a large car fleet and our dataset is so massive and perpetually growing that we cannot train it faster than we are collecting new data. If we train a powerful black-box function approximator to learn the mapping from robot observation to human behavior [1], and we use active-learning techniques like DAgger to combat false negatives, will that be enough to acquire these latent information processing capabilities? Can the car learn to think like a human, and how much?

Inferring low-dimensional unobserved states from behavior is a well-studied technique in statistical modeling. In recent years, meta-reinforcement learning algorithms have increased the capability of agents to change their behavior in the presence of new information. However, no one has applied this principle to the scale and complexity of "human-level thinking and reasoning variables". If we use basic black-box function approximators (ConvNets, ResNets, Transformers, etc.), will it be enough? Or will it still fail even with a million lifetimes worth of driving data?

In other words, can simply predicting human behavior lead to a model that can learn to think like a human?

One cannot draw a hard line between "thinking" and "pattern matching", but loosely speaking I'd want to see such learned latent variables reflect basic deductive and inductive reasoning capabilities. For example, a logical proposition formulated as a steering problem: "Turn left if it is raining; right otherwise".

This could also be addressed via other high-data environments:

Observing trader orders on markets and seeing if we can recover the trader's deductive reasoning and beliefs about the future. See if we can observe rational thought (if not rational behavior).
Recovering intent and emotions and desire from social network activity.

Q2: What is the computationally cheapest "organic building block" of an Artificial Life simulation that could lead to human-level AGI?

Many AI researchers, myself included, believe that competitive survival of "living organisms" is the only true way to implement general intelligence.

If you lack some mental power like deductive reasoning, another agent might exploit the reality to its advantage to out-compete you for resources.

If you don't know how to grasp an object, you can't bring food to your mouth. Intelligence is not merely a byproduct of survival; I would even argue that it is Life and Death itself from which all semantic meaning we perceive in the world arises (the difference between a "stable grasp" and an "unstable grasp").

How does one realize an A-Life research agenda? It would be prohibitively expensive to implement large-scale evolution with real robots, because we don't know how to get robots to self-replicate as living organisms do. We could use synthetic biology technology, but we don't know how to write complex software for cells yet and even if we could, it would probably take billions of years for cells to evolve into big brains. A less messy compromise is to implement A-Life in silico and evolve thinking critters in there.

We'd want the simulation to be fast enough to simulate armies of critters. Warfare was a great driver of innovation. We also want the simulation to be rich and open-ended enough to allow for ecological niches and tradeoffs between mental and physical adaptations (a hand learning to grasp objects).

Therein lies the big question: if the goal is to replicate the billions of years of evolutionary progress leading up to where we are today, what are the basic pieces of the environment that would be just good enough?

Chemistry? Cells? Ribosomes? I certainly hope not.
How do nutrient cycles work? Resources need to be recycled from land to critters and back for there to be ecological change.
Is the discovery of fire important for evolutionary progression of intelligence? If so, do we need to simulate heat?
What about sound and acoustic waves?
Is a rigid-body simulation of MuJoCo humanoids enough? Probably not, if articulated hands end up being crucial.
Is Minecraft enough?
Does the mental substrate need to be embodied in the environment and subject to the physical laws of the reality? Our brains certainly are, but it would be bad if we had to simulate neural networks in MuJoCo.
Is conservation of energy important? If we are not careful, it can be possible through evolution for agents to harvest free energy from their environment.

In the short story Crystal Nights by Greg Egan, simulated "Crabs" are built up of organic blocks that they steal from other Crabs. Crabs "reproduce" by assembling a new crab out of parts, like LEGO. But the short story left me wanting for more implementation details...

Q3: Loschmidt's Paradox and What Gives Rise to Time?

I recently read The Order of Time by Carlo Rovelli and being a complete Physics newbie, finished the book feeling more confused and mystified than when I had started.

The second law of thermodynamics, $\Delta{S} > 0$, states that entropy increases with time. That is the only physical law that is requires time "flow" forwards; all other physical laws have Time-Symmetry: they hold even if time was flowing backwards. In other words, T-Symmetry in a physical system implies conservation of entropy.

Microscopic phenomena (laws of mechanics on position, acceleration, force, electric field, Maxwell's equations) exhibit T-Symmetry. Macroscopic phenomena (gases dispersing in a room, people going about their lives), on the other hand, are T-Asymmetric. It is perhaps an adaptation to macroscopic reality being T-Asymmetric that our conscious experience itself has evolved to become aware of time passing. Perhaps bacteria do not need to know about time...

But if macroscopic phenomena are comprised of nothing more than countless microscopic phenomena, where the heck does entropy really come from?

Upon further Googling, I learned that this question is known as Loschmidt's Paradox. One resolution that I'm partially satisfied with is to consider that if we take all microscopic collisions to be driven by QM, then there really is no such thing as "T-symmetric" interactions, and thus microscopic interactions are actually T-asymmetric. A lot of the math becomes simpler to analyze if we consider a single pair of particles obeying randomized dynamics (whereas in Statistical Mechanics we are only allowed to assume that about a population of particles).

Even if we accept that macroscopic time originates from a microscopic equivalent of entropy, this still begs the question of what the origin of microscopic entropy (time) is.

Unfortunately, many words in English do not help to divorce my subjective, casual understanding of time from a more precise, formal understanding. Whenever I think of microscopic phenomena somehow "causing" macroscopic phenomena or the cause of time (entropy) "increasing", my head gets thrown for a loop. So much T-asymmetry is baked into our language!

I'd love to know of resources to gain a complete understanding of what we know and don't know, and perhaps a new language to think about Causality from a physics perspective

If you have thoughts on these questions, or want to share your own big science questions that keep you up at night, let me know in the comments or on Twitter! #3sciencequestions

Selected Quotes from "The Dark Ages of AI Panel Discussion"

2019-12-25T20:10:00.004-08:00

In 1984, a panel at the AAAI conference discussed whether the field was approaching an "AI Winter". Mitch Waldrop wrote a transcript of the discussion, and much of it reads exactly like something written 35 years into the future.

Below are some quotes from the transcript that I found impressive, as they describe the feelings of many an AI researcher today and how the public views AI, despite all the advances in computing and software since 1984. 👇

"People make essentially no distinction between computers, broadly defined, and Artificial Intelligence... as far as they're concerned, there is no difference; they're just worried about the impact of very capable, smart computers" - Mitch Waldrop

"The computer is not only a mythic emblem for this bright, high-technology future, it's a mythic symbol for much of the anxiety that people have about their own society." - Mitch Waldrop

"A second anxiety, what you might call the 'Frankenstein Anxiety', is the fear of being replaced, of becoming superfluous..." - Mitch Waldrop

"Modern Times Anxiety: People becoming somehow, because of computers, just a cog in the vast, faceless machine; the strong sense of helplessness, that we really have no control over our lives" - Mitch Waldrop

"The problem is not a matter of imminent deadlines or lack of space or lack of time... the real problem is that what reporters see as real issues in the world are very different from what the AI community sees as real issues." - Mitch Waldrop

"If we expect physicists to be concerned about arms control and chemists to be concerned about toxic waste, it's probably reasonble to expect AI people to be concerned about the human impact of these technologies" - Mitch Waldrop

"It [Doomsday] is already here. There is no content in this conference" - Bob Wilensky

"What I heard was that only completed scientific work was going to be accepted. This is a horrible concept - no new unformed ideas, no incremental work building on previous work" - Roger Schank

"When I first got into this field twenty years ago, I used to explain to people what I did, and they would already say, 'you mean computers can't do that already?' They'll always believe that." - Roger Schank

"Big business has a very serious role in this country. Among other things, they get to determine what's 'in' and what's 'out' in the government." - Roger Schank

"I got scared when big business started getting into this - Schlumberger, Xerox, HP, Texas Instruments, GTE, Amico, Exxcon, they were all making investments - they all have AI groups. And you find out that, thoise people weren't trained in AI." - Roger Schank

"It's easier to go into a startup... [or] a big company... than to go into a university and try to organize an AI lab, which is just as hard to do now as it ever was. But if we don't do that, we will find that we are in the 'Dark Ages' of AI" - Roger Schank

"The first [message] is incumbent upon AI because we have promised so much, to produce. We must produce working systems. Some of you must devote yourselves to doing that. It is also the case that some of you had better commit to doing science." - Roger Schank

"If it turns out that our AI conference isn't the place to discuss science, then we better start finding a place where we can discuss science, because this show for all the venture capitalists is very nice." - Roger Schank

"the notion of cognition as computation is going to have extraordinary importance to the philosophy and psychology of the next generation. And for well or ill, this notion has affected some of the deepest aspects of our self-image." - B. Chandrasekaran

"symbol-level theories, which may even be right, are being mistaken for knowledge-level theories" - B. Chandrasekaran

"My hope is that AI will evolve more like biotech in the sense that certain technologies will be spun off, and researchers will remain and extremely interesting progress will be made" - B. Chandrasekaran

"I have encountered people who have a science fiction view of the world and think that computers now can do just about anything... these people have a feeling that computers can do wonderful things, but if you ask them how exactly could an AI program help in work, they don't have the sense that within a week or two they could be replaced or that computers can come in and do a much better job than they do in work." - John McDermott

"There have been a number of technologies that have run into dead ends, like dirigibles and external combustion engines. And there have been other ones, like television, and in fact, the telephone system itself, which took between twenty and forty years to go from being laboratory possibilities to actual commercial successes. Do you really think that AI is going to become a commercial success in the next 10-15 years?" - Audience member

"They [lay people] seem to have a vague idea that great things can happen, have sublime confidence... but when it gets down to the nitty-gritty, they tend to be pretty unimaginative and have pretty low expectations as to what can be done." - Mitch Waldrop

"It seems that academic AI people tend to blame everyone but themselves when it comes to problems of AI in terms of relationship to the general society." - Audience member

Differentiable Path Tracing on the GPU/TPU

2019-11-28T21:50:00.001-08:00

You can download a PDF (typset in LaTeX) of this blog post here.

Jupyter Notebook Code on GitHub: https://github.com/ericjang/pt-jax

This blog post is a tutorial on implementing path tracing, a physically-based rendering algorithm, in JAX. This code runs on the CPU, GPU, and Google Cloud TPU, and is implemented in a way that also makes it end-to-end differentiable. You can compute gradients of the rendered pixels with respect to geometry, materials, whatever your heart desires.

I love JAX because it is equally suited for pedagogy and high-performance computing. We will implement a path tracer for a single pixel in numpy-like syntax, slap a jax.vmap operator on it, and JAX automatically converts our code to render multiple pixels with SIMD instructions! You can do the same thing for multiple devices using jax.pmap. If that isn't magic, I don't know what is. At the end of the tutorial you will not only know how to render a Cornell Box, but also understand geometric optics and radiometry from first principles.

The figure below, borrowed from a previous post from this blog, explains at a high level the light simulator we're about to implement:

I divide this tutorial into two parts: 1) implementing geometry-related functions like ray-scene intersection and normal estimation, and 2) the "light transport" part where we discuss how to accumulate radiance arriving at an imaginary camera sensor.

JAX and Matplotlib (and a bit of calculus and probability) are the only required dependencies for this tutorial:

import jax.numpy as np
from jax import jit, grad, vmap, random, lax
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

JAX is essentially a drop-in replacement for numpy, with the exception that operations are all functional (no indexing assignment) and the user must manually pass around an explicit rng_key to generate random numbers. Here is a short list of JAX gotchas if you are coming to JAX as a numpy user.

Part I: Geometry

The vast majority of rendering software represents scene geometry as a collection of surface primitives that form "meshes". 3D modeling software form meshes using quadrilaterial faces, and then the rendering software converts the quads to triangles under the hood. Collections of meshes are composed together to form entire objects and scenes. For this tutorial we're going to use an unorthodox geometry representation and we'll need to implement a few helper functions to manipulate them.

Differentiable Scene Intersection with Distance Fields

Rendering requires computing intersection points $y$ in the scene with ray $\omega_i$, and usually involves traversing a highly-optimized spatial data structure called a bounding volume hierarchy (BVH). $y$ can be expressed as a parametric equation of the origin point $x$ and raytracing direction $\omega_i$, and the goal is to find the distance $t$:

$\hat{y} = x + t \cdot \omega_i$

There is usually a lot of branching logic in BVH traversal algorithms, which makes it harder to implement efficiently on accelerator hardware like GPUs and TPUs. Instead, let's use raymarching on signed distance fields to find the intersection point $y$. I first learned of this geometry modeling technique when Inigo "IQ" Quilez, a veritable wizard of graphics programming, gave a live coding demo at Pixar about how he modeled vegetation in the "Brave" movie. Raymarching is the primary technique used by the ShaderToy.com community to implement cool 3D movies using only instructions available to WebGL fragment shaders.

A signed distance field over position $p$ specifies "the distance you can move in any direction without coming into contact with the object". For example, here is the signed distance field for a plane that passes through the origin and is perpendicular to the y-axis.

def sdFloor(p):

return p.y

To find the intersection distance $t$, the raymarching algorithm iteratively increments $t$ by step sizes equal to the signed distance field of the scene (so we never pass through an object). This iteration happens until $t$ "leaves the scene'" or the distance field shrinks to zero (we have collided with an object). For the plane distance, we see from the diagram below that stepping forward using the distance field allows us to get arbitrarily close to the plane without ever passing through it.

def raymarch(ro, rd, sdf_fn, max_steps=10):

t = 0.0

for i in range(max_steps):

p = ro + t*rd

t = t + sdf_fn(p)

return t

Signed distance fields combined with raymarching have a number of nice mathematical properties. The most important one is that unlike analytical ray-shape intersection, raymarching does not require re-deriving an analytical solution for intersecting points for every primitive shape we wish to add to the scene. Triangles are also general, but they require a lot of memory to store expressive scenes. In my opinion, signed distance fields strike a good balance between memory budget and geometric expressiveness.

Similar to ResNet architectures in Deep Learning, the raymarching algorithm is a form of "unrolled iterative inference" of the same signed distance field. If we are trying to differentiate through the signed distance function (for instance, trying to approximate it with a neural network), this representation may be favorable to gradient descent algorithms.

Building Up Our Scene

The first step is to implement the signed distance field for the scene of interest. The naming and programming conventions in this tutorial are heavily inspired by stylistic conventions used by ShaderToy DemoScene community. One such convention is to define hard-coded enums for each object, so we can associate intersection points to their nearest object. The values are arbitrary; you can substitute them with your favorite numbers if you like.

OBJ_NONE=0.0

OBJ_FLOOR=0.1

OBJ_CEIL=.2

OBJ_WALL_RD=.3

OBJ_WALL_WH=.4

OBJ_WALL_GR=.5

OBJ_SHORT_BLOCK=.6

OBJ_TALL_BLOCK=.7

OBJ_LIGHT=1.0

OBJ_SPHERE=0.9

Computing a ray-scene intersection should therefore return an object id and an associated distance, for which we define a helper function to zip up those two numbers.

def df(obj_id, dist):

return np.array([obj_id, dist])

Next, we'll define the distance field for a box (source: https://www.iquilezles.org/www/articles/distfunctions/distfunctions.htm).

def udBox(p, b):

# b = half-widths

return length(np.maximum(np.abs(p)-b,0.0))

Rotating, translating, and scaling an object implied by a signed distance field is done by performing the inverse operation to the input point to the distance function. For example, if we want to rotate one of the boxes in the scene by an angle of $\theta$, we rotate its argument $p$ by $-\theta$ instead.

def rotateX(p,a):

# We won't be using rotateX for this tutorial.

c = np.cos(a); s = np.sin(a);

px,py,pz=p[0],p[1],p[2]

return np.array([px,c*py-s*pz,s*py+c*pz])

def rotateY(p,a):

c = np.cos(a); s = np.sin(a);

px,py,pz=p[0],p[1],p[2]

return np.array([c*px+s*pz,py,-s*px+c*pz])

def rotateZ(p,a):

c = np.cos(a); s = np.sin(a);

px,py,pz=p[0],p[1],p[2]

return np.array([c*px-s*py,s*px+c*py,pz])

Another cool property of signed distance fields is that you can compute the union of two solids with a simple np.minimum operation. By the definition of a distance field, if you take a step size equal to the smaller of the two distances in either direction, you are still guaranteed not to intersect with anything. The following method, short for "Union Operation", joins to distance fields by comparing their distance property.

def opU(a,b):

if a[1] < b[1]:

return a

else:

return b

Unfortunately, the JAX compiler complains when combining both grad and jit operators through conditional logic like the one above. So we need to write things a little differently to preserve differentiability:

def opU(a,b):

condition = np.tile(a[1,None]<b[1,None], [2])

return np.where(condition, a, b)

Now we have all the requisite pieces to build the signed distance field for the Cornell Box, which we call sdScene. Recall from the previous section that the distance field for an axis-aligned plane is just the height along that axis. We can use this principle to build infinite planes that comprise the walls, floor, and ceiling of the Cornell Box.

def sdScene(p):

# p is [3,]

px,py,pz=p[0],p[1],p[2]

# floor

obj_floor = df(OBJ_FLOOR, py) # py = distance from y=0

res = obj_floor

# ceiling

obj_ceil = df(OBJ_CEIL, 4.-py)

res = opU(res,obj_ceil)

# backwall

obj_bwall = df(OBJ_WALL_WH, 4.-pz)

res = opU(res,obj_bwall)

# leftwall

obj_lwall = df(OBJ_WALL_RD, px-(-2))

res = opU(res,obj_lwall)

# rightwall

obj_rwall = df(OBJ_WALL_GR, 2-px)

res = opU(res,obj_rwall)

# light

obj_light = df(OBJ_LIGHT, udBox(p - np.array([0,3.9,2]), np.array([.5,.01,.5])))

res = opU(res,obj_light)

# tall block

bh = 1.3

p2 = rotateY(p- np.array([-.64,bh,2.6]),.15*np.pi)

d = udBox(p2, np.array([.6,bh,.6]))

obj_tall_block = df(OBJ_TALL_BLOCK, d)

res = opU(res,obj_tall_block)

# short block

bw = .6

p2 = rotateY(p- np.array([.65,bw,1.7]),-.1*np.pi)

d = udBox(p2, np.array([bw,bw,bw]))

obj_short_block = df(OBJ_SHORT_BLOCK, d)

res = opU(res,obj_short_block)

return res

Notice that we model the light source on the ceiling as a rectangular prism with half-widths $(0.5, 0.5)$. All numbers are expressed in SI units, so this implies a 1 meter x 1 meter light, and a big 4m x 4m Cornell box (this is a big scene!). The size of the light will become relevant later when we compute quantitites like emitted radiance.

Computing Surface Normals

In rendering we need to frequently compute the normals of geometric surfaces. In ShaderToy programs, the most common algorithm used to compute normals is a finite-difference gradient approximation of the distance field $\nabla_p d(p)$, and then normalize that vector to obtain an approximate normal.

def calcNormalFiniteDifference(p):

# derivative approximation via midpoint rule

eps = 0.001

dx=np.array([eps,0,0])

dy=np.array([0,eps,0])

dz=np.array([0,0,eps])

# extract just the distance component

nor = np.array([

sdScene(p+dx) - sdScene(p-dx),

sdScene(p+dy) - sdScene(p-dy),

sdScene(p+dz) - sdScene(p-dz),

])

return normalize(nor)

Note that this requires six separate evaluations to the sdScene function! As it turns out, JAX can give us analytical normals basically for free via its auto-differentiation capabilities. The backward pass has the same computational complexity as the forward pass, resulting in autodiff gradients being 6x faster than finite-differencing. Neat!

def dist(p):

# return the distance-component only

return sdScene(p)[1]

def calcNormalWithAutograd(p):

return normalize(grad(dist)(p))

Cosine-Weighted Sampling

We require is the ability to sample scattering rays around some local surface normal, for when we choose recursive rays to scatter. All the objects in the scene are assigned "Lambertian BRDFs'', which mean that they are matte in reflectance properties and the apparent brightness to an observer is the same regardless of viewing angle. For Lambertian materials, it is much more effective to sample from a cosine-weighted distribution because it allows two cosine-related probability terms (from the sampling and from the BRDF) to cancel out. The motivation for this will become apparent in Part II of the tutorial, but here is the code up front.

def sampleCosineWeightedHemisphere(rng_key, n):

rng_key, subkey = random.split(rng_key)

u = random.uniform(subkey,shape=(2,),minval=0,maxval=1)

u1, u2 = u[0], u[1]

uu = normalize(np.cross(n, np.array([0.,1.,1.])))

vv = np.cross(uu,n)

ra = np.sqrt(u2)

rx = ra*np.cos(2*np.pi*u1)

ry = ra*np.sin(2*np.pi*u1)

rz = np.sqrt(1.-u2)

rr = rx*uu+ry*vv+rz*n

return normalize(rr)

Here's a quick 3D visualization to see whether our implementation is doing something reasonable:

from mpl_toolkits.mplot3d import Axes3D

nor = normalize(np.array([[1.,1.,0.]]))

nor = np.tile(nor,[1000,1])

rng_key = random.split(RNG_KEY, 1000)

rd = vmap(sampleCosineWeightedHemisphere)(rng_key, nor)

fig = plt.figure()

ax = fig.add_subplot(121, projection='3d')

ax.scatter(rd[:,0],rd[:,2],rd[:,1])

ax = fig.add_subplot(122)

ax.scatter(rd[:,0],rd[:,1])

Camera Model

For each pixel we want to render, we need to associate it with a ray direction rd and a ray origin ro. The most basic camera model for computer graphics is a pinhole camera, shown below:

The following code sets up a pinhole camera with focal distance of 2.2 meters:

N=150 # width of image plane

xs=np.linspace(0,1,N) # 10 pixels

us,vs = np.meshgrid(xs,xs)

uv = np.vstack([us.flatten(),vs.flatten()]).T

# normalize pixel locations to -1,1

p = np.concatenate([-1+2*uv, np.zeros((N*N,1))], axis=1)

# Render a pinhole camera.

eye = np.tile(np.array([0,2.,-3.5]),[p.shape[0],1])

look = np.array([[0,2.0,0]]) # look straight ahead

w = vmap(normalize)(look - eye)

up = np.array([[0,1,0]]) # up axis of world

u = vmap(normalize)(np.cross(w,up))

v = vmap(normalize)(np.cross(u,w))

d=2.2 # focal distance

rd = vmap(normalize)(p[:,0,None]*u + p[:,1,None]*v + d*w)

If you wanted to render an orthographic projection, you can simply set all ray direction values to point straight forward along the Z-axis, instead of all originating from the same eye point: rd = np.array([0, 0, 1]).

N=150 # width of image plane

xs=np.linspace(0,1,N) # 10 pixels

us,vs = np.meshgrid(xs,xs)

us = (2*us-1)

vs *= 2

uv = np.vstack([us.flatten(),vs.flatten()]).T # 10x10 image grid

eye = np.concatenate([uv, np.zeros((N*N,1))], axis=1)*2

rd = np.zeros_like(eye) + np.array([[0, 0, 1]])

An orthographic camera is what happens when you stretch the focal distance to infinity. That will yield an image like this:

Part II: Light Simulation

With our scene defined and basic geometric functions set up, we can finally get to the fun part of implementing light transport. This part of the tutorial is agnostic to the geometry representation described in Part I, so you can actually follow along with whatever programming language and geometry representation you like (raymarching, triangles, etc).

Radiometry From First Principles

Before we learn the path tracing algorithm, it is illuminating to first understand the underlying physical phenomena being simulated. Radiometry is a mathematical framework for measuring electromagnetic radiation. Not only can it be used to render pretty pictures, but it can also be used to understand heat and energy propagated in straight lines within closed systems (e.g. blackbody radiation). What we are ultimately interested in are human perceptual color quantities, but to get them first we will simulate the physical quantities (Watts) and then convert them to lumens and RGB values.

This section borrows some figures from the PBRT webpage on Radiometry. I highly recommend reading that page before proceeding, but I also summarize the main points you need to know here.

You can actually derive the laws of radiometry from first principles, using only the principle of conservation of energy: within a closed system, the total amount of energy being emitted is equal to the total amount of energy being absorbed.

Consider a small sphere of radius $r$ emitting 60 Watts of electromagnetic power into a larger enclosing sphere of radius $R$. We know that the bigger sphere must be absorbing 60 Watts of energy, but because it has a larger surface area ($4\pi R^2$), the incoming energy density per unit area is a factor of $\frac{R^2}{r^2}$ smaller.

We call this "area density of flux'' irradiance (abbreviated $E$) if it is arriving at a surface, and radiant exitance (abbreviated $M$) if it is leaving a surface. The SI unit for these quantities are Watts per square meter.

Figure Source: http://www.pbr-book.org/3ed-2018/Color_and_Radiometry/Radiometry.html

Now let's consider a slightly different scene in the figure below, where a small flat surface with area $A$ emits a straight beam of light onto the floor. On the left, the emitting and receiving surfaces have the same area, $A = A_1$, so the irradiance equals radiant exitance $E = M$. On the right, the beam of light shines on the floor at an angle $\theta$, which causes the projection $A_2$ to be larger. Calculus and trigonometry tell us that as we shrink the area $A \to 0$, the area of the projected light $A_2$ approaches $\frac{A}{\cos \theta}$. Because flux must be conserved, the irradiance of $A_2$ must be $E = M \cos \theta$, where $\theta$ is the angle between the surface normal and light direction. This is known as "Lambert's Law''.

Figure Source: http://www.pbr-book.org/3ed-2018/Color_and_Radiometry/Radiometry.html

In the above examples, the scenes were simple or symmetric enough that we did not have to think about what direction light is coming from when computing the irradiance of a surface. However, if we want to simulate light in a complex scene, we will need to compute irradiance by integrating light over many possible directions. For non-transparent surfaces, this set of directions forms a hemisphere surrounding the point of interest, and is perpendicular to the surface normal.

Radiance extends the measure of irradiance to also depend on the solid angle of incident light. Solid angles are just extensions of 2D angles to 3D spheres (and hemispheres). You can recover irradiance and power by integrating out angle and area of irradiance, respectively:

Radiance $L = \frac{\partial^2 \Phi}{\partial \Omega \partial A \cos \theta}$ measures flux per projected unit area $A \cos \theta$ per unit solid angle (Figure 5.10) $\Omega$.
Irradiance $E = \frac{\partial \Phi}{\partial A \cos \theta}$ is the integral of radiance over solid angles $\Omega$.
Power $\Phi$ is the integral of irradiance over projected area $A$.

A nice property of radiance is that it is conserved along rays through empty space. We have the incoming radiance $L_i$ from direction $\omega_i$ to point $x$ equal to the outgoing radiance $L_o$ from some other point $y$, in the reverse direction $-\omega_i$. $y$ is the intersection of origin $x$ along ray $\omega_i$ with the scene geometry.

$ L_i(x, \omega_i) = L_o(y, -\omega_i) $

It's important to note that although incoming and outgoing radiance are conserved along empty space, we still need to respect Lambert's Law when computing an irradiance at a surface.

Different Ways to Integrate Radiance

You may remember from calculus class that it is sometimes easier to compute integrals by changing the integration variable. The same concept holds in rendering: we'll use three different integration methods in building a computationally efficient path tracer. In this section I will draw some material directly from the PBRTv3 online textbook, which you can find here: http://www.pbr-book.org/3ed-2018/Color_and_Radiometry/Working_with_Radiometric_Integrals.html

I was a teaching assistant for the graduate graphics course for 2 years at Brown and by far the most common mistake made in the path tracing project assignments were insufficient understanding of the calculus that went into correctly integrating radiometric quantities.

Integrating Over Solid Angle

As mentioned before, in order to compute irradiance $E(x, n)$ at a surface point $x$ with normal $n$, we need to take Lambert's rule into account, because there is a "spreading out'' of flux density that occurs when light sources are facing at an angle.

$E(x, n) = \int_\Omega d\omega L_i(x, \omega) |\cos \theta| = \int_\Omega d\omega L_i(x, \omega) |\omega \cdot n| $

One way to estimate this integral is a single-sample Monte Carlo Estimator, where we sample a single ray direction $\omega_i$ uniformly from the hemisphere, and evaluate the radiance for that direction. In expectation over $\omega_i$, the estimator computes the correct integral.

$\omega_i \sim \Omega $

$\hat{E}(x, n) = L_i(x, \omega_i) |\omega \cdot n| \frac{1}{p(\omega_i)} $

Integrating Over Projected Solid Angle

Due to Lambert's law, we should never sample outgoing rays perpendicular to the surface normal because the projected area $\frac{A}{\cos \theta}$ approaches infinity, so the radiance contribution to that area is zero.

We can avoid sampling these "wasted'' rays by weighting the probability of sampling a ray according to Lambert's law - in other words, a cosine-weighted distribution $H^2$ along the hemisphere. This requires us to perform a change of variables, and integrate with respect to the projected solid angle $d\omega^\perp = |\cos \theta| d\omega$.

This is where the cosine-weighted hemisphere sampling function we implemented earlier will come in handy.

$ E(x, n) = \int_{H^2} d\omega^\perp L_i(x, \omega^\perp) $

The cosine term in the integral means that the contribution to irradiance is higher as the light source becomes more perpendicular to the light.

Integrating Over Light Area

If the light source subtends a very small solid angle on the hemisphere, we will need to sample a lot of random outgoing rays before we find one that intersects the light source. For small or directional light sources, it is far more computationally efficient to integrate over the area of the light, rather than the hemisphere.

Figure Source: http://www.pbr-book.org/3ed-2018/Color_and_Radiometry/Working_with_Radiometric_Integrals.html

If we perform a change in variables from differential solid angle $d\omega$ to differential area $dA$, we must compensate for the change in volume.

$ d\omega = \frac{dA \cos \theta_o}{r^2} $

I won't go through the derivation in this tutorial, but the interested reader can find it here: https://www.cs.princeton.edu/courses/archive/fall10/cos526/papers/zimmerman98.pdf. Substituting the above equation into the irradiance integral, we have:

$ E(x, n) = \int_{A} L \cos \theta_i \frac{dA \cos \theta_o}{r^2} $

where $L$ is the emitted radiance of the light coming from the implied direction $-\omega$, which has an angular offset of $\theta_o$ from the light surface's surface normal. The corresponding single-sample Monte Carlo estimator is given by sampling a point on the area light, rather than a direction on the hemisphere. The probability $p(p)$ of sampling the point $p$ on an area $A$ is usually given by a uniform $\frac{1}{A}$.

$p \sim A $

$\omega = \frac{p-x}{\left\lVert {p-x} \right\rVert} $

$r^2 = \left\lVert {p-x} \right\rVert ^2 $

$\hat{E}(x, n) = \frac{1}{p(p)}\frac{L}{r^2} |\omega \cdot x| |-\omega \cdot n| $

Making Rendering Computationally Tractable with Path Integrals

The rendering equation describes the outgoing radiance $L_o(x, \omega_o)$ from point $x$ along ray $\omega_o$.

$ L_o(x, \omega_o) = L_e(x, \omega_o) + \int_{\Omega} f_r(x, \omega_i, \omega_o) L_i(x, \omega_i) (-\omega_i \cdot n) d\omega_i $

where $L_e(x, \omega_o)$ is emitted radiance, $f_r(x, \omega_i, \omega_o)$ is the BRDF (material properties), $L_i(x, \omega_i)$ is incoming radiance, $(-\omega_i \cdot n)$ is the attenuation of light coming in at an incident angle with surface normal $n$. The integral is with respect to solid angle on a hemisphere.

How do we go about implementing this on a computer? Evaluating the incoming light to a point requires integrating over an infinite number of directions, and for each of these directions, we have to recursively evaluate the incoming light to those points. Our computers simply cannot do this.

Fortunately, path tracing provides a tractable way to approximate this scary integral. Instead of integrating over the hemisphere $\Omega$, we can sample a random direction $w_i \sim \Omega$, and the probability-weighted contribution from that single ray is an unbiased, single-sample monte carlo estimator for Eq. 1.

$ \omega_i \sim \Omega $

$\hat{L}_o(x, \omega_o) = L_e(x, \omega_o) + \frac{1}{p(\omega_i)} f_r(x, \omega_i, \omega_o) L_i(x, \omega_i) (-\omega_i \cdot n(x)) $

We still need to deal with infinite recursion. In most real-world scenarios, a photon only bounces around a few times before it is absorbed, so we can truncate the depth or use a more unbiased technique like Russian Roulette sampling. We recursively trace the $L_i(x, \omega_i)$ function until we hit the termination condition, which results in a linear computation cost with respect to depth.

A Naive Path Tracer

Below is the code for a naive path tracer, which is more or less a direct translation of the equation above.

def trace(ro, rd, depth):

p = intersect(ro, rd)

n = calcNormal(p)

radiance = emittedRadiance(p, ro)

if depth < 3:

# Uniform hemisphere sampling

rd2 = sampleUniformHemisphere(n)

Li = trace(p, rd2, depth+1)

radiance += brdf(p, rd, rd2)*Li*np.dot(rd, n)

return radiance

We assume a 25 Watt square light fixture at the top of the Cornell Box that acts as a diffuse area light and only emits light from one side of the plane. Diffuse lights have uniform spatial and directional radiance distribution; this is also known as a "Lambertian Emitter'', and it has a closed-form solution for its emitted radiance from any direction:

LIGHT_POWER = np.array([25, 25, 25]) # Watts

LIGHT_AREA = 1.

def emittedRadiance(p, ro):

return LIGHT_POWER / (np.pi * LIGHT_AREA)

The $\pi$ term is a little surprising at first, but you can find the derivation here for where it comes from: https://computergraphics.stackexchange.com/questions/3621/total-emitted-power-of-diffuse-area-light.

Normally we'd have to track radiance for every visible wavelength, but we can obtain a good approximation of the entire spectral power distribution by tracking radiance for just a few wavelengths of light. According to tristimulus theory, it is actually possible to represent all human-perceivable colors with 3 numbers, such as XYZ or RGB color bases. For simplicity, we'll only compute radiance values for R, G, B wavelengths in this tutorial. The brdf term corresponds to material properties. This is a simple scene in which all materials are Lambertian, meaning that the direction of the incident and exitant angles don't matter, so the brdf reflects incident radiance by multiplying its R, G, B values. Here are the BRDFs we use for various objects in the scene, expressed in the RGB basis:

lightDiffuseColor = np.array([0.2,0.2,0.2])

leftWallColor = np.array([.611, .0555, .062]) * 1.5

rightWallColor = np.array([.117, .4125, .115]) * 1.5

whiteWallColor = np.array([255, 239, 196]) / 255

We can make our path tracer more efficient by switching the integration variable to the projected solid angle $d\omega_i |\cos \theta|$. As discussed in the last section, this has the benefit of importance-sampling the solid angles that are proportionally larger due to Lambert's law, and as an added bonus we can drop the evaluation of the cosine term.

def trace(ro, rd, depth):

p = intersect(ro, rd)

n = calcNormal(p)

radiance = emittedRadiance(p, ro)

if depth < 3:

# Cosine-weighted hemisphere sampling

rd2 = sampleCosineWeightedHemisphere(n)

Li = trace(p, rd2, depth+1)

radiance += brdf(p, rd, rd2)*Li

return radiance

Reducing Variance by Splitting Up Indirect Lighting

The above estimator is correct and will get you the right result in expectation, but ends up being a high-variance estimator because the samples only have nonzero radiance when one or more of the path intersections intersects the emissive geometry. If you are trying to render a scene that is illuminated by a geometrically small light source -- a candle in a dark room perhaps -- the vast majority of path samples will never intersect the candle, and subsequently these samples will be sort of wasted. The image will appear very grainy and dark.

Luckily, the area integration trick we discussed a few sections back comes to our rescue. In graphics, we actually know where the light surfaces are ahead of time, so we can integrate over the emissive surface instead of integrating over the receiving surface's solid angles. We do this by performing a change of variables $d\omega = \frac{dA \cos \theta_o}{r^2}$.

To implement this trick, we can split up indirect lighting reflecting off point $p$ into two separate calculations: (1) direct lighting a the light source bouncing off of $p$, and (2) indirect lighting from a non-light source reflecting off of $p$. Notice that we have to modify the recursive trace term to ignore emittedRadiance from any lights it encounters, except for the case where light leaves the emitter and enters the eye directly (which is when depth=0). This is because for each point $p$ in the path, we are already accounting for an extra path that goes from an area light directly to $p$. We don't want to double count such paths!

def trace(ro, rd, depth):

p = intersect(ro, rd)

n = calcNormal(p)

if depth == 0:

# Integration over solid angle (eye ray)

radiance = emittedRadiance(p, ro)

# Direct Lighting Term

pA, M, pdf_A = sampleAreaLight()

n_light = calcNormal(pA)

if visibilityTest(p, pA):

square_distance = np.sum(np.square(pA - p))

w_i = normalize(pA - p)

dw_da = np.dot(n_light, -w_i)/square_distance # dw/dA

radiance += (brdf(p, rd, w_i) * np.dot(n, w_i) * M) * dw_da

# Indirect Lighting Term

if depth < 3:

# Integration over cosine-weighted solid angle

rd2 = sampleCosineWeightedHemisphere(n)

Li = trace(p, rd2, depth+1)

radiance += brdf(p, rd, rd2)*Li

return radiance

The sampleAreaLight() function samples a point $p$ on an area light with emitted radiance $M$ and also computes the probability of choosing that sample (for a uniform emitter, it's just one over the area).

The cool thing about this path tracer implementation is that it features three different ways to integrate irradiance: solid angles, projected solid angles, and area light surfaces. Calculus is useful!

Ignoring Photometry

Photometry is the study of how we convert radiometric quantities (the outputs of the path tracer) to the color quantities perceived by the human visual system. For this tutorial we will do a crude approximation of the radiometric-to-photometric by simply clipping the values of each R, G, B radiance to a maximum of 1, and display the result directly in matplotlib.

And voila! We get a beautifully path-traced image of a Cornell Box. Notice how colors from the walls "bleed" onto adjacent walls, and the shadows cast by the boxes are "soft".

Performance Benchmarks: P100 vs. TPUv2

Copying data between accelerators (TPU, GPU) and host chips (CPU) is very slow, so we'll try to compile the path tracing code into as few XLA calls from Python as possible. We can do this by applying the jax.jit operator to the entire trace() function, so the rendering happens completely on the accelerator. Because trace is a recursive function, we need to tell the XLA compiler that we are actually compiling it with a statically fixed depth of 3, so that XLA can unroll the loop and make it non-recursive. The vmap call then transforms the function into a vectorized version.

trace = jit(trace, static_argnums=(3,)) # optional

render_fn = lambda rng_key, ro, rd : trace(rng_key, ro, rd, 0)

vec_render_fn = vmap(render_fn)

According to jax.local_device_count(), a Google Cloud TPU has 8 cores. The code above only performs SIMD vectorization across 1 device, so we can also parallelize across multiple TPU cores using JAX's pmap operator to get an additional speed boost..

# vec_render_fn = vmap(render_fn)
vec_render_fn = jax.soft_pmap(render_fn)

How fast does this path tracer run? I benchmarked the performance of a (1) manually-vectorized Numpy implementation, (2) a vmap-vectorized single-pixel implementation, and (3) a manually-vectorized JAX implementation (almost identical in syntax to numpy). Jitting the recursive trace function was very slow to compile (occasionally even crashed my notebook kernel), so I also implemented a version where the recursion happens in Python but the loop body of trace (direct lighting, emission, sampling rays) are executed on the accelerator.

The plot below shows that JAX code is much slower to run on the first sample because the just-in-time compilation has to compile and fuse all the necessary XLA operations. I wouldn't read too carefully into this plot (especially when comparing GPU vs. TPU) because when I was doing these experiments I encountered a huge amount of variance in compile times. Numpy doesn't have any JIT compilation overhead, so it runs much faster for a single sample, even on the CPU.

What about a multi-sample render? After the XLA kernels have been compiled, subsequent calls to the trace function are very fast.

We see that there's a trade-off between compilation time and runtime: the more we compile, the faster things run when performing many samples. I haven't tuned the code to favor any accelerator in particular, and this is the first time I've measured TPU and GPU performance under a reasonable path tracing workload. Path tracing is an embarrassingly parallel workload (on the pixel level and image sample level), so it should be quite possible to get a linear speedup from using more TPU cores. My code currently does not do that because each pmap'ed worker is blocked on rendering an entire image sample. If you have suggestions on how to accelerate the code further, I'd love to hear from you.

Summary

In this blog post we derived the principles of physically based rendering from scratch, and implemented a differentiable path tracer in pure JAX. There are three kinds of radiometric integrals (solid angle, projected solid angle, and area) that come up in a basic implementations of a path tracer and we used all three to implement a path tracer that separates direct lighting contributions from area lights separately from indirect lighting bouncing from non-light surfaces.

JAX provides us with a lot of useful features to implement this:

You can write a one-pixel path tracer and vmap it into a vectorized version without sacrificing performance. You can parallelize trivially across devices using pmap.
Code runs on GPU and TPU without modifications.
Analytical surface normals of signed distance fields provided by automatic differentiation.
Lightweight enough to run in a Jupyter/Colaboratory notebook, making it ideal for trying out graphics research ideas without getting bogged down by software engineering abstractions.

There are still some sharp bits with JAX because graphics and rendering workloads are not its first-class customers. Still, I think there is a lot of promise and future work to be done with combining the programmatic expressivity of modern deep learning frameworks with the field of graphics.

We didn't explore the differentiability of this path tracer, but rest assured that the combination of ray-marching and Monte Carlo path integration makes everything tractable. Stay tuned for the next part of the tutorial, when we mix differentiation of this path tracer with neural networks and machine learning.

Acknowledgements

Thanks to Luke Metz, Jonathan Tompson, Matt Pharr for interesting discussion a few years ago when I wrote the first version of this code in TensorFlow. Many thanks to Peter Hawkins, James Bradbury, and Stephan Hoyer for teaching me more about JAX and XLA. Thanks to Yining Karl Li for entertaining my dumb rendering questions and Vincent Vanhoucke for catching typos.

Fun Facts

Jim Kajiya's first path tracer took 7 hours to render a 256x256 image on a 280,000 USD IBM computer. By comparison, this renderer takes about 10 seconds to render an image of similar size, and you can run it for free with Google's free hosted colab notebooks that come with JAX pre-installed.
I didn't discuss photometry much in this tutorial, but it turns out that the SI unit of photometric density, the candela, is the only SI base unit related to a biological process (human vision system).
Check out my blog post on normalizing flows for more info on how "conservation of probability mass'' is employed in deep learning research!
OpenDR was one of the first general-purpose differentiable renderers, and was technically innovative enough to merit publishing in ECCV 2014. It's remarkable to see how easy writing a differentiable renderer has become with modern deep learning frameworks like JAX, Pytorch, and TensorFlow.

Robinhood, Leverage, and Lemonade

2019-11-06T07:18:00.004-08:00

DISCLAIMER: NO INVESTMENT OR LEGAL ADVICE
The Content is for informational purposes only, you should not construe any such information or other material as legal, tax, investment, financial, or other advice. Investing involves risk, please consult a financial professional before making an investment.

Robinhood is a zero-commission brokerage that was founded in 2013. It has a beautiful mobile user interface that game-ifies the gambling of your life savi—, er, makes it seamless for millennials to buy and sell stocks.

I wrote on Quora in Dec 2014 on why lowering the barrier to entry to this extent can cause retail investors to make trades without knowing what they are doing. That post turned out to be rather prescient, for reasons I’ll explain below.

One of the ways Robinhood makes money is via margin lending: they loan you some extra money to invest in the stock market with, and later you pay back the loan with some interest (currently about 5%).

If you are in the business of lending money, not only do you have to safeguard your brokerage system against technological vulnerabilities (e.g. C++ memory leaks that expose users’ trades), but you also need to defend against financial vulnerabilities, which are portfolios that expose the lender or its customers to an irresponsible amount of investment risk.

In the last few months it has come to light [1, 2, 3, 4, 5] that there are some serious financial vulnerabilities in Robinhood’s margin lending platform, whereby it is possible for users to borrow much, much more money from Robinhood than they are supposed to.

Reddit Discussion

These users subsequently gamble huge amounts of borrowed money away in a coin toss, leaving Robinhood in a very bad spot, perhaps even at odds with Regulation T laws (I am not a lawyer, just speculating here).

“Leverage” is one of the most important concepts to understand in finance, and when used judiciously, is a net positive for everyone involved. It is important for everyone to understand how credit works, and how much leverage is too much. Borrowing more money than you can afford to pay back can take many forms, whether it is taking on college debt, credit card debt, or raising VC money.

Here’s a tutorial on “financial leverage” in the form of a story about lemonade:

Lemonade Leverage

It’s a hot summer, and you decide to start a lemonade stand to make some money. You have 100€, with which you can buy enough ingredients to make 120€ of lemonade for the summer. Your “return on investment”, or ROI, for the summer is 20%, since you ended up with 20% more money than you started with.

You also figure that if you had another 200€, enough people want lemonade that you could sell three times as much lemonade and make 360€. But you don’t have 200€ to spare! What do you do?

You could use the 120€ to build a slightly bigger lemonade operation next year. Assuming you could get a 20% ROI again next summer, you end up with 144€. But it will be many years before you even have 300€! By this time next year, lemonade might be out of fashion and kids might be juuling at home watching Netflix instead. You would much prefer to scale up your lemonade operation now, while you are confident that you can sell lemonade at a "profit margin" of 20%.

Fortunately, your friend “Britney Banker” is very wealthy and can lend you 200€. Britney Banker doesn’t have your entrepreneurial spirit, so she lacks the ability to get a 20% ROI on her own money. She offers to give you 200€ today, in exchange for you giving her 210€ at the end of the year -- an interest rate of 5%. Your “capital leverage ratio” is 100 / 200 = 1:2, because for every dollar you own, Britney is willing to lend you 2€.

If things turn out well, you sell 360€ worth of lemonade, pay Britney back 210€, and pocket the remaining 150€. Starting with 100€, you were able to use borrowed money to “magnify” your return to 50%.

However, if you make 200€ worth of lemonade and fail to sell any of it before the lemonade spoils and became worthless, you would be in a very sticky situation! You would have worthless lemonade and a 210€ debt to Britney. This is far worse than if you had lost your own 100€, because at least you wouldn’t owe anyone anything afterwards. So even though 1:2 leverage may amplify your gains from 20% → 50%, so it may amplify your potential losses from 100% → -310%!

The only reason why Britney is willing to lend you the money in the first place is that Britney thinks this outcome (you losing all of the borrowed money on top of your own assets) is unlikely. If Britney thought that you were less reliable, she might offer you a smaller leverage ratio (e.g. 1 : 1.5).

Lemonade Coupons

Suppose you make a big batch of lemonade (with Britney’s money) and then go door to door selling lemonade, but instead of giving customers a delicious drink right away, you give them a “deep-in-the-lemonade covered call option”. You take their money up front, and give them a coupon that allows them to “buy” a lemonade for free (0€).

The "call option" is referred to as "covered" because you actually have the lemonade to go with the coupon, it's just that you're holding onto the lemonade until the buyer actually redeems the coupon.

You then go back to Britney and say “I have 360€ of lemonade that I’ve made but haven’t sold, and 360€ in cash from selling lemonade options to customers, and as for debts there’s 200€ I’ve borrowed from you. That’s 520€ in net assets, so can I please borrow 1040€?”.

Britney says “sure, that’s a 1:2 leverage ratio”, and writes you a check for 1040€, again with 5% interest. But Britney has made a tragic mistake here! The 360€ in lemonade she counted as your assets are not really yours to spend, because you actually owe them in obligations to customers.

With 1204€ in borrowed assets, you are now leveraged over 1:12 !

You repeat this process again, turning 1040€ cash into 1248€ of lemonade, selling an additional 1248€ of deep-in-the-lemonade options. You now have 1608€ of lemonade, and 1608€ in cash, and 1204€ of debt, for net assets of 1608 + 1608 - 1204 = 2012€.

You go back to Britney and ask to borrow another 4024€, with 5% interest. Again, because Britney is forgetting to account for the 1608€ in lemonade “debt” that you may have to deliver to coupon-holders, she thinks that the leverage is still 1:2. You repeat this process one more time, and your new total position is 6k€ in lemonade, 6k€ in cash, 5k€ net debt.

If you were to successfully deliver 6k€ of lemonade, you would make 1k€ in profit, starting from only 100€ of your own cash. A 1000% return sounds too good to be true, right? That’s because it is.

One hot summer day, all of the coupon holders decide to exercise their coupons at the same time. You realize that your lemonade stand can’t actually fulfill 6k€ in lemonade orders and you are in way over your head. Desperate, you attempt to pivot and come up with a Billy Mcfarland-esque scheme to buy lemonade from a local grocery and dilute it with some water. But due to inexperience with food handling operations, you accidentally contaminate half the batch, and are left with only 3k€ of lemonade. You have 6k€ cash but still owe 3k€ in lemonade and 5k€ in cash.Your 1k€ profit opportunity has now become a 2k€ DEBT (ROI of -2100%), and we haven't even factored in the interest! Because the debtors (lemonade coupon holders and Britney Banker) must be paid regardless of whether you successfully make lemonade or not, your leverage has an asymmetric payoff - the downsides are twice as bad as the upside!

I wish I could say that this story was fictional, but to the best of my understanding this is more or less what /u/ControlTheNarrative and others attempted to do on Robinhood. Substitute "lemonade" for "AMD stock", and "lemonade coupon" for "deep-in-the-money covered call option". Theoretically, Robinhood shouldn't allow you to buy options on margin, but /u/ControlTheNarrative was very clever to use covered call options, which meant that he bought AMD stock with margin (valid) and then created cash and in-the-money AMD call options (sort of like creating matter and antimatter from nothing). Robinhood failed to detect the "antimatter", allowing /u/ControlTheNarrative to mask his "debt", thereby doubling his apparent net assets.

Ok, where did /u/ControlTheNarrative go wrong? It might be possible to still turn a profit by investing the vast amount of leverage in a “safe asset”, right? This seems unlikely: Robinhood’s interest rate of 5% far exceeds the risk-free rate of 1.88% currently offered by a 1-year Treasury note. In other words, it only makes sense to use Robinhood's leverage when you have the ability to deliver annualized returns that exceed 5%. When one has limited assets and a risky investment opportunity, they should instead carefully choose leverage so that they do not end up owing 10x their net worth should they encounter a stroke of bad luck.

Instead of trying to find an investment that minimizes risk while maintaining >5% return, /u/ControlTheNarrative proceeded to then take his enormous leverage and bet all of that on a coin toss: out-of-the-money (OTM) put options against Apple (remember that he is able to buy these options with leveraged cash because it has been "laundered" using covered call options).

Unfortunately for him, Apple proceeded to beat performance expectations for earnings, and subsequently the OTM options became worthless!

Guh!

Acknowledgements

Thanks to Ted Xiao and Daniel Ho for insightful discussion. We had a good laugh. I found the following links helpful in my research:

Normalizing Flows in 100 Lines of JAX

2019-07-06T12:10:00.000-07:00

JAX is a great linear algebra + automatic differentiation library for fast experimentation with and teaching machine learning. Here is a lightweight example, in just 75 lines of JAX, of how to implement Real-NVP.

This post is based off of a tutorial on normalizing flows I gave at the ICML workshop on Invertible Neural Nets and Normalizing Flows. I've already written about how to implement your own flows in TensorFlow using TensorFlow Probability's Bijector API, so to make things interesting I wanted to show how to implement Real-NVP a different way.

By the end of this tutorial you'll be able to reproduce this figure of a normalizing flow "bending" samples from a 2D Normal distribution to samples from the "Two Moons" dataset. Real-NVP forms the basis of a lot of flow-based architectures (as of 2019), so this is a good template to start learning from.

If you are not already familiar with flows at a high level, please check out the 2-part tutorial: [part 1] [part 2], as this tutorial just focuses on how to implement flows in JAX. You can find all the code along with the slides for my talk here.

Install Dependencies

There are just a few dependencies required to reproduce this tutorial. We'll be running everything on the CPU, though you can also build the GPU-enabled versions of JAX if you have the requisite hardware.

pip install --upgrade jax jaxlib scikit-learn matplotlib

Toy Dataset

Scikit-Learn comes with some toy datasets that are useful for small scale density models.

from sklearn import cluster, datasets, mixture

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

n_samples = 2000

noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)

X, y = noisy_moons

X = StandardScaler().fit_transform(X)

Affine Coupling Layer in JAX

TensorFlow probability defines an object-oriented API for building flows, where a "TransformedDistribution" object is given a base "Distribution" object along with a "Bijector" object that implements the invertible transformation. In pseudocode, it goes something like this:

class TransformedDistribution(Distribution):

def sample(self):

x = self.base_distribution.sample()

return self.bijector.forward(x)

def log_prob(self, y):

x = self.bijector.inverse(y)

ildj = self.bijector.inverse_log_det_jacobian(y)

return self.base_distribution.log_prob(x) + ildj

However, programming in JAX takes on a functional programming philosophy where functions are stateless and classes are eschewed. That's okay: we can still build a similar API in a functional way. To make everything end-to-end differentiable via JAX's grad() operator, it's convenient to put the parameters that we want gradients for as the first argument of every function. Here are the sample and log_prob implementations of the base distribution.

def sample_n01(N):

D = 2

return random.normal(rng, (N, D))

def log_prob_n01(x):

return np.sum(-np.square(x)/2 - np.log(np.sqrt(2*np.pi)),axis=-1)

Below are the forward and inverse functions of Real-NVP, which operates on minibatches (we could also re-implement this to operate over vectors, and use JAX's vmap operator to auto-batch it). Because we are dealing with 2D data, the masking scheme for Real-NVP is very simple: we just switch the masked variable every other flow via the "flip" parameter.

def nvp_forward(net_params, shift_and_log_scale_fn, x, flip=False):

d = x.shape[-1]//2

x1, x2 = x[:, :d], x[:, d:]

if flip:

x2, x1 = x1, x2

shift, log_scale = shift_and_log_scale_fn(net_params, x1)

y2 = x2*np.exp(log_scale) + shift

if flip:

x1, y2 = y2, x1

y = np.concatenate([x1, y2], axis=-1)

return y

def nvp_inverse(net_params, shift_and_log_scale_fn, y, flip=False):

d = y.shape[-1]//2

y1, y2 = y[:, :d], y[:, d:]

if flip:

y1, y2 = y2, y1

shift, log_scale = shift_and_log_scale_fn(net_params, y1)

x2 = (y2-shift)*np.exp(-log_scale)

if flip:

y1, x2 = x2, y1

x = np.concatenate([y1, x2], axis=-1)

return x, log_scale

The "forward" NVP transformation takes in a callable shift_and_log_scale_fn (an arbitrary neural net that takes the masked variables as inputs), applies it to recover the shift and log scale parameters, transforms the un-masked inputs, and then stitches the masked scalar and the transformed scalar back together in the right order. The inverse does the opposite.

Here are the corresponding sampling (forward) and log-prob (inverse) implementations for a single RealNVP coupling layer. The ILDJ term is computed directly, as it is just the (negative) sum of the log_scale terms.

def sample_nvp(net_params, shift_log_scale_fn, base_sample_fn, N, flip=False):

x = base_sample_fn(N)

return nvp_forward(net_params, shift_log_scale_fn, x, flip=flip)

def log_prob_nvp(net_params, shift_log_scale_fn, base_log_prob_fn, y, flip=False):

x, log_scale = nvp_inverse(net_params, shift_log_scale_fn, y, flip=flip)

ildj = -np.sum(log_scale, axis=-1)

return base_log_prob_fn(x) + ildj

What should we use for our shift_and_log_scale_fn? I've found that for 2D data + NVP, wider and shallow neural nets tend to train more stably. We'll use some JAX helper libraries to build a function that initializes the parameters and callable function for a MLP with two hidden layers (512) and ReLU activations.

from jax.experimental import stax # neural network library

from jax.experimental.stax import Dense, Relu # neural network layers

def init_nvp():

D = 2

net_init, net_apply = stax.serial(

Dense(512), Relu, Dense(512), Relu, Dense(D))

in_shape = (-1, D//2)

out_shape, net_params = net_init(rng, in_shape)

def shift_and_log_scale_fn(net_params, x1):

s = net_apply(net_params, x1)

return np.split(s, 2, axis=1)

return net_params, shift_and_log_scale_fn

Stacking Coupling Layers

TensorFlow Probability's object-oriented API is convenient because it allows us to "stack" multiple TransformedDistributions on top of each other for more expressive - yet tractable - transformations.

dist1 = TransformedDistribution(base_dist, bijector1)

dist2 = TransformedDistribtution(dist1, bijector2)

dist2.sample() # member variables reference dist1, which references base_dist

For "bipartite" flows like Real-NVP which leave some variables untouched, it is critical to be able to stack multiple flows so that all variables get a chance to be "transformed".

Here's the functional way to do the same thing in JAX. We have a function "init_nvp_chain" that returns neural net parameters, callable shift_and_log_scale_fns, and masking parameters for each flow. We then pass this big bag of parameters to the sample_nvp_chain function.

In log_prob_nvp_chain, there is an iteration loop that overrides log_prob_fn, which is initially set to base_log_prob_fn. This is to accomplish similar semantics to how TransformedDistribution.log_prob is defined with respect to the log_prob function of the base distribution beneath it. Python variable binding can be a bit tricky at times, and it's easy to make a mistake here that results in an infinite loop. The solution is to make a function generator (make_lob_prob_fn), that returns a function with the correct base log_prob_fn bound to the log_prob_nvp argument. Thanks to David Bieber for pointing this fix out to me.

def init_nvp_chain(n=2):

flip = False

ps, configs = [], []

for i in range(n):

p, f = init_nvp()

ps.append(p), configs.append((f, flip))

flip = not flip

return ps, configs

def sample_nvp_chain(ps, configs, base_sample_fn, N):

x = base_sample_fn(N)

for p, config in zip(ps, configs):

shift_log_scale_fn, flip = config

x = nvp_forward(p, shift_log_scale_fn, x, flip=flip)

return x

def make_log_prob_fn(p, log_prob_fn, config):

shift_log_scale_fn, flip = config

return lambda x: log_prob_nvp(p, shift_log_scale_fn, log_prob_fn, x, flip=flip)

def log_prob_nvp_chain(ps, configs, base_log_prob_fn, y):

log_prob_fn = base_log_prob_fn

for p, config in zip(ps, configs):

log_prob_fn = make_log_prob_fn(p, log_prob_fn, config)

return log_prob_fn(y)

Training Real-NVP

Finally, we are ready to train this thing!

We initialize our Real-NVP with 4 affine coupling layers (each variable is transformed twice), define the optimization objective to be model negative log-likelihood over minibatches (more precisely, cross entropy).

from jax.experimental import optimizers

from jax import jit, grad

import numpy as onp

ps, cs = init_nvp_chain(4)

def loss(params, batch):

return -np.mean(log_prob_nvp_chain(params, cs, log_prob_n01, batch))

opt_init, opt_update, get_params = optimizers.adam(step_size=1e-4)

Next, we declare a single optimization step where we retrieve the current optimizer state, compute gradients with respect to our big list of Real-NVP parameters, and then update our parameters. The cool thing about JAX is that we can "jit" (just-in-time compile) the step function to a single XLA op so that the entire optimization step happens without returning back to the (relatively slow) Python interpreter. We could even JIT the entire optimization process if we wanted to!

@jit

def step(i, opt_state, batch):

params = get_params(opt_state)

g = grad(loss)(params, batch)

return opt_update(i, g, opt_state)

iters = int(1e4)

data_generator = (X[onp.random.choice(X.shape[0], 100)] for _ in range(iters))

opt_state = opt_init(ps)

for i in range(iters):

opt_state = step(i, opt_state, next(data_generator))

ps = get_params(opt_state)

Animation

Here's the code snippet that will visualize each of the 4 affine coupling layers transforming samples from the base Normal distribution, in sequence. Is it just me, or does anyone else find themselves constantly having to Google "How to make a Matplotlib animation?"

from matplotlib import animation, rc

from IPython.display import HTML, Image

x = sample_n01(1000)

values = [x]

for p, config in zip(ps, cs):

shift_log_scale_fn, flip = config

x = nvp_forward(p, shift_log_scale_fn, x, flip=flip)

values.append(x)

# First set up the figure, the axis, and the plot element we want to animate

fig, ax = plt.subplots()

ax.set_xlim(xlim)

ax.set_ylim(ylim)

y = values[0]

paths = ax.scatter(y[:, 0], y[:, 1], s=10, color='red')

def animate(i):

l = i//48

t = (float(i%48))/48

y = (1-t)*values[l] + t*values[l+1]

paths.set_offsets(y)

return (paths,)

anim = animation.FuncAnimation(fig, animate, frames=48*len(cs), interval=1, blit=False)

anim.save('anim.gif', writer='imagemagick', fps=60)

Tips for Training Likelihood Models

2019-07-05T13:04:00.004-07:00

This is a tutorial on common practices in training generative models that optimize likelihood directly, such as autoregressive models and normalizing flows. Deep generative modeling is a fast-moving field, so I hope for this to be a newcomer-friendly introduction to the basic evaluation terminology used consistently across research papers, especially when it comes to modeling more complicated distributions like RGB images. This is a more in-depth version of the tutorial lecture I gave on normalizing flows at ICML.

This tutorial discusses the most mathematically straightforward of generative models (tractable density estimation models), and cover some important design considerations when choosing how to model image pixels. By the end of this post, you will know how to quantitatively compare likelihood models, even ones that differ drastically in architecture and the way pixels are modeled.

Divergence Minimization: A General Framework for Generative Modeling

The goal of generative modeling (all of statistical machine learning, really) is to take data sampled from some (possibly conditional) probability distribution $p(x)$ and learn a model $p_\theta(x)$ that approximates $p(x)$. Modeling allows us to extrapolate insight beyond the raw data we are given. Here are some versatile things you can do with generative models:

Draw new samples from $p(x)$
Learn hierarchical latent variables $z$ that explain the observations $x$
You can intervene on latent variables to examine the interventionist distributions $p_\theta(x|do(z))$ Note that this will only work properly if your conditional distribution models the correct causal relationship $z \to x$ and we assume ignorability.
Interrogate the likelihood of a new data point $x^\prime$ under our model distribution to detect anomalies

Modeling conditional distributions has an even broader set of direct applications, since we can interpret classification and regression problems as learning generative models:

Machine Translation $p(\text{translated english sentence}|\text{french sentence})$
Captioning $p(\text{caption}|\text{image})$
Regression objectives like minimizing mean-squared error $\min \frac{1}{2}(x-\mu)^2$ are mathematically equivalent to maximum log-likelihood estimation of a Gaussian with diagonal covariance: $\max -\frac{1}{2} (x-\mu)^2$.

In order to get $p_\theta(x)$ to match $p(x)$, we first have to come up with the notion of a distance between the two distributions. In statistics, it is more common to devise a weaker notion of “distance” called a divergence measure, which unlike a metric distance, is not symmetric ($D(p, q) \neq D(q, p)$). Once we have a formal divergence measure between distributions can we attempt to minimize it via optimization.

There are many, many divergences $D(p_\theta || p)$ that we can formulate, and these are often chosen to suit the generative modeling algorithm. Here are just a few:

Maximum Mean Discrepancy (MMD)
Jensen-Shannon Divergence (JSD)
Kullback-Leibler divergence (KLD)
Reverse KLD
Kernelized Stein discrepancy (KSD)
Bregman Divergence
Hyvärinen score
Chi-Squared Divergence
Alpha Divergence

Divergences between two distributions, unlike metrics, need not be symmetric. In the limit of infinite data and compute, all these divergences arrive at the same answer, that is, $D(p_\theta || p) = 0$ iff $p_\theta \equiv p$. Note that these divergences are distinct from perceptual evaluation metrics like Inception Score or Fréchet Inception Distance, which do not guarantee convergence to the same result in the high-data limit (but are useful metrics if you care about visual quality of images).

However, most experiments see a finite amount of data and compute, so the choice of metric matters and can actually change the qualitative behavior of what generative distribution $p_\theta(x)$ ends up being learned. For example, if the target density is $p$ is multi-modal and the model distribution $q$ is not expressive enough, minimizing forward KL $D_{KL}(p||q)$ will learn mode-covering behavior while minimizing reverse KL $D_{KL}(q||p)$ results in mode-dropping behavior. See this blog post for an explanation why.

Thinking about generative modeling in the framework of divergence minimization is useful because it allows us to map what we properties want from our generative models to our choice of divergence objective in a principled way. It may be an implicit density model (GANs) where sampling is tractable but log-probabilities are not available, or a energy-based model where sampling is not available but (unnormalized) log-probabilities are tractable.

This blog post will cover models trained and evaluated using the most straightforward metric: the Kullback-Leibler Divergence. These models include Autoregressive Models, Normalizing Flows, and Variational Autoencoders (approximately). Optimizing KLD is equivalent to optimizing log-probability, and we'll derive why in the next section!

Average Log-Probability and Compression

We want to model $p(x)$, the probability distribution for some data-generating stochastic process. We typically assume that sampling from a sufficiently large dataset is approximately the same thing as sampling from the true data-generating process. For instance, sampling an image from the MNIST dataset is equivalent to drawing a sample from the true handwriting process that created the MNIST dataset.

Given a test set of images $x^1,...,x^N$ sampled i.i.d from $p(x)$, and a likelihood model $p_\theta$ parameterized by $\theta$, we want to maximize the following objective:

$$
\mathcal{L}(\theta) = \frac{1}{N}\sum_{i=1}^{N}\log p_\theta(x^i) \sim \int p(x) \log p_\theta(x) dx = -H(p, p_\theta)
$$

The average log-probability is a Monte Carlo estimator of the negative cross entropy between the true likelihood $p$ and model likelihood $p_\theta$, because we are not able to actually enumerate over all $x^i$. In plain language, this translates to "maximize average likelihood of data", or equivalently, "minimize negative cross-entropy between true distribution and model distribution".

With a little algebra, the negative cross-entropy can be re-written in terms of KL divergence (relative entropy) and absolute entropy of $p$:

$$-H(p, p_\theta) = -KL(p, p_\theta) - H(p)$$

Shannon’s Source Coding Theorem (1948) tells us that entropy $H(p)$ is the lower bound on average code length for any code you can construct to communicate samples from $p(x)$ losslessly. More entropy means more "randomness", which cannot be compressed. In particular, when we use the natural logarithm $\log_e$ to compute entropy, it takes on the "natural units of information", or nats. When computing entropy in $\log_2$, the resulting units are the familiar "bit". The $H(p)$ term is independent of $\theta$, so maximizing $\mathcal{L}(\theta)$ is really just equivalent to minimizing $KL(p, p_\theta)$. That is why maximum likelihood is also known as minimizing KL divergence.

The KL divergence $KL(p, p_\theta)$, or relative entropy, is the number of "extra nats" you would need to encode data from $p(x)$ using an entropy coding scheme based on $p_\theta(x)$. Therefore, our Monte Carlo estimator $\mathcal{L}(\theta)$ of negative cross entropy is also expressed in nats.

Putting the two together, the cross entropy is nothing more than the average code length required to communicate samples from $p$, using a codebook based on $p_\theta$. We pay a "base fee" of $H(p)$ nats no matter what (the optimal code), and we pay an additional "fine" of $KL(p, p_\theta)$ nats for any deviations of $p_\theta$ from $p$.

We can compare cross entropies of two different models in a very interpretable way: suppose model $\theta_1$ has average likelihood $\mathcal{L}(\theta_1)$ and model $\theta_2$ has average likelihood $\mathcal{L}(\theta_2)$. Subtracting $\mathcal{L}(\theta_1) - \mathcal{L}(\theta_2)$ causes the entropy terms $H(p)$ to cancel out, resulting in $KL(p, p_{\theta1})-KL(p, p_{\theta_2})$. This quantity is the "reduction of penalty nats you need to pay when switching from code $p_{\theta_1}$ to code $p_{\theta_2}$".

Expressivity, optimization, and generalization are three important properties of a good generative model, and likelihoods offer an interpretable metric with which to debug these properties in our models. If a generative model cannot memorize the training set, it suggests there are difficulties with optimization (getting stuck) or expressivity (underfitting).

The Cifar10 image dataset has 50000 training samples, so we know that a model that memorizes the data perfectly will assign a probability mass of exactly 1/50000 to each image in the training dataset, thereby achieving a negative cross entropy of $log_2(\frac{1}{50000})$, or 15.6 bits per image (this is independent of how many pixels there are per image!). Of course, we usually don’t want our generative models to overfit to such extremes, but it’s useful to keep this upper bound in mind as a sanity check when debugging your generative model.

Comparing the difference between training and test likelihoods can tell us if the networks are memorizing the training set or learning something that generalizes to the test set, or whether there are semantically meaningful modes in the data that the model fails to capture.

Which Distribution Should You Use For Modeling Image Pixels?

There are plenty ways to parameterize an image. For instance, you can represent an image via a 3D scene that is projected (rendered) into 2D. Or you can parameterize images as vector representations of sketches (like SVG graphics), or Laplacian Pyramids, or even motor torques for a robotic arm that subsequently paint a picture. However, for simplicity, researchers typically models image likelihoods as the joint distribution over RGB pixels - it is a general-purpose digital format that has proven effective for capturing natural data in the visible electromagnetic spectrum.

Each RGB pixel is encoded by a uint8 integer, which can take on 256 possible values. Thus, an image with 3072 pixels and 256 possible values per pixel can take on $256^{3072}$ possible values. There are a finite number of images, which means we could technically represent images using a single $256^{3072}$-sided die. But this number is too large to be represented in memory! Even modeling 3 uint8-encoded pixels jointly as a Categorical results in $256^3=16777216$ possible categories, which is unwieldy even for modern computers. To make things computationally tractable, we must "factorize" the likelihood for the whole image into a combination of conditionally independent pixel-wise distributions. One easy factorization is to make each pixel likelihood independent of one another:

$$ p(x_1, ..., x_N) = p(x_1)p(x_2)...p(x_N)$$

This is also known as a mean-field decoders (see this comment for where the name “mean-field” comes from). Each pixel-wise distribution has its own tractable density or mass function.

Another choice is to make the pixel likelihood autoregressive, where each conditional distribution has its own tractable density or mass function.

$$ p(x_1, ..., x_N) = p(x_1)p(x_2|x_1)...p(x_N|x_1,...,x_{N-1})$$

We still have to figure out how to model each conditional distribution though. Here are some common choices along with an example paper that used them:

Bernoulli probabilities over each channel (DRAW)
256-way Categorical distribution over each channel (PixelRNN, Image Transformer)
Continuous density on de-quantized data (Real-NVP)
Discretized logistic mixture (PixelCNN++, Image Transformer)

Pixel Values as Bernoulli Emission Probabilities

The MNIST, FashionMNIST, NotMNIST datasets are good choices to start with when debugging your likelihood models:

Those datasets can be stored completely in computer RAM
They do not require a lot of architecture tuning (allowing you to focus on the algorithmic aspects)
Small generative models for these datasets can train on modest hardware, such as a modern laptop lacking a GPU.

It is common to choose conditional pixel likelihoods $ p(x_i)$ to be modeled as Bernoulli random variables. For binarized data where pixel values are only 0 or 1 (heads or tails), this is fine.

Example of a binarized MNIST image. Binarized digits are recognizable, but not so much for natural images.

However, MNIST and its friends are encoded as floating point values in the range [0, 1], where the 256 integers are normalized to lie between these boundaries. There is an expressivity problem here, because Bernoulli variables cannot sample values between 0 and 1!

For papers that train on non-binarized MNIST, we must instead interpret the encoded values as emission probabilities for corresponding Bernoulli variables, i.e. if we see a pixel value of 0.9, it actually represents a Bernoulli likelihood of the pixel being 1, not the sample value itself. The optimization objective is to minimize the cross entropy between predicted probability distribution (parameterized by a scalar emission probability), and the stored emission probability in the data. The cross-entropy of two Bernoullis with emission probabilities $p(1)$ and $p_\theta(1)$ are given by:

$$H(p, p_\theta) = -\left[(1-p(1)) log (1-p_\theta(1)) + p(1) log p_\theta(1)\right]$$

Remember from the earlier section in this post that minimizing this cross entropy results in the same objective as maximizing likelihood! The average log-likelihood (relative entropy) of these toy image datasets is usually reported in units of nats.

The DRAW paper (Gregor et al. 2015) extends this idea to modeling per-channel colors. However, there is a serious drawback to interpreting color pixel data as emission probabilities. When we sample from our generative model, we get noisy, speckly images rather than natural-looking coherent images. Here’s a Python code snippet that reproduces this problem:

import tensorflow_datasets as tfds

import numpy as np

import matplotlib.pyplot as plt

builder = tfds.builder("cifar10")

builder.download_and_prepare()

datasets = builder.as_dataset()

np_datasets = tfds.as_numpy(datasets)

img = next(np_datasets['train'])['image']

sample = np.random.binomial(1,p=img.astype(np.float32)/256)

fig, arr = plt.subplots(1, 2)

arr[0].imshow(img)

arr[1].imshow((sample*255).astype(np.uint8))

Interpreting pixel values as ‘emission probabilities’ results in unrealistic-looking samples - while it is an O.K. assumption for handwritten digits and sprites, it doesn't work for larger-scale, natural images. Papers that do use Bernoulli decoders will often showcase the emission probabilities (e.g. in a reconstruction or imputation task) rather than actual samples.

Pixel Values as Categorical Distributions

Larger color datasets (SVHN, CIFAR10, CelebA, ImageNet) are encoded in 8-bit RGB color (each channel is a uint8 integer that ranges in value from 0 to 255, inclusive).

Instead of interpreting their pixel values as Bernoulli emission probabilities, we can attempt to model the distribution over actual uint8 pixel values encoded in the image. One of the simplest choices is a 256-way categorical distribution.

Sampling from categorical distributions allows the generative model to sample images rather than emission probabilities.

For color images, it is common to report cross entropies of individual pixels in log base 2, instead of log base e. If a test set with 3072 pixels per image has average likelihood (nats) of $-H(p, q)$, the “bits-per-pixel” is just $-H(p, q)\div log (2)\div3072$.

This metric is motivated by average-likelihood-as-compression interpretation we discussed earlier: for a pixel that is typically encoded using 8 bits, we can devise a lossless entropy coding scheme using our generative model $p_\theta$ that can actually compress the entire dataset with an average bit length of 3 for representing each pixel.

At the time of this writing, the best generative model for Cifar10, Sparse Transformers, achieves a test likelihood of 2.80 bits per pixel. As a point of comparison, PNG and WebP -- widely used algorithms for lossless image compression -- achieve about 5.87 and 4.61 bits on Cifar10 images, respectively (PNG achieves 5.72 bpp if you don’t count the extra bytes like headers and CRC checksums).

This is quite exciting, because it suggests that machine learning can be used for better content-aware entropy-encoding schemes than existing compression schemes. Efficient lossless compression can be used to improve hashing algorithms, make your downloads faster, and improve your Zoom calls, and all of that technology is probably quite feasible today.

Stochastic De-Quantization for Continuous Density Models

If we optimize a continuous density model (such as a mixture of Gaussians) to maximize log-likelihood on discrete data, this can result in a degenerate solution where the model assigns the same density spike to each of the possible discrete values {0, ..., 255}. Even with an infinitely large dataset, the model can achieve arbitrarily high likelihoods by simply squeezing the spikes narrower and narrower.

To address this problem. it is quite common to de-quantize the data by adding noise to the integer pixel values. One such transformation is given by $y = x + u$, where $u$ is a sample from the random uniform $U(0,1)$. The first paper that I am aware of that motivates stochastic de-quantization for density modeling is Uria et al. 2013, and has since become common practice in Dinh et al., 2014, Salimans et al., 2017, and the works that built on top of these papers.

Left: optimizing density models on discrete data can result in a degenerate solution where the model assigns a probability spike on a finite set of discrete values. Stochastic de-quantization is often applied so that we learn likelihood models on continuous data.

A discrete model assigns probability mass over an interval, while a continuous model assigns a density function. Let $P(x)$ and $p(x)$ represent the discrete probability masses and continuous densities of the true data distribution, and let $P_\theta(x)$ and $p_\theta(x)$ represent the same for the model density. We’ll derive below why optimizing the continuous likelihood model $p_\theta(y)$ over de-quantized data $y$ results in optimizing the lower-bound of the actual discrete probability model $P_\theta(x)$:

Integrating the density over a unit interval gives us the total mass implied by the density function:

$$ P_\theta(x) = \int_0^1 p_\theta(x+u) du $$

Our model likelihood objective is trained on de-quantized data sampled from the true data distribution:

$$ \mathbb{E}_{y \sim p}\left[ \log p_\theta(y) \right]$$

By definition of expectation:

$$ = \int p(y) \log p_\theta(y) dy $$

Expanding the integral,

$$ = \int dy p(y) \int dy \log p_\theta(x+du) $$
$$ = \mathbb{E}_{x \sim P} \int du \log p_\theta(x+du) $$

Via Jensen’s inequality (for the uniform variable u),

$$ \leq \mathbb{E}_{x \sim P} \log \int du p_\theta(x+du) $$
$$ = \mathbb{E}_{x \sim P} \log P_\theta(x) $$

A recent paper, Flow++, proposes using a learned de-quantization random variable to improve the tightness of the variational bound. The intuition here is that a single importance-sampled noise variate from $q(u|x)$ results in a lower-variance estimate of the integral $\int_0^1 p_\theta(x+u) du$ than a single sample from a uniform(0, 1) distribution. Because the de-quantization noise is different, one consequence of this work is that density models with different architectures and different quantization strategies cannot be compared in a controlled manner.

One way to compare Flow++ and uniformly de-quantized generative models fairly is to permit researchers to use whatever variational bound they like at training time, but standardize the evaluation of likelihood at evaluation time to be some tight multi-sample bound. The intuition here is that as we integrate over more samples, we get a better approximation of the true log-likelihood of the corresponding discrete model $P_\theta(x)$.

For instance, we could report the multi-sample bound from a fixed U(0, 1) de-quantization distribution, as commonly done in VAE literature with multi-sample IWAE bounds. A discussion of VAEs and IWAE bounds are out of the scope of this tutorial, and will be covered in the future.

Side Note: Data Preprocessing for Normalizing Flows

Normalizing Flows are a family of generative models that “transform” a base probability distribution into a more complicated probability distribution.

Normalizing Flows learn transformations that have tractable inverses and Jacobian determinants. Being able to compute these two quantities efficiently allow us to calculate the transformed distribution’s log-density, using the change-in-variables rule:

$$ \log p(y) = \log p(x) - \log |det J(f)(x)| $$

The vast majority of Normalizing Flows operate over continuous density functions (thus requiring the volume-tracking Jacobian determinant term), though there is some recent research on “discrete flows” that learn to transform probability mass functions rather than transforming density (Tran et al. 2019, Hoogeboom et al. 2019). We won’t be discussing these flows much in this blog post, but suffice it to say that they work by devising bijective discrete transformations of discrete base distributions.

In addition to the stochastic de-quantization mentioned earlier, there are a couple additional tricks employed when training normalizing flows for image data.

Empirically, re-scaling the data from the range [0, 256] to the unit interval [0, 1] prior to maximum likelihood estimation helps stabilize training, as neural network biases are usually centered around zero.

To prevent boundary issues where a sample from the base distribution could get mapped to a point outside of the re-scaled boundary (0, 1) by the flow, we can transform the re-scaled data to the range $-\infty, \infty$ via the logistic function (the inverse of the sigmoid function).

We can think of these re-scaling and logistic transforms as "preprocessing" flows at the beginning of the model, where just like any other bijector, we have to account for the change in volume induced by the transformation.

The important thing to realize here is that for evaluation purposes, pixel densities should always be computed in the continuous interval [0, 256], so that we can compare likelihoods from flows and autoregressive over the same data (up to the variational gap induced by the stochastic dequantization).

Here is a diagram showing a standard normalizing flow for RGB images, with the original discrete data on the left and the base density (can be a Gaussian, a logistic, or whatever your favorite tractable density is) on the right.

Generative model likelihoods typically reported in de-quantized space (green). Starting from Dinh et al. 2016, many flow-based models re-scale pixels to $[\lambda, 1-\lambda]$ and apply the logistic function (inverse sigmoid) to help with numerical stability on boundary conditions.

Discretized Logistic Mixture Likelihood

One drawback of modeling pixels with categorical distributions is that a categorical cross entropy loss cannot tell us that a pixel value of 127 is closer to 128 than it is to 0. For an observed pixel value $p$, the gradient of the categorical cross entropy is constant with respect to pixel intensity (since the loss treats the categories as un-ordered). Although the cross entropy gradient is non-zero, it is said to be “sparse” because it does not provide information on how close (in pixel intensity) we are to the target distribution. Ideally, we would like gradients to be larger in magnitude when the predicted intensity is far away from the observed value, and smaller when the model is close.

A more serious problem with modeling pixels as categorical distributions is that if we choose to represent more than 256 categories, we’d be in trouble. For example, we might want to model the R, G, and B pixels jointly (256^3 categories!) or model higher-precision pixel encodings than uint8 for HDR images. We’d quickly run out of memory attempting to store the projection matrices needed to map neural net activations to logits for that many categories.

Two concurrent papers, Inverse Autoregressive Flow and PixelCNN++, solve these problems by introducing a probability distribution for modeling RGB pixels as ordinal data, for which gradient information from the cross entropy loss can push pixels in the right direction while still preserving a discrete probability model.

We can model continuous pixel probability densities via a mixture of logistics, which is a continuous distribution. To recover probability mass for discrete pixel values, we can use the convenient property of the logistic distribution is that its CDF is the sigmoid function. By subtracting two sigmoids CDF(x+0.5) - CDF(x-0.5), we can recover the total probability mass lying between two integer pixel values.

For example, the probability of a pixel having value=127 is modeled as the probability mass lying between 126.5 and 127.5 for a continuous mixture of logistic distributions. The probability model must also account for edge cases such that CDF(0-0.5) is 0 and CDF(255+0.5) is 1, as is required of probability distributions.

Representing pixels in this way also affords the luxury of being able to handle a lot more categories, which means that PixelCNN++ can model the R, G, and B pixel channels at once. The caveat here is that you must tune the number of mixture components adequately (on Cifar-10, 5 seems to be enough).

Analogous to how Tran et al. 2019 devise discrete flows on top of categorical distributions, Hoogeboom et al. 2019 devise discrete flows for ordinal data by using this discretized logistic mixture likelihood as the base distribution. This gets the best of both worlds: we get to use normalizing flows for tractable inverses and sampling, while avoiding to have to solve for a de-quantized likelihood objective (which may incur a variational lower bound penalty on the likelihood). Both are very exciting papers that I hope to write about more in the future!

Perplexity

Log-likelihood is also a common metric for evaluating generative models in the language modeling domain. A discrete alphabet without ordering makes Categorical distributions the most natural choice for density modeling.

One quirk of the natural language processing (NLP) field is that language model likelihoods are often evaluated in units of Perplexity, which is given by $2^H(p, q)$. The inverse of perplexity, $\log_2 2^-H(p, q)$, is nothing more than average log-likelihood $-H(p, q)$. Perplexity is an intuitive concept since inverse probability is just the "branching factor" of a random variable, or the weighted average number of choices a random variable has. The relationship between perplexity and log-likelihood is so straightforward that some papers (Image Transformer) actually use the word “perplexity” interchangeably with log-likelihoods.

Closing Thoughts

In this blog post we derived the relationship between maximizing average log-likelihood and compression. We also mentioned several modeling choices between discrete and continuous likelihood models for individual pixels.

There is a broader question of whether likelihood is even the right quantity to be measuring / optimizing. At NIPS 2016 (now known as the NeurIPS conference), I recall there being a pretty lively debate in the generative modeling workshop where people were debating whether optimizing tractable-likelihood models was even a good idea.

It turns out that optimizing and evaluating against likelihood was a good idea after all, because since then researchers have figured out how to build and scale up much more flexible likelihood models while keeping them computationally tractable. Models like Glow, GPT-2, WaveNet, and Image Transformer are trained with likelihood and can generate image, audio and text samples with stunning quality. On the other hand, one might argue that at the end of the day, generative modeling needs to be coupled to performance on an actual task, such as classification accuracy when the model is fine-tuned on a labeled dataset. My colleague Niki Parmar says the following about images vs text likelihood models:

On text, there is generally a pattern where better likelihood leads to better performance on downstream tasks like GLUE. On images, I've heard from other practitioners that pixel prediction doesn't work as a pre-training task for downstream tasks like image classification. It could be because pixels don't mean much in terms of representations as compared to word-pieces or words in text. It's an open question but I find it fascinating that representation learning in images is quite different, almost difficult to establish and measure.

In a future blog post, I’ll build on top of this tutorial and discuss the evaluation of generative models that optimize variational lower bounds on log-likelihood (e.g. Variational Autoencoders, importance-weighted autoencoders).

Acknowledgements

Many thanks to Dustin Tran, Niki Parmar, and Vincent Vanhoucke for reviewing drafts of this blog post. As always, thank you for reading!

Lessons from AI Research Projects: The First 3 Years

2019-05-23T17:32:00.002-07:00

Translations: 中文

I've been at Google Brain robotics (now referred to as Robotics @ Google) for nearly 3 years. It's helpful to reflect, from time to time, on the scientific, engineering and personal productivity takeaways gleaned from working on large research projects. Every researcher's unique experiences and experimentation can potentially become their personal competitive edge for thinking about new problems in unique ways. Here are mine (so far).

These are ordered chronologically (earliest work first), so that the reader can see how my past experiences shape my current biases and beliefs (orange = first author).

Categorical Reparameterization with Gumbel-Softmax

The importance of a work environment that encourages serendipitous discovery and 20% time (the inspiration for Gumbel-Softmax came to me in a water cooler conversation I was having with Shane Gu).
Research on very basic techniques (e.g. generative modeling) can have a huge impact through various downstream applications.
The simplest method to implement is the one that gets cited the most.

End-to-End Learning of Semantic Grasping

The notion of a "class label" is meaningless, and is the wrong way to tackle goal-conditioned grasping.
ML can help robotics, but robotics can also help ML (i.e. retroactive labeling via present poses).
The importance of moving fast, investing in visualization and analysis tools (e.g. notebooks) that do not require a robot.

Time Contrastive Networks

All you need is high-quality data and a contrastive loss. Pierre Sermanet is fond of saying, tongue-in-cheek, that these two things will get us to AGI.
Dream big.

Deep Reinforcement Learning for Vision-Based Robotic Grasping

The importance of a fast prototyping environment and quick experiment turnaround times.
Q-Learning works and scales pretty well.

QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Most people don’t really care how QT-Opt is trained; they are excited about what a trained QT-Opt system can do.
All you need is scale, compute, and data.

Grasp2Vec: Learning Object Representations from Self-Supervised Grasping

Magical things can happen if you focus on innovations in better-structured data, instead of better algorithms (all you need is high-quality data and a contrastive loss).
The notion of a class label is meaningless.
Good reward functions are a very nice piece of "Software 2.0" infrastructure: modular functionality, quick to verify for correctness, and does not impose strong assumptions on upstream or downstream computations (in contrast to RL algorithms).
More on Twitter.

Generative Ensembles for Robust Anomaly Detection

Thinking deeply about the nature of the OoD problem and different types of uncertainty.
The OoD problem is ill-posed, but still useful for practical applications.
OoD and generalization are two sides of the same coin.
I spent a 10 days in Jeju mentoring DL camp students. Every day I woke up, ate 3 meals in the same cafeteria downstairs, had no meetings, and thought really hard about the research problem. This monastic working environment was tremendously useful for my creative "flow".

Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards

Optimal control theory says that we need RL to make robots work, but you can get surprisingly far with the original Deep Learning recipe: supervised learning + lots of data + architecture tuning.
Meta-Learning is all about pushing the burden of learning into the prior.
Generative modeling (e.g. principled approaches to density estimation, being able to fit multi-modal distributions) is important for scaling up robotics.
More on Twitter.

General Lessons from Deep RL + Robotics

I am increasingly of the opinion that the biggest wins in making an ML system work come from high-quality data. Many researchers in sub-fields of ML do not prioritize the choice of data when looking for ways to improve on benchmarks. Deep RL on real robots is a great way to do ML research, because the researcher is forced to gather their own dataset and contend with how data biases generalization outcomes.
Robotics is full-stack ML (gathering and serializing custom data, building a custom data pipeline, training and evaluation binaries, inference on a real robotic system), which increases iteration times & decreases opportunities for spontaneous creativity and discovery. Robotics projects tend to take ~1 FTE year to finish, while most DL papers can be completed in 2-3 months. One of the most important things to me right now is figuring out how we can achieve the same iteration speeds in robotics as achieved in other deep learning domains.
Best software engineering practices for de-risking Deep RL engineering are in their early days. How to keep a full-stack dev environment flexible and fast to iterate on (scientific, creative risk) while keeping technical debt from bubbling over (execution risk)? My colleagues and I designed Tensor2Robot to solve a lot of our large-scale ML + robotics problems, but this is just the beginning.

The scope of this post is limited to my own research projects. Of course, there are papers that I didn't work on and inspire my views tremendously. I'll mention those in a follow-up blog post.

Fun with Snapchat's Gender Swapping Filter

2019-05-12T18:36:00.004-07:00

Snapchat's new gender-bending filter is a source of endless fun and laughs at parties. The results are very pleasing to look at. As someone who is used to working with machine learning algorithms, it's almost magical how robust this feature is.

I was so duly impressed that I signed up for Snapchat and fiddled around with it this morning to try and figure out what's going on under the hood and how I might break it.

N.B, this is not a serious exercise in reverse-engineering Snapchat's IPA file or studying how other apps engineer similar features; it's just some basic hypothesis testing into when it works and when it doesn't, plus a little narcissistic bathroom selfie fun.

Initial Observations

The center picture is a standard bathroom selfie. To the left is the "male" filter, and on the right the "female" filter.

The first thing most users probably notice is that the app works in real time, works with a few different face angles, and does not require an internet connection to run. Hair behaves very naturally when wearing a beanie.

Here's a rotating profile shot. The app seems to detect whether the face is pointing in a permissible orientation, and only if that boolean is satisfied does the filter get applied.

Gender swap works in a variety of lighting conditions, though the hair does not seem to cast shadows.

Damn! I look cute.

Here was an example that I thought was really cool - the hair captures the directional key lighting.

Occlusion Tests

Ok, it works pretty well. Can we get it to fail? The app detects when the face is in the wrong pose, but what if there are things occluding the face? Do those occluding objects get "transformed" too?

The answer is yes. Below is a test where I slide an object across my face. The app works when half the face is occluded, but it seems like if too much of the face is blocked, the "should I face swap" bit is set to False.

Here's vertical occlusion, where the bit seems to depend on "what percentage of the face real estate is occluded" rather than what important semantic features (e.g. eyes, lips) are occluded. Right before the app decides that the "should I face swap" should switch to "False", you can see the blurring of the white bottle. Also, my hair turns blonde as I center the bottle in view.

Very interesting. This suggests to me that there definitely some machine learning going on here, and it's picking up on some statistical artifact of the data it was trained on. Do blondes tend to make more makeup tutorials or something?

I partially covered my face in a black charcoal masque, and things seemed pretty stable. The female filter does lighten the masque a bit. It's pretty easy to tell from this GIF that the "face swap" feature is confined to a rectangular region that tracks the head (note the sharp cutoff of the hair as it gets to my shoulders).

The filter stops working once I cover the rest of my face in the masque. Interestingly enough, the ovoid regions of my uncovered skin seem to be detected as faces, and the app proceeds to perform the style transform on that region. You can see the head and face templates flickering in and out like some kind of Junji Ito horror story.

Peeling off the masque is surprisingly stable.

Hair Layer

I was most impressed by the realism of the hair, so I wanted to figure out whether there were any hair mesh models used for dynamic lighting, or whether it was all machine-learning based.

The hair seems to be rendered as the topmost layer (like a Photoshop layer), but unlike your basic puppy ear/tongue filter, this hair layer has an alpha channel that is partially transparent. If you look closely there is also a clear segmentation mask for the hair that allows the face to show through. Snapchat is probably doing head tracking to figure out where the head is, computing the 2D alpha mask for the hair.

How does it work? A guess

At first glance, my mind jumped to some sort of CycleGAN architecture that maps the distribution of male faces to female faces, and vice versa. The dataset would be the billions of selfies Snap has, er, not deleted in the last 8 years.

This does raise a lot of questions though:

Are they training truly unpaired image translation? That would be incredibly impressive, given that CycleGAN is bonkers and shouldn't even work in the first place. I would bet they have an unpaired alignment objective that is regularized by a limited dataset of ground-truth pairs, such as pairs of images of male/female siblings, or even a hand-designed gender transform that acts as data augmentation (e.g. making the jawline rounder can be done without machine learning).
The hair and face transforms seem to be synthesized independently, given that they occupy different layers (or perhaps synthesized together and separated into different layers right before rendering). This is also the first instance I've seen of GANs being used to render the alpha channel. I am a bit dubious of whether the hair is even generated by a GAN at all. One one hand, there is clearly some smooth function that switches out highlights and hair colors as a function of the positioning of an occluding object, suggesting that colors are probably learned partially from data. On the other hand, the hair is so stable that I have a hard time believing it is synthesized completely with a GAN generator. I have seen a few examples of other East Asian male face swaps with similar hairdos, suggesting that maybe there is a large-ish template library of haridos (that is refined with some ML model).
How do Snap's ML engineers know whether a CycleGAN has converged for such an enormous dataset?
How do they get these neural nets to run with such limited compute budgets? What sorts of image resolutions are they generating on the fly?

If it indeed is a CycleGAN, then applying the male filter to a female-filtered image of me should recover the original image, right?

The image is mostly scale invariant, but as we zoom in pretty close, the face does resemble mine more. I would guess that there is a preprocessing step that crops and resizes the canonical face image prior to feeding it to a neural net.
There are also probably other subroutines in the filter like jaw resizing that don't use a CycleGAN, but whose addition would cause the M2F and F2M filters to no longer be exact inverses of each other.

Implications of Technology

I have a friend who does drag. It's a lot of work! I'm excited for technology like this, because it will make it easier for makeup artists, cosplayers, and drag artists to experiment with new ideas and identities cheaply and quickly.

Technology such as face and voice changing enables a wider gap between public Internet personas and the real people managing those characters. This isn't necessarily a bad thing: if you are born a man but are passionate about being a cute anime girl on the internet, who are we to judge? Will gender fluidity & drag culture will become more normalized in society as our daily social media normalize gender-bending?

The future is quite exciting.

What I Cannot Control, I Do not Understand

2019-03-10T15:10:00.004-07:00

Xiaoyi Yin has graciously translated this blog post to 中文.

I often hear the remark around the proverbial AI watering hole that there are no examples of reinforcement learning (RL) deployed in commercial settings that couldn’t be replaced by simpler algorithms.

This is somewhat true. If one takes RL to mean “neural networks trained with DQN / PPO / Soft-Actor Critic etc.”, then indeed, there are no commercial products (yet!) whose success relies on Deep RL algorithmic breakthroughs in the last 5 years [1].

However, if one interprets “reinforcement learning” to mean the notion of “learning from repeated trial and error”, then commercial applications abound, especially in pharmaceuticals, finance, TV show recommendations, and other endeavors based on scientific experimentation and intervention.

I’ll explain in this post how Reinforcement Learning is a general approach to solving the Causal Inference problem, the desiderata of nearly all machine learning systems. In this sense, many high-impact problems are already tackled using ideas from RL, but under different terminology and engineering processes.

Doctor, Won’t You Help Me Live Longer

Let’s suppose you are a doctor tasked with helping your patients live longer. You know a thing or two about data science, so you fit a model on a lot of patient records to predict life expectancy, and make a shocking finding: people who drink red wine every day have a 90% likelihood of living over 80 years, compared to the base probability of 50% for non drinkers.

In the parlance of causal inference, you’ve found the following observational distribution:

p(patient lives > 80 yrs | patient drinks red wine daily) = .9

Furthermore, your model has high accuracy on holdout datasets, which increases your confidence that your model has discovered the secret to longevity. Elated, you start telling your patients to drink red wine daily. After all, as a doctor, it is insufficient to predict; we must also prescribe! And what’s not to like about living longer and drinking red wine on the daily?

Many decades later, you follow up on your patients and -- with great disappointment -- observe the following interventional distribution:

p(patient lives > 80 yrs | do(patient drinks red wine daily)) = .5

The life expectancy of patients on the red wine has not increased! What gives?

Finding the Causal Model

The core problem here lies in confounding variables. When we decided to prescribe red wine to patients based on the observational model, we made a strong hypothesis about the causality diagram:

The directed edges between these random variables here denote causality, which can also be thought of as "the arrow of time". Changing the value of the “Drinks Red Wine” variable ought to have an effect on “Live > 80 years”, but changing “Lives > 80 years” has no effect on drinking red wine.

If this causal diagram was correct, then our intervention should have increased the lifespan of patients. But the actual experiment does not support this, so we must reject this hypothetical causal model and reach for alternative hypotheses to explain the data. Perhaps there are one or more variables that cause a higher propensity of red wine drinking, AND living longer, thus correlating those variables together?

We make the educated guess that a confounding variable might be that wealthy people tend to simultaneously live longer and drink more wine. Combing through the data again, we find that P(drinks red wine | is wealthy) = 0.9 and P(lives > 80 | is wealthy) = 1.0. So our hypothesis now takes the form:

If our understanding of the world is correct, then do(is wealthy) should make people live > 80 years and drink more red wine. And indeed, we find that once we give patients $1M cash infusions to make them wealthy (by USA standards), they end up living longer and drinking red wine daily (this is a hypothetical result, fabricated for the sake of this blog post).

RL as Automated Causal Inference

ML models are increasingly used to drive decision making in recommender systems, self-driving cars, pharmaceutical R&D, and experimental physics. In many cases, we desire an outcome event $y$, for which we attempt to learn a model $p(y|x_1, .., x_N)$ and then choose inputs $x_1...x_N$ to maximize $p(y|x_1...x_N)$.

It should be quite obvious from the previous medical example that to avoid causality when building decision-making systems is to risk overfitting models that are not useful for prescribing intervention. Suppose we automated the causal model discovery process in the following manner:

Fit an observational model to the data p(y|x_1, x_2, … x_N)
Assume the observational model captures the causal model. Prescribe an intervention do(x_i) that maximizes p(y|x_1..N) and gather a new dataset where 50% of x_i has the intervention and 50% does not.
Fit an observational model to the new data p(y|x_i)
Repeat steps 1-3 until observational model matches intervention model: p(y|do(x_i)) = p(y|x_i)

To return to the red wine case study as a test case:

You would initially have p(live > 80 years | drink red wine daily) = .9.
Upon gathering a new dataset, you would obtain p(live > 80 years | do(drink red wine daily)) = .5. Model is not converged, but at least your observational model no longer believes that drinking red wine explains living longer. Furthermore, it now pays attention to the right variable, that p(live > 80 years | is_wealthy) = 1.
The subsequent iteration of this procedure then finds that p(live > 80 years | do(is wealthy)) = 1, so we are done.

The act of gathering a randomized trial (the 50% split of intervention vs. non-intervention) and re-training a new observational model is one of the most powerful ways to do general causal inference, because it uses data from reality (which “knows” the true causal model) to stamp out incorrect hypotheses.

Repeatedly training observational models and suggesting interventions is what RL algorithms are all about, which is solving optimal control for sequential decision-making problems. Control is the operative word here - the true test of whether an agent understands its environment is whether it can solve it.

For ML models whose predictions are used to infer interventions (so as to manipulate some downstream random variable), I argue that the overfitting problem is nothing more than a causal inference problem. This also explains why RL tends to be much harder as a machine learning problem than supervised learning - not only are there fewer bits of supervision per observation, but the RL agent must also figure out the causal, interventionist distribution required to behave optimally.

One salient case of “overfitting” is in RL algorithms can theoretically be trained “offline” -- that is, learning entirely from off-policy data without gathering new data samples from the environment. However, without periodically gathering new experience from the environment, agents can overfit to finite-size datasets or dataset imbalances, and propose interventions that do not generalize past their offline data. The best way to check if an agent is “learning the right thing” is to deploy it in the world and verify its hypotheses under the interventionist distribution. Indeed, for our robotic grasping research at Google, we often find that fine-tuning with “online” experience improves performance substantially. This is equivalent to re-training an observational model on new data p(grasp success | do(optimal_action)).

Production "RL"

The A/B testing framework often used in production engineering is a manual version of the "automated causal inference" pipeline, where a random 50% of users (assumed to be identically distributed) are shown one intervention and the other 50% are shown the control.

This is the cornerstone of data-driven decision making, and is used widely at hedge funds, Netflix, StitchFix, Google, Walmart, and so on. Although this process has humans in the loop (specifically for proposing interventions and choosing the stopping criterion), there are many related nuances to these methodologies that also arise in RL literature like data non-stationarity, the difficulty of obtaining truly randomized experiments, and long-term credit assignment. I’m just starting to learn about causal inference myself, and hope that in the next few years there will be more cross-fertilization of ideas between the RL, Data Science, and Causal Inference research communities.

For a more technical introduction to Causal Inference, see this great blog series by Ferenc Huszar.

[1] A footnote on why I think RL hasn’t had much commercial deployment yet. Feel free to clue me in if there are indeed companies using RL in production that I don’t know about!

In order for a company to be justified in adopting RL technology, the problem at hand needs to be 1) commercially useful 2) feasible for current Deep RL algorithms 3) the marginal utility of optimal control must be worth the technical risks of Deep RL.

Let’s consider deep image understanding by comparison: 1) everything from surveillance to self-driving cars to FaceID is highly commercially interesting 2) current models are highly accurate and scale well to a variety of image datasets 3) the models generally work as expected and do not require great expertise to train and deploy.

As for RL, it doesn’t take a great imagination to realize that general RL algorithms would eventually enable robots to learn skills entirely on their own, or help companies make complex financial decisions like stock buybacks and hiring, or enable far richer NPC behavior in games. Unfortunately, these problem domains don’t meet criteria (2) - the technology simply isn’t ready and requires many more years of R&D.

For problems where RL is plausible, it is difficult to justify being the first user of a technology whose marginal utility to your problem of choice is unproven. Example problems might include datacenter cooling or air traffic control. Even for domains where RL has been shown clearly to work (e.g. low-dimensional control or pixel-level control), RL still requires a lot of research skill to build a working system.

Meta-Learning in 50 Lines of JAX

2019-02-21T08:06:00.001-08:00

Github repo here: https://github.com/ericjang/maml-jax

Adaptive behavior in humans and animals occurs at many time scales: when I use a new shower handle for the first time, it takes me a few seconds to figure out how to adjust the water temperature to my liking. Upon reading a news article, I obtain new information that I didn't have before. More difficult skills, such as mastering a musical instrument, are acquired over a lifetime of deliberate practice.

Learning is hardly restricted to animal-level intelligence; it can be found in every living creature. Multi-cellular developmental programs are highly plastic and can even store epigenetic “memories'” between generations. At the longest time-scales, evolution itself can be thought of as “learning” on the genomic level, whereby favorable genetic codes are discovered and remembered over the course of many generations. At the shortest of timescales, a single ion channel activating in response to a stimulus can also be thought of as “learning”, as it is an adaptive, stateful response to the environment. Biological intelligence blurs the boundaries between “behavior” (responding to the environment), “learning” (acquiring information about the world in order to improve fitness), and “optimization” (improving fitness).

The focus of Machine Learning (ML) is to imbue computers with the ability to learn from data, so that they may accomplish tasks that humans have difficulty expressing in pure code. However, what most ML researchers call “learning” right now is but a very small subset of the vast range of behavioral adaptability encountered in biological life! Deep Learning models are powerful, but require a large amount of data and many iterations of stochastic gradient descent (SGD). This learning procedure is time-consuming and once a deep model is trained, its behavior is fairly rigid; at deployment time, one cannot really change the behavior of the system (e.g. correcting mistakes) without an expensive retraining process. Can we build systems that can learn faster, and with less data?

“Meta-learning'', one of the most exciting ML research topics right now, addresses this problem by optimizing a model not just for the ability to “predict well'', but also the ability to “learn well''. Although Meta-Learning has attracted a lot of research attention in recent years, related ideas and algorithms have been around for some time (see Hugo Larochelle's slides and Lilian Weng’s blog post for an excellent overview of related concepts).

This blog post won’t cover all the possible ways in which one can build a meta-learning system; instead, this is a practical tutorial on how to get your feet wet in meta-learning research. Specifically, I'll show you how to implement the MAML meta-learning algorithm in about 50 lines of Python code, using Google's awesome JAX library.

You can find a self-contained Jupyter notebook here reproducing this tutorial.

An Operator Perspective on Learning and Meta-Learning

“Meta-learning” is used in so many different research contexts nowadays that it's difficult to communicate to other researchers what I’m exactly working on when I say “Meta-Learning”. A source of this confusion stems from the blurred semantics between “optimization”, “learning”, “adaptation”, “memory”, and how these terms can be employed in wildly different applications.

This section is my attempt to make the definition of “learning” and “meta-learning” more mathematically precise, and explain why seemingly different algorithms are all branded as “meta-learning” these days. Feel free to skip to the next section if you want to dive straight into the MAML+JAX coding tutorial.

We define a learning operator $f : F_\theta \to F_\theta$ as a function that improves a model function $f_\theta$ with respect to some task. A common learning operator used in deep learning and reinforcement learning literature is the stochastic gradient descent algorithm, with respect to a loss function. In standard DL contexts, learning occurs over hundreds of thousands or even millions of gradient steps, but generally, “learning'' can also occur on shorter (conditioning) or longer timescales (hyperparameter search). In addition to explicit optimization, learning can also be implemented implicitly via a dynamical system (recurrent neural networks conditioning on the past) or probabilistic inference.

A meta-learning operator $f_o(f_i(f_\theta))$ is a composite operator of two learning operators: an “inner loop'' $f_i \in F_i$ and an “outer loop'' $f_o \in F_o$. Furthermore, $f_i$ is a model itself, and $f_o : F_i \to F_i$ is an operator over the inner learning rule $f_i$. In other words, $f_o$ learns the learning rule $f_i$, and $f_i$ learns a model for a given task, where we define “task'' to be a self-contained family of problems for which $f_i$ can adequately update $f_\theta$ to solve. At meta-training time, $f_o$ is applied to select for $f_i$ across a variety of training tasks. At meta-test time, we evaluate the generalization properties of $f_i$ and $f_\theta$ to holdout tasks.

The choice of $f_o$ and $f_i$ depends largely on the problem domain. In architecture search literature (also called “learning to learn''), $f_i$ is a relatively slow training procedure of a neural network from scratch, while $f_o$ can be a neural controller, random search algorithm, or a Gaussian Process Bandit.

A wide variety of machine learning problems can be formulated in terms meta-learning operators. In (meta) imitation learning (or goal-conditioned reinforcement learning), $f_i$ is used to relay instructions to the RL agent, such as conditioning on a task embedding or human demonstrations. In meta-reinforcement learning (MRL), $f_i$ instead implements a “fast reinforcement learning'' algorithm by which an agent improves itself after trying the task a couple times. It’s worth re-iterating here that I don’t see a distinction between “learning” and “conditioning”, because they both rely on inputs that are supplied at test time (i.e. “new information provided by the environment”).

MAML is a meta-learning algorithm that implements $f_i$ via SGD, i.e. $\theta := \theta - \alpha \nabla_{\theta}(\mathcal{L}(\theta))$. This SGD update is differentiable with respect to $\theta$, allowing $f_o$ to effectively optimize $f_i$ via backpropagation without requiring many additional parameters to express $f_i$.

Exploring JAX: Gradients

We begin the tutorial by importing JAX’s numpy drop-in and the gradient operator, grad.

import jax.numpy as np
from jax import grad

The gradient operator grad transforms a python function into another function that computes the gradients. Here, we compute first, second, and third order derivatives of $e^x$ and $x^2$:

f = lambda x : np.exp(x)
g = lambda x : np.square(x)
print(grad(f)(1.)) # = e^{1}
print(grad(grad(f))(1.))
print(grad(grad(grad(f)))(1.))

print(grad(g)(2.)) # 2x = 4
print(grad(grad(g))(2.)) # x = 2
print(grad(grad(grad(g)))(2.)) # x = 0

Exploring JAX: Auto-Vectorization with vmap

Now let’s consider a toy regression problem in which we try to learn the function $f_\theta(x) = sin(x)$ with a neural network. The goal here is to get familiar with defining and training models. JAX provides some lightweight helper functions to make it easy to set up a neural network.

from jax import vmap # for auto-vectorizing functions
from functools import partial # for use with vmap
from jax import jit # for compiling functions for speedup
from jax.experimental import stax # neural network library
from jax.experimental.stax import Conv, Dense, MaxPool, Relu, Flatten, LogSoftmax # neural network layers
import matplotlib.pyplot as plt # visualization

We’ll define a simple neural network with 2 hidden layers. We’ve specified an in_shape of (-1, 1), which means that the model takes in a variable-size batch dimension, and has a feature dimension of 1 scalar (since this is a 1-D regression task). JAX’s helper libraries all take on a functional API (unlike TensorFlow, which maintains a graph state), so we get back a function that initializes parameters and a function that applies the forward pass of the network. These callables return lists and tuples of numpy arrays - a simple and flat data structure for storing network parameters.

# Use stax to set up network initialization and evaluation functions
net_init, net_apply = stax.serial(
   Dense(40), Relu,
   Dense(40), Relu,
   Dense(1)
)
in_shape = (-1, 1,)
out_shape, net_params = net_init(in_shape)

Next, we define the model loss to be Mean-Squared Error (MSE) across a batch of inputs.

def loss(params, inputs, targets):
   # Computes average loss for the batch
   predictions = net_apply(params, inputs)
   return np.mean((targets - predictions)**2)

We evaluate the uninitialized network across a range of inputs:

# batch the inference across K=100
xrange_inputs = np.linspace(-5,5,100).reshape((100, 1)) # (k, 1)
targets = np.sin(xrange_inputs)
predictions = vmap(partial(net_apply, net_params))(xrange_inputs)
losses = vmap(partial(loss, net_params))(xrange_inputs, targets) # per-input loss
plt.plot(xrange_inputs, predictions, label='prediction')
plt.plot(xrange_inputs, losses, label='loss')
plt.plot(xrange_inputs, targets, label='target')
plt.legend()

As expected, at random initialization, the model’s predictions (blue) are totally off the target function (green).

Let’s train the network via gradient descent. JAX’s random number generator is set up differently than Numpy’s, so to initialize network parameters we’ll use the original Numpy library (onp) to generate random numbers. We’ll also import the tree_multimap utility to easily manipulate collections of per-parameter gradients (for TensorFlow users, this is analogous to nest.map_structure for Tensors).

import numpy as onp
from jax.experimental import optimizers
from jax.tree_util import tree_multimap # Element-wise manipulation of collections of numpy arrays

We initialize the parameters and optimizer, and run the curve fitting for 100 steps. Note that adding the @jit decorator to the “step” function uses XLA to compile the entire training step into machine code, along with optimizations like fused accelerator kernels, memory and layout optimization. TensorFlow itself also uses XLA for accelerating statically defined graphs. XLA makes the computation very fast and amenable to hardware acceleration because the entire thing can be executed without returning to a Python interpreter (or Graph interpreter in the case of TensorFlow sans XLA). The code in this tutorial will just work on CPU/GPU/TPU.

opt_init, opt_update = optimizers.adam(step_size=1e-2)
opt_state = opt_init(net_params)
# Define a compiled update step
@jit
def step(i, opt_state, x1, y1):
   p = optimizers.get_params(opt_state)
   g = grad(loss)(p, x1, y1)
   return opt_update(i, g, opt_state)

for i in range(100):
   opt_state = step(i, opt_state, xrange_inputs, targets)
net_params = optimizers.get_params(opt_state)

Evaluating our network again, we see that the sinusoid curve has been correctly approximated.

This result is nothing to write home about, but in just a moment we’ll re-use a lot of these functions to implement MAML.

Exploring JAX: Checking MAML Numerics

When implementing ML algorithms, it’s important to unit-testing implementations against test cases where the true values can be computed analytically. The following example does this for MAML on a toy objective $g$. Note that by default JAX computes gradients with respect to the first argument of the function.

# gradients of gradients test for MAML
# check numerics
g = lambda x, y : np.square(x) + y
x0 = 2.
y0 = 1.
print('grad(g)(x0) = {}'.format(grad(g)(x0, y0))) # 2x = 4
print('x0 - grad(g)(x0) = {}'.format(x0 - grad(g)(x0, y0))) # x - 2x = -2
def maml_objective(x, y):
return g(x - grad(g)(x, y), y)
print('maml_objective(x,y)={}'.format(maml_objective(x0, y0))) # x**2 + 1 = 5
print('x0 - maml_objective(x,y) = {}'.format(x0 - grad(maml_objective)(x0, y0))) # x - (2x) = -2.

Implementing MAML with JAX

Now let’s extend our sinusoid regression task to a multi-task problem, in which the sinusoid function can have varying phases and amplitudes. This task was proposed in the MAML paper as a way to illustrate how MAML works on a toy problem. Below are some points sampled from two different tasks, divided into “train” (used to compute the inner loss) and “validation” splits (sampled from the same task, used to compute the outer loss).

Suppose a task loss function $\mathcal{L}$ is defined with respect to model parameters $\theta$, input features $X$, output labels $Y$. Let $x_1, y_1$ and $x_2, y_2$ be identically distributed task instance data sampled from $X, Y$. Then MAML optimizes the following:

$\mathcal{L}(\theta - \nabla \mathcal{L}(\theta, x_1, y_1), x_2, y_2)$

MAML’s inner update operator is just gradient descent on the regression loss. The outer loss, maml_loss, is simply the original loss applied after the inner_update operator has been applied. One interpretation of the MAML objective is that it is a differentiable estimate of a cross-validation loss with respect to a learner. Meta-training results in an inner_update that minimizes the cross-validation loss.

def inner_update(p, x1, y1, alpha=.1):
   grads = grad(loss)(p, x1, y1)
   inner_sgd_fn = lambda g, state: (state - alpha*g)
   return tree_multimap(inner_sgd_fn, grads, p)

def maml_loss(p, x1, y1, x2, y2):
   p2 = inner_update(p, x1, y1)
   return loss(p2, x2, y2)

In each iteration of optimizing the MAML objective, we sample a single new task, sample a different set of input features and input labels for both the training and validation splits.

opt_init, opt_update = optimizers.adam(step_size=1e-3) # this LR seems to be better than 1e-2 and 1e-4
out_shape, net_params = net_init(in_shape)
opt_state = opt_init(net_params)

@jit
def step(i, opt_state, x1, y1, x2, y2):
   p = optimizers.get_params(opt_state)
   g = grad(maml_loss)(p, x1, y1, x2, y2)
   l = maml_loss(p, x1, y1, x2, y2)
   return opt_update(i, g, opt_state), l
K=20

np_maml_loss = []

# Adam optimization
for i in range(20000):
   # define the task
   A = onp.random.uniform(low=0.1, high=.5)
   phase = onp.random.uniform(low=0., high=np.pi)
   # meta-training inner split (K examples)
   x1 = onp.random.uniform(low=-5., high=5., size=(K,1))
   y1 = A * onp.sin(x1 + phase)
   # meta-training outer split (1 example). Like cross-validating with respect to one example.
   x2 = onp.random.uniform(low=-5., high=5.)
   y2 = A * onp.sin(x2 + phase)
   opt_state, l = step(i, opt_state, x1, y1, x2, y2)
   np_maml_loss.append(l)
   if i % 1000 == 0:
       print(i)
net_params = optimizers.get_params(opt_state)

At meta-training time, the network learns to “quickly adapt” to x1, y1 in order to minimize cross-validation error on a new set of points x2. At deployment time (shown in the plot above), when we have a new task (new amplitude and phase not seen at training time), the model can apply the inner_update operator to fit the target sinusoid much faster and with fewer data samples than simply re-training the parameters with SGD.

Why is inner_update a more effective learning rule than retraining with SGD on a new dataset? The magic here is that by training in a multi-task setting, the inner_update operator has generalized across tasks into a learning rule that is specially adapted for sinusoid regression tasks. In the standard data regime of deep learning, generalization is obtained from many examples of a single task (e.g. RL, image classification). In meta-learning, generalization is obtained from a few examples each from many tasks, and a shared learning rule is learned for the task distribution.

# batch the inference across K=100
targets = np.sin(xrange_inputs)
predictions = vmap(partial(net_apply, net_params))(xrange_inputs)
plt.plot(xrange_inputs, predictions, label='pre-update predictions')
plt.plot(xrange_inputs, targets, label='target')

x1 = onp.random.uniform(low=-5., high=5., size=(K,1))
y1 = 1. * onp.sin(x1 + 0.)

for i in range(1,5):
   net_params = inner_update(net_params, x1, y1)
   predictions = vmap(partial(net_apply, net_params))(xrange_inputs)
   plt.plot(xrange_inputs, predictions, label='{}-shot predictions'.format(i))
plt.legend()

Batching MAML Gradients Across Tasks with vmap

We can compute the MAML gradients across multiple tasks at once to reduce the variance of gradients of the learning operator. This was proposed in the MAML paper, and is analogous to how increasing minibatch size of standard SGD reduces variance of the parameter gradients (leading to more efficient learning).

Thanks to the vmap operator, we can automatically transform our single-task MAML implementation into a “batched version” that operates across tasks. From a software engineering & testing perspective, vmap is extremely nice because the "task-batched" MAML implementation simply re-uses code from the non-task batched MAML algorithm, without losing any vectorization benefits. This means that when unit-testing code, we can test the single-task MAML algorithm for numerical correctness, then scale up to a more advanced batched version (e.g. for handling harder tasks such as robotic learning) for efficiency.

# vmapped version of maml loss.
# returns scalar for all tasks.
def batch_maml_loss(p, x1_b, y1_b, x2_b, y2_b):
task_losses = vmap(partial(maml_loss, p))(x1_b, y1_b, x2_b, y2_b)
return np.mean(task_losses)

Below is a function that samples a batch of tasks, where outer_batch_size is the number of tasks we meta-train on in each step, and inner_batch_size is the number of data points per-task.

def sample_tasks(outer_batch_size, inner_batch_size):
   # Select amplitude and phase for the task
   As = []
   phases = []
   for _ in range(outer_batch_size):
       As.append(onp.random.uniform(low=0.1, high=.5))
       phases.append(onp.random.uniform(low=0., high=np.pi))
   def get_batch():
       xs, ys = [], []
       for A, phase in zip(As, phases):
           x = onp.random.uniform(low=-5., high=5., size=(inner_batch_size, 1))
           y = A * onp.sin(x + phase)
           xs.append(x)
           ys.append(y)
       return np.stack(xs), np.stack(ys)
   x1, y1 = get_batch()
   x2, y2 = get_batch()
   return x1, y1, x2, y2

Now for the training loop, which strongly resembles the previous single-task one. As you can see, gradient-based meta-learning requires treating two kinds of variance: those of intra-task gradients for the inner loss, and those of inter-task gradients for the outer loss.

opt_init, opt_update = optimizers.adam(step_size=1e-3)
out_shape, net_params = net_init(in_shape)
opt_state = opt_init(net_params)

# vmapped version of maml loss.
# returns scalar for all tasks.
def batch_maml_loss(p, x1_b, y1_b, x2_b, y2_b):
   task_losses = vmap(partial(maml_loss, p))(x1_b, y1_b, x2_b, y2_b)
   return np.mean(task_losses)

@jit
def step(i, opt_state, x1, y1, x2, y2):
   p = optimizers.get_params(opt_state)
   g = grad(batch_maml_loss)(p, x1, y1, x2, y2)
   l = batch_maml_loss(p, x1, y1, x2, y2)
   return opt_update(i, g, opt_state), l

np_batched_maml_loss = []
K=20
for i in range(20000):
   x1_b, y1_b, x2_b, y2_b = sample_tasks(4, K)
   opt_state, l = step(i, opt_state, x1_b, y1_b, x2_b, y2_b)
   np_batched_maml_loss.append(l)
   if i % 1000 == 0:
       print(i)
net_params = optimizers.get_params(opt_state)

When we plot the MAML objective as a function of training step, we see that the batched MAML trains much faster (as a function of gradient steps) and also has lower variance during training.

Conclusions

In this tutorial we explored the MAML algorithm and reproduced the Sinusoid regression task from the paper in about 50 lines of Python code. I was very pleasantly surprised to find how easy grad, vmap, and jit made it to implement MAML, and I am excited to continue using it for my own meta-learning research.

So, what are the distinctions between “optimization”, “learning”, “adaptation”, and “memory”? I believe they are all equivalent, because it is possible to implement memory capabilities with optimization techniques (MAML) and vice versa (e.g. RNN-based meta reinforcement learning). In reinforcement learning, imitating a teacher or conditioning on user-specified goal or recovering from a failure can all use the same machinery.

Thinking about precise definitions of “learning” and “meta-learning”, and attempting to reconcile them with the capabilities of biological intelligence have led me to realize that every process in Life itself, spanning molecular reaction to behavioral adaptation to genetic evolution, is nothing more than learning happening at many time scales. I’ll have much more to say on the topic of Artificial Life and Machine Learning in the future, but for now, thank you for reading this humble tutorial on fitting sinusoidal functions!

Acknowledgements

Thanks to Matthew Johnson for helping to proofread this post and helping me to resolve JAX questions.

Thoughts on the BagNet Paper

2019-02-05T23:35:00.003-08:00

Some thoughts on the interesting BagNet paper (accepted at ICLR 2019) currently being circulated around the Machine Learning Twitter Community.

Disclaimer: I wasn't a reviewer of this paper for ICLR. I think it was worthy of acceptance to the conference, and hope it prompts further investigation by the research community. Please feel free to email me if you spot any mistakes / misunderstandings in this post.

Paper Summary:

Deep Convolutional Networks (CNNs) work by aggregating local features via learned convolutions followed by spatial pooling. Successive application of these "convolutional layers" results in a "hierarchy of features" that integrate low-level information across a wide spatial extent to form high-level information.

As for algorithmic solutions, those aboard the deep learning hype train (myself included) believe that current deep CNNs perform global integration of information. There is a hand-wavy notion that intelligent visual understanding requires "seeing the forest for the trees."

In the BagNet paper, the authors find that for the ImageNet classification task, the following algorithm (BagNet) works surprisingly well (86% Top-5 accuracy) in comparison to the deep AlexNet model (84.7% accuracy):

1) Chopping up the input images into 33x33 patches.

2) Running each patch through a deep net (1x1 convolutions) to get a class vector.

3) Add up the resulting class vectors spatially (across all patches).

4) Prediction is the class with the most counts.

By way of analogy, it suggests that for image classification, you don't need a non-linear model to integrate a bunch of local features into a global representation, you just need to "count a bunch of trees to guess that it's a forest".

Some other experimental conclusions:

BagNet works slightly better when using 33x33 patches compared to 17x17 patches (80%). So deep nets do extract useful spatial information (9x9 vs. 17x17 vs. 33x33), just perhaps not to the global spatial extent we might have previously imagined (e.g. 112x112, 224x224).
Spatially distinct features from the BagNet model do not interact beyond the bagging step. This begs the question of whether most of the "power" of deep nets comes from merely examining local features. Are Deep Nets just BagNets? This would be quite concerning if that were the case!
VGG appears to approximate BagNets quite well (though I am a bit skeptical about the author's methodology of showing this) while DenseNets and ResNets appear to be doing something totally different from BagNets (authors explain in the rebuttal that this may come from "(1) a more non-linear classifier on top of the local features or (2) larger local feature sizes".

Thoughts & Questions

Regardless of your beliefs on whether CNNs can/should take us all the way to Artificial General Intelligence or not, this paper offers a neat bit of evidence that we can build surprisingly powerful image classification models by only examining local features. It is often more helpful to tackle applied problems with a more interpretable model, and I'm glad to see such models doing surprisingly well for certain problems.

BagNet seems quite similar in principle to Generalized Additive Models, which predate Deep Learning quite a bit. The basic idea of GAMs to combine non-linear univariate features (i.e. $f(x_i)$ where each $x_i$ is a pixel and $f$ is a neural net) into a simple, interpretable features so that the marginal predictive distribution with respect to each variable can be interrogated. I'm particularly excited about ideas like Lou et al. which relax GAMs to support pairwise interactions between univariate feature extractors (2D marginals are still interpretable to humans).

The authors do not claim this explicitly, but it's easy to skim the paper quickly and think "DNNs suck; they are nothing more than BagNets". That's not actually the case (and the authors' experiments suggest this).

One counterexample: adversarial examples are clear instances where local modifications (sometimes a single pixel) can change global feature representations. So it is clear that global shape integration is happening for test inputs. The remaining question is whether global shape integration is happening where we think it should happen, and on which tasks this happens. As someone who is deeply interested in AGI, I find ImageNet much less interesting now, precisely because it can be solved with models that have little global understanding of images.

The authors also say this much themselves, that we need harder tasks that require global shape integration.

Generative modeling of images (e.g. GANs) is a task where it's quite clear that linear interactions between patch features are insufficient to model the unconditional joint distribution across pixels. Or consider my favorite RL task, Life on Earth, in which agents clearly need to perform spatial reasoning to solve problems like chasing prey and running away from predators. It would be fun to design an artificial life setup and see if organisms using bag-of-features perception can actually compete with organisms that use non-linear global integration (I doubt it).

If we train a model that should do better by integrating global information (i.e. classification), and it ends up just overfitting to local features, then this is a truly interesting result - it means that we need an optimization objective that does not allow models to cheat in this way. I think the "Life-on-Earth" is a great task for this, though I hope to find one that is computationally less resource intensive :)

Finally, a word on interpretability vs. causal inference. In the near term, I could see BagNet being useful for self-driving cars, where the parallelizability of considering each patch separately would give even better speedups for large images. Everyone wants ML models on self-driving cars to be interpretable, right? But there is also the psychological question of whether a human would prefer to get in a car that drives with a black box CNN that is "accurate, uninterpretable, and maybe wrong", or whether they want a car that makes decisions using Bag-of-Features: "accurate, interpretable, and definitely wrong". Lobbying for interpretability (as used by BagNet) seems to be at odds with demands for "causal inference" and "program induction" by means of achieving better generalizable machine learning, because a strong assumption of causal inference is that your model can express the true causal distribution. I'm curious how members of the community think we should reconcile this difference.

Update (Feb 9): There is a more positive way to look at these methods for better causal inference. Methods like BagNet can serve as a very useful sanity check when designing end-to-end systems (like robotics, self-driving cars): if your deep net is not performing much better than a system only examining local statistical regularities (like BagNet), it is a good sign that your model may still yet benefit from better global information integration. One might even consider jointly optimizing BagNet and Advantage(DeepNet, BagNet) so that the DeepNet must explicitly extract strictly better information than what BagNet does. I have been thinking of how to better verify our ML systems for robotics and building such "null hypothesis" models can be a good way to check that they aren't learning something silly.

Uncertainty: a Tutorial

2018-12-28T10:14:00.000-08:00

A PDF version of this post can be found here.
Chinese translation by Xiaoyi Yin

Notions of uncertainty are tossed around in conversations around AI safety, risk management, portfolio optimization, scientific measurement, and insurance. Here are a few examples of colloquial use:

"We want machine learning models to know what they don't know.''
"An AI responsible for diagnosing patients and prescribing treatments should tell us how confident it is about its recommendations.''
"Significant figures in scientific calculations represent uncertainty in measurements.''
"We want autonomous agents to explore areas where they are uncertain (about rewards or predictions) so that they may discover sparse rewards.''
"In portfolio optimization, we want to maximize returns while limiting risk.''
"US equity markets finished disappointingly in 2018 due to increased geopolitical uncertainty.''

What exactly then, is uncertainty?

Uncertainty measures reflect the amount of dispersion of a random variable. In other words, it is a scalar measure of how "random" a random variable is. In finance, it is often referred to as risk.

There is no single formula for uncertainty because there are many different ways to measure dispersion: standard deviation, variance, value-at-risk (VaR), and entropy are all appropriate measures. However, it's important to keep in mind that a single scalar number cannot paint a full picture of "randomness'', as that would require communicating the entire random variable itself!

Nonetheless, it is helpful to collapse randomness down to a single number for the purposes of optimization and comparison. The important thing to remember is that "more uncertainty'' is usually regarded as "less good'' (except in simulated RL experiments).

Types of Uncertainty

Statistical machine learning concerns itself with the estimation of models $p(\theta|\mathcal{D})$, which in turn estimate unknown random variables $p(y|x)$. Multiple forms of uncertainty come into play here. Some notions of uncertainty describe inherent randomness that we should expect (e.g. outcome of a coin flip) while others describe our lack of confidence about our best guess of the model parameters.

To make things more concrete, let's consider a recurrent neural network (RNN) that predicts the amount of rainfall today from a sequence of daily barometer readings. A barometer measures atmospheric pressure, which often drops when its about to rain. Here's a diagram summarizing the rainfall prediction model along with different kinds of uncertainty.

Uncertainty can be understood from a simple machine learning model that attempts to predict daily rainfall from a sequence of barometer readings. Aleatoric uncertainty is irreducible randomness that arises from the data collection process. Epistemic uncertainty reflects confidence that our model is making the correct predictions. Finally, out-of-distribution errors arise when the model sees an input that differs from its training data (e.g. temperature of the sun, other anomalies).

Aleatoric Uncertainty

Aleatoric Uncertainty draws its name from the Latin root aleatorius, which means the incorporation of chance into the process of creation. It describes randomness arising from the data generating process itself; noise that cannot be eliminated by simply drawing more data. It is the coin flip whose outcome you cannot know.

In our rainfall prediction analogy, aleatoric noise arises from imprecision of the barometer. There are also important variables that the data collection setup does not observe: How much rainfall was there yesterday? Are we measuring barometric pressure in the present day, or the last ice age? These unknowns are inherent to our data collection setup, so collecting more data from that system does not absolve us of this uncertainty.

Aleatoric uncertainty propagates from the inputs to the model predictions. Consider a simple model $y = 5x$, which takes in normally-distributed input $x \sim \mathcal{N}(0,1)$. In this case, $y \sim \mathcal{N}(0, 5)$, so the aleatoric uncertainty of the predictive distribution can be described by $\sigma=5$. Of course, predictive aleatoric uncertainty is more challenging to estimate when the random structure of the input data $x$ is not known.

One might think that because aleatoric uncertainty is irreducible, one cannot do anything about it and so we should just ignore it. No! One thing to watch out for when training models is to choose an output representation capable of representing aleatoric uncertainty correctly. A standard LSTM does not emit probability distributions, so attempting to learn the outcome of a coin flip would just converge to the mean. In contrast, models for language generation emit a sequence of categorical distributions (words or characters), which can capture the inherent ambiguity in sentence completion tasks.

Epistemic Uncertainty

"Good models are all alike; every bad model is wrong in its own way."

Epistemic Uncertainty is derived from the Greek root epistēmē, which pertains to knowledge about knowledge. It measures our ignorance of the correct prediction arising from our ignorance of the correct model parameters.

Below is a plot of a Gaussian Process Regression model on some toy 1-dimensional dataset. The confidence intervals reflect epistemic uncertainty; the uncertainty is zero for training data (red points), and as we get farther away from training points, the model ought to assign higher standard deviations to the predictive distribution. Unlike aleatoric uncertainty, epistemic uncertainty can be reduced by gathering more data and "ironing out" the regions of inputs where the model lacks knowledge.

1-D Gaussian Process Regression Model showcasing epistemic uncertainty for inputs outside its training set.

There is a rich line of inquiry connecting Deep Learning to Gaussian Processes. The hope is that we can extend the uncertainty-awareness properties of GPs with the representational power of neural networks. Unfortunately, GPs are challenging to scale to the uniform stochastic minibatch setting for large datasets, and they have fallen out of favor among those working on large models and datasets.

If one wants maximum flexibility in choosing their model family, a good alternative to estimating uncertainty is to use ensembles, which is just a fancy way of saying "multiple independently learned models''. While GP models analytically define the predictive distribution, ensembles can be used to compute the empirical distribution of predictions.

Any individual model will make some errors due to randomized biases that occur during the training process. Ensembling is powerful because other models in the ensembles tend to expose the idiosyncratic failures of a single model while agreeing with the correctly inferred predictions.

How do we sample models randomly to construct an ensemble? In Ensembling with bootstrap aggregation, we start with a training dataset of size $N$ and sample $M$ datasets of size $N$ from the original training set (with replacement, so each dataset does not span the entire dataset). The $M$ models are trained on their respective datasets and their resulting predictions collectively form an empirical predictive distribution.

If training multiple models is too expensive, it is also possible to use Dropout training to approximate a model ensemble. However, introducing dropout involves an extra hyperparameter and can compromise single model performance (often unacceptable for real world applications where calibrated uncertainty estimation is secondary to accuracy).

Therefore, if one has access to plentiful computing resources (as one does at Google), it is often easier to just re-train multiple copies of a model. This also yields the benefits of ensembling without hurting performance. This is the approach taken by the Deep Ensembles paper. The authors of this paper also mention that the random training dynamics induced by differing weight initializations was sufficient to introduce a diverse set of models without having to resort to reducing the training set diversity via bootstrap aggregation. From a practical engineering standpoint, it's smart to bet on risk estimation methods that do not get in the way of the model's performance or whatever other ideas the researcher wants to try.

Out-of-Distribution Uncertainty

For our rainfall predictor, what if instead of feeding in the sequence of barometer readings, we fed in the temperature of the sun? Or a sequence of all zeros? Or barometer readings from a sensor that reports in different units? The RNN will happily compute away and give us a prediction, but the result will likely be meaningless.

The model is totally unqualified to make predictions on data generated via a different procedure than the one used to create the training set. This is a failure mode that is often overlooked in benchmark-driven ML research, because we typically assume that the training, validation, and test sets consist entirely of clean i.i.d data.

Determining whether inputs are "valid'' is a serious problem for deploying ML in the wild, and is known as the Out of Distribution (OoD) problem. OoD is also synonymous with model misspecification error and anomaly detection.

Besides its obvious importance for hardening ML systems, anomaly detection models are an intrinsically useful technology. For instance, we might want to build a system that monitors a healthy patient's vitals and alerts us when something goes wrong without necessarily having seen that pattern of pathology before. Or we might be managing the "health" of a datacenter and want to know whenever unusual activity occurs (disks filling up, security breaches, hardware failures, etc.)

Since OoD inputs only occur at test-time, we should not presume to know the distribution of anomalies the model encounters. This is what makes OoD detection tricky - we have to harden a model against inputs it never sees during training! This is exactly the standard attack scenario described in Adversarial Machine Learning.

There are two ways to handle OoD inputs for a machine learning model: 1) catch the bad inputs before we even put them through the model 2) let the "weirdness'' of model predictions imply to us that the input was probably malformed.

In the first approach, we assume nothing about the downstream ML task, and simply consider the problem of whether an input is in the training distribution or not. This is exactly what discriminators in Generative Adversarial Networks (GANs) are supposed to do. However, a single discriminator is not completely robust because it is only good for discriminating between the true data distribution and whatever the generator's distribution is; it can give arbitrary predictions for an input that lies in neither distribution.

Instead of a discriminator, we could build a density model of the in-distribution data, such as a kernel density estimator or fitting a Normalizing Flow to the data. Hyunsun Choi and I investigated this in our recent paper on using modern generative models to do OoD detection.

The second approach to OoD detection involves using the predictive (epistemic) uncertainty of the task model to tell us when inputs are OoD. Ideally, malformed inputs to a model ought to generate "weird'' predictive distribution $p(y|x)$. For instance, Hendrycks and Gimpel showed that the maximum softmax probability (the predicted class) for OoD inputs tends to be lower than that of in-distribution inputs. Here, uncertainty is inversely proportional to the "confidence'' as modeled by the max sofmax probability. Models like Gaussian Processes give us these uncertainty estimates by construction, or we could compute epistemic uncertainty via Deep Ensembles.

In reinforcement learning, OoD inputs are actually assumed to be a good thing, because it represents inputs from the world that the agent does not know how to handle yet. Encouraging the policy to find its own OoD inputs implements "intrinsic curiosity'' to explore regions the model predicts poorly in. This is all well and good, but I do wonder what would happen if such curiousity-driven agents are deployed in real world settings where sensors break easily and other experimental anomalies happen. How does a robot distinguish between "unseen states" (good) and "sensors breaking" (bad)? Might that result in agents that learn to interfere with their sensory mechanisms to generate maximum novelty?

Who Will Watch the Watchdogs?

As mentioned in the previous section, one way to defend ourselves against OoD inputs is to set up a likelihood model that "watchdogs" the inputs to a model. I prefer this approach because it de-couples the problem of OoD inputs from epistemic and aleatoric uncertainty in the task model. It makes things easy to analyze from an engineering standpoint.

But we should not forget that the likelihood model is also a function approximator, possibly with its own OoD errors! We show in our recent work on Generative Ensembles (and also showed in concurrent work by DeepMind), that under a CIFAR likelihood model, natural images from SVHN can actually be more likely than the in-distribution CIFAR images themselves!

Likelihood estimation involves a function approximator that can itself be susceptible to OoD inputs. A likelihood model of CIFAR assigns higher probabilities to SVHN images than CIFAR test images!

However, all is not lost! It turns out that the epistemic uncertainty of likelihood models is an excellent OoD detector for the likelihood model itself. By bridging epistemic uncertainty estimation with density estimation, we can use ensembles of likelihood models to protect machine learning models against OoD inputs in a model-agnostic way.

Calibration: the Next Big Thing?

A word of warning: just because a model is able to spit out a confidence interval for a prediction doesn't mean that the confidence interval actually reflects the actual probabilities of outcomes in reality!

Confidence intervals (e.g. $2\sigma$) implicitly assume that your predictive distribution is Gaussian-distributed, but if the distribution you're trying to predict is multi-modal or heavy-tailed, then your model will not be well calibrated!

Suppose our rainfall RNN tells us that there will be $\mathcal{N}(4, 1)$ inches of rain today. If our model is calibrated, then if we were to repeat this experiment over and over again under identical conditions (possibly re-training the model each time), we really would observe empirical rainfall to be distributed exactly $\mathcal{N}(4, 1)$.

Machine Learning models developed by academia today mostly optimize for test accuracy or some fitness function. Researchers are not performing model selection by deploying the model in repeated identical experiments and measuring calibration error, so unsurprisingly, our models tend to be poorly calibrated.

Going forward, if we are to trust ML systems deployed in the real world (robotics, healthcare, etc.), I think a much more powerful way to "prove our models understand the world correctly'' is to test them for statistical calibration. Good calibration also implies good accuracy, so it would be a strictly higher bar to optimize against.

Should Uncertainty be Scalar?

As useful as they are, scalar uncertainty measures will never be as informative as the random variables they describe. I find methods like particle filtering and Distributional Reinforcement Learning very cool because they are algorithms that operate on entire distributions, freeing us from resorting to simple normal distributions to keep track of uncertainty. Instead of shaping ML-based decision making with a single scalar of "uncertainty", we can now query the full structure of distributions when deciding what to do.

The Implicit Quantile Networks paper (Dabney et al.) has a very nice discussion on how to construct "risk-sensitive agents'' from a return distribution. In some environments, one might favor an opportunitistic policy that prefers to explore the unknown, while in other environments unknown things may be unsafe and should be avoided. The choice of risk measure essentially determines how to map the distribution of returns to a scalar quantity that can be optimized against. All risk measures can be computed from the distribution, so predicting full distributions enables us to combine multiple definitions of risk easily. Furthermore, supporting flexible predictive distributions seems like a good way to improve model calibration.

Performance of various risk measures on Atari games as reported by the IQN paper.

Risk measures are a deeply important research topic to financial asset managers. The vanilla Markowitz portfolio objective minimizes a weighted variance of portfolio returns $\frac{1}{2}\lambda w^T \Sigma w$. However, variance is an unintuitive choice of "risk'' in financial contexts: most investors don't mind returns exceeding expectations, but rather wish to minimize the probability of small or negative returns. For this reason, risk measures like Value-at-Risk, Shortfall Probability, and Target Semivariance, which only pay attention to the likelihood of "bad'' outcomes, are more useful objectives to optimize.

Unfortunately, they are also more difficult to work with analytically. My hope is that research into distributional RL, Monte Carlo methods, and flexible generative models will allow us to build differentiable relaxations of risk measures that can play nicely with portfolio optimizers. If you work in finance, I highly recommend reading the IQN paper's "Risks in Reinforcement Learning" section.

Summary

Here's a recap of the main points of this post:

Uncertainty/risk measures are scalar measures of "randomness''. Collapsing a random variable to a single number is done for optimization and mathematical convenience.
Predictive uncertainty can be decomposed into aleatoric (irreducible noise arising from data collection process), epistemic (ignorance about true model), and out-of-distribution (at test time, inputs may be malformed).
Epistemic uncertainty can be mitigated by softmax prediction thresholding or ensembling.
Instead of propagating OoD uncertainty to predictions, we can use a task-agnostic filtering mechanism that safeguards against "malformed inputs''.
Density models are a good choice for filtering inputs at test time. However, it's important to recognize that density models are merely approximations of the true density function, and are themselves susceptible to out-of-distribution inputs.
Self-plug:Generative Ensembles reduce epistemic uncertainty of likelihood models so they can be used to detect OoD inputs.
Calibration is important and underappreciated in research models.
Some algorithms (Distributional RL) extend ML algorithms to models that emit flexible distributions, which provides more information than a single risk measure.

Machine Learning Memes

2018-11-30T21:42:00.004-08:00

A periodically-updated list of my favorite Deep Learning memes. Enjoy!

content warning: may contain crude humor.

Caption: The Gary Marcus/Yoshua Bengio debate. (Thanks Jackie Kay for sending me this)

Dijkstra's in Disguise

2018-08-08T01:09:00.001-07:00

You can find a PDF version of this blog post here.

A weighted graph is a data structure consisting of some vertices and edges, and each edge has an associated cost of traversal. Let's suppose we want to compute the shortest distance from vertex $u$ to every other vertex $v$ in the graph, and we express this cost function as $\mathcal{L}_u(v)$.

For example, if each edge in this graph has cost $1$, $\mathcal{L}_u(v) = 3$.

Dijkstra's, Bellman-Ford, Johnson's, Floyd-Warshall are good algorithms for solving the shortest paths problem. They all share the principle of relaxation, whereby costs are initially overestimated for all vertices and gradually corrected for using a consistent heuristic on edges (the term "relaxation" in the context of graph traversal is not be confused with "relaxation" as used in an optimization context, e.g. integer linear programs). The heuristic can be expressed in plain language as follows:

It turns out that many algorithms I've encountered in my computer graphics, finance, and reinforcement learning studies are all variations of this relaxation principle in disguise. It's quite remarkable (embarrassing?) that so much of my time has been spent on such a humble technique taught in introductory computer science courses!

This blog post is a gentle tutorial on how all these varied CS topics are connected. No prior knowledge of finance, reinforcement learning, or computer graphics is needed. The reader should be familiar with undergraduate probability theory, introductory calculus, and be willing to look at some math equations. I've also sprinkled in some insights and questions that might be interesting to the AI research audience, so hopefully there's something for everybody here.

Bellman-Ford

Here's a quick introduction to Bellman-Ford, which is actually easier to understand than the famous Dijkstra's Algorithm.

Given a graph with $N$ vertices and costs $\mathcal{E}(s, v)$ associated with each directed edge $s \to v$, we want to find the cost of the shortest path from a source vertex $u$ to each other vertex $v$. The algorithm proceeds as follows: The cost to reach $u$ from itself is initialized to $0$, and all the other vertices have distances initialized to infinity.

The relaxation step (described in the previous section) is performed across all edges in any order for each iteration. The correct distances from $u$ are guaranteed to have propagated completely to all vertices after $N-1$ iterations, since the longest of the shortest paths contain at most $N$ unique vertices. If the relaxation condition indicates there are still yet shorter paths after $N$ iterations, it implies the presence of a cycle whose total cost is negative. You can find a nice animation of the Bellman-Ford algorithm here.

Below is the pseudocode:

Currency Arbitrage

Admittedly, all this graph theory seems sort of abstract and boring at first. But would it still be boring if I told you that efficiently detecting negative cycles in graphs is a multi-billion dollar business?

The foreign exchange (FX) market, where one currency is traded for another, is the largest market in the world, with about 5 trillion USD being traded every day. This market determines the exchange rate for local currencies when you travel abroad. Let's model a currency exchange's order book (the ledger of pending transactions) as a graph:

Each vertex represents a currency (e.g. JPY, USD, BTC).
Each directed edge represents the conversion of currency $A$ to currency $B$.

An arbitrage opportunity exists if the product of exchange rates in a cycle exceeds $1$, which means that you can start with 1 unit of currency $A$, trade your way around the graph back to currency $A$, and then end up with more than 1 unit of $A$!

To see how this is related to the Bellman-Ford algorithm, let each currency pair $(A, B)$ with conversion rate $\frac{B}{A}$ be represented as a directed edge from $A$ to $B$ with edge weight $\mathcal{E}(A,B) = \log \frac{A}{B}$. Rearranging the terms,

The above algebra shows that if the sum of edge weights in a cycle is negative, it is equivalent to the product of exchange rates exceeding $1$. The Bellman-Ford algorithm can be directly applied to detect currency arbitrage opportunities! This also applies to all fungible assets in general, but currencies tend to be the most strongly-connected vertices in the graph representing the financial markets.

In my sophomore year of college, I caught the cryptocurrency bug and set out to build an automated arbitrage bot for scraping these opportunities in exchanges. Cryptocurrencies - being unregulated speculative digital assets - are ripe for cross-exchange arbitrage opportunities:

Inter-exchange transaction costs are low (assets are ironically centralized into hot and cold wallets).
Lots of speculative activity, whose bias generates lots of mispricing.
Exchange APIs expose much more order book depth and require no license to trade cryptos. With a spoonful of Python and a little bit of initial capital, you can trade nearly any crypto you want across dozens of exchanges..

Now we have a way to automatically detect mispricings in markets and end up with more money than we started with. Do we have a money printing machine yet?

Not so fast! A lot of things can still go wrong. Exchange rates fluctuate over time and other people are competing for the same trade, so the chances of executing all legs of the arbitrage are by no means certain.

Execution of trading strategies is an entire research area on its own, and can be likened to crossing a frozen lake as quickly as possible. Each intermediate currency position, or "leg'', in an arbitrage strategy is like taking a cautious step forward. One must be able to forecast the stability of each step and know what steps proceed after, or else one can get "stuck'' holding a lot of a currency that gives out like thin ice and becomes worthless. Often the profit opportunity is not big enough to justify the risk of crossing that lake.

Simply taking the greedy minimum among all edge costs does not take into account the probability of various outcomes happening in the market. The right way to structure this problem is to think about edge weights being random variables that change over time. In order to compute the expected cost, we need to integrate over all possible path costs that can manifest. Hold this thought, as we will need to introduce some more terminology in the next few sections.

While the arbitrage system I implemented was capable of detecting arb opportunities, I never got around to fully automating the execution and order confirmation subsystems. Unfortunately, I got some coins stolen and lost interest in cryptos shortly after. To execute arb opportunities quickly and cheaply I had to keep small BTC/LTC/DOGE positions in each exchange, but sometimes exchanges would just vanish into thin air. Be careful of what you wish for, or you just might find your money "decentralized'' from your wallet!

Directional Shortest-Path

Let's introduce another cost function, the directional shortest path $\mathcal{L}_u(v, s \to v)$, that computes the shortest path from $u$ to $v$, where the last traversed edge is from $s \to v$. Just like making a final stop at the bathroom $s$ before boarding an airplane $v$.

Note that the original shortest path cost $\mathcal{L}_u(v)$ is equivalent to the smallest directional shortest path cost among all of $v$'s neighboring vertices, i.e. $\mathcal{L}_u(v) = \min_{s} \mathcal{L}_u(v, s \to v)$

Shortest-path algorithms typically associate edges with costs, and the objective is to minimize the total cost. This is also equivalent to trying to maximize the negative cost of the path, which we call $\mathcal{Q}_u = -\mathcal{L}_u(v)$. Additionally, we can re-write this max-reduction as a sum-reduction, where each $\mathcal{Q}_u$ term is multiplied by an indicator function that is $1$ when its $\mathcal{Q}_u$ term is the largest and $0$ otherwise.

Does this remind you of any well-known algorithm?

If you guessed "Q-Learning", you are absolutely right!

Q-Learning

Reinforcement learning (RL) problems entail an agent interacting with its environment such that the total expected reward $R$ it receives is maximized over a multi-step (maybe infinite) decision process. In this setup, the agent will be unable to take further actions or receive additional rewards after transitioning to a terminal (absorbing) state.

There are many ways to go about solving RL problems, and we'll discuss just one kind today: value-based algorithms, attempt to recover a value function $Q(s,a)$ that computes the maximum total reward an agent can possibly obtain if it takes an action $a$ at state $s$.

Wow, what a mouthful! Here's a diagram of what's going on along with an annotated mathematical expression.

Re-writing the shortest path relaxation procedure in terms of a directional path cost recovers the Bellman Equality, which underpins the Q-Learning algorithm. It's no coincidence that Richard Bellman of Bellman-Ford is also the same Richard Bellman of the Bellman Equality! Q-learning is a classic example of dynamic programming.

For those new to Reinforcement Learning, it's easiest to understand Q-Learning in the context of an environment that yields a reward only at the terminal transition:

The value of state-action pairs $(s_T, a_T)$ that transition to a terminal state are easy to learn - it is just the sparse reward received as the episode ends, since the agent can't do anything afterwards.
Once we have all those final values, the value for $(s_{T-1}, a_{T-1})$ leading to those states are "backed up'' (backwards through time) to the states that transition to them.
This continues all the way to the state-action pairs $(s_1, a_1)$ encountered at the beginning of episodes.

Handling Randomness in Shortest-Path Algorithms

Remember the "thin ice'' analogy from currency arbitrage? Let's take a look at how modern RL algorithms are able to handle random path costs.

In RL, the agent's policy distribution $\pi(a|s)$ is a conditional probability distribution over actions, specifying how the agent behaves randomly in response to observing some state $s$. In practice, policies are made to be random in order to facilitate exploration of environments whose dynamics and set of states are unknown (e.g. imagine the RL agent opens its eyes for the first time and must learn about the world before it can solve a task). Since the agent's sampling of action $a \sim \pi(a|s)$ from the policy distribution are immediately followed by computation of environment dynamics $s^\prime = f(s, a)$, it's equivalent to view randomness as coming from a stochastic policy distribution or stochastic transition dynamics. We redefine a notion of Bellman consistency for expected future returns:

By propagating expected values, Q-learning allows for shortest-path algorithms to essentially be aware of the expected path length, and take transition probabilities of dynamics/policies into account.

Modern Q-Learning

This section discusses some recent breakthroughs in RL research, such as Q-value overestimation, Softmax Temporal Consistency, Maximum Entropy Reinforcement Learning, and Distributional Reinforcement Learning. These cutting-edge concepts are put into the context of shortest-path algorithms as discussed previously. If any of these sound interesting and you're willing to endure a bit more math jargon, read on -- otherwise, feel free to skip to the next section on computer graphics.

Single-step Bellman backups during Q-learning turn out to be rather sensitive to random noise, which can make training unstable. Randomness can come from imperfect optimization over actions during the Bellman Update, poor function approximation in the model, random label noise (e.g. human error in assigning labels to a robotic dataset), stochastic dynamics, or uncertain observations (partial observability). All of these can violate the Bellman Equality, which may cause learning to diverge or get stuck in a poor local minima.

Sources of noise that arise in Q-learning which violate the hard Bellman Equality.

A well-known problem among RL practitioners is that Q-learning suffers from over-estimation; during off-policy training, predicted Q-values climb higher and higher but the agent doesn't get better at solving the task. Why does this happen?

Even if $Q_\theta$ is an unbiased estimator of the true value function, any variance in the estimate is converted into upward bias during the Bellman update. A sketch of the proof: assuming Q values are uniformly or normally distributed about the true value function, the Fisher–Tippett–Gnedenko theorem tells us that applying the max operator over multiple normally-distributed variables is mean-centered around a Gumbel distribution with a positive mean. Therefore the updated Q function, after the Bellman update is performed, will obtain some positively skewed bias! One way to deal with this is double Q-learning, which re-evaluates the optimal next-state action value using an i.i.d $Q$ function. Assuming Q-value noise is independent of the max action, the use of a i.i.d Q function for scoring the best actions makes max-Q estimation unbiased again.

Dampening Q values can also be accomplished crudely by decreasing the discount factor (0.95 is common for environments like Atari), but $\gamma$ is kind of a hack as it is not a physically meaningful quantity in most environments.

Yet another way to decrease overestimation of Q values is to "smooth'' the greediness of the max-operator during the Bellman backup, by taking some kind of weighted average over Q values, rather than a hard max that only considers the best expected value. In discrete action spaces with $K$ possible actions, the weighted average is also known as a "softmax'' with a temperature parameter:

$$\verb|softmax|(x, \tau) = \mathbf{w}^T \mathbf{x}$$

where

$$\mathbf{w}_i = \frac{e^{\mathbf{x}_i/\tau}}{\sum_{j=1}^{K}{e^{\mathbf{x}_j/\tau}}}$$

Intuitively, the "softmax'' can be thought of as a confidence penalty on how likely we believe $\max Q(s^\prime, a^\prime)$ to be the actual expected return at the next time step. Larger temperatures in the softmax drag the mean away from the max value, resulting in more pessimistic (lower) Q values. Because of this temeprature-controlled softmax, our reward objective is no longer simply to "maximize expected total reward''; rather, it is more similar to "maximizing the top-k expected rewards''. In the infinite-temperature limit, all Q-values are averaged equally and the softmax becomes a mean, corresponding to the return of a completely random policy. Hold that thought, as this detail will be visited again when we discuss computer graphics!

This modification to the standard Hard-Max Bellman Equality is known as Softmax Temporal Consistency. In continuous action spaces, the backup through an entire episode can be thought of as repeatedly backing up expectations over integrals.

By introducing a confidence penalty as an implicit regularization term, our optimization objective is no longer optimizing for the cumulative expected reward from the environment. In fact, if the policy distribution has the form of a Boltzmann Distribution:

$$\pi(a|s) \sim \exp Q(s, a)$$

This softmax regularization has a very explicit, information-theoretic interpretation: it is the optimal solution for the Maximum-Entropy RL objective:

$$\pi_{\mathrm{MaxEnt}}^* = \arg\!\max_{\pi} \mathbb{E}_{\pi}\left[ \sum_{t=0}^T r_t + \mathcal{H}(\pi(\cdot | \mathbf{s}_t)) \right]$$

An excellent explanation for the maximum entropy principle is reproduced below from Brian Ziebart's PhD thesis:

When given only partial information about a probability distribution, $\tilde{P}$, typically many different distributions, $P$, are capable of matching that information. For example, many distributions have the same mean value. The principle of maximum entropy resolves the ambiguity of an under-constrained distribution by selecting the single distribution that has the least commitment to any particular outcome while matching the observational constraints imposed on the distribution.

This is nothing more than "Occam's Razor'' in the parlance of statistics. The Maximum Entropy Principle is a framework for limiting overfitting in RL models, as it limits the amount of information (in nats) contained by the policy. The more entropy a distribution has, the less information it contains, and therefore the less "assumptions'' about the world it makes. The combination of Softmax Temporal Consistency with Boltzmann Policies is known as Soft Q-Learning.

To draw a connection back to currency arbitrage and the world of finance, limiting the number of assumptions in a model is of paramount importance to quantiatiative researchers at hedge funds, since hundreds of millions of USD could be at stake. Quants have developed a rather explicit form of Occam's Razor by tending to rely on models with as few statistical priors as possible, such as Linear models and Gaussian Process Regression with simple kernels.

Although Soft Q-Learning can regularize against model complexity, updates are still backed up over single timesteps. It is often more effective to integrate rewards with respect to a "path'' of samples actually sampled at data collection time, than backing up expected Q values one edge at a time and hoping that softmax temporal consistency remains consistent well when accumulating multiple backups.

Work from Nachum et al. 2017, O’Donoghue et al. 2016, Schulman et al. 2017 explore the theoretical connections between multi-step return optimization objectives (policy-based) and temporal consistency (value-based) objectives. The use of a multi-step return can be thought of as a path-integral solution to marginalizing out random variables occuring during a multi-step decision process (such as random non-Markovian dynamics). In fact, long before Deep RL research became popular, control theorists have been using path integrals for optimal control to tackle the problem of integrating multi-step stochastic dynamics [1, 2]. A classic example is the use of the Viterbi Algorithm in stochastic planning.

Once trained, the value function $Q(s,a)$ implies a sequence of actions an agent must do in order to maximize expected reward (this sequence does not have to be unique). In order for the $Q$ function to be correct, it must also implicitly capture knowledge about the expected dynamics that occur along the sequence of actions. It's quite remarkable that all this "knowledge of the world and one's own behavior'' can be captured into a single scalar.

However, this representational compactness can also be a curse!

Soft Q-learning and PGQ/PCL successfully back up expected values over some return distribution, but it's still a lot to ask of a neural network to capture all the knowledge about expected future dynamics, marginalize all the randomness into a single statistic.

We may be interested in propagating other statistics like variance, skew, and kurtosis of the value distribution. What if we did Bellman backups over entire distributions, without having to throw away the higher-order moments?

This actually recovers the motivation of Distributional Reinforcement Learning, in which "edges'' in the shortest path algorithm propagate distributions over values rather than collapsing everything into a scalar. The main contribution of the seminal Bellemare et al. 2017 paper is defining an algebra that generalizes the Bellman Equality to operate on distributions rather than scalar statistics of them. Unlike the path-integral approach to Q-value estimation, this framework avoids marginalization error by passing richer messages in the single-step Bellman backups.

Soft-Q learning, PGQ/PCL, and Distributional Reinforcement Learning are "probabilistically aware'' reinforcement learning algorithms. They appear to be tremendously beneficial in practice, and I would not be surprised if by next year it becomes widely accepted that these techniques are the "physically correct'' thing to do, and hard-max Q-learning (as done in standard RL evaluations) is discarded. Given that multi-step Soft-Q learning (PCL) and Distributional RL take complementary approaches to propagating value distributions, I'm also excited to see whether the approaches can be combined (e.g. policy gradients over distributional messages).

Physically-Based Rendering

Ray tracing is not slow, computers are. -- James Kajiya

A couple of the aforementioned RL works make heavy use of the terminology "path integrals''. Do you know where else path integrals and the need for "physical correctness'' arise? Computer graphics!

Whether it is done by an illustrator's hand or a computer, the problem of rendering asks "Given a scene and some light sources, what is the image that arrives at a camera lens?''. Every rendering procedure -- from the first abstract cave painting to Disney's modern Hyperion renderer, is a depiction of light transported from the world to the eye of the observer.

Here are some examples of the enormous strides rendering technology has made in the last 20 years:

From top left, clockwise: Big City Overstimulation by Gleb Alexandrov. Pacific Rim, Uprising. The late Peter Cushing resurrected for a Star Wars movie. Remove Henry's Cavill's mustache to re-shoot some scenes because he needs the mustache for another movie.

Photorealistic rendering algorithms are made possible thanks to accurate physical models of how light behaves and interacts with the natural world, combined with the computational resources to actually represent the natural world in a computer. For instance, a seemingly simple object like a butterfly wing has an insane amount of geometric detail, and light interacts with this geometry to produce some macroscopic effect like iridescence.

Light transport involves far too many calculations for a human to do by hand, so the old master painters and illustrators came up with a lot of rules about how light behaves and interacts with everyday scenes and objects. Here are some examples of these rules:

Cold light has a warm shadow, warm light has a cool shadow.
Light travels through tree leaves, resulting in umbras that are less "hard" than a platonic sphere or a rock.
Clear water and bright daylight result in caustics.
Light bounces off flat water like a billiard ball with a perfectly reflected incident angle, but choppy water turns white and no longer behaves like a mirror.

You can get quite far on a big bag of heuristics like these. Here are some majestic paintings from the Hudson River School (19th century).

Albert Bierstadt, Scenery in the Grand Tetons, 1865-1870

Albert Bierstadt, Among the Sierra Nevada Mountains, California, 1868

Mortimer Smith: Winter Landscape, 1878

However, a lot of this painterly understanding -- though breathtaking -- was non-rigorous and physically inaccurate. Scaling this up to animated sequences was also very laborious. It wasn't until 1986, with the independent discovery of the rendering equation by David Immel et al. and James Kajiya, that we obtained physically-based rendering algorithms.

Of course, the scene must obey the conservation of energy transport: the electromagnetic energy being fed into the scene (via radiating objects) must equal the total amount of electromagnetic energy being absorbed, reflected, or refracted in the scene. Here is the rendering equation explained in an annotated equation:

A Monte Carlo estimator is a method for estimating high-dimensional integrals, by simply taking the expectation over many independent samples of an unbiased estimator. Path-tracing is the simplest Monte-Carlo approximation possible to the rendering equation. I've borrowed some screenshots from Disney's very excellent tutorial on production path tracing to explain how "physically-based rendering'' works.

Initially, the only thing visible to the camera is the light source. Let there be light!

A stream of photons is emitted from the light and strikes a surface (in this case, a rock). It can be absorbed into non-visible energy, reflected off the object, or refracted into the object.

Any reflected or refracted light is emitted from the surface and continues in another random direction, and the process repeats until there are no photons left or it is absorbed by the camera lens.

This process is repeated ad infinum for many rays until the inflow vs. outflow of photons reaches equilibrium or the artist decides that the computer has been rendering for long enough. The total light contribution to a surface is a path integral over all these light bounce paths.

This equation has applications beyond entertainment: the inverse problem is studied in astrophysics simulations (given observed radiance of a supernovae, what are the properties of its nuclear reactions?), and the neutron transport problem. In fact, Monte Carlo methods for solving integral equations were developed for studying fissile reactions for the Manhattan Project! The rendering integral is also an Inhomogeneous Fredholm equations of the second kind, which have the general form:

$${\displaystyle \varphi (t)=f(t)+\lambda \int _{a}^{b}K(t,s)\varphi (s)\,\mathrm {d} s.}$$

Take another look at the rendering equation. Déjà vu, anyone?

Once again, path tracing is nothing more than the Bellman-Ford heuristic encountered in shortest-path algorithms! The rendering integral is taken over the $4\pi$ steradian's of surface area on a unit sphere, which cover all directions an incoming light ray can come from. If we interpret this area integration probabilistically, this is nothing more than the expectation (mean) over directions sampled uniformly from a sphere.

This equation takes the same form as the high-temperature softmax limit for Soft Q-learning! Recall that as $\tau \to \infty$, softmax converges to an expectation over a uniform distribution, i.e. a policy distribution with maximum entropy and no information. Light rays have no agency, they merely bounce around the scene like RL agents taking completely random actions!

The astute reader may wonder whether there is also a corresponding "hard-max'' version of rendering, just as hard-max Bellman Equality is to the Soft Bellman Equality in Q-learning.

The answer is yes! The recursive raytracing algorithm (invented before path-tracing, actually) was a non-physical approximation of light transport that assumes the largest of lighting contributions reflected off a surface comes from one of the following light sources:

Emitting material
Direct exposure to light sources
Strongly reflected light (i.e. surface is a mirror)
Strongly refracted light (i.e. surface is made of glass or water).

In the case of reflected and refracted light, recursive trace rays are branched out to perform further ray intersection, usually terminating at some fixed depth.

Raytracing approximation to the rendering equation.

Because ray tracing only considers the maximum contribution directions, it is not able to model indirect light, such as light bouncing off a bright wall and bleeding into an adjacent wall. Although these contributions are minor in today setups like Cornell Boxes, they play a dominant role in rendering pictures of snow, flesh, and food.

Below is a comparison of a ray-traced image and a path-traced image. The difference is like night and day:

Prior work has drawn connections between light transport and value-based reinforcement learning, and in fact Dahm and Keller 2017 leverage Q-learning to learn optimal selection of "ray bounce actions'' to accelerate importance sampling in path tracing. Much of the physically-based rendering literature considers the problem of optimal importance sampling to minimize variance of the path integral estimators, resulting in less "noisy'' images.

For more information on physically-based rendering, I highly recommend Benedikt Bitterli's interactive tutorial on 2D light transport, Pat Hanrahan's book chapter on Monte Carlo Path Tracing, and the authoritative PBR textbook.

Summary and Questions

We have 3 very well-known algorithms (currency arbitrage, Q-learning, path tracing) that independently discovered the principle of relaxation used in shortest-path algorithms such as Dijkstra's and Bellman-Ford. Remarkably, each of these disparate fields of study discovered notions of hard and soft optimality, which is relevant in the presence of noise or high-dimensional path integrals. Here is a table summarizing the equations we explored:

These different fields have quite a lot of ideas that could be cross-fertilized. Just to toss some ideas out there (a request for research, if you will):

There has been some preliminary work on using optimal control to reduce sample complexity of path tracing algorithms. Can sampling algorithms used in rendering be leveraged for reinforcement learning?
Path tracing integrals are fairly expensive because states and actions are continuous and each bounce requires ray-intersecting a geometric data structure. What if we do light transport simulations on a point cloud with a precomputed visibility matrix between all points, and use that as an approximation for irradiance caching / final-gather?
Path tracing is to Soft Q-Learning as Photon Mapping is to ...?
Has anyone ever tried using the Maximum Entropy principle as a regularization framework for financial trading strategies?
The selection of a proposal distribution for importance-sampled Monte Carlo rendering could utilize Boltzmann Distributions with soft Q-learning. This is nice because the proposal distribution over recursive ray directions has infinite support by construction, and Soft Q-learning can be used to tune random exploration of light rays.
Is there a distributional RL interpretation of path tracing, such as polarized path tracing?
Given the equivalence between Q Learning and shortest path algorithms, it's interesting to note that in Deep RL research, we carefully initialize weights but leave the Q-function values fairly arbitrary. However, all shortest-path algorithms rely on initializing costs to negative infinity, so that costs being propagated during relaxation correspond to actually realizable paths. Why aren't we initializing all function values to negative-valued numbers?

Acknowledgements

I'm very grateful to Austin Chen, Deniz Oktay, Ofir Nachum, and Vincent Vanhoucke for proofreading and providing feedback to this post. All typos/factual errors are my own; please write to me if you spot additional errors. And finally, thank you for reading!

Eric Jang

Robots Must Be Ephemeralized

Two Flavors of Sim2Real

The Case For Iterating Directly In Real

Suffering From Success: Evaluating General Purpose Robots

Ephemeralization

Ephemeralization for Robotics

Acknowledgements

ML Mentorship: Some Q/A about RL

A Note on Categorizing RL Algorithms

Stonks are What You Can Get Away With: NFTs and Financial Nihilism

Explaining NFTs using Counterfeit Goods

The Riddle of Intangible Value

Artistic and Financial Nihilism: One and The Same?

Acknowledgements

Sovereign Arcade: Currency as High-Margin Infrastructure

Large Companies as Nation-States

The Network State

Crypto Whales

Summary

Further reading and Acknowledgements

Science and Engineering for Learning Robots

The Deep Learning Revolution

Software 2.0

How Much Should We Learn in Robotics?

Three reasons for end-to-end learning

Fused Perception-to-Action in Nature

The Trouble With Defining Anything

Cooking is Not Software 1.0

Science and Engineering of End-to-End ML

Interesting Problems

Compiling Software 2.0 Capable of Lifelong Learning

Train on Short Sequences and It Just Works

Hierarchical Computation

Parallel Evolution

Summary

Don't Mess with Backprop: Doubts about Biologically Plausible Deep Learning

How to Understand ML Papers Quickly

Software and Hardware for General Robots

My Criteria for Reviewing Papers

My Criteria

Opportunities for Non-Traditional Researchers

Chaos and Randomness

Preliminaries

Definition of Chaos

Chaos in the Logistic Family

Spatial Precision Error + Chaos = Randomness

Free Office Hours for Non-Traditional ML Researchers

Three Questions that Keep Me Up at Night

Selected Quotes from "The Dark Ages of AI Panel Discussion"

Differentiable Path Tracing on the GPU/TPU

Part I: Geometry

Differentiable Scene Intersection with Distance Fields

Building Up Our Scene

Computing Surface Normals

Cosine-Weighted Sampling

Camera Model

Part II: Light Simulation

Radiometry From First Principles

Different Ways to Integrate Radiance

Integrating Over Solid Angle

Integrating Over Projected Solid Angle

Integrating Over Light Area

Making Rendering Computationally Tractable with Path Integrals

A Naive Path Tracer

Reducing Variance by Splitting Up Indirect Lighting

Ignoring Photometry

Performance Benchmarks: P100 vs. TPUv2

Summary

Acknowledgements

Fun Facts

Robinhood, Leverage, and Lemonade

Lemonade Leverage

Lemonade Coupons

Acknowledgements

Normalizing Flows in 100 Lines of JAX

Install Dependencies

Toy Dataset

Affine Coupling Layer in JAX

Stacking Coupling Layers