## Monday, September 20, 2021

### Robots Must Be Ephemeralized

There is a subfield of robotics research called “sim-to-real” (sim2real) whereby one attempts to solve a robotic task in simulation, and then get a real robot to do the same thing in the real world. My team at Google utilizes Sim2Real techniques extensively in pretty much every domain we study, including locomotion and navigation and manipulation.

The arguments for doing robotic research in simulation are generally well-known in the community: more statistical reproducibility, less concern about safety issues during learning, avoiding the operational complexity of maintaining thousands of robots that wear down at differing rates. Sim2Real is utilized heavily on quadruped and five-finger hand platforms, because at present, such hardware can only be operated a few hundred trials before they start to wear down or break. When the dynamics of the system start to vary from episode-to-episode, learning becomes even more difficult.

In a previous blog post, I also discussed how iterating in simulation solves some tricky problems around new code changes invalidating old data. Simulation makes this a non-issue because it is relatively cheap to re-generate your dataset every time you change the code.

Despite significant sim2real advances in the last decade, I must confess that three years ago, I was still somewhat ideologically opposed to doing robotics research in simulation, on the grounds that we should revel in the richness and complexity of real data, as opposed to perpetually staying in the safe waters of simulation.

Following those beliefs, I worked on a three-year long robotics project where our team eschewed simulation and focused the majority of our time on iterating in the real world (mea culpa). That project was a success, and the paper will be presented at the 2021 Conference on Robotic Learning. However, in the process, I learned some hard lessons that completely reversed my stance on sim2real and offline policy evaluation. I now believe that offline evaluation technology is no longer optional if you are studying general-purpose robots, and I have pivoted my research workflows to rely much more heavily on these methods. In this blog post, I outline why it is tempting for roboticists to iterate directly on real life, and how the difficulty of evaluating general-purpose robots will eventually force us to increasingly rely on offline evaluation techniques such as simulation.

### Two Flavors of Sim2Real

I’m going to assume the reader is familiar with basic sim2real techniques. If not, please check out this RSS’2020 workshop website for tutorial videos. There are broadly two ways to formalize sim2real problems.

One approach is to create an “adapter” that transforms simulated sensor readings to resemble real data as much as possible, so that a robot trained in simulation behaves indistinguishably in both simulation and real. Progress on generative modeling techniques such as GANs have enabled this to work even for natural images.

Another formulation of the sim2real problem is to train simulated robots under lots of randomized conditions. In becoming robust under varied conditions, the simulated policy can treat the real world as just another instance under the training distribution. OpenAI’s Dactyl took this “domain randomization” approach, and were able to get the robot to manipulate a Rubik’s cube without ever doing policy learning on real data.

Both the domain adaptation and domain randomization approaches in practice yield similar results when transferred to real, so their technical differences are not super important. The takeaway is that the policy is learned and evaluated on simulated data, then deployed in real with fingers crossed.

### The Case For Iterating Directly In Real

Three years ago, my primary arguments against sim were related to the richness of data available to real vs simulated robots:
1. Reality is messy and complicated. It takes regular upkeep and effort to maintain neatness for a desk or bedroom or apartment. Meanwhile, robot simulations tend to be neat and sterile by default, with not a lot of “messiness” going on. In simulation, you must put in extra work to increase disorder, whereas in the real world, entropy increases for free. This acts as a forcing function for roboticists to focus on the scalable methods that can handle the complexity of the real world.
2. Some things are inherently difficult to simulate - in the real world, you can have robots interact with all manner of squishy toys and articulated objects and tools. Bringing those objects into a simulation is incredibly difficult. Even if one uses photogrammetry technology to scan objects, one still needs to set-dress objects in the scene to make a virtual world resemble a real one. Meanwhile, in the real world one can collect rich and diverse data by simply grabbing the nearest household object - no coding required.
3. Bridging the “reality gap” is a hard research problem (often requiring training high-dimensional generative models), and it’s hard to know whether these models are helping until one is running actual robot policies in the real world anyway. It felt more pragmatic to focus on direct policy learning on the test setting, where one does not have to wonder whether their training distribution differs from their test distribution.

To put those beliefs into context, at the time, I had just finished working on Grasp2Vec and Time-Contrastive-Networks, both of which leveraged rich real-world data to learn interesting representations. The neat thing about these papers was that we could train these models on whatever object (Grasp2Vec) or video demonstration (TCN) the researcher felt like mixing into the training data, and scale up the system without writing a single line of code. For instance, if you want to gather a teleoperated demonstration of a robot playing with a Rubik’s cube, you simply need to buy a Rubik’s cube from a store and put it into the robot workspace. In simulation, you would have to model a simulated equivalent of a rubik’s cube that twists and turns just like a real one - this can be a multi-week effort just to align the physical dynamics correctly. It didn’t hurt that the models “just worked”, there wasn’t much iteration needed on the modeling front for us to start seeing cool generalization.

There were two more frivolous reasons I didn’t like sim2real:

Aesthetics: Methods that learn in simulation often rely on crutches that are only possible in simulation, not real. For example, using millions of trials with an online policy-gradient method (PPO, TRPO) or the ability to reset the simulation over and over again. As someone who is inspired by the sample efficiency of humans and animals, and who believes in the LeCake narrative of using unsupervised learning algorithms on rich data, relying on a “simulation crutch” to learn feels too ham-handed. A human doesn’t need to suffer a fatal accident to learn how to drive a car.

A “no-true-Scotsman” bias: I think there is a tendency for people who spend all their time iterating in simulation to forget the operational complexity of the real world. Truthfully, I may have just been envious of others who were publishing 3-4 papers a year on new ideas in simulated domains, while I was spending time answering questions like “why is the gripper closing so slowly?”

### Suffering From Success: Evaluating General Purpose Robots

So how did I change my mind? Many researchers at the intersection of ML and Robotics are working towards the holy grail of “generalist robots that can do anything humans ask them”. Once you have the beginnings of such a system, you start to notice a host of new research problems you didn’t think of before, and this is how I came to realize that I was wrong about simulation.

In particular, there is a “Problem of Success”: how do we go about improving such generalist robots? If the success rate is middling, say, 50%, how do we accurately evaluate a system that can generalize to thousands or millions of operating conditions? The feeling of elation that a real robot has learned to do hundreds of things -- perhaps even things that people didn’t train them for -- is quickly overshadowed by uncertainty and dread of what to try next.

Let’s consider, for example, a generalist cooking robot - perhaps a bipedal humanoid that one might deploy in any home kitchen to cook any dish, including Wozniak’s Coffee Test (A machine is required to enter an average American home and figure out how to make coffee: find the coffee machine, find the coffee, add water, find a mug, and brew the coffee by pushing the proper buttons).

In research, a common metric we’d like to know is the average success rate - what is the overall success rate of the robot at performing a number of different tasks around the kitchen?

In order to estimate this quantity, we must average over the set of all things the robot is supposed to generalize to, by sampling different tasks, different starting configurations of objects, different environments, different lighting conditions, and so on.

For a single scenario, it takes a substantial number of trials to measure success rates with single digit precision:

The standard deviation of a binomial parameter is given by sqrt(P*(1-P)/N), where P is the sample mean and N is the sample size. If your empirical mean of the success rate is 50% under N=5000 samples, this equation tells you that the standard error is 0.007. A more intuitive way to understand this is in terms of a confidence interval: there is a 95% epistemic probability that the true mean, which may not be exactly 50%, lies within the range [50 - 1.3, 50 + 1.3].

5000 trials is a lot of work! Rarely do real robotics experiments do anywhere near 300 or even 3000 evaluations to measure task success.

From Vincent Vanhoucke’s blog post, here is a table drawing a connection from your sample size (under the worst case of p=50%, which maximizes standard error) to the number of significant digits you can report:

Depending on the length of the task, it could take all day or all week or all month to run one experiment. Furthermore, until robots are sufficiently capable of resetting their own workspaces, a human supervisor needs to reset the workspace over and over again as one goes through the evaluation tasks.

One consequence of these napkin calculations is that pushing the frontier of robotic capability requires a series of incremental advances (e.g. 1% at a time) with extremely costly evaluation (5000 episodes per iteration), or a series of truly quantum advances that are so large in magnitude that it takes very few samples to know that the result is significant. Going from “not working at all” to “kind of working” is one example of a large statistical leap, but in general it is hard to pull these out of the hat over and over again.

Techniques like A/B testing can help reduce the variance of estimating whether one model is better than another one, but it still does not address the problem of the sample complexity of evaluation growing exponentially with the diversity of conditions the ML models are expected to generalize to.

What about a high-variance, unbiased estimator? One approach would be to sample a location at random, then a task at random, and then an initial scene configuration at random, and then aggregate thousands of such trials into a single “overall success estimator”. This is tricky to work with because it does not help the researcher drill into problems where learning under one set of conditions causes catastrophic forgetting of another number. Furthermore, if the number of training tasks is many times larger than the number of evaluation samples and task successes are not independent, then there will be high variance in the overall success estimate.

What about evaluating general robots with a biased, low-variance estimator of the overall task success? We could train a cooking robot to make millions of dishes, but only evaluate on a few specific conditions - for example, measuring the robot’s ability to make banana bread and using that as an estimator for its ability to do all the other tasks. Catastrophic forgetting is still a problem - if the success rate of making banana bread is inversely correlated with the success rate of making stir-fry, then you may be crippling the robot in ways that you are no longer measuring. Even if that isn’t a problem, having to collect 5000 trials limits the number of experiments one can evaluate on any given day. Also, you end up with a lot of surplus banana bread.

The following is a piece of career advice, rather than a scientific claim: in general you should strive to be in a position where your productivity bottleneck is the number of ideas you can come up with in a single day, rather than some physical constraint that limits you to one experiment per day. This is true in any scientific field, whether it be in biology or robotics.

Lesson: Scaling up in reality is fast because it requires little to no additional coding, but once you have a partially working system, careful empirical evaluation in real life becomes increasingly difficult as you increase the generality of the system.

### Ephemeralization

In his 2011 essay Software is Eating The World, venture capitalist Marc Andreessen pointed out that more and more of the value chain in every sector of the world was being captured by software companies. In the ensuing decade, Andreesen has refined his idea further to point out that “Software Eating The World” is a continuation of a technological trend, Ephemeralization, that precedes even the computer age. From Wikipedia:

Ephemeralization, a term coined by R. Buckminster Fuller in 1938, is the ability of technological advancement to do "more and more with less and less until eventually you can do everything with nothing,"

Consistent with this theme, I believe the solution to scaling up generalist robotics is to push as much of the iteration loop into software as possible, so that the researcher is freed from the sheer slowness of having to iterate in the real world.

Andreessen has posed the question of how future markets and industries might change when everybody has access to such massive leverage via “infinite compute”. ML researchers know that “infinite” is a generous approximation - it still costs 12M USD to train a GPT-3 level language model. However, Andreessen is directionally correct - we should dare to imagine a near future where compute power is practically limitless to the average person, and let our careers ride this tailwind of massive compute expansion. Compute and informational leverage are probably still the fastest growing resources in the world.

Software is also eating research. I used to work in a biology lab at UCSF, where only a fraction of postdoc time was spent thinking about the science and designing experiments. The majority of time was spent pipetting liquids into PCR plates, making gel media, inoculating petri dishes, and generally moving liquids around between test tubes. Today, it is possible to run a number of “standard biology protocols” in the cloud, and one could conceivably spend most of their time focusing on the high-brow experiment design and analysis rather than manual labor.

Imagine a near future where instead of doing experiments on real mice, we instead simulate a highly accurate mouse behavioral model. If such models turn out to be accurate, then medical science will be revolutionized overnight by virtue of researchers being able to launch massive-scale studies with billions of simulated mouse models. A single lab might be able to replicate a hundred years of mouse behavioral studies practically overnight. A scientist working on a laptop from a coffee shop might be able to design a drug, run clinical trials on it using a variety of cloud services, and get it FDA approved all from her laptop. When this happens, Fuller’s prediction will come true and it really will seem as if we can do “everything with nothing”.

### Ephemeralization for Robotics

The most obvious way to ephemeralize robot learning in software is to make simulations that resemble reality as closely as possible. Simulators are not perfect - they still suffer from the reality gap and data richness problems that originally made me skeptical of iterating in simulation. But, having worked on general purpose robots directly in the real world, I now believe that people who want high-growth careers should actively seek workflows with highest leverage, even if it means putting in the legwork to make a simulation as close to reality as possible.

There may be ways to ephemeralize robotic evaluation without having to painstakingly hand-design Rubik’s cubes and human behavior into your physics engine. One solution is to use machine learning to learn world models from data, and having the policy interact with the world model instead of the real world for evaluation. If learning high-dimensional generative models is too hard, there are off-policy evaluation methods and offline hyperparameter selection methods that don’t necessarily require simulation infrastructure. The basic intuition is that if you have a value function for a good policy, you can use it to score other policies on your real world validation datasets. The downside to these methods is that they often require finding good policy or value function to begin with, and are only accurate for ranking policies up to the level of the aforementioned policy itself. A Q(s,a) function for a policy with a 70% success rate can tell you if your new model is performing around 70% or 30% , but is not effective at telling you whether you will get 95% (since these models don’t know what they don’t know). Some preliminary research suggests that extrapolation can be possible, but it has not yet been demonstrated at the scale of evaluating general-purpose robots on millions of different conditions.

What are some alternatives to more realistic simulators? Much like the “lab in the cloud” business, there are some emerging cloud-hosted benchmarks such as AI2Thor and MPI’s Real Robot Challenge, where researchers can simply upload their code and get back results. The robot cloud provider handles all of the operational aspects of physical robots, freeing the researcher to focus on software.

One drawback of these setups is that these hosted platforms are designed for repeatable, resettable experiments, and do not have the diversity that general purpose robots would be exposed to.

Alternatively, one could follow the Tesla Autopilot approach and deploy their research code in “shadow mode” across a fleet of robots in the real world, where the model only makes predictions but does not make control decisions. This exposes evaluation to high-diversity data that cloud benchmarks don’t have, but suffers from the long-term credit assignment problem. How do we know whether a predicted action is good or not if the agent isn’t allowed to take those actions?

For these reasons, I think data-driven realistic simulation gets the best of both worlds - you get the benefits of real world diverse data and the ability to evaluate simulated long-term outcomes. Even if you are relying heavily on real-world evaluations via a hosted cloud robotics lab or a fleet running Shadow Mode, having a complementary software-only evaluation provides additional signal can only help with saving costs and time.

I suspect that a practical middle ground is to combine multiple signals from offline metrics to predict success rate: leveraging simulation to measure success rates, training world models or value functions to help predict what will happen in “imagined rollouts”, adapting simulation images to real-like data with GANs, and using old-fashioned data science techniques (logistic regression) to study the correlations between these offline metrics and real evaluated success. As we build more general AI systems that interact with the real world, I predict that there will be cottage industries dedicated to building simulators dedicated for sim2real evaluation and data scientists who build bespoke models for guessing the result of expensive real-world evaluations.

Separately from how ephemeralization drives down the cost of evaluating robots in the real world, there is the effect of ephemeralization driving down the cost of robot hardware itself. It used to be that robotics labs could only afford a couple expensive robot arms from Kuka and Franka. Each robot would cost hundreds of thousands of dollars, because they had precisely engineered encoders and motors that enabled millimeter-level precision. Nowadays, you can buy some cheap servos from AliExpress.com for a few hundred dollars, glue it to some metal plates, and control it in a closed-loop manner using a webcam and a neural network running on a laptop.

Instead of relying on hardware precise position control, the arm moves based purely on vision and hand-eye coordination. All the complexity has been migrated from hardware to software (and machine learning). This technology is not mature enough yet for factories and automotive companies to replace their precision machines with cheap servos, but the writing is on the wall: software is coming for hardware, and this trend will only accelerate.

### Acknowledgements

Thanks to Karen Yang, Irhum Shafkat, Gary Lai, Jiaying Xu, Casey Chu, Vincent Vanhoucke, Kanishka Rao for reviewing earlier drafts of this essay.

## Friday, July 30, 2021

### ML Mentorship: Some Q/A about RL

One of my ML research mentees is following OpenAI's Spinning up in RL tutorials (thanks to the nice folks who put that guide together!). She emailed me some good questions about the basics of Reinforcement Learning, and I wanted to share some of my replies on my blog in case it helps further other student's understanding of RL.

 The classic Sutton and Barto diagram of RL

Your “How to Understand ML Papers Quickly” blog post recommended asking ourselves “what loss supervises the output predictions” when reading ML papers. However, in SpinningUp, it mentions that “minimizing the ‘loss’ function has no guarantee whatsoever of improving expected return” and “loss function means nothing.” In this case, what should we look for instead when reading DRL papers if not the loss function?

Policy optimization algorithms like PPO train by minimizing some loss, which in the most naive implementation is the (negative) expected return at the current policy's parameters. So in reference to my blog post, this is the "policy gradient loss" that supervises the current policy's predictions.

It so happens that this loss function is defined with respect to data $\mathcal{D}(\pi^i)$ sampled by the *current* policy, rather than data sampled i.i.d from a fixed / offline dataset as commonly done in supervised learning. So if you change the policy from $\pi^i \to \pi^{i+1}$, then re-computing the policy gradient loss for $\pi^{i+1}$ requires collecting some new environment data $\mathcal{D}(\pi^{i+1})$ with $\pi^{i+1}$. Computing the loss function has special requirements (you have to annoyingly gather new data every time you update), but at the end of the day it is still a loss that supervises the training of a neural net, given parameters and data.

On "loss function means nothing": the Spinning Up docs are correct in saying that the loss you minimize is not actually the evaluated performance of the policy, in the same way that minimizing cross entropy loss maximizes accuracy while not telling you what the accuracy is. In a similar vein, the loss value for $\pi^i, \mathcal{D}(\pi^i)$ is decreased after a policy gradient update. You can assume that if your new policy sampled the exact same trajectory as before, the resultant reward would be the same, but your loss would be lower. Vice versa, if your new policy samples a different trajectory, you can probably assume that there will be a monotonic increase in reward as a result of taking each policy gradient step (assuming step size is correct and that you could re-evaluate the loss under a sufficiently large distribution).

However, you don't know how much decrease in loss translates to increase in reward, due to non-linear sensitivity between parameters and outputs, and further non-linear sensitivity between outputs and rewards returned by the environment. A simple illustrative example of this: a fine-grained manipulation task with sparse rewards, where the episode return is 1 if all actions are done within a 1e-3 tolerance, and 0 otherwise. A policy update might result in each of the actions improving the tolerance from 1e-2 to 5e-3, and this policy achieves a lower "loss" according to some Q function, but still has the same reward when re-evaluated in the environment.

Thus, when training RL it is not uncommon to see the actor loss go down but the reward stay flat, or vice versa (the actor loss stays flat but the reward goes up). It's usually not a great sign to see the actor loss blow up though!

Why in DRL, people frequently set up algorithms to optimize the undiscounted return, but use discount factors in estimating value functions?

See https://stats.stackexchange.com/questions/221402/understanding-the-role-of-the-discount-factor-in-reinforcement-learning. In addition to avoiding infinite sums from a mathematical perspective, the discount factor actually serves as an important hyperparameter when tuning RL agents. It biases the optimization landscape so that agents prefer the same reward sooner than later. Finishing an episode sooner also allows agents to see more episodes, which indirectly improves the amount of search and exploration a learning algorithm can do. Additionally, discounting produces a symmetry-breaking effect that further reduces the search space. In a sparse reward environment with a $\gamma=1$ (no discounting), an agent would be equally happy to do nothing on the first step, and then complete the task vs. do the task straight away. Discounting makes the task easier to learn because the agent can learn that there is only one preferable action at the first step.

In model-based RL, why embedding planning loops into policies makes model bias less of a problem?

Here is an example that might illustrate how planning helps:

Given a good Q function $Q(s,a)$, you can recover a policy $\pi(a|s)$ by performing a search procedure argmax_a $Q(s,a)$ to recover the best action that results in the best expected (discounted) future returns. A search algorithm like grid search is computationally expensive, but guaranteed to work because it will cover all the possibilities.

Imagine instead of search, you use a neural network "actor" to amortize the "search" process into a single pass through a neural network. This is what Actor-Critic algorithms do: they learn a critic and use the critic to learn an actor, which performs "amortized search over the argmax $Q(s,a)$".

Whenever you can use brute force search on the critic instead of an actor, it is better to do so. This is because an actor network (amortized search) can make mistakes, while brute force is slow but will not make a mistake.

The above example illustrates the simplest example of a 1-step planning algorithm, where "planning" is actually synonymous with "search". You can think about the act of searching for the best action with respect to $Q(s, a)$ as being equivalent to "planning for the best future outcome", where $Q(s,a)$ evaluates your plan.

Now imagine you have a perfect model of dynamics, $p(s'|s,a)$, and an okay-ish Q function where it has function approximation errors in some places. Instead of just selecting the best Q value and action at a given state, the agent can now consider the future state and consider the Q values that one encounters at the next set of actions. By using a plan and an "imagined rollout" of the future, the agent can query $Q(s,a)$ along every state in the trajectory, and potentially notice inconsistencies with Q functions. For instance, Q might be high at the beginning of the episode but low at the end of the episode despite taking the greedy action at each state. This would immediately tell you that the Q function is unreliable for some states in the trajectory.

A well-trained Q function should respect the Bellman equality, so if you have a Q function and a good dynamics model, then you can actually check your Q function for self-consistency at inference time time to make sure it satisfies Bellman equality, even before taking any actions.

One way to think of a planning module is that it "wraps" a value function $Q_\pi(s,a)$ and gives you a slightly better version of the policy, since it uses search to consider more possibilities than the neural-net amortized policy $\pi(a|s)$. You can then take the trajectory data generated by the better policy and use that to further improve your search amortizer, which yields the "minimal policy improvement technique" perspective from  Ferenc Huszár.

When talking about data augmentation for model-free methods, what is the difference between “augment[ing] real experiences with fictitious ones in updating the agent” and “us[ing] only fictitious experience for updating the agent”?

If you have a perfect world model, then all you need is to train an agent on "imaginary rollouts" and then it will be exactly equivalent to training the agent on the real experience. In robotics this is really nice because you can train purely in "mental simulation" without having to wear down your robots. Model-Ensemble TRPO is a straightforward paper that tries these ideas.

Of course in practice, no one ever learns a perfect world model, so it's common to use the fictitious (imagined) experience as a supplemental experience to real interaction. The real interactions data provide some grounding in reality for both the imagination model and the policy training.

How to choose the baseline (function b) in policy gradients?

The baseline should be chosen to minimize the variance of gradients while keeping the estimate of the learning signal unbiased. Here is a talk that covers that stuff in more detail https://www.youtube.com/watch?v=ItI_gMuT5hw, you can also google terms like "variance reduction policy gradient" more and "control variates reinforcement learning". I have a blog post on variance reduction, which also discusses control variates: https://blog.evjang.com/2016/09/variance-reduction-part1.html

Consider episode returns for 3 actions = [1, 10, 100]. Clearly the third action is by far the best, but if you take a naive policy gradient, you end up increasing the likelihood of the bad actions too! Typically $b=V(s)$ is sufficient, because it turns the $Q(s,a)-V(s)$ into advantage $A(s,a)$, which has the desired effect of increasing the likelihood of good actions, keeping the likelihood of neutral actions the same, and decreasing the likelihood of bad actions. Here is a paper that applies an additional control variate on top of advantage estimation to further reduce variance.

How to better understand target policy smoothing in TD3?

In actor-critic methods, both the Q function and actor are neural networks, so it can be very easy to use gradient descent to find a region of high curvature in the Q function where the value is very high. You can think of the actor as a generator and a critic as a discriminator, and the actor learns to "adversarially exploit" regions of curvature in the critic so as to maximize the Q value without actually emitting meaningful actions.

All three of the tricks in TD3 are designed to mitigate the problem of the actor adversarially selecting an action with a pathologically high Q value. By adding noise to the input to the target Q network, it prevents the "search" from finding exact areas of high curvature. Like Trick 1, it helps make the Q function estimates more conservative, thus reducing the likelihood of choosing over-estimated Q values.

#### A Note on Categorizing RL Algorithms

RL is often taught in a taxonomic layout, as it helps to classify algorithms based on whether they are "model based vs. model-free", "on-policy vs. off-policy", "supervised vs. unsupervised". But these categorizations are illusory, much like the Spoon in the Matrix.  There are actually many different frameworks and schools of thought that allow one to independently derive the same RL algorithms, and they cannot always be neatly classified and separated from each other.

For example, it is possible to derive actor critic algorithms from both on-policy and off-policy perspectives.

Starting from off-policy methods, you have DQN which use the inductive bias of Bellman Equality to learn optimal policies via dynamic programming. Then you can extend DQN to continuous actions via an actor network, which arrives at DDPG.

Starting from on-policy methods, you have REINFORCE, which is vanilla policy gradient algorithm. You can add a value function as a control variate, and this requires learning a critic network. This again re-derives something like PPO or DDPG.

So is DDPG an on-policy or off-policy algorithm? Depending on the frequency with which you update the critic vs. the actor, it starts to look more like onpolicy or offpolicy update. My colleague Shane has a good treatment of the subject in his Interpolated Policy Gradients paper.

## Saturday, June 19, 2021

### Stonks are What You Can Get Away With: NFTs and Financial Nihilism

 Eric Jang, "Ten Apes", Jun 19 2021. NFT "drop" coming soon.

Andy Warhol once said, “Art is what you can get away with.” I interpret the quote as a nihilistic take on “beauty is in the eye of the beholder” — a urinal you found in the junkyard can be considered art, so long as you convince someone to buy it, or showcase it in a museum. All that matters is what other people see in it and what buyers are willing to pay.

The 2020’s equivalent of Warhol paintings are Non-Fungible-Tokens (NFTs). In this essay I’ll explain what NFTs are by motivating them with some interesting real-world problems. Then I’ll discuss why the NFT craze for digital art generates so much ideologically contentious debate. Finally, I’ll discuss some parallels between artistic and financial nihilism, and how this might serve as a framework for thinking about wildly speculative markets.

### Explaining NFTs using Counterfeit Goods

Suppose you want to buy a Birkin bag or some other luxury brand item. An unauthorized seller — perhaps someone who needs some emergency cash — is willing to sell you a Birkin bag. They offer you a good discount, relative to the price the authorized retailer would charge you. But how can you be sure they aren’t selling you a fake? Counterfeits for these items are very high quality, and the average Birkin customer probably can’t tell the difference between a real and a fake.

One way to avoid counterfeits is to only purchase items from an authorized retailer, e.g. a trusted Hermès store. But this is not practical because it prevents people from selling or giving away their bags. If you leave your bag to someone in your will, then its authenticity is no longer guaranteed.

So we have the market need: how does a seller pass on or sell a luxury item? How does a buyer ensure that they are buying an authentic item?

One possible answer is for Hermès to print out a list of secret serial numbers, perhaps sewn inside the bag, that declare whether a bag is legit or not. Owners receive a serial number when they buy the bag. But this is not a strong deterrent. A counterfeiter could just buy a real bag and then copy its serial number into many fake bags.

What if Hermès maintains a public website of who owns which bag? Any time a bag changes ownership, this ledger needs to be updated. By recording a unique owner for each unique serial number, this solves the problem of counterfeiters simply duplicating serial numbers. The process shifts from verifying properties to verifying transactions and owners.

These approaches would work, but also have a centralized point of failure: If the Hermès website goes down, nobody can trade bags anymore. Hermès is a big company and has the resources to protect their website against DDOS attacks and other cybersecurity threat vectors, but smaller luxury brands might not have a state-of-the-art security department. If they are not careful, their security could be breached by hackers or an unscrupulous sysadmin. Also, if Hermès stops operating as a company in 25 years, who will maintain the ledger of ownership? If it is a third party company, can we trust them not to abuse that power? Even in the unlikely event that the central point of failure never makes a mistake, it’s still mildly annoying to require Hermès to get involved every time a bag changes hands.

What if you could verify transactions and owners, without a centralized party? This is where Non-Fungible Tokens, or NFTs, come in. In 2009, someone published a landmark paper on how to build a decentralized ledger of who owns what. This ledger is called a "blockchain". A blockchain is a record of the consensus state of the world, following some agreed-upon protocol that is known to everyone. The remarkable thing about blockchains is that they are decentralized (no central point of failure), and resilient to malicious actors in the network. Distributed consensus is reached by each individual contributing some resource like money, hash rate, or computer storage. So long as a large fraction of resources in the network are controlled by well-behaved actors, the integrity of the blockchain remains secure. The fraction required typically varies from one-thirds to just over a half.

There are many blockchains out there. The details of how their consensus protocols are implemented are fascinating but beyond the scope of this essay. The important thing to know is that the base technology underlying NFTs and cryptocurrencies is a formal protocol that allows people to come to an agreement on who owns what without having to involve a trusted third party (e.g. Hermès, an escrow agent, your bank, or your government). Theoretically speaking, blockchains allow shared consensus in a trustless society.

NFTs are like a paper deed of ownership, but instead of paper the certificate is digital. And unlike a paper deed an NFT cannot be forged. NFTs contain a unique “serial number” that is publicly viewable, but only one person can be said to “possess” that serial number on the blockchain, much like how home addresses are public but registered to a single owner by the recording office. To see how NFTs solve the Birkin bag counterfeit problem, let’s suppose Hermès publicly declares the following for all to hear:

“Owners of True Birkin bags will be issued a digital certificate of authenticity represented by an NFT”

As a buyer, you can be quite confident that the bag is authentic if the seller also owns the NFT, and you can verify that the NFT was indeed originally created by Hermès by looking up its public transaction history. During a transaction, the seller simply gives the buyer the bag and tells the blockchain to re-assign ownership of the NFT to the buyer’s digital identifier. If the payment is done in cryptocurrency, the escrow can even be performed using a smart contract without a centralized party (the seller publishes contract “If a specific buyer’s wallet address sends me X USDC in 24 hours, send the NFT is sent to them and send the cash to me.”)

NFTs provide the means to implement digital scarcity, but there still needs to be a way to pair it with a real-world item in the “analog” world. A seller could still bypass the security of NFTs by selling you an NFT with a fake Birkin bag. However, for every fake bag you want to sell, you need to purchase a real NFT and the real bag that comes with it. After you sell the NFT with the fake bag, you are left with a real bag with no NFT! Subsequently, the market value of the real bag drops because buyers will be highly suspicious of a seller who says "this is a real bag, I don't have the NFT because I just sold it with a fake bag." While NFTs are not sure proof of a physical Birkin bag's authenticity, they all but ruin the economic incentives of counterfeiting.

What about luxury consumable goods? You could buy NFT-certified Wagyu beef, sell the NFT with some cheaper steak, and then eat the real Wagyu beef - it doesn’t matter what other people think you're eating. However, NFT transactions are public, so a grocery shopper would be quite suspicious of a food NFT that has changed hands outside of the typical supply chain addresses. For NFTs paired with physical goods, each “unusual” transaction significantly adds to counterfeit risk, which diminishes the economic incentives to counterfeiters. This is especially true for consumable, perishable goods.

Authenticity is useful, even outside of Veblen goods. You can imagine using NFTs to implement anonymous digital identity verification (a 30B market by 2024), or ship it with food products like meat where the customer cares a lot about the provenance of the product. In Taiwan, there is a current ongoing scandal where a bunch of US-imported pork has been passed off as “domestic pork” and nobody can trust their butchers anymore.

In the most general case, NFTs can be used to implement provenance tracking of both physical and digital assets - an increasingly important need in our modern age of disinformation. Where did this photo of a politician come from? Who originally produced this audio clip?

### The Riddle of Intangible Value

NFTs make a lot of sense for protecting the authenticity of luxury goods or implementing single sign-on or tracking the provenance of meat products, but that’s not what they’re primarily used for today. Rather, most people sell NFTs for digital art. Here are some early examples of art NFTs, called “Cryptopunks”. Each punk is a 24x24 RGB image.

One of these recently sold for 17M USD in an auction. At first glance, this is perplexing. The underlying digital content - some pixels stored in a file - are freely accessible to anyone. Why would anyone pay so much for a certificate of authenticity on something that anyone can enjoy for free? Is the buyer the one that gets punked?

It’s easy to dismiss this behavior as poor taste colliding with the arbitrarily large disposable income of rich people, in particular crypto millionaires that swap crypto assets with other crypto millionaires. While this may be true, I think it’s far more interesting to ask “what worldview would cause a rational person to bet \$17M on a certificate for a 24x24x3 set of pixel values”?

Historically, the lion’s share of rewards for digital content has been owned by distribution technology like Spotify or content aggregators like Facebook, and then split with the management company. The creatives themselves are paid pittances, and do not share in the financialization of their labor. The optimist case for NFT art is as follows: NFTs are decentralized, which means any artist with an internet connection can draw up financial contracts for their art on their own terms. If NFTs revolutionize the business model of digital art, and if the future of art is mostly digital, then the first art NFTs to ever be issued might accrue significant cultural relevance, and that’s why they command such high speculative prices.

Valuing art based on cultural relevance might be a bit absurd, but why is the Mona Lisa “The Mona Lisa”? da Vinci arguably made “better” paintings from a technical standpoint. It's because of intangible value. The Mona Lisa is valuable because of its cultural proximity to important events and people in history, and the mimetic desire of other humans. In fact, it was a relatively obscure painting until 1911, when it was stolen from the Louvre and became a source of national shame overnight.

All art, from your child’s first finger painting, to an antique heirloom passed down generations, to a “masterpiece” like the Mona Lisa, are valued this way. They are valuable simply because others deem it valuable.

NFTs are the digital equivalent of buying a banana duck-taped to a wall; you are betting that in the future, that statement of ownership on some blockchain will be historically significant, which you can presumably trade in for cash or clout or both. But buyer beware: things get philosophically tricky when applying the theory of “intangible value” to digital information and artwork where the cost of replication goes to zero.

I can think of two ways to look at how one values NFTs for digital art. One perspective is that in a world full of fake Birkin bags and products sourced from ethically dubious places, the only thing of value is the certificate of authenticity. The cultural and mimetic value of content has transferred entirely to the provenance certificate, and not the pixels themselves (which can be copied for free). If art’s value is derived from the cultural relevance it represents and its proximity to important people, then the most sensible way to make high art would not be to improve one’s painting skills, but to schmooze with a lot of famous people and insert oneself into important events in history, and issue scarce status symbols for the bourgeoisie. Warhol did exactly that.

The alternate view is that if a perfect copy can be made of some pixels, then it is not really a counterfeit at all, and therefore the NFT secures nothing of actual value. Is it meaningful to ascribe a certificate of authenticity to something that can be perfectly replicated? Is “authenticity” of a stream of 0s and 1s meaningless? There is certainly utility in verifying the source of some information, but anyone can mint an NFT for the same information.

In summary, the Pro-NFT crowd values the intangible “collector’s scarcity and cultural relevance”. The anti-NFT focuses on tangible value - how much real value does this secure? Both are reasonable frameworks to value things, and you can end up with wildly different conclusions.

### Artistic and Financial Nihilism: One and The Same?

Convince enough people that a urinal is valuable, and it becomes an investment grade asset. This is no longer merely a matter of art philosophy - when you invest in an index fund, you are essentially reinforcing the market’s current belief of valuations. When people bid up the price of TSLA or GME to stratospheric valuations, the index fund must re-adjust their market-weighted holdings to reflect those prices, creating further money inflows to the asset and thus a self-fulfilling prophecy. As it turns out, the art-of-investing is much like investing-in-art. As I have suggested in the title of this essay and borrowed from Warhol (who probably borrowed it from Marshall McLuhan), stonks are what you can get away with.

We are starting to see this valuation framework being applied to the equities market today, where price movements are dominated by narratives about where the price is going and what other people are willing to pay for it, especially with meme stocks like GME and AMC. Many retail investors don’t really care about whether GME’s price is justified by their corporate earnings - they simply buy at any cost. This financial nihilism - where intrinsic value is unknowable and all that matters is what other people think - is a worldview often encountered in Gen Z retail traders and a surprising number of professional traders I know. Perhaps the midwit meme is really true.

This is definitely a cause for some concern, but at the same time, I think value investors should keep an open mind that what first seems like irrational behavior might have a method to madness. If you have an irrational force acting in the markets, like shareholders who refuse to sell or lend their stock, a discounted cash flow model for AMC or GME starts to not become very predictive of share price. By reflexivity, that will have impacts on future cash flows! In a similar fashion, using present-day frameworks for thinking about business and value do not account for the disruptive force of technology. That’s why I find NFTs so fascinating - they are an intersection of finance, art, technology, and the nihilistic framework of valuation that is so prevalent in our society today.

What is rational behavior for an investor, anyway? Is it “standard behavior” as measured against the population average? How do you tell apart standard behavior from a collective delusion? Perhaps the luxury bag makers, Ryan Cohens, and Andy Warhol’s of the world understand it best: Convince the world to believe in your values, and you will be the sanest person on the planet. For fifteen minutes, at least.

### Acknowledgements

Thanks to Cati Grasso, Sam Hoffman, Phúc Lê, Chung Kang Wang, Jerry Suh, and Ellen Jiang for comments and feedback on drafts of this post.

## Wednesday, May 26, 2021

### Sovereign Arcade: Currency as High-Margin Infrastructure

This essay is about how the powerful want to become countries, and the implications of cryptocurrencies on the sovereignty of nations. I’m not an economics expert: please leave a comment if I have made any errors.

Money allows goods, services, and everything else under the sun to be assigned a value using the same unit of measurement. Without money, society reverts to bartering, which is highly inefficient. You may need plumbing services but have nothing that the plumber wants, so your toilet remains clogged. By acting as a measure of value everyone agrees on, money facilitates frictionless economic collaboration between people.

Foreign monetary policy is surprisingly simple to understand when viewed through the lens of power and control. Nation states get nervous when other nation states get too powerful, and controlling the currency is a form of power.

To see why this is the case, let’s consider a Gaming Arcade (yes, like Chuck E. Cheese) as a miniature model of a “Nation State”. To participate inside the “arcade economy”, you are to swap your outside money (USD) for arcade tokens.

Arcades are like mini nation-states: they issue their own currency, encourage spending with state-owned enterprises, and have a one-sided currency exchange to prevent money outflows.

The coins are a store of value that facilitate a one-way transaction with the Nation-State: you get to play an arcade game, and in return you get some entertainment value and some tickets, which we call “wages”.

The tickets are another store of value that can facilitate another one-way transaction: converting them into prizes. Prizes can be a stuffed animal or something else of value. Typically, the cost of winning a prize at an arcade is many multiples of what it would cost to just buy the prize at an outside store. The arcade captures that price difference as their profit.

Money’s most important feature requirement is that it is a *stable* measure of value. Too much inflation, and people stop saving money. Too much deflation, and people and companies aren’t incentivized to spend money (for example, employing people). Imagine if tomorrow, an arcade coin could let you play a game for two rounds instead of one, and the day after, you could play for four rounds! Well, no one would want to play arcade games today anymore.

The arcade imposes many kinds of draconian capital controls, and in many ways resembles an extreme form of State Capitalism:
• All transactions are with state-owned enterprises (the arcade games) and must be conducted using state currencies (coins and tickets). You can’t start a business that takes people’s coins or tickets within the arcade.
• The state can hand out valuable coins at virtually zero cost without worrying about inflation - every coin they issue is backed by a round of a coin-operated game, of which they have near-infinite supply. They can’t hand out infinite tickets though, because that would either require backing it up with more prizes, or devaluing each ticket so that more tickets are needed to buy the same prize.
• You can bring outside money into the arcade, but you can’t convert coins, tickets, or prizes into money to take out.

Controlling the currency supply is indeed a very powerful business to be in, and why arcades would prefer to issue their own currency and keep money from leaving their borders.

Governments are just like arcades. They prefer their citizens and trading partners to use a currency they control, because it gives them a lever with which they can influence spending behavior. If country A uses country B’s currency instead, then country B’s currency supply shenanigans can actually influence saving and spending behavior of country A. This can pose a threat to the sovereignty of a nation (a fancy way to say “control over its people”).

After World War II, the US Dollar became the world’s reserve currency, which means that it’s the currency used for the majority of international trade. The USA wants the world to buy oil with US dollars, and we go to great lengths to enforce it with various forms of soft and hard power. The US dollar is backed by oil (petrodollar theory), and this “dollars-are-oil rule” in turn is enforced by US military might.

Governments print money all the time to pay for needed short-term needs like building bridges and COVID relief. However, too much of this can be a dangerous thing. The government gets what it wants in the short term, but more money chasing the same amount of goods will cause businesses to raise prices, causing inflation. Countries like Venezuela and Turkey who print too much of their own currency experience a runaway feedback loop where money supply and prices skyrocket, and then no one trusts the government currency as a stable source of value anymore.

The USA is not like other countries in this regard; controlling the world’s reserve currency gives the USA the ability to print money like no other country can. The US government owing 28 trillion USD of debt is like the Arcade owing you a trillion game coins. Yes, it is a lot of coins - maybe the arcade doesn’t even have a trillion coins to give you. But the arcade knows that you know that it’s in the best interest of everyone to not try and collect all those coins right away, because the arcade would go bankrupt, and then the coins you asked for would be worthless.

Is this sketchy? Absolutely. Most other countries absolutely hate this power dynamic. Especially China. The USA calls China a currency manipulator for devaluing the yuan, but will turn around and do the exact same thing by printing dollars. China does not want to be subject to the whims of US monetary policy, so they are working very hard to establish the yuan as the currency of exchange in international trade. Everyone wants to be the arcade operator, not the arcade player.

### Large Companies as Nation-States

Nation-states not only have to worry about the currencies of other nation-states, but increasingly, large global corporations as well. Any businesses that get big enough start to think about the currency game, since currency is a form of high-margin infrastructure.

AliPay is a mobile wallet made by an affiliate company of Alibaba. It’s basically backed by an SQL table saying how much money each AliPay user has. It would be very easy for AliPay to print money - all they have to do is bump up some number in a row in the SQL table. As long as users are able to redeem their AliPay balance on something of equivalent value, Alibaba’s accounts remain solvent and they can get away with this. In fact, many of their users shop on Alibaba’s e-commerce properties anyway, so Alibaba doesn’t even need to have 100% cash reserves to back up all entries in their SQL table. Users can redeem their balances by paying for Alibaba goods, which Alibaba presumably can acquire for less than the price the user pays for.

Of course, outright printing money incurs the wrath of the Sovereign Arcade. Alibaba was severely punished for merely suggesting that they could do a better job than China’s banks. Facebook tried to challenge the dollar by introducing a token backed with other countries’ reserve currencies, and the idea was slapped down so hard that FB had to rename the project and start over. In contrast, the US government is happy to approve crypto tokens backed using the US dollar, because ultimately the US government controls the underlying resource.

There are clever ways to build high margin infrastructure without crossing the money-printing line. Any large institution with a monopoly over a high-margin resource can essentially mint debt for free, effectively printing currency like an arcade does with its coins. The resource can be a lot of things - coffee, cloud computing credits, energy, user data. In the case of a nation-state, the resource is simply violence and enforcement of the law.

As of 2019, Starbucks had 1.6B USD of gift cards in circulation, which puts it above the national GDP of about 20 countries. Like the arcade coins, Starbucks gift cards are only redeemable for limited things: scones and coffee. Starbucks can essentially mint Starbucks gift cards for free, and this doesn’t suffer from inflation because each gift card is backed by future coffee which Starbucks can also make at a marginal cost. You can even use Starbucks cards internationally, which makes “Star-Bucks” more convenient than current foreign currency exchange protocols.

As long as account balances are used to redeem a resource that the company can acquire cheaply (e.g. gift cards for coffee, gift cards for cloud computing, advertising credits), a large company could also practice “currency manipulation” by arbitrarily raising monetary balances in their SQL tables.

### The Network State

Yet another threat to the sovereign power is decentralized rogue nations, made possible by cryptocurrency. At the heart of cryptocurrency’s rise is a social problem in our modern, globalized society: how do we trust our sovereigns to actually be good stewards of our property? Banking executives who overleveraged risky investments got bailed out in 2008 by the US government. The USA printed a lot of money in 2020 to bail out those impacted by COVID-19 economic shutdowns. Every few weeks, we hear about data breaches in the news. A lot of Americans are losing trust in their institutions to protect their bank accounts, their privacy, and their economic interests.

Even so, most Americans still take the power of the dollar for granted: 1) our spending power remains stable and 2) the number we see in our bank accounts is ours to spend. We have American soft and hard diplomacy to thank for that. But in less stable countries, capital controls can be rather extreme: a bank may simply decide one day that you can’t withdraw more than 1 USD per day. Or some government can decide that you’re a criminal and freeze your assets entirely.

Cryptocurrency offers a simple answer: You can’t trust the sovereign, or the bank, or any central authority to maintain the SQL table of who owns what. Instead, everyone cooperatively maintains the record of ownership in a decentralized, trustless way. For those of you who aren’t familiar with how this works, I recommend this 26-minute video by 3Blue1Brown.

To use the arcade analogy, cryptocurrency would be like a group of teenagers going to the arcade, and instead of converting their money into arcade coins, they pool it together to buy prizes from outside. They bring their own games (Nintendo Switches or whatever), and then swap prizes with each other based on who wins. They get the fun value of hanging out with friends and playing games and prizes, while cutting the arcade operator out.

The decentralized finance (DeFi) ecosystem has grown a lot in the last few years. In the first few years of crypto, all you could do was send Bitcoin and other Altcoins to each other. Today, you can swap currencies in decentralized exchanges, take out flash loans, buy distressed debt at a discount, provide liquidity as a market maker, perform no-limit betting on prediction markets, pay a foreigner with USD-backed stablecoins, and cryptographically certify authenticity of luxury goods.

Balaji Srinivasan predicts that as decentralized finance projects continue to grow, a large group of individuals with a shared sense of values and territory will congregate on the internet and declare themselves citizens of a “Network State”. It sounds fantastical at first, but many of us already live in Proto-Network states. We do our work on computers, talk to people over the internet, shop for goods online, and spend leisure time in online communities like Runescape and such. It makes sense for a geographically distributed economy to adopt a digital-native currency that transcends borders.

Network states will have the majority of their assets located on the internet, with a small amount of physical property distributed around the world for our worldly needs. The idea of a digital rogue nation is less far-fetched than you might think. If you walk into a Starbucks or McDonalds or a Google Office or an Apple Store anywhere in the world, there is a feeling of cultural consistency, a familiar ambience. In fact, Starbucks gets pretty close: you go there to eat and work and socialize and pay for things with Starbucks gift cards.

A network state might have geographically distributed physical locations that have a consistent culture, with most of its assets and culture in the cloud. Pictured: Algebraist coffee, a new entrant into the luxury coffee brand space

A network state could have a national identity independent of physical location. I see no reason why a "Texan" couldn’t enjoy ranching and brisket and big cars and football anywhere in the world.

Balaji is broadly optimistic that existing sovereigns will be tolerant or even facilitate network states, by offering them economic development zones and tax incentives to establish their physical embodiments within their borders, in exchange for the innovation and capital they attract.

I am not quite so optimistic - the fact that US persons can now pseudonymously perform economic activities with anyone in the world (including sanctioned countries) without the US government knowing, using a currency that the US government cannot control - is a terrifying prospect to the sovereign. The world’s governments highly underestimate the degree to which future decentralized economies will upset the world order and power structures of the world. Any one government can make life difficult for cryptocurrency businesses to get big, but as long as some countries are permissive towards it, it’s hard to put that genie back into the bottle and prevent the emergence of a new digital economy.

### Crypto Whales

I think the biggest threat to the emergence of a network state is not existing sovereigns, but rather the power imbalance of early stakeholders versus new adopters.

At the time of writing, there are nearly 100 Bitcoin billionaires and 7062 Bitcoin wallets that own more than 10M each. This isn’t even counting the other cryptocurrencies or DeFi wealth locked in Ethereum - the other day, someone up bought nearly a billion dollars of the meme currency DOGE. We mostly have no idea who these people are - they walk amongst us, and are referred to as “whales”.

A billionaire’s taxes substantially alter state budget planning in smaller states, so politicians actually go out of their way to appease billionaires (e.g. Illinois with Ken Griffin). If crypto billionaires colluded, they could institute quite a lot of political change at local and maybe even national levels.

China has absolutely zero chill when it comes to any challenge to their sovereignty, so it was not surprising at all that they recently cracked down on domestic use of cryptocurrency. However, by shutting their miners down, I believe China is losing a strategic advantage in their quest to unseat America as the world superpower. A lot of crypto billionaires reside in China, having operated large mining pools and developing the world’s mining hardware early on. I think the smart move for China would have been to allow their miners to operate, but force them to sell their crypto holdings for digital yuan. This would peg crypto to the yuan, and also allow China to stockpile crypto reserves in case the world starts to use it more as a reserve currency.

There’s a chance that crypto might even overtake the Yuan as the challenger to reserve currency, because it’s easier to acquire in countries with strict capital controls (e.g. Venezuela, Argentina, Zimbabwe). If I were China, I’d hedge against both possibilities and try to control both.

Controlling miners has power implications far beyond stockpiling of crypto wealth. Miners play an important role in the market microstructure of cryptocurrency - they have the ability to see all potential transactions before they get permanently appended to blockchain. The assets minted by miners are virtually untraceable. One way a Network State could be compromised is if China smuggled several crypto whales into these fledgling nations that are starting to adopt Bitcoin, and then used their influence over Bitcoin reserves, tax revenues, and market microstructure to punish those who spoke out against China.

The more serious issue than China’s hypothetical influence over Bitcoin monetary policy is the staggering inequality of crypto wealth distribution. Presently, 2% of wallets control over 95% of Bitcoin. Many people are already uncomfortable with the majority of Bitcoins being owned by a handful of mining operators and Silicon Valley bros and other agents of tech inequality. Institutions fail violently when inequality is high - people will drop the existing ledger of balances and install a new one (such as Bitcoin). If people decide to form a new network state, why should they adopt a currency that would make these tech bros the richest members of their society? Would you want your richest citizen to be someone who bet their life savings on DOGE? Would you trust this person’s judgement or capacity for risk management?

Like any currency, Bitcoin and Ethereum face adoption risk if the majority of assets are held by people who lack the leadership to deploy capital effectively on behalf of society. Unless crypto billionaires vow to not spend the majority of their wealth (like Satoshi has seemingly done), or demonstrate a remarkable level of leadership and altruism towards growing the crypto economy (like Vitalik Buterin has done), the inequality aspect will remain a large barrier to the formation of stable network states.

### Summary

1. A gaming arcade is a miniature model of a nation-state. Controlling the supply and right to issue currency is lucrative.
2. Large businesses with high-margin infrastructure can essentially mint debt, much like printing money.
3. Cryptocurrencies will create “Network States” that challenge existing nation-states. But they will not prosper if they set up their richest citizens as ones who won the “early adopter” lottery.

### Further reading and Acknowledgements

I highly recommend Lyn Alden’s essay on the history of the US dollar, the fraying petrodollar system, and the future of reserve currency.

Thanks to Austin Chen and Melody Cao for providing feedback on earlier drafts.

## Sunday, March 14, 2021

### Science and Engineering for Learning Robots

This is the text version of a talk I gave on March 12, 2021, at the Brown University Robotics Symposium. As always, all views are my own, and do not represent those of my employer.

I'm going to talk about why I believe end-to-end Machine Learning is the right approach for solving robotics problems, and invite the audience to think about a couple interesting open problems that I don't know how to solve yet.

I'm a research scientist at Robotics at Google. This is my first full-time job out of school, but I actually started my research career doing high school science fairs. I volunteered at UCSF doing wet lab experiments with telomeres, and it was a lot of pipetting and only a fraction of the time was spent thinking about hypotheses and analyzing results. I wanted to become a deep sea marine biologist when I was younger, but after pipetting several 96-well plates (and messing them up) I realized that software-defined research was faster to iterate on and freed me up to do more creative, scientific work.

I got interested in brain simulation and machine learning (thanks to Andrew Ng's Coursera Course) in 2012. I did volunteer research at a neuromorphic computing lab at Stanford and did some research at Brown on biological spiking neuron simulation in tadpoles. Neuromorphic hardware is the only plausible path to real-time, large-scale biophysical neuron simulation on a robot, but much like wet-lab research is rather slow to iterate on. It was also a struggle to learn even simple tasks, which made me pivot to artificial neural networks which were starting to work much better at a fraction of the computational cost. In 2015 I watched Sergey Levine's talk on Guided Policy Search and remember thinking to myself, "oh my God, this is what I want to work on".

## The Deep Learning Revolution

We've seen a lot of progress in Machine Learning in the last decade, especially in end-to-end machine learning, also known as deep learning. Consider a task like audio transcription: classically, we would chop up the audio clip into short segments, detect phonemes, aggregate phonemes into words, words into sentences, and so on. Each of these stages is a separate software module with distinct inputs and outputs, and these modules might involve some degree of machine learning. The idea of deep learning is to fuse all these stages together into a single learning problem, where there are no distinct stages, just the end-to-end prediction task from raw data. With a lot of data and compute, such end-to-end systems vastly outperform the classical pipelined approach. We've seen similar breakthroughs in vision and natural language processing, to the extent that all state-of-the-art systems for these domains are pretty much deep learning models.

Robotics has for many decades operated under a modularized software pipeline, where first you estimate state, then plan, then perform control to realize your plan. The question our team at Google is interested in studying is whether the end-to-end advances we've seen in other domains holds for robotics as well.

## Software 2.0

When it comes to thinking about the tradeoff between hand-coded, pipelined approaches versus end-to-end learning, I like Andrej Karpathy's abstraction of Software 1.0 vs Software 2.0: Software 1.0 is where a human explicitly writes down instructions for some information processing. Such instructions (e.g. in C++) are passed through a compiler that generates the low level instructions of what the computer actually executes. When building Software 2.0, you don't write the program - you give a set of inputs and outputs and it's the ML system's job to finds the best program that satisfies your input-output description. You can think of ML as a "higher order compiler that takes data and gives you programs".

The gradual or not-so-gradual subsumption of software 1.0 code into software 2.0 is inevitable - one might start by tuning some coefficients here and there, then you might optimize over one of several code branches to run, and before you know it, the system actually consists of an implicit search procedure over many possible sub-programs. The hypothesis is that as we increase availability of compute and data, we will be able to automatically do more and more search over programs to find the optimal routine. Of course, there is always a role for Software 1.0 - we need it for things like visualization and data management. All of these ideas are covered in Andrej's talks and blog posts, so I encourage you to check those out.

## How Much Should We Learn in Robotics?

End-to-end learning has yet to outperform the classical control-theory approaches in some tasks, so within the robotics community there is still an ideological divide on how much learning should actually be done.

On one hand, you have classical robotics approaches, which breaks down the problem into three stages: perception, planning, and control. Perception is about determining the state of the world, planning is about high level decision making around those states, and control is about applying specific motor outputs so that you achieve what you want. Many of the ideas we explore in deep reinforcement learning today (meta-learning, imitation learning, etc.) have already been studied in classical robotics under different terminology (e.g. system identification). The key difference is that classical robotics deals with smaller state spaces, whereas end-to-end approaches fuse perception, planning, and control into a single function approximation problem. There's also a middle ground where one can attempt to use hand-coded constructs from classical robotics as a prior, and then use data to adapt the system to reality. According to Bayesian decision making theory, the stronger prior you have, the less data (evidence) you need to construct a strong posterior belief.

I happen to fall squarely on the far side of the spectrum - the end-to-end approach. I'll discuss why I believe strongly in these approaches.

## Three reasons for end-to-end learning

First, it's worked for other domains, so why shouldn't it work for robotics? If there is something about robotics that makes this decidedly not the case, it would be super interesting to understand what makes robotics unique. As an existence proof, our lab and other labs have already built a few real-world systems that are capable of doing manipulation and navigation from end-to-end pixel-to-control. Shown on the left is our grasping system, Qt-Opt, which essentially performs grasping using only monocular RGB, the current arm pose, and end-to-end function approximation. It can grasp objects it's never seen before. We've also had success on door opening and manipulation from imitation learning.

## Fused Perception-to-Action in Nature

Secondly, there are often many shortcuts one can take to solve specific tasks, without having to build a unified perception-planning-control stack that is general across all tasks. Work from Mandyam Srinivasan's lab has done cool experiments getting honeybees to fly and perch inside small holes, with a spiral pattern painted on the wall. They found that bees will de-accelerate as they approach the target by the simple heuristic of keeping the rate of image expansion (the spiral) constant. They found that if you artificially increase or decrease the rate of expansion by spinning the spiral clockwise or counterclockwise, the honeybee will predictably speed up or slow down. This is Nature's elegant solution to a control problem: visually-guided odometry is computationally cheaper and less error prone than having to detect where the target is in world frame, plan a trajectory, and so on. It may not be a general framework for planning and control, but it is sufficient for accomplishing what honeybees need to do.

Okay, maybe honeybees can use end-to-end approaches, but what about humans? Do we need a more general perception-planning-control framework for human problems? Maybe, but we also use many shortcuts for decision making. Take ball catching: we don't catch falling objects by solving ODEs or planning, we instead employ a gaze heuristic - as long as an object stays in the same point in your field of view, you will eventually intersect with the object's trajectory. Image taken from Henry Brighton's talk on Robust decision making in uncertain environments.

## The Trouble With Defining Anything

Third, we tend to describe decision making processes with words. Words are pretty much all we have to communicate with one another, but they are inconsistent with how we actually make decisions. I like to describe this as an intelligence "iceberg"; the surface of the iceberg is how we think our brain ought to make decisions, but the vast majority of intelligent capability is submerged from view, inaccessible to our consciousness and incompressible into simple language like English. That is why we are capable of performing intelligent feats like perception and dextrous manipulation, but struggle to articulate how we actually perform them in short sentences. If it were easy to articulate in clear unambiguous language, we could just type up those words into a computer program and not have to use machine learning for anything. Words about intelligence are lossy compression, and a lossy representation of a program is not sufficient to implement the full thing.

Consider a simple task of identifying the object in the image on the left (a cow). A human might attempt to string some word-based reasoning together to justify why this is a cow: "you see the context (an open field), you see a nose, you see ears, and black-and-white spots, and maybe the most likely object that has all these parts is a cow".

This is a post-hoc justification, and not actually a full description of how our perception system registers whether something is a cow or not. If you take an actual system capable of recognizing cows with great accuracy (e.g a convnet) and inspect its salient neurons and channels that respond strongly to cows, you will find a strange looking feature map that is hard to put into words. We can't define anything in reality with human-readable words or code with the level of precision needed for interacting with reality, so we must use raw sensory data - grounded in reality - to figure out the decision-making capabilities we want.

## Cooking is Not Software 1.0

Our obsession with focusing on the top half of the intelligence iceberg biases us towards the Software 1.0 way of programming, where we take a hard problem and attempt to describe it - using words - as the composition of smaller problems. There is also a tendency for programmers to think of general abstractions for their code, via ontologies that organize words with other words. Reality has many ways to defy your armchair view of what cows are and how robotic skills ought to be organized to accomplish tasks in an object-oriented manner.

Cooking is one of the holy grails of robotic tasks, because environments are open-ended and there is a lot of dextrous manipulation involved. Cooking analogies abound in programming tutorials - here is an example of making breakfast with asynchronous programming. It's tempting to think that you can build a cooking robot by simply breaking down the multi-stage cooking task into sub-tasks and individual primitive skills.

Sadly, even the most trivial of steps abounds with complexity. Consider the simple task of spreading jam on some toast.

The software 1.0 programmer approaches this problem by breaking down the task into smaller, reusable routines. Maybe you think to yourself, first I need a subroutine for holding the slice of toast in place with the robot fingers, then I need a subroutine to spread jam on the toast.

Spreading jam on toast entails three subroutines: a subroutine for scooping the jam with the knife, depositing the lump of jam on the toast, then spreading it evenly.

Here is where the best laid plans go awry. A lot of things can happen in reality at any stage that would prevent you from moving onto the next stage. What if the toaster wasn't plugged in and you're starting with untoasted bread? What if you get the jam on the knife but in the process break something on the robot and you aren't checking to make sure everything is fine before proceeding to the next subroutine? What if there isn't enough jam in the jar? What if you're on the last slice of bread in the loaf and the crust side is facing up?

The prospect of writing custom code to handle the ends of the bread loaf (literal edge cases) ought to give one pause as to whether this is approach is scalable to unstructured environments like kitchens - you end up with a million lines of code that essentially capture the state machine of reality. Reality is chaotic - even if you had a perfect perception system, simply managing reality at the planning level quickly becomes intractable. Learning based approaches give us hope of managing this complexity by accumulate all these edge cases in data, and let the end-to-end objective (getting some jam on the toast) and Software 2.0 compiler figure out how to handle all the edge cases. My belief in end-to-end learning is not because I think ML has unbounded capability, but rather that the alternative approach where we capture all of reality into a giant hand-coded state machine is utterly hopeless.

Here is a video where I am washing and cutting strawberries and putting them on some cheesecake. A roboticist that spends too much time in the lab and not the kitchen might prescribe a program that (1) "holds strawberry", (2) "cut strawberry", (3) "pick-and-place on cheesecake", but if you watch the video frame by frame, there are a lot of other manipulation tasks that happen in the meantime - opening and closing containers with one or two hands, pushing things out of the way, inspecting for quality. To use the Intelligence Iceberg analogy: the recipe and high level steps are the surface ice, but the submerged bulk are all the little micro-skills the hands need to do to open containers and adapt to reality. I believe the most dangerous conceit in robotics is to design elegant programming ontologies on a whiteboard, and ignore the subtleties of reality and what its data tells you.

There are a few links I want to share highlighting the complexity of reality. I enjoyed this recent article on Quanta Magazine about the trickiness of defining life. This is not merely a philosophical question; people at NASA are planning a Mars expedition to collect soil samples and answer whether life ever existed on Mars. This mission requires clarity on the definition of life. Just like it's hard to define intelligent capabilities in precise language, so it is to define life. These two words may as well be one and the same.

Klaus Greff's talk on What Are Objects? raises some interesting queestions about the fuzziness of word. Obviously we want our perception systems to recognize objects so that we may manipulate and plan around them. But as the talk points out, defining what is and is not an object can be quite tricky (is a hole an object? Is the frog prince defined by what he once was, or what he looks like now?).

I've also written a short story on the trickiness of defining even simple classes like "teacups".

I worked on a project with Coline Devin where we used data and Software 2.0 to learn a definition of objects without any human labels. We use a grasping system to pick up stuff and define objects as "that which is graspable". Suppose you have a bin of objects and pick one of them up. The object is now removed from the bin and maybe the other objects have shifted around the bin a little. You can also easily look at whatever is in your hand. We then design an embedding architecture and use the following assumption about reality to train it: the pre-grasp objects embedding - post-grasp objects embedding to be equal to the embedding of whatever you picked up. This allowed us to bootstrap a completely self-supervised instance grasping system from a grasping system without ever relying on labels. This is by no means a comprehensive definition of "object" (see Klaus's talk) but I think it's a pretty good one.

## Science and Engineering of End-to-End ML

End-to-end learning is a wonderful principle for building robotic systems, but it is not without its practical challenges and execution risks. Deep neural nets are opaque black box function approximators, which makes debugging them at scale challenging. This requires discipline in both engineering and science, and often the roboticist needs to make a choice as to whether to solve an engineering problem or a scientific one.

This is what a standard workflow looks like for end-to-end robotics. You start by collecting some data, cleaning it, then designing the input and output specification. You fit a model to the data, validate it offline with some metrics like mean-squared error or accuracy, then deploy it in the real world and see if it continues to work as well on your validation sets. You might iterate on the model and validation via some kind of automated hyperparameter tuning.

Most ML PhDs spend all their time on the model training and validation stages of the pipeline. RL PhDs have a slightly different workflow, where they think a bit more about data collection via the exploration problem. But most RL research also happens in simulation, where there is no need to do data cleaning and the feature and label specification is provided to you via the benchmark's design.

While it's true that advancing learning methods is the primary point of ML, I think this behavior is the result of perverse academic incentives.

There is a viscious tendency for papers to put down old ideas and hype up new ones in the pursuit of "technical novelty". The absurdity of all this is that if we ever found that an existing algorithm works super well on harder and harder problems, it would have a hard time getting published on in academic conferences. Reviewers operate under the assumption that our ML algorithms are never good enough.

In contrast, production ML usually emphasizes everything else in the pipeline. Researchers on Tesla's Autopilot team have found that in general, 10x'ing your data on the same model architecture outperforms any incremental modeling improvement in the last few years. As Ilya Sutskever says, most incremental algorithm improvements are just data in disguise. Researchers at quantitative trading funds do not change models drastically: they spend their time finding novel data sources that add additional predictive signal. By focusing on large-scale problems, you get a sense of where the real bottlenecks are. You should only work on innovating new learning algorithms if you have reason to believe that that is what is holding your system back.

Here are some examples of real problems I've run into in building end-to-end ML systems. When you collect data on a robot, certain aspects of the code get baked into the data. For instance, the tuning of the IK solver or the acceleration limits on the joints. A few months later, the code on the robot controllers might have changed in subtle ways, like maybe the IK solver was swapped with a different solver. This happens a lot in a place like Google where multiple people work on a single codebase. But because assumptions of the v0 solver were baked into the training data, you now have a train-test mismatch and the ML policy no longer works as well.

Consider an imitation learning task where you collect some demonstrations, and then predict actions (labels) from states (features). An important unit test to perform before you even start training a model is to check whether a robot that replays the exact labels in order can actually solve the task (for an identical initialization as the training data). This check is important because the way you design your labels might make assumptions that don't necessarily hold at test-time.

I've found data management to be one of the most crucial aspects of debugging real world robotic systems. Recently I found a "data bug" where there was a demonstration of the robot doing nothing for 5 minutes straight - the operator probably left the recording running without realizing it. Even though the learning code was fine, noisy data like this can be catastrophic for learning performance.

As roboticists we all want to see in our lifetime robots doing holy grail tasks like tidying our homes and cooking in the kitchen. Our existing systems, whether you work on Software 1.0 or Software 2.0 approaches, are far away from that goal. Instead of spending our time researching how to re-solve a task a little bit better than an existing approach, we should be using our existing robotic capabilities to collect new data for tasks we can't solve yet.

There is a delicate balance in choosing between understanding ML algorithms better, versus pushing towards a longer term goal of qualitative leaps in robotic capability. I also acknowledge that the deep learning revolution for robotics needs to begin with solving the easier tasks and then eventually working its way up to the harder problems. One way to accomplish both good science and long term robotics is to understand how existing algorithms break down in the face of harder data and tougher generalization demands encountered in new tasks.

## Interesting Problems

Hopefully I've convinced you that end-to-end learning is full of opportunities to really get robotics right, but also rife with practical challenges. I want to highlight two interesting problems that I think are deeply important to pushing this field forward, not just for robotics but for any large-scale ML system.

A typical ML research project starts from a fixed dataset. You code up and train a series of ML experiments, then you publish a paper once you're happy with one of the experiments. These codebases are not very large and don't get maintained beyond the duration of the project, so you can move quickly and scrappily with little to no version control or regression testing.

Consider how this would go for a "lifelong learning" system for robotics, where you are collecting data and never throwing it away. You start the project with some code that generates a dataset (Data v1). Then you train a model with some more code, which compiles a Software 2.0 program (ckpt.v1.a). Then you use that model to collect more data (Data v2), and concatenate your datasets together (Data v1 + Data v2) to then train another model, and use that to collect a third dataset (Data v3), and so on. All the while you might be publishing papers on the intermediate results.

The tricky thing here is that the behavior of Software 1.0 and Software 2.0 code is now baked into each round of data collection, and the Software 2.0 code has assumptions from all prior data and code baked into it. The dependency graph between past versions of code and your current system become quite complex to reason about.

This only gets trickier if you are running multiple experiments and generating multiple Software 2.0 binaries in parallel, and collecting with all of those.

Let's examine what code gets baked into a collected dataset. It is a combination of Software 1.0 code (IK solver, logging schema) and Software 2.0 code (a model checkpoint). The model checkpoint itself is the distillation of a ML experiment, which consists of more Software 1.0 code (Featurization, Training code) and Data, which in turn depends on its own Software 1.0 and 2.0 code, and so on.

Here's the open problem I'd like to pose to the audience: how can we verify correctness of lifelong learning systems (accumulating data, changing code), while ensuring experiments are reproducible and bug free? Version control software and continuous integration testing is indispensable for team collaboration on large codebases. What would the Git of Software 2.0 look like?

Here are a couple ideas on how to mitigate the difficulty of lifelong learning. The flywheel of an end-to-end learning system involves converting data to a model checkpoint, then a model checkpoint to predictions, and model predictions to a final real world evaluation number. That eval also gets converted into data. It's critical to test these four components separately to ensure there are no regressions - if one of these breaks, so does everything else.

Another strategy is to use Sim2Real, where you train everything in simulation and develop a lightweight fine-tuning procedure for transferring the system to reality. We rely on this technique heavily at Google and I've heard this is OpenAI's strategy as well. In simulation, you can transmute compute into data, so data is relatively cheap and you don't have to worry about handling old data. Every time you change your Software 1.0 code, you can just re-simulate everything from scratch and you don't have to deal with ever-increasing data heterogeneity. You might still have to manage some data dependencies for real world data, because typically sim2real methods require training a CycleGAN.

## Compiling Software 2.0 Capable of Lifelong Learning

When people use the phrase "lifelong learning" there are really two definitions. One is about lifelong dataset accumulation, and concatenating prior datasets to train systems that do new capabilities. Here, we may re-compile the Software 2.0 over and over again.

A stronger version of "lifelong learning" is to attempt to train systems that learn on their own and never need to have their Software 2.0 re-compiled. You can think about this as a task that runs for a very long time.

Many of the robotic ML models we build in our lab have goldfish memories - they make all their decisions from a single instant in time. They are, by construction, incapable of remembering what the last action they took was or what happend 10 seconds ago. But there are plenty of tasks where it's useful to remember:

• An AI that can watch a movie (>170k images) and give you a summary of the plot.
• An AI that is conducting experimental research, and it needs to remember hundreds of prior experiments to build up its hypotheses and determine what to try next.
• An AI therapist that should remember the context of all your prior conversations (say, around 100k words).
• A robot that is is cooking and needs to leave something in the oven for several hours and then resume the recipe afterwards.

Memory and learning over long time periods requires some degree of selective memory and attention. We don't know how to select which moments in a sequence are important, so we must acquire that by compiling a Software 2.0 program. We can train a neural network to fit some task objective to the full "lifetime" of the model, and let the model figure out how it needs to selectively remember within that lifetime in order to solve the task.

However, this presents a big problem: in order to optimize this objective, you need to run forward predictions over every step in the lifetime. If you are using backpropagation to train your networks, then you also need to run a similar number of steps in reverse. If you have N data elements and the lifetime is T steps long, the computational cost of learning is between O(NT) and O(NT^2), depending on whether you use RNNs, Transformers, or something in between. Even though a selective attention mechanisms might be an efficient way to perform long-term memory and learning, the act of finding that program via Software 2.0 compilation is very expensive because we have to consider full sequences.

## Train on Short Sequences and It Just Works

The optimistic take is that we can just train on shorter sequences, and it will just generalize to longer sequences at test time. Maybe you can train selective attention on short sequences, and then couple that with a high capacity external memory. Ideas from Neural Program Induction and Neural Turing Machines seem relevant here. Alternatively, you can use ideas from Q-learning to essentially do dynamic programming across time and avoid having to ingest the full sequence into memory (R2D2)

## Hierarchical Computation

Another approach is to fuse multiple time steps into a single one, potentially repeating this trick over and over again until you have effectively O(log(T)) computation cost instead of O(T) cost. This can be done in both forward and backward passes - clockwork RNNs and Dilated Convolutions used in WaveNet are good examples of this. A variety of recent sub-quadratic attention improvements to Transformers (Block Sparse Transformers, Performers, Reformers, etc.) can be thought of as special cases of this as well.

## Parallel Evolution

Maybe we do need to just bite the bullet and optimize over the full sequences, but use embarassingly parallel algorithms to ammortize the time complexity (by distributing it across space). Rather than serially running forward-backward on the same model over and over again, you could imagine testing multiple lifelong learning agents simultaneously and choosing the best-of-K agents after T time has elapsed.

If you're interested in these problems, here's some concrete advice for how to get started. Start by looking up the existing literature in the field, pick one of these papers, and see if you can re-implement it from scratch. This is a great way to learn and make sure you have the necessary coding chops to get ML systems working well. Then ask yourself, how well does the algorithm handle harder problems? At what point does it break down? Finally, rather than thinking about incremental improvements to existing algorithms and benchmarks, constantly be thinking of harder benchmarks and new capabilities.

## Summary

• Three reasons why I believe in end-to-end ML for robotics: (1) it worked for other domains (2) fusing perception and control is a nice way to simplfiy decision making for many tasks (3) we can't define anything precisely so we need to rely on reality (via data) to tell us what to do.
• When it comes to improving our learning systems, think about the broader pipeline, not just the algorithmic and mathy learning part.
• Challenge: how do we do version control for Lifelong Learning systems?
• Challenge: how do we compile Software 2.0 that does Lifelong Learning? How can we optimize for long-term memory and learning without having to optimize over full lifetimes?