Wednesday, May 26, 2021

Sovereign Arcade: Currency as High-Margin Infrastructure

This essay is about how the powerful want to become countries, and the implications of cryptocurrencies on the sovereignty of nations. I’m not an economics expert: please leave a comment if I have made any errors.

Money allows goods, services, and everything else under the sun to be assigned a value using the same unit of measurement. Without money, society reverts to bartering, which is highly inefficient. You may need plumbing services but have nothing that the plumber wants, so your toilet remains clogged. By acting as a measure of value everyone agrees on, money facilitates frictionless economic collaboration between people.

Foreign monetary policy is surprisingly simple to understand when viewed through the lens of power and control. Nation states get nervous when other nation states get too powerful, and controlling the currency is a form of power.

To see why this is the case, let’s consider a Gaming Arcade (yes, like Chuck E. Cheese) as a miniature model of a “Nation State”. To participate inside the “arcade economy”, you are to swap your outside money (USD) for arcade tokens.

Arcades are like mini nation-states: they issue their own currency, encourage spending with state-owned enterprises, and have a one-sided currency exchange to prevent money outflows.


The coins are a store of value that facilitate a one-way transaction with the Nation-State: you get to play an arcade game, and in return you get some entertainment value and some tickets, which we call “wages”.

The tickets are another store of value that can facilitate another one-way transaction: converting them into prizes. Prizes can be a stuffed animal or something else of value. Typically, the cost of winning a prize at an arcade is many multiples of what it would cost to just buy the prize at an outside store. The arcade captures that price difference as their profit.

Money’s most important feature requirement is that it is a *stable* measure of value. Too much inflation, and people stop saving money. Too much deflation, and people and companies aren’t incentivized to spend money (for example, employing people). Imagine if tomorrow, an arcade coin could let you play a game for two rounds instead of one, and the day after, you could play for four rounds! Well, no one would want to play arcade games today anymore.

The arcade imposes many kinds of draconian capital controls, and in many ways resembles an extreme form of State Capitalism:
  • All transactions are with state-owned enterprises (the arcade games) and must be conducted using state currencies (coins and tickets). You can’t start a business that takes people’s coins or tickets within the arcade.
  • The state can hand out valuable coins at virtually zero cost without worrying about inflation - every coin they issue is backed by a round of a coin-operated game, of which they have near-infinite supply. They can’t hand out infinite tickets though, because that would either require backing it up with more prizes, or devaluing each ticket so that more tickets are needed to buy the same prize.
  • You can bring outside money into the arcade, but you can’t convert coins, tickets, or prizes into money to take out.

Controlling the currency supply is indeed a very powerful business to be in, and why arcades would prefer to issue their own currency and keep money from leaving their borders.

Governments are just like arcades. They prefer their citizens and trading partners to use a currency they control, because it gives them a lever with which they can influence spending behavior. If country A uses country B’s currency instead, then country B’s currency supply shenanigans can actually influence saving and spending behavior of country A. This can pose a threat to the sovereignty of a nation (a fancy way to say “control over its people”).

After World War II, the US Dollar became the world’s reserve currency, which means that it’s the currency used for the majority of international trade. The USA wants the world to buy oil with US dollars, and we go to great lengths to enforce it with various forms of soft and hard power. The US dollar is backed by oil (petrodollar theory), and this “dollars-are-oil rule” in turn is enforced by US military might.

Governments print money all the time to pay for needed short-term needs like building bridges and COVID relief. However, too much of this can be a dangerous thing. The government gets what it wants in the short term, but more money chasing the same amount of goods will cause businesses to raise prices, causing inflation. Countries like Venezuela and Turkey who print too much of their own currency experience a runaway feedback loop where money supply and prices skyrocket, and then no one trusts the government currency as a stable source of value anymore.

The USA is not like other countries in this regard; controlling the world’s reserve currency gives the USA the ability to print money like no other country can. The US government owing 28 trillion USD of debt is like the Arcade owing you a trillion game coins. Yes, it is a lot of coins - maybe the arcade doesn’t even have a trillion coins to give you. But the arcade knows that you know that it’s in the best interest of everyone to not try and collect all those coins right away, because the arcade would go bankrupt, and then the coins you asked for would be worthless. 

Is this sketchy? Absolutely. Most other countries absolutely hate this power dynamic. Especially China. The USA calls China a currency manipulator for devaluing the yuan, but will turn around and do the exact same thing by printing dollars. China does not want to be subject to the whims of US monetary policy, so they are working very hard to establish the yuan as the currency of exchange in international trade. Everyone wants to be the arcade operator, not the arcade player.

Large Companies as Nation-States


Nation-states not only have to worry about the currencies of other nation-states, but increasingly, large global corporations as well. Any businesses that get big enough start to think about the currency game, since currency is a form of high-margin infrastructure.

AliPay is a mobile wallet made by an affiliate company of Alibaba. It’s basically backed by an SQL table saying how much money each AliPay user has. It would be very easy for AliPay to print money - all they have to do is bump up some number in a row in the SQL table. As long as users are able to redeem their AliPay balance on something of equivalent value, Alibaba’s accounts remain solvent and they can get away with this. In fact, many of their users shop on Alibaba’s e-commerce properties anyway, so Alibaba doesn’t even need to have 100% cash reserves to back up all entries in their SQL table. Users can redeem their balances by paying for Alibaba goods, which Alibaba presumably can acquire for less than the price the user pays for.

Of course, outright printing money incurs the wrath of the Sovereign Arcade. Alibaba was severely punished for merely suggesting that they could do a better job than China’s banks. Facebook tried to challenge the dollar by introducing a token backed with other countries’ reserve currencies, and the idea was slapped down so hard that FB had to rename the project and start over. In contrast, the US government is happy to approve crypto tokens backed using the US dollar, because ultimately the US government controls the underlying resource.

There are clever ways to build high margin infrastructure without crossing the money-printing line. Any large institution with a monopoly over a high-margin resource can essentially mint debt for free, effectively printing currency like an arcade does with its coins. The resource can be a lot of things - coffee, cloud computing credits, energy, user data. In the case of a nation-state, the resource is simply violence and enforcement of the law.

As of 2019, Starbucks had 1.6B USD of gift cards in circulation, which puts it above the national GDP of about 20 countries. Like the arcade coins, Starbucks gift cards are only redeemable for limited things: scones and coffee. Starbucks can essentially mint Starbucks gift cards for free, and this doesn’t suffer from inflation because each gift card is backed by future coffee which Starbucks can also make at a marginal cost. You can even use Starbucks cards internationally, which makes “Star-Bucks” more convenient than current foreign currency exchange protocols.

As long as account balances are used to redeem a resource that the company can acquire cheaply (e.g. gift cards for coffee, gift cards for cloud computing, advertising credits), a large company could also practice “currency manipulation” by arbitrarily raising monetary balances in their SQL tables.


The Network State


Yet another threat to the sovereign power is decentralized rogue nations, made possible by cryptocurrency. At the heart of cryptocurrency’s rise is a social problem in our modern, globalized society: how do we trust our sovereigns to actually be good stewards of our property? Banking executives who overleveraged risky investments got bailed out in 2008 by the US government. The USA printed a lot of money in 2020 to bail out those impacted by COVID-19 economic shutdowns. Every few weeks, we hear about data breaches in the news. A lot of Americans are losing trust in their institutions to protect their bank accounts, their privacy, and their economic interests.

Even so, most Americans still take the power of the dollar for granted: 1) our spending power remains stable and 2) the number we see in our bank accounts is ours to spend. We have American soft and hard diplomacy to thank for that. But in less stable countries, capital controls can be rather extreme: a bank may simply decide one day that you can’t withdraw more than 1 USD per day. Or some government can decide that you’re a criminal and freeze your assets entirely.

Cryptocurrency offers a simple answer: You can’t trust the sovereign, or the bank, or any central authority to maintain the SQL table of who owns what. Instead, everyone cooperatively maintains the record of ownership in a decentralized, trustless way. For those of you who aren’t familiar with how this works, I recommend this 26-minute video by 3Blue1Brown.

To use the arcade analogy, cryptocurrency would be like a group of teenagers going to the arcade, and instead of converting their money into arcade coins, they pool it together to buy prizes from outside. They bring their own games (Nintendo Switches or whatever), and then swap prizes with each other based on who wins. They get the fun value of hanging out with friends and playing games and prizes, while cutting the arcade operator out.

The decentralized finance (DeFi) ecosystem has grown a lot in the last few years. In the first few years of crypto, all you could do was send Bitcoin and other Altcoins to each other. Today, you can swap currencies in decentralized exchanges, take out flash loans, buy distressed debt at a discount, provide liquidity as a market maker, perform no-limit betting on prediction markets, pay a foreigner with USD-backed stablecoins, and cryptographically certify authenticity of luxury goods.

Balaji Srinivasan predicts that as decentralized finance projects continue to grow, a large group of individuals with a shared sense of values and territory will congregate on the internet and declare themselves citizens of a “Network State”. It sounds fantastical at first, but many of us already live in Proto-Network states. We do our work on computers, talk to people over the internet, shop for goods online, and spend leisure time in online communities like Runescape and such. It makes sense for a geographically distributed economy to adopt a digital-native currency that transcends borders.

Network states will have the majority of their assets located on the internet, with a small amount of physical property distributed around the world for our worldly needs. The idea of a digital rogue nation is less far-fetched than you might think. If you walk into a Starbucks or McDonalds or a Google Office or an Apple Store anywhere in the world, there is a feeling of cultural consistency, a familiar ambience. In fact, Starbucks gets pretty close: you go there to eat and work and socialize and pay for things with Starbucks gift cards. 

A network state might have geographically distributed physical locations that have a consistent culture, with most of its assets and culture in the cloud. Pictured: Algebraist coffee, a new entrant into the luxury coffee brand space

A network state could have a national identity independent of physical location. I see no reason why a "Texan" couldn’t enjoy ranching and brisket and big cars and football anywhere in the world.


Balaji is broadly optimistic that existing sovereigns will be tolerant or even facilitate network states, by offering them economic development zones and tax incentives to establish their physical embodiments within their borders, in exchange for the innovation and capital they attract.

I am not quite so optimistic - the fact that US persons can now pseudonymously perform economic activities with anyone in the world (including sanctioned countries) without the US government knowing, using a currency that the US government cannot control - is a terrifying prospect to the sovereign. The world’s governments highly underestimate the degree to which future decentralized economies will upset the world order and power structures of the world. Any one government can make life difficult for cryptocurrency businesses to get big, but as long as some countries are permissive towards it, it’s hard to put that genie back into the bottle and prevent the emergence of a new digital economy.

Crypto Whales


I think the biggest threat to the emergence of a network state is not existing sovereigns, but rather the power imbalance of early stakeholders versus new adopters.

At the time of writing, there are nearly 100 Bitcoin billionaires and 7062 Bitcoin wallets that own more than 10M each. This isn’t even counting the other cryptocurrencies or DeFi wealth locked in Ethereum - the other day, someone up bought nearly a billion dollars of the meme currency DOGE. We mostly have no idea who these people are - they walk amongst us, and are referred to as “whales”.

A billionaire’s taxes substantially alter state budget planning in smaller states, so politicians actually go out of their way to appease billionaires (e.g. Illinois with Ken Griffin). If crypto billionaires colluded, they could institute quite a lot of political change at local and maybe even national levels.

China has absolutely zero chill when it comes to any challenge to their sovereignty, so it was not surprising at all that they recently cracked down on domestic use of cryptocurrency. However, by shutting their miners down, I believe China is losing a strategic advantage in their quest to unseat America as the world superpower. A lot of crypto billionaires reside in China, having operated large mining pools and developing the world’s mining hardware early on. I think the smart move for China would have been to allow their miners to operate, but force them to sell their crypto holdings for digital yuan. This would peg crypto to the yuan, and also allow China to stockpile crypto reserves in case the world starts to use it more as a reserve currency.

There’s a chance that crypto might even overtake the Yuan as the challenger to reserve currency, because it’s easier to acquire in countries with strict capital controls (e.g. Venezuela, Argentina, Zimbabwe). If I were China, I’d hedge against both possibilities and try to control both.

Controlling miners has power implications far beyond stockpiling of crypto wealth. Miners play an important role in the market microstructure of cryptocurrency - they have the ability to see all potential transactions before they get permanently appended to blockchain. The assets minted by miners are virtually untraceable. One way a Network State could be compromised is if China smuggled several crypto whales into these fledgling nations that are starting to adopt Bitcoin, and then used their influence over Bitcoin reserves, tax revenues, and market microstructure to punish those who spoke out against China.

The more serious issue than China’s hypothetical influence over Bitcoin monetary policy is the staggering inequality of crypto wealth distribution. Presently, 2% of wallets control over 95% of Bitcoin. Many people are already uncomfortable with the majority of Bitcoins being owned by a handful of mining operators and Silicon Valley bros and other agents of tech inequality. Institutions fail violently when inequality is high - people will drop the existing ledger of balances and install a new one (such as Bitcoin). If people decide to form a new network state, why should they adopt a currency that would make these tech bros the richest members of their society? Would you want your richest citizen to be someone who bet their life savings on DOGE? Would you trust this person’s judgement or capacity for risk management?

Like any currency, Bitcoin and Ethereum face adoption risk if the majority of assets are held by people who lack the leadership to deploy capital effectively on behalf of society. Unless crypto billionaires vow to not spend the majority of their wealth (like Satoshi has seemingly done), or demonstrate a remarkable level of leadership and altruism towards growing the crypto economy (like Vitalik Buterin has done), the inequality aspect will remain a large barrier to the formation of stable network states.

Summary

  1. A gaming arcade is a miniature model of a nation-state. Controlling the supply and right to issue currency is lucrative.
  2. Large businesses with high-margin infrastructure can essentially mint debt, much like printing money.
  3. Cryptocurrencies will create “Network States” that challenge existing nation-states. But they will not prosper if they set up their richest citizens as ones who won the “early adopter” lottery.

Further reading and Acknowledgements


I highly recommend Lyn Alden’s essay on the history of the US dollar, the fraying petrodollar system, and the future of reserve currency.

Thanks to Austin Chen and Melody Cao for providing feedback on earlier drafts.










Sunday, March 14, 2021

Science and Engineering for Learning Robots

This is the text version of a talk I gave on March 12, 2021, at the Brown University Robotics Symposium. As always, all views are my own, and do not represent those of my employer.

I'm going to talk about why I believe end-to-end Machine Learning is the right approach for solving robotics problems, and invite the audience to think about a couple interesting open problems that I don't know how to solve yet.

I'm a research scientist at Robotics at Google. This is my first full-time job out of school, but I actually started my research career doing high school science fairs. I volunteered at UCSF doing wet lab experiments with telomeres, and it was a lot of pipetting and only a fraction of the time was spent thinking about hypotheses and analyzing results. I wanted to become a deep sea marine biologist when I was younger, but after pipetting several 96-well plates (and messing them up) I realized that software-defined research was faster to iterate on and freed me up to do more creative, scientific work.

I got interested in brain simulation and machine learning (thanks to Andrew Ng's Coursera Course) in 2012. I did volunteer research at a neuromorphic computing lab at Stanford and did some research at Brown on biological spiking neuron simulation in tadpoles. Neuromorphic hardware is the only plausible path to real-time, large-scale biophysical neuron simulation on a robot, but much like wet-lab research is rather slow to iterate on. It was also a struggle to learn even simple tasks, which made me pivot to artificial neural networks which were starting to work much better at a fraction of the computational cost. In 2015 I watched Sergey Levine's talk on Guided Policy Search and remember thinking to myself, "oh my God, this is what I want to work on".

The Deep Learning Revolution

We've seen a lot of progress in Machine Learning in the last decade, especially in end-to-end machine learning, also known as deep learning. Consider a task like audio transcription: classically, we would chop up the audio clip into short segments, detect phonemes, aggregate phonemes into words, words into sentences, and so on. Each of these stages is a separate software module with distinct inputs and outputs, and these modules might involve some degree of machine learning. The idea of deep learning is to fuse all these stages together into a single learning problem, where there are no distinct stages, just the end-to-end prediction task from raw data. With a lot of data and compute, such end-to-end systems vastly outperform the classical pipelined approach. We've seen similar breakthroughs in vision and natural language processing, to the extent that all state-of-the-art systems for these domains are pretty much deep learning models.

Robotics has for many decades operated under a modularized software pipeline, where first you estimate state, then plan, then perform control to realize your plan. The question our team at Google is interested in studying is whether the end-to-end advances we've seen in other domains holds for robotics as well.

Software 2.0

When it comes to thinking about the tradeoff between hand-coded, pipelined approaches versus end-to-end learning, I like Andrej Karpathy's abstraction of Software 1.0 vs Software 2.0: Software 1.0 is where a human explicitly writes down instructions for some information processing. Such instructions (e.g. in C++) are passed through a compiler that generates the low level instructions of what the computer actually executes. When building Software 2.0, you don't write the program - you give a set of inputs and outputs and it's the ML system's job to finds the best program that satisfies your input-output description. You can think of ML as a "higher order compiler that takes data and gives you programs".

The gradual or not-so-gradual subsumption of software 1.0 code into software 2.0 is inevitable - one might start by tuning some coefficients here and there, then you might optimize over one of several code branches to run, and before you know it, the system actually consists of an implicit search procedure over many possible sub-programs. The hypothesis is that as we increase availability of compute and data, we will be able to automatically do more and more search over programs to find the optimal routine. Of course, there is always a role for Software 1.0 - we need it for things like visualization and data management. All of these ideas are covered in Andrej's talks and blog posts, so I encourage you to check those out.

How Much Should We Learn in Robotics?

End-to-end learning has yet to outperform the classical control-theory approaches in some tasks, so within the robotics community there is still an ideological divide on how much learning should actually be done.

On one hand, you have classical robotics approaches, which breaks down the problem into three stages: perception, planning, and control. Perception is about determining the state of the world, planning is about high level decision making around those states, and control is about applying specific motor outputs so that you achieve what you want. Many of the ideas we explore in deep reinforcement learning today (meta-learning, imitation learning, etc.) have already been studied in classical robotics under different terminology (e.g. system identification). The key difference is that classical robotics deals with smaller state spaces, whereas end-to-end approaches fuse perception, planning, and control into a single function approximation problem. There's also a middle ground where one can attempt to use hand-coded constructs from classical robotics as a prior, and then use data to adapt the system to reality. According to Bayesian decision making theory, the stronger prior you have, the less data (evidence) you need to construct a strong posterior belief.

I happen to fall squarely on the far side of the spectrum - the end-to-end approach. I'll discuss why I believe strongly in these approaches.

Three reasons for end-to-end learning

First, it's worked for other domains, so why shouldn't it work for robotics? If there is something about robotics that makes this decidedly not the case, it would be super interesting to understand what makes robotics unique. As an existence proof, our lab and other labs have already built a few real-world systems that are capable of doing manipulation and navigation from end-to-end pixel-to-control. Shown on the left is our grasping system, Qt-Opt, which essentially performs grasping using only monocular RGB, the current arm pose, and end-to-end function approximation. It can grasp objects it's never seen before. We've also had success on door opening and manipulation from imitation learning.

Fused Perception-to-Action in Nature

Secondly, there are often many shortcuts one can take to solve specific tasks, without having to build a unified perception-planning-control stack that is general across all tasks. Work from Mandyam Srinivasan's lab has done cool experiments getting honeybees to fly and perch inside small holes, with a spiral pattern painted on the wall. They found that bees will de-accelerate as they approach the target by the simple heuristic of keeping the rate of image expansion (the spiral) constant. They found that if you artificially increase or decrease the rate of expansion by spinning the spiral clockwise or counterclockwise, the honeybee will predictably speed up or slow down. This is Nature's elegant solution to a control problem: visually-guided odometry is computationally cheaper and less error prone than having to detect where the target is in world frame, plan a trajectory, and so on. It may not be a general framework for planning and control, but it is sufficient for accomplishing what honeybees need to do.

Okay, maybe honeybees can use end-to-end approaches, but what about humans? Do we need a more general perception-planning-control framework for human problems? Maybe, but we also use many shortcuts for decision making. Take ball catching: we don't catch falling objects by solving ODEs or planning, we instead employ a gaze heuristic - as long as an object stays in the same point in your field of view, you will eventually intersect with the object's trajectory. Image taken from Henry Brighton's talk on Robust decision making in uncertain environments.

The Trouble With Defining Anything

Third, we tend to describe decision making processes with words. Words are pretty much all we have to communicate with one another, but they are inconsistent with how we actually make decisions. I like to describe this as an intelligence "iceberg"; the surface of the iceberg is how we think our brain ought to make decisions, but the vast majority of intelligent capability is submerged from view, inaccessible to our consciousness and incompressible into simple language like English. That is why we are capable of performing intelligent feats like perception and dextrous manipulation, but struggle to articulate how we actually perform them in short sentences. If it were easy to articulate in clear unambiguous language, we could just type up those words into a computer program and not have to use machine learning for anything. Words about intelligence are lossy compression, and a lossy representation of a program is not sufficient to implement the full thing.

Consider a simple task of identifying the object in the image on the left (a cow). A human might attempt to string some word-based reasoning together to justify why this is a cow: "you see the context (an open field), you see a nose, you see ears, and black-and-white spots, and maybe the most likely object that has all these parts is a cow".

This is a post-hoc justification, and not actually a full description of how our perception system registers whether something is a cow or not. If you take an actual system capable of recognizing cows with great accuracy (e.g a convnet) and inspect its salient neurons and channels that respond strongly to cows, you will find a strange looking feature map that is hard to put into words. We can't define anything in reality with human-readable words or code with the level of precision needed for interacting with reality, so we must use raw sensory data - grounded in reality - to figure out the decision-making capabilities we want.

Cooking is Not Software 1.0

Our obsession with focusing on the top half of the intelligence iceberg biases us towards the Software 1.0 way of programming, where we take a hard problem and attempt to describe it - using words - as the composition of smaller problems. There is also a tendency for programmers to think of general abstractions for their code, via ontologies that organize words with other words. Reality has many ways to defy your armchair view of what cows are and how robotic skills ought to be organized to accomplish tasks in an object-oriented manner.

Cooking is one of the holy grails of robotic tasks, because environments are open-ended and there is a lot of dextrous manipulation involved. Cooking analogies abound in programming tutorials - here is an example of making breakfast with asynchronous programming. It's tempting to think that you can build a cooking robot by simply breaking down the multi-stage cooking task into sub-tasks and individual primitive skills.

Sadly, even the most trivial of steps abounds with complexity. Consider the simple task of spreading jam on some toast.

The software 1.0 programmer approaches this problem by breaking down the task into smaller, reusable routines. Maybe you think to yourself, first I need a subroutine for holding the slice of toast in place with the robot fingers, then I need a subroutine to spread jam on the toast.

Spreading jam on toast entails three subroutines: a subroutine for scooping the jam with the knife, depositing the lump of jam on the toast, then spreading it evenly.

Here is where the best laid plans go awry. A lot of things can happen in reality at any stage that would prevent you from moving onto the next stage. What if the toaster wasn't plugged in and you're starting with untoasted bread? What if you get the jam on the knife but in the process break something on the robot and you aren't checking to make sure everything is fine before proceeding to the next subroutine? What if there isn't enough jam in the jar? What if you're on the last slice of bread in the loaf and the crust side is facing up?

The prospect of writing custom code to handle the ends of the bread loaf (literal edge cases) ought to give one pause as to whether this is approach is scalable to unstructured environments like kitchens - you end up with a million lines of code that essentially capture the state machine of reality. Reality is chaotic - even if you had a perfect perception system, simply managing reality at the planning level quickly becomes intractable. Learning based approaches give us hope of managing this complexity by accumulate all these edge cases in data, and let the end-to-end objective (getting some jam on the toast) and Software 2.0 compiler figure out how to handle all the edge cases. My belief in end-to-end learning is not because I think ML has unbounded capability, but rather that the alternative approach where we capture all of reality into a giant hand-coded state machine is utterly hopeless.

Here is a video where I am washing and cutting strawberries and putting them on some cheesecake. A roboticist that spends too much time in the lab and not the kitchen might prescribe a program that (1) "holds strawberry", (2) "cut strawberry", (3) "pick-and-place on cheesecake", but if you watch the video frame by frame, there are a lot of other manipulation tasks that happen in the meantime - opening and closing containers with one or two hands, pushing things out of the way, inspecting for quality. To use the Intelligence Iceberg analogy: the recipe and high level steps are the surface ice, but the submerged bulk are all the little micro-skills the hands need to do to open containers and adapt to reality. I believe the most dangerous conceit in robotics is to design elegant programming ontologies on a whiteboard, and ignore the subtleties of reality and what its data tells you.

There are a few links I want to share highlighting the complexity of reality. I enjoyed this recent article on Quanta Magazine about the trickiness of defining life. This is not merely a philosophical question; people at NASA are planning a Mars expedition to collect soil samples and answer whether life ever existed on Mars. This mission requires clarity on the definition of life. Just like it's hard to define intelligent capabilities in precise language, so it is to define life. These two words may as well be one and the same.

Klaus Greff's talk on What Are Objects? raises some interesting queestions about the fuzziness of word. Obviously we want our perception systems to recognize objects so that we may manipulate and plan around them. But as the talk points out, defining what is and is not an object can be quite tricky (is a hole an object? Is the frog prince defined by what he once was, or what he looks like now?).

I've also written a short story on the trickiness of defining even simple classes like "teacups".

I worked on a project with Coline Devin where we used data and Software 2.0 to learn a definition of objects without any human labels. We use a grasping system to pick up stuff and define objects as "that which is graspable". Suppose you have a bin of objects and pick one of them up. The object is now removed from the bin and maybe the other objects have shifted around the bin a little. You can also easily look at whatever is in your hand. We then design an embedding architecture and use the following assumption about reality to train it: the pre-grasp objects embedding - post-grasp objects embedding to be equal to the embedding of whatever you picked up. This allowed us to bootstrap a completely self-supervised instance grasping system from a grasping system without ever relying on labels. This is by no means a comprehensive definition of "object" (see Klaus's talk) but I think it's a pretty good one.

Science and Engineering of End-to-End ML

End-to-end learning is a wonderful principle for building robotic systems, but it is not without its practical challenges and execution risks. Deep neural nets are opaque black box function approximators, which makes debugging them at scale challenging. This requires discipline in both engineering and science, and often the roboticist needs to make a choice as to whether to solve an engineering problem or a scientific one.

This is what a standard workflow looks like for end-to-end robotics. You start by collecting some data, cleaning it, then designing the input and output specification. You fit a model to the data, validate it offline with some metrics like mean-squared error or accuracy, then deploy it in the real world and see if it continues to work as well on your validation sets. You might iterate on the model and validation via some kind of automated hyperparameter tuning.

Most ML PhDs spend all their time on the model training and validation stages of the pipeline. RL PhDs have a slightly different workflow, where they think a bit more about data collection via the exploration problem. But most RL research also happens in simulation, where there is no need to do data cleaning and the feature and label specification is provided to you via the benchmark's design.

While it's true that advancing learning methods is the primary point of ML, I think this behavior is the result of perverse academic incentives.

There is a viscious tendency for papers to put down old ideas and hype up new ones in the pursuit of "technical novelty". The absurdity of all this is that if we ever found that an existing algorithm works super well on harder and harder problems, it would have a hard time getting published on in academic conferences. Reviewers operate under the assumption that our ML algorithms are never good enough.

In contrast, production ML usually emphasizes everything else in the pipeline. Researchers on Tesla's Autopilot team have found that in general, 10x'ing your data on the same model architecture outperforms any incremental modeling improvement in the last few years. As Ilya Sutskever says, most incremental algorithm improvements are just data in disguise. Researchers at quantitative trading funds do not change models drastically: they spend their time finding novel data sources that add additional predictive signal. By focusing on large-scale problems, you get a sense of where the real bottlenecks are. You should only work on innovating new learning algorithms if you have reason to believe that that is what is holding your system back.

Here are some examples of real problems I've run into in building end-to-end ML systems. When you collect data on a robot, certain aspects of the code get baked into the data. For instance, the tuning of the IK solver or the acceleration limits on the joints. A few months later, the code on the robot controllers might have changed in subtle ways, like maybe the IK solver was swapped with a different solver. This happens a lot in a place like Google where multiple people work on a single codebase. But because assumptions of the v0 solver were baked into the training data, you now have a train-test mismatch and the ML policy no longer works as well.

Consider an imitation learning task where you collect some demonstrations, and then predict actions (labels) from states (features). An important unit test to perform before you even start training a model is to check whether a robot that replays the exact labels in order can actually solve the task (for an identical initialization as the training data). This check is important because the way you design your labels might make assumptions that don't necessarily hold at test-time.

I've found data management to be one of the most crucial aspects of debugging real world robotic systems. Recently I found a "data bug" where there was a demonstration of the robot doing nothing for 5 minutes straight - the operator probably left the recording running without realizing it. Even though the learning code was fine, noisy data like this can be catastrophic for learning performance.

As roboticists we all want to see in our lifetime robots doing holy grail tasks like tidying our homes and cooking in the kitchen. Our existing systems, whether you work on Software 1.0 or Software 2.0 approaches, are far away from that goal. Instead of spending our time researching how to re-solve a task a little bit better than an existing approach, we should be using our existing robotic capabilities to collect new data for tasks we can't solve yet.

There is a delicate balance in choosing between understanding ML algorithms better, versus pushing towards a longer term goal of qualitative leaps in robotic capability. I also acknowledge that the deep learning revolution for robotics needs to begin with solving the easier tasks and then eventually working its way up to the harder problems. One way to accomplish both good science and long term robotics is to understand how existing algorithms break down in the face of harder data and tougher generalization demands encountered in new tasks.

Interesting Problems

Hopefully I've convinced you that end-to-end learning is full of opportunities to really get robotics right, but also rife with practical challenges. I want to highlight two interesting problems that I think are deeply important to pushing this field forward, not just for robotics but for any large-scale ML system.

A typical ML research project starts from a fixed dataset. You code up and train a series of ML experiments, then you publish a paper once you're happy with one of the experiments. These codebases are not very large and don't get maintained beyond the duration of the project, so you can move quickly and scrappily with little to no version control or regression testing.

Consider how this would go for a "lifelong learning" system for robotics, where you are collecting data and never throwing it away. You start the project with some code that generates a dataset (Data v1). Then you train a model with some more code, which compiles a Software 2.0 program (ckpt.v1.a). Then you use that model to collect more data (Data v2), and concatenate your datasets together (Data v1 + Data v2) to then train another model, and use that to collect a third dataset (Data v3), and so on. All the while you might be publishing papers on the intermediate results.

The tricky thing here is that the behavior of Software 1.0 and Software 2.0 code is now baked into each round of data collection, and the Software 2.0 code has assumptions from all prior data and code baked into it. The dependency graph between past versions of code and your current system become quite complex to reason about.

This only gets trickier if you are running multiple experiments and generating multiple Software 2.0 binaries in parallel, and collecting with all of those.

Let's examine what code gets baked into a collected dataset. It is a combination of Software 1.0 code (IK solver, logging schema) and Software 2.0 code (a model checkpoint). The model checkpoint itself is the distillation of a ML experiment, which consists of more Software 1.0 code (Featurization, Training code) and Data, which in turn depends on its own Software 1.0 and 2.0 code, and so on.

Here's the open problem I'd like to pose to the audience: how can we verify correctness of lifelong learning systems (accumulating data, changing code), while ensuring experiments are reproducible and bug free? Version control software and continuous integration testing is indispensable for team collaboration on large codebases. What would the Git of Software 2.0 look like?

Here are a couple ideas on how to mitigate the difficulty of lifelong learning. The flywheel of an end-to-end learning system involves converting data to a model checkpoint, then a model checkpoint to predictions, and model predictions to a final real world evaluation number. That eval also gets converted into data. It's critical to test these four components separately to ensure there are no regressions - if one of these breaks, so does everything else.

Another strategy is to use Sim2Real, where you train everything in simulation and develop a lightweight fine-tuning procedure for transferring the system to reality. We rely on this technique heavily at Google and I've heard this is OpenAI's strategy as well. In simulation, you can transmute compute into data, so data is relatively cheap and you don't have to worry about handling old data. Every time you change your Software 1.0 code, you can just re-simulate everything from scratch and you don't have to deal with ever-increasing data heterogeneity. You might still have to manage some data dependencies for real world data, because typically sim2real methods require training a CycleGAN.

Compiling Software 2.0 Capable of Lifelong Learning

When people use the phrase "lifelong learning" there are really two definitions. One is about lifelong dataset accumulation, and concatenating prior datasets to train systems that do new capabilities. Here, we may re-compile the Software 2.0 over and over again.

A stronger version of "lifelong learning" is to attempt to train systems that learn on their own and never need to have their Software 2.0 re-compiled. You can think about this as a task that runs for a very long time.

Many of the robotic ML models we build in our lab have goldfish memories - they make all their decisions from a single instant in time. They are, by construction, incapable of remembering what the last action they took was or what happend 10 seconds ago. But there are plenty of tasks where it's useful to remember:

  • An AI that can watch a movie (>170k images) and give you a summary of the plot.
  • An AI that is conducting experimental research, and it needs to remember hundreds of prior experiments to build up its hypotheses and determine what to try next.
  • An AI therapist that should remember the context of all your prior conversations (say, around 100k words).
  • A robot that is is cooking and needs to leave something in the oven for several hours and then resume the recipe afterwards.

Memory and learning over long time periods requires some degree of selective memory and attention. We don't know how to select which moments in a sequence are important, so we must acquire that by compiling a Software 2.0 program. We can train a neural network to fit some task objective to the full "lifetime" of the model, and let the model figure out how it needs to selectively remember within that lifetime in order to solve the task.

However, this presents a big problem: in order to optimize this objective, you need to run forward predictions over every step in the lifetime. If you are using backpropagation to train your networks, then you also need to run a similar number of steps in reverse. If you have N data elements and the lifetime is T steps long, the computational cost of learning is between O(NT) and O(NT^2), depending on whether you use RNNs, Transformers, or something in between. Even though a selective attention mechanisms might be an efficient way to perform long-term memory and learning, the act of finding that program via Software 2.0 compilation is very expensive because we have to consider full sequences.

Train on Short Sequences and It Just Works

The optimistic take is that we can just train on shorter sequences, and it will just generalize to longer sequences at test time. Maybe you can train selective attention on short sequences, and then couple that with a high capacity external memory. Ideas from Neural Program Induction and Neural Turing Machines seem relevant here. Alternatively, you can use ideas from Q-learning to essentially do dynamic programming across time and avoid having to ingest the full sequence into memory (R2D2)

Hierarchical Computation

Another approach is to fuse multiple time steps into a single one, potentially repeating this trick over and over again until you have effectively O(log(T)) computation cost instead of O(T) cost. This can be done in both forward and backward passes - clockwork RNNs and Dilated Convolutions used in WaveNet are good examples of this. A variety of recent sub-quadratic attention improvements to Transformers (Block Sparse Transformers, Performers, Reformers, etc.) can be thought of as special cases of this as well.

Parallel Evolution

Maybe we do need to just bite the bullet and optimize over the full sequences, but use embarassingly parallel algorithms to ammortize the time complexity (by distributing it across space). Rather than serially running forward-backward on the same model over and over again, you could imagine testing multiple lifelong learning agents simultaneously and choosing the best-of-K agents after T time has elapsed.

If you're interested in these problems, here's some concrete advice for how to get started. Start by looking up the existing literature in the field, pick one of these papers, and see if you can re-implement it from scratch. This is a great way to learn and make sure you have the necessary coding chops to get ML systems working well. Then ask yourself, how well does the algorithm handle harder problems? At what point does it break down? Finally, rather than thinking about incremental improvements to existing algorithms and benchmarks, constantly be thinking of harder benchmarks and new capabilities.

Summary

  • Three reasons why I believe in end-to-end ML for robotics: (1) it worked for other domains (2) fusing perception and control is a nice way to simplfiy decision making for many tasks (3) we can't define anything precisely so we need to rely on reality (via data) to tell us what to do.
  • When it comes to improving our learning systems, think about the broader pipeline, not just the algorithmic and mathy learning part.
  • Challenge: how do we do version control for Lifelong Learning systems?
  • Challenge: how do we compile Software 2.0 that does Lifelong Learning? How can we optimize for long-term memory and learning without having to optimize over full lifetimes?

Saturday, February 13, 2021

Don't Mess with Backprop: Doubts about Biologically Plausible Deep Learning

Biologically Plausible Deep Learning (BPDL) is an active research field at the intersection of Neuroscience and Machine Learning, studying how we can train deep neural networks with a "learning rule" that could conceivably be implemented in the brain.

The line of reasoning that typically motivates BPDL is as follows:

  1. A Deep Neural Network (DNN) can learn to perform perception tasks that biological brains are capable of (such as detecting and recognizing objects).
  2. If activation units and their weights are to DNNs as what neurons and synapses are to biological brains, then what is backprop (the primary method for training deep neural nets) analogous to?
  3. If learning rules in brains are not implemented using backprop, then how are they implemented? How can we achieve similar performance to backprop-based update rules while still respecting biological constraints?

A nice overview of the ways in which backprop is not biologically plausible can be found here, along with various algorithms that propose fixes.

My somewhat contrarian opinion is that designing biologically plausible alternatives to backprop is the wrong question to be asking. The motivating premises of BPDL makes a faulty assumption: that layer activations are neurons and weights are synapses, and therefore learning-via-backprop must have a counterpart or alternative in biological learning.

Despite the name and their impressive capabilities on various tasks, DNNs actually have very little to do with biological neural networks. One of the great errors in the field of Machine Learning is that we ascribe too much biological  meaning to our statistical tools and optimal control algorithms. It leads to confusion from newcomers, who ascribe entirely different meaning to "learning", "evolutionary algorithms", and so on.

DNNs are a sequence of linear operations interspersed with nonlinear operations, applied sequentially to real-valued inputs - nothing more. They are optimized via gradient descent, and gradients are computed efficiently using a dynamic programming scheme known as backprop. Note that I didn't use the word "learning"!

Dynamic programming is the ninth wonder of the world1, and in my opinion one of the top three achievements of Computer Science. Backprop has linear time-complexity in network depth, which makes it extraordinarily hard to beat from a computational cost perspective. Many BPDL algorithms often don't do better than backprop, because they try to take an efficient optimization scheme and shoehorn in an update mechanism with additional constraints. 

If the goal is to build a biologically plausible learning mechanism, there's no reason that units in Deep Neural Networks should be one-to-one with biological neurons. Trying to emulate a DNN with models of biologically neurons feels backwards; like trying to emulate the Windows OS with a human brain. It's hard and a human brain can't simulate Windows well.

Instead, let's do the emulation the other way around: optimizing a function approximator to implement a biologically plausible learning rule. The recipe is straightforward:

  1. Build a biological plausible model of a neural network with model neurons and synaptic connections. Neurons communicate with each other using spike trains, rate coding, or gradients, and respect whatever constraints you deem to be "sufficiently biologically plausible". It has parameters that need to be trained.
  2. Use computer-aided search to design a biologically plausible learning rule for these model neurons. For instance, each neuron's feedforward behavior and local update rules can be modeled as a decision from an artificial neural network.
  3. Update the function approximator so that the biological model produces the desired learning behavior. We could train the neural networks via backprop. 

The choice of function approximator we use to find our learning rule is irrelevant - what we care about at the end of the day is answering how a biological brain is able to learn hard tasks like perception, while respecting known constraints like the fact that biological neurons don't store all activations in memory or only employ local learning rules. We should leverage Deep Learning's ability to find good function approximators, and direct that towards finding a good biological learning rules.

The insight that we should (artificially) learn to (biologically) learn is not a new idea, but it is one that I think is not yet obvious to the neuroscience + AI community. Meta-Learning, or "Learning to Learn", is a field that has emerged in recent years, which formulates the act of acquiring a system capable of performing learning behavior (potentially superior to gradient descent). If meta-learning can find us more sample efficient or superior or robust learners, why can't it find us rules that respect biological learning constraints? Indeed, recent work [1, 2, 3, 4, 5] shows this to be the case. You can indeed use backprop to train a separate learning rule superior to naïve backprop.

I think the reason that many researchers have not really caught onto this idea (that we should emulate biologically plausible circuits with a meta-learning approach) is that until recently, compute power wasn't quite strong enough to both train a meta-learner and a learner. It still requires substantial computing power and research infrastructure to set up a meta-optimization scheme, but tools like JAX make it considerably easier now.

A true biology purist might argue that finding a learning rule using gradient descent and backprop is not an "evolutionarily plausible learning rule", because evolution clearly lacks the ability to perform dynamic programming or even gradient computation. But this can be amended by making the meta-learner evolutionarily plausible. For instance, the mechanism with which we select good function approximators does not need rely on backprop at all. Alternatively, we could formulate a meta-meta problem whereby the selection process itself obeys rules of evolutionary selection, but the selection process is found using, once again, backprop.

Don't mess with backprop!


Footnotes

[1] The eighth wonder being, of course, compound interest.

Monday, January 25, 2021

How to Understand ML Papers Quickly

My ML mentees often ask me some variant of the question "how do you choose which papers to read from the deluge of publications flooding Arxiv every day?” 

The nice thing about reading most ML papers is that you can cut through the jargon by asking just five simple questions. I try to answer these questions as quickly as I can when skimming papers.

1) What are the inputs to the function approximator?

E.g. a 224x224x3 RGB image with a single object roughly centered in the view. 

2) What are the outputs to the function approximator?

E.g. a 1000-long vector corresponding to the class of the input image.

Thinking about inputs and outputs to the system in a method-agnostic way lets you take a step back from the algorithmic jargon and consider whether other fields have developed methods that might work here using different terminology. I find this approach especially useful when reading Meta-Learning papers

By thinking about a ML problem first as a set of inputs and desired outputs, you can reason whether the input is even sufficient to predict the output. Without this exercise you might accidentally set up a ML problem where the output can't possibly be determined by the inputs. The result might be a ML system that performs predictions in a way that are problematic for society

3) What loss supervises the output predictions? What assumptions about the world does this particular objective make?

ML models are formed from combining biases and data. Sometimes the biases are strong, other times they are weak. To make a model generalize better, you need to add more biases or add more unbiased data. There is no free lunch

An example: many optimal control algorithms make the assumption of a stationary episodic data generation procedure which is a Markov-Decision Process (MDP). In an MDP, “state” and “action” deterministically map via the environment’s transition dynamics to “a next-state, reward, and whether the episode is over or not”. This structure, though very general, can be used to formulate a loss that allows learning Q values to follow the Bellman Equation.

4) Once trained, what is the model able to generalize to, in regards to input/output pairs it hasn’t seen before?

Due to the information captured in the data or the architecture of the model, the ML system may generalize fairly well to inputs it has never seen before. In recent years we are seeing more and more ambitious levels of generalization, so when reading papers I watch out to see any surprising generalization capabilities and where it comes from (data, bias, or both). 

There is a lot of noise in the field about better inductive biases, like causal reasoning or symbolic methods or object-centric representations. These are important tools for building robust and reliable ML systems and I get that the line separating structured data vs. model biases can be blurry. That being said, it baffles me how many researchers think that the way to move ML forward is to reduce the amount of learning and increase the amount of hard-coded behavior. 

We do ML precisely because there are things we don't know how to hard-code. As Machine Learning researchers, we should focus our work on making learning methods better, and leave the hard-coding and symbolic methods to the Machine Hard-Coding Researchers. 

5) Are the claims in the paper falsifiable? 

Papers that make claims that cannot be falsified are not within the realm of science. 


P.S. for additional hot takes and mentorship for aspiring ML researchers, sign up for my free office hours. I've been mentoring students over Google Video Chat most weekends for 7 months now and it's going great. 

Saturday, November 28, 2020

Software and Hardware for General Robots

Disclaimer, these are just my opinions and not necessarily those of my employer or robotics colleagues.

2021-04-23: If you liked this post, you may be interested in a more recent blog post I wrote on why I believe in end-to-end learning for robots.

Hacker News Discussion

Moravec's Paradox describes the observation that our AI systems can solve "adult-level cognitive" tasks like chess-playing or passing text-based intelligence tests fairly easily, while accomplishing basic sensorimotor skills like crawling around or grasping objects - things one-year old children can do - are very difficult.

Anyone who has tried to build a robot to do anything will realize that Moravec's Paradox is not a paradox at all, but rather a direct corollary of our physical reality being so irredeemably complex and constantly demanding. Modern humans traverse millions of square kilometers in their lifetime, a labyrinth full of dangers and opportunities. If we had to consciously process and deliberate all the survival-critical sensory inputs and motor decisions like we do moves in a game of chess, we would have probably been selected out of the gene pool by Darwinian evolution. Evolution has optimized our biology to perform sensorimotor skills in a split second and make it feel easy. 

Another way to appreciate this complexity is to adjust your daily life to a major motor disability, like losing fingers or trying to get around San Francisco without legs.

Software for General Robots

The difficulty of sensorimotor problems is especially apparent to people who work in robotics and get their hands dirty with the messiness of "the real world". What are the consequences of an irredeemably complex reality on how we build software abstractions for controlling robots? 

One of my pet peeves is when people who do not have sufficient respect for Moravec's Paradox propose a programming model where high-level robotic tasks ("make me dinner") can be divided into sequential or parallel computations with clearly defined logical boundaries: wash rice, de-frost meat, get the plates, set the table, etc. These sub-tasks can be in turn broken down further. When a task cannot be decomposed further because there are too many edge cases for conventional software to handle ("does the image contain a cat?"), we can attempt to invoke a Machine Learning model as "magic software" for that capability.

This way of thinking - symbolic logic that calls upon ML code - arises from engineers who are used to clinging to the tidiness of Software 1.0 abstractions and programming tutorials that use cooking analogies. 

Do you have any idea how much intelligence goes into a task like "fetching me a snack", at the very lowest levels of motor skill? Allow me to illustrate. I recorded a short video of me opening a package of dates and annotated it with all the motor sub-tasks I performed in the process.


In the span of 36 seconds, I counted about 14 motor and cognitive skills. They happened so quickly that I didn't consciously notice them until I went back and analyzed the video, frame by frame. 

Here are some of the things I did:
  • Leverage past experience opening this sort of package to understand material properties and how much force to apply.
  • Constantly adapt my strategy in response to unforeseen circumstances (Ziploc not giving)
  • Adjusting grasp when slippage occurs
  • Devising an ad-hoc Weitlaner Retractor with thumb knuckles to increase force on the Ziploc.
As a roboticist, it's humbling to watch videos of animals making decisions so quickly and then watch our own robots struggle to do the simplest things. We even have to speed up the robot video 4x-8x to prevent the human watcher from getting bored! 

With this video in mind, let's consider where we currently are in the state of robotic manipulation. In the last decade or so, multiple research labs have used deep learning to develop robotic systems that can perform any-object robotic grasping from vision. Grasping is an important problem because in order to manipulate objects, one must usually first grasp them. It took the Google Robotics and X teams 2-3 years to develop our own system, QT-Opt. This was a huge research achievement because it was a general method that worked on pretty much any object and, in principle, could be used to learn other tasks. 

Some people think that this capability to pick up objects can be wrapped in a simple programmatic API and then used to bootstrap us to human-level manipulation. After all, hard problems are just composed of simpler problems, right? 

I don't think it's quite so simple. The high-level API call "pick_up_object()" implies a clear semantic boundary between when the robot grasping begins and when it ends. If you re-watch the above video above, how many times do I perform a grasp? It's not clear to me at all where you would slot those function calls. Here is a survey if you are interested in participating in a poll of "how many grasps do you see in this video", whose results I will update in this blog post. 

If we need to solve 13 additional manipulation skills just to open a package of dates, and each one of these capabilities take 2-3 years to build, then we are a long, long way from making robots that match the capabilities of humans. Never mind that there isn't a clear strategy for how to integrate all these behaviors together into a single algorithmic routine. Believe me, I wish reality was simple enough that complex robotic manipulation could be done mostly in Software 1.0. However, as we move beyond pick-and-place towards dexterous and complex tasks, I think we will need to completely rethink how we integrate different capabilities in robotics.

As you might note from the video, the meaning of a "grasp" is somewhat blurry. Biological intelligence was not specifically evolved for grasping - rather, hands and their behaviors emerged from a few core drives:  regulate internal and external conditions, find snacks, replicate.

None of this is to say that our current robot platforms and the Software 1.0 programming models are useless for robotics research or applications. A general purpose function pick_up_object() can still be combined with "Software 1.0 code" into a reliable system worth billions of dollars in value to Amazon warehouses and other logistics fulfillment centers. General pick-and-place for any object in any unstructured environment remains an unsolved, valuable, and hard research problem.

Hardware for General Robots

What robotic hardware do we require in order to "open a package of dates"?

Willow Garage was one of the pioneers in home robots, showing that a teleoperated PR2 robot could be used to tidy up a room (note that two arms are needed here for more precise placement of pillows). These are made up of many pick-and-place operations.
 

This video was made in 2008. That was 12 years ago! It's sobering to think of how much time has passed and how little the needle has seemingly moved. Reality is hard. 

The Stretch is a simple telescoping arm attached to a vertical gantry. It can do things like pick up objects, wipe planar surfaces, and open drawers.



However, futurist beware! A common source of hype for people who don't think enough about physical reality is to watch demos of robots doing useful things in one home, and then conclude that the same robots are ready to do those tasks in any home.

The Stretch video shows the robot pulling open a dryer door (left-swinging) and retrieving clothes from it. The video is a bit deceptive - I think the camera  physically cannot see the interior of the dryer, so even though a human can teleoperate the robot to do the task, it would run into serious difficulty when ensuring that the dryer has been completely emptied.

Here is a picture of my own dryer, which features a dryer with a right-swinging door close to a wall. I'm not sure if the Stretch actually can fit in this tight space, but the PR2 definitely would not be able to open this door without the base getting in the way. 




Reality's edge cases are often swept under the rug when making robot demo videos, which usually show the robot operating in an optimal environment that the robot is well-suited for. But the full range of tasks humans do in the home is vast.  Neither the PR2 nor the Stretch can crouch under a table to pick up lint off the floor, change a lightbulb while standing on a chair, fix caulking in a bathroom, open mail with a letter opener, move dishes from the dishwasher to the high cabinets, break down cardboard boxes for the recycle bin, go outside and retrieve the mail. 

And of course, they can't even open a Ziploc package of dates. If you think that was complex, here is a first-person video of me chopping strawberries, washing utensils, and decorating a cheesecake. This was recorded with a GoPro strapped to my head. Watch each time my fingers twitch - each one is a separate manipulation task!



We often talk about a future where robots do our cooking for us, but I don't think it's possible with any hardware on the market today. The only viable hardware for a robot meant to do any task in human spaces is an adult-sized humanoid, with two-arms, two-legs, and five fingers on each hand. 

Just like I discussed about Software 1.0 in robotics, there is still an enormous space of robot morphologies that can still provide value to research and commercial applications. That doesn't change the fact that any alternative hardware can't do all the things a humanoid can in a human-centric space. Agility Robotics is one of the companies that gets it on the hardware design front. People who build physical robots use their hands a lot - could you imagine the robot you are building assembling a copy of itself? 


Why Don't We Just Design Environments to be More Robot-Friendly?

A compromise is to co-design the environment with the robot to avoid infeasible tasks like above. This can simplify both the hardware and software problems. Common examples I hear incessantly go like this:
  1. Washing machines are better than a bimanual robot washing dishes in the sink, and a dryer is a more efficient machine than a human hanging out clothes to air-dry.
  2. Airplanes are better at transporting humans than birds 
  3. We built cars and roads, not faster horses 
  4. Wheels can bear more weight and are more energetically efficient than legs.
In the home robot setting, we could design special dryer machine doors that the robot can open easily, or have custom end-effectors (tools) for each task instead of a five-fingered hand. We could go as far as to to have the doors be motorized and open themselves with a remote API call, so the robot doesn't even need to open the dryer on its own.

At the far end of this axis, why even bother with building a robot? We could re-imagine the design of homes themselves to be a single ASRS system that brings you whatever you need from any location in the house like a Dumbwaiter (except it would work horizontally and vertically). This would dispenses with the need to have a robot walking around in your home.

This pragmatic line of thinking is fine for commercial applications, but as a human being and a scientist, it feels a bit like a concession of defeat that we cannot make robots do tasks the way humans do. Let's not forget the Science Fiction dreams that inspired so many of us down this career path - it is not about doing the tasks better, it is about doing everything humans can. A human can wash dishes and dry clothes by hand, so a truly general-purpose robot should be able too. For many people, this endeavor is as close as we can get to Biblical Creation: “Let Us make man in Our image, after Our likeness, to rule over the fish of the sea and the birds of the air, over the livestock, and over all the earth itself and every creature that crawls upon it.” 

Yes, we've built airplanes to fly people around. Airplanes are wonderful flying machines. But to build a bird, which can do a million things and fly? That, in my mind, is the true spirit of general purpose robotics.

Sunday, September 27, 2020

My Criteria for Reviewing Papers

Xiaoyi Yin (尹肖贻) has kindly translated this post into Chinese (中文)

Accept-or-reject decisions for the NeurIPS 2020 conference are out, with 9454 submissions and 1900 accepted papers (20% acceptance rate). Congratulations to everyone (regardless of acceptance decision) for their hard work in doing good research!

It's common knowledge among machine learning (ML) researchers that acceptance decisions at NeurIPS and other conferences are something of a weighted dice roll. In this silly theatre we call "Academic Publishing"  -- a mostly disjoint concept from research by the way --,  reviews are all over the place because each reviewer favors different things in ML papers. Here are some criteria that a reviewer might care about: 

Correctness: This is the bare minimum for a scientific paper. Are the claims made in the paper scientifically correct? Did the authors take care not to train on the test set? If an algorithm was proposed, do the authors convincingly show that it works for the reasons they stated? 

New Information: Your paper has to contribute new knowledge to the field. This can take the form of a new algorithm, or new experimental data, or even just a different way of explaining an existing concept. Even survey papers should contain some nuggets of new information, such as a holistic view unifying several independent works.

Proper Citations: a related work section that articulates connections to prior work and why your work is novel. Some reviewers will reject papers that don't tithe prior work adequately, or isn't sufficiently distinguished from it.

SOTA results: It's common to see reviewers demand that papers (1) propose a new algorithm and (2)  achieve state-of-the-art (SOTA) on a benchmark. 

More than "Just SOTA": No reviewer will penalize you for achieving SOTA, but some expect more than just beating the benchmark, such as one or more of the criteria in this list. Some reviewers go as far as to bash the "SOTA-chasing" culture of the field, which they deem to be "not very creative" and "incremental". 

Simplicity: Many researchers profess to favor "simple ideas". However, the difference between "your simple idea" and "your trivial extension to someone else's simple idea" is not always so obvious.

Complexity: Some reviewers deem papers that don't present any new methods or fancy math proofs as "trivial" or "not rigorous".

Clarity & Understanding: Some reviewers care about the mechanistic details of proposed algorithms and furthering understanding of ML, not just achieving better results. This is closely related to "Correctness".

Is it "Exciting"?: Julian Togelius (AC for NeurIPS '20) mentions that many papers he chaired were simply not very exciting. Only Julian can know what he deems "exciting", but I suppose he means having "good taste" in choosing research problems and solutions. 




Sufficiently Hard Problems: Some reviewers reject papers for evaluating on datasets that are too simple, like MNIST. "Sufficiently hard" is a moving goal post, with the implicit expectation that as the field develops better methods the benchmarks have to get harder to push unsolved capabilities. Also, SOTA methods on simple benchmarks are not always SOTA on harder benchmarks that are closer to real world applications. Thankfully my most cited paper was written at a time where it was still acceptable to publish on MNIST.

Is it Surprising? Even if a paper demonstrates successful results, a reviewer might claim that they are unsurprising or "obvious". For example, papers applying standard object recognition techniques to a novel dataset might be argued to be "too easy and straightforward" given that the field expects supervised object recognition to be mostly solved (this is not really true, but the benchmarks don't reflect that). 

I really enjoy papers that defy intuitions, and I personally strive to write surprising papers. 

Some of my favorite papers in this category do not achieve SOTA or propose any new algorithms at all:

  1. Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet
  2. Understanding Deep Learning Requires Rethinking Generalization.
  3. A Metric Learning Reality Check
  4. Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations 
  5. Adversarial Spheres

Is it Real? Closely related to "sufficiently hard problems". Some reviewers think that games are a good testbed to study RL, while others (typically from the classical robotics community) think that Mujoco Ant and a real robotic quadruped are entirely different problems; algorithmic comparisons on the former tell us nothing about the same set of experiments on the latter.

Does Your Work Align with Good AI Ethics? Some view the development of ML technology as a means to build a better society, and discourage papers that don't align with their AI ethics. The required "Broader Impact" statements in NeurIPS submissions this year are an indication that the field is taking this much more seriously. For example, if you submit a paper that attempts to infer criminality from only facial features or perform autonomous weapon targeting, I think it's likely your paper will be rejected regardless of what methods you develop.

Different reviewers will prioritize different aspects of the above, and many of these criteria are highly subjective (e.g. problem taste, ethics, simplicity). For each of the criteria above, it's possible to come up with counterexamples of highly-cited or impactful ML papers that don't meet that criteria but possibly meet others.


My Criteria

I wanted to share my criteria for how I review papers. When it comes to recommending accept/reject, I mostly care about Correctness and New Information. Even if I think your paper is boring and unlikely to be an actively researched topic in 10 years, I will vote to accept it as long as your paper helped me learn something new that I didn't think was already stated elsewhere. 

Some more specific examples:

  • If you make a claim about humanlike exploration capabilities in RL in your introduction and then propose an algorithm to do something like that, I'd like to see substantial empirical justification that the algorithm is indeed similar to what humans do.
  • If your algorithm doesn't achieve SOTA, that's fine with me. But I would like to see a careful analysis of why your algorithm doesn't achieve it and why.
  • When papers propose new algorithms, I prefer to see that the algorithm is better than prior work. However, I will still vote to accept if the paper presents a factually correct analysis of why it doesn't do better than prior work. 
  • If you claim that your new algorithm works better because of reason X, I would like to see experiments that show that it isn't because of alternate hypotheses X1, X2. 
Correctness is difficult to verify. Many metric learning papers were proposed in the last 5 years and accepted at prestigious conferences, only for Musgrave et al. '20 to point out that the experimental methodology between these papers were not consistent.

I should get off my high horse and say that I'm part of the circus too. I've reviewed papers for 10+ conferences and workshops and I can honestly say that I only understood 25% of papers from just reading them. An author puts in tens or hundreds of hours into designing and crafting a research paper and the experimental methodology, and I only put in a few hours in deciding whether it is "correct science". Rarely am I able to approach a paper with the level of mastery needed to rigorously evaluate correctness.

A good question to constantly ask yourself is: "what experiment would convince me that the author's explanations are correct and not due to some alternate hypothesis? Did the authors check that hypothesis?"

I believe that we should accept all "adequate" papers, and more subjective things like "taste" and "simplicity" should be reserved for paper awards, spotlights, and oral presentations. I don't know if everyone should adopt this criteria, but I think it's helpful to at least be transparent as a reviewer on how I make accept/reject decisions. 

Opportunities for Non-Traditional Researchers

If you're interested in getting mentorship for learning how to read, critique, and write papers better, I'd like to plug my weekly office hours, which I hold on Saturday mornings over Google Meet. I've been mentoring about 6 people regularly over the last 3 months and it's working out pretty well. 

Anyone who is not in a traditional research background (not currently in an ML PhD program) can reach out to me to book an appointment. You can think of this like visiting your TA's office hours for help with your research work. Here are some of the services I can offer, completely pro bono:

  • If you have trouble understanding a paper I can try to read it with you and offer my thoughts on it as if I were reviewing it.
  • If you're very very new to the field and don't even know where to begin I can offer some starting exercises like reading / summarizing papers, re-producing existing papers, and so on.
  • I can try to help you develop a good taste of what kinds of problems to work on, how to de-risk ambitious ideas, and so on.
  • Advice on software engineering aspects of research. I've been coding for over 10 years; I've picked up some opinions on how to get things done quickly.
  • Asking questions about your work as if I was a visitor at your poster session.
  • Helping you craft a compelling story for a paper you want to write.
No experience is required, all that you need to bring to the table is a desire to become better at doing research. The acceptance rate for my office hours is literally 100% so don't be shy!