Saturday, November 28, 2020

Software and Hardware for General Robots

Disclaimer, these are just my opinions and not necessarily those of my employer or robotics colleagues.

2021-04-23: If you liked this post, you may be interested in a more recent blog post I wrote on why I believe in end-to-end learning for robots.

Hacker News Discussion

Moravec's Paradox describes the observation that our AI systems can solve "adult-level cognitive" tasks like chess-playing or passing text-based intelligence tests fairly easily, while accomplishing basic sensorimotor skills like crawling around or grasping objects - things one-year old children can do - are very difficult.

Anyone who has tried to build a robot to do anything will realize that Moravec's Paradox is not a paradox at all, but rather a direct corollary of our physical reality being so irredeemably complex and constantly demanding. Modern humans traverse millions of square kilometers in their lifetime, a labyrinth full of dangers and opportunities. If we had to consciously process and deliberate all the survival-critical sensory inputs and motor decisions like we do moves in a game of chess, we would have probably been selected out of the gene pool by Darwinian evolution. Evolution has optimized our biology to perform sensorimotor skills in a split second and make it feel easy. 

Another way to appreciate this complexity is to adjust your daily life to a major motor disability, like losing fingers or trying to get around San Francisco without legs.

Software for General Robots

The difficulty of sensorimotor problems is especially apparent to people who work in robotics and get their hands dirty with the messiness of "the real world". What are the consequences of an irredeemably complex reality on how we build software abstractions for controlling robots? 

One of my pet peeves is when people who do not have sufficient respect for Moravec's Paradox propose a programming model where high-level robotic tasks ("make me dinner") can be divided into sequential or parallel computations with clearly defined logical boundaries: wash rice, de-frost meat, get the plates, set the table, etc. These sub-tasks can be in turn broken down further. When a task cannot be decomposed further because there are too many edge cases for conventional software to handle ("does the image contain a cat?"), we can attempt to invoke a Machine Learning model as "magic software" for that capability.

This way of thinking - symbolic logic that calls upon ML code - arises from engineers who are used to clinging to the tidiness of Software 1.0 abstractions and programming tutorials that use cooking analogies. 

Do you have any idea how much intelligence goes into a task like "fetching me a snack", at the very lowest levels of motor skill? Allow me to illustrate. I recorded a short video of me opening a package of dates and annotated it with all the motor sub-tasks I performed in the process.

In the span of 36 seconds, I counted about 14 motor and cognitive skills. They happened so quickly that I didn't consciously notice them until I went back and analyzed the video, frame by frame. 

Here are some of the things I did:
  • Leverage past experience opening this sort of package to understand material properties and how much force to apply.
  • Constantly adapt my strategy in response to unforeseen circumstances (Ziploc not giving)
  • Adjusting grasp when slippage occurs
  • Devising an ad-hoc Weitlaner Retractor with thumb knuckles to increase force on the Ziploc.
As a roboticist, it's humbling to watch videos of animals making decisions so quickly and then watch our own robots struggle to do the simplest things. We even have to speed up the robot video 4x-8x to prevent the human watcher from getting bored! 

With this video in mind, let's consider where we currently are in the state of robotic manipulation. In the last decade or so, multiple research labs have used deep learning to develop robotic systems that can perform any-object robotic grasping from vision. Grasping is an important problem because in order to manipulate objects, one must usually first grasp them. It took the Google Robotics and X teams 2-3 years to develop our own system, QT-Opt. This was a huge research achievement because it was a general method that worked on pretty much any object and, in principle, could be used to learn other tasks. 

Some people think that this capability to pick up objects can be wrapped in a simple programmatic API and then used to bootstrap us to human-level manipulation. After all, hard problems are just composed of simpler problems, right? 

I don't think it's quite so simple. The high-level API call "pick_up_object()" implies a clear semantic boundary between when the robot grasping begins and when it ends. If you re-watch the above video above, how many times do I perform a grasp? It's not clear to me at all where you would slot those function calls. Here is a survey if you are interested in participating in a poll of "how many grasps do you see in this video", whose results I will update in this blog post. 

If we need to solve 13 additional manipulation skills just to open a package of dates, and each one of these capabilities take 2-3 years to build, then we are a long, long way from making robots that match the capabilities of humans. Never mind that there isn't a clear strategy for how to integrate all these behaviors together into a single algorithmic routine. Believe me, I wish reality was simple enough that complex robotic manipulation could be done mostly in Software 1.0. However, as we move beyond pick-and-place towards dexterous and complex tasks, I think we will need to completely rethink how we integrate different capabilities in robotics.

As you might note from the video, the meaning of a "grasp" is somewhat blurry. Biological intelligence was not specifically evolved for grasping - rather, hands and their behaviors emerged from a few core drives:  regulate internal and external conditions, find snacks, replicate.

None of this is to say that our current robot platforms and the Software 1.0 programming models are useless for robotics research or applications. A general purpose function pick_up_object() can still be combined with "Software 1.0 code" into a reliable system worth billions of dollars in value to Amazon warehouses and other logistics fulfillment centers. General pick-and-place for any object in any unstructured environment remains an unsolved, valuable, and hard research problem.

Hardware for General Robots

What robotic hardware do we require in order to "open a package of dates"?

Willow Garage was one of the pioneers in home robots, showing that a teleoperated PR2 robot could be used to tidy up a room (note that two arms are needed here for more precise placement of pillows). These are made up of many pick-and-place operations.

This video was made in 2008. That was 12 years ago! It's sobering to think of how much time has passed and how little the needle has seemingly moved. Reality is hard. 

The Stretch is a simple telescoping arm attached to a vertical gantry. It can do things like pick up objects, wipe planar surfaces, and open drawers.

However, futurist beware! A common source of hype for people who don't think enough about physical reality is to watch demos of robots doing useful things in one home, and then conclude that the same robots are ready to do those tasks in any home.

The Stretch video shows the robot pulling open a dryer door (left-swinging) and retrieving clothes from it. The video is a bit deceptive - I think the camera  physically cannot see the interior of the dryer, so even though a human can teleoperate the robot to do the task, it would run into serious difficulty when ensuring that the dryer has been completely emptied.

Here is a picture of my own dryer, which features a dryer with a right-swinging door close to a wall. I'm not sure if the Stretch actually can fit in this tight space, but the PR2 definitely would not be able to open this door without the base getting in the way. 

Reality's edge cases are often swept under the rug when making robot demo videos, which usually show the robot operating in an optimal environment that the robot is well-suited for. But the full range of tasks humans do in the home is vast.  Neither the PR2 nor the Stretch can crouch under a table to pick up lint off the floor, change a lightbulb while standing on a chair, fix caulking in a bathroom, open mail with a letter opener, move dishes from the dishwasher to the high cabinets, break down cardboard boxes for the recycle bin, go outside and retrieve the mail. 

And of course, they can't even open a Ziploc package of dates. If you think that was complex, here is a first-person video of me chopping strawberries, washing utensils, and decorating a cheesecake. This was recorded with a GoPro strapped to my head. Watch each time my fingers twitch - each one is a separate manipulation task!

We often talk about a future where robots do our cooking for us, but I don't think it's possible with any hardware on the market today. The only viable hardware for a robot meant to do any task in human spaces is an adult-sized humanoid, with two-arms, two-legs, and five fingers on each hand. 

Just like I discussed about Software 1.0 in robotics, there is still an enormous space of robot morphologies that can still provide value to research and commercial applications. That doesn't change the fact that any alternative hardware can't do all the things a humanoid can in a human-centric space. Agility Robotics is one of the companies that gets it on the hardware design front. People who build physical robots use their hands a lot - could you imagine the robot you are building assembling a copy of itself? 

Why Don't We Just Design Environments to be More Robot-Friendly?

A compromise is to co-design the environment with the robot to avoid infeasible tasks like above. This can simplify both the hardware and software problems. Common examples I hear incessantly go like this:
  1. Washing machines are better than a bimanual robot washing dishes in the sink, and a dryer is a more efficient machine than a human hanging out clothes to air-dry.
  2. Airplanes are better at transporting humans than birds 
  3. We built cars and roads, not faster horses 
  4. Wheels can bear more weight and are more energetically efficient than legs.
In the home robot setting, we could design special dryer machine doors that the robot can open easily, or have custom end-effectors (tools) for each task instead of a five-fingered hand. We could go as far as to to have the doors be motorized and open themselves with a remote API call, so the robot doesn't even need to open the dryer on its own.

At the far end of this axis, why even bother with building a robot? We could re-imagine the design of homes themselves to be a single ASRS system that brings you whatever you need from any location in the house like a Dumbwaiter (except it would work horizontally and vertically). This would dispenses with the need to have a robot walking around in your home.

This pragmatic line of thinking is fine for commercial applications, but as a human being and a scientist, it feels a bit like a concession of defeat that we cannot make robots do tasks the way humans do. Let's not forget the Science Fiction dreams that inspired so many of us down this career path - it is not about doing the tasks better, it is about doing everything humans can. A human can wash dishes and dry clothes by hand, so a truly general-purpose robot should be able too. For many people, this endeavor is as close as we can get to Biblical Creation: “Let Us make man in Our image, after Our likeness, to rule over the fish of the sea and the birds of the air, over the livestock, and over all the earth itself and every creature that crawls upon it.” 

Yes, we've built airplanes to fly people around. Airplanes are wonderful flying machines. But to build a bird, which can do a million things and fly? That, in my mind, is the true spirit of general purpose robotics.