Eric Jang: May 2019

Translations: 中文

I've been at Google Brain robotics (now referred to as Robotics @ Google) for nearly 3 years. It's helpful to reflect, from time to time, on the scientific, engineering and personal productivity takeaways gleaned from working on large research projects. Every researcher's unique experiences and experimentation can potentially become their personal competitive edge for thinking about new problems in unique ways. Here are mine (so far).

These are ordered chronologically (earliest work first), so that the reader can see how my past experiences shape my current biases and beliefs (orange = first author).

Categorical Reparameterization with Gumbel-Softmax

The importance of a work environment that encourages serendipitous discovery and 20% time (the inspiration for Gumbel-Softmax came to me in a water cooler conversation I was having with Shane Gu).
Research on very basic techniques (e.g. generative modeling) can have a huge impact through various downstream applications.
The simplest method to implement is the one that gets cited the most.

End-to-End Learning of Semantic Grasping

The notion of a "class label" is meaningless, and is the wrong way to tackle goal-conditioned grasping.
ML can help robotics, but robotics can also help ML (i.e. retroactive labeling via present poses).
The importance of moving fast, investing in visualization and analysis tools (e.g. notebooks) that do not require a robot.

Time Contrastive Networks

All you need is high-quality data and a contrastive loss. Pierre Sermanet is fond of saying, tongue-in-cheek, that these two things will get us to AGI.
Dream big.

Deep Reinforcement Learning for Vision-Based Robotic Grasping

The importance of a fast prototyping environment and quick experiment turnaround times.
Q-Learning works and scales pretty well.

QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Most people don’t really care how QT-Opt is trained; they are excited about what a trained QT-Opt system can do.
All you need is scale, compute, and data.

Grasp2Vec: Learning Object Representations from Self-Supervised Grasping

Magical things can happen if you focus on innovations in better-structured data, instead of better algorithms (all you need is high-quality data and a contrastive loss).
The notion of a class label is meaningless.
Good reward functions are a very nice piece of "Software 2.0" infrastructure: modular functionality, quick to verify for correctness, and does not impose strong assumptions on upstream or downstream computations (in contrast to RL algorithms).
More on Twitter.

Generative Ensembles for Robust Anomaly Detection

Thinking deeply about the nature of the OoD problem and different types of uncertainty.
The OoD problem is ill-posed, but still useful for practical applications.
OoD and generalization are two sides of the same coin.
I spent a 10 days in Jeju mentoring DL camp students. Every day I woke up, ate 3 meals in the same cafeteria downstairs, had no meetings, and thought really hard about the research problem. This monastic working environment was tremendously useful for my creative "flow".

Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards

Optimal control theory says that we need RL to make robots work, but you can get surprisingly far with the original Deep Learning recipe: supervised learning + lots of data + architecture tuning.
Meta-Learning is all about pushing the burden of learning into the prior.
Generative modeling (e.g. principled approaches to density estimation, being able to fit multi-modal distributions) is important for scaling up robotics.
More on Twitter.

General Lessons from Deep RL + Robotics

I am increasingly of the opinion that the biggest wins in making an ML system work come from high-quality data. Many researchers in sub-fields of ML do not prioritize the choice of data when looking for ways to improve on benchmarks. Deep RL on real robots is a great way to do ML research, because the researcher is forced to gather their own dataset and contend with how data biases generalization outcomes.
Robotics is full-stack ML (gathering and serializing custom data, building a custom data pipeline, training and evaluation binaries, inference on a real robotic system), which increases iteration times & decreases opportunities for spontaneous creativity and discovery. Robotics projects tend to take ~1 FTE year to finish, while most DL papers can be completed in 2-3 months. One of the most important things to me right now is figuring out how we can achieve the same iteration speeds in robotics as achieved in other deep learning domains.
Best software engineering practices for de-risking Deep RL engineering are in their early days. How to keep a full-stack dev environment flexible and fast to iterate on (scientific, creative risk) while keeping technical debt from bubbling over (execution risk)? My colleagues and I designed Tensor2Robot to solve a lot of our large-scale ML + robotics problems, but this is just the beginning.

The scope of this post is limited to my own research projects. Of course, there are papers that I didn't work on and inspire my views tremendously. I'll mention those in a follow-up blog post.

Snapchat's new gender-bending filter is a source of endless fun and laughs at parties. The results are very pleasing to look at. As someone who is used to working with machine learning algorithms, it's almost magical how robust this feature is.

I was so duly impressed that I signed up for Snapchat and fiddled around with it this morning to try and figure out what's going on under the hood and how I might break it.

N.B, this is not a serious exercise in reverse-engineering Snapchat's IPA file or studying how other apps engineer similar features; it's just some basic hypothesis testing into when it works and when it doesn't, plus a little narcissistic bathroom selfie fun.

Initial Observations

The center picture is a standard bathroom selfie. To the left is the "male" filter, and on the right the "female" filter.

The first thing most users probably notice is that the app works in real time, works with a few different face angles, and does not require an internet connection to run. Hair behaves very naturally when wearing a beanie.

Here's a rotating profile shot. The app seems to detect whether the face is pointing in a permissible orientation, and only if that boolean is satisfied does the filter get applied.

Gender swap works in a variety of lighting conditions, though the hair does not seem to cast shadows.

Damn! I look cute.

Here was an example that I thought was really cool - the hair captures the directional key lighting.

Occlusion Tests

Ok, it works pretty well. Can we get it to fail? The app detects when the face is in the wrong pose, but what if there are things occluding the face? Do those occluding objects get "transformed" too?

The answer is yes. Below is a test where I slide an object across my face. The app works when half the face is occluded, but it seems like if too much of the face is blocked, the "should I face swap" bit is set to False.

Here's vertical occlusion, where the bit seems to depend on "what percentage of the face real estate is occluded" rather than what important semantic features (e.g. eyes, lips) are occluded. Right before the app decides that the "should I face swap" should switch to "False", you can see the blurring of the white bottle. Also, my hair turns blonde as I center the bottle in view.

Very interesting. This suggests to me that there definitely some machine learning going on here, and it's picking up on some statistical artifact of the data it was trained on. Do blondes tend to make more makeup tutorials or something?

I partially covered my face in a black charcoal masque, and things seemed pretty stable. The female filter does lighten the masque a bit. It's pretty easy to tell from this GIF that the "face swap" feature is confined to a rectangular region that tracks the head (note the sharp cutoff of the hair as it gets to my shoulders).

The filter stops working once I cover the rest of my face in the masque. Interestingly enough, the ovoid regions of my uncovered skin seem to be detected as faces, and the app proceeds to perform the style transform on that region. You can see the head and face templates flickering in and out like some kind of Junji Ito horror story.

Peeling off the masque is surprisingly stable.

Hair Layer

I was most impressed by the realism of the hair, so I wanted to figure out whether there were any hair mesh models used for dynamic lighting, or whether it was all machine-learning based.

The hair seems to be rendered as the topmost layer (like a Photoshop layer), but unlike your basic puppy ear/tongue filter, this hair layer has an alpha channel that is partially transparent. If you look closely there is also a clear segmentation mask for the hair that allows the face to show through. Snapchat is probably doing head tracking to figure out where the head is, computing the 2D alpha mask for the hair.

How does it work? A guess

At first glance, my mind jumped to some sort of CycleGAN architecture that maps the distribution of male faces to female faces, and vice versa. The dataset would be the billions of selfies Snap has, er, not deleted in the last 8 years.

This does raise a lot of questions though:

Are they training truly unpaired image translation? That would be incredibly impressive, given that CycleGAN is bonkers and shouldn't even work in the first place. I would bet they have an unpaired alignment objective that is regularized by a limited dataset of ground-truth pairs, such as pairs of images of male/female siblings, or even a hand-designed gender transform that acts as data augmentation (e.g. making the jawline rounder can be done without machine learning).
The hair and face transforms seem to be synthesized independently, given that they occupy different layers (or perhaps synthesized together and separated into different layers right before rendering). This is also the first instance I've seen of GANs being used to render the alpha channel. I am a bit dubious of whether the hair is even generated by a GAN at all. One one hand, there is clearly some smooth function that switches out highlights and hair colors as a function of the positioning of an occluding object, suggesting that colors are probably learned partially from data. On the other hand, the hair is so stable that I have a hard time believing it is synthesized completely with a GAN generator. I have seen a few examples of other East Asian male face swaps with similar hairdos, suggesting that maybe there is a large-ish template library of haridos (that is refined with some ML model).
How do Snap's ML engineers know whether a CycleGAN has converged for such an enormous dataset?
How do they get these neural nets to run with such limited compute budgets? What sorts of image resolutions are they generating on the fly?

If it indeed is a CycleGAN, then applying the male filter to a female-filtered image of me should recover the original image, right?

The image is mostly scale invariant, but as we zoom in pretty close, the face does resemble mine more. I would guess that there is a preprocessing step that crops and resizes the canonical face image prior to feeding it to a neural net.
There are also probably other subroutines in the filter like jaw resizing that don't use a CycleGAN, but whose addition would cause the M2F and F2M filters to no longer be exact inverses of each other.

Implications of Technology

I have a friend who does drag. It's a lot of work! I'm excited for technology like this, because it will make it easier for makeup artists, cosplayers, and drag artists to experiment with new ideas and identities cheaply and quickly.

Technology such as face and voice changing enables a wider gap between public Internet personas and the real people managing those characters. This isn't necessarily a bad thing: if you are born a man but are passionate about being a cute anime girl on the internet, who are we to judge? Will gender fluidity & drag culture will become more normalized in society as our daily social media normalize gender-bending?

The future is quite exciting.

Eric Jang

Thursday, May 23, 2019

Lessons from AI Research Projects: The First 3 Years

Sunday, May 12, 2019

Fun with Snapchat's Gender Swapping Filter