Sunday, May 12, 2019

Fun with Snapchat's Gender Swapping Filter

Snapchat's new gender-bending filter is a source of endless fun and laughs at parties. The results are very pleasing to look at. As someone who is used to working with machine learning algorithms, it's almost magical how robust this feature is.

I was so duly impressed that I signed up for Snapchat and fiddled around with it this morning to try and figure out what's going on under the hood and how I might break it.

N.B, this is not a serious exercise in reverse-engineering Snapchat's IPA file or studying how other apps engineer similar features; it's just some basic hypothesis testing into when it works and when it doesn't, plus a little narcissistic bathroom selfie fun.


Initial Observations

The center picture is a standard bathroom selfie. To the left is the "male" filter, and on the right the "female" filter.



The first thing most users probably notice is that the app works in real time, works with a few different face angles, and does not require an internet connection to run. Hair behaves very naturally when wearing a beanie.


Here's a rotating profile shot. The app seems to detect whether the face is pointing in a permissible orientation, and only if that boolean is satisfied does the filter get applied.



Gender swap works in a variety of lighting conditions, though the hair does not seem to cast shadows.



Damn! I look cute.

Here was an example that I thought was really cool - the hair captures the directional key lighting.



Occlusion Tests

Ok, it works pretty well. Can we get it to fail? The app detects when the face is in the wrong pose, but what if there are things occluding the face? Do those occluding objects get "transformed" too?

The answer is yes. Below is a test where I slide an object across my face. The app works when half the face is occluded, but it seems like if too much of the face is blocked, the "should I face swap" bit is set to False.


Here's vertical occlusion, where the bit seems to depend on "what percentage of the face real estate is occluded" rather than what important semantic features (e.g. eyes, lips) are occluded. Right before the app decides that the "should I face swap" should switch to "False", you can see the blurring of the white bottle. Also, my hair turns blonde as I center the bottle in view.

Very interesting. This suggests to me that there definitely some machine learning going on here, and it's picking up on some statistical artifact of the data it was trained on. Do blondes tend to make more makeup tutorials or something?



I partially covered my face in a black charcoal masque, and things seemed pretty stable. The female filter does lighten the masque a bit. It's pretty easy to tell from this GIF that the "face swap" feature is confined to a rectangular region that tracks the head (note the sharp cutoff of the hair as it gets to my shoulders).


The filter stops working once I cover the rest of my face in the masque. Interestingly enough, the ovoid regions of my uncovered skin seem to be detected as faces, and the app proceeds to perform the style transform on that region. You can see the head and face templates flickering in and out like some kind of Junji Ito horror story.



Peeling off the masque is surprisingly stable.


Hair Layer

I was most impressed by the realism of the hair, so I wanted to figure out whether there were any hair mesh models used for dynamic lighting, or whether it was all machine-learning based.



The hair seems to be rendered as the topmost layer (like a Photoshop layer), but unlike your basic puppy ear/tongue filter, this hair layer has an alpha channel that is partially transparent. If you look closely there is also a clear segmentation mask for the hair that allows the face to show through. Snapchat is probably doing head tracking to figure out where the head is, computing the 2D alpha mask for the hair.


How does it work? A guess

At first glance, my mind jumped to some sort of CycleGAN architecture that maps the distribution of male faces to female faces, and vice versa. The dataset would be the billions of selfies Snap has, er, not deleted in the last 8 years.

This does raise a lot of questions though:

  • Are they training truly unpaired image translation? That would be incredibly impressive, given that CycleGAN is bonkers and shouldn't even work in the first place. I would bet they have an unpaired alignment objective that is regularized by a limited dataset of ground-truth pairs, such as pairs of images of male/female siblings, or even a hand-designed gender transform that acts as data augmentation (e.g. making the jawline rounder can be done without machine learning). 
  • The hair and face transforms seem to be synthesized independently, given that they occupy different layers (or perhaps synthesized together and separated into different layers right before rendering). This is also the first instance I've seen of GANs being used to render the alpha channel. I am a bit dubious of whether the hair is even generated by a GAN at all. One one hand, there is clearly some smooth function that switches out highlights and hair colors as a function of the positioning of an occluding object, suggesting that colors are probably learned partially from data. On the other hand, the hair is so stable that I have a hard time believing it is synthesized completely with a GAN generator. I have seen a few examples of other East Asian male face swaps with similar hairdos, suggesting that maybe there is a large-ish template library of haridos (that is refined with some ML model).
  • How do Snap's ML engineers know whether a CycleGAN has converged for such an enormous dataset?
  • How do they get these neural nets to run with such limited compute budgets? What sorts of image resolutions are they generating on the fly?
  • If it indeed is a CycleGAN, then applying the male filter to a female-filtered image of me should recover the original image, right? 





  • The image is mostly scale invariant, but as we zoom in pretty close, the face does resemble mine more. I would guess that there is a preprocessing step that crops and resizes the canonical face image prior to feeding it to a neural net.
  • There are also probably other subroutines in the filter like jaw resizing that don't use a CycleGAN, but whose addition would cause the M2F and F2M filters to no longer be exact inverses of each other.



Implications of Technology

I have a friend who does drag. It's a lot of work! I'm excited for technology like this, because it will make it easier for makeup artists, cosplayers, and drag artists to experiment with new ideas and identities cheaply and quickly.

Technology such as face and voice changing enables a wider gap between public Internet personas and the real people managing those characters. This isn't necessarily a bad thing: if you are born a man but are passionate about being a cute anime girl on the internet, who are we to judge? Will gender fluidity & drag culture will become more normalized in society as our daily social media normalize gender-bending?

The future is quite exciting.