Comments on Eric Jang: Tutorial: Categorical Variational Autoencoders using Gumbel-Softmax

We used a categorical KL instead of the KL of the ...

2017-11-29T16:25:40.357-08:00

We used a categorical KL instead of the KL of the Gumbel-Softmax distribution. Maddison et al. 2017 (Concrete Distribution) use the latter, which involves tau.

Thanks Eric, this is very exciting work! I had a ...

2017-10-02T03:19:41.112-07:00

Thanks Eric, this is very exciting work!

I had a question about the loss function. The latent loss here seems to be the KL-divergence between two categorical distributions (without tau). However, the density of the GS distribution in your paper involves tau. Does it simplify somehow in the KL divergence, or did you use the categorical distribution for the loss function instead?

Thanks a lot in advance!

2017-08-07T14:01:10.313-07:00

This comment has been removed by the author.

What are the pros and cons of using q(y|x) as disc...

2017-05-22T18:30:33.254-07:00

What are the pros and cons of using q(y|x) as discrete distribution?

Sure. Please send me an email (you can find it on ...

2017-04-23T15:13:49.731-07:00

Sure. Please send me an email (you can find it on my website, evjang.com)

Sure. The Discrete VAE paper (https://arxiv.org/ab...

2017-04-23T15:12:18.058-07:00

Sure. The Discrete VAE paper (https://arxiv.org/abs/1609.02200), despite its name, is not the first paper to implement discrete variational autoencoders with stochastic discrete latent variables. Prior work includes NVIL (https://arxiv.org/pdf/1402.0030), DARN, and MuProp. Discrete VAE presents a model that counts technically as a VAE, but its forward pass is not equivalent to the model described in the other papers. In Discrete VAE, the forward sampling is autoregressive through each binary unit, which allows every discrete choice to be marginalized out in a tractable manner in the backward pass. Because the forward pass is different, the optimization objective is different, which makes it harder to compare (we are optimizing different models). The non-discrete Gumbel-Softmax relaxation also technically results in optimizing a different model as well, but since it's merely a relaxation of the original model, we can still evaluate it the same way.

Whereas DARN, MuProp, NVIL, Straight-Through Gumbel-Softmax present a way to train the same forward model, Discrete VAE optimizes a new objective altogether. It's an open question what the "right forward pass" is, but it makes it hard to compare Discrete VAE with other work since they have different forward passes and optimization strategies.

Thank you Eric for this detailed introduction. Sup...

2017-04-14T17:15:42.601-07:00

Thank you Eric for this detailed introduction. Super helpful!

-Yixing

p_theta(y) prior here in KL divergence code is jus...

2017-03-31T11:47:10.131-07:00

p_theta(y) prior here in KL divergence code is just a categorical with equal 1/K probabilities, right?

Hi Eric, I'm a student researcher at MIT doin...

2017-03-06T15:48:35.377-08:00

Hi Eric,

I'm a student researcher at MIT doing work on the application of GAN to a discrete data set. I was wondering if there's any chance we could hop on a call so you could explain this methodology to me further?

2017-01-25T02:06:06.606-08:00

This comment has been removed by a blog administrator.

More stuff about gumbel-sigmoid here - https://git...

2016-12-23T13:56:42.009-08:00

More stuff about gumbel-sigmoid here - https://github.com/yandexdataschool/gumbel_lstm/blob/master/demo_gumbel_sigmoid.ipynb

@Ero Gol
afaik, unlike max, argmax (index of maximum) will have zero/NA gradient by definition since infinitely small changes in the vector won't change index of the maximum unless there are two exactly equal elements.

I still don't see why we cannot train the same...

2016-11-17T05:02:05.809-08:00

I still don't see why we cannot train the same network by enforcing the latent space a one-hot vector for the example above. So if the backprop is the problem, you can flow the error through the argmax node and you can learn the parameters still. Could you give more details what is the differentiating factor of your method. Also cloud you explain what is z values on the last figure?

Hi Eric, Agree with the posters above me -- great...

2016-11-14T18:01:16.403-08:00

Hi Eric,

Agree with the posters above me -- great tutorial!

I was wondering how this would be applied to my use case: suppose I have two dense real-valued vectors, and I want to train a VAE s.t. the latent features are categorical and the original and decoded vectors are close together in terms of cosine similarity. I'm guessing that I have to change the first term of the ELBO function, since `p_x.log_prob(x)` isn't what I care about (is that right?). Any thoughts on what the modified version would be?

Thanks

+1

2016-11-14T11:44:57.018-08:00

If categories are large, you will need a more effi...

2016-11-10T17:55:47.527-08:00

If categories are large, you will need a more efficient encoding of samples from the categorical distribution than one-hot vectors, otherwise you will have a rank>1e6 matrix multiply. A reparameterization trick for other encodings of vectors might be worth pursuing.

Great tutorial! I am wondering what happens if th...

2016-11-09T22:39:55.198-08:00

Great tutorial!

I am wondering what happens if the number of categories is extremely large, i.e., 1 million. Gradients need to calculated for all the \PI s and g s of a large number?

Thank you, and looking forward to hearing.

Could you please compare your model to model calle...

2016-11-09T11:05:43.432-08:00

Could you please compare your model to model called "Discrete Variational Autoencoder" and give some thoughts on the difference and similarity of models?

https://arxiv.org/abs/1609.02200