tag:blogger.com,1999:blog-842965756326639856.post1973811367269128669..comments2017-04-25T22:24:44.582-07:00Comments on Eric Jang: Tutorial: Categorical Variational Autoencoders using Gumbel-SoftmaxEricnoreply@blogger.comBlogger13125tag:blogger.com,1999:blog-842965756326639856.post-9536347345249446722017-04-23T15:13:49.731-07:002017-04-23T15:13:49.731-07:00Sure. Please send me an email (you can find it on ...Sure. Please send me an email (you can find it on my website, evjang.com)Erichttp://www.blogger.com/profile/05932982386234738790noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-63812380069993662832017-04-23T15:12:18.058-07:002017-04-23T15:12:18.058-07:00Sure. The Discrete VAE paper (https://arxiv.org/ab...Sure. The Discrete VAE paper (https://arxiv.org/abs/1609.02200), despite its name, is not the first paper to implement discrete variational autoencoders with stochastic discrete latent variables. Prior work includes NVIL (https://arxiv.org/pdf/1402.0030), DARN, and MuProp. Discrete VAE presents a model that counts technically as a VAE, but its forward pass is not equivalent to the model described in the other papers. In Discrete VAE, the forward sampling is autoregressive through each binary unit, which allows every discrete choice to be marginalized out in a tractable manner in the backward pass. Because the forward pass is different, the optimization objective is different, which makes it harder to compare (we are optimizing different models). The non-discrete Gumbel-Softmax relaxation also technically results in optimizing a different model as well, but since it's merely a relaxation of the original model, we can still evaluate it the same way. <br /><br />Whereas DARN, MuProp, NVIL, Straight-Through Gumbel-Softmax present a way to train the same forward model, Discrete VAE optimizes a new objective altogether. It's an open question what the "right forward pass" is, but it makes it hard to compare Discrete VAE with other work since they have different forward passes and optimization strategies. <br />Erichttp://www.blogger.com/profile/05932982386234738790noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-14524151632946876552017-04-14T17:15:42.601-07:002017-04-14T17:15:42.601-07:00Thank you Eric for this detailed introduction. Sup...Thank you Eric for this detailed introduction. Super helpful! <br /><br />-YixingUnknownhttp://www.blogger.com/profile/02181342246378228760noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-22129710194319879722017-03-31T11:47:10.131-07:002017-03-31T11:47:10.131-07:00p_theta(y) prior here in KL divergence code is jus...p_theta(y) prior here in KL divergence code is just a categorical with equal 1/K probabilities, right?Gökçen Eraslanhttp://www.blogger.com/profile/08074219965357657250noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-39839848147352692002017-03-06T15:48:35.377-08:002017-03-06T15:48:35.377-08:00Hi Eric,
I'm a student researcher at MIT doin...Hi Eric,<br /><br />I'm a student researcher at MIT doing work on the application of GAN to a discrete data set. I was wondering if there's any chance we could hop on a call so you could explain this methodology to me further?<br /><br />Unknownhttp://www.blogger.com/profile/18175697056933501188noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-81783353787061905772017-01-25T02:06:06.606-08:002017-01-25T02:06:06.606-08:00This comment has been removed by a blog administrator.Brandy Lehmannhttp://www.blogger.com/profile/02036141162168507103noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-73014322242221951162016-12-23T13:56:42.009-08:002016-12-23T13:56:42.009-08:00More stuff about gumbel-sigmoid here - https://git...More stuff about gumbel-sigmoid here - https://github.com/yandexdataschool/gumbel_lstm/blob/master/demo_gumbel_sigmoid.ipynb<br /><br />@Ero Gol<br />afaik, unlike max, argmax (index of maximum) will have zero/NA gradient by definition since infinitely small changes in the vector won't change index of the maximum unless there are two exactly equal elements.<br /><br /><br />Unknownhttp://www.blogger.com/profile/16940273816827756871noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-79851673589104498902016-11-17T05:02:05.809-08:002016-11-17T05:02:05.809-08:00I still don't see why we cannot train the same...I still don't see why we cannot train the same network by enforcing the latent space a one-hot vector for the example above. So if the backprop is the problem, you can flow the error through the argmax node and you can learn the parameters still. Could you give more details what is the differentiating factor of your method. Also cloud you explain what is z values on the last figure?Ero Golhttp://www.blogger.com/profile/05501378120695218499noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-69088609304856680962016-11-14T18:01:16.403-08:002016-11-14T18:01:16.403-08:00Hi Eric,
Agree with the posters above me -- great...Hi Eric,<br /><br />Agree with the posters above me -- great tutorial!<br /><br />I was wondering how this would be applied to my use case: suppose I have two dense real-valued vectors, and I want to train a VAE s.t. the latent features are categorical and the original and decoded vectors are close together in terms of cosine similarity. I'm guessing that I have to change the first term of the ELBO function, since `p_x.log_prob(x)` isn't what I care about (is that right?). Any thoughts on what the modified version would be? <br /><br />ThanksUnknownhttp://www.blogger.com/profile/16751420649036201956noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-28047785571615615492016-11-14T11:44:57.018-08:002016-11-14T11:44:57.018-08:00+1+1Unknownhttp://www.blogger.com/profile/05265242182980553100noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-56156491099184554032016-11-10T17:55:47.527-08:002016-11-10T17:55:47.527-08:00If categories are large, you will need a more effi...If categories are large, you will need a more efficient encoding of samples from the categorical distribution than one-hot vectors, otherwise you will have a rank>1e6 matrix multiply. A reparameterization trick for other encodings of vectors might be worth pursuing.<br /><br />Erichttp://www.blogger.com/profile/05932982386234738790noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-23261046248065888322016-11-09T22:39:55.198-08:002016-11-09T22:39:55.198-08:00Great tutorial!
I am wondering what happens if th...Great tutorial!<br /><br />I am wondering what happens if the number of categories is extremely large, i.e., 1 million. Gradients need to calculated for all the \PI s and g s of a large number?<br /><br />Thank you, and looking forward to hearing.<br /><br />Unknownhttp://www.blogger.com/profile/18278964908828399996noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-41517449749539432032016-11-09T11:05:43.432-08:002016-11-09T11:05:43.432-08:00Could you please compare your model to model calle...Could you please compare your model to model called "Discrete Variational Autoencoder" and give some thoughts on the difference and similarity of models?<br /><br />https://arxiv.org/abs/1609.02200Viktor Yanushhttp://www.blogger.com/profile/18299095837602201307noreply@blogger.com