tag:blogger.com,1999:blog-842965756326639856.post1973811367269128669..comments2017-01-17T06:04:24.741-08:00Comments on Eric Jang: Tutorial: Categorical Variational Autoencoders using Gumbel-SoftmaxEricnoreply@blogger.comBlogger7125tag:blogger.com,1999:blog-842965756326639856.post-73014322242221951162016-12-23T13:56:42.009-08:002016-12-23T13:56:42.009-08:00More stuff about gumbel-sigmoid here - https://git...More stuff about gumbel-sigmoid here - https://github.com/yandexdataschool/gumbel_lstm/blob/master/demo_gumbel_sigmoid.ipynb<br /><br />@Ero Gol<br />afaik, unlike max, argmax (index of maximum) will have zero/NA gradient by definition since infinitely small changes in the vector won't change index of the maximum unless there are two exactly equal elements.<br /><br /><br />Unknownhttp://www.blogger.com/profile/16940273816827756871noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-79851673589104498902016-11-17T05:02:05.809-08:002016-11-17T05:02:05.809-08:00I still don't see why we cannot train the same...I still don't see why we cannot train the same network by enforcing the latent space a one-hot vector for the example above. So if the backprop is the problem, you can flow the error through the argmax node and you can learn the parameters still. Could you give more details what is the differentiating factor of your method. Also cloud you explain what is z values on the last figure?Ero Golhttp://www.blogger.com/profile/05501378120695218499noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-69088609304856680962016-11-14T18:01:16.403-08:002016-11-14T18:01:16.403-08:00Hi Eric,
Agree with the posters above me -- great...Hi Eric,<br /><br />Agree with the posters above me -- great tutorial!<br /><br />I was wondering how this would be applied to my use case: suppose I have two dense real-valued vectors, and I want to train a VAE s.t. the latent features are categorical and the original and decoded vectors are close together in terms of cosine similarity. I'm guessing that I have to change the first term of the ELBO function, since `p_x.log_prob(x)` isn't what I care about (is that right?). Any thoughts on what the modified version would be? <br /><br />ThanksUnknownhttp://www.blogger.com/profile/16751420649036201956noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-28047785571615615492016-11-14T11:44:57.018-08:002016-11-14T11:44:57.018-08:00+1+1Unknownhttp://www.blogger.com/profile/05265242182980553100noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-56156491099184554032016-11-10T17:55:47.527-08:002016-11-10T17:55:47.527-08:00If categories are large, you will need a more effi...If categories are large, you will need a more efficient encoding of samples from the categorical distribution than one-hot vectors, otherwise you will have a rank>1e6 matrix multiply. A reparameterization trick for other encodings of vectors might be worth pursuing.<br /><br />Erichttp://www.blogger.com/profile/05932982386234738790noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-23261046248065888322016-11-09T22:39:55.198-08:002016-11-09T22:39:55.198-08:00Great tutorial!
I am wondering what happens if th...Great tutorial!<br /><br />I am wondering what happens if the number of categories is extremely large, i.e., 1 million. Gradients need to calculated for all the \PI s and g s of a large number?<br /><br />Thank you, and looking forward to hearing.<br /><br />Unknownhttp://www.blogger.com/profile/18278964908828399996noreply@blogger.comtag:blogger.com,1999:blog-842965756326639856.post-41517449749539432032016-11-09T11:05:43.432-08:002016-11-09T11:05:43.432-08:00Could you please compare your model to model calle...Could you please compare your model to model called "Discrete Variational Autoencoder" and give some thoughts on the difference and similarity of models?<br /><br />https://arxiv.org/abs/1609.02200Viktor Yanushhttp://www.blogger.com/profile/18299095837602201307noreply@blogger.com