tag:blogger.com,1999:blog-8429657563266398562017-12-15T03:19:58.867-08:00Eric JangTechnology, A.I., CareersEricnoreply@blogger.comBlogger13125tag:blogger.com,1999:blog-842965756326639856.post-36544027851033562802017-11-20T13:31:00.002-08:002017-11-29T16:16:43.430-08:00Expressivity, Trainability, and Generalization in Machine Learning<i>Update 11/29: I'm looking for translators to help translate this post into different languages, particularly Chinese (中文), Spanish (Español), Korean (한국어), Russian (ру́сский язы́к), and Japanese (日本語). Please email me at <myfirstname><mylastname>2004<at>gmail.com! </i><br /><br />When I read Machine Learning papers, I ask myself whether the contributions of the paper fall under improvements to 1) Expressivity 2) Trainability, and/or 3) Generalization. I learned this categorization from my colleague <a href="http://www.sohldickstein.com/">Jascha Sohl-Dickstein</a> at Google Brain, and the terminology is also introduced in <a href="https://arxiv.org/pdf/1606.05336v4.pdf">this paper</a>. I have found this categorization effective in thinking about how individual research papers (especially on the theoretical side) tie subfields of AI research (e.g. robotics, generative models, NLP) together into one big picture [1].<br /><div><br /></div><div>In this blog post, I discuss how these concepts tie into current (Nov 2017) machine learning research on Supervised Learning, Unsupervised Learning, and Reinforcement Learning. I consider Generalization to be comprised of two categories -- “weak” and “strong” generalization -- and I will discuss them separately. This table summarizes my opinion on where things are:</div><div><br /></div><div><span id="docs-internal-guid-f7b52e42-d766-623c-9d26-e60a4941fb3c"><span style="font-family: inherit;"><br /></span></span><br /><div dir="ltr" style="margin-left: -4.5pt;"><table style="border-collapse: collapse; border: none;"><colgroup><col width="122"></col><col width="109"></col><col width="115"></col><col width="130"></col><col width="154"></col></colgroup><tbody><tr style="height: 0pt;"><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><br /></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Expressivity</span></div><br /><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">What computations can this model perform?</span></div></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Trainability</span></div><br /><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">How easy is it to fit the model to data?</span></div></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Weak Generalization</span></div><br /><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">How well does the model perform on small perturbations of its data distribution?</span></div></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Strong Generalization</span></div><br /><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">How well does the model perform on a large shift in data distribution?</span></div></td></tr><tr style="height: 0pt;"><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Supervised Learning</span></div><br /><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">Learn using human-provided labels. Predict those labels.</span></div></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">Easy</span></div><br /><br /><br /></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">OK</span></div><br /></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">Getting There</span></div><br /></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">Hard</span></div></td></tr><tr style="height: 0pt;"><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Unsupervised Learning</span></div><br /><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">Learn without human-provided labels. Predict those labels.</span></div></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">Easy</span></div><br /></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">OK</span></div><br /><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"> </span></div></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"> Getting There</span></div><br /></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">Hard</span></div><br /></td></tr><tr style="height: 0pt;"><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Reinforcement Learning</span></div><br /><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">Learn to maximize reward in an environment.</span></div></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">Easy</span></div><br /></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.2; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">Hard </span></div><br /></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">Mega-hard</span></div><br /></td><td style="border-bottom: solid #000000 1pt; border-left: solid #000000 1pt; border-right: solid #000000 1pt; border-top: solid #000000 1pt; padding: 5pt 5pt 5pt 5pt; vertical-align: top;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;">Mega-hard</span></div><br /><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"> </span></div></td></tr></tbody></table></div><span id="docs-internal-guid-f7b52e42-d766-623c-9d26-e60a4941fb3c"><span style="font-family: inherit;"></span></span><br /><div><span id="docs-internal-guid-f7b52e42-d766-623c-9d26-e60a4941fb3c"><span style="font-family: inherit;"><span style="font-family: inherit;"><br /></span></span></span></div><span id="docs-internal-guid-f7b52e42-d766-623c-9d26-e60a4941fb3c"><span style="font-family: inherit;"></span>I’m deeply grateful to Jascha Sohl-Dickstein and Ben Poole for providing feedback and edits to this post, as well as Marcin Moczulski for helpful discussions about trainability of RNNs. </span></div><div><br /></div><div>This post covers a broad spectrum of research at a fairly opinionated level. I emphasize that any factual errors here are my own and do not reflect opinions of my colleagues and proofreaders. Feel free to contribute feedback in the comments or send me an email if you’d like to discuss / suggest edits - I’m here to learn. </div><div><span style="font-family: "arial"; font-size: 20pt; white-space: pre-wrap;"><br /></span></div><div><h2>Expressivity</h2><i>What computations can this model perform?</i><br /><br /><a href="https://arxiv.org/abs/1606.05336">Measures of expressivity characterize</a> the complexity of functions that can possibly be computed by a parametric function such as a neural net. Deep neural networks are exponentially <a href="https://arxiv.org/pdf/1606.05336.pdf">expressive with respect to their depth</a>, which means that moderately-sized neural networks are sufficiently expressive for most supervised, unsupervised, and RL problems being researched today [2]. One piece of evidence for this is that deep neural nets are capable of <a href="https://arxiv.org/abs/1611.03530">memorizing very large datasets</a>. <br /><br />Neural nets can express all sorts of things: continuous, complex, discrete, and even stochastic variables. Thanks to research in generative modeling and <a href="http://bayesiandeeplearning.org/">Bayesian Deep Learning</a> in the last few years, deep neural networks have been leveraged to construct probabilistic neural networks have produced incredible generative modeling results.<br /><br />Recent breakthroughs in generative modeling illustrate the profound expressivity of neural networks: neural nets can output extremely complex data manifolds (audio, images) that are <a href="https://arxiv.org/abs/1703.10717">nearly indistinguishable from real data</a>. Here is the output on a recent <a href="https://arxiv.org/abs/1704.00028">GAN architecture</a> proposed by researchers at NVIDIA: <br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-7fLNI6VGuH0/WhJH2uMty2I/AAAAAAAAG6o/hX27kG8P2I4bvkg2d1sFfG0mnSOE7WIMACLcBGAs/s1600/img000000.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="1024" data-original-width="1024" height="200" src="https://4.bp.blogspot.com/-7fLNI6VGuH0/WhJH2uMty2I/AAAAAAAAG6o/hX27kG8P2I4bvkg2d1sFfG0mnSOE7WIMACLcBGAs/s200/img000000.png" width="200" /></a></div> <a href="https://2.bp.blogspot.com/-nAvfhkSdpI4/WhJH2sYhvbI/AAAAAAAAG6k/VRpiFYR4bHsi9vOP9zvn1tI-EZIm9HflACLcBGAs/s1600/img000004.png" imageanchor="1" style="clear: left; display: inline !important; margin-bottom: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="1024" data-original-width="1024" height="200" src="https://2.bp.blogspot.com/-nAvfhkSdpI4/WhJH2sYhvbI/AAAAAAAAG6k/VRpiFYR4bHsi9vOP9zvn1tI-EZIm9HflACLcBGAs/s200/img000004.png" width="200" /></a><a href="https://2.bp.blogspot.com/-_MNlSNKiyUA/WhJH2hy4U_I/AAAAAAAAG6g/-iUQxAMossIKjPCQxoWCIDGU_HF8gllIgCLcBGAs/s1600/img000014.png" imageanchor="1" style="clear: left; display: inline !important; margin-bottom: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="1024" data-original-width="1024" height="200" src="https://2.bp.blogspot.com/-_MNlSNKiyUA/WhJH2hy4U_I/AAAAAAAAG6g/-iUQxAMossIKjPCQxoWCIDGU_HF8gllIgCLcBGAs/s200/img000014.png" width="200" /></a><br /><br />It’s not perfect yet (notice the warped background), but we are getting very close. Similarly, in audio synthesis, audio generated samples from the latest WaveNet models <a href="https://deepmind.com/blog/wavenet-launches-google-assistant/">sound like real people</a>.<br /><div><br /></div>Unsupervised learning is not limited to generative modeling. Some researchers like Yann Lecun are rebranding unsupervised learning as “predictive learning”, where the model infers the past, imputes the present, or predicts the future. However, since much of unsupervised learning has focused on predicting extremely complex joint distributions (images from past/future, audio), I think generative modeling is a reasonably good benchmark to measure expressivity in the unsupervised domain.<br /><br />Neural nets seem to be expressive enough for RL as well. Fairly small networks (2 conv layers and 2 fully-connected layers) are powerful enough to solve <a href="https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf">Atari and Mujoco control tasks</a> (though their trainability leaves much to be desired - see the next section).<br /><br />On its own, expressivity is not a very interesting question - we can always increase it by adding more layers, more connectivity, and so on. The challenge is making neural nets expressive enough for both the training and test set, while keeping the training difficulty manageable. For example, 2D convolutions seem to be required to make image classification models generalize, even though a deep fully-connected network has enough capacity to memorize a training set.<br /><br />Expressivity is the easiest problem to deal with (add more layers!), but also simultaneously the most mysterious: we don’t have good way of measuring how much expressivity (and of what kind) is demanded by a given task. What kinds of problems will demand neural networks that are orders of magnitude larger than the ones we use today? Why will they require so much compute? Are our current neural networks expressive enough to implement intelligence resembling that of a human? Does solving generalization require vastly more expressive models?<br />The brain has many orders of magnitude more “neuronal units” (1e11) than the number units in our large neural networks (Inception-ResNet-V2 has roughly 25e6 ReLU units). This comparison is already drastic, even when we give Relus the benefit of the doubt that they are even remotely comparable to biological neurons. A single biological neuron -- with its various neurotransmitters, dendritic forest integrating input from 10000 other neurons in a time-varying integration -- are also incredibly expressive. A locust implements its collision detection system in one neuron and it already flies better than any drone system we have ever built. Where is all this expressivity coming from and going to? How much more expressivity do we need?<br /><br /><h2><br />Trainability</h2><i>Given a sufficiently expressive space of models, can we find a good model?</i><br /><div><br /></div><div>A Machine Learning model is any computer program that has some of its functionality learned from data. During “learning”, we search for a reasonably good model that utilizes knowledge from the data to make decisions, out of a (potentially huge) space of models. This search process is usually formulated as solving an optimization problem over the space of models. <br /><br /><br /><a href="https://3.bp.blogspot.com/-X2cwIFvBBz4/WhJOnb6IeMI/AAAAAAAAG7A/KajbOxjk698ObBUD83DET0HmRwyM6MqzwCLcBGAs/s1600/model%2Bspace%2B%25281%2529.png"><img border="0" src="https://3.bp.blogspot.com/-X2cwIFvBBz4/WhJOnb6IeMI/AAAAAAAAG7A/KajbOxjk698ObBUD83DET0HmRwyM6MqzwCLcBGAs/s400/model%2Bspace%2B%25281%2529.png" /></a><br /><h3>Several Varieties of Optimization</h3>A common approach, especially in deep learning, is to define some scalar metric that evaluates the “goodness” of a model. Then use a numerical optimization technique to maximize said “goodness” (or equivalently, “minimize badness”). <br /><br />A concrete example: minimizing the average cross-entropy error is a standard way to train neural networks to classify images. This is done in the hopes that when the model x dataset have been minimized on cross entropy loss, it will also do what we actually want, e.g. classify images correctly with a certain degree of precision and recall on test images. Often the evaluation metric cannot be optimized directly (the most obvious reason being that we don’t have access to the test dataset) but a surrogate function like cross-entropy on a training set can. <br /><br />Searching for good models (training) ultimately amounts to optimization -- no two ways about it! But … objectives are sometimes hard to specify. A classic scenario in supervised learning is in image down-sampling, where it’s difficult to define a single scalar quantity that aligns exactly with how humans perceive “perceptual loss” from a particular downsampling algorithm. In a similar manner, super-resolution and image synthesis are also difficult, because we have a hard time writing down the “goodness” as a maximization objective. Imagine writing down a function that tells you how “photoreal” an image is! The debate on how to measure quality of generative models (for images, audio) <a href="https://arxiv.org/abs/1511.01844">rages on</a> to <a href="https://arxiv.org/abs/1705.05263">this day</a>.</div><div><br />The most popular technique in recent years that addresses this problem is a co-adaptation approach, which formulate the optimization problem as solving for the equilibrium point between two non-stationary distributions that are evolving in tandem [3]. An intuitive analogy to explain why this is “natural” would be to compare this to ecological evolution of a predator species and a prey species. At first, predators get smarter so they can catch prey effectively. Then, the prey get smarter in order to evade predators. The species co-evolve against each other, and the end result is that both species become more intelligent. <br /><br /><a href="https://arxiv.org/abs/1406.2661">Generative Adversarial Networks</a> operate according to a similar principle, in order to avoid having to define explicit perceptual loss objectives. Similarly, competitive self-play in reinforcement learning employs this principle to learn <a href="https://deepmind.com/blog/alphago-zero-learning-scratch/">rich</a> <a href="https://blog.openai.com/competitive-self-play/">behaviors</a>. Although the optimization objective is now implicitly specified, it’s still an optimization problem, and machine learning practitioners can re-use familiar tools like deep neural networks and SGD.<br /><br />Evolution strategies typically consider optimization-as-a-simulation. The user specifies some dynamical system over a population of models, and at each timestep of the simulation, the population is updated according to the rules of the dynamical system. The models may or may not interact with each other. The simulation runs forward in time, and hopefully the dynamics of the system induce the population to eventually converge around “good models”.<br /><br />For a great tutorial on ES in the context of RL, see <a href="http://blog.otoro.net/2017/10/29/visual-evolution-strategies/">A Visual Guide to Evolution Strategies by David Ha</a> (the “References and Further Reading” section is really great!).<br /><br /><h3>Current State of Affairs</h3>Trainability is basically solved for feedforward networks and “direct” objectives encountered in supervised learning (I make this claim empirically, not theoretically). Some popular breakthroughs published in 2015 (<a href="https://arxiv.org/abs/1502.03167">Batch Norm</a>, <a href="https://arxiv.org/abs/1512.03385">ResNets</a>, <a href="https://arxiv.org/abs/1511.06422">Good Init</a>) are widely used today to make training of feedforward networks super trainable. In fact, deep nets with hundreds of layers can now minimize <a href="https://arxiv.org/abs/1611.03530">training error of large classification datasets to zero</a>. See <a href="https://arxiv.org/abs/1703.09039">this paper</a> for a comprehensive survey on the hardware and algorithmic infrastructure underlying modern DNNs. <br /><br />RNNs are still pretty tricky, but the research community is making a lot of progress, to the point where it is no longer crazy to <a href="https://arxiv.org/abs/1707.02920">drop an LSTM into a complex robotic policy</a> and expect it to “just work”. That’s pretty incredible, considering that as late as 2014 not many people had confidence in trainability RNNs and in the previous decades there was a <a href="http://www.worldscientific.com/doi/abs/10.1142/S0218488598000094">lot</a> of <a href="http://ieeexplore.ieee.org/abstract/document/279181/">work</a> showing that RNNs are horrible to train. There’s evidence that a whole bunch of RNN architectures are <a href="https://arxiv.org/abs/1611.09913">equivalently expressive</a>, and any differences in performance are due to some architectures being <a href="https://arxiv.org/abs/1707.05589">easier to optimize than others</a> [4].<br /><br />In unsupervised learning, the model outputs are often (but not always!) bigger -- for instance, 1024 x 1024 pixels, gigantic sequences for speech and text. This unfortunately makes them harder to train. <br /><br />One of the big breakthroughs in 2017 was that GANs are now dramatically easier to train. The most popular improvements have been simple modifications to the original Jensen-Shannon divergence objective: <a href="https://arxiv.org/abs/1611.04076">least squares</a>, <a href="https://arxiv.org/abs/1609.03126">absolute deviation with margin</a>, and Wasserstein distance (<a href="https://arxiv.org/abs/1701.07875">1</a>, <a href="https://arxiv.org/abs/1704.00028">2</a>). The <a href="https://arxiv.org/abs/1710.10196">recent work by NVIDIA</a> extends Wasserstein-based GANs to be make the model less sensitive to a wide variety of hyperparameters like BatchNorm parameters, architecture choice, and so on. Stability is incredibly important for practical and industrial applications - it’s what gives us the confidence that it will be compatible with our future research ideas or applications. Taken together, these results are exciting because they indicate that our generator neural networks are plenty expressive to generate the right images, and the bottleneck that was holding them back was one of trainability. Trainability might still be a bottleneck - unfortunately with neural nets it’s hard to tell if a model is simply not expressive enough and/or we haven’t trained it enough.<br /><br />Latent discrete variable inference within neural nets was also previously difficult to train due to monte-carlo gradient estimators having high variance. However it’s starting to make a comeback in recent years in all sorts of architectures, from GANS to Language Modeling to Memory-Augmented Neural Networks to Reinforcement Learning. Discrete representations are very useful from an expressivity standpoint and it’s great that we can now train them <a href="https://arxiv.org/pdf/1703.07370.pdf">fairly reliably</a>.<br /><br />Tragically, Deep Reinforcement Learning is still quite behind when it comes to pure trainability, without even considering the generalization aspect. Reinforcement learning is difficult because for environments with more than 1 timestep, we are searching for a model that then performs optimization (of the reward) at inference time. There is an inner optimization process where the model induces optimal control, and an outer optimization loop that learns this optimal model, using only a database of what the agent saw. <br /><br />Recently I added one extra dimension to a continuous robotic control task, and the performance of my RL algorithm dropped from >80% to 10%. Not only is RL training hard, but it’s also unreliable! Because optimization landscapes are so stochastic, we can’t even get the same results with <a href="https://arxiv.org/abs/1709.06560">different random seeds</a>, so the strategy is to report a distribution of reward curves over multiple trials, with different random seeds. Different implementations of the same algorithm perform differently across environments, so RL scores reported in literature are to be taken with a handful of salt.<br /><br />This is a travesty! Trainability in RL is still very much unsolved, because we still cannot scale up our problems by a little bit and expect the same learning to do the same thing 10 times out of 10. <br /><br />If we treat RL as a pure optimization problem (and worry about generalization and complex tasks later), the problem is still intractable. Let’s say there’s an environment where a sparse reward is given only at the end of the episode (e.g. babysitting a child where the babysitter is paid after the parents return home). The number of actions (and corresponding outcomes of the environment) grows exponentially with the duration of the episode, but only a few of these action sequences corresponds to a success. </div><div><br /></div><div>Therefore, estimating the policy gradient at any point in the optimization landscape of models requires exponentially many samples in the action space before a useful learning signal is obtained. This is as bad as trying to compute a Monte Carlo expectation of a probability distribution (over all action sequences) where the mass is concentrated at a “delta distribution” (see diagram below). When there is vanishing overlap between the proposal distribution (exploration) and reward distribution, finite-sample Monte Carlo estimates of policy gradients simply will not work, no matter how many samples you collect.</div><div><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-k2BMeyh0dXM/WhJPWVcj6ZI/AAAAAAAAG7I/kCd6igkh7W4gAfAmjAoupibM3NJ6ZNeTQCLcBGAs/s1600/proposal%2Bdistribution%2B%25281%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="457" data-original-width="850" height="344" src="https://1.bp.blogspot.com/-k2BMeyh0dXM/WhJPWVcj6ZI/AAAAAAAAG7I/kCd6igkh7W4gAfAmjAoupibM3NJ6ZNeTQCLcBGAs/s640/proposal%2Bdistribution%2B%25281%2529.png" width="640" /></a></div><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><br /></div>Furthermore, if the data distribution is non-stationary (as in the case of off-policy learning algorithms with a replay buffer), collecting “bad data” can introduce unstable feedback loops into the outer optimization process. <br /><br />Here’s the same idea from an optimization perspective rather than a Monte Carlo estimation perspective: without any priors over the state space (such as an understanding of the world or an explicit instruction provided to the agent), the optimization landscape looks like “swiss cheese” -- small “holes” of convex optima surrounded by vast plateaus of parameter space where policy gradient information is useless. This means that the entire model space basically has zero information (because the learning signal is effectively uniform across the space of models, since the non-zero area is vanishingly small).<br /><br /><br /><img height="201" src="https://lh5.googleusercontent.com/TAHeGX7yeLhl_grgpYkGjkmzkxITKTJ_dBpku1jVg3vvD4JLAkznidkEPmM0lkefNVp0vNucyrIdYr5hjnrte5_9u3r2ORIw2GvjO-0xDLxQIqtfmF4jLrjdRC2XU5iFm-ivG50P" width="640" /><br /><br /><br />Without developing good representations that we can do learning on top of, we might as well just cycle through random seeds and sample random policies until we happen upon a good model that lands in one of these swiss cheese holes. The fact that this is a surprisingly strong baseline in RL suggests that our optimization landscapes may very well look like this.<br /><br />I believe that RL benchmarks like Atari and Mujoco, while interesting from a pure optimization standpoint, do not really push the limits of machine learning, so much as just solve for a single monolithic policy that optimizes performance for a fairly sterile environment. There is little selective pressure encouraging the model to “generalize”, which makes the problem a pure optimization problem, and not really a hard ML problem.<br /><br />It’s not that I like to complicate the trainability problems of RL with issues of generalization (it certainly does not make things easy to debug!), but I think learning understanding of the environment and understanding the task is the only way that will make reinforcement learning tractable for real-world robotic problems. <br /><br />Contrast this with supervised learning and unsupervised learning, we can obtain learning signals cheaply, no matter where we are in the model search space. The proposal distribution for minibatch gradients has nonzero overlap with the distribution of gradients. If we are using SGD with minibatch size=1, then the probability of sampling the transition with a useful learning signal is at worst 1/N where N is the size of the dataset (so learning is guaranteed after each epoch). We can brute-force our way to a good solution by simply throwing a lot of compute and data at the problem. Furthermore, improving perceptual generalization of lower layers may actually have a variance-reducing effect via “bootstrapping” on top of lower-level features.<br /><br />In order to solve RL problems with high dimensionality and complexity, we must consider generalization and general perceptual understanding before tackling the numerical optimization problem. To attempt otherwise is a foolish waste of compute and data (though it’s still useful to answer exactly how much we can brute force). We need to get to a point where every data point provides a non-zero number of bits to the RL algorithm, and have an easy way to importance-sample gradients when the task is very complex (without gathering exponentially more data). Only then, is it reasonable to assume we can succeed at brute-forcing the problem. <br /><br />Learning from demonstration, imitation learning, inverse reinforcement learning, and interfacing with natural language instructions perhaps provide ways to quickly bring the starting policy to a point that obtains some learning signal, or shape the search space in a way such that each episode gives a non-zero amount of information to the policy (for example, the environment provides a reward of 0 but observations yield some information that is helpful to the inductive biases of the model’s planning module).<br /><br />To summarize: trainability is: easy for supervised learning, is harder-but-getting-there for unsupervised learning, and is still terribly broken for reinforcement learning.</div><div><h2></h2><h2>Generalization</h2><br />Generalization is the most profound of the 3 problems, and is the heart of Machine Learning itself. Loosely speaking, it is how well the model performs on test dataset when it is trained on a training dataset. <br /><br />There are two distinctive scenarios when talking about generalization: 1) the training and test data are sampled from the same distribution (and we need to learn this distribution from the training data alone) or 2) training and test data are drawn from different distributions (and we need to generalize from the train to the test distribution). Let’s call (1) “weak generalization” and (2) “strong generalization. This way of classifying generalization could be also referred to as “interpolation vs. extrapolation”, or “robustness vs. understanding”.</div><div><br /><h3>Weak Generalization: Two Landscapes</h3></div><div><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: "arial"; font-size: 11pt; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">How well does the model perform on small perturbations of its data distribution?</span></div>In “weak generalization”, we usually assume the training and test data samples are drawn from the same distribution. However, in real world settings, there is almost always some difference between the train and test distributions, even in the large-sample limit. <br /><br />These differences can come from sensor noise, changes in ambient light conditions, gradual wear and tear of objects, variations in lighting conditions (maybe it was a cloudy day when the photographer gathered the test set images). Another situation is that differences can be <a href="https://blog.openai.com/adversarial-example-research/">generated by an adversary</a>. Adversarial perturbations are barely perceptible to our human vision, so we might consider adversarial examples to be drawn from the “same distribution”.<br /><br />Therefore, it’s helpful in practice to think about “weak generalization” as evaluating on a “perturbed” version of the training distribution:<br /><br /><img src="https://lh3.googleusercontent.com/Af81JXmvnuJ_NVBvfwFtQenROMj1S753UnccXWJKpbTQfS6YG1mnnsQxUCU8eF772OHKhpfTpGv8PnatPx2PlXYCBlK2RvBvxe20f_f086hRXfWdY-urXbh6tnQrTu8ygoTrJFvw" /><br /><br />A perturbed testing data distribution might also induce a perturbed optimization landscape (lowest point is best).<br /><img src="https://lh4.googleusercontent.com/vWK-3FsiCA1dWYn7Ye2S0K1asfmx4wKuVnBBXwrDosQjvMnCuVQ02hb2H6V7PZ3u2VEzuZtYNO3BBy356YZQJrykVzsWP2WgR28YFYqLvwYt4_h3go6vruRMnzdtUzTXMjLHrRYd" /><br /><br /><br />The fact that we don’t know the testing distribution ahead of time presents some difficulties for optimization. If we are too aggressive in optimizing the training landscape (the sharp global minima on the left of the blue curve), we end up with a model that is sub-optimal with respect test data (sharp local minima on the red curve). Here, we’ve overfit to the training distribution or training data samples, and failed to generalize to the perturbed test distribution.<br /><br />“Regularization” is any technique we employ to prevent overfitting. We don’t have any prior information on what the test perturbation is, so often the best we can usually do is try to train on random perturbations versions of the training distribution, in hopes that these perturbations cover the test distribution. Stochastic gradient descent, dropout, weight noise, activation noise, data augmentation are commonly-used regularizers in Deep Learning. In reinforcement learning, randomizing simulation parameters makes the training more robust. In his talk at ICLR 2017, Chiyuan Zhang made the comment that he sees regularization as “anything that makes training harder” (rather than the conventional view of “limiting model capacity”). Basically, making things *harder* to optimize somehow improves generalization. <br /><br />This is really disturbing -- our methods of “generalizing” are quite crude, amounting to “optimizer lobotomy”. We basically throw a wrench into the optimizer and hope it messes training just enough to prevent overfitting. Furthermore, improving trainability is can make you pay a price in generalization! This way of viewing (weak) generalization certainly complicates how we proceed with trainability research.<br /><br />But if better optimizers overfit, then how do we explain why some optimizers seem to decrease both training and test error? The reality is that any combination of optimization landscape and optimizer strikes a balance between 1) finding a better region of models and 2) overfitting to a specific solution, and we don’t have good methods for controlling this balance.<br /><br />The most challenging test of weak generalization is probably that of adversarial attacks, where the perturbations are coming from an “adversary” that provides the worst-possible perturbation to your data point, so that your model performs badly with high probability. We still don’t have any deep learning approaches that are very <a href="https://arxiv.org/abs/1707.07397">robust to adversarial examples</a>, but my gut feeling is that it is eventually solvable [5].<br /><br />There’s some theoretical work on applying information theory to show that neural networks apparently go through a phase transition during the training process in which the model <a href="https://arxiv.org/abs/1703.00810">switches from “memorizing” data to “compressing” data</a>. This theory is starting to pick up steam, though there is ongoing academic debate as to <a href="https://openreview.net/forum?id=ry_WPG-A-&noteId=ry_WPG-A-">whether this theory is valid</a>. Keep your eyes peeled for this one - the intuition of “memorization” and “compression” is compelling.<br /><br /><h3>Strong Generalization: Natural Manifold</h3>In tests of “strong generalization”, the model is evaluated on a completely different data distribution than the test, but with data coming from the same underlying manifold (or generative process). <br /><br /><br /><img src="https://lh3.googleusercontent.com/UMoZDFKJa9B1kHSfduYAucdcfEv2PbA-8GfS0oj9wwWdl-Zv6-7Tw_Lq9ZzSol7s4xrKiJ4FzV5vdGGlnNOXFcGiCL0L5ul1JhX0pRNynXjOniY5RScg-UJCDBvfl9Vu-zWbVOzE" /><br /><br />How could one could ever learn a good model for the test distribution if it is “completely different” from the training distribution? The resolution for this is that these “glimpses” of data actually come from the same manifold of “natural data”. There can still be a lot of information overlap between train and test distributions, as long as they come from the same underlying generative process. <br /><br />The space of observable data in the world can be described as a very high-dimensional, continuously varying “natural manifold”. The tuple (video clip of a pair of cymbals crashing together, the sound of crashing cymbals) is a point on this manifold. The tuple (video clip of a pair of cymbals crashing together, the sound of frogs croaking)is not - those joint data points are simply are inconsistent with our reality. As you can see, even though the manifold is positively gigantic, it’s also highly structured. For example, physical laws like gravity are obeyed in all the data that we observe, objects don’t blip in and out of existence, and so on.<br /><br />Strong Generalization can be thought of as how much of this “super-manifold” is captured by a particular model that is only trained on a teensy sampling of data points from the manifold. Note that an image classifier does not need to discover Maxwell’s Equations -- it merely needs to have an understanding of reality that is consistent with data on the manifold.<br /><br />Modern classification models trained on ImageNet are arguably OK at strong generalization - principles like edges, contours, and objects are indeed understood by models trained on ImageNet, which is why it’s so popular to transfer these pre-trained weights to other datasets for few-shot & metric learning. It could be a lot better though: classifiers trained on ImageNet don’t work universally, few-shot learning still is unsolved, and they are still susceptible to adversarial examples. Obviously our models don’t understand what they are looking at, to the same degree that humans do. But <a href="https://arxiv.org/abs/1706.01427">it’s a start</a>.<br /><br />Similar to weak generalization, the test distribution can be sampled adversarially in a way that induces maximum discrepancy between the training and testing distribution. AlphaGo Zero is my favorite example of this: at test time, it observes data from human players that is completely different than its training distribution (it has never “seen” a human before). Furthermore, the human is using all of their intelligence to actively lead AlphaGo to regimes that are unobserved in the training data. Even though AlphaGo doesn’t explicitly understand anything about abstract mathematics, opponent psychology, or what the color green means, it clearly understood enough about the world to outwit a human player within a narrow domain. If an AI system is robust against a skilled human adversary, I consider it to have sufficient strong generalization capability. <br /><br />Sadly, Reinforcement Learning research has largely ignored the problem of strong generalization. Most benchmarks are static environments, with very little perceptual richness (e.g. the humanoid doesn’t understand its world or what its body even looks like, beyond some floating joint positions that lead to some reward). <br /><br />I really believe that solving generalization is key to solving RL trainability. The more our learning system “understands” about the world, the better able it is to obtain learning signals, perhaps with fewer samples. This is why few-shot learning, imitation learning, learning-to-learn is important: it moves us away from brute force solutions where variance is high and information is low.<br /><br /><b>I believe two things are required to achieve stronger forms of generalization: </b><br /><br />First, we need models that actively deduce the fundamental laws of the world from observation and experimentation. Symbolic reasoning and <a href="https://en.wikipedia.org/wiki/Causal_inference">causal inference</a> seem like ripe topics to study, but any kind of unsupervised learning is likely to help. I am reminded of humankind’s long quest to understand the motion of celestial bodies by deriving physical laws of the universe using a system of logical reasoning (mathematics). It’s interesting to note that before the <a href="https://en.wikipedia.org/wiki/Copernican_Revolution">Copernican Revolution</a> humans probably relied on Bayesian heuristics at first (“superstitions”), and these “Bayesian” models were discarded once we discovered classical mechanics.<br /><br />Our model-based machine learning methods (models that attempt to “predict” aspects of their environment) right now are “pre-Copernican” in the sense that they only interpolate based on very shallow statistical superstitions, rather than coming up with deep, general principles to explain and extrapolate data that can be millions of light years away or many timesteps into the future. Note that humans didn’t really need a firm grasp on probability theory to derive deterministic celestial mechanics, which begs the question of whether there are ways to do machine learning and causal inference without an explicit statistical framework.</div><div><br /></div><div>One way to drastically reduce the complexity burden is to make our learning systems more adaptive. We need to go beyond optimizing models<i> that predict or</i><i> act</i> in static ways<i>. </i>Instead, we need to be optimizing models that can <i>think</i>, <i>remember</i>, and <i>learn</i>, all in real-time.</div><div><br />Second, we need to throw enough diversity of data at the problem, so that the model can be pushed to develop these abstract representations. An environment needs to be rich enough to force the right representations to develop (though AlphaGo Zero does raise questions on how much of the natural manifold an agent really needs to experience). Without these constraints, the problem is fundamentally underspecified and the chance that we accidentally discover the right solution is nil. Perhaps humans would never have gained intelligence if they could not stand up on their hind legs and wonder about why stars moving in such <a href="https://en.wikipedia.org/wiki/Dynamics_of_the_celestial_spheres">strange, elliptical patterns</a>. <br /><br />I wonder if the Trisolarian civilization (from the book "The Three Body Problem") evolved to such a high level of technological advancement because their livelihoods depended on their physical understanding of complex celestial mechanics. Maybe we need to introduce some celestial motion into our Mujoco & Bullet environments :)</div><div><br /></div><br /><h2>Footnotes</h2><br />[1] There are some research areas that don’t fit as neatly into the framework of expressivity, trainability, and generalization. For example, interpretability research seeks to understand why a model provides a particular answer. Not only is this expected by ML customers and policymakers in some high-stakes fields (medicine, law enforcement), but these can shed light on generalization: if we discover that the model is providing diagnoses in a way that differs a lot from how a human medical professional would arrive at those conclusions, it could mean that there are edge cases where the model’s deductive process will fail to generalize. Finding out whether your model learned the right thing is even stronger having low test error! Differential privacy is another constraint that we sometimes demand from our ML models. These topics are out of the scope of this blog post.<br /><br />[2] A hand-wavy explanation for why this is the case: A fully-connected layer of size N followed by a ReLU nonlinearity can chop up a vector space into N piecewise-linear pieces. Adding a second ReLU layer further subdivides space into N more pieces, resulting in N^2 piecewise linear regions in input space, 3 layers is N^3, and so on. For detailed analysis, see Raghu et al. 2017.<br /><br />[3] This is sometimes called a <a href="https://en.wikipedia.org/wiki/Bilevel_optimization">Multi-level optimization problem</a>. However, this implies an “outer” and “inner” optimization loop, whereas co-adaptation may occur simultaneously. e.g. concurrent processes on a single machine communicating asynchronously, or species evolving continuously w.r.t each other in an ecosystem. In these cases, there is no clear “outer” and “inner” optimization loop.<br /><br />[4] <a href="https://arxiv.org/abs/1409.0473">seq2seq with attention</a> achieved SOTA when they were first introduced, but I suspect that the advantage it confers is one of trainability, not expressivity or generalization. Maybe seq2seq without attention can do just as well if it is properly initialized.<br /><br />[5] One idea to protect against adversarial methods, without solving strong generalization: make it extremely expensive to compute adversarial perturbations. Models and data are partially black-box. At each call to the model during inference time, randomly pick a model from an ensemble of trained models, and serve that model to the adversary without telling them which model they got. The models are trained independently of each other and can even adopt different architectures. This makes it difficult to compute finite-difference gradients, as f(x+dx) - f(x) can have arbitrarily high variance. Furthermore, gradients between successive gradient computations will still have high variance because different pairs of models can be sampled. Another way to improve things would be to use multi-modal data (video, multi-view, images + sound), so that it is difficult to perturb inputs in a way that preserves consistency of inputs.</div><div><span style="font-family: "arial"; font-size: 11pt; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>Ericnoreply@blogger.com6tag:blogger.com,1999:blog-842965756326639856.post-19097478221009129482017-01-02T14:24:00.001-08:002017-01-02T18:21:43.043-08:00Summary of NIPS 2016The 30th annual Neural Information Processing Systems (NIPS) conference took place in Barcelona in early December. In this post, I share my thoughts on the conference. Let me know your thoughts in the comments section!<br /><br />Basic outline:<br /><br /><ul><li>State of AI Research and trends in 2016</li><li>Talks and papers that I liked</li><li>Predictions for 2017</li><li>Miscellaneous notes</li></ul><h2></h2><h2></h2><h2>State of Artificial Intelligence Research: 2016 Trends</h2><div><br /></div><div>Academic, industry, and public interest in Artificial Intelligence (A.I.) is taking off like a goddamn <a href="http://www.rocketai.org/">rocket</a>. There was a 50% percent increase in NIPS registrations from last year:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-TJn1-d2s7R0/WGqtte4sGzI/AAAAAAAAFMs/OVkPQBCVRyoY15AsGtyrQRONrSVrXYvugCLcB/s1600/nips2016_registrations.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="395" src="https://3.bp.blogspot.com/-TJn1-d2s7R0/WGqtte4sGzI/AAAAAAAAFMs/OVkPQBCVRyoY15AsGtyrQRONrSVrXYvugCLcB/s640/nips2016_registrations.png" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">It's not a coincidence that Nvidia, the literal arms-dealer of deep learning, has had a good year in the stock market.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-2-0ft23MuKE/WGquLIFAERI/AAAAAAAAFMw/aM1FPsAaVBYr-B6W4CtDNFNdQUpW7LEQwCLcB/s1600/nvda.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="203" src="https://3.bp.blogspot.com/-2-0ft23MuKE/WGquLIFAERI/AAAAAAAAFMw/aM1FPsAaVBYr-B6W4CtDNFNdQUpW7LEQwCLcB/s640/nvda.png" width="640" /></a></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">2016 saw the continued success of Deep Learning: <a href="https://deepmind.com/research/alphago/">superhuman Go playing</a>, <a href="https://arxiv.org/abs/1610.05256">superhuman speech transcription</a>, <a href="https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html">superhuman translation</a>, <a href="https://www.youtube.com/watch?v=5aogzAUPilE">superhuman lip reading</a>, and <a href="https://waymo.com/">maybe-superhuman driving ability</a> (no public benchmarks for self-driving cars yet). These accomplishments are especially remarkable, considering that many AI researchers used to believe that solving the relevant problems required solving general intelligence itself.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">The research topics that are in vogue right now (GANs, RL, generative models, unsupervised learning, robotics, alternative ways to train DNNs) are important problems to tackle before we can build general AI. However, they haven't created significant commercial value in industry yet, in ways that couldn't plausibly be substituted with traditional supervised learning. </div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Transfer learning, domain adaptation and semi-supervised learning alleviate the data-hungry requirements of deep learning, and are starting to work really well.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">The volume and quality of published work in 2016 was staggering, and it's hard to come away from the conference feeling like there are any ideas left to work on. Many papers and posters I saw at the conference had multiple groups converging on similar ideas. </div><div class="separator" style="clear: both; text-align: left;"><br /></div><h3 style="clear: both; text-align: left;">GANs</h3><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">I can’t imagine a higher compliment for Ian Goodfellow than Yann Lecun declaring <a href="https://arxiv.org/abs/1406.2661">Generative Adversarial Networks</a> (GANs) to be “the biggest breakthrough in Machine Learning in the last 1-2 decades.” </div><br />For those of you who are unfamiliar with GANs, here's <a href="http://blog.evjang.com/2016/06/generative-adversarial-nets-in.html">a simple writeup explaining them</a>. GANs were initially developed for generating images, but researchers are exploring their applications in reinforcement learning, domain adaptation, security ML, compression, and more. A more appropriate terminology would be “adversarial methods”, because the network we are trying to train need not be "generative" - it could be a deterministic classifier or policy instead.<br /><div><div><br /></div><div>Why are GANs so cool? Before they became popular, many researchers formulated learning as an optimization objective, i.e. “minimize a loss or maximize a reward". GANs break away from that paradigm; instead, adversarial methods solve for a Nash equilibria between at least two “agents”, both of which are allowed to co-adapt to each other.<br /><br />The research community cares about this because some loss functions are often difficult to specify directly. In many over-parameterized problems, the optimization process will “short-circuit” design flaws in the objective and yield a result that might minimize the objective, but isn't quite what we really wanted. </div><div><br /></div><div>For example, if we wanted to minimize some error for image compression/reconstruction, often what we find is that a naive choice of error metric (e.g. euclidean distance to the ground truth label) results in qualitatively bad results. The design flaw is that we don’t have good perceptual similarity metrics for images that are universally applicable for the space of all images. GANs use a second “adversarial” network learn an optimal implicit distance function (in theory).<br /><br />Ian Goodfellow gave a <a href="http://www.iangoodfellow.com/slides/2016-12-04-NIPS.pdf">very nice tutorial on GANs</a>. The main challenge for GANs right now is preventing degenerate equilibria between the discriminator and generator. These issues often manifest themselves as mode collapse (the generator learns a single tight mode over a single sample) or the discriminator "overpowering" the generator. I'm not worried about this - I think with time, intuitive tricks and recipes from experienced researchers will make them easier to use. People used to think that only Geoff Hinton could train DNNs, and that <a href="http://yyue.blogspot.com/2015/01/a-brief-overview-of-deep-learning.html">ended up not being the case</a>.</div></div><div><br /></div><div>Here's a quote from Maciej Cegłowski's an eminently hilarious, bookmark-bar-tier <a href="http://idlewords.com/talks/superintelligence.htm">blog post</a> about AI existential alarmism:<br /><br /><i>With no way to define intelligence (except just pointing to ourselves), we don't even know if it's a quantity that can be maximized. For all we know, human-level intelligence could be a tradeoff. Maybe any entity significantly smarter than a human being would be crippled by existential despair, or spend all its time in Buddha-like contemplation.</i><br /><br />Does intelligent life arise from optimization? Or is it an equilibrium? Or to quote Yann, “a bunch of hacks that just happen to work?” AI research is not ready to answer these questions yet, but I am optimistic that research on GANs will tease out what kinds of problems are better-suited for expressing as an equilibrium-finding vs. minimization problem. </div><h3></h3><h3><br />Deep RL</h3><div><br /></div>People usually regard <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">Reinforcement Learning</a> (RL) as pertaining to playing video games and controlling robots, but RL methods are far more general than that - any ML problem with dependencies across space or time can benefit from RL techniques (i.e. sequential pixels). RL allows us to think about non-stationary data distributions as "environments" that respond to an agent's behavior. RL is applied creatively to a lot of ML problems nowadays, such as <a href="https://arxiv.org/abs/1607.07086">translation</a>.<br /><div><br />Pieter Abbeel and John Schulman gave <a href="http://people.eecs.berkeley.edu/~pabbeel/nips-tutorial-policy-optimization-Schulman-Abbeel.pdf">a very nice tutorial on policy gradient methods</a>, and John gave a talk on the “<a href="http://rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdf">nuts and bolts” of Deep RL research</a>. The nuts and bolts talk is jam-packed with great advice that would have taken me years to figure out on my own, and I highly recommend it for anyone who is working on RL projects.</div><div><br /></div><div>Curiously, a <a href="https://arxiv.org/abs/1611.03852">recent paper</a> by Chelsea Finn showed that certain classes of inverse RL problems have a mathematically equivalent GAN formulation!<br /><br />Someone asked the Deep RL panelists what some of the most challenging problems were in RL. Rich Sutton and Raia Hadsell mentioned learning of subtasks/subgoals, i.e. hierarchical representations of the task at hand. John Schulman mentioned an interesting comment about how the Bellman discount factor in the discrete MDP setting makes it difficult to plan more than >100 time steps into the future, even when r=0.99. Finally, Chelsea Finn discussed sample complexity: it’s annoying that we have to learn every RL task from scratch, and some kind of inter-task transfer learning is needed. After all, humans can quickly learn to play new games because we've played other games before.<br /><br /><br /><h3>Bayesian Deep Learning</h3><br />I think one of the most promising areas of research right now is the marriage of deep neural nets with graphical models, referred to some as “Bayesian deep learning.” It’s probably the closest thing to a theoretical understanding of deep learning that generalizes to how neural nets are used in practice.</div><div><br /></div><div>Kyle Cranmer gave a <a href="https://figshare.com/articles/NIPS_2016_Keynote_Machine_Learning_Likelihood_Free_Inference_in_Particle_Physics/4291565/1">phenomenal talk</a> on how likelihood-free inference methods were used to find evidence of the Higgs Boson. A lot of it went over my head, but what little I understood from it was that they used a variational model to approximate the simulation of a 11-story particle detector at atomic scale:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-XoC-iZAmcik/WGrZnHkSoOI/AAAAAAAAFNE/E4gGBsm0dv48Dm1dkXQz5_7HTzau31BjACLcB/s1600/cern.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="482" src="https://3.bp.blogspot.com/-XoC-iZAmcik/WGrZnHkSoOI/AAAAAAAAFNE/E4gGBsm0dv48Dm1dkXQz5_7HTzau31BjACLcB/s640/cern.png" width="640" /></a></div><div><br /></div><div><br /></div><div>There were also a whole bunch of interesting variational inference techniques presented at the conference:</div><div><ul><li><a href="http://approximateinference.org/accepted/Moore2016.pdf">Symmetrized Variational Inference</a> - a way to avoid mode collapse with Gaussian VAEs</li><li><a href="https://arxiv.org/abs/1610.05683">Rejection Sampling Variational inference</a> - reparameterization trick for a rejection sampler, allowing us to train gamma, beta, Dirichlet and many other variational distributions! This is huge. </li><li><a href="https://arxiv.org/abs/1512.02578">Robust Inference with Variational Bayes</a> - improving the robustness of a variational posterior to choice of prior and likelihood.</li><li><a href="http://approximateinference.org/accepted/HoffmanJohnson2016.pdf">ELBO surgery</a> - ELBO is the most common variational lower bound used to train VAEs, and this paper proposes an alternative way to write the ELBO: split up the KL(q(z|x)|p(z)) term into an index-code mutual information term and marginal KL term KL(q(z)|p(z)). Experiments show that the latter term is actually quite large, which suggests that we should focus on improving priors. This work is quite important because it provides a easy-to-implement debugging tool for anyone training VAEs.</li><li><a href="https://arxiv.org/abs/1610.09033">Operator Variational Inference</a>, <a href="https://arxiv.org/abs/1505.05770">Normalizing Flows</a>, and <a href="https://arxiv.org/abs/1605.08803">Real-NVP</a> enable richer families of posterior approximations.</li><li><a href="https://arxiv.org/abs/1611.02731">Variational lossy autoencoder</a>s</li></ul></div><div><br /><div><h3>Open Research</h3><div><br /></div><div>There were a couple exciting software announcements at NIPS this year. First was OpenAI's announcement of <a href="https://universe.openai.com/">Universe</a>, a software interface that bridges all kinds of software to OpenAI Gym. This means we can bring RL to interact with environments such as the <a href="http://alpha.openai.com/miniwob/index.html">full-featured Chrome Browser</a>, or desktop applications like Photoshop. Think of this as "<a href="https://developer.apple.com/library/content/documentation/AppleScript/Conceptual/AppleScriptX/AppleScriptX.html">AppleScript</a> on steroids".</div><div><br /></div><div>Deepmind released <a href="https://deepmind.com/blog/open-sourcing-deepmind-lab/">Deepmind Lab</a> for FPS + 3D agent research in their Labyrinth environment, and earlier this year, Microsoft released <a href="http://blogs.microsoft.com/next/2016/03/13/project-malmo-using-minecraft-build-intelligent-technology/">Project Malmo</a> for training AI agents in Minecraft. These are really interesting environments, and typically outside the capabilities of a grad student to implement on their own.</div><div><br /></div><div>These are just a few of the high-profile software releases; many other tech companies and academic institutions are releasing benchmark software, deep learning frameworks, and data all the time. </div><div><br /></div><div>People often cite 1) compute, 2) data, and 3) algorithms as the reason why DL is taking off, and I think we should add 4) the unprecedented <i>accessibility</i> of ML research. Tons of people are already using OpenAI gym for RL projects, and many research codes are being made public for other people to build on top of. The quality and scope of research benefits dramatically from students not having to write thousands of lines of support code, and instead focus on their research problem.</div><div><br /></div><div>It’s great that large tech companies are making such great software available to the broader community, and this kind of work that is more impactful than any research paper I saw at the conference this year. </div><div><h2></h2><h2></h2><h2>Talks and Papers that I Liked</h2></div><div><br /></div>I had nowhere enough mental capacity to see all the papers and talks at NIPs, but I did write down some talks, papers, and posters that I came away really inspired by. Here’s a non-exhaustive list.<br /><br /><ul><li>Andrew Ng gave a <a href="https://www.dropbox.com/s/dyjdq1prjbs8pmc/NIPS2016%20-%20Pages%202-6%20(1).pdf?dl=0">practical tutorial</a> on how to debug ML research projects. This is a must-watch for beginners in the field. Here is a <a href="https://www.youtube.com/watch?v=F1ka6a13S9I">video of a similar talk</a> he gave at the Deep Learning summer school.</li><li><a href="https://arxiv.org/pdf/1605.06451.pdf">Fixed Points of Belief Propagation</a> - the authors formulate the convergence condition of belief propagation (BP) as the solution of polynomial equations. This makes me wonder if these methods could be used to bridge nonlinear dynamical systems with statistical inference.</li><li><a href="https://arxiv.org/abs/1603.09382">Deep Networks with Stochastic Depth</a> - layer-wise dropout allows for training ResNets with >1000 layers.</li><li><a href="https://arxiv.org/abs/1607.06450">Layer Norm</a> - like batch norm, layer norm is used to prevent covariance shift between successive deep layers. The advantage is that layer norm does not introduce gradient variance from other minibatch elements.</li><li><a href="https://arxiv.org/abs/1606.02396">Deep Successor Reinforcement Learning</a> - I really liked this paper because it gets at model-based + model-free ideas. The motivation is that model-free methods like DQN take a long time to learn adequate low-level features because we are often trying to train a CNN to “see” using a single value scalar that is not directly related to image features. The idea is to learn “successor features” in an unsupervised manner (using rich supervision provided by model prediction error), and then separately learn a simple linear value function on top of these features. The paper is short and well-written.</li><li>The Intelligent Biosphere by Drew Purves (couldn't find the talk slides). This talk was a bit of an oddball among the other mathy ones at NIPS (I thought the drawings were distracting and failed to capture the nuance of what Drew was saying), but I think the message is crucial for AI researchers who want to take a step back and think about what it means for a system to be “intelligent”. The basic idea is that “intelligence” extends beyond brains; something as simple as self-replicating RNA exhibit intelligent behavior at the evolutionary scale. The natural world is fractal, cyclic, and fuzzy, as opposed to the simplicity of our current simulated environments for benchmarking AI. The implication is that there may be challenges learning in these environments that don’t carry over from Atari games. One interesting comment Drew made was that in a biosphere, every organism is a resource to another organism. That is, learning and adaptation of each organism is not independent of other organisms (reminds me of GANs)! If you’re interested in these environment + intelligence ideas, similar arguments have been made in <a href="http://s3.amazonaws.com/academia.edu.documents/30958431/maclennan-final.pdf?AWSAccessKeyId=AKIAJ56TQJRTWSMTNPEA&Expires=1483341349&Signature=ZbX3MEDN4Fi4IcBNlzBC1qKvcks%3D&response-content-disposition=inline%3B%20filename%3DSynthetic_ethology_A_new_tool_for_invest.pdf">ethology</a> and <a href="http://www.scholarpedia.org/article/Neuroethology">neuroethology</a> texts.</li><li><a href="https://arxiv.org/abs/1609.07152">Input Convex Neural Networks</a> - the authors presented a very interesting follow-up workshop poster at the Deep RL workshop (I can't find a paper), where they were able to solve <a href="https://arxiv.org/abs/1603.00748">continuous Q functions</a> by parameterizing the Q-function to be convex with respect to the action vector of the agent (and non-convex with respect to the input state). This allows for efficient DQN policies in continuous action domains. </li><li><a href="https://arxiv.org/abs/1610.06258">Using Fast Weights to Attend to the Recent Past</a> - basically an alternative to LSTM that seems to work a lot better.</li><li><a href="https://arxiv.org/abs/1610.09513">Phased LSTM</a> - a way to learn an adaptive time scale for LSTM architectures, where the hidden state is updated at a learned frequency. This is ideal for learning from data with arbitrary sampling rates, such as from <a href="http://rpg.ifi.uzh.ch/research_dvs.html">event-based vision sensors</a> (which naturally solve some hard computer vision problems like aliasing and bandwidth constraints). It also happens that this Zurich lab happens to make a <a href="http://inilabs.com/products/dynamic-vision-sensors/">really awesome event-based camera</a> and I’m really excited to see how they apply ML to it.</li><li><a href="https://arxiv.org/abs/1605.06431">Residual Networks Behave Like Ensembles of Relatively Shallow Networks</a> - the title says it all. I thought the experiments could have been a lot better, but regardless, this work could potentially have a huge impact in new ways to train very deep ResNets quickly.</li><li><a href="https://tkipf.github.io/graph-convolutional-networks/">Graph Convolutional Networks</a> - generalization of convolutions to allow DNN processing on graph data structures (the kind with edges and vertices). This work is important for getting DL methods to learn approximate solutions to NP-hard graph problems.</li><li><a href="https://arxiv.org/abs/1608.07179">Minimizing Quadratic Functions in Constant Time</a> - I did a double-take when I saw the title of this poster. I don’t understand the math very well, but the algorithm is very simple: subsample a $n \times n$ matrix to a $k \times k$ matrix and solve it. A simple scaling of the resulting minimized value yields an approximation to the minimized objective of the original system! In problems where we need to branch-and-bound a large search space (an optimal portfolio), we might find the value of a solution more informative than the solution itself.</li></ul><div><h2></h2><h2></h2><h2>Predictions for 2017</h2><div><br /></div><ul><li>SGD + backprop will still be the best method for training deep neural nets in 2017.</li><li>Really good transfer learning, domain adaptation, and/or multitask learning will be one of the major accomplishments in 2017 (this was suggested by Andrew Ng during his tutorial).</li><li>Ensemble methods (e.g. training 10 versions of a network and averaging the predictions) will squeeze out 2-5% performance on all kinds of ML tasks. </li><li>Widespread use of model-based + model-free RL, e.g. <a href="https://arxiv.org/pdf/1611.05397">Reinforcement Learning with Unsupervised Auxiliary Tasks</a> as a way to alleviate sample-complexity of model-free methods.</li></ul><div><div><h2></h2><h2></h2><h2>Miscellaneous Notes</h2><ul></ul></div><div><div><br /></div><div>Some of these are paraphrased opinions of panelists and speakers. If you think I misrepresented your opinions in any way, please don't hesitate to <a href="http://evjang.com/about.html">reach out</a> and I will make a correction!</div><div><br /></div><ul><li>Boston Dynamics displayed a series of very entertaining and impressive robotics demos. Their basic approach appears to be high precision hardware that behaves faithfully to a simulated planning model, plus a bit of robust control to handle the imperfections. Machine Learning doesn’t seem to be an important priority for their approach to robotic control, but I think it's good to have uncorrelated research approaches. It's hard to say which one will end up working best. God, I hope <a href="https://www.bloomberg.com/news/articles/2016-03-17/google-is-said-to-put-boston-dynamics-robotics-unit-up-for-sale">Alphabet doesn’t sell them off</a>. Their stuff is really cool.</li><li>Apple has publicly announced that they will open up their research culture. I think a product-driven research culture has the potential to align research towards important problems (rather than maximizing citations, as is done in academia).</li><li>Certain prop trading firms are using neural nets to do price forecasting. I had previously thought that all trading firms were too concerned about model interpretability to use neural nets.</li><li>Speaking of hedge funds, many top hedge funds and trading shops came to NIPS to run career booths, but there was a surprising lack of interest from attendees (attendees were more interested in the likes of Apple, Facebook, Deepmind, Google, etc). At a regular college career fair in the East Coast, these roles are reversed. The talent pool at NIPS seems to be far more interested in tech and open-ended research than making money. </li><li>Deers, despite being “herbivores”, will sometimes eat the heads of baby birds for extra calcium (this was from Drew Purves talk). Life rejects your reductionist labels, human!</li><li>Engineering Principles From Stable and Developing Brains (Saket Navlakha) - interestingly enough, pruning unused neuron units seems to work better for learning than growing additional neurons.</li><li>Yoshua Bengio - much better generalization (train vs. test error) happens when telling generative models what their latent codes should be.</li><li>There was a great question asked during the Brains and Bits panel about why we are focusing so much intelligence research on humans, instead of simpler animals like octopi and birds. Yoshua says that focusing on simple models might cause researchers to overlook more general principles of intelligence. I agree with the argument, but not it’s premise - I think it is our AI algorithms that are simple, not the octopi and birds. I believe that we should build an octopi AI before a human one (see the "<a href="https://books.google.com/books?id=usu6EZj79pgC&pg=PA13&lpg=PA13&dq=whole+iguana&source=bl&ots=jOntBGYtAB&sig=Eiwkxw_GeM2B9h_Y9KOsSbF5CJs&hl=en&sa=X&ved=0ahUKEwjdjLyUsKTRAhVCHGMKHZ_1B3sQ6AEINzAH#v=onepage&q=whole%20iguana&f=false">whole iguana</a>" approach).</li><li>Yann Lecun thinks model interpretability is overrated: we have lots of interpretable algorithms, but we still lack the basic principles with which to build a general-purpose learning system that can navigate the complexities of our natural world. I agree wholeheartedly with this - requiring interpretability unnecessarily confines us to look for solutions in models that we understand bottom-up. If history of scientific discovery is any clue, that is not always the case. In fact, serendipitous discovery sometimes invalidates our bottom-up assumptions! (e.g. "neural nets don't work").</li><li>Yann is not a fan of the log-likelihood metric. Log-likelihood is evaluated under a particular model (e.g. PixelCNN decoder), and if the model is not good to begin with, then log-likelihood is somewhat meaningless.</li><li>Andrew Saxe asked a great question at the Brain and Bits workshop: “what are fringe topics that might be important to solving intelligence?” Demis Hassabis replied that people don't regard the study of imagination, dreaming, consciousness as very scientific, but they may become serious practical questions within 5 years. Terry Sejnowski replied that understanding sleep is crucial, as it is very much a computational phenomenon rather than a torpid one. Yoshua wonders about what it means to "understand something", i.e. neural nets and other models can learn computations and statistical correlations, but how should semantic meaning be extracted?</li></ul><div><br /></div><h2>Further Reading</h2><div><br /></div><div>I didn't attend any talks on meta-learning (e.g. synthetic gradients, learning-to-learn, hypernetworks), and a great many other research themes, but others have blogged about their experiences at NIPS. Check these articles out to piece together a fuller picture!</div><div><br /></div><div><a href="https://medium.com/@elluba/nips-2016-cake-rocket-ai-gans-and-the-style-transfer-debate-708c46438053#.5990vq0st">NIPS 2016: cake, Rocket AI, GANs and the style transfer debate</a> by Luba Elliott</div><div><a href="http://blog.aylien.com/highlights-nips-2016/">Highlights of NIPS 2016: Adversarial Learning, Meta-learning and more</a> by Aylien</div><div><a href="https://martin-thoma.com/nips-2016/">NIPS 2016</a> by Martin Thoma</div><div><a href="https://blog.ought.com/nips-2016-875bb8fadb8c">50 things I learned at NIPS 2016</a> by Andreas Stuhlmüller</div><div><a href="http://www.deeplearningweekly.com/blog/deep-learning-2016-the-year-in-review">Deep Learning 2016: the Year in Review</a> by Jan Bussieck</div><div><br /></div><div><br /></div><div>Thank you for reading, and happy new year!</div></div></div></div></div></div>Ericnoreply@blogger.com17tag:blogger.com,1999:blog-842965756326639856.post-19738113672691286692016-11-08T23:00:00.003-08:002017-04-23T14:55:09.687-07:00Tutorial: Categorical Variational Autoencoders using Gumbel-Softmax<div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-i3w5W_x06i8/WP0ineAGrNI/AAAAAAAAFVM/dV7NnpXaoHwmIsTwDWv7WHRWFX9Uc283gCLcB/s1600/Screen%2BShot%2B2017-04-23%2Bat%2B11.53.42%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="133" src="https://4.bp.blogspot.com/-i3w5W_x06i8/WP0ineAGrNI/AAAAAAAAFVM/dV7NnpXaoHwmIsTwDWv7WHRWFX9Uc283gCLcB/s320/Screen%2BShot%2B2017-04-23%2Bat%2B11.53.42%2BPM.png" width="320" /></a></div><br /><br />In this post, I discuss our recent paper, <a href="https://arxiv.org/abs/1611.01144">Categorical Reparameterization with Gumbel-Softmax</a>, which introduces a simple technique for training neural networks with discrete latent variables. I'm really excited to share this because (1) I believe it will be quite useful for a variety of Machine Learning research problems, (2) this is my first published paper ever (on Arxiv, and submitted to a NIPS workshop and ICLR as well).<br /><b>The TLDR;</b> if you want categorical features in your neural nets, just let <code>sample = softmax((logits+gumbel noise)/temperature)</code>, and then backprop as usual using your favorite automatic differentiation software (e.g. TensorFlow, Torch, Theano).<br />You can find the code for this article <a href="https://github.com/ericjang/gumbel-softmax/blob/master/Categorical%20VAE.ipynb">here</a><br /><h2>Introduction</h2>One of the main themes in Deep Learning is to “let the neural net figure out all the intermediate features”. For example: training convolutional neural networks results in the self-organization of a feature detector hierarchy, while Neural Turing Machines automatically “discover” copying and sorting algorithms.<br />The workhorse of Deep Learning is the backpropagation algorithm, which uses dynamic programming to compute parameter gradients of the network. These gradients are then used to minimize the optimization objective via gradient descent. In order for this to work, all of the layers in our neural network — i.e. our learned intermediate features — must be continuous-valued functions.<br />What happens if we want to learn intermediate representations that are discrete? Many "codes" we want to learn are fundamentally discrete - musical notes on a keyboard, object classes (“kitten”, “balloon”, “truck”), and quantized addresses (“index 423 in memory”).<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-W3RoaXU3VVc/WCLGD1TsISI/AAAAAAAAFL4/YFb1Dlxcnp4acDW5KDEeBW6HMk8BQjLuQCLcB/s1600/7998802501_f4633002de_b.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://2.bp.blogspot.com/-W3RoaXU3VVc/WCLGD1TsISI/AAAAAAAAFL4/YFb1Dlxcnp4acDW5KDEeBW6HMk8BQjLuQCLcB/s400/7998802501_f4633002de_b.jpg" width="302" /></a></div>We can use stochastic neural networks, where each layer compute the parameters of some (discrete) distribution, and its forward pass consists of taking a sample from that parametric distribution. However, the difficulty is that we can’t backpropagate through samples. As shown below, there is a stochastic node (blue circle) in between $f(z)$ and $\theta$.<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-fpP3tK8x5Xk/WCKuHOhS7qI/AAAAAAAAFKs/hGfcGjBC3msRhQczwPMoiWhfIOtnGL1CACEw/s1600/gumbel1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://4.bp.blogspot.com/-fpP3tK8x5Xk/WCKuHOhS7qI/AAAAAAAAFKs/hGfcGjBC3msRhQczwPMoiWhfIOtnGL1CACEw/s320/gumbel1.png" width="177" /></a></div>Left: in continuous neural nets, you can use backprop to compute parameter gradients. Right: backpropagation is not possible through stochastic nodes.<br /><h2>Gumbel-Softmax Distribution</h2>The problem of backpropagating through stochastic nodes can be circumvented if we can re-express the sample $z \sim p_\theta(z)$, such that gradients can flow from $f(z)$ to $\theta$ without encountering stochastic nodes. For example, samples from the normal distribution $z \sim \mathcal{N}(\mu,\sigma)$ can be re-written as $z = \mu + \sigma \cdot \epsilon$, where $\epsilon \sim \mathcal{N}(0,1)$. This is also known as the “reparameterization trick”, and is commonly used to train variational autoencoders with Gaussian latent variables.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-UygVd5cfAKA/WCKuJ2BqmEI/AAAAAAAAFKw/zSjMGxopb74_-FeXt7tTslrM0_PTKKvlQCEw/s1600/gumbel3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://4.bp.blogspot.com/-UygVd5cfAKA/WCKuJ2BqmEI/AAAAAAAAFKw/zSjMGxopb74_-FeXt7tTslrM0_PTKKvlQCEw/s320/gumbel3.png" width="175" /></a></div>The Gumbel-Softmax distribution is reparameterizable, allowing us to avoid the stochastic node during backpropagation.<br /><b>The main contribution of this work is a “reparameterization trick” for the categorical distribution.</b> Well, not quite – it’s actually a re-parameterization trick for a distribution that we can <i>smoothly deform</i> into the categorical distribution. We use the Gumbel-Max trick, which provides an efficient way to draw samples $z$ from the Categorical distribution with class probabilities $\pi_i$:<br />$$ \DeclareMathOperator*{\argmax}{arg\,max} z = \verb|one_hot|\left(\argmax_{i}{\left[ g_i + \log \pi_i \right]}\right) $$ <br />argmax is not differentiable, so we simply use the softmax function as a continuous approximation of argmax:<br />$$ y_i = \frac{\text{exp}((\log(\pi_i)+g_i)/\tau)}{\sum_{j=1}^k \text{exp}((\log(\pi_j)+g_j)/\tau)} \qquad \text{for } i=1, ..., k. $$ <br />Hence, we call this the Gumbel-<i>Soft</i>Max distribution*. $\tau$ is a temperature parameter that allows us to control how closely samples from the Gumbel-Softmax distribution approximate those from the categorical distribution. As $\tau \to 0$, the softmax becomes an argmax and the Gumbel-Softmax distribution becomes the categorical distribution. During training, we let $\tau > 0$ to allow gradients past the sample, then gradually anneal the temperature $\tau$ (but not completely to 0, as the gradients would blow up).<br />Below is an interactive widget that draws samples from the Gumbel-Softmax distribution. Keep in mind that samples are vectors, and a one-hot vector (i.e. one of the elements is 1.0 and the others are 0.0) corresponds to a discrete category. Click "re-sample" to generate a new sample, and try dragging the slider and see what samples look like when the temperature $\tau$ is small!<br /><script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.3.0/Chart.bundle.min.js" type="text/javascript"></script> <canvas height="200" id="myChart" width="300"></canvas><script>var ctx = document.getElementById("myChart"); var myChart = new Chart(ctx, { type: 'bar', data: { labels: [1,2,3,4,5,6,7,8], datasets: [{ label: 'Probability', data: [0,0,0,0,0,0,0,0], backgroundColor: '' , borderColor: '', borderWidth: 1 }] }, options: { scales: { yAxes: [{ ticks: { beginAtZero:true, max: 1.0, min: 0, } }] } } }); var gumbels = [0,0,0,0,0,0,0,0] var tau = 1.0 function sample() { for (var i=0; i< 8; i++) { gumbels[i] = -Math.log(-Math.log(Math.random())) } } function gumbel_softmax(logits,tau) { var exp = logits.map(function(x,i){ return Math.exp((x + gumbels[i] )/tau) }) var sum = exp.reduce(function(a, b) { return a + b; },0) var y = exp.map(function(x){ return x/sum }) return y } function updatePlot(new_tau) { tau = new_tau var logits = [3,1.2,4,-0.3,0.2,1.0,3.6,4.05] myChart.data.datasets[0].data = gumbel_softmax(logits,tau) // choose new colors var colors = [ [0.6, 'rgba(54, 162, 235, 0.2)', 'rgba(54, 162, 235, 1)'], [1.2,'rgba(0, 204, 102, 0.2)','rgba(0, 204, 102, 1)'], [1.8,'rgba(255, 159, 64, 0.2)','rgba(255, 159, 64, 1)'], [2.4,'rgba(255, 206, 86, 0.2)','rgba(255, 206, 86, 1)'], [3.1,'rgba(255, 99, 132, 0.2)' ,'rgba(255,99,132,1)'] ] for (var i=0; i<colors.length; i++) { if (tau < colors[i][0]) { myChart.data.datasets[0].backgroundColor = colors[i][1] myChart.data.datasets[0].borderColor = colors[i][2] break } } myChart.update() } function reSample() { sample() updatePlot(tau) } reSample() </script><span id="valBox"></span><span id="slider_value">1.0</span><input id="myslider" max="3.0" min="0.1" oninput="updatePlot(this.value)" step="0.1" type="range" /><button onclick="reSample()">re-sample</button> <script> var slider = document.getElementById('myslider'); slider.addEventListener('change', function() { var val = document.getElementById('slider_value'); val.innerHTML = slider.value; }); </script> <br /><br /><h2>TensorFlow Implementation</h2>Using this technique is extremely simple, and only requires 12 lines of Python code:<br /><script src="https://gist.github.com/ericjang/1001afd374c2c3b7752545ce6d9ed349.js"></script> <br />Despite its simplicity, Gumbel-Softmax works surprisingly well - we benchmarked it against other stochastic gradient estimators for a couple tasks and Gumbel-Softmax outperformed them for both Bernoulli (K=2) and Categorical (K=10) latent variables. We can also use it to train semi-supervised classification models much faster than previous approaches. See <a href="https://arxiv.org/abs/1611.01144">our paper</a> for more details.<br /><h2>Categorical VAE with Gumbel-Softmax</h2>To demonstrate this technique in practice, here's a categorical variational autoencoder for MNIST, implemented in less than 100 lines of Python + TensorFlow code.<br />In standard <a href="https://arxiv.org/abs/1312.6114">Variational Autoencoders</a>, we learn an encoding function that maps the data manifold to an isotropic Gaussian, and a decoding function that transforms it back to the sample. The data manifold is projected into a Gaussian ball; this can be hard to interpret if you are trying to learn the categorical structure within your data.<br />First, we declare the encoding network: <br /><script src="https://gist.github.com/ericjang/0ab1531dcd521390960635bff79a44b1.js"></script> <br />Next, we sample from the Gumbel-Softmax posterior and decode it back into our MNIST image.<br /><script src="https://gist.github.com/ericjang/19c94c216ed7bbe21d7142e3fcdc8afa.js"></script> <br />Variational autoencoders minimizes reconstruction error of the data by maximizing an expectedlower bound (ELBO) on the likelihood of the data, under a generative model $p_\theta(x)$. For a derivation, see this <a href="http://blog.evjang.com/2016/08/variational-bayes.html">tutorial on variational methods</a>.<br />$$\log p_\theta(x) \geq \mathbb{E}_{q_\phi(y|x)}[\log p_\theta(x|y)] - KL[q_\phi(y|x)||p_\theta(y)]$$ <script src="https://gist.github.com/ericjang/79f2ad7f01d94858712d561baf0be908.js"></script> <br />Finally, we run train our VAE: <br /><script src="https://gist.github.com/ericjang/d30ed6e2528594a6b4421f45459abe74.js"></script> <br />...and, that's it! Now we can sample randomly from our latent categorical code and decode it back into MNIST images: <br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-WGtMZR2EHMw/WCLDmrq0OAI/AAAAAAAAFLs/OdWPcjaZFBc56pHbIIH6_tIOlpoHkKu1gCLcB/s1600/code.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://1.bp.blogspot.com/-WGtMZR2EHMw/WCLDmrq0OAI/AAAAAAAAFLs/OdWPcjaZFBc56pHbIIH6_tIOlpoHkKu1gCLcB/s400/code.png" width="400" /></a></div>Code can be found <a href="https://github.com/ericjang/gumbel-softmax/blob/master/Categorical%20VAE.ipynb">here</a>. Thank you for reading, and let me know if you find this technique useful!<br /><h2>Acknowledgements</h2>I'm sincerely grateful to my co-authors, Shane Gu and Ben Poole for collaborating with me on this work. Shane introduced me to the Gumbel-Max trick back in August, and supplied the framework for comparing Gumbel-Softmax with existing stochastic gradient estimators. Ben suggested and implemented the semi-supervised learning aspect of the paper, did the math derivations in the Appendix, and helped me a lot with editing the paper. Finally, thanks to Vincent Vanhoucke and the Google Brain team for encouraging me to pursue this idea.<br />*Chris J. Maddison, Andriy Mnih, and Yee Whye Teh at Deepmind have discovered this technique independently and published their own paper on it - they call it the <a href="https://arxiv.org/abs/1611.00712">“Concrete Distribution”</a>. We only found out about each other’s work right as we were submitting our papers to conferences (oops!). If you use this technique in your work, please cite both of our papers! They deserve just as much credit.<br /><br /><i>23 Apr 2017: Update - Chris Maddison and I integrated these distributions into TensorFlow's <a href="https://www.tensorflow.org/versions/r0.12/api_docs/python/contrib.distributions/">Distributions</a> sub-framework. <a href="https://github.com/ericjang/gumbel-softmax/blob/master/gumbel_softmax_vae_v2.ipynb">Here's a code example</a> of how to implement a categorical VAE using the distributions API.</i>Ericnoreply@blogger.com20tag:blogger.com,1999:blog-842965756326639856.post-18517255843527726702016-09-06T21:36:00.004-07:002016-09-06T22:37:28.539-07:00Riemann Summation and Physics Simulation are Statistically BiasedRiemann summation is the simplest method for numerically approximating integrals. in which you chop up the domain into equally sized segments and add a bunch of rectangles together.<br /><br />$$\int f(x) dx \sim \sum_i^k f(x) \Delta x$$<br /><br class="Apple-interchange-newline" /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-xY1s4OvpKVQ/V8-PQWbdEqI/AAAAAAAAFJc/N3PmEpKRrt4IvGzE4iN_yTFmMgiBPpE7gCLcB/s1600/riemann%2Bbias.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="158" src="https://3.bp.blogspot.com/-xY1s4OvpKVQ/V8-PQWbdEqI/AAAAAAAAFJc/N3PmEpKRrt4IvGzE4iN_yTFmMgiBPpE7gCLcB/s400/riemann%2Bbias.png" width="400" /></a></div><br /><div><br /></div>This technique also comes up in Euler integration for simulating physics:<br /><br />$$<br />\begin{align*}<br />\Delta x & = \int \dot{x}(t) dt \\<br />& \sim \dot{x}(t) \Delta t<br />\end{align*}<br />$$<br /><br />These are all special cases of "Deterministic Quadratures", where the integration step size $\Delta x$ is non-random ("deterministic") and we are summing up a bunch of quadrilateral pieces ("<a href="https://en.wikipedia.org/wiki/Quadrature_(mathematics)">Quadratures</a>") to approximate areas.<br /><br />This post interprets the numerical integration as a special case of stratified sampling, and shows that deterministic quadrature rules are statistically biased estimators.<br /><br />Suppose we want to compute $E_X[f(X)] = \int_x p(x)\cdot f(x) dx$ via a Monte Carlo estimator.<br /><br />A plain Monte Carlo Estimator is given by $\mu = \frac{1}{n}\sum_i^n f(X_i)$, and has variance $\frac{1}{n}\sigma^2$ where $\sigma^2=Var[X_i]$.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-0ZOJKuhhZqE/V835jmFLasI/AAAAAAAAFJY/hUmDjJtlaR8oH2CesO4olhApWSIDTcnjwCPcB/s1600/plain_mc.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://1.bp.blogspot.com/-0ZOJKuhhZqE/V835jmFLasI/AAAAAAAAFJY/hUmDjJtlaR8oH2CesO4olhApWSIDTcnjwCPcB/s400/plain_mc.png" width="400" /></a></div><br /><br />Stratified sampling (see <a href="http://blog.evjang.com/2016/09/variance-reduction-part1.html">previous blog post</a>) introduces a stratification variable $Y$ with a discrete distribution of $k$ strata with probability densities $p(y_1),p(y_2),...,p(y_k)$, respectively. A common choice is to assign each $p(y_1) = p(y_2) = ... = p(y_k) = 1/k$, and arrange the strata in a grid. In this case, $y_i$ correspond to the corners of the strata and $P(X|Y=y_i)$ corresponds to a uniform distribution over that square ($X_i = Y_i + 0.1*\text{rand}()$).<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-4cd05MWgoZw/V8-TO9W6s0I/AAAAAAAAFJk/XTImoTwgQRAsY0Q0VWCHXHFDNugTWKMWwCLcB/s1600/stratified.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://1.bp.blogspot.com/-4cd05MWgoZw/V8-TO9W6s0I/AAAAAAAAFJk/XTImoTwgQRAsY0Q0VWCHXHFDNugTWKMWwCLcB/s400/stratified.png" width="400" /></a></div><br /><br />The variance of this estimator is $\sum_i^k p(y_i)^2 \cdot V_i$, where $V_i$ is the variance of $\mathbb{E}[X|Y=y_i]$ for strata $i$.<br /><br />Suppose we let $\mathbb{E}[X|Y=y_i]$ as $y_i$ - that is, we just sample at the corners where $y_i$ sampels are.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-EMrnAxY-Kqo/V8-U7VqXF1I/AAAAAAAAFJs/i-8nfKZwsNk_Ouq-iIce8yjYYbbwWTwcACLcB/s1600/stratified1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://1.bp.blogspot.com/-EMrnAxY-Kqo/V8-U7VqXF1I/AAAAAAAAFJs/i-8nfKZwsNk_Ouq-iIce8yjYYbbwWTwcACLcB/s400/stratified1.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">The good news is that $V_i = 0$, and we've reduced the variance of the total estimator to zero. The bad news is that the estimator is now biased because our per-strata estimators $\mathbb{E}[X|Y=y_i]$ are biased (our estimator never uses information from the interiors/edges of the squares). The estimator is also inconsistent for finite strata.</div><br />Does this figure remind you of anything? This estimator behaves identically to a Riemann Summation in 2D!<br /><br />What does this mean?<br /><br /><ul><li>If you ever take a Riemann sum with fixed step size, your integral is biased! </li><li>If you are monitoring mean ocean temperatures and place your sensors in an regular grid across the water (like above), your estimator is biased! </li><li>If you are recording video at a too-low frequency and <a href="https://www.youtube.com/watch?v=R-IVw8OKjvQ">pick up aliasing artifacts</a>, your estimates are biased!</li><li>If you use a fixed timestep in a physics simulation, such as Euler or <a href="https://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods">Runge-Kutta</a> methods, your space integrals are biased!</li></ul><br />The takeaway here is that *all* deterministic quadrature rules are statistically biased. that's okay though - the expected value of biased estimators *do* converge to the true expectation as you crank the number of strata to infinity (e.g. $\Delta x \to 0$).<br /><br />A "less-biased" physics simulator ought to incorporate random time steps (for instance, sampled between 0 and MAX_TIME_STEP). To reduce variance, one might generate multiple samples, compute a nonlinear update, then average out the integration.<br /><br /><br />Fairly obvious in hindsight, but I found it interesting to think about the relationship between stratified sampling and deterministic quadrature :)Ericnoreply@blogger.com3tag:blogger.com,1999:blog-842965756326639856.post-60968315191597735402016-09-05T17:32:00.000-07:002016-09-06T20:21:03.706-07:00Monte Carlo Variance Reduction Techniques in JuliaMonte Carlo methods are a simple yet powerful family of techniques for performing numerical integration in high-dimensional spaces.<br /><br />This post is a tutorial on a few variance reduction techniques that can be used to reduce the noisiness of your estimators: <a href="https://en.wikipedia.org/wiki/Antithetic_variates">antithetic sampling</a>, <a href="https://en.wikipedia.org/wiki/Control_variates">control variates</a>, and <a href="https://en.wikipedia.org/wiki/Stratified_sampling">stratified sampling</a>.<br /><br /><i><a href="https://github.com/ericjang/variance_reduction/blob/master/antithetic_cv_stratified.ipynb">Code on Github</a></i><br /><i><br /></i><br /><h2>The Problem</h2><i><br /></i>Our goal is to find the expected value of the random variable $X$:<br /><div><div><br />$$X = f(u) = \exp(\sin(8 u_1^2 + 3 u_1)+ \sin(11 u_2))$$<br /><br />where $u$ is a 2D point sampled uniformly from the unit square, as shown below:<br /><br /><pre><code>f(x,y) = exp(sin(8x.^2+3x)+sin(11y))<br />plot(linspace(0,1,100),linspace(0,1,100),f,st=:contourf,title="f(X)",size=(400,400))<br /></code></pre><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-MlxmpSFpkQQ/V84KEUhR0AI/AAAAAAAAFJM/MzVh1KybCJUh4na7_Hx_KzT9_i7VK3zTgCLcB/s1600/function.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://3.bp.blogspot.com/-MlxmpSFpkQQ/V84KEUhR0AI/AAAAAAAAFJM/MzVh1KybCJUh4na7_Hx_KzT9_i7VK3zTgCLcB/s400/function.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">The expectation can be re-written as an integral over the unit square.</div><div class="separator" style="clear: both; text-align: left;"><br /></div>$$<br />\mathbb{E}[X] = \int_0^1 \int_0^1 \exp(\sin(8 u_1^2 + 3 u_1)+ \sin(11 u_2)) du_1 du_2<br />$$<br /><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Monte Carlo is useful in scenarios where a closed-form integral does not exist, or we do not have access to the analytical form of the function (for example, it could be a black box program that the NSA hands to us). In this case we <i>can</i> integrate it analytically, but let's pretend we can't :)</div><div class="separator" style="clear: both; text-align: left;"><br /></div><h2>Basic Monte Carlo Estimator</h2><br />The basic Monte Carlo estimator $\hat{\mu}$ gives a straightforward answer: "evaluate $x_i=f(u_i)$ at $n$ random values of $U$, and then take the average among all your samples." (to make comparison easier with the estimators I will introduce later on, I use $2n$ samples instead of $n$).<br /><br />$$<br />\hat{\mu} = \frac{1}{2n}\sum_i^{2n} x_i \approx \mathbb{E}[X]<br />$$<br /><div><br /></div><div>Here is the estimator in code :<br /><pre><code><br />function plainMC(N)<br /> u1 = rand(2*N)<br /> u2 = rand(2*N)<br /> return u1,u2,f(u1,u2),[]<br />end<br /></code></pre><br />The plot below shows 100 samples and their values, colored according to the value of $f(X)$ at that location.<br /><br /><pre><code><br />u1,u2,x,ii=plainMC(N)<br />scatter(u1,u2,markercolor=x,<br /> xlabel="u1",ylabel="u2",<br /> title="Plain MC Samples",<br /> size=(400,400),label="")<br /></code></pre><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-0ZOJKuhhZqE/V835jmFLasI/AAAAAAAAFIw/6Nud5M85Yd0JTVfDSLzOIqbreD_m05DQgCLcB/s1600/plain_mc.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://4.bp.blogspot.com/-0ZOJKuhhZqE/V835jmFLasI/AAAAAAAAFIw/6Nud5M85Yd0JTVfDSLzOIqbreD_m05DQgCLcB/s400/plain_mc.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div>The variance of this (2n-sample) estimator is given by:</div><div><br />$$<br />\newcommand{\Var}{\operatorname{Var}}<br />\begin{align*}<br />\Var[\hat{\mu}] & = \Var[ \frac{1}{2n}\sum_i^{2n} X_i ] \\<br />& = (\frac{1}{2n})^2\sum_i^{2n} \Var [X_i] \\<br />& = \frac{\sigma^2}{2n}<br />\end{align*}<br />$$</div><div><br /></div><div><div>Where $\sigma^2$ is the variance of a single sample $X_i$ and $n$ is the total number of samples.</div></div><div><br /><div>The variance of our estimator is $\mathbb{E}[(\hat{\mu}-\mu)^2]$, or in other words, the "expected squared error" with respect to the true expectation $\mu$. </div><div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-JRtTWEcVqEI/V82p7I6QipI/AAAAAAAAFHg/U-iUCCWVj5Qknm3G4db4pOIS4_5qo1WxACLcB/s1600/plainMC.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="266" src="https://4.bp.blogspot.com/-JRtTWEcVqEI/V82p7I6QipI/AAAAAAAAFHg/U-iUCCWVj5Qknm3G4db4pOIS4_5qo1WxACLcB/s400/plainMC.png" width="400" /></a></div><br /><br /></div><div>We could decrease variance by simply cranking up our sample count $n$, but this is computationally expensive. While variance is proportional to $\frac{1}{n}$, the standard deviation (which has the same units as $\mu$ and is proportional to the confidence interval of our estimator) decreases at $\frac{1}{\sqrt{n}}$, a far slower rate. This is highly unsatisfactory.</div><div><br />Instead, we will use variance reduction techniques to construct estimators with the <i>same </i>expected value as our MC estimator, and still achieve a lower variance with the same number of samples.</div></div><div><br /></div></div><h2>Antithetic Sampling</h2><br />Antithetic sampling is a slight extension to the basic Monte Carlo estimator. The idea is to pair each sample $X_i$ with an "antithetic sample" $Y_i$ such that $X_i$ and $Y_i$ are identically distributed, but negatively correlated. Our single-sample estimate is given by:<br /><br />$$<br />\hat{\mu} = \frac{X_i + Y_i}{2}<br />$$<br /><br />$\mathbb{E}[X_i] = \mathbb{E}[Y_i]$ so this estimator has the same expected value as the plain MC estimator.<br /><br />$$<br />\begin{align*}<br />\Var[\hat{\mu}] & = \frac{1}{4}\Var[X_i + Y_i] \\<br />& = \frac{1}{4}(\Var[X_i] + \Var[Y_i] + 2 \text{Cov}(X_i,Y_i)) && \Var[X_i] = \Var[Y_i] = \sigma^2 \\<br />& = \frac{1}{2}(\sigma^2 + \sigma^2\beta) && \beta = \text{Corr}(X_i,Y_i) = \frac{\text{Cov}(X_i,Y_i)}{\sigma^2}<br />\end{align*}<br />$$<br /><br />If we sample $N$ of these pairs (for a total of $2N$ samples), the variance is $\frac{1}{2n}(\sigma^2 + \sigma^2\beta)$, and it's easy to see that this estimator has a strictly lower variance than the plain Monte Carlo as long as $\beta$ is negative.<br /><br />Unfortunately, there isn't a general method for generating good antithetic samples. If the random sample of interest $X_i = f(U_i)$ is a transformation of a uniform random variable $U_i \sim \text{Uniform}(0,1)$, we can try "flipping" the input and letting $Y_i = f(1-U_i)$. $1-U_i$ is distributed identically to $U_i$, and assuming $f(U_i)$ is monotonic, then $Y_i$ will be large when $X_i$ is small, and vice versa.<br /><br />However, this monotonicity assumption does not hold in general for 2D functions. In fact, for our choice of $f(U)$, it is only <i>slightly</i> negatively correlated ($\beta \approx -0.07$).<br /><br /><pre><code>function antithetic(N)<br /> u1_x = rand(N); u2_x = rand(N);<br /> u1_y = 1-u1_x; u2_y = 1-u2_x;<br /> x = f(u1_x,u2_x);<br /> y = f(u1_y,u2_y);<br /> return [u1_x; u1_y], [u2_x; u2_y], [x;y], (u1_x,u2_x,x,u1_y,u2_y,y)<br />end<br /></code></pre><br />Notice how each $X_i$ sample ("+") is paired with an antithetic sample $Y_i$ ("x") mirrored across the origin.<br /><br /><pre><code><br />u1,u2,x,ii=antithetic(N)<br />u1_x,u2_x,x,u1_y,u2_y,y = ii<br />scatter(u1_x,u2_x,<br /> m=(10,:cross),markercolor=x,<br /> size=(400,400),label="X samples",<br /> xlabel="u1",ylabel="u2",title="Antithetic Sampling")<br />scatter!(u1_y,u2_y,<br /> m=(10,:xcross),markercolor=y,<br /> label="antithetic samples")<br /></code></pre><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-bTCXfOdWtF8/V836D6K38zI/AAAAAAAAFI0/HBc1nVoW1WgoFHyTqyD9Ftwbx4B07g2QQCLcB/s1600/antithetic.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://3.bp.blogspot.com/-bTCXfOdWtF8/V836D6K38zI/AAAAAAAAFI0/HBc1nVoW1WgoFHyTqyD9Ftwbx4B07g2QQCLcB/s400/antithetic.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-xktnmgj_nfM/V82qAE5xW_I/AAAAAAAAFHk/Ts249a7KWvoMcXB5h_qZSNCpkcKVTaiqwCLcB/s1600/antithetic_curve.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="266" src="https://2.bp.blogspot.com/-xktnmgj_nfM/V82qAE5xW_I/AAAAAAAAFHk/Ts249a7KWvoMcXB5h_qZSNCpkcKVTaiqwCLcB/s400/antithetic_curve.png" width="400" /></a></div><br /><br /></div><h2>Control Variates</h2><br />Control Variates is similar to Antithetic Sampling, in that we will be pairing every sample $X_i$ with a sample of $Y_i$ (the "control variate"), and exploiting correlations between $X_i$ and $Y_i$ in order to decrease variance.<br /><br />However, $Y_i$ is no longer sampled from the same distribution as $X_i$. Typically $Y_i$ is some random variable that is computationally easy to generate (or is generated as a side effect) from the generative process for $X_i$. For instance, the weather report that forecasts temperature $X$ also happens to predict rainfall $Y$. We can use rainfall measurements $Y_i$ to improve our estimates of "average temperature" $\mathbb{E}[X]$.<br /><br />Suppose control variate $Y$ has a known expected value $\bar{\mu}$. The single-sample estimator is given by<br /><br />$$<br />\hat{\mu} = X_i - c (Y_i - \mathbb{E}[Y_i])<br />$$<br /><br />Where $c$ is some constant. In expectation, $Y_i$ samples are cancelled out by $\bar{\mu}$ so this estimator is unbiased. <br /><br />It's variance is<br /><br />$$<br />\begin{align*}<br />\Var[\hat{\mu}] & = \Var[X_i - c (Y_i - \mathbb{E}[Y_i])] \\<br />& = \Var[X_i - c Y_i] && \mathbb{E}[Y_i] \text{ is not random, and has zero variance} \\<br />& = \Var[X_i] + c^2 \Var[Y_i] + 2cCov(X_i,Y_i)<br />\end{align*}<br />$$<br /><br />Simple calculus tells us that the value of $c$ that minimizes $Var[\hat{\mu}]$ is<br /><br />$$<br />c^* = -\frac{Cov(X_i,Y_i)}{\Var(Y_i)}<br />$$<br /><br />so the best variance we can achieve is<br /><br />$$<br />\Var[\hat{\mu}] = \Var[X_i] - \frac{Cov(X_i,Y_i)^2}{\Var(Y_i)}<br />$$<br /><br />The variance of our estimator is reduced as long as $Cov(X_i,Y_i)^2 \neq 0$. Unlike antithetic sampling, control variates can be either positively or negatively correlated with $X_i$.<br /><br />The catch here is that $c^*$ will need to be estimated. There are a couple ways:<br /><br /><ol><li>Forget optimality, just let $c^* = 1$! This naive approach still works better than basic Monte Carlo because it exploits covariance structure.</li><li>Estimate $c^*$ from samples $X_i$, $Y_i$. Doing so introduces a dependence between $c^*$ and $Y_i$, which makes the estimator biased, but in practice this bias is small compared to the standard error of the estimate.</li><li>An unbiased alternative to (2) is to compute $c^*$ from a small set of $m < n$ "pilot samples", and then throw the pilot samples away before collecting the real $n$ samples.</li></ol><div><br /></div><div>In the case of our $f(u)$, we know that the modes of the distribution are scattered in the corners of the unit square. In polar coordinates centered at $(0.5,0.5)$, that's about $k \frac{\pi}{2} + \frac{\pi}{4}, k=0,1,2,3$. Let's choose $Y = -cos(4\theta)$, where $\theta$ is uniform about the unit circle. </div><div><br /><pre><code><br />cv(u1,u2) = -cos(4*atan((u2-.5)./(u1-.5)))<br />pcv = plot(linspace(0,1,100),linspace(0,1,100),cv,st=:contourf,title="Y",xlabel="u1",ylabel="u2")<br />plot(pcv,p0,layout=2)<br /></code><br /></pre><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-jcGC6Q9s04c/V84J_G6rgEI/AAAAAAAAFJI/UYJq5VqKgv0lJsEWooOizovU95TKhPxRwCEw/s1600/cv_y.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://2.bp.blogspot.com/-jcGC6Q9s04c/V84J_G6rgEI/AAAAAAAAFJI/UYJq5VqKgv0lJsEWooOizovU95TKhPxRwCEw/s640/cv_y.png" width="640" /></a></div><br /><br />$$<br />\begin{align*}<br />\mathbb{E}_U[Y] & = \int_0^1 \int_0^1 -\cos(4\theta) du_1 du_2 \\<br />& = \int_0^1 \int_0^1 - \frac{1-\tan^2(2\theta)}{1+\tan^2(2\theta)} du_1 du_2 \\<br />& = \int_0^1 \int_0^1 - \frac{1-(\frac{2\tan \theta}{1 - \tan^2(\theta)})^2}{1+(\frac{2\tan \theta}{1 - \tan^2(\theta)})^2} du_1 du_2 && \tan(\theta) = \frac{y}{x} \\<br />& = \pi-3<br />\end{align*}<br />$$<br /><div><br /></div><div>In practice though, we don't have an analytic way to compute $E[Y]$, but we can estimate it in an unbiased manner from our pilot samples.</div><div><br /></div><div><pre><code><br />function controlvariates(N)<br /> # generate pilot samples<br /> npilot = 30<br /> u1 = rand(npilot); u2 = rand(npilot)<br /> x = f(u1,u2)<br /> y = cv(u1,u2)<br /> β=corr(x,y)<br /> c = -β/var(y)<br /> μ_y = mean(y) # estimate mean from pilot samples<br /> # real samples<br /> u1 = rand(2*N); u2 = rand(2*N)<br /> x = f(u1,u2)<br /> y = cv(u1,u2)<br /> return u1,u2,x+c*(y-μ_y),(μ_y, β, c)<br />end<br /></code></pre><br /><br /><br /></div><div>We can verify from our pilot samples that the covariance is quite large: <br /><pre><code><br />βs=[]<br />for i=1:1000<br /> u1,u2,x,ii=controlvariates(100)<br /> βs = [βs; ii[2]]<br />end<br />mean(βs) # 0.367405<br /></code><br /></pre><br /><h2>Stratified Sampling</h2><br />The idea behind stratified sampling is very simple: instead of estimating $E[X]$ over the domain $U$, we break up the domain into $K$ strata, estimate the conditional expectation $X$ over each strata separately, then average our per-strata estimates.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-C8fDd7u4j1A/V836N1HtxuI/AAAAAAAAFI4/F7_k-4WJ0EUQwJkHgiQIZKaeQdXyIGWFACLcB/s1600/stratified_points.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://3.bp.blogspot.com/-C8fDd7u4j1A/V836N1HtxuI/AAAAAAAAFI4/F7_k-4WJ0EUQwJkHgiQIZKaeQdXyIGWFACLcB/s400/stratified_points.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><br />In the above picture, the sample space is divided into 50 strata going across the horizontal axis and there are 2 points placed in each strata. There are lots of ways to stratify - you can stratify in a 1D grid, 2D grid, concentric circles with varying areas, whatever you like.<br /><br />We're not restricted to chopping up $U$ into little fiefdoms as above (although it is a natural choice). Instead, we introduce a "stratification variable" $Y$ with discrete values $y_1,...,y_k$. Our estimators for each strata correspond to the conditional expectations $\mathbb{E}[X|Y=y_i] \text{ for } i=1...K$.<br /><br />The estimator is given by a weighted combination of strata estimates with the respective probability of each strata:<br /><br /><pre><code><br />function stratified(N)<br /> # N strata, 2 samples per strata = 2N samples<br /> u1 = Float64[]<br /> u2 = Float64[]<br /> for y=range(0,1/N,N)<br /> for i=1:2<br /> append!(u1,[y+1/N*rand()])<br /> append!(u2,[rand()])<br /> end<br /> end<br /> return u1,u2,f(u1,u2),[]<br />end<br /></code></pre><br />$$<br />\hat{mu} = \sum_i^k p(y_i) \mathbb{E}[X|Y=y_i]<br />$$<br /><br /><br />The variance of this estimator is<br /><br />$$<br />\sum_i^k p(y_i)^2 (V_i)<br />$$<br /><br />Where $V_i$ is the variance of estimator $i$.<br /><br />For simplicity, let's assume:<br /><br /><ol><li>We stratify along an evenly-spaced grid, i.e. $p(y_i) = \frac{1}{k}$</li><li>We are using a simple MC estimator within each strata, $V_i = \frac{\sigma_i^2}{n_i}$, with $n_i = \frac{2n}{k}$ samples each (so the total number of samples is $2n$, the same as the plain MC estimator). $\sigma_i^2$ is the variance of an individual sample within each strata.</li></ol><br />Then the variance is<br /><br />$$<br />\sum_i^k (\frac{1}{k} p(y_i)) (\frac{k}{2n}\sigma_i^2) = \frac{1}{2n} \sum_i^k p(y_i) \sigma_i^2 \leq \frac{1}{2n} \sigma^2<br />$$<br /><br />Intuitively, the variance within a single grid cell $\sigma_i^2$ should be smaller than the variance of the plain MC estimator, because samples are closer together.<br /><br />Here's a proof that $\sum_i^k p(y_i)\sigma_i^2 \leq \sigma^2$:<br /><br />Let $\mu_i = \mathbb{E}[X|Y=y_i], \sigma_i^2 = \Var[X|Y]$<br /><br />$$<br />\begin{align*}<br />\mathbb{E}[X^2] & = \mathbb{E}_Y[\mathbb{E}_X[X^2|Y]] && \text{Tower property} \\<br />& = \sum_i^k p(y_i) \mathbb{E}[X^2|Y=y_i] \\<br />& = \sum_i^k p(y_i) (\sigma_i^2 + \mu_i^2) && \text{via } \Var[X|Y] = \mathbb{E}[X^2|Y] - \mathbb{E}[X|Y=y_i]^2 \\<br />\mathbb{E}[X] & = \mathbb{E}_Y[\mathbb{E}_X[X|Y]] \\<br />& =\sum_i^k p(y_i) \mu_i \\<br />\sigma^2 & = \mathbb{E}[X^2] - \mathbb{E}[X]^2 \\<br />& =\sum_i^k p(y_i) \sigma_i^2 + \big[\sum_i^k p(y_i) \mu_i^2 - (\sum_i^k p(y_i) \mu_i)^2\big]<br />\end{align*}<br />$$<br /><br />Note that the term<br /><br />$$<br />\sum_i^k p(y_i) \mu_i^2 - (\sum_i^k p(y_i) \mu_i)^2<br />$$<br /><br />corresponds to the variance of a categorical distribution that takes value $\mu_i$ with probability $p(y_i)$. This term is strictly non-negative, and therefore the inequality $\sum_i^k \frac{k}{2n}\sigma_i^2 \leq \sigma^2$ holds.<br /><br />This yields an interesting observation: when we do stratified sampling, we are <i>eliminating the variance between strata means</i>. This implies that we get the most gains by choosing strata such that samples vary greatly <i>between</i> strata.<br /><br />Here's a comparison of the empirical vs. theoretical variance with respect to number of strata:<br /><br /><pre><code><br /># theoretical variance<br />strat_var = zeros(50)<br />for N=1:50<br /> sigma_i_2 = zeros(N)<br /> y = range(0,1/N,N)<br /> for i=1:N<br /> x1 = y[i].+1/N*rand(100)<br /> x2 = rand(100)<br /> sigma_i_2[i]=var(f(x1,x2))<br /> end<br /> n_i = 2<br /> strat_var[N] = sum((1/(N))^2*(1/n_i*sigma_i_2)) <br />end<br />s = estimatorVar(stratified)<br />plot(1:50,s,label="Empirical Variance",title="Stratified Sampling")<br />plot!(1:50,strat_var,label="Theoretical Variance")<br /></code></pre><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-kvsgjCZ8_AE/V83Lh2nkS4I/AAAAAAAAFIE/E3_kag5GdrIvO29RCrzJ-Sc7WnSTilzQACLcB/s1600/stratified_plot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="266" src="https://4.bp.blogspot.com/-kvsgjCZ8_AE/V83Lh2nkS4I/AAAAAAAAFIE/E3_kag5GdrIvO29RCrzJ-Sc7WnSTilzQACLcB/s400/stratified_plot.png" width="400" /></a></div><br /><br /><b>Remarks:</b><br /><br /><ul><li>Stratified Sampling can actually be combined with other variance reduction techniques - for instance, when computing the conditional estimate within a strata, you are free to use antithetic sampling, control variates within the strata. You can even do <a href="https://en.wikipedia.org/wiki/Monte_Carlo_integration#Recursive_stratified_sampling">stratified sampling within your strata</a>! This makes sense if your sampling domain is already partitioned into a bounding volume hierarchy, such as a KD-Tree.</li><li>Antithetic sampling can be regarded as a special case of stratified sampling with 2 strata. The inter-strata variance you are eliminating here is the self-symmetry in the distribution of $X$ (the covariance between one half and another half of the samples).</li></ul><h2>Discussion</h2><br />Here's a plot comparing the standard deviations of these estimators as a function of sample size. Keep in mind that your choice of variance reduction technique should depend on the problem, and no method is objectively better than the others. This is a very simple toy problem so all estimators perform pretty well.<br /><div><br /></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-8lT2MTUr7Aw/V834-RhwjPI/AAAAAAAAFIo/DiSf1gIc1KUtK096-Xcurm95srbnxHIpwCLcB/s1600/comparison.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="266" src="https://3.bp.blogspot.com/-8lT2MTUr7Aw/V834-RhwjPI/AAAAAAAAFIo/DiSf1gIc1KUtK096-Xcurm95srbnxHIpwCLcB/s400/comparison.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">I'm currently working on a follow-up tutorial on Importance Sampling. The explanation is a bit long, so I'm keeping it separate from this tutorial.</div><br />A toy example is all well and good, but where does variance reduction come up in modern scientific computing?<br /><div><h3></h3><h3>Scientific Data Analysis</h3></div><div><br />At a very fundamental level, variance in scientific data either suggests an underlying noisy process or errors. Variance degrades the statistical significance of findings, so it's desirable to conduct experiments and measurements such that the variance of findings is low.</div><h3>Quantitative Finance</h3><span style="font-size: small; font-weight: normal;"><br /></span><span style="font-size: small; font-weight: normal;">In finance, we often simulate expected returns of portfolios by sampling over all possible movements in the market. Certain rare events (for example, a terrorist attack) are highly unlikely but have such a big impact that on the economy that we want to account for it in our expectation. Modeling these "black swan" events are of considerable interest to financial & actuarial industries.</span><br /><h3><div class="separator" style="clear: both; font-size: medium; font-weight: normal; text-align: center;"><a href="https://2.bp.blogspot.com/-e5_6at_ksf0/V8y6BRJayjI/AAAAAAAAFGU/dXVHIjo89J0DUHPC4JoybYye1WXm50mfgCLcB/s1600/var_reduction%2B%25281%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="232" src="https://2.bp.blogspot.com/-e5_6at_ksf0/V8y6BRJayjI/AAAAAAAAFGU/dXVHIjo89J0DUHPC4JoybYye1WXm50mfgCLcB/s640/var_reduction%2B%25281%2529.png" width="640" /></a></div><div class="separator" style="clear: both; font-size: medium; font-weight: normal; text-align: center;"><br /></div></h3><h3></h3><h3>Deep Learning</h3><br />In minibatch stochastic gradient descent (the "hammer" of Deep Learning), we compute the parameter update using the expected gradient of the total loss across the training dataset. It takes too long to consider every point in the dataset, so we estimate gradients via stochastic minibatch. This is nothing more than a Monte Carlo estimate.<br /><br /><div>$$E_x[\nabla_\theta J(x;\theta)] = \int p(x) \nabla_\theta J(x;\theta) dx \approx \frac{1}{m}\sum_i^m \nabla_\theta J(x_i;\theta)$$</div><div><br /></div><div>The less variance we have, the sooner our training converges.<br /><br /></div><h3>Computer Graphics</h3><br />In physically based rendering, we are trying to compute the total incoming light arriving at each camera pixel. The vast majority of paths contribute zero radiance, thus making our samples noisy. Decreasing variance allows us to "de-speckle" the pictures.<br /><br />$$<br />\begin{align*}<br />\mathcal{L}(x,\omega_o) & = \mathcal{L}_e(x,\omega_o) + \int_\Omega{f_r(x,\omega_i,\omega_o)\mathcal{L}(x^\prime,\omega_i)(\omega_o \cdot n_x) d\omega_i} \\<br />& \approx \mathcal{L}_e(x,\omega_o) + \frac{1}{m}\sum_i^m{f_r(x,\omega_i,\omega_o)\mathcal{L}(x^\prime,\omega_i)(\omega_o \cdot n_x)}<br />\end{align*}<br />$$<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://www.mitsuba-renderer.org/devblog/wp-content/uploads/2012/09/integrator_bdpt_path.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://www.mitsuba-renderer.org/devblog/wp-content/uploads/2012/09/integrator_bdpt_path.jpg" height="274" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><ol></ol></div>Ericnoreply@blogger.com10tag:blogger.com,1999:blog-842965756326639856.post-34518927793065848532016-08-07T23:50:00.002-07:002017-01-12T13:21:28.751-08:00A Beginner's Guide to Variational Methods: Mean-Field ApproximationVariational Bayeisan (VB) Methods are a family of techniques that are very popular in statistical Machine Learning. VB methods allow us to re-write <i>statistical inference</i> problems (i.e. infer the value of a random variable given the value of another random variable) as <i>optimization </i>problems (i.e. find the parameter values that minimize some objective function).<br /><br />This inference-optimization duality is powerful because it allows us to use the latest-and-greatest optimization algorithms to solve statistical Machine Learning problems (and vice versa, minimize functions using statistical techniques).<br /><br />This post is an introductory tutorial on Variational Methods. I will derive the optimization objective for the simplest of VB methods, known as the Mean-Field Approximation. This objective, also known as the<i> Variational Lower Bound</i>, is exactly the same one used in <a href="https://arxiv.org/abs/1312.6114" target="_blank">Variational Autoencoders</a> (a neat paper which I will explain in a follow-up post).<br /><br /><h2><span style="font-weight: normal;">Table of Contents</span></h2><ol><li>Preliminaries and Notation</li><li>Problem formulation</li><li>Variational Lower Bound for Mean-field Approximation</li><li>Forward KL vs. Reverse KL</li><li>Connections to Deep Learning</li></ol><h2 id="aprelim"><span style="font-weight: normal;">Preliminaries and Notation</span></h2><div style="font-size: medium; font-weight: normal;"><div style="font-size: medium; font-weight: normal;"><span style="font-size: small; font-weight: normal;"><br /></span><span style="font-size: small; font-weight: normal;">This article assumes that the reader is familiar with concepts like random variables, probability distributions, and expectations. <a href="https://www.khanacademy.org/math/probability/random-variables-topic">Here's a refresher</a> if you forgot some stuff. Machine Learning & Statistics notation isn't standardized very well, so it's helpful to be really precise with notation in this post:</span><br /><ul style="font-size: medium; font-weight: normal;"><li><span style="font-size: small; font-weight: normal;">Uppercase $X$ denotes a random variable</span></li><li><span style="font-size: small; font-weight: normal;">Uppercase $P(X)$ denotes the probability distribution over that variable</span></li><li><span style="font-size: small; font-weight: normal;">Lowercase $x \sim P(X)$ denotes a value $x$ sampled ($\sim$) from the probability distribution $P(X)$ via some generative process.</span></li><li><span style="font-size: small; font-weight: normal;">Lowercase $p(X)$ is the density function of the distribution of $X$. It is a scalar function over the measure space of $X$.</span></li><li><span style="font-size: small; font-weight: normal;">$p(X=x)$ (shorthand $p(x)$) denotes the density function evaluated at a particular value $x$. </span></li></ul><span style="font-size: small; font-weight: normal;"><br /></span><span style="font-size: small; font-weight: normal;">Many academic papers use the terms "variables", "distributions", "densities", and even "models" interchangeably. This is not necessarily wrong per se, since $X$, $P(X)$, and $p(X)$ all imply each other via a one-to-one correspondence. </span><span style="font-size: small; font-weight: normal;">However, it's confusing to mix these words together because their types are different (it doesn't make sense to <i>sample</i> a function, nor does it make sense to <i>integrate</i> a distribution). </span><br /><br /></div></div><div>We model systems as a collection of random variables, where some variables ($X$) are "observable", while other variables ($Z$) are "hidden". We can draw this relationship via the following graph:<br /><br /><div style="text-align: center;"> <a href="https://4.bp.blogspot.com/-v7hpdnGKg-s/V6fQ69Fc6pI/AAAAAAAAFEY/lfwBnJ-QJ_sQJwtasuw7GjPC8kTsEczLACLcB/s1600/Untitled%2Bpresentation.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://4.bp.blogspot.com/-v7hpdnGKg-s/V6fQ69Fc6pI/AAAAAAAAFEY/lfwBnJ-QJ_sQJwtasuw7GjPC8kTsEczLACLcB/s1600/Untitled%2Bpresentation.png" /></a></div><div><br /></div><br />The edge drawn from $Z$ to $X$ relates the two variables together via the conditional distribution $P(X|Z)$.<br /><div><br /></div><div>Here's a more concrete example: $X$ might represent the "raw pixel values of an image", while $Z$ is a binary variable such that $Z=1$ "if $X$ is an image of a cat".</div><br /><div class="separator" style="clear: both; text-align: center;">$X = $<a href="http://f.tqn.com/y/cats/1/S/6/V/4/cat-deadmouse2081x1446.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://f.tqn.com/y/cats/1/S/6/V/4/cat-deadmouse2081x1446.jpg" height="222" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;">$P(Z=1) = 1$ (definitely a cat)</div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;">$X= $<a href="http://fampz.abroaderperspect.netdna-cdn.com/wp-content/uploads/2014/07/1433_1pthmk_sliced_bread__001.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://fampz.abroaderperspect.netdna-cdn.com/wp-content/uploads/2014/07/1433_1pthmk_sliced_bread__001.jpg" height="220" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;">$P(Z=1) = 0$ (definitely not a cat)</div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;">$X = $ <a href="https://s-media-cache-ak0.pinimg.com/564x/9f/fb/b7/9ffbb78f65b5d829a91ca082211f116c.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://s-media-cache-ak0.pinimg.com/564x/9f/fb/b7/9ffbb78f65b5d829a91ca082211f116c.jpg" width="226" /></a></div><div class="separator" style="clear: both; text-align: center;">$P(Z=1) = 0.1$ (sort of cat-like)</div><div class="separator" style="clear: both; text-align: center;"><br /></div><div><a href="https://en.wikipedia.org/wiki/Bayes%27_theorem" target="_blank">Bayes' Theorem</a> gives us a general relationship between any pair of random variables:<br /><br />$$p(Z|X) = \frac{p(X|Z)p(Z)}{p(X)}$$<br /><br />The various pieces of this are associated with common names:<br /><br />$p(Z|X)$ is the <b>posterior probability</b>: "given the image, what is the probability that this is of a cat?" If we can sample from $z \sim P(Z|X)$, we can use this to make a cat classifier that tells us whether a given image is a cat or not.<br /><br />$p(X|Z)$ is the <b>likelihood</b>: "given a value of $Z$ this computes how "probable" this image $X$ is under that category ({"is-a-cat" / "is-not-a-cat"}). If we can sample from $x \sim P(X|Z)$, then we generate images of cats and images of non-cats just as easily as we can generate random numbers. If you'd like to learn more about this, see my other articles on generative models: <a href="http://blog.evjang.com/2016/06/generative-adversarial-nets-in.html" target="_blank">[1]</a>, <a href="http://blog.evjang.com/2016/06/understanding-and-implementing.html" target="_blank">[2]</a>.<br /><br />$p(Z)$ is the <b>prior probability</b>. This captures any prior information we know about $Z$ - for example, if we think that 1/3 of all images in existence are of cats, then $p(Z=1) = \frac{1}{3}$ and $p(Z=0) = \frac{2}{3}$.<br /><br /><div><h3 id="ahidden"><span style="font-weight: normal;">Hidden Variables as Priors</span></h3><i><br /></i><i>This is an aside for interested readers. Skip to the <a href="http:/#aproblem" target="_blank">next section</a> to continue with the tutorial.</i><br /><br />The previous cat example presents a very conventional example of observed variables, hidden variables, and priors. However, it's important to realize that the distinction between hidden / observed variables is somewhat arbitrary, and you're free to factor the graphical model however you like.<br /><br />We can re-write Bayes' Theorem by swapping the terms:<br /><br />$$\frac{p(Z|X)p(X)}{p(Z)} = p(X|Z)$$<br /><br />The "posterior" in question is now $P(X|Z)$.<br /><br />Hidden variables can be interpreted from a <a href="https://en.wikipedia.org/wiki/Bayesian_statistics" target="_blank">Bayesian Statistics</a> framework as <i>prior beliefs</i> attached to the observed variables. For example, if we believe $X$ is a multivariate Gaussian, the hidden variable $Z$ might represent the mean and variance of the Gaussian distribution. The distribution over parameters $P(Z)$ is then a <i>prior</i> distribution to $P(X)$.<br /><br />You are also free to choose which values $X$ and $Z$ represent. For example, $Z$ could instead be "mean, cube root of variance, and $X+Y$ where $Y \sim \mathcal{N}(0,1)$". This is somewhat unnatural and weird, but the structure is still valid, as long as $P(X|Z)$ is modified accordingly.<br /><br />You can even "add" variables to your system. The prior itself might be dependent on other random variables via $P(Z|\theta)$, which have prior distributions of their own $P(\theta)$, and those have priors still, and so on. Any hyper-parameter can be thought of as a prior. In Bayesian statistics, <a href="https://en.wikipedia.org/wiki/Turtles_all_the_way_down" target="_blank">it's priors all the way down</a>.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-p1V5dS0-yws/V6ffRxDSOcI/AAAAAAAAFEo/_2smfh67Biod4th84eiZo9vk-wVP6xfwQCLcB/s1600/Untitled%2Bpresentation%2B%25281%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="313" src="https://4.bp.blogspot.com/-p1V5dS0-yws/V6ffRxDSOcI/AAAAAAAAFEo/_2smfh67Biod4th84eiZo9vk-wVP6xfwQCLcB/s320/Untitled%2Bpresentation%2B%25281%2529.png" width="320" /></a></div><br /></div><br /><h2 id="aproblem"><span style="font-weight: normal;">Problem Formulation</span></h2><br />The key problem we are interested in is <i>posterior inference</i>, or computing functions on the hidden variable $Z$. Some canonical examples of posterior inference:<br /><ul><li><i>Given this surveillance footage $X$, did the suspect show up in it?</i></li><li><i>Given this twitter feed $X$, is the author depressed?</i></li><li><i>Given historical stock prices $X_{1:t-1}$, what will $X_t$ be?</i></li></ul><br />We usually assume that we know how to compute functions on likelihood function $P(X|Z)$ and priors $P(Z)$.<br /><br />The problem is, for complicated tasks like above, we often don't know how to sample from $P(Z|X)$ or compute $p(X|Z)$. Alternatively, we might know the form of $p(Z|X)$, but the corresponding computation is so complicated that we cannot evaluate it in a reasonable amount of time. We could try to use sampling-based approaches like <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo" target="_blank">MCMC</a>, but these are slow to converge.<br /><br /><h2 id="aL"><span style="font-weight: normal;">Variational Lower Bound for Mean-field Approximation</span></h2></div><div><br />The idea behind variational inference is this: let's just perform inference on an easy, parametric distribution $Q_\phi(Z|X)$ (like a Gaussian) for which we know how to do posterior inference, but adjust the parameters $\phi$ so that $Q_\phi$ is as close to $P$ as possible.<br /><br />This is visually illustrated below: the blue curve is the true posterior distribution, and the green distribution is the variational approximation (Gaussian) that we fit to the blue density via optimization.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-OCU72-Cp5lg/V6fxbBAV4oI/AAAAAAAAFE4/BMcR5OYwZqwARnqFnm3I9I_S46O-IH-uQCLcB/s1600/Untitled%2Bpresentation%2B%25282%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="187" src="https://4.bp.blogspot.com/-OCU72-Cp5lg/V6fxbBAV4oI/AAAAAAAAFE4/BMcR5OYwZqwARnqFnm3I9I_S46O-IH-uQCLcB/s400/Untitled%2Bpresentation%2B%25282%2529.png" width="400" /></a></div><br /><br /><div>What does it mean for distributions to be "close"? Mean-field variational Bayes (the most common type) uses the Reverse KL Divergence to as the distance metric between two distributions.</div><br />$$KL(Q_\phi(Z|X)||P(Z|X)) = \sum_{z \in Z}{q_\phi(z|x)\log\frac{q_\phi(z|x)}{p(z|x)}}$$<br /><div><br /></div>Reverse KL divergence measures the amount of information (in nats, or units of $\frac{1}{\log(2)}$ bits) required to "distort" $P(Z)$ into $Q_\phi(Z)$. We wish to minimize this quantity with respect to $\phi$.<br /><br />By definition of a conditional distribution, $p(z|x) = \frac{p(x,z)}{p(x)}$. Let's substitute this expression into our original $KL$ expression, and then distribute:<br /><br />$$<br />\begin{align}<br />KL(Q||P) & = \sum_{z \in Z}{q_\phi(z|x)\log\frac{q_\phi(z|x)p(x)}{p(z,x)}} && \text{(1)} \\<br />& = \sum_{z \in Z}{q_\phi(z|x)\big(\log{\frac{q_\phi(z|x)}{p(z,x)}} + \log{p(x)}\big)} \\<br />& = \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) + \Big(\sum_{z}{\log{p(x)}q_\phi(z|x)}\Big) \\<br />& = \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) + \Big(\log{p(x)}\sum_{z}{q_\phi(z|x)}\Big) && \text{note: $\sum_{z}{q(z)} = 1 $} \\<br />& = \log{p(x)} + \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) \\<br />\end{align}<br />$$<br /><br />To minimize $KL(Q||P)$ with respect to variational parameters $\phi$, we just have to minimize $\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}$, since $\log{p(x)}$ is fixed with respect to $\phi$. Let's re-write this quantity as an expectation over the distribution $Q_\phi(Z|X)$.<br /><br />$$<br />\begin{align}<br />\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}} & = \mathbb{E}_{z \sim Q_\phi(Z|X)}\big[\log{\frac{q_\phi(z|x)}{p(z,x)}}\big]\\<br />& = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - \log{p(x,z)} \big] \\<br />& = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - (\log{p(x|z)} + \log(p(z))) \big] && \text{(via $\log{p(x,z)=p(x|z)p(z)}$) }\\<br />& = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - \log{p(x|z)} - \log(p(z))) \big] \\<br />\end{align} \\<br />$$<br /><br />Minimizing this is equivalent to <i>maximizing</i> the negation of this function:<br /><br />$$<br />\begin{align}<br />\text{maximize } \mathcal{L} & = -\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}} \\<br />& = \mathbb{E}_Q\big[ -\log{q_\phi(z|x)} + \log{p(x|z)} + \log(p(z))) \big] \\<br />& = \mathbb{E}_Q\big[ \log{p(x|z)} + \log{\frac{p(z)}{ q_\phi(z|x)}} \big] && \text{(2)} \\<br />\end{align}<br /> $$<br /><br />In literature, $\mathcal{L}$ is known as the <i>variational lower bound</i>, and is computationally tractable if we can evaluate $p(x|z), p(z), q(z|x)$. We can further re-arrange terms in a way that yields an intuitive formula:<br /><br />$$<br />\begin{align*}<br />\mathcal{L} & = \mathbb{E}_Q\big[ \log{p(x|z)} + \log{\frac{p(z)}{ q_\phi(z|x)}} \big] \\<br />& = \mathbb{E}_Q\big[ \log{p(x|z)} \big] + \sum_{Q}{q(z|x)\log{\frac{p(z)}{ q_\phi(z|x)}}} && \text{Definition of expectation} \\<br />& = \mathbb{E}_Q\big[ \log{p(x|z)} \big] - KL(Q(Z|X)||P(Z)) && \text{Definition of KL divergence} && \text{(3)}<br />\end{align*}<br />$$<br /><br />If sampling $z \sim Q(Z|X)$ is an "encoding" process that converts an observation $x$ to latent code $z$, then sampling $x \sim Q(X|Z)$ is a "decoding" process that reconstructs the observation from $z$.<br /><br />It follows that $\mathcal{L}$ is the sum of the expected "decoding" likelihood (how good our variational distribution can decode a sample of $Z$ back to a sample of $X$), plus the KL divergence between the variational approximation and the prior on $Z$. If we assume $Q(Z|X)$ is conditionally Gaussian, then prior $Z$ is often chosen to be a diagonal Gaussian distribution with mean 0 and standard deviation 1.<br /><br />Why is $\mathcal{L}$ called the variational lower bound? Substituting $\mathcal{L}$ back into Eq. (1), we have:<br /><br />$$<br />\begin{align*}<br />KL(Q||P) & = \log p(x) - \mathcal{L} \\<br />\log p(x) & = \mathcal{L} + KL(Q||P) && \text{(4)}<br />\end{align*}<br />$$<br /><br />The meaning of Eq. (4), in plain language, is that $p(x)$, the log-likelihood of a data point $x$ under the true distribution, is $\mathcal{L}$, plus an error term $KL(Q||P)$ that captures the distance between $Q(Z|X=x)$ and $P(Z|X=x)$ at that particular value of $X$.<br /><br />Since $KL(Q||P) \geq 0$, $\log p(x)$ must be greater than $\mathcal{L}$. Therefore $\mathcal{L}$ is a <i>lower bound</i> for $\log p(x)$. $\mathcal{L}$ is also referred to as evidence lower bound (ELBO), via the alternate formulation:<br /><br />$$<br />\mathcal{L} = \log p(x) - KL(Q(Z|X)||P(Z|X)) = \mathbb{E}_Q\big[ \log{p(x|z)} \big] - KL(Q(Z|X)||P(Z))<br />$$<br /><br />Note that $\mathcal{L}$ itself contains a KL divergence term between the approximate posterior and the prior, so there are two KL terms in total in $\log p(x)$.<br /><br /></div><div><h2 id="afr"><span style="font-weight: normal;">Forward KL vs. Reverse KL</span></h2><div><br />KL divergence is <i>not</i> a symmetric distance function, i.e. $KL(P||Q) \neq KL(Q||P)$ (except when $Q \equiv P$) The first is known as the "forward KL", while the latter is "reverse KL". So why do we use Reverse KL? This is because the resulting derivation would require us to know how to compute $p(Z|X)$, which is what we'd like to do in the first place.<br /><div><br />I really like Kevin Murphy's explanation in the <a href="https://www.cs.ubc.ca/~murphyk/MLbook/" target="_blank">PML textbook</a>, which I shall attempt to re-phrase here:<br /><br />Let's consider the forward-KL first. As we saw from the above derivations, we can write KL as the expectation of a "penalty" function $\log \frac{p(z)}{q(z)}$ over a weighing function $p(z)$.<br /><br />$$<br />\begin{align*}<br />KL(P||Q) & = \sum_z p(z) \log \frac{p(z)}{q(z)} \\<br />& = \mathbb{E}_{p(z)}{\big[\log \frac{p(z)}{q(z)}\big]}\\<br />\end{align*}<br />$$<br /><br />The penalty function contributes loss to the total KL wherever $p(Z) > 0$. For $p(Z) > 0$, $\lim_{q(Z) \to 0} \log \frac{p(z)}{q(z)} \to \infty$. This means that the forward-KL will be large wherever $Q(Z)$ fails to "cover up" $P(Z)$.<br /><br />Therefore, the forward-KL is minimized when we ensure that $q(z) > 0$ wherever $p(z)> 0$. The optimized variational distribution $Q(Z)$ is known as "zero-avoiding" (density avoids zero when $p(Z)$ is zero).<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-oNG_y0SUvg4/V6ghC_9q9MI/AAAAAAAAFFs/q4kH7gxJIrQqBsyM-wIUJ_xNeeys2nZqACLcB/s1600/forward-KL.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="296" src="https://3.bp.blogspot.com/-oNG_y0SUvg4/V6ghC_9q9MI/AAAAAAAAFFs/q4kH7gxJIrQqBsyM-wIUJ_xNeeys2nZqACLcB/s640/forward-KL.png" width="640" /></a></div><br /><br />Minimizing the Reverse-KL has exactly the opposite behavior:<br /><br />$$<br />\begin{align*}<br />KL(Q||P) & = \sum_z q(z) \log \frac{q(z)}{p(z)} \\<br />& = \mathbb{E}_{p(z)}{\big[\log \frac{q(z)}{p(z)}\big]}<br />\end{align*}<br />$$<br /><div><br /></div>If $p(Z) = 0$, we must ensure that the weighting function $q(Z) = 0$ wherever denominator $p(Z) = 0$, otherwise the KL blows up. This is known as "zero-forcing":<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-fHQrw49f-GI/V6ghG9Lv2YI/AAAAAAAAFFw/KUHaZF9Xu-8W4nIUJQZp0T_4nbY63Tz0gCLcB/s1600/reverse-KL.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="338" src="https://3.bp.blogspot.com/-fHQrw49f-GI/V6ghG9Lv2YI/AAAAAAAAFFw/KUHaZF9Xu-8W4nIUJQZp0T_4nbY63Tz0gCLcB/s640/reverse-KL.png" width="640" /></a></div><br />So in summary, minimizing forward-KL "stretches" your variational distribution $Q(Z)$ to cover <b>over</b> the entire $P(Z)$ like a tarp, while minimizing reverse-KL "squeezes" the $Q(Z)$ <b>under</b> $P(Z)$.<br /><br />It's important to keep in mind the implications of using reverse-KL when using the mean-field approximation in machine learning problems. If we are fitting a unimodal distribution to a multi-modal one, we'll end up with more false negatives (there is actually probability mass in $P(Z)$ where we think there is none in $Q(Z)$).<br /><br /></div><h2 id="adl"><span style="font-weight: normal;">Connections to Deep Learning</span></h2><br />Variational methods are really important for Deep Learning. I will elaborate more in a later post, but here's a quick spoiler:<br /><ol><li>Deep learning is really good at optimization (specifically, gradient descent) over very large parameter spaces using lots of data.</li><li>Variational Bayes give us a framework with which we can re-write statistical inference problems as optimization problems.</li></ol>Combining Deep learning and VB Methods allow us to perform inference on <i>extremely </i>complex posterior distributions. As it turns out, modern techniques like Variational Autoencoders optimize the exact same mean-field variational lower-bound derived in this post!<br /><br />Thanks for reading, and stay tuned!<br /><br /><br /><div style="font-family: "times new roman";"><div style="margin: 0px;"><div style="font-family: "times new roman";"><br /></div><div style="font-family: "times new roman";"><br /></div></div></div></div></div></div>Ericnoreply@blogger.com18tag:blogger.com,1999:blog-842965756326639856.post-89550201106278999572016-07-24T23:56:00.000-07:002016-09-03T17:12:56.805-07:00Why Randomness is Important for Deep LearningThis afternoon I attempted to explain to my mom why <a href="https://en.wikipedia.org/wiki/Randomness" target="_blank">Randomness</a> is important for <a href="https://en.wikipedia.org/wiki/Deep_learning" target="_blank">Deep Learning</a> without using any jargon from probability, statistics, or machine learning.<br /><br />The exercise was partially successful. Maybe. I still don't think she knows what Deep Learning is, besides that I am a <i>big fan</i> of it and use it for my job.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="http://www.nasa.gov/centers/glenn/images/content/508159main_1944_woodfan_946-710.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://www.nasa.gov/centers/glenn/images/content/508159main_1944_woodfan_946-710.jpg" height="300" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">I'm a big fan of Deep Learning</td></tr></tbody></table><br />This post is a slightly more technical version of the explanation I gave my mom, with the hope that it will help Deep Learning practitioners to better understand what is going on in neural networks.<br /><br />If you are getting started into Deep Learning research, you may discover that there are a whole bunch of seemingly arbitrary techniques used to train neural networks, with very little theoretical justification besides "it works". For example: dropout regularization, adding gradient noise,, asynchronous stochastic descent.<br /><br />What do all these hocus pocus techniques have in common? They incorporate randomness!<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://25.media.tumblr.com/ced892e7d72771e32ef50b905e8da33c/tumblr_mx2am9n0h61rioxyio1_500.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://25.media.tumblr.com/ced892e7d72771e32ef50b905e8da33c/tumblr_mx2am9n0h61rioxyio1_500.gif" height="320" width="320" /></a></div><br />Random noise actually is crucial for getting DNNs to work well:<br /><br /><ol><li>Random noise allows neural nets to produce multiple outputs given the same instance of input.</li><li>Random noise limits the amount of information flowing through the network, forcing the network to learn meaningful representations of data. </li><li>Random noise provides "exploration energy" for finding better optimization solutions during gradient descent.</li></ol><h2> Single Input, Multiple Output</h2><div><br /></div><div>Suppose you are training a deep neural network (DNN) to classify images.<br /><br /><br /><div style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"></div><br /><div class="separator" style="-webkit-text-stroke-width: 0px; clear: both; color: black; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; margin: 0px; orphans: auto; text-align: center; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><a href="https://devblogs.nvidia.com/wp-content/uploads/2014/11/imagenet_example.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="281" src="https://devblogs.nvidia.com/wp-content/uploads/2014/11/imagenet_example.png" style="cursor: move;" width="400" /></a></div><br /></div><div>For each cropped region, the network learns to convert an image into a number representing a class label, such as “dog” or "person".<br /><br />That’s all very well and good, and these kind of DNNs do not require randomness in their inference model. After all, any image of a dog should be mapped to the “dog” label, and there is nothing random about that.<br /><br />Now suppose you are training a deep neural network (DNN) to play <a href="https://en.wikipedia.org/wiki/Go_(game)">Go (game)</a>. In the case of the image below, the DNN has to make the first move.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9b/Blank_Go_board.svg/2000px-Blank_Go_board.svg.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9b/Blank_Go_board.svg/2000px-Blank_Go_board.svg.png" width="320" /></a></div><div><br /></div><div><br /></div><div>If you used the same deterministic strategy described above, you will find that this network fails to give good results. Why? Because there is no single “optimal starting move” - for each possible stone placement on the board, there is a equally good stone placement on the other side board, via rotational symmetry. There are <i>multiple </i>best answers.<br /><div><br />If the network is deterministic and only capable of picking one output per input, then the optimization process will force the network to choose the move that averages all the best answers, which is smack-dab in the middle of the board. This behavior is highly undesirable, as the center of the board is generally regarded as a bad starting move.<br /><br />Hence, randomness is important when you want the network to be able to output multiple possibilities given the same input, rather than generating the same output over and over again. This is crucial when there are underlying symmetries in the action space - incorporating randomness literally helps us break out of the "<a href="https://en.wikipedia.org/wiki/Buridan%27s_ass">stuck between two bales of hay</a>" scenario.<br /><br />Similarly, if we are training a neural net to compose music or draw pictures, we don’t want it to always draw the same thing or play the same music every time it is given a blank sheet of paper. We want some notion of “variation”, "surprise", and “creativity” in our generative models.<br /><br />One approach to incorporating randomness into DNNs is to keep the network deterministic, but have its outputs be the parameters of a probability distribution, which we can then draw samples from using conventional sampling methods to generate “stochastic outputs”.<br /><br />Deepmind's AlphaGo utilized this principle: given an image of a Go board, it outputs the probability of winning given each possible move. The practice of modeling a distribution after the output of the network is commonly used in other areas of Deep Reinforcement Learning.</div><div><br /><h2>Randomness and Information Theory</h2></div><div><br />During my first few courses in probability & statistics, I really struggled to understand the <i>physical meaning </i>of randomness. When you flip a coin, where does this <i>randomness</i> come from? Is randomness just <a href="https://en.wikipedia.org/wiki/Chaos_theory" target="_blank">deterministic chaos</a>? Is it possible for something to be actually <i>random</i>? </div><div><br /></div><div>To be honest, I still don't fully understand. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://i.ytimg.com/vi/QravXR5Wm-Q/maxresdefault.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://i.ytimg.com/vi/QravXR5Wm-Q/maxresdefault.jpg" width="400" /></a></div><div><br /><a href="https://en.wikipedia.org/wiki/Information_theory" target="_blank">Information theory</a> offers a definition of randomness that is grounded enough to use without being kept wide awake at night: "randomness" is nothing more than the "lack of information".<br /><br />More specifically, the amount of <i>information </i>in an object is the length (e.g. in bits, kilobytes, etc.) of the shortest computer program needed to fully describe it. For example, the first 1 million digits of $\pi = 3.14159265....$ can be represented as a string of length 1,000,002 characters, but it can be more compactly represented using 70 characters, via an implementation of the Leibniz Formula:<br /><br /><script src="https://gist.github.com/ericjang/ba22adbd9f13ca1fcfc40ed8c520f566.js"></script> The above program is nothing more than a <i>compressed</i> version of a million digits of $\pi$. A more concise program could probably express the first million digits of $\pi$ in far fewer bits.<br /><br />Under this interpretation, <i>randomness</i> is "that which cannot be compressed". While the first million digits of $\pi$ can be compressed and are thus not random, empirical evidence suggests (but has not proven) that $\pi$ itself is <a href="https://en.wikipedia.org/wiki/Normal_number" target="_blank">normal number</a>, and thus the amount of information encoded in $\pi$ is infinite.<br /><br />Consider a number $a$ that is equal to the first trillion digits of $\pi$, $a = 3.14159265...$. If we add to that a uniform random number $r$ that lies in the range $(-0.001,0.001)$, we get a number that ranges in between $3.14059...$ and $3.14259...$. The resulting number $a + r$ now only carries ~three digits worth of information, because the process of adding random noise destroyed any information carried after the hundredth's decimal place.<br /><br /><h2>Limiting Information in Neural Nets</h2><br />What does this definition of randomness have to deal with randomness?<br /><br />Another way randomness is incorporated into DNNs is by injecting noise directly into the network itself, rather than using the DNN to model a distribution. This makes the task "harder" to learn as the network has to overcome these internal "perturbations".<br /><br />Why on Earth would you want to do this? The basic intuition is that noise limits the amount of information you can pass through a channel.<br /><br />Consider an autoencoder, a type of neural network architecture that attempts to learn an efficient encoding of data by "squeezing" the input into fewer dimensions in the middle and re-constituting the original data at the other end. A diagram is shown below:<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://nghiaho.com/wp-content/uploads/2012/12/autoencoder_network1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://nghiaho.com/wp-content/uploads/2012/12/autoencoder_network1.png" height="233" width="640" /></a></div><br /><br />During inference, inputs flow from the left through the nodes of the network and come out the other side, just like a pipe.<br /><br />If we consider a theoretical neural network that operates on real numbers (rather than floating point numbers), then without noise in the network, every layer of the DNN actually has access to infinite information bandwidth.<br /><br />Even though we are squeezing representations (the pipe) into fewer hidden units, the network could still learn to encode the previous layer's data into the decimal point values without actually learning any meaningful features. In fact, we could represent all the information in the network with a single number. This is undesirable.<br /><div><br /></div>By limiting the amount of information in a network, we force it to learn <i>compact representations</i> of input features. Several ways to go about this:<br /><br /><ul><li><a href="https://arxiv.org/abs/1312.6114" target="_blank">Variational autoencoders (VAE)</a> add Gaussian noise to the hidden layer. This noise destroys "excess information," forcing the network to learn compact representations of data. </li><li>Closely related to VAE noise (possibly equivalent?) to this is the idea of <a href="http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf" target="_blank">Dropout Regularization</a> - randomly zeroing out some fraction of units during training. Like the VAE, dropout noise forces the network to learn useful information under limited bandwidth. </li><li><a href="https://arxiv.org/abs/1603.09382" target="_blank">Deep Networks with Stochastic Depth</a> - similar idea to dropout, but at a per-layer level rather than per-unit level.</li><li>There's a very interesting paper called <a href="http://arxiv.org/abs/1602.02830" target="_blank">Binarized Neural Networks</a> that uses binary weights and activations in the inference pass, but real-valued gradients in the backward pass. The source of noise comes from the fact that the gradient is a noisy version of the binarized gradient. While BinaryNets are not necessarily more powerful than regular DNNs, individual units can only encode one bit of information, which regularizes against two features being squeezed into a single unit via floating point encoding. </li></ul><div>More efficient compression schemes mean better generalization at test-time, which explains why dropout works so well for over-fitting. If you decide to use regular autoencoders over variational autoencoders, you <i>must</i> use a stochastic regularization trick such as dropout to control how many bits your compressed features should be, otherwise you will likely over-fit. </div><div><br /></div><div>I think VAEs are objectively superior because they are easy to implement and allow you to specify exactly how many bits of information are passing through each layer.</div><div><br /></div><h2>Exploration "Energy"</h2></div><br />DNNs are usually trained via some variant of gradient descent, which basically amounts to finding the parameter update that "rolls downhill" along some loss function. When you get to the bottom of the deepest valley, you've found the best possible parameters for your neural net.<br /><br />The problem with this approach is that neural network loss surfaces have a lot of local minima and plateaus. It's easy to get stuck in a small dip or a flat portion where the slope is already zero (local minima) but you are not done yet.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-SxVce9sBxZg/V5WrK03V8GI/AAAAAAAAFDk/LPUZzgHGyHY3AEXXuImI2zFrrhfCAnfowCLcB/s1600/23.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="235" src="https://2.bp.blogspot.com/-SxVce9sBxZg/V5WrK03V8GI/AAAAAAAAFDk/LPUZzgHGyHY3AEXXuImI2zFrrhfCAnfowCLcB/s400/23.gif" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div>The third interpretation of how randomness assists Deep Learning models is based on the idea of the exploration.<br /><br />Because the datasets used to train DNNs are huge, it’s too expensive to compute the gradient across terabytes of data for every single gradient descent step. Instead, we use <i>stochastic gradient descent </i>(SGD), where we just compute the average gradient across a small subset of the dataset chosen randomly from the dataset.<br /><br />In evolution, if the success of a species is modeled by random variable $X$, then random mutation or noise increases the variance of $X$ - its progeny could either be far better off (adaptation, poison defense) or far worse off (lethal defects, sterility).<br /><br />In numerical optimization, this "genetic mutability" is called "thermodynamic energy", or Temperature that allows the parameter update trajectory to not always "roll downhill", but occasionally bounce out of a local minima or "tunnel through" hills.<br /><br />This is all deeply related to the Exploration-vs.-Exploitation tradeoff as formulated in RL. Training a purely deterministic DNN with zero gradient noise has zero exploitation capabilities - it converges straight to the nearest local minima, however shallow.<br /><br />Using stochastic gradients (either via small minibatches or <a href="https://arxiv.org/abs/1511.06807" target="_blank">literally adding noise to the gradients themselves</a>) is an effective way to allow the optimization to do a bit of "searching" and "bouncing" out of weak local minima. Asynchronous stochastic gradient descent, in which many machines are performing gradient descent in parallel, is another possible source of noise.<br /><br />This "thermodynamic" energy ensures symmetry-breaking during early stages of training, to ensure that all the gradients in a layer are not synchronized to the same values. Not only does noise perform symmetry breaking in <i>action space </i>of the neural net, but noise also performs symmetry breaking in the <i>parameter space</i> of the neural net.<br /><br /><h2>Closing Thoughts</h2></div><div><br />I find it really interesting that random noise actually <i>helps </i>Artificial Intelligence algorithms to avoid over-fitting and explore the solution space during optimization or Reinforcement Learning. It raises interesting philosophical questions on whether the <a href="http://www.scholarpedia.org/article/Neuronal_noise" target="_blank">inherent noisiness</a> of our neural code is a feature, not a bug. </div><div><br /></div><div>One theoretical ML research question I am interested in is whether all these neural network training tricks are actually variations of some general regularization theorem. Perhaps <a href="http://www.eecs.tufts.edu/~dsculley/papers/compressionAndVectors.pdf" target="_blank">theoretical work on compression</a> will be really useful for understanding this. </div><div><br /></div><div>It would be interesting to check the information capacity of various neural networks relative to hand-engineered feature representations, and see how that relates to overfitting tendency or quality of gradients. It's certainly not trivial to measure the information capacity of a network with dropout or trained via SGD, but I think it can be done. For example, constructing a database of synthetic vectors whose information content (in bits, kilobytes, etc) is exactly known, and seeing how networks of various sizes, in combination with techniques like dropout, deal with learning a generative model of that dataset.</div>Ericnoreply@blogger.com12tag:blogger.com,1999:blog-842965756326639856.post-8142726103628692612016-07-16T11:22:00.002-07:002016-07-16T11:22:17.274-07:00What product breakthroughs will recent advances in deep learning enable?<i>This is re-posted from a </i><i><a href="https://www.quora.com/What-product-breakthroughs-will-recent-advances-in-deep-learning-enable/answer/Eric-Jang" target="_blank">Quora answer</a> </i><i>I wrote on on 6/11/16</i>.<br /><br />Deep Learning refers to a class of machine learning (ML) techniques that combine the following:<br /><ul><li>Large neural networks (millions of free parameters)</li><li>High performance computing ( thousands of processors running in parallel)</li><li>Big Data (e.g. millions of color images or recorded chess games)</li></ul>Deep learning techniques currently achieve state of the art performance in a multitude of problem domains (vision, audio, robotics, natural language processing, to name a few). Recent advances in Deep Learning also incorporate ideas from statistical learning [1,2], reinforcement learning (RL) [3], and numerical optimization . For a broad survey of the field, see [9,10].<br /><br />In no particular order, here are some product categories made possible with today's deep learning techniques: <div><ul><li>customized data compression</li><li>compressive sensing</li><li>data-driven sensor calibration</li><li>offline AI</li><li>human-computer interaction</li><li>gaming, artistic assistants</li><li>unstructured data mining</li><li>voice synthesis</li></ul></div><div><h2>Customized data compression</h2>Suppose you are designing a video conferencing app, and want to come up with a lossy encoding scheme to reduce the number of packets you need to send over the Internet. </div><div><br /></div><div>You could use an off-the-shelf codec like H.264, but H.264 is not optimal because it is calibrated for generic video - anything from cat videos to feature films to clouds. It would be nice if instead we had a video codec that was optimized for specifically FaceTime videos. We can save even more bytes than a generic algorithm if we take advantage of the fact that most of the time, there is a face in the center of the screen. However, designing such an encoding scheme is tricky:</div><div><ul><li>How do we specify where the face is positioned, how much eyebrow hair the subject has, what color their eyes are, the shape of their jaw? </li><li>What if their hair is covering one of their eyes? </li><li>What if there are zero or multiple faces in the picture?</li></ul></div><div><br /><img src="https://qph.ec.quoracdn.net/main-qimg-d1a17f35eff08180a8d1bbb254659186-c?convert_to_webp=true" /><br /><br />Deep learning can be applied here. Auto-encoders are a type of neural network whose output is merely a copy of the input data. Learning this "identity mapping" would be trivial if it weren't for the fact that the hidden layers of the auto-encoder are chosen to be smaller than the input layer. This "information bottleneck" forces the auto-encoder to learn an compressed representation of the data in the hidden layer, which is then decoded back to the original form by the remaining layers in the network.</div><div><br /><img src="https://qph.ec.quoracdn.net/main-qimg-284d655682a7894c481155154feec2ae-c?convert_to_webp=true" /><br /><br /></div><div>Through end-to-end training, auto-encoders and other deep learning techniques *adapt* to the specific nuances of your data. Unlike principal components analysis, the encoding and decoding steps are not limited to affine (linear) transformations. PCA learns an "encoding linear transform", while auto-encoders learn a "encoding program".<br /><br />This makes neural nets far more powerful, and allows for complex, domain-specific compression; anything from storing a gazillion selfies on Facebook, to faster YouTube video streaming, to scientific data compression, to reducing the space needed for your personal iTunes library. Imagine if your iTunes library learned a "country music" auto-encoder just to compress your personal music collection!<br /><h2><br />Compressive sensing</h2>Compressive sensing is closely related to the decoding aspects of lossy compression. Many interesting signals have a particular structure to them - that is, the distribution of signals is not completely arbitrary. This means that we don't actually have to sample at the Nyquist limit in order to obtain a perfect reconstruction of the signal, as long our decoding algorithm can properly exploit the underlying structure.<br /><br />Deep learning is applicable here because we can use neural networks to learn the sparse structure without manual feature engineering. Some product applications:<br /><ul><li>Super-resolution algorithms (waifu2X)- literally an "enhance" button like those from CSI Miami</li><li>using WiFi radio wave interference to see people through walls (MIT Wi-Vi)</li><li>interpreting 3D structure of an object given incomplete observations (such as a 2D image or partial occlusion</li><li>more accurate reconstructions from sonar / LIDAR data</li></ul><h2>Data-driven sensor calibration</h2>Good sensors and measurement devices often rely on expensive, precision-manufactured components.<br /><br />Take digital cameras, for example. Digital cameras assume the glass lens is of a certain "nice" geometry. When taking a picture, the onboard processor solves the light transport equations through the lens to compute the final image.</div><div><br /><img src="https://qph.ec.quoracdn.net/main-qimg-31acb0c436be2884a845dacf399ef888-c?convert_to_webp=true" /><br /><br />If the lens is scratched, or warped or shaped like a bunny (instead of a disc) these assumptions are broken and the images no longer turn out well. Another example: our current decoding models used in MRI and EEG assume the cranium is a perfect sphere in order to keep the math manageable [4]. This sort of works, but sometimes we miss the location of a tumor by a few mm. More accurate photographic and MRI imaging ought to compensate for geometric deviation, whether they result from underlying sources or manufacturing defects.<br /><br />Fortunately, deep learning allows us to calibrate our decoding algorithms with data.<br /><br />Instead of a one-size-fits-all decoding model (such as a Kalman filter), we can express more complex biases specifically tuned to each patient or each measuring device. If our camera lens is scratched, we can train the decoding software to implicitly compensate for the altered geometry. This means we no longer have to manufacture and align sensors with utmost precision, and this saves a lot of money.<br /><br />In some cases, we can do away with hardware completely and let the decoding algorithm compensate for that; the Columbia Computational Photography lab has developed a kind of camera that doesn't have a lens. Software-defined imaging, so to speak.</div><div><br /></div><div><br /><img src="https://qph.ec.quoracdn.net/main-qimg-df23947cf2ac2f0828a41011b08ad34a-c?convert_to_webp=true" /><br /><br /><h2>Offline AI</h2>Being able to run AI algorithms without Internet is crucial for apps that have low latency requirements (I.e. self driving cars & robotics) or do not have reliable connectivity (smartphone apps for traveling).<br /><br />Deep Learning is especially suitable for this. After the training phase, neural networks can run the feed forward step very quickly. Furthermore, it is straightforward to shrink down large neural nets into small ones, until they are portable enough to run on a smartphone (at the expense of some accuracy).<br /><br />Google has already done this in their offline camera translation feature in Google Translate App [6].</div><div><br /><img src="https://qph.ec.quoracdn.net/main-qimg-89e40936cca44a7332f06e371a8f841d?convert_to_webp=true" /><br /><br />Some other possibilities:<br /><ul><li>Intelligent assistants (e.g. Siri) that retain some functionality even when offline.</li><li>wilderness survival app that tells you if that plant is poison ivy, or whether those mushrooms are safe to eat</li><li>small drones with on-board TPU chips [11] that can perform simple obstacle avoidance and navigation</li></ul><br /><h2>Human-computer interaction</h2><br />Deep Neural Networks are the first kind of models that can really see and hear our worldwith an acceptable level of robustness. This opens up a lot of possibilities for Human-Computer Interaction.<br /><br />Cameras can now be used to read sign language and read books aloud to people. In fact, deep neural networks can now describe to us in full sentences what they see [12]. Baidu's DuLight project is enabling visually-impaired people to see the world around them through a sight-to-speech earpiece.<br /><br /><a href="https://www.youtube.com/watch?v=Xe5RcJ1JY3c">Dulight--Eyes for visually impaired</a><br /><br />We are not limited to vision-based HCI. Deep learning can help calibrate EEG interfaces for paraplegics to interact with computers more rapidly, or provide more accurate decoding tech for projects like Soli [7].<br /><h2><br />Gaming</h2><br />Games are computationally challenging because they run physics simulation, AI logic, rendering, and multiplayer interaction together in real time. Many of these components have at least O(N^2) in complexity, so our current algorithms have hit their Moore's ceiling.<br /><br />Deep learning pushes the boundaries on what games are capable of in several ways.<br /><br />Obviously, there's the "game AI" aspect. In current video games, AI logic for non-playable characters (NPC) are not much more than a bunch of if-then-else statements tweaked to imitate intelligent behavior. This is not clever enough for advanced gamers, and leads to somewhat unchallenging character interaction in single-player mode. Even in multiplayer, a human player is usually the smartest element in the game loop.<br /><br />This changes with Deep Learning. Google Deepmind's AlphaGo has shown us that Deep Neural Networks, combined with policy gradient learning, are powerful enough to beat the strongest of human players at complex games like Go. The Deep Learning techniques that drive AlphaGo may soon enable NPCs that can exploit the player's weaknesses and provide a more engaging gaming experience. Game data from other players can be sent to the cloud for training the AI to learn from its own mistakes.<br /><br />Another application of deep learning in games is physics simulation. Instead of simulating fluids and particles from first principles, perhaps we can turn the nonlinear dynamics problem into a regression problem. For instance, if we train a neural net to learn the physical rules that govern fluid dynamics, we can evaluate it quickly during gameplay without having to perform large-scale solutions to Navier stokes equations in real time.<br /><br />In fact, this has been done already by Ladicky and Jeong 2015 [8].<br /><img src="https://qph.ec.quoracdn.net/main-qimg-8ba3aa308bf06a0793500e1192365152?convert_to_webp=true" /><br /><br />For VR applications that must run at 90 FPS minimum, this may be the only viable approach given current hardware constraints.<br /><br />Third, deep generative modeling techniques can be used to create unlimited, rich procedural content - fauna, character dialogue, animation, music, perhaps the narrative of the game itself. This is an area that is just starting to be explored by games like No Man's Sky, which could potentially make games with endless novel content.</div><div><br /><img src="https://qph.ec.quoracdn.net/main-qimg-861d0965c2b849f3a18dada8c1b03e6a?convert_to_webp=true" /><br /><br />To add a cherry on top, Deep Neural nets are well suited for parallel mini-batched evaluation, which means that AI logic for a 128 NPCs or 32 water simulations might be evaluated simultaneously on a single graphics card.<br /><br /><h2>Artistic Assistants</h2><br />Given how well neural networks perceive images, audio, and text, it's no surprise that they also work when we use them to draw paintings [13], compose music [14], and write fiction [15].<br /><img src="https://qph.ec.quoracdn.net/main-qimg-b65542083447c7b57aa8ff9517e1823d?convert_to_webp=true" /><br /><br />People have been trying to get computers to compose music and paint pictures for ages, but deep learning is the first one that actually generates "good results". There are already several apps in the App Store that implement these algorithms for giggles, but soon we may see them as assistive generators/filters in professional content creation software.<br /><br /><h2>Data Mining from Unstructured Data</h2><br />Deep learning isn't at the level where it can extract the same amount of information humans can from web pages, but the vision capabilities of deep neural nets are good enough for allowing machines to understand more than just hypertext.<br /><br />For instance:<br /><ul><li>Parsing events from scanned flyers</li><li>identifying which products on EBay are the same</li><li>determining consumer sentiment from webcam</li><li>extracting blog content from pages without RSS feeds</li><li>integrate photo information into valuing financial instruments, insurance policies, and credit scores.</li></ul><h2>Voice synthesis</h2><br />Generative modeling techniques have come far enough and there is sufficient data out there that it is only a matter of time before someone makes an app that reads aloud to you in Morgan Freeman's or Scarlet Johansen's voice. At Vanguard, my voice is my password.<br /><h2><br />Bonus: more products</h2><ul><li>Adaptive OS / Network stack scheduling - scheduling threads and processes in an OS is a NP hard problem. We don't have a very satisfactory solution to this right now, and scheduling algorithms in modern operating systems, filesystems, and TCP/IP implementations are all fairly simple. Perhaps if a small neural net could be used to adapt to a user's particular scheduling patterns (frame this as an RL problem), we would decrease scheduling overhead incurred by the OS. This might make a lot of sense inside of data centers where the savings can really scale.</li><li>Colony counting & cell tracking for microscopy software (for wet lab research)</li><li>The strategy of "replacing simulation with machine learning" has been useful in the fields of drug design too, presenting enormous speed ups in finding which compounds are helpful or toxic [untethiner 2015].</li></ul><br /><h2>References</h2><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[1] <span class="qlink_container"><a class="external_link" href="https://arxiv.org/abs/1312.6114" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">[1312.6114] Auto-Encoding Variational Bayes</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[2] <span class="qlink_container"><a class="external_link" href="https://arxiv.org/pdf/1603.05106v2.pdf" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">https://arxiv.org/pdf/160<wbr></wbr>3.05106...</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[3] <span class="qlink_container"><a class="external_link" href="http://karpathy.github.io/2016/05/31/rl/" rel="noopener" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">Deep Reinforcement Learning: Pong from Pixels</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[4] <span class="qlink_container"><a class="external_link" href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3790855/" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">http://www.ncbi.nlm.nih.g<wbr></wbr>ov/pmc/...</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[5] <span class="qlink_container"><a class="external_link" href="http://web.media.mit.edu/~gordonw/courses/ComputationalPlenopticImaging/CPICourseNotes.pdf" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">http://web.media.mit.edu/<wbr></wbr>~gordon...</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[6] <span class="qlink_container"><a class="external_link" href="https://research.googleblog.com/2015/07/how-google-translate-squeezes-deep.html" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">How Google Translate squeezes deep learning onto a phone</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[7] <span class="qlink_container"><a class="external_link" href="https://atap.google.com/soli/" rel="noopener" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">Project Soli</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[8] <span class="qlink_container"><a class="external_link" href="https://graphics.ethz.ch/publications/papers/paperJeo15a.php" rel="noopener" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">Data-driven Fluid Simulations using Regression Forests</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[9] <span class="qlink_container"><a class="external_link" href="http://www.nature.com/nature/journal/v521/n7553/full/nature14539.html" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">http://www.nature.com/nat<wbr></wbr>ure/jou...</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[10] <span class="qlink_container"><a class="external_link" data-tooltip="attached" href="http://people.idsia.ch/~juergen/deep-learning-overview.html" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">Deep Learning in Neural Networks: An Overview</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[11] <span class="qlink_container"><a class="external_link" data-tooltip="attached" href="https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">Google supercharges machine learning tasks with TPU custom chip</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[12] <span class="qlink_container"><a class="external_link" href="http://arxiv.org/pdf/1410.1090" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">http://arxiv.org/pdf/1410<wbr></wbr>.1090</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[13] <span class="qlink_container"><a class="external_link" data-tooltip="attached" href="http://arxiv.org/abs/1508.06576" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">[1508.06576] A Neural Algorithm of Artistic Style</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; margin-bottom: 1em; padding: 0px;">[14] <span class="qlink_container"><a class="external_link" data-tooltip="attached" href="http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">Composing Music With Recurrent Neural Networks</a></span></div><div class="qtext_para" style="color: #333333; font-family: Georgia, Times, "Times New Roman", serif; font-size: 15px; line-height: 21px; padding: 0px;">[15] <span class="qlink_container"><a class="external_link" href="https://www.robinsloan.com/notes/writing-with-the-machine/" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("data:image/svg+xml,%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20standalone%3D%22no%22%3F%3E%0A%3Csvg%20width%3D%2214px%22%20height%3D%2214px%22%20viewBox%3D%220%200%2014%2014%22%20version%3D%221.1%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%20xmlns%3Axlink%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxlink%22%20xmlns%3Asketch%3D%22http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%2Fns%22%3E%0A%20%20%20%20%3C!--%20Generator%3A%20Sketch%203.5.2%20(25235)%20-%20http%3A%2F%2Fwww.bohemiancoding.com%2Fsketch%20--%3E%0A%20%20%20%20%3Ctitle%3Eexternal_link%3C%2Ftitle%3E%0A%20%20%20%20%3Cdesc%3ECreated%20with%20Sketch.%3C%2Fdesc%3E%0A%20%20%20%20%3Cdefs%3E%3C%2Fdefs%3E%0A%20%20%20%20%3Cg%20id%3D%22Page-1%22%20stroke%3D%22none%22%20stroke-width%3D%221%22%20fill%3D%22none%22%20fill-rule%3D%22evenodd%22%20sketch%3Atype%3D%22MSPage%22%3E%0A%20%20%20%20%20%20%20%20%3Cg%20id%3D%22external_link%22%20sketch%3Atype%3D%22MSLayerGroup%22%20fill%3D%22%23ccc%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M10%2C8%20C9.447%2C8%209%2C8.448%209%2C9%20L9%2C12%20L2%2C12%20L2%2C5%20L5%2C5%20C5.553%2C5%206%2C4.552%206%2C4%20C6%2C3.448%205.553%2C3%205%2C3%20L1%2C3%20C0.447%2C3%200%2C3.448%200%2C4%20L0%2C13%20C0%2C13.552%200.447%2C14%201%2C14%20L10%2C14%20C10.553%2C14%2011%2C13.552%2011%2C13%20L11%2C9%20C11%2C8.448%2010.553%2C8%2010%2C8%20L10%2C8%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%3Cpath%20d%3D%22M13%2C0%20L8%2C0%20C7.447%2C0%207%2C0.448%207%2C1%20C7%2C1.552%207.447%2C2%208%2C2%20L10.586%2C2%20L4.293%2C8.293%20C3.902%2C8.684%203.902%2C9.316%204.293%2C9.707%20C4.488%2C9.902%204.744%2C10%205%2C10%20C5.256%2C10%205.512%2C9.902%205.707%2C9.707%20L12%2C3.414%20L12%2C6%20C12%2C6.552%2012.447%2C7%2013%2C7%20C13.553%2C7%2014%2C6.552%2014%2C6%20L14%2C1%20C14%2C0.448%2013.553%2C0%2013%2C0%20L13%2C0%20Z%22%20id%3D%22Shape%22%20sketch%3Atype%3D%22MSShapeGroup%22%3E%3C%2Fpath%3E%0A%20%20%20%20%20%20%20%20%3C%2Fg%3E%0A%20%20%20%20%3C%2Fg%3E%0A%3C%2Fsvg%3E"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration: none;" target="_blank">Writing with the machine</a></span></div></div>Ericnoreply@blogger.com9tag:blogger.com,1999:blog-842965756326639856.post-63416963029878879362016-07-11T06:34:00.002-07:002016-09-04T09:23:58.819-07:00How to Get an Internship<i>Update: 9/3/2016 - Denis Tarasov of HSE in Moscow, Russia has kindly translated this article into Russian - <a href="https://habrahabr.ru/post/309044/">read it here</a>. I welcome translations into other languages!</i><br /><div style="background-color: white; font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt;"><span class="im" style="color: #500050;"></span></div><br />About a year ago, I wrote a <a href="http://blog.evjang.com/2016/06/my-internship-experiences-at-pixar.html">blog post</a> about my various internship experiences. It ended up being quite popular with recruiters, and actually helped me to land my full-time job at Google.<br /><br />I've also been getting emails from students seeking internship advice. Every time I get one of these, my ego approximately doubles in size. Thank you.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://s2.quickmeme.com/img/6e/6e5a8cad178e65f58dc01219891da8a953d94d97ef0a712feb3c2d8430e45c4c.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://s2.quickmeme.com/img/6e/6e5a8cad178e65f58dc01219891da8a953d94d97ef0a712feb3c2d8430e45c4c.jpg" height="266" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>In this post, I'll share my strategy for landing tech internships. I've wanted to write this for some time now, but I've been hesitant to posture some kind of "magic recipe" when a lot of my own success was mostly due to luck.<br /><br />I'm just another fresh graduate trying to <a href="http://3.bp.blogspot.com/-kJyrRAEO4pc/Tn2tbIStz6I/AAAAAAAAF3A/vJ1mwJ4YIoY/s1600/Graduate_004Pyxurz.jpg" target="_blank">figure</a> <a href="http://littlefun.org/uploads/520be02ac856117033000007_736.jpg" target="_blank">things</a> <a href="http://memesvault.com/wp-content/uploads/Confused-Baby-Meme-Blank-20.jpg" target="_blank">out</a>, and here's what I believe in:<br /><br /><h2><b>#1 Work on Side Projects</b></h2><br />You don't need<i> </i>to be at Google to work on the kinds of problems that Google interns do, nor do you need to work at a hedge fund to learn about finance. Pursue those interests on your own!<br /><br /><div>Want to try animation? Here are some project ideas:<br /><br /><ul><li>Make a 30 second short film in <a href="http://www.autodesk.com/education/free-software/all" target="_blank">Autodesk Maya</a> (free for students) or <a href="https://www.blender.org/" target="_blank">Blender 3D</a> (free for everybody)</li><li>Do a <a href="http://www.11secondclub.com/" target="_blank">11 Second Club</a> animation. </li><li>Make something cool with Pixar's own <a href="https://renderman.pixar.com/view/renderman" target="_blank">Renderman software</a> (free for non-commercial use). I'll bet less than 1% of the resumes that Pixar receives from students list experience with Renderman.</li><li>Draw something on <a href="http://shadertoy.com/" target="_blank">ShaderToy</a>.</li><li>Implement a physically-based rendering algorithm.</li></ul></div><div><br /></div><div>Want to be a software engineer?</div><ul><li>Make an Android / iOS app from scratch (Android learning curve is easier). </li><li>Learn how to use <a href="http://aws.amazon.com/free" target="_blank">Amazon Web Services</a> or <a href="http://cloud.google.com/" target="_blank">Google Cloud Platform</a>. </li><li>Open source your work. A Managing Director at D. E. Shaw once told me that "<a href="http://github.com/" target="_blank">Github</a> is the new resume".</li><li>Check out <a href="https://news.ycombinator.com/show" target="_blank">Show HN</a> to see what projects other folks are working on.</li></ul><br />Finance:<br /><br /><ul><li>Participate in a <a href="https://www.kaggle.com/" target="_blank">Kaggle competition</a>. Get your first-hand experience with overfitting.</li><li>Do some financial market research on <a href="https://www.quantopian.com/" target="_blank">Quantopian</a>. This is the kind of work that real quants do all day. </li><li>Contribute to <a href="https://github.com/twosigma" target="_blank">open source projects</a> like Beaker and Satellite. Who knows, you might even impress someone inside the company.</li></ul><br />Working on side projects accomplishes several objectives simultaneously:<br /><div><ul><li>It builds your brand (see #2).</li><li>It shows the hiring committee that you are willing to hone your craft on your own time, instead of merely trading your time for their money and status.</li><li>It's a low-risk way to find out if you're <i>actually </i>interested in the field.</li><li>In the process of building stuff, you might re-discover important theoretical and engineering challenges that professionals grapple with. In my sophomore year, I wrote a <a href="https://github.com/ericjang/cryptocurrency_arbitrage" target="_blank">Bitcoin- arbitrage bot</a> in Python. Bitcoin exchanges list the price and volume of all open limit orders in the book, while actual financial markets do not. This results in a very fundamental difference in the way <a href="https://en.wikipedia.org/wiki/Market_impact" target="_blank">Market Impact</a> is treated, and gave me something interesting to talk about during my Two Sigma interviews. What I learned was super elementary, but still more practical experience than most candidates.</li></ul></div><br />Don't worry about your projects being impressive or even novel - just focus on improving your skills and exercising your creativity. A little bit of experience using a company's products and technologies will give you a <b>huge </b>edge over other candidates.<br /><br />Start as early as you can. The job application process doesn't begin during the fall recruiting season; it begins as soon as you want it to.<br /><div><br /><h2><b>#2 Make Your Own Website</b></h2><br />Here's a secret: the more you market yourself, the more recruiters will reach out to you. Building your own personal website will make you extremely visible.<br /><br />Your website is basically a resume in long-form, but also functions as your personal brand. Here are some screenshots of other people's sites:<br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-jEbsbsLzemo/V4KdgWFX23I/AAAAAAAAFDQ/hVgvJibszq0KdXHNYVGvjXMPFOzsLt0GwCLcB/s1600/portfolios.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="554" src="https://4.bp.blogspot.com/-jEbsbsLzemo/V4KdgWFX23I/AAAAAAAAFDQ/hVgvJibszq0KdXHNYVGvjXMPFOzsLt0GwCLcB/s640/portfolios.png" width="640" /></a></div><br /></div><div><br /></div><div>Your website should accomplish several things:<br /><ul><li>Make it easy for recruiters to come across your portfolio via Google Search.</li><li>Reveal your personality in ways that a 1-page resume cannot. In particular, it's a great opportunity to showcase aesthetic sense and visual creativity.</li><li>If you're attractive, you should add a profile picture. A candid, smiling face is an extremely effective way to connect with a hiring committee.</li></ul>Platforms like Github Pages, Google App Engine, Wordpress, Weebly let you set up a website for free. Domain names are cheap - as little as $10 a year.<br /><div><br /></div>In addition to showcasing your coding projects, you should list a description of your work in a way that is accessible to people who can't read code. Better yet, write blog posts and tutorials for your projects - what you did and how you did it. Your site will get a lot more visibility if people find it <i>useful</i>.</div><div><br /></div><div>The story you tell through your website - the first impression that you make - is of utmost importance. Do it right, and recruiters will come like ants to a picnic. </div></div><br /><h2>#3 Study CS</h2><br />If you're not sure what you want to do in the long term, choose skills and experiences that give you the most flexibility in the future. I recommend studying some kind of math + CS degree (if you're more interested in research roles) or a illustration + CS double major (if you're more interested in joining the entertainment industry).<br /><br />I started my undergraduate education thinking I would study neuroscience, because "I could learn CS by myself." This was a big mistake:<br /><br /><ul><li>My resume got passed over in resume screens because I listed "neuroscience" as my major. I eventually got through by begging a Google recruiter to give me a chance with the phone interview. Afterwards, I switched to Applied Math-CS.</li><li>Getting good at CS requires lots of practice. School is a good place to do it.</li><li>Neuroscience in the classroom has not caught up to neuroscience in the lab. Cutting edge research is pretty much optogenetics or computational (which is more CS + math + physics than neuroscience anyway).</li></ul><br />More on the last point: I discovered that neuroscience students who knew how to program in MATLAB got to work directly on high-level research questions and interpret experimental data. Students who didn't ended up doing grunt work in the lab - dissecting tiny brains, pipetting liquids, and relying on others to code analysis routines for them.<br /><br />Neuroscience is not the only field that is being disrupted by technology; we will be seeing more "software-defined research" in the coming years. For better or worse, the scientists, doctors, lawyers of the future will all be programmers.<br /><br />Why is math important? Math gives you additional flexibility to break into hard-tech research roles, if you so desire. It's really hard to transition directly into an industry research team (such as Google Research or Microsoft Research) with only a CS undergrad degree.<br /><br />Even though I was able to get more exposure to math at my Two Sigma internship, I was unsuccessful at getting a quant research internship because my background typecasts me into software engineering roles. It is also my own grievous fault for not being better at math.<br /><br />If you want to work in film or games or even a product management role at a tech company, then studying math makes less sense; you should study illustration instead. I've noticed that at Pixar, many Technical Directors want to contribute more to story and art direction, but find themselves pigeonholed into specific roles (they have one "car guy", one "vegetation shading girl", and so on).<br /><br />Being good at illustration will help you break into more creative roles like Art Director or Story Artist. It's also flexible - illustrators are needed everywhere, from design to comics to games. <a href="https://www.supergiantgames.com/games/transistor/" target="_blank">Illustration + CS is a potent skillset</a>.<br /><br />Candidly, math is safer, more flexible, and more lucrative than illustration. It is also future-proof in ways that other valuable degrees (such as design, law, and business) are not. That said, I find art incredibly valuable and continue practicing it as a hobby.<br /><br />In any case, study CS. It will feed you and pay off your student debts and open so many doors. Don't be discouraged if you find CS difficult, or if your classmates seem to be way better at it than you. It wasn't until my third attempt to learn programming that things started to stick in my head.<br /><br />Stick with CS, and the <a href="http://facebook.com/" target="_blank">sky's</a> <a href="http://uber.com/" target="_blank">the</a> <a href="http://snapchat.com/" target="_blank">limit</a>.<br /><h2><b><br /></b></h2><h2><b>#4 Seek Diverse, Contrarian Experiences</b></h2><br />Your coursework, extracurriculars, and internship experiences will have a big impact on your creative process. Diverse experiences enable you to approach problems differently than others, which will make you unique and harder to replace.<br /><br />Pursue courses outside your major and let them inspire your projects. I don't mean this in the sense of "combining fields for the sake of mixing your interests together," like some contrived Egyptology-Physics senior thesis (just a hypothetical example, no offense to those who do this).<br /><br />Instead, ideas from one field might lead to a real <i>competitive advantage</i> in another. For instance:<br /><br /><ul><li>It's been said that Reed College's Calligraphy Class was a formative experience in Steve Jobs's design-minded vision for Apple products.</li></ul><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="http://www.reed.edu/reed_magazine/june2011/articles/eliot_circular/images/582_reynolds2.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://www.reed.edu/reed_magazine/june2011/articles/eliot_circular/images/582_reynolds2.jpg" height="301" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">source: reed.edu</td></tr></tbody></table><br /><ul><li>John Lasseter and Ed Catmull believed that 3D computer graphics was not just a fancy artistic medium, but the future of animation itself. They were right.</li></ul><br /><div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://i.ytimg.com/vi/C-L-WA-nQzI/maxresdefault.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://i.ytimg.com/vi/C-L-WA-nQzI/maxresdefault.jpg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Pixar's The Adventures of André and Wally B.</td></tr></tbody></table><ul><li>Here is an <a href="https://vimeo.com/47049144" target="_blank">elegant and beautiful explanation</a> of a Math proof using interpretive dance. Sometimes difficult concepts become strikingly clear when the right diagram is drawn.</li></ul><div><br /></div><div>Here's a personal anecdote: I did several years of computational neuroscience research in college, which shaped the way I think about debugging complicated simulations in Machine Learning. Inspired by this, I pitched a project idea to a ML professor at my school. He thought it was a terrible idea. I went ahead and <a href="https://github.com/ericjang/tdb" target="_blank">built it anyway</a>, and it actually got me my current job. </div><div><br /></div><div>Diverse experiences help you to discover original or even <a href="https://www.amazon.com/Zero-One-Notes-Startups-Future/dp/0804139296" target="_blank">contrarian ideas</a>. Find something that only you believe to be true. If you're right, the upside is enormous. </div><h2><b><br /></b></h2><h2><b>#5 Plan your next 10 years</b></h2><div>Everybody's got dreams.<br /><br />Some people dream of creating <a href="https://en.wikipedia.org/wiki/Artificial_general_intelligence" target="_blank">Strong AI</a>, some want to make it to the Forbes 30 under 30 list, some want to be parents by the age of 32, some just want to make it to tomorrow.<br /><br />It's really important, even as a college student applying for internships, to reflect on what you want and where you want to be in the long-term. Time is so precious; don't waste any time at a job that isn't growing the skills you want. It's okay to be unsure of what you want to do with your life, but at least write down a list of life/career trajectories that you think will make you happy.<br /><div></div></div><div><br /></div><div>Every so often, re-evaluate your long-term goals and whether the position you're in is taking you there or growing the skills that you want. Some questions to ask yourself:</div><div><ul><li>How will I pay off my student debt?</li><li>Can I see myself doing pure software engineering (frontend, backend, mobile apps) for the remainder of my career? </li><li>How long do I see myself working at my current employer?</li><li>Do I want to transition into more math-y roles like ML research or quantitative finance?</li><li>Do I want to transition into a product management or leadership role?</li><li>Do I want to start my own company someday? Am I okay exchanging coding and making stuff, for the privilege of running a company?</li><li>Do I want to become a Venture Capitalist someday?</li><li>If I plan to have kids by the time I'm 32 - where do I want to be? Who do I want to be with?</li><li>If I keep doing this, will I be happy in ten years? </li></ul></div><div><br /></div><div>Finally, when making plans, don't take your physical, mental, or financial health for granted - have a backup plan in case your best laid plans go awry.</div><div><br /></div><div><b><br /></b></div><div><b>######################### PART 2 ##########################</b></div><div><b><br /></b></div><div>95% of playing the internship game is what I've listed above. The remaining 5% is the actual interview process.</div><div><br /></div><h2><b>#6 Skip the Resume Screen</b></h2><br />The first stage of most internship applications is a resume screen. The recruiter, who must sift through a huge stack of applications, glances at your resume for about six seconds, then either recycles it or sends you a follow up email.<br /><br />SIX SECONDS! That's just enough time do pattern matching for brand-name schools, tech company names, and what programming languages you know. The recruiter will also make a snap judgment just based on how neat and pretty your resume looks. Consequently, resume screens are pretty noisy when it comes to judging inexperienced college students.<br /><br />Fortunately, there are a couple ways to skip the resume screen entirely:<br /><ul><li>If you get a referral from someone inside the company, recruiters will consider your application more carefully. If your resume is not horrible to look at, you'll almost certainly make it to the next stage. I was lucky enough to get referrals for Pixar and Two Sigma. However, these are stories for another day ;)</li><li>If you are an underrepresented minority (URM) in Technology, companies are bending over backwards to get you to pass their interviews. At conferences like <a href="http://ghc.anitaborg.org/">Grace Hopper</a>, you can actually get a free pass out of the resume screening and the phone screen, and do on-the-spot whiteboard interviews with companies like Apple, Facebook, Google, Pinterest, etc. This improves the odds of landing an internship <i>dramatically</i>. A classmate of mine actually got an internship offer from Apple, on the spot, with only her resume (no interview or anything). Reach out to your computer science department and ask if they would sponsor your attendance.</li><li>Reach out to engineers directly through your school alumni network, and ask them to refer you. Don't be shy - it's very little work on their part and they will get nice a referral bonus if you succeed. The worst thing that could happen is that they ignore you, which doesn't cost you anything.</li></ul><div><br /></div><div>It goes without saying that your resume should be on point: everything perfectly aligned and legible with zero typos. Tailor each resume for the company that you are applying to.</div><div><br /></div><div>Tech companies that visit college campuses will often hold resume review sessions for students (Yelp, Microsoft, Google do this). This is super useful, and you should use this resource even if it's with a company you don't want to work for. Not surprisingly, tech recruiters give better industry-specific advice than college career counselors. </div><div><br /></div><div>If at all possible, skip the resume screen. In fact, if you have an offer deadline coming up, companies will often fast-track you straight to the on-site interview. The resume screen and phone interviews are just qualifiers for the on-site, which pretty much solely determines whether you make it in or not. <a href="https://medium.com/the-mission/career-advice-no-one-tells-you-8be1bcd330cb#.uly1b1h4j" target="_blank">Don't go through the front door</a>.</div><h2><br /><b>#7 Phone and On-Site Interviews</b></h2><br />After the noisy resume screen, you are the master of your own fate. Typically there are one or two phone interviews followed by an on-site 5-hour interview. The phone interviews are like miniature versions of on-site interviews, where you write code on a Google Doc or Etherpad.<br /><br />All that matters at this point is how well you solve the coding challenges. If you do solve the problems quickly and correctly, and your behavior doesn't set off any <a href="https://www.quora.com/What-are-some-of-the-biggest-red-flags-in-an-interviewee-1" target="_blank">red flags</a>, you'll probably get the job.<br /><br />My experience is that difficulty of the interview is roughly correlated with firm's selectivity and salary. The hardest interviews I've had were with Google Deepmind, D. E. Shaw, Two Sigma, Quora, and Vatic Labs (startups interviews tend to be pretty rigorous because their hiring decisions are riskier).<br /><br />Google and Facebook were about medium in difficulty. I didn't interview for Pixar's software engineering role, so that interview was all behavioral and very easy. I've heard that Jane Street interviews are the hardest technically (apparently very popular among MIT students).<br /><br /><a href="https://www.amazon.com/Cracking-Coding-Interview-Programming-Questions/dp/098478280X" target="_blank">Cracking the Coding Interview</a> is the only book you'll ever need. The practice problems are about the right level of difficulty for all software engineering roles I've ever interviewed with, and the advice is superb.<br /><br />Finance firms like D.E. Shaw and Jane Street like to ask more math-oriented questions. I recommend these three books (in decreasing order of difficulty):<br /><br /><ul><li><a href="https://www.amazon.com/Practical-Guide-Quantitative-Finance-Interviews/dp/1438236662/ref=sr_1_1?s=books&ie=UTF8&qid=1468189193&sr=1-1&keywords=a+practical+guide+to+quantitative+finance+interviews" target="_blank">A Practical Guide to Quantitative Finance Interviews</a></li><li><a href="https://www.amazon.com/Quant-Interview-Questions-Answers-Second/dp/0987122827/ref=sr_1_1?s=books&ie=UTF8&qid=1468189168&sr=1-1&keywords=quant+job+interview+questions+and+answers" target="_blank">Quant Job Interview Questions and Answers</a></li><li><a href="https://www.amazon.com/Heard-Street-Quantitative-Questions-Interviews/dp/0970055234" target="_blank">Heard on the Street</a></li></ul><br />Preparing for whiteboard interviews is like studying for the SATs - a complete waste of time, but important enough that you gotta do it. There are <a href="https://triplebyte.com/" target="_blank">some</a> <a href="https://www.starfighters.io/" target="_blank">startups</a> that are trying to disrupt the broken interview system, but I am uncertain if they will ever be successful.<br /><br />On the behavioral side: be humble, be confident, smile a lot, ask good questions. Wear smart casual. Here's a trick to smiling often: every few seconds, imagine that the interviewer just extended you a job offer.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://pixst.files.wordpress.com/2011/03/mg_5573.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="213" src="https://pixst.files.wordpress.com/2011/03/mg_5573.jpg" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">"Congratulations, you got the job!"</td></tr></tbody></table><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="http://66.media.tumblr.com/71baa1677e3855544b060472b7a33c7e/tumblr_mnmjtxIH3W1s7mpy3o1_500.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://66.media.tumblr.com/71baa1677e3855544b060472b7a33c7e/tumblr_mnmjtxIH3W1s7mpy3o1_500.gif" height="224" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">"Congratulations, you got the job!"</td></tr></tbody></table><br /><h2><b>#8 Be Old</b></h2><div class="separator" style="clear: both; text-align: center;"><br /></div>It's WAY easier to get internships as a rising junior or senior in college.<br /><br />Interning at Google/Facebook as a first-year is pretty rare, so don't beat yourself up if you don't get an internship right away. A lot of tech companies screen out first-years as a matter of policy.<br /><br />Some finance firms only hire rising college seniors as interns because they're fiercely protective of their IP and don't want other firms poaching their interns next summer.<br /><div><br /></div>The school you go to matters, but if you take the time to build a personal brand and list of side projects, it matters less and less. The same goes for age.<br /><br /><div><h2><b>#9 I got the internship. What do I do?</b></h2></div><div><br />Congrats! Your internship is an opportunity, not an entitlement.<br /><br />These companies are investing in your personal growth and learning, so you should work hard and learn as much as possible. You owe it to the company whose name pads your resume, you owe it to the people who vouched for you in the hiring process, and most of all, you owe it to the candidates who were just as qualified as you, but didn't get the job.<br /><br />My internship offers were all very competitive so I didn't negotiate (I was also saving that social capital for full-time negotiation). You can try to negotiate your internship offers if you want, though.<br /><div><br /></div><h2><b>#10 I didn't get an internship this summer. What do I do?</b></h2><br />Great! You can spend the summer working on exactly what you want to work on. Most interns don't even get this luxury.<br /><br /><ul><li>Create deadlines for yourself as if a manager assigned them to you. </li><li>Have meetings with your imaginary manager where you discuss your progress. </li><li>Show up to "work" on time.</li><li>Get some unemployed friends together and work in a team. Heck, not having a job lined up is the perfect opportunity to <a href="http://playbook.samaltman.com/" target="_blank">start your own company</a>.</li><li>Write a blog post about it. Show your future employers what a fucking awesome employee you would be if you had the opportunity.</li></ul><br />If money is an issue, there are still a few options. You can seek out an UTRA with your university, take up a low-stress part-time job (summer RA, babysitting).<br /><br /><h2><b>#11 Closing Thoughts</b></h2><br /><ul><li>Build your own personal brand through side projects, website, writing.</li><li>Optimize your career decisions for learning and personal growth. </li><li>Work really hard.</li></ul><br /><br />Best of luck, and thank you for reading.<br /><br /></div></div>Ericnoreply@blogger.com30tag:blogger.com,1999:blog-842965756326639856.post-18697135037568498322016-06-25T05:44:00.000-07:002016-06-25T05:59:07.369-07:00Adversarial Exploration Policies for Robust Model Learning<i style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.2px; line-height: 18.48px;">This post was first published on 05/14/16, and has since been migrated to Blogger.</i><span style="background-color: white; color: #666666; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13.2px; line-height: 18.48px;"> </span><br /><br />Brown University requires S.c.B students to take a <i>capstone</i> course that "studies a current topic in depth to produce a culminating artifact such as a paper of software project".<br /><br />For my capstone project, I developed a more efficient sampling approach for use in learning POMDP models with Deep Learning. It's still a work in progress, but the results are pretty promising.<br /><br /><h3>Abstract:</h3><br />Deep neural networks can be applied to model-based learning problems over continuous state-action spaces $S \times A$. By training a prediction network $\hat{f}_\tau : S \times A \to S$ on saved trajectory data, we can approximate the true transition function $f$ of the underlying Markov decision processes. $\hat{f}_\tau$ can then be used within optimal control and planning algorithms to ``predict the future''.<br /><br />Robustness of $\hat{f}_\tau$ is crucial. If the robot (such as an autonomous vehicle) spends most of its exploration time in a small region of $S \times A$, then $\hat{f}_\tau$ may not be accurate in regions that the robot does not encounter often (such as collision trajectories). However, gathering enough training data to fully characterize $f$ over $S \times A$ is very time-consuming, and tends to result in many redundant samples.<br /><br />In this work, I propose exploring $S \times U$ using an ``adversarial policy'' $\pi_\rho : S \to A$ that guides the robot into states and actions that maximize model loss. Policy parameters $\rho$ and model parameters $\tau$ are optimized in an alternating minimax game via stochastic gradient descent. Robot simulation experiments demonstrate that adversarial exploration policies improve model robustness with respect to the time the robot spends sampling the environment.<br /><br /><h3>Links:</h3><ul><li><a href="https://drive.google.com/file/d/0BwPM4VwL8ZOAYWFhbm1rTlJiUjA/view?usp=drive_web">PDF of the writeup</a></li><li><a href="http://evjang-nb.blogspot.com/">Project rearch notebook in blog format</a></li><li><a href="https://github.com/ericjang/adaptive-e2c%22">Source code on Github</a></li><li><a href="https://github.com/ericjang/e2c">E2C vanilla implementation</a></li><li><a href="https://github.com/ericjang/boxbot">BoxBot simulator</a></li></ul><br /><br />Ericnoreply@blogger.com6tag:blogger.com,1999:blog-842965756326639856.post-3387429842333801632016-06-25T05:38:00.003-07:002016-07-10T22:49:02.876-07:00Understanding and Implementing Deepmind's DRAW Model<i style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.2px; line-height: 18.48px;">This post was first published on 2/27/16, and has since been migrated to Blogger.</i><span style="background-color: white; color: #666666; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13.2px; line-height: 18.48px;"> </span><br /><span style="background-color: white; color: #666666; font-family: "trebuchet ms" , "trebuchet" , "verdana" , sans-serif; font-size: 13.2px; line-height: 18.48px;"><br /></span><img src="http://i.imgur.com/eyFG9c8.png" width="100%" /> <br /><br />This blog post is a summary of Google Deepmind's paper <a href="http://arxiv.org/abs/1502.04623">DRAW: A Recurrent Neural Network For Image Generation</a>. Also included is a tutorial on implementing it from scratch in <a href="https://www.tensorflow.org/">TensorFlow</a>.<br />The paper can seem a bit intimidating at first, but it is surprisingly intuitive once you look at it as a small (but elegant) extension to existing generative modeling techniques. Thanks to the TensorFlow framework, the implementation only takes 158 lines of Python code.<br /><br />Below are some outputs produced by my DRAW implementation. The images on the left have the optional "attention mechanism" turned on, and the images on the right were generated with attention turned off.<br /><br /><table class="u-full-width"> <thead><tr> <th>DRAW w/ Attention</th> <th>DRAW w/o Attention</th> </tr></thead> <tbody><tr> <td><img src="http://i.imgur.com/XfAkXPw.gif" width="100%" /></td> <td><img src="http://i.imgur.com/qQUToOy.gif" width="100%" /></td> </tr></tbody></table>It may not look like much, but the images are completely synthetic - we've trained a neural network to transform random noise into an endless stream of images that it has never even seen before. In a sense, it is "dreaming up" the images.<br /><br />The full source code can be found <a href="https://github.com/ericjang/draw">on Github</a>.<br /><br /><h2>A Story</h2><br />Imagine that you crash land in the Sahara desert and meet a young boy who asks you to "draw him a sheep".<br /><br />You draw a mutton chop in the sand with a stick.<br /><br />"Now draw me a sample from the <a href="https://en.wikipedia.org/wiki/Normal_distribution">normal distribution</a>," he says.<br /><br />Sure - you bust open a Python shell on your laptop and type <code>import numpy; numpy.random.normal()</code><br />"0.3976147272555", you tell him.<br /><br />He asks you to draw another sample, so you run your code again and this time you get -0.27625249497.<br /><br />"Now draw me a sample from the distribution of pictures where the camera operator forgot to take the cap off the lens"<br /><br />You blink a couple times, then realize that there is only one such picture to choose from, so you draw him a black rectangle.<br /><br />"Now draw me a sample from the distribution of pictures of houses."<br /><br />"Erm... what?"<br /><br />"Yes, houses! 400 pixels wide by 300 pixels tall, in RGB colorspace please. Don't put it in a box."<br /><h2>Problem Statement</h2>If you've ever written a computer program that used randomness, you probably learned how to sample the uniform probability distribution by calling <code>rand()</code> in your favorite programming language. If you were asked how to get a random number over the range $(0,2)$, or perhaps a random point in a unit square (2 dimensions) rather than a line (1 dimension), you could probably figure out how to do that too.<br /><br />Now what if we were asked to "draw a sample from the distribution of valid <a href="https://en.wikipedia.org/wiki/Covariance_matrix">covariance matrices</a>?" We might actually care about this if we were trying to simulate an investment strategy on a portfolio of stocks. The equations <a href="https://en.wikipedia.org/wiki/Inverse-Wishart_distribution">get a lot harder</a>, but it turns out we can <a href="http://www.mathworks.com/help/stats/iwishrnd.html">still do it</a>.<br /><br />Now consider the distribution of house pictures (let's call it $P$) that the young boy asked us to sample from. $P$ is defined over a 400 pixel by 300 pixel space (120,000 pixels times 3 color channels = 360,000 numbers). Here are some samples drawn from that distribution via Reddit (these are real houses) <a href="https://www.blogger.com/blogger.g?blogID=842965756326639856#ref_houses">[4]</a>.<br /><br /><img src="http://i.imgur.com/H8OGv2c.png" width="100%" /> <br /><br />We'd like to be able to draw samples $c \sim P$, or basically generate new pictures of houses that may not even exist in real life. However, the distribution of pixel values over these kinds of images is <i>very</i> complex. Sampling the correct images needs to account for many long-range spatial and semantic correlations, such as:<br /><ul><li>Adjacent pixels tend to be similar (a property of the distribution of natural images)</li><li>The top half of the image is usually sky-colored</li><li>The roof and body of the house probably have different colors</li><li>Any windows are likely to be reflecting the color of the sky</li><li>The house has to be structurally stable</li><li>The house has to be made out of materials like brick, wood, and cement rather than cotton or meat</li></ul>... and so on. We need to specify these relationships via equations and code, even if implicitly rather than analytically (writing down the 360,000-dimensional equations for $P$).<br />Trying to write down the equations that describe "house pictures" may seem insane, but in the field of Machine Learning, this is an area of intense research effort known as <a href="https://en.wikipedia.org/wiki/Generative_model">Generative Modeling</a>.<br /><br /><h2>Generative Modeling Techniques</h2><br />Formally, generative models allow us to create observation data out of thin air by sampling from the joint distribution over observation data and class labels. That is, when you sample from a generative distribution, you get back a tuple consisting of an image and a class label (e.g. "border collie"). This is in contrast to <a href="https://www.quora.com/What-are-the-differences-between-generative-and-discriminative-machine-learning">discriminative models</a>, which can only sample from the distribution of class labels, conditioned on observations (you need to supply it with the image for it to tell you what it is). Generative models are much harder to create than discriminative models.<br /><br />There have been some awesome papers in the last couple of years that have improved generative modeling performance by combining them with Deep Learning techniques. Many researchers believe that breakthroughs in generative models are key to solving ``unsupervised learning'', and maybe even <a href="http://qr.ae/RUPTUl">general AI</a>, so this stuff gets a lot of attention.<br /><ul><li> Goodfellow et al. 2014 (with Yoshua Bengio's group in Montréal) came up with <a href="http://arxiv.org/abs/1406.2661">Generative Adversarial Networks</a> (GAN), where a neural network is trained to directly learn the <i>sampler function</i> for the joint distribution. It is interactively pitted against another neural network that tries to reject samples from the generator while accepting samples from the "true" distribution <a href="https://www.blogger.com/blogger.g?blogID=842965756326639856#ref_genadv">[1]</a>.<br /> </li><li> Gregor et al. 2014 (with Google Deepmind) proposed <a href="http://arxiv.org/pdf/1310.8499v2.pdf">Deep AutoRegressive Networks</a> (DARN). The idea is to use an autoencoder to learn some high-level features of some input, such as handwritten character images. The images can be reconstituted as a weighted combination of high-level features (e.g. individual pen strokes). To generate an entire image, we sample feature weights one at a time, conditioned on all the previous weights we have already chosen (ancestral sampling). We then let the decoding side of the autoencoder to recover the image from the feature vector <a href="https://www.blogger.com/blogger.g?blogID=842965756326639856#ref_darn">[2]</a>.<br /> </li><li> Kingma and Welling 2013 (Universiteit van Amsterdam) describe <a href="http://arxiv.org/pdf/1312.6114v10.pdf">Variational AutoEncoders</a> (VAE). Like the DARN architecture, we sample some feature representation (learned via autoencoder) generatively, then decode it into a meaningful observation. However, sampling this feature representation might be challenging if the feature interdependencies are complicated (or possibly cyclic). We use <a href="https://en.wikipedia.org/wiki/Variational_Bayesian_methods">Variational methods</a> to get around this: instead of trying to sample from the feature vector distribution $P$, we come up with a friendlier, tractable distribution $Q$ (such as a multivariate Gaussian) and instead sample from that. We vary the parameters $\theta$ of $Q$ so that $Q$'s shape is as similar to $P$ as possible.<br /> </li><li> The combination of Deep Learning and Variational Bayes begs the question of whether we can also combine Deep Learning with VB's antithesis: MCMC methods. There are some groups doing work on this, but I'm not caught-up in this area. <a href="https://www.blogger.com/blogger.g?blogID=842965756326639856#ref_mcmc">[3]</a>.<br /> </li></ul><h3></h3><h2>DRAW: Core Ideas</h2><br />Let's revisit the challenge of drawing houses.<br /><br />The joint distribution we are interested in is $P(C)$. Joint distributions are typically written in terms of 2 or more variables, i.e. $P(X,Y)$, but distributions $P(C)$ over one multi-dimensional variable $C$ can also be thought of as joint distributions (think of $C$ as the concatenation of $X$ and $Y$).<br />The previous section discussed how neural networks can do all the hard work for us by either (1) learning the sampler function (GAN), (2) learning the directed Bayes net (DARN) or (3) learning the variational parameters (VAE). However, it's still asking a lot of our neural nets to figure out such a complex distribution. The feature space is massive ($\mathbb{R}^{360,000}$), and we might need tons of data to make this work.<br /><br />This is where the DRAW paper comes in. <b>The main contribution of the DRAW is that it extends Variational AutoEncoders with (1) "progressive refinement" and (2) "spatial attention".</b> These two mechanisms greatly reduce the complexity burden that the autoencoder needs to learn, thereby allowing its generative capabilities handle larger, more complex distributions, like natural images.<br /><h4></h4><h3>Core Idea 1: Progressive Refinement</h3><br />Intuitively, it's easier to ask our neural network to merely "improve the image" rather than "finish the image in one shot". Human artists work by iterating on their canvas, and infer from their drawing what to fix and what to paint next.<br /><br />If $P(C)$ is a joint distribution, we can rewrite it as $P(C)=P(A|B)P(B)$, where $P(A|B)$ is the conditional distribution and $P(B)$ is the prior distribution. We can think of $P(B)$ as another joint distribution, so $P(B)=P(B|D)P(D)$. So priors can have their own priors, which can have their own priors... it's turtles all the way down.<br /><br />Now returning to the DRAW model, "progressive refinement" is simply breaking up our joint distribution $P(C)$ over and over again, resulting in a chain of latent variables $C_1, C_2, ... C_{T-1}$ to a new "observed variable distribution" $P(C_T)$, our house image.<br /><div><br /></div>$$P(C)=P(C_T|C_{T-1})P(C_{T-1}|C_{T-2})...P(C_1|C_0)P(0)$$<br /><div><br /> <!-- canvas picture --><img src="http://i.imgur.com/STegpLp.png" style="display: block; margin: auto;" /> <br /><br />The trick is to sample from the "iterative refinement" distribution $P(C_t|C_{t-1})$ several times rather than straight-up sampling from $P(C)$. To reuse the house images example, sampling $P(C_1|C_0)$ might be only responsible for getting the sky right, sampling $P(C_2|C_1)$ might be only responsible for capturing proper windows and doors (after we've gotten the sky right), and so on.<br />In the case of the DRAW model, $P(C_t|C_{t-1})$ is the same distribution for all $t$, so we can compactly represent this as the following recurrence relation (if not, then we have a Markov Chain instead of a recurrent network) <a href="https://www.blogger.com/blogger.g?blogID=842965756326639856#ref_rnn">[5]</a>.<br /><br /><img src="http://i.imgur.com/N6metx3.png" style="display: block; margin: auto;" /> <br /><h4></h4><h3>Core Idea 2: Spatial Attention</h3><br />Progressive refinement simplifies things by splitting up the generative process over a temporal domain. Can we do something similar in a spatial domain?<br /><br />Yes! In addition to only asking the neural network to improve the image <i>a little bit</i> on each step, we can further reduce the compexity burden by asking it to only improve a <i>small region</i> of the image at a time. The neural network can devote its "mental resources" to one local patch rather than having to think think about the distribution of improvement across the entire canvas. Because "drawings over small regions" are easier to generate than "drawings over large regions", we should expect things to get easier for the network.<br /><br />Easier:<br /><br /><img src="http://i.imgur.com/YQbvWqr.png" height="400" width="360" /><br /><br />Harder: the above face is in there somewhere! Artwork by <a href="https://www.youtube.com/watch?v=MGbvhyTZXfs">Kim Jung gi</a>.<br /><br /><img height="289" src="https://magorographics.files.wordpress.com/2015/07/34-1kim-jung-gi.jpg" width="640" /> <br /><br />Of course, there are some subtleties here: how big should the attention patch be? Should the "pen-strokes" be sharp or blurry? How do we get the attention mechanism to be able to support both wide, broad strokes and little precise refinements?<br /><br />Fear not. As we will see later, these are dynamic parameters that will be learned by the DRAW model.<br /><h2>The Model</h2>The DRAW model uses recurrent networks (RNNs) that run for $T$ steps of progressive refinement. Here is the neural network, where the RNNs have been unrolled across time - everything is feedforward now.<br /><img src="http://i.imgur.com/ldrcxhH.png" width="120%" /> To keep things concrete, here are the shapes of the tensors being passed along the network: <code>A</code> and <code>B</code> are the image width and height respectively (so the image is a B x A matrix), and <code>N</code> is the width of the attention window (in pixels). It is standard practice to flatten 2D image data entirely into the second dimension (index=1) because this construction generalizes better when we want to use our neural network on other data formats.<br /><br /><table class="u-full-width"> <thead><tr> <th>Tensor</th> <th>Shape</th> </tr></thead> <tbody><tr> <td>$x$</td> <td><code>(batch_size, B*A)</code></td> </tr><tr> <td>$\hat{x}$</td> <td><code>(batch_size,B*A)</code></td> </tr><tr> <td>$r_t$</td> <td><code>(batch_size,2*N*N)</code></td> </tr><tr> <td>$h_t^{enc}$</td> <td><code>(batch_size,enc_size)</code></td> </tr><tr> <td>$z_t$</td> <td><code>(batch_size,z_size)</code></td> </tr><tr> <td>$h_t^{dec}$</td> <td><code>(batch_size,dec_size)</code></td> </tr><tr> <td>$write(h_t^{dec})$</td> <td><code>(batch_size,B*A)</code></td> </tr></tbody></table><br />$h_t^{enc}$, $z_t$, and $h_t^{dec}$ form the "Variational Autoencoder" component of the model. Images seen by the network are "compressed" into features by the encoding layer. The reason why we want to use an autoencoder is because it's too difficult to sample images in pixel space directly. We want to find a more compact representation for the images we are trying to generate, and instead sample from those representations before transforming them back into images.<br /><br />To sample $h^{enc}$ generatively, we pray to the Variational Bayes gods and hope that there is some $(\mu,\Sigma)$ that parameterizes a <code>z_size</code>-dimensional multivariate Gaussian $Q$ in a way such that $Q$ looks kinda like $P(h_{enc})$. We choose Gaussian because it's well-studied and easy to sample from, but you could have also chosen a multivariate Gamma, Uniform, or other distributions as long as you can match it to $P(h^{enc})$.<br /><br />Expressing this model is easy in TensorFlow. The syntax is similar to manipulating Numpy arrays, so we practically can copy equations right out of the paper:<br /><br /><img src="http://i.imgur.com/7CLlD1e.png" style="display: block; margin: auto;" width="50%" /> <script src="https://gist.github.com/ericjang/f6393b86282bddd31d71.js"></script> <br /><h3>Note about Weight Sharing</h3>Because the model is recurrent, we want to be using the same network weights for all our linear layers and LSTM layers at each time step. <code>tf.variable_scope(reuse=True)</code> is used to share the parameters of the network.<br /><br />Annoyingly though, we have to manually disable <code>reuse=True</code> on the first time step, otherwise trying to create a new variable with <code>tf.get_variable(reuse=True)</code> raises an "under-share" exception. That's what the global variable <code>DO_SHARE</code> is for: weight sharing is turned off for t=0, then we switch it on permanently.<br /><h3>Reading</h3>The implementation of <code>read</code> without attention is super easy: simply concatenate the image with the error image (difference between the image and the current reconstruction. I highly recommend implementing the non-attentive version of DRAW first, in order to check that all the other components work.<br /><br /><script src="https://gist.github.com/ericjang/369c4d01b81542604822.js"></script> To implement read-with-attention, we use a basic linear layer to compute the 5 scalar parameters that we use to set up the read filterbanks $F_x$ and $F_y$. I made a short helper function <code>linear</code> to help with other places in the model where affine transformations $W$ are needed.<br /><script src="https://gist.github.com/ericjang/63cb2235060c2225592b.js"></script> <br />You might be wondering - why do we compute $log(\sigma)$, $\log(\gamma)$, etc. rather than just have the linear layer spit out $\sigma$ and $\gamma$?. I think this is so that we guarantee $\sigma=exp(log(\sigma))$ and $\gamma=exp(log(\gamma))$ are positive. We could just use <code>tf.min</code> and clamp the values, but it doesn't differentiate as well so our error gradients would propagate slower if we did that.<br /><br />Once we obtain the filterbank parameters, we need to actually set up the filters. $F_X$ and $F_Y$ each specify $N$ 1-dimensional Gaussian bumps. These bumps are identical except that they are translated (mean-shifted) from each other at intervals of $\delta$ pixels, and the whole grid is centered at $(g_Y,g_X)$. Spacing these filters out means that we are downsampling the image, rather than just zooming really close.<br /><br />The extracted value for attention grid pixel $(i,j)$ is the convolution of the image with a 2D Gaussian bump whose $x$ and $y$ components are the respective $i$ and $j$ rows in $F_X$ and $F_Y$. Thus, reading with attention extracts a N*N patch from the image patch and N*N from the error patch).<br /><br />If you didn't follow what the last couple paragraphs said, that's okay: it's much easier to explain what the filterbanks are doing with a picture:<br /><br /><img src="http://i.imgur.com/5T5wJrK.png" width="100%" /> Here's the code to build up the filterbank matrices $F_X$, $F_Y$.<br /><script src="https://gist.github.com/ericjang/77857ce72b1cc76ee990.js"></script> <br /><h3>Batch Multiply vs. Vectorized Multiplication</h3>In batched operations, we usually have tensors with the shape <code>(batch_size,img_size)</code>. But the recipe calls for $F_YxF_X^T$, which is a matrix-matrix-matrix multiplication. How do we perform the multiplication when our $x$ is a vector?<br /><br />As it turns out, TensorFlow provides a convenient <code>tf.batch_matmul()</code> function that lets us do exactly this. We temporarily reshape our $x$ vectors into <code>(batch_size,B,A)</code>, apply batch multiplication, then reshape it back to a 2D matrix again.<br /><br />Here is the read function with attention, using batch multiplication:<br /><script src="https://gist.github.com/ericjang/5a49da4efdab031c9d8b.js"></script> <br />There's also a way to do this without <code>tf.batch_matmul</code>, and that is to just vectorize the multiplications. It's slightly verbose, but useful. Here is how you do it:<br /><script src="https://gist.github.com/ericjang/d647d632c39ab22d6fad.js"></script> <br /><h3>Encoder</h3>In TensorFlow, you build recurrent nets by specifying the time-unrolled graph manually (as drawn in the diagram above).<br /><br />The constructor <code>tf.models.rnn.rnn_cell.LSTMCell(output_size, input_size)</code> returns to you a function, <code>lstm_op</code> corresponding to the update equations for an LSTM layer. You pass in the current state and input to <code>lstm_op</code>, and it returns the output and a new LSTM state (for which you are responsible for passing back into the next call to <code>lstm_op</code>).<br /><pre class="code">output, new_state = lstm_op(input, current_state)</pre><script src="https://gist.github.com/ericjang/8dbcf722f429bb0112c4.js"></script> <br /><h3>Sampler</h3>Given one instance of an encoded representation $h_t^{enc}$ of a "6" image, we'd like to map that to the distribution $P$ of all "6" images, or at least a variational approximation $Q \sim P$ of it. In principle, if we choose the correct $(\mu$, $\Sigma)$ to characterize $Q$, then we can sample other instances of "encoded 6 images". Note that the samples would all have random variations, but nonetheless still be images of "6".<br /><br />If we save a copy of $(\mu,\Sigma)$, then we can create as many samples from $Q$ as we want using <code>numpy.random.multivariate_normal()</code>. In fact, the way to use DRAW as a generative model is to cut out the first half of the model (read, encode, and sampleQ), and just pass our own $z \sim \mathcal{N}(\mu,\Sigma)$ straight to the second half of the network. The decoder and writer transforms the random normals back into an image.<br /><img src="http://i.imgur.com/JbS9yxn.png" width="100%" /> <br />Radical.<br /><h3>Sampler Reparameterization Trick</h3>There's one problem: if we're training the DRAW architecture end-to-end via gradient descent, how the heck do we backpropagate loss gradients through a function like <code>numpy.random.multivariate_normal()</code>?<br /><br />The answer proposed in Kingma's Variational Autoencoders paper is hilariously simple: just rewrite $z=\mathcal{N}(\mu,\sigma)$ as $z=\mu+\mathcal{N}(0,1)*\sigma$. Now we can sort of take derivatives: $\partial{z}/\partial{\mu}=1, \partial{z}/\partial{\sigma}=e$ where $e \sim N(0,1)$.<br /><img src="http://i.imgur.com/dvjYkQp.png" width="100%" /> <br />In <a href="https://www.blogger.com/blogger.g?blogID=842965756326639856#ref_vae_talk">Karol Gregor's talk [6]</a>, $e$ is described as "added noise" that limits the information passed between the encoder and decoder layers (otherwise $\mu,\Sigma$, being made up of real numbers, would carry an infinite number of bits). Somehow this prevents similar $z$ being decoded into wildly different images, but I don't fully understand the info-theory interpretation quite yet.<br /><script src="https://gist.github.com/ericjang/72055c383fc047e07039.js"></script> <br /><h3>Decoder</h3>The variable names in the paper make it a little confusing as to what exactly is being encoded and decoded at each layer. It's clear that $x_t, r_t, c_t$ are are in "image space", and $h^{enc}_t$ refers to the encoded feature representation of an image.<br /><br />However, my understanding is that $h^{dec}_t$ is NOT the decoded image representation, but rather it resides in the same feature space as $h^{enc}_t$. The superscript "dec" is misleading because it actually refers to the "decoding of $z$ back into encoded image space". I think this is so because passing data through <code>sampleQ</code> is implicitly adding an additional layer of encoding.<br /><br />It then follows that the actual transformation from "encoded image space" back to "image space" is done implicitly by the <code>write</code> layer. I'd love to hear an expert opinion on this matter though <a href="https://www.blogger.com/blogger.g?blogID=842965756326639856#ref_hdec">[7]</a>.<br /><script src="https://gist.github.com/ericjang/7934bae43b4c2473adcc.js"></script> <br /><h3>Writing with Attention</h3>During writing, we apply the opposite filter transformation we had during reading. Instead of focusing a large image down to a small "read patch", we now transform a small "write patch" to a larger region spanning the entire canvas. If we make the write patch pure white, we can visualize where the write filterbanks cause the model to attend to in the image. This can be useful for debugging or visualizing the DRAW model in action. To visualize the same thing with read filterbanks, just apply them as if they were write filters.<br /><script src="https://gist.github.com/ericjang/a19d3ec85f08663dabe8.js"></script> <br /><h3>Model Variables</h3>It's usually a good idea to double-check the parameters you're optimizing over, and that variables are being properly shared, before switching training on. Here are the variables (learnable parameters) of the DRAW model and their respective tensor shapes.<br /><script src="https://gist.github.com/anonymous/9cdbf7acad3edc7544a4.js"></script> <br /><pre class="code"> read/w:0 : TensorShape([Dimension(256), Dimension(5)])<br /> read/b:0 : TensorShape([Dimension(5)])<br /> encoder/LSTMCell/W_0:0 : TensorShape([Dimension(562), Dimension(1024)])<br /> encoder/LSTMCell/B:0 : TensorShape([Dimension(1024)])<br /> mu/w:0 : TensorShape([Dimension(256), Dimension(10)])<br /> mu/b:0 : TensorShape([Dimension(10)])<br /> sigma/w:0 : TensorShape([Dimension(256), Dimension(10)])<br /> sigma/b:0 : TensorShape([Dimension(10)])<br /> decoder/LSTMCell/W_0:0 : TensorShape([Dimension(266), Dimension(1024)])<br /> decoder/LSTMCell/B:0 : TensorShape([Dimension(1024)])<br /> writeW/w:0 : TensorShape([Dimension(256), Dimension(25)])<br /> writeW/b:0 : TensorShape([Dimension(25)])<br /> write/w:0 : TensorShape([Dimension(256), Dimension(5)])<br /> write/b:0 : TensorShape([Dimension(5)])<br /></pre>Looks good!<br /><h3>Loss Function</h3>The DRAW model's loss function is $L_x+L_z$, or the sum of a reconstruction loss (how good the picture looks) and a latent loss for our choice of $z$ (a measure of how bad our variational approximation is of the true latent distribution). Because we are using binary cross entropy for $L_x$, we need our MNIST image values to go from 0-1 (binarized) rather than -0.5 to 0.5 (mean-normalized). The latent loss is the the standard reverse-KL diverence used in variational methods.<br />I don't have a full understanding of this yet, so maybe I'll ask here: why are we allowed to just combine reverse-KL with binary cross entropy? If we only minimize $L_x$, it looks like we can still obtain good drawings even though the latent loss increases. This suggests that KL and binary cross entropy are apples and oranges, and we can change the cost function to $Lx+\lambda*Ly$ where $\lambda$ allows us to prioritize one kind of loss over another (or even alternate optimizing $L_x$ and $L_z$ in an EM-like fashion).<br /><br /><script src="https://gist.github.com/ericjang/b01328d4e67d3c283c12.js"></script> One very annoying bug that I encountered was that while DRAW without attention trained without problems, adding attention suddenly caused numerical underflow problems to occur during training. The training would start off just fine, but soon all of the gradients would explode into NaNs.<br />The problem was that I was passing near-zero values to the <code>log</code> function used in binary cross entropy, even though $x_{recons}=\sigma(c_T)$ has an asymptote at 0. The fix was to just add a small positive epsilon-value to what I was passing into <code>log</code>, to really make sure it was strictly positive.<br /><h3>Optimization</h3>The code for setting up an Adam Optimizer to minimize the loss function is not super interesting to look at, so I leave you to take a look at the <a href="https://github.com/ericjang/draw">full code on Github</a>. It's common in Deep Learning to apply momentum to the optimizer and clip gradients that are too large, so I utilized those heuristics in my implementation (but they are not necesary). Here is a plot of the training error for DRAW with attention:<br /><img src="http://i.imgur.com/f2n0n4F.png" width="100%" /><br />Here's an example of MNIST being generated after training. Notice how the attention mechanism "traces" out the curves of the letter, rather than "blotting" it in gradually (DRAW without attention).<br /><img src="http://i.imgur.com/XfAkXPw.gif" width="80%" /> <br /><h2>Closing Thoughts</h2>DRAW improves generative modeling capabilities of Variational AutoEncoders by "splitting" the complexity of the task across the temporal and spatial domains (progressive refinement and attention). TensorFlow made this a joy to implement, and I'm excited what other exotic differentiable programs will come out of the Deep Learning woodwork. Here are some miscellaneous thoughts:<br /><ul><li>It's quite remarkable that even though this model perceives and manipulates images, there was no use of <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">convolutional neural nets</a>. It's almost as if RNNs equipped with attention can somehow do spatial tasks that much-larger convnets are so good at. I'm reminded of how Baidu actually uses convnets in their speech recognition models (convolution across time rather than space), rather than LSTMs (as Google does). Is there some sort of representational duality between recurrent and convolutional architectures?</li><li>Although the <code>sampleQ</code> block in the DRAW model is implemented using a Variational AutoEncoder, theoretically we could replace that module with one of the other generative modeling techniques (GAN, DARN), so long as we are still able to backpropagate loss gradients through it.</li></ul>Please don't hesitate to reach out to me via email if there are factual inaccuracies/typos in this blog post, or something that I can clarify better. Thank you for reading!<br /><h2>Footnotes</h2><ol><li id="ref_genadv"> My <a href="http://blog.evjang.com/2016/06/generative-adversarial-nets-in.html">previous blog post</a> explains Generative Adversarial Networks in more detail (it uses sorting and stratified sampling as I had not learned about the reparameterization trick at the time). </li><li id="ref_darn"> The prefix "auto" comes from the use of an autoencoder and the suffix "regressive" comes from the correlation matrix of high-level features used in the conditional distribution. Karol Gregor's <a href="http://techtalks.tv/talks/deep-autoregressive-networks/60884/">tech talk on DARN</a> explains the idea very well.<br /> </li><li id="ref_mcmc"> <a href="http://helper.ipam.ucla.edu/publications/gss2012/gss2012_10791.pdf">Introduction to MCMC for Deep Learning</a> and <a href="http://arxiv.org/pdf/1506.04557v3.pdf">Learning Deep Generatie Models with Doubly Stochastic MCMC</a><br /> </li><li id="ref_houses"> I highly recommend checking out the "HousePorn" subreddit (SFW) <a href="https://www.reddit.com/r/Houseporn/">https://www.reddit.com/r/Houseporn/</a><br /> </li><li id="ref_rnn"> "Progressive refinement" is why recurrent neural networks generally have far greater expressive power than feedforward networks of the same size: each step of the RNN's state activation is only responsible for learning the "progressive refinement" distribution (or next word in a full sentence), whereas a model that isn't recurrent has to figure an entire sequence out in one shot. Cool!<br /> </li><li id="ref_vae_talk"> <a href="https://www.youtube.com/watch?v=P78QYjWh5sM&list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu&index=3">Karol Gregor's guest lecture</a> in Oxford's Machine Learning class taught by Nando de Freitas. I highly recommend this lecture series, as it contains more advanced Deep Learning techniques (heavily influenced by Deepmind's research areas).<br /> </li><li id="ref_hdec"> This is a somewhat simplistic reduction: I think that the representation that a layer ends up learning probably has a lot to do with how much size it is allocated. There's no reason why the feature representation $h^{dec}_t$ learns would resemble that of $h^{enc}_t$ if their layer sizes are different.<br /> </li></ol><br /></div>Ericnoreply@blogger.com21tag:blogger.com,1999:blog-842965756326639856.post-78441749610828235592016-06-25T05:28:00.001-07:002016-06-25T05:59:07.374-07:00Generative Adversarial Nets in TensorFlow (Part I)<i>This post was first published on 12/29/15, and has since been migrated to Blogger.</i> <br /><br />This is a tutorial on implementing Ian Goodfellow's <a href="http://arxiv.org/abs/1406.2661">Generative Adversarial Nets</a> paper in TensorFlow. Adversarial Nets are a fun little Deep Learning exercise that can be done in ~80 lines of Python code, and exposes you (the reader) to an active area of deep learning research (as of 2015): Generative Modeling!<br /><br /><a href="https://github.com/ericjang/genadv_tutorial/blob/master/genadv1.ipynb">Code on Github</a><br /><h3>Scenario: Counterfeit Money</h3>To help explain the motivations behind the paper, here's a hypothetical scenario:<br /><b>D</b>anielle is a teller at a bank, and one of her job responsibilities is to discriminate between real money and counterfeit money. <b>G</b>eorge is a crook and is trying to make some counterfeit money, becase free money would be pretty radical.<br /><br />Let's simplify things a bit and assume the only distinguishing feature of currency is one unique number, $X$, printed on the each bill. These numbers are randomly sampled from a probability distribution, whose density function $p_{data}$ is only known to the Treasury (i.e. neither Danielle nor George know the function). For convenience, this tutorial uses $p_{data}$ to refer to both the distribution and the density function (though semantically a distribution and its density function are not the same).<br /><br />George's goal is to generate samples $x^\prime$ from $p_{data}$, so his counterfeit currency is indistinguishable from "real" currency. You might ask: how can George generate samples from $p_{data}$ if he doesn't know $p_{data}$ in the first place?<br /><br />We can create computationally indistinguishable samples without understanding the "true" underlying generative process [1]. The underlying generative process is the method that the Treasury itself is using to generate samples of $X$ - perhaps some efficient algorithm for sampling $p_{data}$ that relies on the analytical formula for the pdf.<br /><br />We can think of this algorithm as the "natural (function) basis", the straightforward method the Treasury would actually use to print our hypothetical currency. However, a (continuous) function can be represented as a combination of a different set of basis functions; George can express the same sampler algorithm in a "neural network basis" or "Fourier basis" or other basis that can be used to build a <a href="https://en.wikipedia.org/wiki/Universal_approximator">universal approximator</a>. From an outsider's perspective, the samplers are computationally indistinguishable, and yet George's model doesn't reveal to him the structure of the "natural" sampler basis or the analytical formula of $p_{data}$.<br /><h3>Background: Discriminative vs. Generative Models</h3>Let $X$, $Y$ be the "observed" and "target" random variables. The joint distribution for $X$ and $Y$ is $P(X,Y)$, which we can think of as a probability density over 2 (possibly dependent) variables.<br /><br />A <a href="https://en.wikipedia.org/wiki/Discriminative_model">Discriminative</a> model allows us to evaluate the conditional probability $P(Y|X)$. For instance, given a vector of pixel values $x$, what is the probability that $Y=6$? (where "6" corresponds to the categorical class label for "tabby cat"). <a href="https://www.tensorflow.org/versions/master/tutorials/mnist/beginners/index.html">MNIST LeNet</a>, <a href="http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/00-classification.ipynb">AlexNet</a>, and other classifiers are examples of a discriminative models.<br /><br /><div class="row"><img src="http://deeplearning.net/tutorial/_images/mylenet.png" width="90%" /></div><br />On the other hand, a <a href="https://en.wikipedia.org/wiki/Generative_model">Generative</a> model can allows us to evaluate the joint probability $P(X,Y)$. This means that we can propose value pairs $(X,Y)$ and do rejection-sampling to obtain samples $x$,$y$ from $P(X,Y)$. Another way to put this is that with the right generative model, we can convert a random number from $[0,1]$ <i>into a picture of a rabbit</i>. That's awesome.<br /><br /><img src="http://i.imgur.com/whX120f.jpg" /> <br /><br />Of course, generative models are much harder to construct than discriminative models, and both are active areas of research in statistics and machine learning.<br /><h3>Generative Adversarial Networks</h3>Goodfellow's paper proposes a very elegant way to teach neural networks a generative model for any (continuous) probability density function. We build two neural networks $D$ (Danielle) and $G$ (George), and have them play an adversarial cat-and-mouse game: $G$ is a generator and attempts to counterfeit samples from $p_{data}$ and $D$ is a decider that tries to not get fooled. We train them simultaneously, so that they both improve by competing with each other. At convergence, we hope that $G$ has learned to sample exactly from $p_{data}$, at which point $D(x)=0.5$ (random guessing).<br /><br />Advesarial Nets have been used to great success to synthesize the following from thin air:<br /><br /><img height="400" src="https://github.com/aleju/cat-generator/raw/master/images/random_color_256_g32upc.jpg?raw=true" width="400" /><br /><a href="https://github.com/aleju/cat-generator">Cat Faces</a><br /><img src="http://soumith.ch/eyescream/images/church_outdoor.png" height="400" width="75%%" /><br /><a href="http://soumith.ch/eyescream/">Churches</a><br /><img height="400" src="https://raw.githubusercontent.com/mattya/chainer-DCGAN/master/sample4.png" width="399" /><br /><a href="https://github.com/mattya/chainer-DCGAN">Anime Characters</a><br /><br />In this tutorial we won't be doing anything nearly as amazing, but hopefully you'll come away with a better fundamental understanding of adversarial nets.<br /><h3>Implementation</h3>We'll be training a neural network to sample from the simple 1-D normal distribution $\mathcal{N}(-1,1)$<br /><br /><img src="https://i.imgur.com/LL6bvIF.png" /> <br /><br />Let $D$,$G$ be small 3-layer perceptrons, each with a meager 11 hidden units in total. $G$ takes as input a single sample of a noise distribution: $z \sim \text{uniform}(0,1)$. We want $G$ to map points $z_1,z_2,...z_M$ to $x_1,x_2,...x_M$, in such a way that mapped points $x_i=G(z_i)$ cluster densely where $p_{data}(X)$ is dense. Thus, G takes in $z$ and generates fake data $x^\prime$.<br /><img src="https://i.imgur.com/fWx86k5.png" width="90%" /> <br />Meanwhile, the discriminator $D$, takes in input $x$ and outputs a likelihood of the input belonging to $p_{data}$.<br />Let $D_1$ and $D_2$ be copies of $D$ (they share the same parameters so $D_1(x)=D_2(x)$). The input to $D_1$ is a single sample of the legitimate data distribution: $x \sim p_{data}$, so when optimizing the decider we want the quantity $D_1(x)$ to be maximized. $D_2$ takes as input $x^\prime$ (the fake data generated by $G$), so when optimizing $D$ we want to $D_2(x^\prime)$ to be <i>minimized</i>. The value function for $D$ is:<br />$$ \log(D_1(x))+\log(1-D_2(G(z))) $$ <br />Here's the Python code:<br /><pre><code>batch=tf.Variable(0)<br />obj_d=tf.reduce_mean(tf.log(D1)+tf.log(1-D2))<br />opt_d=tf.train.GradientDescentOptimizer(0.01)<br /> .minimize(1-obj_d,global_step=batch,var_list=theta_d)<br /></code></pre>The reason we go through the trouble of specifying copies $D_1$ and $D_2$ is that in TensorFlow, we need one copy of $D$ to process the input $x$ and another copy to process the input $G(z)$; the same section of the computational graph can't be re-used for different inputs.<br />When optimizing $G$, we want the quantity $D_2(X^\prime)$ to be maximized (successfully fooling $D$). The value function for $G$ is:<br />$$ \log(D_2(G(z))) $$ <br /><pre><code>batch=tf.Variable(0)<br />obj_g=tf.reduce_mean(tf.log(D2))<br />opt_g=tf.train.GradientDescentOptimizer(0.01)<br /> .minimize(1-obj_g,global_step=batch,var_list=theta_g)<br /></code></pre>Instead of optimizing with one pair $(x,z)$ at a time, we update the gradient based on the average of $M$ loss gradients computed for $M$ different $(x,z)$ pairs. The stochastic gradient estimated from a minibatch is closer to the true gradient across the training data.<br />The training loop is straightforward:<br /><pre><code># Algorithm 1, GoodFellow et al. 2014<br />for i in range(TRAIN_ITERS):<br /> x= np.random.normal(mu,sigma,M) # sample minibatch from p_data<br /> z= np.random.random(M) # sample minibatch from noise prior<br /> sess.run(opt_d, {x_node: x, z_node: z}) # update discriminator D<br /> z= np.random.random(M) # sample noise prior<br /> sess.run(opt_g, {z_node: z}) # update generator G<br /></code></pre><h4>Manifold Alignment</h4>Following the above recipe naively will not lead to good results, because we are sampling $p_{data}$ and $\text{uniform}(0,1)$ independently each iteration. Nothing is enforcing that adjacent points in the $Z$ domain are being mapped to adjacent points in the $X$ domain; in one minibatch, we might train $G$ to map $0.501 \to -1.1$, $0.502 \to 0.01$, and $0.503 \to -1.11$. The mapping arrows cross each other too much, making the transformation very bumpy. What's worse, the next minibatch might map $0.5015 \to 1.1$, $0.5025 \to -1.1$, and $0.504 \to 1.01$. This implies a completely different mapping $G$ from the previous minibatch, so the optimizer will fail to converge.<br /><img src="https://i.imgur.com/Efm3xOr.png" /> <br />To remedy this, we want to minimize the total length of the arrows taking points from $Z$ to $X$, because this will make the transformation as smooth as possible and easier to learn. Another way of saying this is that the "vector bundles" carrying $Z$ to $X$ should be correlated between minibatches. <br /><br />First, we'll stretch the domain of $Z$ to the same size of $X$. The normal distribution centered at -1 has most of its probability mass lying between $[-5,5]$, so we should sample $Z$ from $\text{uniform}(-5,5)$. Doing this means $G$ no longer needs to learn how to "stretch" the domain $[0,1]$ by a factor of 10. The less $G$ has to learn, the better. <br />Next, we'll align the samples of $Z$ and $X$ within a minibatch by sorting them both from lowest to highest.<br /><br />Instead of sampling $Z$ via <code>np.random.random(M).sort()</code>, we'll use via <a herf="https://en.wikipedia.org/wiki/Stratified_sampling" href="https://www.blogger.com/null">stratified sampling</a> - we generate $M$ equally spaced points along the domain and then jitter the points randomly. This preserves sorted order and also increases the representativeness the entire training space. We then match our stratified, sorted $Z$ samples to our sorted $X$ samples.<br /><img src="https://i.imgur.com/Zx4bq1N.png" /> <br />Of course, for higher dimensional problems it's not so straightforward to align the input space $Z$ with the target space $X$, since sorting points doesn't really make sense in 2D and higher. However, the notion of minimizing the transformation distance between the $Z$ and $X$ manifolds still holds <a href="https://www.blogger.com/blogger.g?blogID=842965756326639856#ref2">[2]</a>. <br /><br />The modified algorithm is as follows:<br /><pre><code>for i in range(TRAIN_ITERS):<br /> x= np.random.normal(mu,sigma,M).sort()<br /> z= np.linspace(-5.,5.,M)+np.random.random(M)*.01 # stratified<br /> sess.run(opt_d, {x_node: x, z_node: z})<br /> z= np.linspace(-5.,5.,M)+np.random.random(M)*.01<br /> sess.run(opt_g, {z_node: z})<br /></code></pre><br />This step was crucial for me to get this example working: when dealing with random noise as input, failing to align the transformation map properly will give rise to a host of other problems, like massive gradients that kill ReLU units early on, plateaus in the objective function, or performance not scaling with minibatch size.<br /><h4>Pretraining D</h4>The original algorithm runs $k$ steps of gradient descent on $D$ for every step of gradient descent on $G$. I found it more helpful to pre-train $D$ a larger number of steps prior to running the adversarial net, using a mean-square error (MSE) loss function to fit $D$ to $p_{data}$. This loss function is nicer to optimize than the log-likelihood objective function for $D$ (since the latter has to deal with stuff from $G$). It's easy to see that $p_{data}$ is the optimal likelihood decision surface for its own distribution.<br /><br />Here is the decision boundary at initialization.<br /><br /><img src="https://i.imgur.com/pjae94K.png" /> <br />After pretraining:<br /><img src="https://i.imgur.com/z6Rwx24.png" /> <br />Close enough, lol.<br /><h4>Other Troubleshooting Comments</h4><ul><li>Using too many parameters in the model often leads to overfitting, but in my case making the network too big failed to even converge under the minimax objective - the network units saturated too quickly from large gradients. Start with a shallow, tiny network and only add extra units/layers if you feel that the size increase is absolutely necessary.</li><li>I started out using ReLU units but the units kept saturating (possibly due to manifold alignment issues). The Tanh activation function seemed to work better.</li><li>I also had to tweak the learning rate a bit to get good results.</li></ul><h3>Results</h3>Before training, here is $p_{data}$, the pre-trained decision surface of $D$, and the generative distribution $p_g$.<br />. <img src="https://i.imgur.com/gEWYhMG.png" /> <br />Here's the loss function as a function of training iterations.<br /><img src="https://i.imgur.com/UUSQMf2.png" /> <br />After training, $p_g$ approximates $p_{data}$, and the discriminator is about uniformly confused ($D=0.5$) for all values of $X$.<br /><img src="https://i.imgur.com/O9TOVQy.png" /> <br />And there you have it! $G$ has learned how to sample (approximately) from $p_{data}$ to the extent that $D$ can no longer separate the "reals" from the forgeries.<br /><h3>Footnotes</h3><ol><li id="ref1"> Here's a more vivid example of computational indistinguishability: suppose we train a super-powerful neural network to sample from the distribution of cat faces.<br /> The underlying (generative) data distribution for "true" cat faces involves 1) a cat being born and 2) somebody eventually taking a photograph of the cat. Clearly, our neural network isn't learning this particular generative process, as no real cats are involved. However, if our neural network can generate pictures that we cannot distinguish from "authentic" cat photos (armed with polynomial-time computational resources), then in some ways these photos are "just as legitimate" as regular cat photos.<br /> This is worth pondering over, in the context of the Turing Test, Cryptography, and fake Rolexes.<br /> </li><li id="ref2"> Excessive mapping crossover can be viewed from the perspective of "overfitting", where the learned regression/classification function has been distorted in "contradictory" ways by data samples (for instance, a picture of a cat classified as a dog). Regularization techniques implicitly discourage excessive "mapping crossover", but do not explicitly implement a sorting algorithm to ensure the learned transformation is continuous / aligned in $X$-space with respect to $Z$-space. Such a sorting mechanism might be very useful for improving training speeds... <br /> </li></ol>Ericnoreply@blogger.com44tag:blogger.com,1999:blog-842965756326639856.post-34253076290058424272016-06-25T05:20:00.001-07:002016-06-25T05:58:55.664-07:00My Internship Experiences at Pixar, Google, and Two Sigma<i>This post was first published on 08/17/15, and has since been migrated to Blogger.</i><br /><i><br /></i>Over the last three summers, I've had the incredible privilege of interning with Pixar Animation Studios, Google Inc, and Two Sigma Investments. I wanted to summarize my experiences working in three very different industries (animation, technology, finance). CS/Applied Math students have a lot of career options open to them, and I hope this post will provide some extra perspective on what opportunities are available.<br /><br />In a later post, I'll share some tips for landing tech internships.<br /><br />I do not represent Pixar, Google, or Two Sigma, and all views expressed here are my own.<br /><br /><h3>Pixar</h3><br/><div class="separator" style="clear: both; text-align: center;"><a href="http://i.imgur.com/JGlmc57.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://i.imgur.com/JGlmc57.jpg" height="306" width="400" /></a></div><br />Filmmaking, in its highest form, is the expression and manipulation of human emotion. Pixar is second to none at filmmaking, and their best works have such emotional power they can move an entire theatre to tears in under 240 seconds (see Up's "Married Life montage").<br /><br />I interned at Pixar in 2013 as part of the Pixar Undergraduate Program (PUP). This was a 10-week educational program where we learned about Pixar's movie-making process and developed our skills in modeling, shading, lighting, etc.<br /><br />"Fun and Laughter" is how I would describe the studio. I spent the first few weeks of my internship in an awestruck stupor, wondering 1) how the hell I had managed to get accepted into the program and 2) finding excuses to wander down to the atrium so I could catch sightings of legendary people.<br /><br />My favorite aspect of Pixar was the opportunity to learn from the masters. When researching ideas for my PUP final project, the DP of Finding Nemo taught me about lighting underwater scenes and jellyfish. The guy who created the render-time vegetation in Brave taught me how to see the world's shapes as literal mathematical curves, noise, and fractals.<br /><br />Pixar does some pretty hardcore research on the cutting edge of computer graphics (Discrete Exterior Calculus for Geometry Processing, realtime global illumination on the GPU), and their in-house animation tools are years ahead of any commercially available software. For a smallish company (1200?), their technology infrastructure is very robust.<br /><br />One thing I really appreciate about Pixar is how trusting the company was with us interns' access to intellectual property (IP) and trade secrets. Our access privileges were basically equivalent to full-time employees, allowing us interns to get a clear picture of what it's like to pursue a career in animation. At Google and Two Sigma, interns are kept in an "intern container", making it harder to discern the realities of full-time life.<br /><br />My primary concern with coming back full-time there are a lot of forces in the entertainment industry that pressure studios to compromise a work of art for cheap entertainment and box office returns. Pixar would rather postpone the production of a film than ship a version that sucks, which says a lot about the quality of its people and their dedication to founding values. However, even Pixar has a bottom line. I don't know how I feel about Toy Story 4 being slated for production. Given the volatile nature of the industry these days, there is some risk involved with pursuing animation as a career.<br /><br />Even so, I think Pixar is one of those places that nobody ever regrets working for. For those of you who don't care much for watching movies, or think that storytelling isn't as meaningful to society as making self-driving cars or growing charity endowments, I wanted to end this section by providing a brief anecdote:<br /><br />At the climax of Toy Story 3, Woody and his friends are about to perish in a trash incinerator. Realizing the futility of their struggle, Buzz reaches out to Jessie, and one by one, the toys join hands to face the flames.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://i.imgur.com/JfGiraE.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://i.imgur.com/JfGiraE.png" height="171" width="400" /></a></div><br /><br />I actually shed tears from the <i>sheer awe</i> of how a bunch of polygons could be contorted to simultaneously express the <i>finality of the situation and everlasting friendship</i>. The B flat minor "Bolero Effect" in the background helped too. In those moments, I forgot that I was watching a bunch of computer marionettes dance according to the 100th revision of a script. I was a kid again, and Woody and Buzz were real.<br /><br />If that isn't magic, then I don't know what is.<br /><br />For a broader picture of Pixar beyond my own personal experiences, I encourage you to read Ed Catmull's "Creativity Inc."<br /><br /><h3>Google</h3><br />It's hard for me to make any sweeping generalizations about Google, on account of how big and diverse it is. I would say that Google is like the most amazing and colorful kiddie playground you've ever seen, with C.H.O.A.M headquarters (and a dozen other things) buried underneath.<br /><br />Very few people leave Google because of how goddamned <i>comfortable</i> it is. Full-time benefits at Google beat Pixar and Two Sigma. If you care most about raising a family, work life balance (i.e. work the least hours) and job security, you really can't go wrong with Google. Here is a picture of me eating from an endless supply of baklava.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://i.imgur.com/ZNZmv3G.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://i.imgur.com/ZNZmv3G.jpg" height="286" width="400" /></a></div><br />I worked on the iOS Play Books team and implementing features for the iOS client. I also did some stuff for the Security Team that I'm not allowed to disclose (20% projects are rarer nowadays, but still available if you take the initiative). My work was incredibly fun, and I was lucky to be assigned an interesting, user-facing project that eventually saw the light of day.<br /><br />Well, that depends if anybody actually uses iOS Play Books...<br /><br />Google has the best tech infrastructure of any firm I know (even among other big tech companies). Developers have root access, code review is streamlined, unit testing comes with progress bars, and compilation is blaze-ingly fast. Everything, from onboarding to recruiting to Techstop is fast and can elastically scale. Every summer, Google onboards 2000 interns in a matter of weeks, and the workforce subsequently grows by 10%. Even with this increased load on infrastructure, outages are rare.<br /><br />Concerns: I'm scared of becoming a small cog and spending the rest of my career drinking smoothies from Slice Cafe and being unchallenged by my work. There are certainly some roles at the company that would make me feel that way, but also plenty of teams that I would jump at the chance to work for. In particular, I'm interested in some more open-ended research positions.<br /><br />For more perspectives on what full-time Google is like, I recommend reading comments on Hacker News that discuss Google. Lots of current and ex-Googlers hang out there.<br /><br /><h3>Two Sigma</h3><br/>Two Sigma is a hedge fund whose core business is "algorithmic trading". In a nutshell, TS uses technology and mathematics to forecast stock prices, and build systems to automatically execute strategies to realize ludicrous profit.<br /><br />Two Sigma is like the math/CS equivalent of Pixar's art/CS dream team. TS employs a small but high concentration of talent; there are literally International Math Olympiad medalists left and right, and the majority of people I ate lunch with were Phds. If you draw a directed graph between companies, where edges represent employee migration, you'll see lots of edges going from Goldman, Microsoft, Google, and Facebook to Two Sigma. There are very few edges going the other way.<br /><br />The hedge fund business is lucrative. I don't have enough data, but I'm pretty sure full-time engineering/quant comp at firms like Two Sigma/Shaw/Citadel are slightly better (sometimes a lot better) than those of tech companies in the bay area.<br /><br />Overall, I think the firm has better career growth opportunities than its competitors. By nature of being a smaller company (700+), the average Two Sigma employee has more responsibilities than the average Google employee, which I like a lot. Company culture is a bit more serious - median dress code ranges from business casual / smart casual, and there are no scooters or scooter-riders lying around. I like that too.<br /><br />This summer I joined the Quantiative Applications Engineering team and worked on a project related to portfolio optimization. I had been looking for more exposure to mathematics in my work, so that I wouldn't be pigeonholed as a software engineer for the rest of my career. Finance/math problems interest me more than UX.<br /><br />Office floor plans are quite similar to Google Building 47 (that I worked in), minus the bespoke decor, the chinese food cafe and the nice lady that served tea from 3-5pm every day. I forgot to take pictures of the office, so I'll just throw in a picture of the Brooklyn Bridge that I took on one of our intern events.<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://i.imgur.com/sQJLBfU.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://i.imgur.com/sQJLBfU.jpg" height="400" width="300" /></a></div><br />The internship program was really well-run: Two Sigma only takes a handful of interns each summer, so they took us on a lot of fun intern outings. Pixar's intern classes are small too, but departments run on different schedules so I mostly just hung out with the other PUPs.<br /><br />Two Sigma is a secretive firm, and there is little useful information about it floating on the public Internet. To learn more about the hedge fund industry, I recommend reading "More Money than God" by Sebastian Mallaby.<br /><br /><h3>Personal Rankings</h3>Below is a table of company factors that I've ranked Pixar, Google, and Two Sigma by, with "1" being the best. Keep in mind that these rankings are not objective realities, and that people prioritize different factors. For example, the "Culture" rankings are scored according to "which company's culture I thrive best in", not "which company I think has the best culture".<br /><br /> <table> <thead><tr> <th> </th> <th><div style="text-align: left;">Pixar</div></th> <th><div style="text-align: left;">Google</div></th> <th><div style="text-align: left;">Two Sigma</div></th> </tr></thead> <tbody> <tr> <td><b>Compensation</b></td> <td>3</td> <td>2</td> <td>1</td> </tr> <tr> <td><b>Career Growth</b></th> <td>2</td> <td>3</td> <td>1</td> </tr> <tr> <td><b>Culture</b></td> <td>1</td> <td>3</td> <td>2</td> </tr> <tr> <td><b>Perks</b></td> <td>2</td> <td>1</td> <td>3</td> </tr> <tr> <td><b>Facilities</b></td> <td>2</td> <td>1</td> <td>3</td> </tr> <tr> <td><b>Hours</b></td> <td>3</td> <td>1</td> <td>2</td> </tr> </tbody></table> <br /><h3>Trivia</h3><br /><table> <thead><tr> <th> </th> <th><div style="text-align: left;">Pixar</div></th> <th><div style="text-align: left;">Google</div></th> <th><div style="text-align: left;">Two Sigma</div></th> </tr></thead> <tbody><tr> <td><b>Team</b></td> <td>Pixar Undergraduate Program</td> <td>iOS Play Books</td> <td>Quantitative Applications Engineering</td> </tr><tr> <td><b>Most Memorable Experience</b></td> <td>During orientation, we were shown an early screening of "The Blue Umbrella". I thought to myself, "we're in on the secret..."</td> <td>Food trucks that one does not need to pay for</td> <td>Breathtaking amounts of money flashing across my screen</td> </tr><tr> <td><b>My Typical Lunch</b></td> <td>Luxo Cafe</td> <td>Jia's, then Charlies, then Yoshka's, followed by Slice ...</td> <td>Go to Chinatown with Chinese co-workers. Practice my mandarin</td> </tr><tr> <td><b>What I did in my spare time</b></td> <td>Browse the intranet for juicy info</td> <td>Browse the intranet for juicy info</td> <td>Browse the intranet for juicy info</td> </tr></tbody></table><br /><br /><h3>Closing Thoughts</h3><br />Even now, I'm still a bit bewildered at how fortunate I was to intern at these awesome places, full of the best of the best of the best. It would not have been possible without well-placed recommendations from friends, mentors, co-workers who believed in me, and I am forever grateful to them.<br /><br />I have my concerns about Pixar/Google/Two Sigma, but I believe I can either solve or navigate around the problems. Realistically, no company is perfect and anyone who says otherwise is either a beneficiary of the status quo, lying, or just plain naive. I'd definitely consider going back to any of these companies full-time (contingent on them wanting me back!).<br /><br />So, what happens next? I'll be graduating in the Spring of 2016 with my Masters in CS, and I am currently looking for full-time positions (at other companies too). Deciding on a full-time job is a serious matter, so I'm keeping an open mind and taking my time deciding.<br /><br />Thank you for reading.Ericnoreply@blogger.com57