Sunday, July 24, 2016

Why Randomness is Important for Deep Learning

This afternoon I attempted to explain to my mom why Randomness is important for Deep Learning without using any jargon from probability, statistics, or machine learning.

The exercise was partially successful. Maybe. I still don't think she knows what Deep Learning is, besides that I am a big fan of it and use it for my job.

I'm a big fan of Deep Learning

This post is a slightly more technical version of the explanation I gave my mom, with the hope that it will help Deep Learning practitioners to better understand what is going on in neural networks.

If you are getting started into Deep Learning research, you may discover that there are a whole bunch of seemingly arbitrary techniques used to train neural networks, with very little theoretical justification besides "it works". For example: dropout regularization, adding gradient noise,, asynchronous stochastic descent.

What do all these hocus pocus techniques have in common? They incorporate randomness!


Random noise actually is crucial for getting DNNs to work well:

  1. Random noise allows neural nets to produce multiple outputs given the same instance of input.
  2. Random noise limits the amount of information flowing through the network, forcing the network to learn meaningful representations of data. 
  3. Random noise provides "exploration energy" for finding better optimization solutions during gradient descent.

Single Input, Multiple Output


Suppose you are training a deep neural network (DNN) to classify images.




For each cropped region, the network learns to convert an image into a number representing a class label, such as “dog” or "person".

That’s all very well and good, and these kind of DNNs do not require randomness in their inference model. After all, any image of a dog should be mapped to the “dog” label, and there is nothing random about that.

Now suppose you are training a deep neural network (DNN) to play Go (game). In the case of the image below, the DNN has to make the first move.



If you used the same deterministic strategy described above, you will find that this network fails to give good results. Why? Because there is no single “optimal starting move” - for each possible stone placement on the board, there is a equally good stone placement on the other side board, via rotational symmetry. There are multiple best answers.

If the network is deterministic and only capable of picking one output per input, then the optimization process will force the network to choose the move that averages all the best answers, which is smack-dab in the middle of the board. This behavior is highly undesirable, as the center of the board is generally regarded as a bad starting move.

Hence, randomness is important when you want the network to be able to output multiple possibilities given the same input, rather than generating the same output over and over again. This is crucial when there are underlying symmetries in the action space - incorporating randomness literally helps us break out of the "stuck between two bales of hay" scenario.

Similarly, if we are training a neural net to compose music or draw pictures, we don’t want it to always draw the same thing or play the same music every time it is given a blank sheet of paper. We want some notion of “variation”, "surprise", and “creativity” in our generative models.

One approach to incorporating randomness into DNNs is to keep the network deterministic, but have its outputs be the parameters of a probability distribution, which we can then draw samples from using conventional sampling methods to generate “stochastic outputs”.

Deepmind's AlphaGo utilized this principle: given an image of a Go board, it outputs the probability of winning given each possible move. The practice of modeling a distribution after the output of the network is commonly used in other areas of Deep Reinforcement Learning.

Randomness and Information Theory


During my first few courses in probability & statistics, I really struggled to understand the physical meaning of randomness. When you flip a coin, where does this randomness come from? Is randomness just deterministic chaos? Is it possible for something to be actually random

To be honest, I still don't fully understand. 


Information theory offers a definition of randomness that is grounded enough to use without being kept wide awake at night: "randomness" is nothing more than the "lack of information".

More specifically, the amount of information in an object is the length  (e.g. in bits, kilobytes, etc.)  of the shortest computer program needed to fully describe it. For example, the first 1 million digits of $\pi = 3.14159265....$ can be represented as a string of length 1,000,002 characters, but it can be more compactly represented using 70 characters, via an implementation of the Leibniz Formula:

The above program is nothing more than a compressed version of a million digits of $\pi$. A more concise program could probably express the first million digits of $\pi$ in far fewer bits.

Under this interpretation, randomness is "that which cannot be compressed". While the first million digits of $\pi$ can be compressed and are thus not random, empirical evidence suggests (but has not proven) that $\pi$ itself is normal number, and thus the amount of information encoded in $\pi$ is infinite.

Consider a number $a$ that is equal to the first trillion digits of $\pi$, $a = 3.14159265...$. If we add to that a uniform random number $r$ that lies in the range $(-0.001,0.001)$, we get a number that ranges in between $3.14059...$ and $3.14259...$. The resulting number $a + r$ now only carries ~three digits worth of information, because the process of adding random noise destroyed any information carried after the hundredth's decimal place.

Limiting Information in Neural Nets


What does this definition of randomness have to deal with randomness?

Another way randomness is incorporated into DNNs is by injecting noise directly into the network itself, rather than using the DNN to model a distribution. This makes the task "harder" to learn as the network has to overcome these internal "perturbations".

Why on Earth would you want to do this? The basic intuition is that noise limits the amount of information you can pass through a channel.

Consider an autoencoder, a type of neural network architecture that attempts to learn an efficient encoding of data by "squeezing" the input into fewer dimensions in the middle and re-constituting the original data at the other end. A diagram is shown below:




During inference, inputs flow from the left through the nodes of the network and come out the other side, just like a pipe.

If we consider a theoretical neural network that operates on real numbers (rather than floating point numbers), then without noise in the network, every layer of the DNN actually has access to infinite information bandwidth.

Even though we are squeezing representations (the pipe) into fewer hidden units, the network could still learn to encode the previous layer's data into the decimal point values without actually learning any meaningful features. In fact, we could represent all the information in the network with a single number. This is undesirable.

By limiting the amount of information in a network, we force it to learn compact representations of input features. Several ways to go about this:

  • Variational autoencoders (VAE) add Gaussian noise to the hidden layer. This noise destroys "excess information," forcing the network to learn compact representations of data. 
  • Closely related to VAE noise (possibly equivalent?) to this is the idea of Dropout Regularization - randomly zeroing out some fraction of units during training. Like the VAE, dropout noise forces the network to learn useful information under limited bandwidth.  
  • Deep Networks with Stochastic Depth - similar idea to dropout, but at a per-layer level rather than per-unit level.
  • There's a very interesting paper called Binarized Neural Networks that uses binary weights and activations in the inference pass, but real-valued gradients in the backward pass. The source of noise comes from the fact that the gradient is a noisy version of the binarized gradient. While BinaryNets are not necessarily more powerful than regular DNNs, individual units can only encode one bit of information, which regularizes against two features being squeezed into a single unit via floating point encoding. 
More efficient compression schemes mean better generalization at test-time, which explains why dropout works so well for over-fitting. If you decide to use regular autoencoders over variational autoencoders, you must use a stochastic regularization trick such as dropout to control how many bits your compressed features should be, otherwise you will likely over-fit. 

I think VAEs are objectively superior because they are easy to implement and allow you to specify exactly how many bits of information are passing through each layer.

Exploration "Energy"


DNNs are usually trained via some variant of gradient descent, which basically amounts to finding the parameter update that "rolls downhill" along some loss function. When you get to the bottom of the deepest valley, you've found the best possible parameters for your neural net.

The problem with this approach is that neural network loss surfaces have a lot of local minima and plateaus. It's easy to get stuck in a small dip or a flat portion where the slope is already zero (local minima) but you are not done yet.




The third interpretation of how randomness assists Deep Learning models is based on the idea of the exploration.

Because the datasets used to train DNNs are huge, it’s too expensive to compute the gradient across terabytes of data for every single gradient descent step. Instead, we use stochastic gradient descent (SGD), where we just compute the average gradient across a small subset of the dataset chosen randomly from the dataset.

In evolution, if the success of a species is modeled by random variable $X$, then random mutation or noise increases the variance of $X$ - its progeny could either be far better off (adaptation, poison defense) or far worse off (lethal defects, sterility).

In numerical optimization, this "genetic mutability" is called "thermodynamic energy", or Temperature that allows the parameter update trajectory to not always "roll downhill", but occasionally bounce out of a local minima or "tunnel through" hills.

This is all deeply related to the Exploration-vs.-Exploitation tradeoff as formulated in RL. Training a purely deterministic DNN with zero gradient noise has zero exploitation capabilities - it converges straight to the nearest local minima, however shallow.

Using stochastic gradients (either via small minibatches or literally adding noise to the gradients themselves) is an effective way to allow the optimization to do a bit of "searching" and "bouncing" out of weak local minima. Asynchronous stochastic gradient descent, in which many machines are performing gradient descent in parallel, is another possible source of noise.

This "thermodynamic" energy ensures symmetry-breaking during early stages of training, to ensure that all the gradients in a layer are not synchronized to the same values. Not only does noise perform symmetry breaking in action space of the neural net, but noise also performs symmetry breaking in the parameter space of the neural net.

Closing Thoughts


I find it really interesting that random noise actually helps Artificial Intelligence algorithms to avoid over-fitting and explore the solution space during optimization or Reinforcement Learning. It raises interesting philosophical questions on whether the inherent noisiness of our neural code is a feature, not a bug. 

One theoretical ML research question I am interested in is whether all these neural network training tricks are actually variations of some general regularization theorem. Perhaps theoretical work on compression will be really useful for understanding this. 

It would be interesting to check the information capacity of various neural networks relative to hand-engineered feature representations, and see how that relates to overfitting tendency or quality of gradients. It's certainly not trivial to measure the information capacity of a network with dropout or trained via SGD, but I think it can be done. For example, constructing a database of synthetic vectors whose information content (in bits, kilobytes, etc) is exactly known, and seeing how networks of various sizes, in combination with techniques like dropout, deal with learning a generative model of that dataset.

Saturday, July 16, 2016

What product breakthroughs will recent advances in deep learning enable?

This is re-posted from a Quora answer I wrote on  on 6/11/16.

Deep Learning refers to a class of machine learning (ML) techniques that combine the following:
  • Large neural networks (millions of free parameters)
  • High performance computing ( thousands of processors running in parallel)
  • Big Data (e.g. millions of color images or recorded chess games)
Deep learning techniques currently achieve state of the art performance in a multitude of problem domains (vision, audio, robotics, natural language processing, to name a few). Recent advances in Deep Learning also incorporate ideas from statistical learning [1,2], reinforcement learning (RL) [3], and numerical optimization . For a broad survey of the field, see [9,10].

In no particular order, here are some product categories made possible with today's deep learning techniques: 
  • customized data compression
  • compressive sensing
  • data-driven sensor calibration
  • offline AI
  • human-computer interaction
  • gaming, artistic assistants
  • unstructured data mining
  • voice synthesis

Customized data compression

Suppose you are designing a video conferencing app, and want to come up with a lossy encoding scheme to reduce the number of packets you need to send over the Internet. 

You could use an off-the-shelf codec like H.264, but H.264 is not optimal because it is calibrated for generic video - anything from cat videos to feature films to clouds. It would be nice if instead we had a video codec that was optimized for specifically FaceTime videos. We can save even more bytes than a generic algorithm if we take advantage of the fact that most of the time, there is a face in the center of the screen. However, designing such an encoding scheme is tricky:
  • How do we specify where the face is positioned, how much eyebrow hair the subject has, what color their eyes are, the shape of their jaw? 
  • What if their hair is covering one of their eyes? 
  • What if there are zero or multiple faces in the picture?



Deep learning can be applied here. Auto-encoders are a type of neural network whose output is merely a copy of the input data. Learning this "identity mapping" would be trivial if it weren't for the fact that the hidden layers of the auto-encoder are chosen to be smaller than the input layer. This "information bottleneck" forces the auto-encoder to learn an compressed representation of the data in the hidden layer, which is then decoded back to the original form by the remaining layers in the network.



Through end-to-end training, auto-encoders and other deep learning techniques *adapt* to the specific nuances of your data. Unlike principal components analysis, the encoding and decoding steps are not limited to affine (linear) transformations. PCA learns an "encoding linear transform", while auto-encoders learn a "encoding program".

This makes neural nets far more powerful, and allows for complex, domain-specific compression; anything from storing a gazillion selfies on Facebook, to faster YouTube video streaming, to scientific data compression, to reducing the space needed for your personal iTunes library. Imagine if your iTunes library learned a "country music" auto-encoder just to compress your personal music collection!


Compressive sensing

Compressive sensing is closely related to the decoding aspects of lossy compression. Many interesting signals have a particular structure to them - that is, the distribution of signals is not completely arbitrary. This means that we don't actually have to sample at the Nyquist limit in order to obtain a perfect reconstruction of the signal, as long our decoding algorithm can properly exploit the underlying structure.

Deep learning is applicable here because we can use neural networks to learn the sparse structure without manual feature engineering. Some product applications:
  • Super-resolution algorithms (waifu2X)- literally an "enhance" button like those from CSI Miami
  • using WiFi radio wave interference to see people through walls (MIT Wi-Vi)
  • interpreting 3D structure of an object given incomplete observations (such as a 2D image or partial occlusion
  • more accurate reconstructions from sonar / LIDAR data

Data-driven sensor calibration

Good sensors and measurement devices often rely on expensive, precision-manufactured components.

Take digital cameras, for example. Digital cameras assume the glass lens is of a certain "nice" geometry. When taking a picture, the onboard processor solves the light transport equations through the lens to compute the final image.



If the lens is scratched, or warped or shaped like a bunny (instead of a disc) these assumptions are broken and the images no longer turn out well. Another example: our current decoding models used in MRI and EEG assume the cranium is a perfect sphere in order to keep the math manageable [4]. This sort of works, but sometimes we miss the location of a tumor by a few mm. More accurate photographic and MRI imaging ought to compensate for geometric deviation, whether they result from underlying sources or manufacturing defects.

Fortunately, deep learning allows us to calibrate our decoding algorithms with data.

Instead of a one-size-fits-all decoding model (such as a Kalman filter), we can express more complex biases specifically tuned to each patient or each measuring device. If our camera lens is scratched, we can train the decoding software to implicitly compensate for the altered geometry. This means we no longer have to manufacture and align sensors with utmost precision, and this saves a lot of money.

In some cases, we can do away with hardware completely and let the decoding algorithm compensate for that; the Columbia Computational Photography lab has developed a kind of camera that doesn't have a lens. Software-defined imaging, so to speak.




Offline AI

Being able to run AI algorithms without Internet is crucial for apps that have low latency requirements (I.e. self driving cars & robotics) or do not have reliable connectivity (smartphone apps for traveling).

Deep Learning is especially suitable for this. After the training phase, neural networks can run the feed forward step very quickly. Furthermore, it is straightforward to shrink down large neural nets into small ones, until they are portable enough to run on a smartphone (at the expense of some accuracy).

Google has already done this in their offline camera translation feature in Google Translate App [6].



Some other possibilities:
  • Intelligent assistants (e.g. Siri) that retain some functionality even when offline.
  • wilderness survival app that tells you if that plant is poison ivy, or whether those mushrooms are safe to eat
  • small drones with on-board TPU chips [11] that can perform simple obstacle avoidance and navigation

Human-computer interaction


Deep Neural Networks are the first kind of models that can really see and hear our worldwith an acceptable level of robustness. This opens up a lot of possibilities for Human-Computer Interaction.

Cameras can now be used to read sign language and read books aloud to people. In fact, deep neural networks can now describe to us in full sentences what they see [12]. Baidu's DuLight project is enabling visually-impaired people to see the world around them through a sight-to-speech earpiece.

Dulight--Eyes for visually impaired

We are not limited to vision-based HCI. Deep learning can help calibrate EEG interfaces for paraplegics to interact with computers more rapidly, or provide more accurate decoding tech for projects like Soli [7].


Gaming


Games are computationally challenging because they run physics simulation, AI logic, rendering, and multiplayer interaction together in real time. Many of these components have at least O(N^2) in complexity, so our current algorithms have hit their Moore's ceiling.

Deep learning pushes the boundaries on what games are capable of in several ways.

Obviously, there's the "game AI" aspect. In current video games, AI logic for non-playable characters (NPC) are not much more than a bunch of if-then-else statements tweaked to imitate intelligent behavior. This is not clever enough for advanced gamers, and leads to somewhat unchallenging character interaction in single-player mode. Even in multiplayer, a human player is usually the smartest element in the game loop.

This changes with Deep Learning. Google Deepmind's AlphaGo has shown us that Deep Neural Networks, combined with policy gradient learning, are powerful enough to beat the strongest of human players at complex games like Go. The Deep Learning techniques that drive AlphaGo may soon enable NPCs that can exploit the player's weaknesses and provide a more engaging gaming experience. Game data from other players can be sent to the cloud for training the AI to learn from its own mistakes.

Another application of deep learning in games is physics simulation. Instead of simulating fluids and particles from first principles, perhaps we can turn the nonlinear dynamics problem into a regression problem. For instance, if we train a neural net to learn the physical rules that govern fluid dynamics, we can evaluate it quickly during gameplay without having to perform large-scale solutions to Navier stokes equations in real time.

In fact, this has been done already by Ladicky and Jeong 2015 [8].


For VR applications that must run at 90 FPS minimum, this may be the only viable approach given current hardware constraints.

Third, deep generative modeling techniques can be used to create unlimited, rich procedural content - fauna, character dialogue, animation, music, perhaps the narrative of the game itself. This is an area that is just starting to be explored by games like No Man's Sky, which could potentially make games with endless novel content.



To add a cherry on top, Deep Neural nets are well suited for parallel mini-batched evaluation, which means that AI logic for a 128 NPCs or 32 water simulations might be evaluated simultaneously on a single graphics card.

Artistic Assistants


Given how well neural networks perceive images, audio, and text, it's no surprise that they also work when we use them to draw paintings [13], compose music [14], and write fiction [15].


People have been trying to get computers to compose music and paint pictures for ages, but deep learning is the first one that actually generates "good results". There are already several apps in the App Store that implement these algorithms for giggles, but soon we may see them as assistive generators/filters in professional content creation software.

Data Mining from Unstructured Data


Deep learning isn't at the level where it can extract the same amount of information humans can from web pages, but the vision capabilities of deep neural nets are good enough for allowing machines to understand more than just hypertext.

For instance:
  • Parsing events from scanned flyers
  • identifying which products on EBay are the same
  • determining consumer sentiment from webcam
  • extracting blog content from pages without RSS feeds
  • integrate photo information into valuing financial instruments, insurance policies, and credit scores.

Voice synthesis


Generative modeling techniques have come far enough and there is sufficient data out there that it is only a matter of time before someone makes an app that reads aloud to you in Morgan Freeman's or Scarlet Johansen's voice. At Vanguard, my voice is my password.


Bonus: more products

  • Adaptive OS / Network stack scheduling - scheduling threads and processes in an OS is a NP hard problem. We don't have a very satisfactory solution to this right now, and scheduling algorithms in modern operating systems, filesystems, and TCP/IP implementations are all fairly simple. Perhaps if a small neural net could be used to adapt to a user's particular scheduling patterns (frame this as an RL problem), we would decrease scheduling overhead incurred by the OS. This might make a lot of sense inside of data centers where the savings can really scale.
  • Colony counting & cell tracking for microscopy software (for wet lab research)
  • The strategy of "replacing simulation with machine learning" has been useful in the fields of drug design too, presenting enormous speed ups in finding which compounds are helpful or toxic [untethiner 2015].

References

Monday, July 11, 2016

How to Get an Internship

Update: 9/3/2016 - Denis Tarasov of HSE in Moscow, Russia has kindly translated this article into Russian - read it here. I welcome translations into other languages!

About a year ago, I wrote a blog post about my various internship experiences. It ended up being quite popular with recruiters, and actually helped me to land my full-time job at Google.

I've also been getting emails from students seeking internship advice. Every time I get one of these, my ego approximately doubles in size. Thank you.


In this post, I'll share my strategy for landing tech internships. I've wanted to write this for some time now, but I've been hesitant to posture some kind of "magic recipe" when a lot of my own success was mostly due to luck.

I'm just another fresh graduate trying to figure things out, and here's what I believe in:

#1 Work on Side Projects


You don't need to be at Google to work on the kinds of problems that Google interns do, nor do you need to work at a hedge fund to learn about finance. Pursue those interests on your own!

Want to try animation? Here are some project ideas:

  • Make a 30 second short film in Autodesk Maya (free for students) or Blender 3D (free for everybody)
  • Do a 11 Second Club animation. 
  • Make something cool with Pixar's own Renderman software (free for non-commercial use). I'll bet less than 1% of the resumes that Pixar receives from students list experience with Renderman.
  • Draw something on ShaderToy.
  • Implement a physically-based rendering algorithm.

Want to be a software engineer?
  • Make an Android / iOS app from scratch (Android learning curve is easier). 
  • Learn how to use Amazon Web Services or Google Cloud Platform
  • Open source your work. A Managing Director at D. E. Shaw once told me that "Github is the new resume".
  • Check out Show HN to see what projects other folks are working on.

Finance:

  • Participate in a Kaggle competition. Get your first-hand experience with overfitting.
  • Do some financial market research on Quantopian. This is the kind of work that real quants do all day. 
  • Contribute to open source projects like Beaker and Satellite. Who knows, you might even impress someone inside the company.

Working on side projects accomplishes several objectives simultaneously:
  • It builds your brand (see #2).
  • It shows the hiring committee that you are willing to hone your craft on your own time, instead of merely trading your time for their money and status.
  • It's a low-risk way to find out if you're actually interested in the field.
  • In the process of building stuff, you might re-discover important theoretical and engineering challenges that professionals grapple with. In my sophomore year, I wrote a Bitcoin- arbitrage bot in Python. Bitcoin exchanges list the price and volume of all open limit orders in the book, while actual financial markets do not. This results in a very fundamental difference in the way Market Impact is treated, and gave me something interesting to talk about during my Two Sigma interviews. What I learned was super elementary, but still more practical experience than most candidates.

Don't worry about your projects being impressive or even novel - just focus on improving your skills and exercising your creativity. A little bit of experience using a company's products and technologies will give you a huge edge over other candidates.

Start as early as you can. The job application process doesn't begin during the fall recruiting season; it begins as soon as you want it to.

#2 Make Your Own Website


Here's a secret: the more you market yourself, the more recruiters will reach out to you. Building your own personal website will make you extremely visible.

Your website is basically a resume in long-form, but also functions as your personal brand. Here are some screenshots of other people's sites:



Your website should accomplish several things:
  • Make it easy for recruiters to come across your portfolio via Google Search.
  • Reveal your personality in ways that a 1-page resume cannot. In particular, it's a great opportunity to showcase aesthetic sense and visual creativity.
  • You should add an attractive profile picture of yourself. Putting a candid, smiling face will help people recognize you and put a face to your list of impressive accomplishments.
Platforms like Github Pages, Google App Engine, Wordpress, Weebly let you set up a website for free. Domain names are cheap - as little as $10 a year.

In addition to showcasing your coding projects, you should list a description of your work in a way that is accessible to people who can't read code. Better yet, write blog posts and tutorials for your projects - what you did and how you did it. Your site will get a lot more visibility if people find it useful.

The story you tell through your website - the first impression that you make - is of utmost importance. Do it right, and recruiters will come like ants to a picnic. 

#3 Study CS


If you're not sure what you want to do in the long term, choose skills and experiences that give you the most flexibility in the future. I recommend studying some kind of math + CS degree (if you're more interested in research roles) or a illustration + CS double major (if you're more interested in joining the entertainment industry).

I started my undergraduate education thinking I would study neuroscience, because "I could learn CS by myself." This was a big mistake:

  • My resume got passed over in resume screens because I listed "neuroscience" as my major. I eventually got through by begging a Google recruiter to give me a chance with the phone interview. Afterwards, I switched to Applied Math-CS.
  • Getting good at CS requires lots of practice. School is a good place to do it.
  • Neuroscience in the classroom has not caught up to neuroscience in the lab. Cutting edge research is pretty much optogenetics or computational (which is more CS + math + physics than neuroscience anyway).

More on the last point: I discovered that neuroscience students who knew how to program in MATLAB got to work directly on high-level research questions and interpret experimental data. Students who didn't ended up doing grunt work in the lab - dissecting tiny brains, pipetting liquids, and relying on others to code analysis routines for them.

Neuroscience is not the only field that is being disrupted by technology; we will be seeing more "software-defined research" in the coming years. For better or worse, the scientists, doctors, lawyers of the future will all be programmers.

Why is math important? Math gives you additional flexibility to break into hard-tech research roles, if you so desire. It's really hard to transition directly into an industry research team (such as Google Research or Microsoft Research) with only a CS undergrad degree.

Even though I was able to get more exposure to math at my Two Sigma internship, I was unsuccessful at getting a quant research internship because my background typecasts me into software engineering roles. It is also my own grievous fault for not being better at math.

If you want to work in film or games or even a product management role at a tech company, then studying math makes less sense; you should study illustration instead. I've noticed that at Pixar, many Technical Directors want to contribute more to story and art direction, but find themselves pigeonholed into specific roles (they have one "car guy", one "vegetation shading girl", and so on).

Being good at illustration will help you break into more creative roles like Art Director or Story Artist. It's also flexible - illustrators are needed everywhere, from design to comics to games. Illustration + CS is a potent skillset.

Candidly, math is safer, more flexible, and more lucrative than illustration. It is also future-proof in ways that other valuable degrees (such as design, law, and business) are not. That said, I find art incredibly valuable and continue practicing it as a hobby.

In any case, study CS. It will feed you and pay off your student debts and open so many doors. Don't be discouraged if you find CS difficult, or if your classmates seem to be way better at it than you. It wasn't until my third attempt to learn programming that things started to stick in my head.

Stick with CS, and the sky's the limit.


#4 Seek Diverse, Contrarian Experiences


Your coursework, extracurriculars, and internship experiences will have a big impact on your creative process. Diverse experiences enable you to approach problems differently than others, which will make you unique and harder to replace.

Pursue courses outside your major and let them inspire your projects. I don't mean this in the sense of "combining fields for the sake of mixing your interests together," like some contrived Egyptology-Physics senior thesis (just a hypothetical example, no offense to those who do this).

Instead, ideas from one field might lead to a real competitive advantage in another. For instance:

  • It's been said that Reed College's Calligraphy Class was a formative experience in Steve Jobs's design-minded vision for Apple products.
source: reed.edu

  • John Lasseter and Ed Catmull believed that 3D computer graphics was not just a fancy artistic medium, but the future of animation itself. They were right.

Pixar's The Adventures of AndrĂ© and Wally B.
  • Here is an elegant and beautiful explanation of a Math proof using interpretive dance. Sometimes difficult concepts become strikingly clear when the right diagram is drawn.

Here's a personal anecdote: I did several years of computational neuroscience research in college, which shaped the way I think about debugging complicated simulations in Machine Learning. Inspired by this, I pitched a project idea to a ML professor at my school. He thought it was a terrible idea. I went ahead and built it anyway, and it actually got me my current job. 

Diverse experiences help you to discover original or even contrarian ideas. Find something that only you believe to be true. If you're right, the upside is enormous. 


#5 Plan your next 10 years

Everybody's got dreams.

Some people dream of creating Strong AI, some want to make it to the Forbes 30 under 30 list, some want to be parents by the age of 32, some just want to make it to tomorrow.

It's really important, even as a college student applying for internships, to reflect on what you want and where you want to be in the long-term. Time is so precious; don't waste any time at a job that isn't growing the skills you want. It's okay to be unsure of what you want to do with your life, but at least write down a list of life/career trajectories that you think will make you happy.

Every so often, re-evaluate your long-term goals and whether the position you're in is taking you there or growing the skills that you want. Some questions to ask yourself:
  • How will I pay off my student debt?
  • Can I see myself doing pure software engineering (frontend, backend, mobile apps) for the remainder of my career? 
  • How long do I see myself working at my current employer?
  • Do I want to transition into more math-y roles like ML research or quantitative finance?
  • Do I want to transition into a product management or leadership role?
  • Do I want to start my own company someday? Am I okay exchanging coding and making stuff, for the privilege of running a company?
  • Do I want to become a Venture Capitalist someday?
  • If I plan to have kids by the time I'm 32 - where do I want to be? Who do I want to be with?
  • If I keep doing this, will I be happy in ten years? 

Finally, when making plans, don't take your physical, mental, or financial health for granted - have a backup plan in case your best laid plans go awry.


######################### PART 2 ##########################

95% of playing the internship game is what I've listed above. The remaining 5% is the actual interview process.

#6 Skip the Resume Screen


The first stage of most internship applications is a resume screen. The recruiter, who must sift through a huge stack of applications, glances at your resume for about six seconds, then either recycles it or sends you a follow up email.

SIX SECONDS! That's just enough time do pattern matching for brand-name schools, tech company names, and what programming languages you know. The recruiter will also make a snap judgment just based on how neat and pretty your resume looks. Consequently, resume screens are pretty noisy when it comes to judging inexperienced college students.

Fortunately, there are a couple ways to skip the resume screen entirely:
  • If you get a referral from someone inside the company, recruiters will consider your application more carefully. If your resume is not horrible to look at, you'll almost certainly make it to the next stage. I was lucky enough to get referrals for Pixar and Two Sigma. However, these are stories for another day ;)
  • If you are an underrepresented minority (URM) in Technology, companies are bending over backwards to get you to pass their interviews. At conferences like Grace Hopper, you can actually get a free pass out of the resume screening and the phone screen, and do on-the-spot whiteboard interviews with companies like Apple, Facebook, Google, Pinterest, etc. This improves the odds of landing an internship dramatically. A classmate of mine actually got an internship offer from Apple, on the spot, with only her resume (no interview or anything).  Reach out to your computer science department and ask if they would sponsor your attendance.
  • Reach out to engineers directly through your school alumni network, and ask them to refer you. Don't be shy - it's very little work on their part and they will get nice a referral bonus if you succeed. The worst thing that could happen is that they ignore you, which doesn't cost you anything.

It goes without saying that your resume should be on point: everything perfectly aligned and legible with zero typos. Tailor each resume for the company that you are applying to.

Tech companies that visit college campuses will often hold resume review sessions for students (Yelp, Microsoft, Google do this). This is super useful, and you should use this resource even if it's with a company you don't want to work for. Not surprisingly, tech recruiters give better industry-specific advice than college career counselors. 

If at all possible, skip the resume screen. In fact, if you have an offer deadline coming up, companies will often fast-track you straight to the on-site interview. The resume screen and phone interviews are just qualifiers for the on-site, which pretty much solely determines whether you make it in or not. Don't go through the front door.


#7 Phone and On-Site Interviews


After the noisy resume screen, you are the master of your own fate. Typically there are one or two phone interviews followed by an on-site 5-hour interview. The phone interviews are like miniature versions of on-site interviews, where you write code on a Google Doc or Etherpad.

All that matters at this point is how well you solve the coding challenges. If you do solve the problems quickly and correctly, and your behavior doesn't set off any red flags, you'll probably get the job.

My experience is that difficulty of the interview is roughly correlated with firm's selectivity and salary. The hardest interviews I've had were with Google Deepmind, D. E. Shaw, Two Sigma, Quora, and Vatic Labs (startups interviews tend to be pretty rigorous because their hiring decisions are riskier).

Google and Facebook were about medium in difficulty. I didn't interview for Pixar's software engineering role, so that interview was all behavioral and very easy. I've heard that Jane Street interviews are the hardest technically (apparently very popular among MIT students).

Cracking the Coding Interview is the only book you'll ever need. The practice problems are about the right level of difficulty for all software engineering roles I've ever interviewed with, and the advice is superb.

Finance firms like D.E. Shaw and Jane Street like to ask more math-oriented questions. I recommend these three books (in decreasing order of difficulty):


Preparing for whiteboard interviews is like studying for the SATs - a complete waste of time, but important enough that you gotta do it. There are some startups that are trying to disrupt the broken interview system, but I am uncertain if they will ever be successful.

On the behavioral side: be humble, be confident, smile a lot, ask good questions. Wear smart casual. Here's a trick to smiling often: every few seconds, imagine that the interviewer just extended you a job offer.

"Congratulations, you got the job!"
"Congratulations, you got the job!"

#8 Be Old


It's WAY easier to get internships as a rising junior or senior in college.

Interning at Google/Facebook as a first-year is pretty rare, so don't beat yourself up if you don't get an internship right away. A lot of tech companies screen out first-years as a matter of policy.

Some finance firms only hire rising college seniors as interns because they're fiercely protective of their IP and don't want other firms poaching their interns next summer.

The school you go to matters, but if you take the time to build a personal brand and list of side projects, it matters less and less. The same goes for age.

#9 I got the internship. What do I do?


Congrats! Your internship is an opportunity, not an entitlement.

These companies are investing in your personal growth and learning, so you should work hard and learn as much as possible. You owe it to the company whose name pads your resume, you owe it to the people who vouched for you in the hiring process, and most of all, you owe it to the candidates who were just as qualified as you, but didn't get the job.

My internship offers were all very competitive so I didn't negotiate (I was also saving that social capital for full-time negotiation). You can try to negotiate your internship offers if you want, though.

#10 I didn't get an internship this summer. What do I do?


Great! You can spend the summer working on exactly what you want to work on. Most interns don't even get this luxury.

  • Create deadlines for yourself as if a manager assigned them to you. 
  • Have meetings with your imaginary manager where you discuss your progress. 
  • Show up to "work" on time.
  • Get some unemployed friends together and work in a team. Heck, not having a job lined up is the perfect opportunity to start your own company.
  • Write a blog post about it. Show your future employers what a fucking awesome employee you would be if you had the opportunity.

If money is an issue, there are still a few options. You can seek out an UTRA with your university, take up a low-stress part-time job (summer RA, babysitting).

#11 Closing Thoughts


  • Build your own personal brand through side projects, website, writing.
  • Optimize your career decisions for learning and personal growth. 
  • Work really hard.


Best of luck, and thank you for reading.