Sunday, April 1, 2018

Aesthetically Pleasing Learning Rates

By Eric Jang, Colin Raffel, Ishaan Gulrajani, Diogo Moitinho de Almeida

In this blog post, we abandoned all pretense of theoretical rigor and used pixel values from natural images as learning rate schedules.




The learning rate schedule is an important hyperparameter to choose when training neural nets. Set the learning rate too high, and the loss function may fail to converge. Set the learning rate too low, and the model may take a long time to train, or even worse, overfit.

The optimal learning rate is commensurate with the scale of the smoothness of the gradient of the loss function (or in fancy words, the “Lipschitz constant” of a function). The smoother the function, the larger the learning rate we are allowed to take without the optimization “blowing up”.

What is the right learning rate schedule? Exponential decay? Linear decay? Piecewise constant? Ramp up then down? Cyclic? Warm restarts? How big should the batch size be? How does this all relate to generalization (what ML researchers care about)?

Tragically, in the non-convex, messy world of deep neural networks, all theoretical convergence guarantees are off. Often those guarantees rely on a restrictive set of assumptions, and then the theory-practice gap is written off by showing that it also “empirically works well” at training neural nets.

Fortunately, the research community has spent thousands of GPU years establishing empirical best practices for the learning rate:



Given that theoretically-motivated learning rate scheduling is really hard, we ask, "why not use a learning rate schedule which is at least aesthetically pleasing?" Specifically, we scan across a nice-looking image one pixel at a time, and use the pixel intensities as our learning rate schedule.

We begin with a few observations:

  • Optimization of non-convex objectives seem to benefit from injecting “temporally correlated noise” into the parameters we are trying to optimize. Accelerated stochastic gradient descent methods also exploit temporally correlated gradient directions via momentum. Stretching this analogy a bit, we note that in reinforcement learning, auto-correlated noise seems to be beneficial for state-dependent exploration (1, 2, 3, 4). 
  • Several recent papers (5, 6, 7) suggest that waving learning rates up and down are good for deep learning.
  • Pixel values from natural images have both of the above properties. When reshaped into a 1-D signal, an image waves up and down in a random manner, sort of like Brownian motion. Natural images also tend to be lit from above, which lends to a decaying signal as the image gets darker on the bottom.

We compared several learning rate schedules on MNIST and CIFAR-10 classification benchmarks, training each model for about 100K steps. Here's what we tried:
  • baseline: The default learning rate schedules provided by the github repository.
  • fixed: 3e-4 with Momentum Optimizer.
  • andrej: 3e-4, with Adam Optimizer
  • cyclic: Cyclic learning rates according to the following code snippet:
base_lr = 1e-5
max_lr = 1e-2
step_size = 1000
step = tf.cast(global_step, tf.float32)
cycle = tf.floor(1+step/(2*step_size))
x = tf.abs(step/step_size - 2*cycle + 1)
learning_rate = base_lr + (max_lr-base_lr)*tf.maximum(0., (1.-x))


  • image-based learning rates using the following code:
base_lr = 1e-5
max_lr = 1e-2
im = Image.open(path_to_file)
num_steps = _NUM_IMAGES['train']*FLAGS.train_epochs/FLAGS.batch_size
w, h = im.size
f = np.sqrt(w*h*3/num_steps)
im = im.resize((int(float(w)/f), int(float(h)/f)))
im = np.array(im).flatten().astype(np.float32)/255
im_t = tf.constant(im)
step = tf.minimum(global_step, im.size-1)
pixel_value = im_t[step]
learning_rate = base_lr + (max_lr - base_lr) * pixel_value


Candidate Images


We chose some very aesthetically pleasing images for our experiments.

alexnet.jpg

bad_mnist.jpg (MNIST training image labeled as a 4)

get_out.jpg

hinton.jpg

mona_lisa.jpg

team_kakashi.jpg

puppy.jpg


Which one gives the best learning rate?


Results

Here are the top-1 accuracies on the CIFAR-10 validation set. All learning rate schedules are trained with the Momentum Optimizer (except andrej, where we use Adam).


The default learning rate schedules provided by the github repo are quite strong, beating all of our alternative learning rate schedules.

The Mona Lisa and puppy images turns out to be a pretty good schedules, even better than cyclic learning rates and Andrej Karpathy’s favorite 3e-4 with Adam. The "bad MNIST" digit appears to be a pretty dank learning rate schedule too, just edging out Geoff’s portrait (you’ll have to imagine the error bars on your own). All learning rates perform about equally well on MNIST.

The fixed learning rate of 3e-4 is quite bad (unless one uses the Adam optimizer). Our experiments suggest that maybe pretty much any learning rate schedule can outperform a fixed one, so if ever you see or think about writing a paper with a constant learning rate, just use literally any schedule instead. Even a silly one. And then cite this blog post.

Future Work

  • Does there exist a “divine” natural image whose learning rate schedule results in low test error among a wide range of deep learning tasks? 
  • It would also be interesting to see if all images of puppies produce good learning rate schedules. We think it is very likely, since all puppers are good boys. 
  • Stay tuned for “Aesthetically Pleasing Parameter Noise for Reinforcement Learning” and “Aesthetically Pleasing Random Seeds for Neural Architecture Search”.

Acknowledgements:

We thank Geoff Hinton for providing the misclassified MNIST image, Vincent Vanhoucke for reviewing this post, and Crossroads Cafe for providing us with refreshments.








Friday, February 23, 2018

Teacup



06/23/2018: Xiaoyi Yin (尹肖贻) has translated this post to 中文. Thanks Xiaoyi!


Once upon a time, there was a machine learning researcher who tried to teach a child what a "teacup" was.

"Hullo mister. What do you do?" inquires the child.

"Hi there, child! I'm a machine learning scientist. My life ambition is to create 'Artificial General Intelligence', which is a computer that can do everything a human --"

The child completely disregards this remark, as children often do, and asks a question that has been troubling him all day:

"Mister, what's a teacup? My teacher Ms. Johnson used that word today but I don't know it."

The scientist is appalled that a fellow British citizen does not know what a teacup is, so he pulls out his phone and shows the child a few pictures:



"Oh..." says the child. "A teacup is anything that's got flowers on it, right? Like this?"



The child is alarmingly proficient at using a smartphone.

"No, that's not a teacup," says the scientist. "Here are some more teacups, this time without the flowers."



The child's face crinkles up with thought, then un-crinkles almost immediately - he's found a new pattern.

"Ok, a teacup is anything where there is an ear-shaped hole facing to the right - after all, there is something like that in every one of the images!"

He pulls up a new image to display what he thinks a teacup is, giggling because he thinks ears are funny.


"No, that's an ear. A teacup and ear are mutually exclusive concepts. Let's do some data augmentation. These are all teacups too!"


The scientist rambles on,

"Now I am going to show you some things that are not teacups! This should force your discriminatory boundary to ignore features that teacups and other junk have in common ... does this help?"



"Okay, I think I get it now. A teacup is anything with an holder thing, and is also empty. So these are not teacups:"



"Not quite, the first two are teacups too. And teacups are actually supposed to contain tea."

The child is now confused.

"but what happens if a teacup doesn't have tea but has fizzy drink in it? What if ... what if ... you cut a teacup in halfsies, so it can't hold tea anymore?" His eyes go wide as saucers as he says this, as if cutting teacups is the most scandalous thing he has ever heard of.


"Err... hopefully most of your training data doesn't have teacups like that. Or chowder bowls with one handle, for that matter."

The scientist also mutters something about "stochastic gradient descent being Bayesian" but fortunately the kid doesn't hear him say this.

The child thinks long and hard, iterating over the images again and again.

"I got it! There is no pattern, a teacup is merely any one of the following pictures:"



"Well... if you knew nothing about the world I could see how you arrived at that conclusion... but what if I said that you had some prior about how object classes ought to vary across form and rendering style and --"

"But Mister, what's a prior?"

"A prior is whatever you know beforehand about the distribution over random teacups ... err... never mind. Can you find an explanation for teacups that doesn't require memorizing 14 images? The smaller the explanation, the better."

"But how should I know how small the explanation of teacup ought to be?", asks the child.

"Oh," says the scientist. He slinks away, defeated.

Tuesday, January 23, 2018

Doing a Concurrent Masters at Brown

This is intended as a reference for students who are interested in the Concurrent Bachelor's/Masters programIf you are not a current or prospective undergraduate student at Brown University, the following post won't be relevant to you. 

A few Brown University students have been emailing me about the Concurrent Bachelor's/Masters (CM) degree and whether it would make sense for them to apply for this program. Brown doesn't offer a whole lot of information or resources on this topic (very few students do this), so I'd like to share my perspective as someone who went through the process (I graduated in May 2016 with a ScB in APMA-CS and a MSc in CS). This is not official advice - rules for the CM program may have changed since I graduated.

Background


Universities like UC Berkeley allow undergraduates to graduate early, provided that they have satisfied all their degree requirements. Some students who complete their undergrad in 3 years (6 semesters) use the leftover year to do their "5th-year" Masters degree at the school, thus getting a Bachelors and Masters degree in 4 years.

Brown has 5th-year Masters programs too (CS dept has a popular one), but undergraduates cannot graduate earlier (it's possible to graduate in 7 semesters but this is rarely granted).

The Concurrent Masters degree does permit one to graduate with a Masters in 8 or 9 semesters though.

Strangely, CM doesn't seem to be advertised much at Brown - there weren't any guides or resources or other students to talk to for planning my schedule around CM (guidance counselors and Meiklejohns don't really encourage or know about unorthodox paths like these).

How to plan for CM


The CM application requirements can be found here
  1. During your First-Year (or beginning of Sophomore year at the latest), draw up your course plan for all 8 semesters to meet CM requirements. It will probably be re-arranged a lot (especially upper-level classes) each semester, but every set of courses you pick should keep you on track to meeting CM requirements.
  2. At the beginning of the Spring semester of Junior Year, bring your partially-completed CM application to your department chair, and show them how you are on track to fulfilling the requirements. Have them examine your application to see that your courses do indeed qualify and you are in good academic standing (i.e. you will also fulfill your intended undergrad degree requirements by graduation).
  3. Get recommendation letters from professors and the dept chair. You will need a lot of them - 3 within concentration, 2 outside concentration.
  4. Bring your packet (with rec letters) to the Dean of the College, who is in charge of CM review process.
  5. The applications are reviewed by the academic standing committee by April of your Junior year. You need to meet the course requirements, have approval of your dept. chair, have good letters of references, and say something fairly reasonable on your letter to the committee. From that point it's approved somewhat automatically.
  6. The CM course schedule is approved Junior year, but is contingent on classes that may not actually be offered your senior year. You will probably submit amendment forms to the application during your Senior year. They should be approved as long as they are reasonable substitutions.
  7. I recommend finishing your capstone requirements and 2nd writing requirement during your junior year. This removes a lot of constraints from the schedule optimization problem. 

What courses did you take?


I had a pretty unorthodox curriculum at Brown and basically stretched the "open curriculum" interpretation as far as I could to (barely) satisfy my degree and concurrent masters requirements. I didn't take many intro-level CS courses and substituted those requirements with upper-div math and CS courses. Towards the end the CS department chair got pretty annoyed with all the substitutions I was making; bless twd@ for being so patient with me. Here was my 4-year course schedule:


Semester
Course
Title
Fall 2012
CLPS005A
Seminar: Computing as done in Brains and Computers

APMA0350
Methods of Applied Math I

ENGN1930N
Intro to MRI and Neuroimaging

CLPS1492
Computational Cognitive Neuroscience

CSCI1450
Intro to Probability and Computing
Spring 2013
LITR1010A
Advanced Fiction

NEUR1680
Computational Neuroscience

CSCI1280
Intermediate Animation

NEUR2160
Neurochemistry and Behavior

APMA1720
Monte Carlo Simulation with Applications to Finance
Fall 2013
CSCI1230
Introduction to Computer Graphics

MATH1530
Abstract Algebra

NEUR1970
Independent Study

ENGN1630
Digital Electronics Systems Design
Spring 2014
CSCI1480
Building Intelligent Robots

CSCI2240
Interactive Computer Graphics

APMA1740
Recent Applications of Probability and Statistics

NEUR1970
Individual Independent Study
Fall 2014
CSCI2420
Probabilistic Graphical Models

CSCI1680
Computer Networks

CSCI2951B
Data-Driven Vision and Graphics

PHYS1410
Quantum Mechanics A

CSCI0081
TA Apprenticeship: Full Credit
Spring 2015
APMA2821V
Neural Dynamics: Theory and Modeling

APMA1360
Topics in Chaotic Dynamics

CSCI1970
Individual Independent Study

ECON1720
Corporate Finance

ILLUS-2028
Painting II (RISD)
Fall 2015
CSCI0510
Models of Computation

MUSC1100
Introduction to Composition

TAPS0220
Persuasive Communication

APMA1170
Introduction to Computational Linear Algebra

CSCI2980-HYS
Reading and Research (Masters Project)
Spring 2016
CSCI2980-JFH
Reading and Research (Masters Project)

POLS1740
The Politics of Food

POLS1824f
Meritocracy

CSCI1670
Operating Systems

CSCI1951G
Optimization Methods in Finance

ScB requirements:

Category
Sub-Category
My courses
Core
Mathematics
MATH1530 instead of MATH0350
MATH0540 waived (AP Test)

Applied Mathematics
APMA0350
APMA1360 in lieu of APMA0360
APMA1170

Core Computer Science
(CSCI2980-HCI, CSCI1670) in lieu of (CSCI15, CSCI16)
CSCI 1450 (math)
CSCI 1680 in lieu of CS33 (systems)
CSCI0510 (math) (f15)
Additional Requirements
3 1000-level CS courses
CSCI1480
ENGN1630
CSCI1970 (approved pair waived via TA credit)

3 1000-level APMA courses
Pair: APMA1720 + APMA1740
APMA2821V

Capstone course
CSCI2980-HUG


And here's how I filled out the CM requirements. Note that degree requirements are subject to change and the courses I filled out may not be valid for current Brown students.

Is it Worth It?

Pros:

  • Some entry-level roles in quantitative finance and Machine Learning strongly prefer candidates with at least a Masters degree.
  • Saves tuition compared to doing a 5-year Masters.
  • In the Bay Area (California), having a Masters Degree negotiates you a better interest rate for mortgages. 

Cons:

  • Way more work compared to doing a 5-year masters. Mostly comes from the 10-course breadth requirements.
  • Being spread pretty thinly across many classes makes retaining information harder. You need to take an average of 4+ classes every semester, and the 10-course breadth requirements have to be completed before you submit your application.
  • Maintaining a social life with this course load is tricky.
I do not recommend doing CM just for the sake of getting a Masters degree - a Masters degree isn't that helpful in the big picture of things and you should only do it if it would require minor changes to the course plan you are already pursuing, or whether it is vital to your career.

Three other students in the CS department (two CS-Math concentrators and one other CS-APMA concentrator) did CM in the class of 2016. We all enjoyed taking hard CS/Math classes and would have probably taken the schedules we had anyway.