Eric Jang

Sunday, April 1, 2018

Aesthetically Pleasing Learning Rates

By Eric Jang, Colin Raffel, Ishaan Gulrajani, Diogo Moitinho de Almeida

In this blog post, we abandoned all pretense of theoretical rigor and used pixel values from natural images as learning rate schedules.

The learning rate schedule is an important hyperparameter to choose when training neural nets. Set the learning rate too high, and the loss function may fail to converge. Set the learning rate too low, and the model may take a long time to train, or even worse, overfit.

The optimal learning rate is commensurate with the scale of the smoothness of the gradient of the loss function (or in fancy words, the “Lipschitz constant” of a function). The smoother the function, the larger the learning rate we are allowed to take without the optimization “blowing up”.

What is the right learning rate schedule? Exponential decay? Linear decay? Piecewise constant? Ramp up then down? Cyclic? Warm restarts? How big should the batch size be? How does this all relate to generalization (what ML researchers care about)?

Tragically, in the non-convex, messy world of deep neural networks, all theoretical convergence guarantees are off. Often those guarantees rely on a restrictive set of assumptions, and then the theory-practice gap is written off by showing that it also “empirically works well” at training neural nets.

Fortunately, the research community has spent thousands of GPU years establishing empirical best practices for the learning rate:

Given that theoretically-motivated learning rate scheduling is really hard, we ask, "why not use a learning rate schedule which is at least aesthetically pleasing?" Specifically, we scan across a nice-looking image one pixel at a time, and use the pixel intensities as our learning rate schedule.

We begin with a few observations:

Optimization of non-convex objectives seem to benefit from injecting “temporally correlated noise” into the parameters we are trying to optimize. Accelerated stochastic gradient descent methods also exploit temporally correlated gradient directions via momentum. Stretching this analogy a bit, we note that in reinforcement learning, auto-correlated noise seems to be beneficial for state-dependent exploration (1, 2, 3, 4).

Several recent papers (5, 6, 7) suggest that waving learning rates up and down are good for deep learning.

Pixel values from natural images have both of the above properties. When reshaped into a 1-D signal, an image waves up and down in a random manner, sort of like Brownian motion. Natural images also tend to be lit from above, which lends to a decaying signal as the image gets darker on the bottom.

We compared several learning rate schedules on MNIST and CIFAR-10 classification benchmarks, training each model for about 100K steps. Here's what we tried:

baseline: The default learning rate schedules provided by the github repository.
fixed: 3e-4 with Momentum Optimizer.
andrej: 3e-4, with Adam Optimizer
cyclic: Cyclic learning rates according to the following code snippet:

base_lr = 1e-5
max_lr = 1e-2
step_size = 1000
step = tf.cast(global_step, tf.float32)
cycle = tf.floor(1+step/(2*step_size))
x = tf.abs(step/step_size - 2*cycle + 1)
learning_rate = base_lr + (max_lr-base_lr)*tf.maximum(0., (1.-x))

image-based learning rates using the following code:

base_lr = 1e-5
max_lr = 1e-2
im = Image.open(path_to_file)
num_steps = _NUM_IMAGES['train']*FLAGS.train_epochs/FLAGS.batch_size
w, h = im.size
f = np.sqrt(w*h*3/num_steps)
im = im.resize((int(float(w)/f), int(float(h)/f)))
im = np.array(im).flatten().astype(np.float32)/255
im_t = tf.constant(im)
step = tf.minimum(global_step, im.size-1)
pixel_value = im_t[step]
learning_rate = base_lr + (max_lr - base_lr) * pixel_value

Candidate Images

We chose some very aesthetically pleasing images for our experiments.

alexnet.jpg

bad_mnist.jpg (MNIST training image labeled as a 4)

get_out.jpg

hinton.jpg

mona_lisa.jpg

team_kakashi.jpg

puppy.jpg

Which one gives the best learning rate?

Results

Here are the top-1 accuracies on the CIFAR-10 validation set. All learning rate schedules are trained with the Momentum Optimizer (except andrej, where we use Adam).

The default learning rate schedules provided by the github repo are quite strong, beating all of our alternative learning rate schedules.

The Mona Lisa and puppy images turns out to be a pretty good schedules, even better than cyclic learning rates and Andrej Karpathy’s favorite 3e-4 with Adam. The "bad MNIST" digit appears to be a pretty dank learning rate schedule too, just edging out Geoff’s portrait (you’ll have to imagine the error bars on your own). All learning rates perform about equally well on MNIST.

The fixed learning rate of 3e-4 is quite bad (unless one uses the Adam optimizer). Our experiments suggest that maybe pretty much any learning rate schedule can outperform a fixed one, so if ever you see or think about writing a paper with a constant learning rate, just use literally any schedule instead. Even a silly one. And then cite this blog post.

Future Work

Does there exist a “divine” natural image whose learning rate schedule results in low test error among a wide range of deep learning tasks?
It would also be interesting to see if all images of puppies produce good learning rate schedules. We think it is very likely, since all puppers are good boys.
Stay tuned for “Aesthetically Pleasing Parameter Noise for Reinforcement Learning” and “Aesthetically Pleasing Random Seeds for Neural Architecture Search”.

Acknowledgements:

We thank Geoff Hinton for providing the misclassified MNIST image, Vincent Vanhoucke for reviewing this post, and Crossroads Cafe for providing us with refreshments.

Friday, February 23, 2018

Teacup

06/23/2018: Xiaoyi Yin (尹肖贻) has translated this post to 中文. Thanks Xiaoyi!

Once upon a time, there was a machine learning researcher who tried to teach a child what a "teacup" was.

"Hullo mister. What do you do?" inquires the child.

"Hi there, child! I'm a machine learning scientist. My life ambition is to create 'Artificial General Intelligence', which is a computer that can do everything a human --"

The child completely disregards this remark, as children often do, and asks a question that has been troubling him all day:

"Mister, what's a teacup? My teacher Ms. Johnson used that word today but I don't know it."

The scientist is appalled that a fellow British citizen does not know what a teacup is, so he pulls out his phone and shows the child a few pictures:

"Oh..." says the child. "A teacup is anything that's got flowers on it, right? Like this?"

The child is alarmingly proficient at using a smartphone.

"No, that's not a teacup," says the scientist. "Here are some more teacups, this time without the flowers."

The child's face crinkles up with thought, then un-crinkles almost immediately - he's found a new pattern.

"Ok, a teacup is anything where there is an ear-shaped hole facing to the right - after all, there is something like that in every one of the images!"

He pulls up a new image to display what he thinks a teacup is, giggling because he thinks ears are funny.

"No, that's an ear. A teacup and ear are mutually exclusive concepts. Let's do some data augmentation. These are all teacups too!"

The scientist rambles on,

"Now I am going to show you some things that are not teacups! This should force your discriminatory boundary to ignore features that teacups and other junk have in common ... does this help?"

"Okay, I think I get it now. A teacup is anything with an holder thing, and is also empty. So these are not teacups:"

"Not quite, the first two are teacups too. And teacups are actually supposed to contain tea."

The child is now confused.

"but what happens if a teacup doesn't have tea but has fizzy drink in it? What if ... what if ... you cut a teacup in halfsies, so it can't hold tea anymore?" His eyes go wide as saucers as he says this, as if cutting teacups is the most scandalous thing he has ever heard of.

"Err... hopefully most of your training data doesn't have teacups like that. Or chowder bowls with one handle, for that matter."

The scientist also mutters something about "stochastic gradient descent being Bayesian" but fortunately the kid doesn't hear him say this.

The child thinks long and hard, iterating over the images again and again.

"I got it! There is no pattern, a teacup is merely any one of the following pictures:"

"Well... if you knew nothing about the world I could see how you arrived at that conclusion... but what if I said that you had some prior about how object classes ought to vary across form and rendering style and --"

"But Mister, what's a prior?"

"A prior is whatever you know beforehand about the distribution over random teacups ... err... never mind. Can you find an explanation for teacups that doesn't require memorizing 14 images? The smaller the explanation, the better."

"But how should I know how small the explanation of teacup ought to be?", asks the child.

"Oh," says the scientist. He slinks away, defeated.

Tuesday, January 23, 2018

Doing a Concurrent Masters at Brown

This is intended as a reference for students who are interested in the Concurrent Bachelor's/Masters program. If you are not a current or prospective undergraduate student at Brown University, the following post won't be relevant to you.

A few Brown University students have been emailing me about the Concurrent Bachelor's/Masters (CM) degree and whether it would make sense for them to apply for this program. Brown doesn't offer a whole lot of information or resources on this topic (very few students do this), so I'd like to share my perspective as someone who went through the process (I graduated in May 2016 with a ScB in APMA-CS and a MSc in CS). This is not official advice - rules for the CM program may have changed since I graduated.

Background

Universities like UC Berkeley allow undergraduates to graduate early, provided that they have satisfied all their degree requirements. Some students who complete their undergrad in 3 years (6 semesters) use the leftover year to do their "5th-year" Masters degree at the school, thus getting a Bachelors and Masters degree in 4 years.

Brown has 5th-year Masters programs too (CS dept has a popular one), but undergraduates cannot graduate earlier (it's possible to graduate in 7 semesters but this is rarely granted).

The Concurrent Masters degree does permit one to graduate with a Masters in 8 or 9 semesters though.

Strangely, CM doesn't seem to be advertised much at Brown - there weren't any guides or resources or other students to talk to for planning my schedule around CM (guidance counselors and Meiklejohns don't really encourage or know about unorthodox paths like these).

How to plan for CM

The CM application requirements can be found here.

During your First-Year (or beginning of Sophomore year at the latest), draw up your course plan for all 8 semesters to meet CM requirements. It will probably be re-arranged a lot (especially upper-level classes) each semester, but every set of courses you pick should keep you on track to meeting CM requirements.
At the beginning of the Spring semester of Junior Year, bring your partially-completed CM application to your department chair, and show them how you are on track to fulfilling the requirements. Have them examine your application to see that your courses do indeed qualify and you are in good academic standing (i.e. you will also fulfill your intended undergrad degree requirements by graduation).
Get recommendation letters from professors and the dept chair. You will need a lot of them - 3 within concentration, 2 outside concentration.
Bring your packet (with rec letters) to the Dean of the College, who is in charge of CM review process.
The applications are reviewed by the academic standing committee by April of your Junior year. You need to meet the course requirements, have approval of your dept. chair, have good letters of references, and say something fairly reasonable on your letter to the committee. From that point it's approved somewhat automatically.
The CM course schedule is approved Junior year, but is contingent on classes that may not actually be offered your senior year. You will probably submit amendment forms to the application during your Senior year. They should be approved as long as they are reasonable substitutions.
I recommend finishing your capstone requirements and 2nd writing requirement during your junior year. This removes a lot of constraints from the schedule optimization problem.

What courses did you take?

I had a pretty unorthodox curriculum at Brown and basically stretched the "open curriculum" interpretation as far as I could to (barely) satisfy my degree and concurrent masters requirements. I didn't take many intro-level CS courses and substituted those requirements with upper-div math and CS courses. Towards the end the CS department chair got pretty annoyed with all the substitutions I was making; bless twd@ for being so patient with me. Here was my 4-year course schedule:

Semester	Course	Title
Fall 2012	CLPS005A	Seminar: Computing as done in Brains and Computers
	APMA0350	Methods of Applied Math I
	ENGN1930N	Intro to MRI and Neuroimaging
	CLPS1492	Computational Cognitive Neuroscience
	CSCI1450	Intro to Probability and Computing
Spring 2013	LITR1010A	Advanced Fiction
	NEUR1680	Computational Neuroscience
	CSCI1280	Intermediate Animation
	NEUR2160	Neurochemistry and Behavior
	APMA1720	Monte Carlo Simulation with Applications to Finance
Fall 2013	CSCI1230	Introduction to Computer Graphics
	MATH1530	Abstract Algebra
	NEUR1970	Independent Study
	ENGN1630	Digital Electronics Systems Design
Spring 2014	CSCI1480	Building Intelligent Robots
	CSCI2240	Interactive Computer Graphics
	APMA1740	Recent Applications of Probability and Statistics
	NEUR1970	Individual Independent Study
Fall 2014	CSCI2420	Probabilistic Graphical Models
	CSCI1680	Computer Networks
	CSCI2951B	Data-Driven Vision and Graphics
	PHYS1410	Quantum Mechanics A
	CSCI0081	TA Apprenticeship: Full Credit
Spring 2015	APMA2821V	Neural Dynamics: Theory and Modeling
	APMA1360	Topics in Chaotic Dynamics
	CSCI1970	Individual Independent Study
	ECON1720	Corporate Finance
	ILLUS-2028	Painting II (RISD)
Fall 2015	CSCI0510	Models of Computation
	MUSC1100	Introduction to Composition
	TAPS0220	Persuasive Communication
	APMA1170	Introduction to Computational Linear Algebra
	CSCI2980-HYS	Reading and Research (Masters Project)
Spring 2016	CSCI2980-JFH	Reading and Research (Masters Project)
	POLS1740	The Politics of Food
	POLS1824f	Meritocracy
	CSCI1670	Operating Systems
	CSCI1951G	Optimization Methods in Finance

ScB requirements:

Category

Sub-Category

My courses

Core

Mathematics

MATH1530 instead of MATH0350

MATH0540 waived (AP Test)

Applied Mathematics

APMA0350

APMA1360 in lieu of APMA0360

APMA1170

Core Computer Science

(CSCI2980-HCI, CSCI1670) in lieu of (CSCI15, CSCI16)

CSCI 1450 (math)

CSCI 1680 in lieu of CS33 (systems)

CSCI0510 (math) (f15)

Additional Requirements

3 1000-level CS courses

CSCI1480

ENGN1630

CSCI1970 (approved pair waived via TA credit)

3 1000-level APMA courses

Pair: APMA1720 + APMA1740

APMA2821V

Capstone course

CSCI2980-HUG

And here's how I filled out the CM requirements. Note that degree requirements are subject to change and the courses I filled out may not be valid for current Brown students.

Is it Worth It?

Pros:

Some entry-level roles in quantitative finance and Machine Learning strongly prefer candidates with at least a Masters degree.
Saves tuition compared to doing a 5-year Masters.
In the Bay Area (California), having a Masters Degree negotiates you a better interest rate for mortgages.

Cons:

Way more work compared to doing a 5-year masters. Mostly comes from the 10-course breadth requirements.
Being spread pretty thinly across many classes makes retaining information harder. You need to take an average of 4+ classes every semester, and the 10-course breadth requirements have to be completed before you submit your application.
Maintaining a social life with this course load is tricky.

I do not recommend doing CM just for the sake of getting a Masters degree - a Masters degree isn't that helpful in the big picture of things and you should only do it if it would require minor changes to the course plan you are already pursuing, or whether it is vital to your career.

Three other students in the CS department (two CS-Math concentrators and one other CS-APMA concentrator) did CM in the class of 2016. We all enjoyed taking hard CS/Math classes and would have probably taken the schedules we had anyway.

Category	Sub-Category	My courses
Core	Mathematics	MATH1530 instead of MATH0350 MATH0540 waived (AP Test)
	Applied Mathematics	APMA0350 APMA1360 in lieu of APMA0360 APMA1170
	Core Computer Science	(CSCI2980-HCI, CSCI1670) in lieu of (CSCI15, CSCI16) CSCI 1450 (math) CSCI 1680 in lieu of CS33 (systems) CSCI0510 (math) (f15)
Additional Requirements	3 1000-level CS courses	CSCI1480 ENGN1630 CSCI1970 (approved pair waived via TA credit)
	3 1000-level APMA courses	Pair: APMA1720 + APMA1740 APMA2821V
	Capstone course	CSCI2980-HUG