Sunday, September 27, 2020

My Criteria for Reviewing Papers

Xiaoyi Yin (尹肖贻) has kindly translated this post into Chinese (中文)

Accept-or-reject decisions for the NeurIPS 2020 conference are out, with 9454 submissions and 1900 accepted papers (20% acceptance rate). Congratulations to everyone (regardless of acceptance decision) for their hard work in doing good research!

It's common knowledge among machine learning (ML) researchers that acceptance decisions at NeurIPS and other conferences are something of a weighted dice roll. In this silly theatre we call "Academic Publishing"  -- a mostly disjoint concept from research by the way --,  reviews are all over the place because each reviewer favors different things in ML papers. Here are some criteria that a reviewer might care about: 

Correctness: This is the bare minimum for a scientific paper. Are the claims made in the paper scientifically correct? Did the authors take care not to train on the test set? If an algorithm was proposed, do the authors convincingly show that it works for the reasons they stated? 

New Information: Your paper has to contribute new knowledge to the field. This can take the form of a new algorithm, or new experimental data, or even just a different way of explaining an existing concept. Even survey papers should contain some nuggets of new information, such as a holistic view unifying several independent works.

Proper Citations: a related work section that articulates connections to prior work and why your work is novel. Some reviewers will reject papers that don't tithe prior work adequately, or isn't sufficiently distinguished from it.

SOTA results: It's common to see reviewers demand that papers (1) propose a new algorithm and (2)  achieve state-of-the-art (SOTA) on a benchmark. 

More than "Just SOTA": No reviewer will penalize you for achieving SOTA, but some expect more than just beating the benchmark, such as one or more of the criteria in this list. Some reviewers go as far as to bash the "SOTA-chasing" culture of the field, which they deem to be "not very creative" and "incremental". 

Simplicity: Many researchers profess to favor "simple ideas". However, the difference between "your simple idea" and "your trivial extension to someone else's simple idea" is not always so obvious.

Complexity: Some reviewers deem papers that don't present any new methods or fancy math proofs as "trivial" or "not rigorous".

Clarity & Understanding: Some reviewers care about the mechanistic details of proposed algorithms and furthering understanding of ML, not just achieving better results. This is closely related to "Correctness".

Is it "Exciting"?: Julian Togelius (AC for NeurIPS '20) mentions that many papers he chaired were simply not very exciting. Only Julian can know what he deems "exciting", but I suppose he means having "good taste" in choosing research problems and solutions. 




Sufficiently Hard Problems: Some reviewers reject papers for evaluating on datasets that are too simple, like MNIST. "Sufficiently hard" is a moving goal post, with the implicit expectation that as the field develops better methods the benchmarks have to get harder to push unsolved capabilities. Also, SOTA methods on simple benchmarks are not always SOTA on harder benchmarks that are closer to real world applications. Thankfully my most cited paper was written at a time where it was still acceptable to publish on MNIST.

Is it Surprising? Even if a paper demonstrates successful results, a reviewer might claim that they are unsurprising or "obvious". For example, papers applying standard object recognition techniques to a novel dataset might be argued to be "too easy and straightforward" given that the field expects supervised object recognition to be mostly solved (this is not really true, but the benchmarks don't reflect that). 

I really enjoy papers that defy intuitions, and I personally strive to write surprising papers. 

Some of my favorite papers in this category do not achieve SOTA or propose any new algorithms at all:

  1. Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet
  2. Understanding Deep Learning Requires Rethinking Generalization.
  3. A Metric Learning Reality Check
  4. Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations 
  5. Adversarial Spheres

Is it Real? Closely related to "sufficiently hard problems". Some reviewers think that games are a good testbed to study RL, while others (typically from the classical robotics community) think that Mujoco Ant and a real robotic quadruped are entirely different problems; algorithmic comparisons on the former tell us nothing about the same set of experiments on the latter.

Does Your Work Align with Good AI Ethics? Some view the development of ML technology as a means to build a better society, and discourage papers that don't align with their AI ethics. The required "Broader Impact" statements in NeurIPS submissions this year are an indication that the field is taking this much more seriously. For example, if you submit a paper that attempts to infer criminality from only facial features or perform autonomous weapon targeting, I think it's likely your paper will be rejected regardless of what methods you develop.

Different reviewers will prioritize different aspects of the above, and many of these criteria are highly subjective (e.g. problem taste, ethics, simplicity). For each of the criteria above, it's possible to come up with counterexamples of highly-cited or impactful ML papers that don't meet that criteria but possibly meet others.


My Criteria

I wanted to share my criteria for how I review papers. When it comes to recommending accept/reject, I mostly care about Correctness and New Information. Even if I think your paper is boring and unlikely to be an actively researched topic in 10 years, I will vote to accept it as long as your paper helped me learn something new that I didn't think was already stated elsewhere. 

Some more specific examples:

  • If you make a claim about humanlike exploration capabilities in RL in your introduction and then propose an algorithm to do something like that, I'd like to see substantial empirical justification that the algorithm is indeed similar to what humans do.
  • If your algorithm doesn't achieve SOTA, that's fine with me. But I would like to see a careful analysis of why your algorithm doesn't achieve it and why.
  • When papers propose new algorithms, I prefer to see that the algorithm is better than prior work. However, I will still vote to accept if the paper presents a factually correct analysis of why it doesn't do better than prior work. 
  • If you claim that your new algorithm works better because of reason X, I would like to see experiments that show that it isn't because of alternate hypotheses X1, X2. 
Correctness is difficult to verify. Many metric learning papers were proposed in the last 5 years and accepted at prestigious conferences, only for Musgrave et al. '20 to point out that the experimental methodology between these papers were not consistent.

I should get off my high horse and say that I'm part of the circus too. I've reviewed papers for 10+ conferences and workshops and I can honestly say that I only understood 25% of papers from just reading them. An author puts in tens or hundreds of hours into designing and crafting a research paper and the experimental methodology, and I only put in a few hours in deciding whether it is "correct science". Rarely am I able to approach a paper with the level of mastery needed to rigorously evaluate correctness.

A good question to constantly ask yourself is: "what experiment would convince me that the author's explanations are correct and not due to some alternate hypothesis? Did the authors check that hypothesis?"

I believe that we should accept all "adequate" papers, and more subjective things like "taste" and "simplicity" should be reserved for paper awards, spotlights, and oral presentations. I don't know if everyone should adopt this criteria, but I think it's helpful to at least be transparent as a reviewer on how I make accept/reject decisions. 

Opportunities for Non-Traditional Researchers

If you're interested in getting mentorship for learning how to read, critique, and write papers better, I'd like to plug my weekly office hours, which I hold on Saturday mornings over Google Meet. I've been mentoring about 6 people regularly over the last 3 months and it's working out pretty well. 

Anyone who is not in a traditional research background (not currently in an ML PhD program) can reach out to me to book an appointment. You can think of this like visiting your TA's office hours for help with your research work. Here are some of the services I can offer, completely pro bono:

  • If you have trouble understanding a paper I can try to read it with you and offer my thoughts on it as if I were reviewing it.
  • If you're very very new to the field and don't even know where to begin I can offer some starting exercises like reading / summarizing papers, re-producing existing papers, and so on.
  • I can try to help you develop a good taste of what kinds of problems to work on, how to de-risk ambitious ideas, and so on.
  • Advice on software engineering aspects of research. I've been coding for over 10 years; I've picked up some opinions on how to get things done quickly.
  • Asking questions about your work as if I was a visitor at your poster session.
  • Helping you craft a compelling story for a paper you want to write.
No experience is required, all that you need to bring to the table is a desire to become better at doing research. The acceptance rate for my office hours is literally 100% so don't be shy!

Sunday, September 13, 2020

Chaos and Randomness

For want of a nail the shoe was lost.

For want of a shoe the horse was lost.

For want of a horse the rider was lost.

For want of a rider the message was lost.

For want of a message the battle was lost.

For want of a battle the kingdom was lost.

And all for the want of a horseshoe nail.

For Want of a Nail


Was the kingdom lost due to random chance? Or was it the inevitable outcome resulting from sensitive dependence on initial conditions? Does the difference even matter? Here is a blog post about Chaos and Randomness with Julia code.


Preliminaries


Consider a real vector space $X$ and a function $f: X \to X$ on that space. If we repeatedly apply $f$ to a starting vector $x_1$, we get a sequence of vectors known as an orbit $x_1, x_2, ... ,f^n(x_1)$. 

For example, the logistic map is defined as 

function logistic_map(r, x)
   r*x*(1-x) 
end

Here is a plot of successive applications of the logistic map for r=3.5. We can see that the system constantly oscillates between two values, ~0.495 and ~0.812. 


Definition of Chaos

There is surprisingly no universally accepted mathematical definition of Chaos. For now we will present a commonly used characterization by Devaney: 




We can describe an orbit $x_1, x_2, ... ,f^n(x_1)$ as *chaotic* if:

  1. The orbit is not asymptotically periodic, meaning that it never starts repeating, nor does it approach an orbit that repeats (e.g. $a, b, c, a, b, c, a, b, c...$).
  2. The maximum Lyapunov exponent $\lambda$ is greater than 0. This means that if you place another trajectory starting near this orbit, it will diverge at a rate $e^\lambda$. A positive $\lambda$ implies that two trajectories will diverge exponentially quickly away from each other. If $\lambda<0$, then the distance between trajectories would shrink exponentially quickly. This is the basic definition of "Sensitive Dependence to Initial Conditions (SDIC)", also colloquially understood as the "butterfly effect".

Note that (1) intuitively follows from (2), because the Lyapunov exponent of an orbit that approaches a periodic orbit would be $<0$, which contradicts the SDIC condition.

We can also define the map $f$ itself to be chaotic if there exists an invariant (trajectories cannot leave) subset $\tilde{X} \subset X$, where the following three conditions hold:




  1. Sensitivity to Initial Conditions, as mentioned before.
  2. Topological mixing (every point in orbits in $\tilde{X}$ approaches any other point in $\tilde{X}$).
  3. Dense periodic orbits (every point in $\tilde{X}$ is arbitrarily close to a periodic orbit). At first, this is a bit of a head-scratcher given that we previously defined an orbit to be chaotic if it *didn't* approach a periodic orbit. The way to reconcile this is to think about the subspace $\tilde{X}$ being densely covered by periodic orbits, but they are all unstable so the chaotic orbits get bounced around $\tilde{X}$ for all eternity, never settling into an attractor but also unable to escape $\tilde{X}$.
Note that SDIC actually follows from the second two conditions. If these unstable periodic orbits cover the set $\tilde{X}$ densely and orbits also cover the set densely while not approaching the periodic ones, then intuitively the only way for this to happen is if all periodic orbits are unstable (SDIC).



These are by no means the only way to define chaos. The DynamicalSystems.jl package has an excellent documentation on several computationally tractable definitions of chaos.

Chaos in the Logistic Family


Incidentally, the logistic map exhibits chaos for most of the values of r from values 3.56995 to 4.0. We can generate the bifurcation diagram quickly thanks to Julia's de-vectorized way of numeric programming.

rs = [2.8:0.01:3.3; 3.3:0.001:4.0]
x0s = 0.1:0.1:0.6
N = 2000 # orbit length
x = zeros(length(rs), length(x0s), N)
# for each starting condtion (across rows)
for k = 1:length(rs)
    # initialize starting condition
    x[k, :, 1] = x0s
    for i = 1:length(x0s)
       for j = 1:N-1
            x[k, i, j+1] = logistic_map((r=rs[k] , x=x[k, i, j])...)
       end
    end
end
plot(rs, x[:, :, end], markersize=2, seriestype = :scatter, title = "Bifurcation Diagram (Logistic Map)")

We can see how starting values y1=0.1, y2=0.2, ...y6=0.6 all converge to the same value, oscillate between two values, then start to bifurcate repeatedly until chaos emerges as we increase r.




Spatial Precision Error + Chaos = Randomness

What happens to our understanding of the dynamics of a chaotic system when we can only know the orbit values with some finite precision? For instance,  x=0.76399 or x=0.7641 but we only observe x=0.764 in either case.

We can generate 1000 starting conditions that are identical up to our measurement precision, and observe the histogram of where the system ends up after n=1000 iterations of the logistic map.


Let's pretend this is a probabilistic system and ask the question: what are the conditional distributions of $p(x_n|x_0)$, where $n=1000$, for different levels of measurement precision?

At less than $O(10^{-8})$ precision, we start to observe the entropy of the state evolution rapidly increasing. Even though we know that the underlying dynamics are deterministic, measurement uncertainty (a form of aleotoric uncertainty) can expand exponentially quickly due to SDIC. This results in $p(x_n|x_0)$ appearing to be a complicated probability distribution, even generating "long tails".

I find it interesting that the "multi-modal, probabilistic" nature of $p(x_n|x_0)$ vanishes to a simple uni-modal distribution when measurement is sufficiently high to mitigate chaotic effects for $n=1000$. In machine learning we concern ourselves with learning fairly rich probability distributions, even going as far as to learn transformations of simple distributions into more complicated ones. 

But what if we are being over-zealous with using powerful function approximators to model $p(x_n|x_0)$? For cases like the above, we are discarding the inductive bias that $p(x_n|x_0)$ arises from a simple source of noise (uniform measurement error) coupled with a chaotic "noise amplifier". Classical chaos on top of measurement error will indeed produce Indeterminism, but does that mean we can get away with treating $p(x_n|x_0)$ as purely random?

I suspect the apparent complexity of many "rich" probability distributions we encounter in the wild are more often than not just chaos+measurement error (e.g. weather). If so, how can we leverage that knowledge to build more useful statistical learning algorithms and draw inferences?

We already know that chaos and randomness are nearly equivalent from the perspective of computational distinguishability. Did you know that you can use chaos to send secret messages? This is done by having Alice and Bob synchronize a chaotic system $x$ with the same initial state $x_0$, and then Alice sends a message $0.001*signal + x$. Bob merely evolves the chaotic system $x$ on his own and subtracts it to recover the signal. Chaos has also been used to design pseudo-random number generators.