## Sunday, September 27, 2020

### My Criteria for Reviewing Papers

Accept-or-reject decisions for the NeurIPS 2020 conference are out, with 9454 submissions and 1900 accepted papers (20% acceptance rate). Congratulations to everyone (regardless of acceptance decision) for their hard work in doing good research!

It's common knowledge among machine learning (ML) researchers that acceptance decisions at NeurIPS and other conferences are something of a weighted dice roll. In this silly theatre we call "Academic Publishing"  -- a mostly disjoint concept from research by the way --,  reviews are all over the place because each reviewer favors different things in ML papers. Here are some criteria that a reviewer might care about:

Correctness: This is the bare minimum for a scientific paper. Are the claims made in the paper scientifically correct? Did the authors take care not to train on the test set? If an algorithm was proposed, do the authors convincingly show that it works for the reasons they stated?

New Information: Your paper has to contribute new knowledge to the field. This can take the form of a new algorithm, or new experimental data, or even just a different way of explaining an existing concept. Even survey papers should contain some nuggets of new information, such as a holistic view unifying several independent works.

Proper Citations: a related work section that articulates connections to prior work and why your work is novel. Some reviewers will reject papers that don't tithe prior work adequately, or isn't sufficiently distinguished from it.

SOTA results: It's common to see reviewers demand that papers (1) propose a new algorithm and (2)  achieve state-of-the-art (SOTA) on a benchmark.

More than "Just SOTA": No reviewer will penalize you for achieving SOTA, but some expect more than just beating the benchmark, such as one or more of the criteria in this list. Some reviewers go as far as to bash the "SOTA-chasing" culture of the field, which they deem to be "not very creative" and "incremental".

Simplicity: Many researchers profess to favor "simple ideas". However, the difference between "your simple idea" and "your trivial extension to someone else's simple idea" is not always so obvious.

Complexity: Some reviewers deem papers that don't present any new methods or fancy math proofs as "trivial" or "not rigorous".

Clarity & Understanding: Some reviewers care about the mechanistic details of proposed algorithms and furthering understanding of ML, not just achieving better results. This is closely related to "Correctness".

Is it "Exciting"?: Julian Togelius (AC for NeurIPS '20) mentions that many papers he chaired were simply not very exciting. Only Julian can know what he deems "exciting", but I suppose he means having "good taste" in choosing research problems and solutions.

Sufficiently Hard Problems: Some reviewers reject papers for evaluating on datasets that are too simple, like MNIST. "Sufficiently hard" is a moving goal post, with the implicit expectation that as the field develops better methods the benchmarks have to get harder to push unsolved capabilities. Also, SOTA methods on simple benchmarks are not always SOTA on harder benchmarks that are closer to real world applications. Thankfully my most cited paper was written at a time where it was still acceptable to publish on MNIST.

Is it Surprising? Even if a paper demonstrates successful results, a reviewer might claim that they are unsurprising or "obvious". For example, papers applying standard object recognition techniques to a novel dataset might be argued to be "too easy and straightforward" given that the field expects supervised object recognition to be mostly solved (this is not really true, but the benchmarks don't reflect that).

I really enjoy papers that defy intuitions, and I personally strive to write surprising papers.

Some of my favorite papers in this category do not achieve SOTA or propose any new algorithms at all:

Is it Real? Closely related to "sufficiently hard problems". Some reviewers think that games are a good testbed to study RL, while others (typically from the classical robotics community) think that Mujoco Ant and a real robotic quadruped are entirely different problems; algorithmic comparisons on the former tell us nothing about the same set of experiments on the latter.

Does Your Work Align with Good AI Ethics? Some view the development of ML technology as a means to build a better society, and discourage papers that don't align with their AI ethics. The required "Broader Impact" statements in NeurIPS submissions this year are an indication that the field is taking this much more seriously. For example, if you submit a paper that attempts to infer criminality from only facial features or perform autonomous weapon targeting, I think it's likely your paper will be rejected regardless of what methods you develop.

Different reviewers will prioritize different aspects of the above, and many of these criteria are highly subjective (e.g. problem taste, ethics, simplicity). For each of the criteria above, it's possible to come up with counterexamples of highly-cited or impactful ML papers that don't meet that criteria but possibly meet others.

## My Criteria

I wanted to share my criteria for how I review papers. When it comes to recommending accept/reject, I mostly care about Correctness and New Information. Even if I think your paper is boring and unlikely to be an actively researched topic in 10 years, I will vote to accept it as long as your paper helped me learn something new that I didn't think was already stated elsewhere.

Some more specific examples:

• If you make a claim about humanlike exploration capabilities in RL in your introduction and then propose an algorithm to do something like that, I'd like to see substantial empirical justification that the algorithm is indeed similar to what humans do.
• If your algorithm doesn't achieve SOTA, that's fine with me. But I would like to see a careful analysis of why your algorithm doesn't achieve it and why.
• When papers propose new algorithms, I prefer to see that the algorithm is better than prior work. However, I will still vote to accept if the paper presents a factually correct analysis of why it doesn't do better than prior work.
• If you claim that your new algorithm works better because of reason X, I would like to see experiments that show that it isn't because of alternate hypotheses X1, X2.
Correctness is difficult to verify. Many metric learning papers were proposed in the last 5 years and accepted at prestigious conferences, only for Musgrave et al. '20 to point out that the experimental methodology between these papers were not consistent.

I should get off my high horse and say that I'm part of the circus too. I've reviewed papers for 10+ conferences and workshops and I can honestly say that I only understood 25% of papers from just reading them. An author puts in tens or hundreds of hours into designing and crafting a research paper and the experimental methodology, and I only put in a few hours in deciding whether it is "correct science". Rarely am I able to approach a paper with the level of mastery needed to rigorously evaluate correctness.

A good question to constantly ask yourself is: "what experiment would convince me that the author's explanations are correct and not due to some alternate hypothesis? Did the authors check that hypothesis?"

I believe that we should accept all "adequate" papers, and more subjective things like "taste" and "simplicity" should be reserved for paper awards, spotlights, and oral presentations. I don't know if everyone should adopt this criteria, but I think it's helpful to at least be transparent as a reviewer on how I make accept/reject decisions.

If you're interested in getting mentorship for learning how to read, critique, and write papers better, I'd like to plug my weekly office hours, which I hold on Saturday mornings over Google Meet. I've been mentoring about 6 people regularly over the last 3 months and it's working out pretty well.

Anyone who is not in a traditional research background (not currently in an ML PhD program) can reach out to me to book an appointment. You can think of this like visiting your TA's office hours for help with your research work. Here are some of the services I can offer, completely pro bono:

• If you have trouble understanding a paper I can try to read it with you and offer my thoughts on it as if I were reviewing it.
• If you're very very new to the field and don't even know where to begin I can offer some starting exercises like reading / summarizing papers, re-producing existing papers, and so on.
• I can try to help you develop a good taste of what kinds of problems to work on, how to de-risk ambitious ideas, and so on.
• Advice on software engineering aspects of research. I've been coding for over 10 years; I've picked up some opinions on how to get things done quickly.
• Helping you craft a compelling story for a paper you want to write.
No experience is required, all that you need to bring to the table is a desire to become better at doing research. The acceptance rate for my office hours is literally 100% so don't be shy!

## Sunday, September 13, 2020

### Chaos and Randomness

For want of a nail the shoe was lost.

For want of a shoe the horse was lost.

For want of a horse the rider was lost.

For want of a rider the message was lost.

For want of a message the battle was lost.

For want of a battle the kingdom was lost.

And all for the want of a horseshoe nail.

Was the kingdom lost due to random chance? Or was it the inevitable outcome resulting from sensitive dependence on initial conditions? Does the difference even matter? Here is a blog post about Chaos and Randomness with Julia code.

## Preliminaries

Consider a real vector space $X$ and a function $f: X \to X$ on that space. If we repeatedly apply $f$ to a starting vector $x_1$, we get a sequence of vectors known as an orbit $x_1, x_2, ... ,f^n(x_1)$.

For example, the logistic map is defined as

function logistic_map(r, x)
r*x*(1-x)
end

Here is a plot of successive applications of the logistic map for r=3.5. We can see that the system constantly oscillates between two values, ~0.495 and ~0.812.

## Definition of Chaos

There is surprisingly no universally accepted mathematical definition of Chaos. For now we will present a commonly used characterization by Devaney:

We can describe an orbit $x_1, x_2, ... ,f^n(x_1)$ as *chaotic* if:

1. The orbit is not asymptotically periodic, meaning that it never starts repeating, nor does it approach an orbit that repeats (e.g. $a, b, c, a, b, c, a, b, c...$).
2. The maximum Lyapunov exponent $\lambda$ is greater than 0. This means that if you place another trajectory starting near this orbit, it will diverge at a rate $e^\lambda$. A positive $\lambda$ implies that two trajectories will diverge exponentially quickly away from each other. If $\lambda<0$, then the distance between trajectories would shrink exponentially quickly. This is the basic definition of "Sensitive Dependence to Initial Conditions (SDIC)", also colloquially understood as the "butterfly effect".

Note that (1) intuitively follows from (2), because the Lyapunov exponent of an orbit that approaches a periodic orbit would be $<0$, which contradicts the SDIC condition.

We can also define the map $f$ itself to be chaotic if there exists an invariant (trajectories cannot leave) subset $\tilde{X} \subset X$, where the following three conditions hold:

1. Sensitivity to Initial Conditions, as mentioned before.
2. Topological mixing (every point in orbits in $\tilde{X}$ approaches any other point in $\tilde{X}$).
3. Dense periodic orbits (every point in $\tilde{X}$ is arbitrarily close to a periodic orbit). At first, this is a bit of a head-scratcher given that we previously defined an orbit to be chaotic if it *didn't* approach a periodic orbit. The way to reconcile this is to think about the subspace $\tilde{X}$ being densely covered by periodic orbits, but they are all unstable so the chaotic orbits get bounced around $\tilde{X}$ for all eternity, never settling into an attractor but also unable to escape $\tilde{X}$.
Note that SDIC actually follows from the second two conditions. If these unstable periodic orbits cover the set $\tilde{X}$ densely and orbits also cover the set densely while not approaching the periodic ones, then intuitively the only way for this to happen is if all periodic orbits are unstable (SDIC).

These are by no means the only way to define chaos. The DynamicalSystems.jl package has an excellent documentation on several computationally tractable definitions of chaos.

## Chaos in the Logistic Family

Incidentally, the logistic map exhibits chaos for most of the values of r from values 3.56995 to 4.0. We can generate the bifurcation diagram quickly thanks to Julia's de-vectorized way of numeric programming.

rs = [2.8:0.01:3.3; 3.3:0.001:4.0]
x0s = 0.1:0.1:0.6
N = 2000 # orbit length
x = zeros(length(rs), length(x0s), N)
# for each starting condtion (across rows)
for k = 1:length(rs)
# initialize starting condition
x[k, :, 1] = x0s
for i = 1:length(x0s)
for j = 1:N-1
x[k, i, j+1] = logistic_map((r=rs[k] , x=x[k, i, j])...)
end
end
end
plot(rs, x[:, :, end], markersize=2, seriestype = :scatter, title = "Bifurcation Diagram (Logistic Map)")

We can see how starting values y1=0.1, y2=0.2, ...y6=0.6 all converge to the same value, oscillate between two values, then start to bifurcate repeatedly until chaos emerges as we increase r.

## Spatial Precision Error + Chaos = Randomness

What happens to our understanding of the dynamics of a chaotic system when we can only know the orbit values with some finite precision? For instance,  x=0.76399 or x=0.7641 but we only observe x=0.764 in either case.

We can generate 1000 starting conditions that are identical up to our measurement precision, and observe the histogram of where the system ends up after n=1000 iterations of the logistic map.

Let's pretend this is a probabilistic system and ask the question: what are the conditional distributions of $p(x_n|x_0)$, where $n=1000$, for different levels of measurement precision?

At less than $O(10^{-8})$ precision, we start to observe the entropy of the state evolution rapidly increasing. Even though we know that the underlying dynamics are deterministic, measurement uncertainty (a form of aleotoric uncertainty) can expand exponentially quickly due to SDIC. This results in $p(x_n|x_0)$ appearing to be a complicated probability distribution, even generating "long tails".

I find it interesting that the "multi-modal, probabilistic" nature of $p(x_n|x_0)$ vanishes to a simple uni-modal distribution when measurement is sufficiently high to mitigate chaotic effects for $n=1000$. In machine learning we concern ourselves with learning fairly rich probability distributions, even going as far as to learn transformations of simple distributions into more complicated ones.

But what if we are being over-zealous with using powerful function approximators to model $p(x_n|x_0)$? For cases like the above, we are discarding the inductive bias that $p(x_n|x_0)$ arises from a simple source of noise (uniform measurement error) coupled with a chaotic "noise amplifier". Classical chaos on top of measurement error will indeed produce Indeterminism, but does that mean we can get away with treating $p(x_n|x_0)$ as purely random?

I suspect the apparent complexity of many "rich" probability distributions we encounter in the wild are more often than not just chaos+measurement error (e.g. weather). If so, how can we leverage that knowledge to build more useful statistical learning algorithms and draw inferences?

We already know that chaos and randomness are nearly equivalent from the perspective of computational distinguishability. Did you know that you can use chaos to send secret messages? This is done by having Alice and Bob synchronize a chaotic system $x$ with the same initial state $x_0$, and then Alice sends a message $0.001*signal + x$. Bob merely evolves the chaotic system $x$ on his own and subtracts it to recover the signal. Chaos has also been used to design pseudo-random number generators.

## Saturday, June 20, 2020

### Free Office Hours for Non-Traditional ML Researchers

This post was prompted by a tweet I saw from my colleague, Colin:

I'm currently a researcher at Google with a "non-traditional background", where non-traditional background means "someone who doesn't have a PhD". People usually get PhDs so they can get hired for jobs that require that credential. In the case of AI/ML, this might be to become a professor at a university, or land a research scientist position at a place like Google, or sometimes even both.

At Google it's possible to become a researcher without having a PhD, although it's not very easy. There are a two main paths [1]:

One path is to join an AI Residency Program, which are fixed-term jobs from non-university institution (FAANG companies, AI2, etc.) that aim to jump-start a research career in ML/AI. However, these residencies are usually just 1 year long and are not long enough to really "prove yourself" as a researcher.

Another path is to start as a software engineer (SWE) in an ML-focused team and build your colleagues' trust in your research abilities. This was the route I took: I joined Google in 2016 as a software engineer in the Google Brain Robotics team. Even though I was a SWE by title, it made sense to focus on the "most important problem", which was to think really hard about why the robots weren't doing what we wanted and train deep neural nets in an attempt to fix those problems. One research project led to another, and now I just do research + publications all the time.

As the ML/AI publishing field has grown exponentially in the last few years, it has gotten harder to break into research (see Colin's tweet). Top PhD programs like BAIR usually require students to have a publication at a top conference like ICML, ICLR, NeurIPS before they even apply. I'm pretty sure I would not have been accepted to any PhD programs if I were graduating from college today, and would have probably ended up taking a job offer in quantitative finance instead.

The uphill climb gets even steeper for aspiring researchers with non-traditional backgrounds; they are competing with no shortage of qualified PhD students. As Colin alludes to, it is also getting harder for internationals to work at American technology companies and learn from American schools, thanks to our administration's moronic leadership.

The supply-demand curves for ML/AI labor are getting quite distorted. On one hand, we have a tremendous global influx of people wanting to solve hard engineering problems and contribute to scientific knowledge and share it openly with the world. On the other hand, there seems to be a shortage of formal training:
1. A research mentor to learn the academic lingo and academic customs from, and more importantly, how to ask good questions and design experiments to answer them.
2. Company environments where software engineers are encouraged to take bold risks and lead their own research (and not just support researchers with infra).

Free Office Hours

I can't do much for (2) at the moment, but I can definitely help with (1). To that end, I'm offering free ML research mentorship to aspiring researchers from non-traditional backgrounds via email and video conferencing.

I'm most familiar with applied machine learning, robotics, and generative modeling, so I'm most qualified to offer technical advice in these areas. I have a bunch of tangential interests like quantitative finance, graphics, and neuroscience. Regardless of technical topic, I can help with academic writing and de-risking ambitious projects and choosing what problems to work on. I also want to broaden my horizons and learn more from you.

If you're interested in using this resource, send me an email at <myfirstname><mylastname><2004><at><g****.com>. In your email, include:
2. What you want to get out of advising
3. A cool research idea you have in a couple sentences
Some more details on how these office hours will work:
1. Book weekly or bi-weekly Google Meet [2] calls to check up on your work and ask questions, with 15 minute time slots scheduled via Google Calendar.
2. The point of these office hours is not to answer "how do I get a job at Google Research", but to fulfill an advisor-like role in lieu of a PhD program. If you are farther along your research career we can discuss career paths and opportunities a little bit, but mostly I just want to help people with (1).
3. I'm probably not going to write code or run experiments for you.
4. I don't want to be that PI that slaps their name on all of their student's work - most advice I give will be given freely with no strings attached. If I make a significant contribution to your work or spend > O(10) hours working with you towards a publishable result, I may request being a co-author on a publication.
5. I reserve the right to decline meetings if I feel that it is not a productive use of my time or if other priorities take hold.
6. I cannot tell you about unpublished work that I'm working on at Google or any Google-confidential information.
7. I'm not offering ML consultation for businesses, so your research work has to be unrelated to your job.
8. To re-iterate point number 2 once more, I'm less interested in giving career advice and more interested in teaching you how to design experiments, how to cite and write papers, and communicating research effectively.
What do I get out of this? First, I get to expand my network. Second, I can only personally run so many experiments by myself so this would help me grow my own research career. Third, I think the supply of mentorship opportunities offered by academia is currently not scalable, and this is a bit of an experiment on my part to see if we can do better. I'd like to give aspiring researchers similar opportunities that I had 4 years ago that allowed me to break into the field.

Footnotes
[1] Chris Olah has a great essay on some additional options and pros and cons of non-traditional education.
[2] Zoom complies with Chinese censorship requests, so as a statement of protest I avoid using Zoom when possible.

## Wednesday, April 1, 2020

### Three Questions that Keep Me Up at Night

A Google interview candidate recently asked me: "What are three big science questions that keep you up at night?" This was a great question because one's answer reveals so much about one's intellectual interests - here are mine:

Q1: Can we imitate "thinking" from only observing behavior?

Suppose you have a large fleet of autonomous vehicles with human operators driving them around diverse road conditions. We can observe the decisions made by the human, and attempt to use imitation learning algorithms to map robot observations to the steering decisions that the human would take.

However, we can't observe what the homunculus is thinking directly. Humans read road text and other signage to interpret what they should and should not do. Humans plan more carefully when doing tricky maneuvers (parallel parking). Humans feel rage and drowsiness and translate those feelings into behavior.

Let's suppose we have a large car fleet and our dataset is so massive and perpetually growing that we cannot train it faster than we are collecting new data. If we train a powerful black-box function approximator to learn the mapping from robot observation to human behavior [1], and we use active-learning techniques like DAgger to combat false negatives, will that be enough to acquire these latent information processing capabilities? Can the car learn to think like a human, and how much?

Inferring low-dimensional unobserved states from behavior is a well-studied technique in statistical modeling. In recent years, meta-reinforcement learning algorithms have increased the capability of agents to change their behavior in the presence of new information. However, no one has applied this principle to the scale and complexity of "human-level thinking and reasoning variables". If we use basic black-box function approximators (ConvNets, ResNets, Transformers, etc.), will it be enough? Or will it still fail even with a million lifetimes worth of driving data?

In other words, can simply predicting human behavior lead to a model that can learn to think like a human?

One cannot draw a hard line between "thinking" and "pattern matching", but loosely speaking I'd want to see such learned latent variables reflect basic deductive and inductive reasoning capabilities. For example, a logical proposition formulated as a steering problem: "Turn left if it is raining; right otherwise".

This could also be addressed via other high-data environments:

• Observing trader orders on markets and seeing if we can recover the trader's deductive reasoning and beliefs about the future. See if we can observe rational thought (if not rational behavior).
• Recovering intent and emotions and desire from social network activity.

Q2: What is the computationally cheapest "organic building block" of an Artificial Life simulation that could lead to human-level AGI?

Many AI researchers, myself included, believe that competitive survival of "living organisms" is the only true way to implement general intelligence.

If you lack some mental power like deductive reasoning, another agent might exploit the reality to its advantage to out-compete you for resources.

If you don't know how to grasp an object, you can't bring food to your mouth. Intelligence is not merely a byproduct of survival; I would even argue that it is Life and Death itself from which all semantic meaning we perceive in the world arises (the difference between a "stable grasp" and an "unstable grasp").

How does one realize an A-Life research agenda? It would be prohibitively expensive to implement large-scale evolution with real robots, because we don't know how to get robots to self-replicate as living organisms do. We could use synthetic biology technology, but we don't know how to write complex software for cells yet and even if we could, it would probably take billions of years for cells to evolve into big brains. A less messy compromise is to implement A-Life in silico and evolve thinking critters in there.

We'd want the simulation to be fast enough to simulate armies of critters. Warfare was a great driver of innovation. We also want the simulation to be rich and open-ended enough to allow for ecological niches and tradeoffs between mental and physical adaptations (a hand learning to grasp objects).

Therein lies the big question: if the goal is to replicate the billions of years of evolutionary progress leading up to where we are today, what are the basic pieces of the environment that would be just good enough?

• Chemistry? Cells? Ribosomes? I certainly hope not.
• How do nutrient cycles work? Resources need to be recycled from land to critters and back for there to be ecological change.
• Is the discovery of fire important for evolutionary progression of intelligence? If so, do we need to simulate heat?
• What about sound and acoustic waves?
• Is a rigid-body simulation of MuJoCo humanoids enough? Probably not, if articulated hands end up being crucial.
• Is Minecraft enough?
• Does the mental substrate need to be embodied in the environment and subject to the physical laws of the reality? Our brains certainly are, but it would be bad if we had to simulate neural networks in MuJoCo.
• Is conservation of energy important? If we are not careful, it can be possible through evolution for agents to harvest free energy from their environment.

In the short story Crystal Nights by Greg Egan, simulated "Crabs" are built up of organic blocks that they steal from other Crabs. Crabs "reproduce" by assembling a new crab out of parts, like LEGO. But the short story left me wanting for more implementation details...

Q3: Loschmidt's Paradox and What Gives Rise to Time?

I recently read The Order of Time by Carlo Rovelli and being a complete Physics newbie, finished the book feeling more confused and mystified than when I had started.

The second law of thermodynamics, $\Delta{S} > 0$, states that entropy increases with time. That is the only physical law that is requires time "flow" forwards; all other physical laws have Time-Symmetry: they hold even if time was flowing backwards. In other words, T-Symmetry in a physical system implies conservation of entropy.

Microscopic phenomena (laws of mechanics on position, acceleration, force, electric field, Maxwell's equations) exhibit T-Symmetry. Macroscopic phenomena (gases dispersing in a room, people going about their lives), on the other hand, are T-Asymmetric. It is perhaps an adaptation to macroscopic reality being T-Asymmetric that our conscious experience itself has evolved to become aware of time passing. Perhaps bacteria do not need to know about time...

But if macroscopic phenomena are comprised of nothing more than countless microscopic phenomena, where the heck does entropy really come from?

Upon further Googling, I learned that this question is known as Loschmidt's Paradox. One resolution that I'm partially satisfied with is to consider that if we take all microscopic collisions to be driven by QM, then there really is no such thing as "T-symmetric" interactions, and thus microscopic interactions are actually T-asymmetric. A lot of the math becomes simpler to analyze if we consider a single pair of particles obeying randomized dynamics (whereas in Statistical Mechanics we are only allowed to assume that about a population of particles).

Even if we accept that macroscopic time originates from a microscopic equivalent of entropy, this still begs the question of what the origin of microscopic entropy (time) is.

Unfortunately, many words in English do not help to divorce my subjective, casual understanding of time from a more precise, formal understanding. Whenever I think of microscopic phenomena somehow "causing" macroscopic phenomena or the cause of time (entropy) "increasing", my head gets thrown for a loop. So much T-asymmetry is baked into our language!

I'd love to know of resources to gain a complete understanding of what we know and don't know, and perhaps a new language to think about Causality from a physics perspective

If you have thoughts on these questions, or want to share your own big science questions that keep you up at night, let me know in the comments or on Twitter! #3sciencequestions

## Wednesday, December 25, 2019

### Selected Quotes from "The Dark Ages of AI Panel Discussion"

In 1984, a panel at the AAAI conference discussed whether the field was approaching an "AI Winter". Mitch Waldrop wrote a transcript of the discussion, and much of it reads exactly like something written 35 years into the future.

Below are some quotes from the transcript that I found impressive, as they describe the feelings of many an AI researcher today and how the public views AI, despite all the advances in computing and software since 1984. 👇

"People make essentially no distinction between computers, broadly defined, and Artificial Intelligence... as far as they're concerned, there is no difference; they're just worried about the impact of very capable, smart computers" - Mitch Waldrop

"The computer is not only a mythic emblem for this bright, high-technology future, it's a mythic symbol for much of the anxiety that people have about their own society." - Mitch Waldrop

"A second anxiety, what you might call the 'Frankenstein Anxiety', is the fear of being replaced, of becoming superfluous..." - Mitch Waldrop

"Modern Times Anxiety: People becoming somehow, because of computers, just a cog in the vast, faceless machine; the strong sense of helplessness, that we really have no control over our lives" - Mitch Waldrop

"The problem is not a matter of imminent deadlines or lack of space or lack of time... the real problem is that what reporters see as real issues in the world are very different from what the AI community sees as real issues." - Mitch Waldrop

"If we expect physicists to be concerned about arms control and chemists to be concerned about toxic waste, it's probably reasonble to expect AI people to be concerned about the human impact of these technologies" - Mitch Waldrop

"It [Doomsday] is already here. There is no content in this conference" - Bob Wilensky

"What I heard was that only completed scientific work was going to be accepted. This is a horrible concept - no new unformed ideas, no incremental work building on previous work" - Roger Schank

"When I first got into this field twenty years ago, I used to explain to people what I did, and they would already say, 'you mean computers can't do that already?' They'll always believe that." - Roger Schank

"Big business has a very serious role in this country. Among other things, they get to determine what's 'in' and what's 'out' in the government." - Roger Schank

"I got scared when big business started getting into this - Schlumberger, Xerox, HP, Texas Instruments, GTE, Amico, Exxcon, they were all making investments - they all have AI groups. And you find out that, thoise people weren't trained in AI." - Roger Schank

"It's easier to go into a startup... [or] a big company... than to go into a university and try to organize an AI lab, which is just as hard to do now as it ever was. But if we don't do that, we will find that we are in the 'Dark Ages' of AI"  - Roger Schank

"The first [message] is incumbent upon AI because we have promised so much, to produce. We must produce working systems. Some of you must devote yourselves to doing that. It is also the case that some of you had better commit to doing science."  - Roger Schank

"If it turns out that our AI conference isn't the place to discuss science, then we better start finding a place where we can discuss science, because this show for all the venture capitalists is very nice."  - Roger Schank

"the notion of cognition as computation is going to have extraordinary importance to the philosophy and psychology of the next generation. And for well or ill, this notion has affected some of the deepest aspects of our self-image." - B. Chandrasekaran

"symbol-level theories, which may even be right, are being mistaken for knowledge-level theories"  - B. Chandrasekaran

"My hope is that AI will evolve more like biotech in the sense that certain technologies will be spun off, and researchers will remain and extremely interesting progress will be made"  - B. Chandrasekaran

"I have encountered people who have a science fiction view of the world and think that computers now can do just about anything... these people have a feeling that computers can do wonderful things, but if you ask them how exactly could an AI program help in work, they don't have the sense that within a week or two they could be replaced or that computers can come in and do a much better job than they do in work." - John McDermott

"There have been a number of technologies that have run into dead ends, like dirigibles and external combustion engines. And there have been other ones, like television, and in fact, the telephone system itself, which took between twenty and forty years to go from being laboratory possibilities to actual commercial successes. Do you really think that IA is going to become a commercial success in the next 10-15 years?" - Audience member

"They [lay people] seem to have a vague idea that great things can happen, have sublime confidence... but when it gets down to the nitty-gritty, they tend to be pretty unimaginative and have pretty low expectations as to what can be done." - Mitch Waldrop

"It seems that academic AI people tend to blame everyone but themselves when it comes to problems of AI in terms of relationship to the general society." - Audience member

## Thursday, November 28, 2019

### Differentiable Path Tracing on the GPU/TPU

You can download a PDF (typset in LaTeX) of this blog post here.
Jupyter Notebook Code on GitHub: https://github.com/ericjang/pt-jax

This blog post is a tutorial on implementing path tracing, a physically-based rendering algorithm, in JAX. This code runs on the CPU, GPU, and Google Cloud TPU, and is implemented in a way that also makes it end-to-end differentiable. You can compute gradients of the rendered pixels with respect to geometry, materials, whatever your heart desires.

I love JAX because it is equally suited for pedagogy and high-performance computing. We will implement a path tracer for a single pixel in numpy-like syntax, slap a jax.vmap operator on it, and JAX automatically converts our code to render multiple pixels with SIMD instructions! You can do the same thing for multiple devices using jax.pmap. If that isn't magic, I don't know what is. At the end of the tutorial you will not only know how to render a Cornell Box, but also understand geometric optics and radiometry from first principles.

The figure below, borrowed from a previous post from this blog, explains at a high level the light simulator we're about to implement:

I divide this tutorial into two parts: 1) implementing geometry-related functions like ray-scene intersection and normal estimation, and 2) the "light transport" part where we discuss how to accumulate radiance arriving at an imaginary camera sensor.

JAX and Matplotlib (and a bit of calculus and probability) are the only required dependencies for this tutorial:

import jax.numpy as np
from jax import jit, grad, vmap, random, lax
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

JAX is essentially a drop-in replacement for numpy, with the exception that operations are all functional (no indexing assignment) and the user must manually pass around an explicit rng_key to generate random numbers. Here is a short list of JAX gotchas if you are coming to JAX as a numpy user.

### Part I: Geometry

The vast majority of rendering software represents scene geometry as a collection of surface primitives that form "meshes". 3D modeling software form meshes using quadrilaterial faces, and then the rendering software converts the quads to triangles under the hood. Collections of meshes are composed together to form entire objects and scenes. For this tutorial we're going to use an unorthodox geometry representation and we'll need to implement a few helper functions to manipulate them.

#### Differentiable Scene Intersection with Distance Fields

Rendering requires computing intersection points $y$ in the scene with ray $\omega_i$, and usually involves traversing a highly-optimized spatial data structure called a bounding volume hierarchy (BVH). $y$ can be expressed as a parametric equation of the origin point $x$ and raytracing direction $\omega_i$, and the goal is to find the distance $t$:

$\hat{y} = x + t \cdot \omega_i$

There is usually a lot of branching logic in BVH traversal algorithms, which makes it harder to implement efficiently on accelerator hardware like GPUs and TPUs. Instead, let's use raymarching on signed distance fields to find the intersection point $y$. I first learned of this geometry modeling technique when Inigo "IQ" Quilez, a veritable wizard of graphics programming, gave a live coding demo at Pixar about how he modeled vegetation in the "Brave" movie. Raymarching is the primary technique used by the ShaderToy.com community to implement cool 3D movies using only instructions available to WebGL fragment shaders.

A signed distance field over position $p$ specifies "the distance you can move in any direction without coming into contact with the object". For example, here is the signed distance field for a plane that passes through the origin and is perpendicular to the y-axis.

def sdFloor(p):
return p.y

To find the intersection distance $t$, the raymarching algorithm iteratively increments $t$ by step sizes equal to the signed distance field of the scene (so we never pass through an object). This iteration happens until $t$ "leaves the scene'" or the distance field shrinks to zero (we have collided with an object). For the plane distance, we see from the diagram below that stepping forward using the distance field allows us to get arbitrarily close to the plane without ever passing through it.

def raymarch(ro, rd, sdf_fn, max_steps=10):
t = 0.0
for i in range(max_steps):
p = ro + t*rd
t = t + sdf_fn(p)
return t

Signed distance fields combined with raymarching have a number of nice mathematical properties. The most important one is that unlike analytical ray-shape intersection, raymarching does not require re-deriving an analytical solution for intersecting points for every primitive shape we wish to add to the scene. Triangles are also general, but they require a lot of memory to store expressive scenes. In my opinion, signed distance fields strike a good balance between memory budget and geometric expressiveness.

Similar to ResNet architectures in Deep Learning, the raymarching algorithm is a form of "unrolled iterative inference" of the same signed distance field. If we are trying to differentiate through the signed distance function (for instance, trying to approximate it with a neural network), this representation may be favorable to gradient descent algorithms.

#### Building Up Our Scene

The first step is to implement the signed distance field for the scene of interest. The naming and programming conventions in this tutorial are heavily inspired by stylistic conventions used by ShaderToy DemoScene community. One such convention is to define hard-coded enums for each object, so we can associate intersection points to their nearest object. The values are arbitrary; you can substitute them with your favorite numbers if you like.

OBJ_NONE=0.0
OBJ_FLOOR=0.1
OBJ_CEIL=.2
OBJ_WALL_RD=.3
OBJ_WALL_WH=.4
OBJ_WALL_GR=.5
OBJ_SHORT_BLOCK=.6
OBJ_TALL_BLOCK=.7
OBJ_LIGHT=1.0
OBJ_SPHERE=0.9

Computing a ray-scene intersection should therefore return an object id and an associated distance, for which we define a helper function to zip up those two numbers.

def df(obj_id, dist):
return np.array([obj_id, dist])

Next, we'll define the distance field for a box (source: https://www.iquilezles.org/www/articles/distfunctions/distfunctions.htm).

def udBox(p, b):
# b = half-widths
return length(np.maximum(np.abs(p)-b,0.0))

Rotating, translating, and scaling an object implied by a signed distance field is done by performing the inverse operation to the input point to the distance function. For example, if we want to rotate one of the boxes in the scene by an angle of $\theta$, we rotate its argument $p$ by $-\theta$ instead.

def rotateX(p,a):
# We won't be using rotateX for this tutorial.
c = np.cos(a); s = np.sin(a);
px,py,pz=p[0],p[1],p[2]
return np.array([px,c*py-s*pz,s*py+c*pz])

def rotateY(p,a):
c = np.cos(a); s = np.sin(a);
px,py,pz=p[0],p[1],p[2]
return np.array([c*px+s*pz,py,-s*px+c*pz])

def rotateZ(p,a):
c = np.cos(a); s = np.sin(a);
px,py,pz=p[0],p[1],p[2]
return np.array([c*px-s*py,s*px+c*py,pz])

Another cool property of signed distance fields is that you can compute the union of two solids with a simple np.minimum operation. By the definition of a distance field, if you take a step size equal to the smaller of the two distances in either direction, you are still guaranteed not to intersect with anything. The following method, short for "Union Operation", joins to distance fields by comparing their distance property.

def opU(a,b):
if a[1] < b[1]:
return a
else:
return b

Unfortunately, the JAX compiler complains when combining both grad and jit operators through conditional logic like the one above. So we need to write things a little differently to preserve differentiability:

def opU(a,b):
condition = np.tile(a[1,None]<b[1,None], [2])
return np.where(condition, a, b)

Now we have all the requisite pieces to build the signed distance field for the Cornell Box, which we call sdScene. Recall from the previous section that the distance field for an axis-aligned plane is just the height along that axis. We can use this principle to build infinite planes that comprise the walls, floor, and ceiling of the Cornell Box.

def sdScene(p):
# p is [3,]
px,py,pz=p[0],p[1],p[2]
# floor
obj_floor = df(OBJ_FLOOR, py) # py = distance from y=0
res = obj_floor
# ceiling
obj_ceil = df(OBJ_CEIL, 4.-py)
res = opU(res,obj_ceil)
# backwall
obj_bwall = df(OBJ_WALL_WH, 4.-pz)
res = opU(res,obj_bwall)
# leftwall
obj_lwall = df(OBJ_WALL_RD, px-(-2))
res = opU(res,obj_lwall)
# rightwall
obj_rwall = df(OBJ_WALL_GR, 2-px)
res = opU(res,obj_rwall)
# light
obj_light = df(OBJ_LIGHT, udBox(p - np.array([0,3.9,2]), np.array([.5,.01,.5])))
res = opU(res,obj_light)
# tall block
bh = 1.3
p2 = rotateY(p- np.array([-.64,bh,2.6]),.15*np.pi)
d = udBox(p2, np.array([.6,bh,.6]))
obj_tall_block = df(OBJ_TALL_BLOCK, d)
res = opU(res,obj_tall_block)
# short block
bw = .6
p2 = rotateY(p- np.array([.65,bw,1.7]),-.1*np.pi)
d = udBox(p2, np.array([bw,bw,bw]))
obj_short_block = df(OBJ_SHORT_BLOCK, d)
res = opU(res,obj_short_block)
return res

Notice that we model the light source on the ceiling as a rectangular prism with half-widths $(0.5, 0.5)$. All numbers are expressed in SI units, so this implies a 1 meter x 1 meter light, and a big 4m x 4m Cornell box (this is a big scene!). The size of the light will become relevant later when we compute quantitites like emitted radiance.

#### Computing Surface Normals

In rendering we need to frequently compute the normals of geometric surfaces. In ShaderToy programs, the most common algorithm used to compute normals is a finite-difference gradient approximation of the distance field $\nabla_p d(p)$, and then normalize that vector to obtain an approximate normal.

def calcNormalFiniteDifference(p):
# derivative approximation via midpoint rule
eps = 0.001
dx=np.array([eps,0,0])
dy=np.array([0,eps,0])
dz=np.array([0,0,eps])
# extract just the distance component
nor = np.array([
sdScene(p+dx) - sdScene(p-dx),
sdScene(p+dy) - sdScene(p-dy),
sdScene(p+dz) - sdScene(p-dz),
])
return normalize(nor)

Note that this requires six separate evaluations to the sdScene function! As it turns out, JAX can give us analytical normals basically for free via its auto-differentiation capabilities. The backward pass has the same computational complexity as the forward pass, resulting in autodiff gradients being 6x faster than finite-differencing. Neat!

def dist(p):
# return the distance-component only
return sdScene(p)[1]

#### Cosine-Weighted Sampling

We require is the ability to sample scattering rays around some local surface normal, for when we choose recursive rays to scatter. All the objects in the scene are assigned "Lambertian BRDFs'', which mean that they are matte in reflectance properties and the apparent brightness to an observer is the same regardless of viewing angle. For Lambertian materials, it is much more effective to sample from a cosine-weighted distribution because it allows two cosine-related probability terms (from the sampling and from the BRDF) to cancel out. The motivation for this will become apparent in Part II of the tutorial, but here is the code up front.

def sampleCosineWeightedHemisphere(rng_key, n):
rng_key, subkey = random.split(rng_key)
u = random.uniform(subkey,shape=(2,),minval=0,maxval=1)
u1, u2 = u[0], u[1]
uu = normalize(np.cross(n, np.array([0.,1.,1.])))
vv = np.cross(uu,n)
ra = np.sqrt(u2)
rx = ra*np.cos(2*np.pi*u1)
ry = ra*np.sin(2*np.pi*u1)
rz = np.sqrt(1.-u2)
rr = rx*uu+ry*vv+rz*n
return normalize(rr)

Here's a quick 3D visualization to see whether our implementation is doing something reasonable:

from mpl_toolkits.mplot3d import Axes3D
nor = normalize(np.array([[1.,1.,0.]]))
nor = np.tile(nor,[1000,1])
rng_key = random.split(RNG_KEY, 1000)
rd = vmap(sampleCosineWeightedHemisphere)(rng_key, nor)
fig = plt.figure()
ax.scatter(rd[:,0],rd[:,2],rd[:,1])
ax.scatter(rd[:,0],rd[:,1])

#### Camera Model

For each pixel we want to render, we need to associate it with a ray direction rd and a ray origin ro. The most basic camera model for computer graphics is a pinhole camera, shown below:

The following code sets up a pinhole camera with focal distance of 2.2 meters:

N=150 # width of image plane
xs=np.linspace(0,1,N) # 10 pixels
us,vs = np.meshgrid(xs,xs)
uv = np.vstack([us.flatten(),vs.flatten()]).T
# normalize pixel locations to -1,1
p = np.concatenate([-1+2*uv, np.zeros((N*N,1))], axis=1)
# Render a pinhole camera.
eye = np.tile(np.array([0,2.,-3.5]),[p.shape[0],1])
look = np.array([[0,2.0,0]]) # look straight ahead
w = vmap(normalize)(look - eye)
up = np.array([[0,1,0]]) # up axis of world
u = vmap(normalize)(np.cross(w,up))
v = vmap(normalize)(np.cross(u,w))
d=2.2 # focal distance
rd = vmap(normalize)(p[:,0,None]*u + p[:,1,None]*v + d*w)

If you wanted to render an orthographic projection, you can simply set all ray direction values to point straight forward along the Z-axis, instead of all originating from the same eye point: rd = np.array([0, 0, 1]).

N=150 # width of image plane
xs=np.linspace(0,1,N) # 10 pixels
us,vs = np.meshgrid(xs,xs)
us = (2*us-1)
vs *= 2
uv = np.vstack([us.flatten(),vs.flatten()]).T # 10x10 image grid
eye = np.concatenate([uv, np.zeros((N*N,1))], axis=1)*2
rd = np.zeros_like(eye) + np.array([[0, 0, 1]])

An orthographic camera is what happens when you stretch the focal distance to infinity. That will yield an image like this:

### Part II: Light Simulation

With our scene defined and basic geometric functions set up, we can finally get to the fun part of implementing light transport. This part of the tutorial is agnostic to the geometry representation described in Part I, so you can actually follow along with whatever programming language and geometry representation you like (raymarching, triangles, etc).

Before we learn the path tracing algorithm, it is illuminating to first understand the underlying physical phenomena being simulated. Radiometry is a mathematical framework for measuring electromagnetic radiation. Not only can it be used to render pretty pictures, but it can also be used to understand heat and energy propagated in straight lines within closed systems (e.g. blackbody radiation). What we are ultimately interested in are human perceptual color quantities, but to get them first we will simulate the physical quantities (Watts) and then convert them to lumens and RGB values.

This section borrows some figures from the PBRT webpage on Radiometry. I highly recommend reading that page before proceeding, but I also summarize the main points you need to know here.

You can actually derive the laws of radiometry from first principles, using only the principle of conservation of energy: within a closed system, the total amount of energy being emitted is equal to the total amount of energy being absorbed.

Consider a small sphere of radius $r$ emitting 60 Watts of electromagnetic power into a larger enclosing sphere of radius $R$. We know that the bigger sphere must be absorbing 60 Watts of energy, but because it has a larger surface area ($4\pi R^2$), the incoming energy density per unit area is a factor of $\frac{R^2}{r^2}$ smaller.

We call this "area density of flux'' irradiance (abbreviated $E$) if it is arriving at a surface, and radiant exitance (abbreviated $M$) if it is leaving a surface. The SI unit for these quantities are Watts per square meter.

Now let's consider a slightly different scene in the figure below, where a small flat surface with area $A$ emits a straight beam of light onto the floor. On the left, the emitting and receiving surfaces have the same area, $A = A_1$, so the irradiance equals radiant exitance $E = M$. On the right, the beam of light shines on the floor at an angle $\theta$, which causes the projection $A_2$ to be larger. Calculus and trigonometry tell us that as we shrink the area $A \to 0$, the area of the projected light $A_2$ approaches $\frac{A}{\cos \theta}$. Because flux must be conserved, the irradiance of $A_2$ must be $E = M \cos \theta$, where $\theta$ is the angle between the surface normal and light direction. This is known as "Lambert's Law''.

In the above examples, the scenes were simple or symmetric enough that we did not have to think about what direction light is coming from when computing the irradiance of a surface. However, if we want to simulate light in a complex scene, we will need to compute irradiance by integrating light over many possible directions. For non-transparent surfaces, this set of directions forms a hemisphere surrounding the point of interest, and is perpendicular to the surface normal.

Radiance extends the measure of irradiance to also depend on the solid angle of incident light. Solid angles are just extensions of 2D angles to 3D spheres (and hemispheres). You can recover irradiance and power by integrating out angle and area of irradiance, respectively:

• Radiance $L = \frac{\partial^2 \Phi}{\partial \Omega \partial A \cos \theta}$ measures flux per projected unit area $A \cos \theta$ per unit solid angle (Figure 5.10) $\Omega$.
• Irradiance $E = \frac{\partial \Phi}{\partial A \cos \theta}$ is the integral of radiance over solid angles $\Omega$.
• Power $\Phi$ is the integral of irradiance over projected area $A$.

A nice property of radiance is that it is conserved along rays through empty space. We have the incoming radiance $L_i$ from direction $\omega_i$ to point $x$ equal to the outgoing radiance $L_o$ from some other point $y$, in the reverse direction $-\omega_i$. $y$ is the intersection of origin $x$ along ray $\omega_i$ with the scene geometry.

$L_i(x, \omega_i) = L_o(y, -\omega_i)$

It's important to note that although incoming and outgoing radiance are conserved along empty space, we still need to respect Lambert's Law when computing an irradiance at a surface.

### Different Ways to Integrate Radiance

You may remember from calculus class that it is sometimes easier to compute integrals by changing the integration variable. The same concept holds in rendering: we'll use three different integration methods in building a computationally efficient path tracer. In this section I will draw some material directly from the PBRTv3 online textbook, which you can find here: http://www.pbr-book.org/3ed-2018/Color_and_Radiometry/Working_with_Radiometric_Integrals.html

I was a teaching assistant for the graduate graphics course for 2 years at Brown and by far the most common mistake made in the path tracing project assignments were insufficient understanding of the calculus that went into correctly integrating radiometric quantities.

#### Integrating Over Solid Angle

As mentioned before, in order to compute irradiance $E(x, n)$ at a surface point $x$ with normal $n$, we need to take Lambert's rule into account, because there is a "spreading out'' of flux density that occurs when light sources are facing at an angle.

$E(x, n) = \int_\Omega d\omega L_i(x, \omega) |\cos \theta| = \int_\Omega d\omega L_i(x, \omega) |\omega \cdot n|$

One way to estimate this integral is a single-sample Monte Carlo Estimator, where we sample a single ray direction $\omega_i$ uniformly from the hemisphere, and evaluate the radiance for that direction. In expectation over $\omega_i$, the estimator computes the correct integral.

$\omega_i \sim \Omega$
$\hat{E}(x, n) = L_i(x, \omega_i) |\omega \cdot n| \frac{1}{p(\omega_i)}$

#### Integrating Over Projected Solid Angle

Due to Lambert's law, we should never sample outgoing rays perpendicular to the surface normal because the projected area $\frac{A}{\cos \theta}$ approaches infinity, so the radiance contribution to that area is zero.

We can avoid sampling these "wasted'' rays by weighting the probability of sampling a ray according to Lambert's law - in other words, a cosine-weighted distribution $H^2$ along the hemisphere. This requires us to perform a change of variables, and integrate with respect to the projected solid angle $d\omega^\perp = |\cos \theta| d\omega$.

This is where the cosine-weighted hemisphere sampling function we implemented earlier will come in handy.

$E(x, n) = \int_{H^2} d\omega^\perp L_i(x, \omega^\perp)$

The cosine term in the integral means that the contribution to irradiance is higher as the light source becomes more perpendicular to the light.

#### Integrating Over Light Area

If the light source subtends a very small solid angle on the hemisphere, we will need to sample a lot of random outgoing rays before we find one that intersects the light source. For small or directional light sources, it is far more computationally efficient to integrate over the area of the light, rather than the hemisphere.

If we perform a change in variables from differential solid angle $d\omega$ to differential area $dA$, we must compensate for the change in volume.

$d\omega = \frac{dA \cos \theta_o}{r^2}$

I won't go through the derivation in this tutorial, but the interested reader can find it here: https://www.cs.princeton.edu/courses/archive/fall10/cos526/papers/zimmerman98.pdf. Substituting the above equation into the irradiance integral, we have:

$E(x, n) = \int_{A} L \cos \theta_i \frac{dA \cos \theta_o}{r^2}$

where $L$ is the emitted radiance of the light coming from the implied direction $-\omega$, which has an angular offset of $\theta_o$ from the light surface's surface normal. The corresponding single-sample Monte Carlo estimator is given by sampling a point on the area light, rather than a direction on the hemisphere. The probability $p(p)$ of sampling the point $p$ on an area $A$ is usually given by a uniform $\frac{1}{A}$.

$p \sim A$
$\omega = \frac{p-x}{\left\lVert {p-x} \right\rVert}$
$r^2 = \left\lVert {p-x} \right\rVert ^2$
$\hat{E}(x, n) = \frac{1}{p(p)}\frac{L}{r^2} |\omega \cdot x| |-\omega \cdot n|$

### Making Rendering Computationally Tractable with Path Integrals

The rendering equation describes the outgoing radiance $L_o(x, \omega_o)$ from point $x$ along ray $\omega_o$.

$L_o(x, \omega_o) = L_e(x, \omega_o) + \int_{\Omega} f_r(x, \omega_i, \omega_o) L_i(x, \omega_i) (-\omega_i \cdot n) d\omega_i$

where $L_e(x, \omega_o)$ is emitted radiance, $f_r(x, \omega_i, \omega_o)$ is the BRDF (material properties), $L_i(x, \omega_i)$ is incoming radiance, $(-\omega_i \cdot n)$ is the attenuation of light coming in at an incident angle with surface normal $n$. The integral is with respect to solid angle on a hemisphere.

How do we go about implementing this on a computer? Evaluating the incoming light to a point requires integrating over an infinite number of directions, and for each of these directions, we have to recursively evaluate the incoming light to those points. Our computers simply cannot do this.

Fortunately, path tracing provides a tractable way to approximate this scary integral.  Instead of integrating over the hemisphere $\Omega$, we can sample a random direction $w_i \sim \Omega$, and the probability-weighted contribution from that single ray is an unbiased, single-sample monte carlo estimator for Eq. 1.

$\omega_i \sim \Omega$
$\hat{L}_o(x, \omega_o) = L_e(x, \omega_o) + \frac{1}{p(\omega_i)} f_r(x, \omega_i, \omega_o) L_i(x, \omega_i) (-\omega_i \cdot n(x))$

We still need to deal with infinite recursion. In most real-world scenarios, a photon only bounces around a few times before it is absorbed, so we can truncate the depth or use a more unbiased technique like Russian Roulette sampling. We recursively trace the $L_i(x, \omega_i)$ function until we hit the termination condition, which results in a linear computation cost with respect to depth.

### A Naive Path Tracer

Below is the code for a naive path tracer, which is more or less a direct translation of the equation above.

def trace(ro, rd, depth):
p = intersect(ro, rd)
n = calcNormal(p)
if depth < 3:
# Uniform hemisphere sampling
rd2 = sampleUniformHemisphere(n)
Li = trace(p, rd2, depth+1)
radiance += brdf(p, rd, rd2)*Li*np.dot(rd, n)

We assume a 25 Watt square light fixture at the top of the Cornell Box that acts as a diffuse area light and only emits light from one side of the plane. Diffuse lights have uniform spatial and directional radiance distribution; this is also known as a "Lambertian Emitter'', and it has a closed-form solution for its emitted radiance from any direction:

LIGHT_POWER = np.array([25, 25, 25]) # Watts
LIGHT_AREA = 1.
return LIGHT_POWER / (np.pi * LIGHT_AREA)

The $\pi$ term is a little surprising at first, but you can find the derivation here for where it comes from: https://computergraphics.stackexchange.com/questions/3621/total-emitted-power-of-diffuse-area-light.

Normally we'd have to track radiance for every visible wavelength, but we can obtain a good approximation of the entire spectral power distribution by tracking radiance for just a few wavelengths of light. According to tristimulus theory, it is actually possible to represent all human-perceivable colors with 3 numbers, such as XYZ or RGB color bases. For simplicity, we'll only compute radiance values for R, G, B wavelengths in this tutorial. The brdf term corresponds to material properties. This is a simple scene in which all materials are Lambertian, meaning that the direction of the incident and exitant angles don't matter, so the brdf reflects incident radiance by multiplying its R, G, B values. Here are the BRDFs we use for various objects in the scene, expressed in the RGB basis:

lightDiffuseColor = np.array([0.2,0.2,0.2])
leftWallColor = np.array([.611, .0555, .062]) * 1.5
rightWallColor = np.array([.117, .4125, .115]) * 1.5
whiteWallColor = np.array([255, 239, 196]) / 255

We can make our path tracer more efficient by switching the integration variable to the projected solid angle $d\omega_i |\cos \theta|$. As discussed in the last section, this has the benefit of importance-sampling the solid angles that are proportionally larger due to Lambert's law, and as an added bonus we can drop the evaluation of the cosine term.

def trace(ro, rd, depth):
p = intersect(ro, rd)
n = calcNormal(p)
if depth < 3:
# Cosine-weighted hemisphere sampling
rd2 = sampleCosineWeightedHemisphere(n)
Li = trace(p, rd2, depth+1)

#### Reducing Variance by Splitting Up Indirect Lighting

The above estimator is correct and will get you the right result in expectation, but ends up being a high-variance estimator because the samples only have nonzero radiance when one or more of the path intersections intersects the emissive geometry. If you are trying to render a scene that is illuminated by a geometrically small light source -- a candle in a dark room perhaps -- the vast majority of path samples will never intersect the candle, and subsequently these samples will be sort of wasted. The image will appear very grainy and dark.

Luckily, the area integration trick we discussed a few sections back comes to our rescue. In graphics, we actually know where the light surfaces are ahead of time, so we can integrate over the emissive surface instead of integrating over the receiving surface's solid angles. We do this by performing a change of variables $d\omega = \frac{dA \cos \theta_o}{r^2}$.

To implement this trick, we can split up indirect lighting reflecting off point $p$ into two separate calculations: (1) direct lighting a the light source bouncing off of $p$, and (2) indirect lighting from a non-light source reflecting off of $p$. Notice that we have to modify the recursive trace term to ignore emittedRadiance from any lights it encounters, except for the case where light leaves the emitter and enters the eye directly (which is when depth=0). This is because for each point $p$ in the path, we are already accounting for an extra path that goes from an area light directly to $p$. We don't want to double count such paths!

def trace(ro, rd, depth):
p = intersect(ro, rd)
n = calcNormal(p)
if depth == 0:
# Integration over solid angle (eye ray)
# Direct Lighting Term
pA, M, pdf_A = sampleAreaLight()
n_light = calcNormal(pA)
if visibilityTest(p, pA):
square_distance = np.sum(np.square(pA - p))
w_i = normalize(pA - p)
dw_da = np.dot(n_light, -w_i)/square_distance  # dw/dA
radiance += (brdf(p, rd, w_i) * np.dot(n, w_i) * M) * dw_da
# Indirect Lighting Term
if depth < 3:
# Integration over cosine-weighted solid angle
rd2 = sampleCosineWeightedHemisphere(n)
Li = trace(p, rd2, depth+1)

The sampleAreaLight() function samples a point $p$ on an area light with emitted radiance $M$ and also computes the probability of choosing that sample (for a uniform emitter, it's just one over the area).

The cool thing about this path tracer implementation is that it features three different ways to integrate irradiance: solid angles, projected solid angles, and area light surfaces. Calculus is useful!

### Ignoring Photometry

Photometry is the study of how we convert radiometric quantities (the outputs of the path tracer) to the color quantities perceived by the human visual system. For this tutorial we will do a crude approximation of the radiometric-to-photometric  by simply clipping the values of each R, G, B radiance to a maximum of 1, and display the result directly in matplotlib.

And voila! We get a beautifully path-traced image of a Cornell Box. Notice how colors from the walls "bleed" onto adjacent walls, and the shadows cast by the boxes are "soft".

### Performance Benchmarks: P100 vs. TPUv2

Copying data between accelerators (TPU, GPU) and host chips (CPU) is very slow, so we'll try to compile the path tracing code into as few XLA calls from Python as possible. We can do this by applying the jax.jit operator to the entire trace() function, so the rendering happens completely on the accelerator. Because trace is a recursive function, we need to tell the XLA compiler that we are actually compiling it with a statically fixed depth of 3, so that XLA can unroll the loop and make it non-recursive. The vmap call then transforms the function into a vectorized version.

trace = jit(trace, static_argnums=(3,)) # optional
render_fn = lambda rng_key, ro, rd : trace(rng_key, ro, rd, 0)
vec_render_fn = vmap(render_fn)

According to jax.local_device_count(), a Google Cloud TPU has 8 cores. The code above only performs SIMD vectorization across 1 device, so we can also parallelize across multiple TPU cores using JAX's pmap operator to get an additional speed boost..

# vec_render_fn = vmap(render_fn)
vec_render_fn = jax.soft_pmap(render_fn)

How fast does this path tracer run? I benchmarked the performance of a (1) manually-vectorized Numpy implementation, (2) a  vmap-vectorized single-pixel implementation, and (3) a manually-vectorized JAX implementation (almost identical in syntax to numpy). Jitting the recursive trace function was very slow to compile (occasionally even crashed my notebook kernel), so I also implemented a version where the recursion happens in Python but the loop body of trace  (direct lighting, emission, sampling rays) are executed on the accelerator.

The plot below shows that JAX code is much slower to run on the first sample because the just-in-time compilation has to compile and fuse all the necessary XLA operations. I wouldn't read too carefully into this plot (especially when comparing GPU vs. TPU) because when I was doing these experiments I encountered a huge amount of variance in compile times. Numpy doesn't have any JIT compilation overhead, so it runs much faster for a single sample, even on the CPU.

What about a multi-sample render? After the XLA kernels have been compiled, subsequent calls to the trace function are very fast.

We see that there's a trade-off between compilation time and runtime: the more we compile, the faster things run when performing many samples. I haven't tuned the code to favor any accelerator in particular, and this is the first time I've measured TPU and GPU performance under a reasonable path tracing workload. Path tracing is an embarrassingly parallel workload (on the pixel level and image sample level), so it should be quite possible to get a linear speedup from using more TPU cores. My code currently does not do that because each pmap'ed worker is blocked on rendering an entire image sample. If you have suggestions on how to accelerate the code further, I'd love to hear from you.

### Summary

In this blog post we derived the principles of physically based rendering from scratch, and implemented a differentiable path tracer in pure JAX. There are three kinds of radiometric integrals (solid angle, projected solid angle, and area) that come up in a basic implementations of a path tracer and we used all three to implement a path tracer that separates direct lighting contributions from area lights separately from indirect lighting bouncing from non-light surfaces.

JAX provides us with a lot of useful features to implement this:
• You can write a one-pixel path tracer and vmap it into a vectorized version without sacrificing performance. You can parallelize trivially across devices using pmap.
• Code runs on GPU and TPU without modifications.
• Analytical surface normals of signed distance fields provided by automatic differentiation.
• Lightweight enough to run in a Jupyter/Colaboratory notebook, making it ideal for trying out graphics research ideas without getting bogged down by software engineering abstractions.
There are still some sharp bits with JAX because graphics and rendering workloads are not its first-class customers. Still, I think there is a lot of promise and future work to be done with combining the programmatic expressivity of modern deep learning frameworks with the field of graphics.

We didn't explore the differentiability of this path tracer, but rest assured that the combination of ray-marching and Monte Carlo path integration makes everything tractable.  Stay tuned for the next part of the tutorial, when we mix differentiation of this path tracer with neural networks and machine learning.

### Acknowledgements

Thanks to Luke Metz, Jonathan Tompson, Matt Pharr for interesting discussion a few years ago when I wrote the first version of this code in TensorFlow. Many thanks to Peter Hawkins, James Bradbury, and Stephan Hoyer for teaching me more about JAX and XLA. Thanks to Yining Karl Li for entertaining my dumb rendering questions and Vincent Vanhoucke for catching typos.

### Fun Facts

• Jim Kajiya's first path tracer took 7 hours to render a 256x256 image on a 280,000 USD IBM computer. By comparison, this renderer takes about 10 seconds to render an image of similar size, and you can run it for free with Google's free hosted colab notebooks that come with JAX pre-installed.
• I didn't discuss photometry much in this tutorial, but it turns out that the SI unit of photometric density, the candela, is the only SI base unit related to a biological process (human vision system).
• Check out my blog post on normalizing flows for more info on how "conservation of probability mass'' is employed in deep learning research!
• OpenDR was one of the first general-purpose differentiable renderers, and was technically innovative enough to merit publishing in ECCV 2014. It's remarkable to see how easy writing a differentiable renderer has become with modern deep learning frameworks like JAX, Pytorch, and TensorFlow.