(This is retold from memory, some of the details might be slightly off).
The date is April 14th, 2017. For most of the past month, a distributed team of engineers from UC Berkeley, Lawrence Berkeley National Labs, Intel, MIT and Julia Computing have been working day and night on the codebase of the Celeste project. The goal of the Celeste application is to crunch 55,000 GB (i.e. 55TB) of astronomical images and produce the most up to date catalog of astronomical objects (stars and galaxies at that point), along with precise estimates (and, uniquely, well defined uncertainty estimates) of their positions, shapes and colors. Our goal was ambitious. We'd apply variational inference to a problem four orders of magnitude larger than previously attempted. At peak, there'd be 1.5 million separate threads running on over 9,000 separate machines. And we'd attempt to do all of this, not in a traditional HPC programming language like C++ or Fortran, but in Julia, a high-level dynamic programming language with all of its dynamic features including garbage collection.
A year prior, Celeste required several minutes to process even a single light source. Through improvements and some basic optimization, we'd gotten this down to a few seconds. Already almost good enough for the current data set we had (at this performance level, a few hundred machines would be able to crunch through the data set in a few days), but not to demonstrate that we could handle the data coming from the next generation telescope (LSST). For that we want to demonstrate that we could scale to the entire machine and more importantly make good use of it. To demonstrate this capability, we'd set ourselves a goal: Scale the application to the entirety of the Cori supercomputer at LBNL (at the time the world's fifth largest computer) and perform at least 10^15 double precision floating point operations per second (one petaflop per second, or about 1/10th of the absolute best performance ever achieved on this machine on code that just tried to multiply some giant matrices). If we managed to achieve this, we would submit our result as an entry for consideration to the annual Gordon Bell prize. The deadline for this entry was the evening of April 15th (a Saturday).
Against this background, I found myself coming to the office (at the time a rented desk across the street from Harvard Business School in Boston) on April 14th. My collaborators were doing the same. In consideration of the upcoming Gordon Bell deadline, the system administrators at LBNL had set aside the entire day to allow different groups to run at the scale of the full machine (generally the machine is shared among many much smaller jobs - it is very rare to get the whole machine to yourself). Each group was allocated an hour and we'd be up second. We took a final look at the plan for the day. Which configurations we'd run (we'd start with the full scale run for 15-20 minutes to prove scalability and then kick off several smaller runs in parallel to measure how our system scaled up with the number of nodes), who'd be responsible for submitting the jobs, looking at the output, collecting the results, etc. We weren't sure what would happen. At the beginning of the week we were able to do a smaller scale run on a thousand nodes. At the time our extrapolated performance was a few hundred teraflops. Not enough. After a week of very little sleep, I'd made some additional enhancements: a lot of general improvements to the compiler, enabling the compiler to call into a vendor-provided runtime library for computing the exponential function in a vectorized function and a complete change to the way we were vectorizing the frequently executed parts of the code (which consisted of a few thousand lines at the source level). I had tested these changes on a few hundred nodes and things looked promising, but we had never tested these changes on anything close to the scale we were attempting to run at.
A few minutes before noon (which was our designated starting time) we got word that the previous group was about to wrap up. We gathered on a video call anxiously watching a chat window for the signal that the machine was released to us and just like that, it was "Go. Go. Go." time. With the stroke of a button, the script was launched and 9,000 machines in the lower levels of the NERSC building in the Berkeley hills roared into action. More than 3MW of power, the equivalent power consumption of a small village, kicked into action to keep these machines supplied.
After an anxious few minutes, a call out on the video call "Something's wrong. We're not getting any results". We were seeing threads check in that they were ready to start processing data, but no data processing ever happened. "Alright, kill it and try again. We'll look through the logs and figure out what happened." After a few minutes, while the second run was starting up, it had become clear that three of the 1.5 million threads had crashed at startup. Worse, the log entries looked something like:
signal (11): Segmentation fault: 11
unknown function (ip: 0x12096294f)
unknown function (ip: 0xffffffffffffffff)
Not only had it crashed, but the crash had corrupted the stack and there was no indication what had caused the problem. "If this also crashes, what do we do?" "We can go back to the version from Monday, but we know it's not fast enough." "There's no way we'll be able to debug this before our time slot runs out. And even if we could, we just can't have the machine sitting idle while we do." "Go see if the next group is ready to run. If so, we'll try to reclaim our remaining time at the end of the day". And so, after 20 minutes on the machine we released it to the next group and set to debugging. If we couldn't figure it out we'd run the Monday version.
We had about five hours to figure out what the problem was. About an hour and a half in, through some guesswork and careful examination of the logs, we were able to correlate the hex numbers in the stack trace to assembly instructions in the binary. None of the three crashes were in the same place, but a pattern emerged: two of the locations pointed right after calls to the new vendor library we had enabled. A hastily sent email to the vendor's engineering team was met with disbelief, the library was frequently used in applications with up to tens or hundreds of thousands of threads. On the first call from each thread, the library would dynamically adjust itself depending on the CPU it was running on.
Looking at the disassembly, this code was clearly designed to be thread safe, and we couldn't see any obvious errors. Nevertheless, the pattern fit. The library did its own stack manipulation (since it was designed to not use stack in most cases, which would explain the unwinder's inability to give us a good backtrace), the crash happened very early in the program before any data was processed (which would be consistent with being in the initialization routine). Through some very careful binary surgery, we patched out the initialization routine, hardcoded the correct implementation and did a small scale test run on testbed system (after all the main system was currently serving other groups). Nothing crashed, but that didn't mean much - the scale of the testbed didn't even reach the scale of our test runs.
We got the machine back at the end of the day as scheduled. A binary had been built with our hack and we were ready for one last Hail Mary attempt. Thirty minutes on the clock until the machine was returned back to regular use. Once again, the 9,000 nodes roared into action, but this time - it worked. Within a minute our performance metric showed that not only was it working, but we had indeed hit our goal - we clocked in at 1.54 petaflops. There was audible relief all around on the call. These results in hand, we were able to negotiate for an additional, smaller scale, slot the next day to run our scaling experiments, and we still had to finish writing the paper afterwards, but the hard part was over.
I hope this account gives some insight into the experience. The Celeste project was one of the most ambitious projects I've ever been a part of, trying to push the envelope in so many different ways: statistical methods, parallel computing, programming languages, astronomy. Many improvements to Julia over the two years following this run were the direct result of experience gathered trying to make this happen - it'd be much easier this time around after all these improvements. A few weeks later I got an email that there had indeed been a bug in the initialization routine, which had now been fixed. The fact that nobody else had ever run into it was probably merely a question of scale. At 1.5 million running threads, even one-in-a-million can happen every single time. In the end running on a supercomputer is just like running on a regular computer, except a million times larger and a million times more stressful.