Programming | Probably Dance

May 1, 2026

Apple is Holding my Pictures Hostage Until I Accept Their New Terms of Service

It started off a few months ago, when my iPad suddenly couldn’t play videos any more that it had recorded. It would show the first frame, but hitting the play button wouldn’t do anything. I googled around but couldn’t find anything. I had recently installed an update, so I figured I’ll just wait for the next update to fix things.

But the next update didn’t fix things, and it turns out the reason is actually a little dystopian: Apple has deleted my local copies of my videos and will only give them back if I sign their new terms of service.

Read the rest of this entry »

Leave a comment

March 1, 2026

Word Map – A Game About Hill Climbing and Stepping Stones

I really like Semantle because I noticed that progress is similar to the description in Why Greatness Cannot Be Planned: The Myth of the Objective by Kenneth Stanley. The objective function is very clear and you can hill-climb your way to the top, but hill-climbing is actually very difficult in high-dimensional spaces, so you need to explore to find stepping stones and then exploit based on those stepping stones. I have long looked for a game where I can practice that behavior and Semantle almost was that game, so I decided to evolve it to make it easier to practice doing the right behaviors.

The result is Word Map. Give it a play.

Read the rest of this entry »

3 Comments

February 10, 2026

How Programmers Spend Their Time

I submitted a tiny patch to flash attention. The necessary typing for the change takes less ten seconds, but the overall change took more than ten hours So where does the time go?

It started when coworker had a bug where cudnn attention would crash randomly. We looked at his unreleased changes and concluded that they couldn’t possibly cause this, so we suspected that we had a lingering bug that was exposed by making harmless changes to related code.

Step 1, a few hours: My coworker tried to figure this out just by running the code repeatedly, trying out various theories. The bug was hard to reproduce so this took hours without much progress.

Step 2, 1 hour: I thought this is a good reason to try out compute sanitizer. It would be easiest to just run it on our existing tests to see if it finds any issues without my coworker’s changes. But the tests run on another box because they require certain GPUs, which means you have to run the tests through some layers. Unfortunately compute sanitizer really wants to be in charge of the program, so we have to convince those layers to let compute sanitizer run the whole thing. It keeps on failing and we can’t figure out why, until eventually I suspect that the issue is that the tests run in a sandbox, and the sandbox is strict enough that it breaks compute sanitizer somehow. This turned out to be true and we probably wasted an hour together.

Read the rest of this entry »

Leave a comment

January 31, 2026

How LLMs Keep on Getting Better

If you look at the source code of a modern open source LLM, it looks very similar to the transformer described in the “Attention is all you need” paper from 2017. It’s just a stack of exactly three components: attention blocks, matmuls, and norm layers. The big algorithmic changes, like Mamba 2 or linear attention variants, aren’t really used yet. But look closer and almost everything has changed in the details.

The story of how LLMs keep on getting better is one of pushing for big and little improvements in a hundred different directions. Turns out hill climbing can get you to a really good place if you just climb along enough dimensions. This makes it hard to notice changes as they’re happening because they’re so small, so lets look at the last two years and see how many small changes there were to add up to the big improvements we saw.

Read the rest of this entry »

4 Comments

September 28, 2025

Avalanche Studios NYC Retrospective – An Ambitious Company Ruined by Bad Development Practices

I’ve wanted to write about this since the studio closed a year ago. Now that Contraband is also canceled I think it’s time, especially since Contraband was one of the big reasons why I left the company. The blog post turned out much bigger than expected though. There was a lot to get off my chest…

I worked at Avalanche Studios NYC from July 2012 to December 31st 2019, seven and a half years. I wasn’t there for most of Contraband’s development but it was obvious early on that it was going to be a very difficult project. If anything I’m surprised it lasted that long before being canceled.

The studio was born out of ambition. It failed because it could not deliver on that ambition. So this will necessarily be negative. But we had a good run and made two good games. I have so many memories and thoughts that I need to get written down somewhere, so lets celebrate the good and talk about the troubles.

Read the rest of this entry »

9 Comments

June 19, 2025

Revisiting Knuth’s “Premature Optimization” Paper

The most famous quote from Knuth’s paper “Structured Programming with go to Statements” is this:

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

People always use this quote wrong, and to get a feeling for that we just have to look at the original paper, and the context in which it was written. The paper isn’t actually about optimization. It’s about when we have to use goto statements because structured programming couldn’t express certain things at the time. Or at least it couldn’t do so efficiently, requiring extra checks, and that’s why Knuth has to talk about performance: The topic he is addressing is “Can we get rid of goto statements without sacrificing performance?”

Read the rest of this entry »

3 Comments

May 31, 2025

I’m Open-Sourcing my Custom Benchmark GUI

I think one of the reasons why I was able to do good performance work over the years is that at some point I started taking benchmarking seriously enough to write my own library. I used to use Google Benchmark, which is a fine library, but at some point you realize that you need a GUI to really scale up benchmarks¹. Here is a github link, and this video gives a quick intro:

The main problems it tries to address is:

Getting good numbers by running benchmarks repeatedly, visualizing them in context, picking a single opinionated good visualization, handling noise and even adding a bit of well-justified noise, and being careful about what statistics to do on the numbers.
Dealing with the inevitable combinatorial explosion of benchmarks when you want to try different data structures (min-max heap vs interval heap vs binary heap) with different operations (make_heap, push, pop) on different types (int vs string), different compilers, debug build vs release build, different variants of the code (e.g. trying loop unrolling), different input lengths etc. The full combinatorial explosion might be millions or billions of possible benchmarks. I want to be able to get a first impression for a subset in a few minutes. And then if I want less noisy results I can let it run overnight. And then I can try a new variation and visualize it together with the overnight results in under a minute.
Various ergonomic issues. Making it easy to select which numbers are together on the screen. Having the numbers as a graph first, CSV second. Being robust to the code crashing halfway through a long run: Record the partial results and be able to resume the same run. Making it easy to attach a profiler to one specific benchmark that I’m interested in.

This sounds complicated, and I have to admit that this is very much an app written by a programmer for a programmer, but the whole point of a GUI is that I can make this both more powerful and easier to use at the same time. In fact I think the patterns might be more widely useful for people who do slow-running experiments of other kinds (like training a ML model).

Read the rest of this entry »

Leave a comment

February 8, 2025

Why Does Integer Addition Approximate Float Multiplication?

Here is a rough approximation of float multiplication (source):

float rough_float_multiply(float a, float b) {
    constexpr uint32_t bias = 0x3f76d000;
    return bit_cast<float>(bit_cast<uint32_t>(a) + bit_cast<uint32_t>(b) - bias);
}

We’re casting the floats to ints, adding them, adjusting the exponent, and returning as float. If you think about it for a second you will realize that since the float contains the exponent, this won’t be too wrong: You can multiply two numbers by adding their exponents. So just with the exponent-addition you will be within a factor of 2 of the right result. But this will actually be much better and get within 7.5% of the right answer. Why?

Read the rest of this entry »

8 Comments

October 7, 2024

Initial CUDA Performance Surprises

I am somehow very late to learning CUDA. I didn’t even know until recently that CUDA is just C++ with a small amount of extra stuff. If I had known that there is so little friction to learning it, I would have checked it out much earlier. But if you come in with C++ habits, you’ll write suboptimal code, so here are some lessons I had to learn to get things to run fast.

Memory Coalescing

If you have multiple threads operating on an array in C++, you probably want to iterate like this:

std::vector<T> vec = ...;
size_t per_thread = vec.size() / num_threads;
T * my_slice = vec.data() + per_thread * my_thread_i;
for (size_t i = 0; i < per_thread; ++i) {
    do_something(my_slice[i]);
}

Meaning each thread iterates over a contiguous chunk of memory. In CUDA this is going to be slow because you want the threads to load memory together. So if thread 0 loads bytes 0 to 15, then you want thread 1 to load bytes 16 to 31 and thread 2 to load bytes 32 to 47 etc. So the loop instead has to look like this:

T * data = ...;
size_t num_elements = ...;
for (int i = my_thread_i; i < num_elements; i += num_threads) {
    do_something(data[i]);
}

This is called “memory coalescing” where adjacent threads use adjacent memory. On a loop with a small body (dot product) this is 3x faster.

Read the rest of this entry »

Leave a comment

April 9, 2024

How I use LLMs to program

Studies have shown that LLMs help novice programmers more than experienced programmers. This matches my experience. At work I see that interns or new hires have some LLM window open almost all the time. I use them maybe once a week. But you could say the same thing about Stack Overflow. I used it all the time when I started programming. Now I use it occasionally. While it’s easy to point at their obvious issues, I think they are also clearly a net-positive on average. So how do LLMs help me?

Big plus: Languages that I don’t use as often

I don’t often write SQL statements. I can obviously write the simple ones, but SQL is a language that has all the features you could ever possibly want, and I don’t know how to use them and don’t know how to google for them. So I ask a LLM. Similarly for javascript/css/html programming. I used to hate doing web frontend work, now it’s not so bad because LLMs can help me get out of the tricky edge cases.

I have also used LLMs to translate functionality from one language to another. E.g. if I know what a function is called in C++ but I can’t find an equivalent one in the standard library of another language, an LLM will often do a decent first pass of rewriting the C++ function in the other language.

Small minus: The code is overly generic

Read the rest of this entry »

Leave a comment

Category: Programming

Memory Coalescing

Big plus: Languages that I don’t use as often

Small minus: The code is overly generic