How LLMs Keep on Getting Better

by Malte Skarupke

If you look at the source code of a modern open source LLM, it looks very similar to the transformer described in the “Attention is all you need” paper from 2017. It’s just a stack of exactly three components: attention blocks, matmuls, and norm layers. The big algorithmic changes, like Mamba 2 or linear attention variants, aren’t really used yet. But look closer and almost everything has changed in the details.

The story of how LLMs keep on getting better is one of pushing for big and little improvements in a hundred different directions. Turns out hill climbing can get you to a really good place if you just climb along enough dimensions. This makes it hard to notice changes as they’re happening because they’re so small, so lets look at the last two years and see how many small changes there were to add up to the big improvements we saw.

Big Visible Changes

  • models now “think” before giving an answer
  • models use “tools” like web search or writing Python programs
  • models have much longer context window
  • the scaffolding around models is better (e.g. Claude code or “deep research”)
  • models understand images and generate them

Big Invisible Changes

  • Mixture of Experts – Run giant models but only use a fraction for each token
  • Better GPUs – More memory and faster, especially at lower precision
  • Better data – people curate their training data much more now

The main point of this blog post is that we many, many small improvements, so it’ll necessarily be long and shallow to go through it all:

Thinking Models

Models can now expend tokens to think out loud, which improves their answer in the end. This doesn’t look that complicated when you use it, but it required adding a new training phase of “reinforcement learning” which feels a bit more like traditional AI than neural networks do. You no longer just propagate a loss to predict the next token, you have to come up with good problems that make the network learn to behave the way you want and learn the right behaviors. I know very little about it. I liked that LLMs were based on text. Less worries about them having wrong objectives and wiping out humanity when all they do is predict the next token. But this reinforcement learning sure makes them better, e.g. at coding.

RLHF was a precursor, then OpenAI had an existence proof in the form of o1 and then everyone else fast-followed because turns out there were many ways of doing this. Deepseek r1 being the most famous one, and they did make a genuine algorithmic improvement in GRPO. But if you look at the size of the step improvement of GRPO over PPO (which came out in 2017) it really isn’t a large change. That’ll be a theme. A lot of this is down to finding good problems to train on, which we’ll also see in the “better data” section below.

Tool Use

Two years ago we were talking about emerging abilities as model scale up. Then we just started giving them more abilities directly. LLMs started using tools like “web search”. And instead of trying to do math in token-space they just write little Python programs and run them for you. These allow the LLMs to compensate for their weak spots. Instead of having to make up next tokens for answers it doesn’t know, it can google that for you. And Python is just better at math than LLMs are, so they no longer make basic mistakes.

Longer Context Windows

So many changes led to this. Remember that Llama 3 had a context length of 8192 tokens. And then Llama 3.1 had a context length of 128k tokens. That particular one was mostly better understanding of how to scale up RoPE. But there were also new extensions like YaRN. And then newer models have even longer context lengths. For a while it seemed like all the big labs were releasing one paper after another on how to get a million token context window. You also get small differences like how Deepseek applies its position embedding to only part of the query and key vectors (and leaves the rest without position embedding) or how GPT-OSS alternates between layers with small sliding windows and layers with full attention. Just different people trying different things.

And when you do run out of the long context of these models, they can now compact it and you can keep going. Which in practice just means summarizing the important bits and discarding the details. Unfortunately not much has been published on the details.

Train Using More GPUs

One problem with the long context window is that during training you just can’t fit all the activations into GPU memory. So people got really into splitting the training across as many GPUs as possible. This isn’t new, but there were dozens of little and big inventions for this, like Ring Attention and fused matmul/networking kernels.

Google released the Jax Scaling book with lots of techniques, Huggingface did their own take on this with the Ultrascale Playbook. The latter says “Reading Time: 2-4 days” which is optimistic. And after reading that you will still only have a surface-level understanding of what it says. This stuff is really difficult and you’ll tank performance a few times by e.g. sharding FSDP across too many GPUs before getting it right.

KV Cache Memory Improvements

The long context length is still a big memory problem so models found other ways to save memory. GQA is an easy way to decrease the KV-cache size. Deepseek went more aggressive with MLA. PagedAttention helps with inference. And of course people compressed their KV caches:

Smaller Data Types

Another way to save memory is to use smaller data types. Instead of float32 use bfloat16. Instead of bfloat16 use float8, or why not just use FP4? We got both good hardware support for smaller data types and also algorithmic improvements (still happening) to make models robust to the loss of precision. I mean FP4 is a crazy data type in that I can enumerate all the possible values: 0, 0.5, 1, 1.5, 2, 3, 4, 6 (plus the same numbers negative). It’s really a testament to how robust neural networks have gotten that this works at all. Ten years ago neural networks were unstable by default and you had to try many seeds to get anything working (remember that we didn’t even know how to properly initialize linear layers until 2015) and now they’re so robust that you can throw crazy low-precision data types at them and they still work. GPT-OSS uses FP4. Most of the stability improvements were not in the last two years, but the smaller data types were. You see considerations for which data type to use all over the big papers, e.g. Deepseek thought very carefully about this.

Better Hardware

We also got better hardware. B200s gave us very fast FP4 performance. But mostly we got more memory. The H100 had 80GB of memory, the H200 has 140GB, the B200 has 180GB and the B300 has 280GB. Look at my sections above for why people want this. (also as an aside, the PagedAttention paper I linked above talks about using an A100 with 40GB of memory. That seems so small now, just over two years later…)

And then everyone started using TPUs, hardware that was built specifically for neural networks. This is less of a big deal than you’d think because Nvidia GPUs are now also mostly neural network machines, but it did make things cheaper than if there had been no competition.

Also networking got faster. And Nvidia released the NVL72 which is 72 GPUs connected together with really fast networking, to make all these many-GPU training jobs run better. This again required lots of little improvements to take advantage of, and to run robustly.

More Efficient Algorithms

Flash Attention 3 came out and was better and more complicated. Everyone is anxiously waiting for the FA4 paper.

At the same time matrix multiplication became even more crazy. Since these GPUs are now mostly giant matmul machines, you’d think that it would be easy to make them do a matrix multiplication. But no, a fast matmul requires crazy code and it’s still improving all the time.

And then of course you have to fuse that with networking now so that while your matmul works on the next block, the same kernel can do networking with all the other GPUs in your cluster to combine the results of the previous block with a results from a different GPU. Because it’s not optimal to do a matmul and to then do networking, like we did two years ago. You want to do both at the same time.

Also megakernels are maybe a thing now? I haven’t seen them used in open-source models yet.

Luckily torch.compile also became good in the last two years. Often you can write reasonable code and the compiler will turn it into efficient code. Which at least makes it easier to try out the latest papers.

Mixture of Experts

Another thing you can do is just not run the whole model for every token. E.g. in GPT-OSS 120B you actually only have active 5B parameters for each token. The matmuls are split into “experts” and you only do a subset for each token, decided at runtime. This sounds easy but required algorithmic improvements to work at training time. Backpropagation alone won’t do any more, you need to encourage the model to use all the experts at training time. Also we saw lots of experimentation with hyper parameters, like how many experts, what fraction of experts is active (usual numbers are 3% in Kimi K2 to 25% in Grok), whether there are shared experts and how many, how exactly the routing works… And obviously there had to be algorithmic improvements to make this efficient at runtime, which is still very much ongoing.

Larger Tokenizers

The vocabulary size of these models keeps on going up. Apparently that makes them better somehow. Llama 2 had 32k tokens in its vocabulary, Llama 3 had 128k, GPT-OSS has 201k. This means the embedding layer and the un-embedding layer is a significant fraction of the active 5B params in that model. The hidden dimension of GPT-OSS is 2880, and 201k*2880 = 580m parameters in the embedding and unembedding layers, for a combined total of 1.16B. Meaning more than 20% of the active params are just to go from token indices to hidden dimension and back.

Slower Scaling

Models are not getting bigger at the same speed any more as they used to. Deepseek V3 came out a year ago with 671B total params, out of which 37B are active for each token, and Kimi K2.5 has 1T total params out of which 32B are active for each token. Gone are the days where the number of params multiplies by 10. And even then the big models are MoE now. I don’t think anyone has gone bigger than Llama 3’s 405B active params, and that came out 1.5 years ago.

Since we can train on very large numbers of GPUs now, each of which has enormous amounts of memory, I don’t think the limit here is ability any more. (like it would have been two years ago) Everyone can figure out how to train giant models now. I’d guess the limits are given by diminishing returns, and by high hardware prices.

Distilling Models

One way that models actually got smaller is through distillation. We saw this with Claude Opus and Sonnet. Anthropic trained a really big model, Opus, and then trained a smaller model, Sonnet, to imitate it. This makes the models cheaper and faster to run while only losing a little bit of quality.

Attention Sinks

Attention always had weird effects where the model seemed to pay a lot of attention to the first token in the sequence. Eventually the theory for this became that this happens when there are no important tokens, so the first token acts as a “sink” when nothing needs to be attended to. Recently people added explicit sinks to their attention layers (GPT-OSS) which act as a threshold for the softmax in attention. Meaning if nothing gets enough weight, the sink will zero out all the attention scores. And Qwen noticed that you can get the same benefits by putting one more gate after attention. Apparently this just makes the model straight-up better along all dimensions at the cost of minimal extra compute because the model has to compensate for less weirdness.

Better Data

The Olmo papers are always great, and you can perfectly see how better data became a focus. OLMo 2 talked about various architectural decisions, algorithmic improvements, training stability, and yes, also data. But read Olmo 3 in comparison and it’s all about training data. Once again dozens of improvements. Details about gathering, deduplicating, filtering, deciding the order… And then the whole thing again for reinforcement learning problems plus iterating on what problems work… Reading all these many pages on data quality makes me think that this must cause a big difference between other models, too. (Claude and Gemini come to mind)

Synthetic Data

Turns out you can use LLMs to generate training data for other LLMs. This is most obvious for reinforcement learning problems where you need to generate lots of problems. There were some early papers about how synthetic data is really bad, and then more work made it not so. The tl;dr version of it seems to be “keep on iterating on the synthetic data until it’s really good.”

Better Optimizers

When you train a model you have to use your loss-gradients to update the model somehow. This is the job of the “optimizer”. We got the first good optimizers ten years ago and they’re one of the big reasons why neural networks started getting good then. Right now we have a second phase of getting better optimizers. Apparently people are now speedrunning training of LLMs to a certain quality. What took 45 minutes two years ago now takes under 2 minutes. (half of this is due to better optimizers) If you can train a model to a good quality faster, it will end up at a better quality overall by the end of the training.

Learning Rate Schedules

This is a surprising point in that you’d have thought that we figured out what learning rates to use ten years ago. But almost every paper now talks about their learning rate schedules and they’re all a little different. These schedules are actually still pretty simple, so I wouldn’t be surprised if we see more improvements here. (this has to co-evolve with the optimizers and data that’s being used)

Better Scaffolding

We got Deep Research and Claude Code. These were enabled by long context windows and tool use and by reinforcement learning, but they also just allow the models to do a better job than the old call and response. Now you can tell a model to do something and it just goes and does it. There was no place for models to do this two years ago.

Big Areas I Can’t Cover

When there are dozens of directions that models improve into, there are some big areas that I can’t cover because I know little about them and because they would be too big on their own:

Better Finetuning

I mentioned RLHF, but I don’t think that is even used any more. Llama uses DPO instead and there have been more papers since. As I mentioned with the “Better Data” point above, recent papers now spend a lot of time talking about how they finetuned the models after pretraining (a term which means “read lots of text and predict the next token in all of it”) is finished. It’s too much to cover.

Multimodal Models

Models can now generate pictures and videos and sounds. I take so many pictures of things now and ask models about them. My impression is that writing about these areas would be twice as long as this whole blog post again. Luckily I know very little about all the improvements that led to that, so I won’t talk about them, but given the pace of improvements of e.g. image generation, it’s clear that they also went through dozens of improvements.

Inference Improvements

People started using speculative decoding, predicting multiple tokens at once (e.g. for the little google search AI snippets where cheap inference is important), and I’ve seen the headlines for various papers about how to better assign requests to hardware to get better batching and caching. I didn’t read any of them.

Summary and Outlook

AI is weird in that the chat interface looks very similar to two years ago, and if you look at a model’s code it looks very similar to two years ago, but in the details everything has been hill climbing in many small improvements to make better models. Does any individual improvement make a big difference? Would models be much worse without e.g. explicit attention sinks? No, but it all adds up. And sometimes enough small enough improvements allow a step change in capabilities, like the longer context did.

More papers come out than anyone can possibly keep up with (even just reading the headlines or the abstracts), and I only looked at the ones that made it into released models and that I remembered. But other areas haven’t stood still, even if no big models use their improvements. State-space models and linear attention have also been hill-climbing. I would not be surprised if they’re better than transformers soon (it would be a classic example of the theory of a cheaper, worse thing disrupting a more expensive, better thing by slowly improving). Or maybe those mixture-of-depths or H-Net approaches get adopted. And for some reason papers keep on coming out about how much better RNNs are getting. There are so many different approaches that you don’t see in LLMs yet, but have a chance of being adopted. When the next big thing comes out, it’ll probably be years in the making.

And of course even within transformers there are dozens more directions to explore. Big ones that come to mind are multiple residual streams, generalized attention, even more aggressive compression to smaller data types, more complicated attention. This architecture is not done improving. Even if every single one of these is a small step, it’ll add up.

I used to think that we need some algorithmic breakthroughs to make LLMs really good and get over their weird flaws. (where they’re really good at many things and then make the stupidest mistakes at other times) Now I think we are at a good enough starting point where we can hill-climb our way out of this. I’d be surprised if we didn’t see some big steps in addition to the many small steps, but I no longer think it’s necessary. The overall pace of improvements has just been so good.