Ideas for a Programming Language Part 1: Source Code in Database

by Malte Skarupke

My main motivation for wanting to write a programming language is that I believe I can make programmers much more productive. Somewhere in the archives of unfinished blog posts for this website here I have a very long post about what language features and coding practices I think contribute to programmer productivity, (and what harms productivity) but I believe that in the long term the one most important contributor is the quality the tools that we use to program.

And “tools” is a very vague word. It includes everything from source control to text editors to test frameworks to debuggers to refactoring helpers to code completion to compiler warning messages to static analysis to profilers, diff, make and many more categories. All of this surrounding infrastructure that we use around coding actually has a much bigger impact on productivity than which language we pick. Which makes sense because those tools are usually written by someone not directly involved in the language. And if they took the time to write the tool, you bet it will have a big impact on their productivity. So I want to make a language that will have the best tools ever. I think that’s actually not too difficult. It seems that nobody has had that ambition before.

And for that my language will store its source code in a database instead of text files.

Source code in database (SCID) is not a new idea and everyone thinks its a good idea. I haven’t found anyone who doubts that. There’s an obvious problem where people immediately want to make the source code not editable as text just because it’s not stored as text. But once you get over that (text input simply works too well) and decide to still use text input and simply change the storage, everyone sees that there are some real benefits here.

And the main benefit of having a smarter storage is that tools are easier to write. Text will only get parsed once when it is typed in, and once it is stored all other tools immediately have full information about your source code. You can write the fastest compiler ever. You can write perfect auto complete. You can write a diff tool that’s smart and fast and correct. The list goes on. It’s surprising how many of our tools could be improved a lot if we simply had more information available.

The reason why these points are not as strong as they could be is that there is a workaround: You can always just generate a database. We do this a lot. If you’re reading this at work, chances are that at least one core of your machine is currently busy building a database for your source code. And that gets you pretty close to having good tools. But while I believe that that can help, building a database automatically can never be as good as having the source data be in a database to begin with:

  • If you generate a database a cold start takes ages. This fact alone makes it so that some tools will never get written.
  • You also can’t be as fast when reacting to changes because you probably have to rebuild too much. People are modifying a different datastructure than your internals, and keeping the two in sync is probably much more work than just regenerating stuff all the time.
  • You could be as precise as SCID, but for some reason auto complete tools always make silly mistakes. People get sloppy when the database is “only” used for autocomplete, and isn’t actually the authority on your source code.

These downsides are not big problems for existing tools, but that’s because tools for which these would be big problems will never be written. I can write better tools if the database is always there. If it’s not always there you have to have fallback options or you may have to freeze the screen for a minute when refreshing your database. (Hello Visual Studio) If I can rely on always having correct information, I can be more confident in making automatic changes. Flint is an example where somebody just gave up and wrote the best thing they could with flawed information. (because getting exact information takes too long) It’s also nice to have predictable performance. If something always takes 10ms I can run it whenever you type any letter. If something usually takes 10ms but every now and then it takes 1 second, I can not. The more predicable my data is, the more predictable my performance will be. Generated data is not very predictable. If someone ever takes the approach that flint has taken in my language, I will consider that a defect that I have to fix at the language level.

The other big reason for why Source Code in Database hasn’t happened yet is that everything else works with text. Text editors, source control, diff tools, publishing software (such as this blog), the world works with text files.

And for me that’s just a hump I have to get over. I’m willing to do this. I am not sad that I can’t use my existing text editor because I’d write an IDE for my language anyway. But I do plan to maintain a library that will translate the database to text and back. That could be used to write plugins for existing editors. But that’s going to bring you back into the territory of generated databases, so I do not want that to be the default mode of interacting with the language.

Source control will be a problem but it’s probably solvable. Just store binary files and provide a custom diff tool I guess. This sounds terrible but the main reason why you probably have bad experiences with binary diff tools is that you can’t go in after the fact and fix things when the diff tool inevitable gets something wrong. Since the input for this will still be text (including in the diff tool) that should not be a problem. Speaking about diff tools: It’s a shame that existing diff tools won’t immediately work with my language. But you can write better diff tools anyway if you have the AST. (See ydiff)

And I think other tools will also be solved. It’s a real problem that existing tools won’t work, but ultimately I don’t care that much.

With those downsides out of the way I find that once you start thinking with source code in database, you come up with a lot of things that you have wanted to do for ages but never did because there is too much overhead for writing tools. For example if I want to write a tool that tells me what global variables are used in a piece of code, that’s a lot of work. I think using source code in database that will be easy. Simply write a query for all used variables and filter by how they are declared.

One tool that should be easy to write in C++ but would be slow as hell is finding out where a variable goes: Let’s say I have an int member and I want to find out everyone who is ever accessing it and where they are passing the int to and where those people are then passing the int to. This is easy to do in C++: Replace the int with a struct that has no automatic conversion operators. Now follow the compiler errors and keep replacing ints with your struct until you are done. This takes anywhere from an hour to a day, but after that you have a good picture. I believe using SCID I could write a tool that does this automatically in seconds. I’ve had to do this twice in the past two years because I had to turn two int32s that were used very widely in the engine into int64s and there couldn’t be any mistakes. While I did this I noticed how useful the information from this process was. If I could automate this I would use it all the time.

Another thing we could do is make the executable just be an extension of the database. It should be totally connected to the database and all information about it should be made available. For starters code hot reloading would work 100% of the time, even in inlined functions. It’s not like we can’t track where things end up. When we come across a case where code hot reloading doesn’t work, we figure out what information we need to keep in order to make it work, and we keep that information. My theory for why these things don’t work currently is that we have this ecosystem where we have a text file, an executable and a pdb, and they are all using different systems. It’s a bunch of complexity that comes out of the fact that we have this text file in the middle which is not an extendable file format. If the core of your system is not composable and extensible, everything built around it won’t be able to talk to each other because they have nothing in common. Could you create an ecosystem where if I am writing a tool and I need information from the buildsystem, that information is as easily accessible as the information from the pdb and as easily accessible as information from source control? If you have a text file at the core, it’s going to be a lot more work, because you bet that these tools have no common interfaces at all.

Once your executable is just an extension of your database, you start thinking that you’d actually like to keep a lot more information connected to this database. Like where the code ends up in the executable. Or maybe all the places that your function gets inlined. Or all the places where I made the compiler change it’s mind by placing the inline keyword. How much did I bloat the code (in bytes)? Did my change cause other optimizations to be removed? That information exists or can be generated.

I am also fairly certain that at some point this will influence the language. I will be able to make design decisions in the language that I will only come up with because I have a smarter storage. Nothing revolutionary (I expect that it will be possible to backport those ideas to other languages) but this will change my thinking after having worked with it for a while.

And there will be a lot of tiny improvements to the IDE: hungarian notation as an option, solving disagreements about differentStyles of variable_naming, getting rid of whitespace problems, named parameters as a view option instead of a language feature, and probably more that will come up as I am writing the IDE.

Point being: I am already thinking about a bunch of new things that I would have never thought about before. I think source code in database will make writing tools a lot easier and more reliable.

But it’s the sum of all these things that will make a difference. I will have a faster compiler than any existing language and that will make you a few percent more productive. I will have the best diff tool ever and that will make you a little more productive. I will give you a ton more information about your source code and that will make you a little more productive. And the product of all these tiny improvements will be a huge gain in productivity. I honestly think that I could be twice as productive as I am today.

The main thing that will stop this from happening is that we are pretty good at pushing this local maximum of using text files further and further. There is nothing fundamentally stopping people from writing awesome things with generated ASTs and generated databases. I think there is a global maximum out there that does not involve text files, but we keep discovering that we’re not at the local maximum for text files yet. (also I think that as soon as I demo that this has real benefits, old languages will support SCID. For example in C++ you could just allow a new file extension .cppdb in addition to .cpp, which is a database and then the language defines how module includes work between legacy text files and whatever the new file format is)

Still my plan is this: when I write my language I will store the source code in a database only. That will be more work initially but then I will leapfrog whatever gains that other people have gotten in the meantime using text files.

The next article will be about rethinking the callstack. It won’t be as drastic as you think, but we do a lot of work just because this call stack metaphor often doesn’t match what we are doing.