Ideas for a Programming Language Part 1: Source Code in Database

by Malte Skarupke

My main motivation for wanting to write a programming language is that I believe I can make programmers much more productive. Somewhere in the archives of unfinished blog posts for this website here I have a very long post about what language features and coding practices I think contribute to programmer productivity, (and what harms productivity) but I believe that in the long term the one most important contributor is the quality the tools that we use to program.

And “tools” is a very vague word. It includes everything from source control to text editors to test frameworks to debuggers to refactoring helpers to code completion to compiler warning messages to static analysis to profilers, diff, make and many more categories. All of this surrounding infrastructure that we use around coding actually has a much bigger impact on productivity than which language we pick. Which makes sense because those tools are usually written by someone not directly involved in the language. And if they took the time to write the tool, you bet it will have a big impact on their productivity. So I want to make a language that will have the best tools ever. I think that’s actually not too difficult. It seems that nobody has had that ambition before.

And for that my language will store its source code in a database instead of text files.

Source code in database (SCID) is not a new idea and everyone thinks its a good idea. I haven’t found anyone who doubts that. There’s an obvious problem where people immediately want to make the source code not editable as text just because it’s not stored as text. But once you get over that (text input simply works too well) and decide to still use text input and simply change the storage, everyone sees that there are some real benefits here.

And the main benefit of having a smarter storage is that tools are easier to write. Text will only get parsed once when it is typed in, and once it is stored all other tools immediately have full information about your source code. You can write the fastest compiler ever. You can write perfect auto complete. You can write a diff tool that’s smart and fast and correct. The list goes on. It’s surprising how many of our tools could be improved a lot if we simply had more information available.

The reason why these points are not as strong as they could be is that there is a workaround: You can always just generate a database. We do this a lot. If you’re reading this at work, chances are that at least one core of your machine is currently busy building a database for your source code. And that gets you pretty close to having good tools. But while I believe that that can help, building a database automatically can never be as good as having the source data be in a database to begin with:

If you generate a database a cold start takes ages. This fact alone makes it so that some tools will never get written.
You also can’t be as fast when reacting to changes because you probably have to rebuild too much. People are modifying a different datastructure than your internals, and keeping the two in sync is probably much more work than just regenerating stuff all the time.
You could be as precise as SCID, but for some reason auto complete tools always make silly mistakes. People get sloppy when the database is “only” used for autocomplete, and isn’t actually the authority on your source code.

These downsides are not big problems for existing tools, but that’s because tools for which these would be big problems will never be written. I can write better tools if the database is always there. If it’s not always there you have to have fallback options or you may have to freeze the screen for a minute when refreshing your database. (Hello Visual Studio) If I can rely on always having correct information, I can be more confident in making automatic changes. Flint is an example where somebody just gave up and wrote the best thing they could with flawed information. (because getting exact information takes too long) It’s also nice to have predictable performance. If something always takes 10ms I can run it whenever you type any letter. If something usually takes 10ms but every now and then it takes 1 second, I can not. The more predicable my data is, the more predictable my performance will be. Generated data is not very predictable. If someone ever takes the approach that flint has taken in my language, I will consider that a defect that I have to fix at the language level.

The other big reason for why Source Code in Database hasn’t happened yet is that everything else works with text. Text editors, source control, diff tools, publishing software (such as this blog), the world works with text files.

And for me that’s just a hump I have to get over. I’m willing to do this. I am not sad that I can’t use my existing text editor because I’d write an IDE for my language anyway. But I do plan to maintain a library that will translate the database to text and back. That could be used to write plugins for existing editors. But that’s going to bring you back into the territory of generated databases, so I do not want that to be the default mode of interacting with the language.

Source control will be a problem but it’s probably solvable. Just store binary files and provide a custom diff tool I guess. This sounds terrible but the main reason why you probably have bad experiences with binary diff tools is that you can’t go in after the fact and fix things when the diff tool inevitable gets something wrong. Since the input for this will still be text (including in the diff tool) that should not be a problem. Speaking about diff tools: It’s a shame that existing diff tools won’t immediately work with my language. But you can write better diff tools anyway if you have the AST. (See ydiff)

And I think other tools will also be solved. It’s a real problem that existing tools won’t work, but ultimately I don’t care that much.

With those downsides out of the way I find that once you start thinking with source code in database, you come up with a lot of things that you have wanted to do for ages but never did because there is too much overhead for writing tools. For example if I want to write a tool that tells me what global variables are used in a piece of code, that’s a lot of work. I think using source code in database that will be easy. Simply write a query for all used variables and filter by how they are declared.

One tool that should be easy to write in C++ but would be slow as hell is finding out where a variable goes: Let’s say I have an int member and I want to find out everyone who is ever accessing it and where they are passing the int to and where those people are then passing the int to. This is easy to do in C++: Replace the int with a struct that has no automatic conversion operators. Now follow the compiler errors and keep replacing ints with your struct until you are done. This takes anywhere from an hour to a day, but after that you have a good picture. I believe using SCID I could write a tool that does this automatically in seconds. I’ve had to do this twice in the past two years because I had to turn two int32s that were used very widely in the engine into int64s and there couldn’t be any mistakes. While I did this I noticed how useful the information from this process was. If I could automate this I would use it all the time.

Another thing we could do is make the executable just be an extension of the database. It should be totally connected to the database and all information about it should be made available. For starters code hot reloading would work 100% of the time, even in inlined functions. It’s not like we can’t track where things end up. When we come across a case where code hot reloading doesn’t work, we figure out what information we need to keep in order to make it work, and we keep that information. My theory for why these things don’t work currently is that we have this ecosystem where we have a text file, an executable and a pdb, and they are all using different systems. It’s a bunch of complexity that comes out of the fact that we have this text file in the middle which is not an extendable file format. If the core of your system is not composable and extensible, everything built around it won’t be able to talk to each other because they have nothing in common. Could you create an ecosystem where if I am writing a tool and I need information from the buildsystem, that information is as easily accessible as the information from the pdb and as easily accessible as information from source control? If you have a text file at the core, it’s going to be a lot more work, because you bet that these tools have no common interfaces at all.

Once your executable is just an extension of your database, you start thinking that you’d actually like to keep a lot more information connected to this database. Like where the code ends up in the executable. Or maybe all the places that your function gets inlined. Or all the places where I made the compiler change it’s mind by placing the inline keyword. How much did I bloat the code (in bytes)? Did my change cause other optimizations to be removed? That information exists or can be generated.

I am also fairly certain that at some point this will influence the language. I will be able to make design decisions in the language that I will only come up with because I have a smarter storage. Nothing revolutionary (I expect that it will be possible to backport those ideas to other languages) but this will change my thinking after having worked with it for a while.

And there will be a lot of tiny improvements to the IDE: hungarian notation as an option, solving disagreements about differentStyles of variable_naming, getting rid of whitespace problems, named parameters as a view option instead of a language feature, and probably more that will come up as I am writing the IDE.

Point being: I am already thinking about a bunch of new things that I would have never thought about before. I think source code in database will make writing tools a lot easier and more reliable.

But it’s the sum of all these things that will make a difference. I will have a faster compiler than any existing language and that will make you a few percent more productive. I will have the best diff tool ever and that will make you a little more productive. I will give you a ton more information about your source code and that will make you a little more productive. And the product of all these tiny improvements will be a huge gain in productivity. I honestly think that I could be twice as productive as I am today.

The main thing that will stop this from happening is that we are pretty good at pushing this local maximum of using text files further and further. There is nothing fundamentally stopping people from writing awesome things with generated ASTs and generated databases. I think there is a global maximum out there that does not involve text files, but we keep discovering that we’re not at the local maximum for text files yet. (also I think that as soon as I demo that this has real benefits, old languages will support SCID. For example in C++ you could just allow a new file extension .cppdb in addition to .cpp, which is a database and then the language defines how module includes work between legacy text files and whatever the new file format is)

Still my plan is this: when I write my language I will store the source code in a database only. That will be more work initially but then I will leapfrog whatever gains that other people have gotten in the meantime using text files.

The next article will be about rethinking the callstack. It won’t be as drastic as you think, but we do a lot of work just because this call stack metaphor often doesn’t match what we are doing.

5 Comments to “Ideas for a Programming Language Part 1: Source Code in Database”

John Calsbeek (@jcalsbeek) says:

December 6, 2014 at 23:19

There are a lot of interesting problems in this space. I’m not convinced there is a clear “this is better than any other way” answer. Say you somehow roll source, compiled code, and debug data into one thing. That thing still needs to decompose somehow to fit in source control. But if you have an optimizer that can, say, inline across file boundaries, things get really weird.

Say you have a table in your database that stores people who call a function. Now if I add a call to the function I have to check in a new version of the file that contains the called function too.

Or is a smarter SCM also in the crosshairs of this project? 🙂

- Malte Skarupke says:
  
  December 7, 2014 at 19:29
  
  My current idea is that there is one file per library. So there won’t be any file boundaries within a project. Only if you bring in third party code do the problems of references to other compilation units come in.
  
  The closest thing to a classical file will be a function. Meaning if two people modify the same function, they will have to merge their changes. But still all functions and classes will be in one giant binary file that everyone is submitting to source control. We’ll see how that works…
  
  I think doing custom version control will happen at some point. The question will be how soon it has to happen… But I also think that version control in a database should be easier than on a file system.
  
  As for actually solving all these problems: Right now my hope is that I can delay those things until I have some experience with how the system behaves. So everything will only have to work with me as the only user initially.
  
  - John Calsbeek (@jcalsbeek) says:
    
    December 8, 2014 at 08:46
    
    Could also push a workflow where everyone works in a VM with the IntelliSense database already built, and just check in the VM to source control and do merges on that. 🙂
Dr. Stefan Reiß says:

March 2, 2015 at 06:13

Interesting Stuff…
At least one famous dinosaur uses such a principle:
ABAP stores the type infomation (domains, dataelements, structures) in database tables. And as described within the text, you need for editing a type a special tool (SE11). On the other side the search, where a field is used in structures is a simple query.
For me as Delphi developer and ABAP beginner it appeared a litte bit odd – but it has some charme and advantages.

For generating of new hooks (“BAdI”) within an ABAP program also structures within the database are used.

However, SAP stopped somewhere and the other parts of the language (normal source code) is contained in editor files (which are stored as texts in the database). So you can define a type global in database or local in ABAP source code like a record in PASCAL.

Ryan Sanders says:

November 26, 2017 at 00:16

awe finally “source code in database” if i see another bloody 10 depth folder, with a single file with 4 or less lines in it, and 40 lines worth of licenses, i will go insane!!!

i am working towards this goal now, source code in a database, meaning database, IDE (integrated development environment, compiler, UACL (user access control list) other words permission system needed, and full extras, to deal with commenting, license agreements, dealing with variable names, dealing with function names, dealing with how to actually store the structure of the code, dealing with reversion or like system, dealing with actual source code, and compiled code, pulling from database and building a single file per say and/or code chunks to send to a compiler. dealing with different types of integers, floats, characters, bits

many things that are built into a compiler, is brought into how information is stored in the database/s, and many of the compiler commands and options also need to be brought into the code. resulting in a compiler that is nearly stripped of its original abilities. examples checking against syntax / mis spelled words and like. just blindly inserting source code = ugly. if one is able to check on a small piece of code at a time database adjustments are quicker, now if you are dumping a huge program all of a sudden onto a database ya sure there going to down time, just like un-zip / un-compressing files.

there is further issue, someone going to want to pull information from the database as a file. and then some folks are going to want to take a file and input it into the database. in both instances information is lost or extra effort needed to adjust everything to the database requirements. this also puts creating a “new file type” into consideration. of how to move data between different computers not physically connected by the internet, and/or to much data for internet transfers and syncing of databases.

furthermore, the database types SQL vs NON SQL, and how data is stored, and the various types of data wanted to be stored. source code yes, but once source code is broken down into various tables, the information can be further reduced and the database and what it can do and how it does it in storing becomes an issue.

database being third party software, vs being in the same software. database in same software = less complexity for end user and ability to write the database code and compiler code into each other. once a third party database is introduce, meaning different software even if on same machine, brings additional issues.

i doubt we will ever get away from “folder/file” relationship, for example a “tree menu” in some website. it gives structure to find things. granted in code, there can be many times were a piece of code branches off in many different places. and could possibly create un-ending loops.

but a tree menu, grouping chunks of code together, to form a code structure will most likely be needed.

adding in additional stuff to various programming languages, simply stating, functions, void, if,then,else, loop, etc… will most likely end up requiring additional options, to make full use of code being in database, and giving hints to the compiler and to the database calls for the source code to speed things up. for example does compiler simple “loop” through code, or does the loop only go through twice, and it is quicker to just repeat the code inside the code twice. should certain variable names, function names, etc… be removed due to A = B = C = D = E = F, and compiler can remove all but A, but we as humans still want, A,C,E to be given as hooks to the complied code. if we want a “temp variable”, can that variable memory address and space be re-used on next looping through code, vs destroying the variable and re-creating it each time.

additional things to consider, many times, lines of code = exact same thing, only differing in what variables are supplied to it. can one apply a “DIFF” to the code as it is being stored to database. so that only what is changed is changed. and reducing overall database size. and also tied to compiler and database calls for source code. so one can apply for faster results.

there is another issue, many programming languages out there. and many programs are still in C / C++, mainly operating systems, windows, mac, linux, etc… and for bigger programs, it still means needing a compiler that can still read in files, that have yet to be inputted into the database.

i am working towards pure javascript, css, html project of “source code in database”. if interested shoot me an email.

Probably Dance