Ideas for a Programming Language Part 5: More Flexible Types
by Malte Skarupke
Type systems tend to be too rigid. And dynamic typing just throws the baby out with the bathwater. So here are a few things I want:
1. I want to add information to an object at compile time without having to give it a new type. Is an integer in seconds or kilometers or kilometers per hour? Is a string SQL escaped or URL escaped or just plain text? Is a pointer known to not be zero? I also want to write logic based on this. If I multiply seconds with kilometers per hour, I want a result that makes sense. Most importantly, I want to do this easily: There is nothing stopping me from having a “url_escaped_string” type in C++, but then that string doesn’t work with any of my existing functions. I can add an implicit conversion operator, but then I lose my compile time information as soon as I use a function that accepts only one type of string. Or everything has to be templates and everything has to be duck-typing.
So what I envision is that every object has runtime properties and compile time properties. So a string might have a runtime property called size(), and it might have a compile time property called language() where I can set the language to be “SQL” or “URL” or similar and could then make sure at compile time that everything is escaped correctly. An int would not have any runtime properties, but it would still have compile time properties like its unit of measurement. We’ll start with that and then we’ll see what other features we’d want from that. (like functions being able to write conditions for their input parameters where they check the compile time properties of those objects and similar)
2. I want multi-stage initialization. Sometimes you need to create a bunch of empty shells and then initialize them all at once later. Or sometimes you need to initialize in three, four or more stages. In C++ I tend to avoid doing this because it can lead to really confusing code and is considered an anti-pattern. It shouldn’t be an anti-pattern. There should be a way to do it hygienically. I want to say that an object is only half initialized, and that you’re not allowed to call certain functions on it yet. I want to also be able to say that a function will finish initializing it and I am allowed to call all functions on the object after that function has finished. I think this will replace constructors. Constructors need special rules, special syntax even in some languages, and have to initialize every member, even if they just put an empty default value in there so that the destructor will not be confused. Multi-stage initialization seems a better approach. Then everything is just normal functions, and the object changes types without changing data.
3. I want to have interfaces with optional functions and optional data. Where an object can implement an interface and choose to not implement certain functions. If it doesn’t implement a function, I get a compiler error if I try to call that functions on that object through the interface. This is inspired by the talk “Generic Programming Must Go” by Andrei Alexandrescu where he has a great example for this in trying to come up with a generic allocator interface. In generic programming we often check for capabilities of objects. In Alexandrescu’s example he checks for example if an allocator is able to expand a previous allocation or if you have to re-allocate. The syntax for that is always ugly. It’s better in D than in C++, which is why Alexandrescu was happy with a solution that involves checking capabilities, but I think interfaces with optional functions on them could work better. Mainly to get guaranteed consistency and documentation of what’s available.
4. I want to separate data layout from struct description. For example: If I have the same data once in SOA form and once in AOS form, I want to be able to call the same functions on it without code duplication. This is inspired by Jonathan Blow’s language, but I think his solution is a bit too specific. In his solution if I change the object, it has to change everywhere. I think a better separation might be that interfaces are the default way of programming, and I can bind data to those interfaces. For example I can easily imagine having five different implementations of storing strings. From small string optimization to copy-on-write ref-counted to immutable ref-counted to substring pointers into a immutable buffer (when parsing) to compile time constant strings. I can do all of that in C++, and people have done all of that, but you tend to quickly end up with second class objects. Where if you want the full functionality you have to use std::string, or a string_view type, and if your object is neither of those, you’ll get code duplication. And string_view is a great step forward because it’s so cheap to convert to, but I think a generic interface with optional members (see above) might be better, because if your type can provide more functionality, you don’t want to degrade to string_view. Or if my string is just a { uint32 offset, uint32 size } in a mmapped file, I would like to keep it like that and not have to create a string_view. The annoying part isn’t that I have to create a string_view, but that whoever had the pointer to that offset and size now also has to be changed. In the end you end up mirroring the entire structure of the file in memory. (or you have to do in-place pointer patching, which means you have to add enough padding to fit pointers)
5. I want to use runtime information to constrain what functions people can call at compile time. One example from the above that would benefit from this is the multi-stage initialization. Let’s say I have a container that has objects in it at multiple stages of initialization. Some are uninitialized, some are fully initialized. And now I want to iterate over that container. I also want compile time guarantees that I am not accidentally doing something to an uninitialized object that I am only allowed to do to a fully initialized object. The container might know that it contains both types of objects, or it might not keep that information around, in any case we have to check at runtime which state the current object is in. So I need to make the initialization stage available at runtime. There are several ways of doing that and I don’t think that the language should prescribe one, so let’s just pick one and say that for this case I decide to put a runtime enum onto the object that mirrors the compile time property. If we have that, we might as well get rid of the compile time property, because we don’t want to have to keep two pieces of data in sync. Now the code always has to check at runtime how initialized my object is. And I want the same guarantees from that as if I had done the check at compile time. Meaning depending on what the result of the runtime check is, I am allowed to call different functions on the object at compile time. The only thing the compiler has to do is keep track of what properties the current scope has checked about the object. If the current scope is one in which the initialization state returned “fully initialized,” then the compiler will allow me to call more functions on the object. If the current scope is one where I have never checked, or the check returned false, the compiler will not allow me to call all functions on the object.
I think some of these will give a great deal of flexibility, and will reduce some of the pains of static typing. I also think some of these can happen transparently. Remember that I plan to store the source code in a database, not in a text file. So I don’t have to always have all information on the screen. It is totally possible for me to create an interface where you right click on an integer to get to its properties to change its unit of measurement. Where that may be a bad example, because units of measurements might be a property that you want to see all the time, (it may be more important than the actual type of the object. Often you care more that something is in meters or kilometers than you care if it’s an int or a float) but I expect that people will want to add lots of compile time properties to things. Like “what are the min- and max values of this property in our editor?” It may turn out to be a giant mess, but we’ll try and if it becomes a mess we’ll see what steps have to be taken. And since the source code is stored in a database, the complexity will be easier to deal with, because it will for example be trivial for you to find all the places where a property is assigned to, and then you can ask what unit tests cover that assignment code, put a breakpoint on the line and run the unit test to see what happens.
One concern that I can imagine people have might be that I’m going to make type systems even more complicated. But I think type systems, for example in C++, are too simple right now. That may sound crazy, “C++ has a type system that’s too simple,” but I think it’s true. Look at a few headers of your favourite boost library. The code will look nearly insane and will be very difficult to understand and debug, but what it’s actually doing is usually pretty simple. It does things like “can I write this object using iostreams? If yes, compile this code, otherwise compile that code” and it’s just that in the C++ type system, checking things like that is incredibly complicated. So what people are doing is simple, but the way they are doing it is very complicated. That suggests to me that the tool they are using is not appropriate for the job. In fact the C++ people know this: They are currently adding concepts, which will make the type system even more complicated, but I think that adding concepts will make template code so much simpler to write that a lot more people will do it. (and if you are someone who tries to stay away from template code, I would be willing to bet that you have literally hundreds of workarounds to get around cases where the type system is insufficient. It’s just that those workarounds are difficult to spot, because they may not look like workarounds. But believe me, your life could be so much easier)
So sometimes you have to add complexity to make things simpler. It’s like with storing the source code in a database. That makes big parts of the language a lot more complex, but it will make writing tools much simpler. And I will be able to add information that’s not expressed in the text. So you add complexity over here to gain simplicity over there, and you just have to keep balancing your global complexity budget as well as your various local complexity budgets and make sure that you’re not going too far to one side or the other. So I’ll make types more complicated 🙂
Multi-stage initialization reminds me of something that I’ve wanted before: objects that behave like state machines. You don’t always strictly get closer to being fully initialized, you can go backwards and sideways. Like if you have a “visited” flag on an object that you only use while you are doing some recursive operation, but you want to be some other value during some other operation. C++ could use a union but there’s no safety there, and no association with what member functions are available at the time.
Or having object states like mutable/immutable. Like a string object that can be appended to, and while appending has no guarantee of contiguous storage, then you switch it to “finished” and it reallocates it contiguously, but then you can switch it back to mutable at some other point when you know it’s safe to do so.
Per-thread object states are also kind of interesting. Like if you lock a mutex and suddenly you are allowed to call functions/use data from the guarded object.
—
The unanswered question for me with “string is an interface, there are many implementations” is… how do you do codegen? Say you have five different types that work like that, and some code that uses all five: you have a combinatoric explosion of possibilities. C++ says “compile each one that is actually used” but that has some quirks like not being sure about the code size. So maybe you’d say that you pick a specific implementation when you care about the data format (when it’s in a struct or something) and otherwise there’s one “canonical” implementation that you pass around (pointer + length for strings, say). But maybe that doesn’t work for every interface.
Even weirder is if you have those five types and put them all in a struct. If you separate data layout from that struct description, you have a combinatorial explosion of ways to represent that struct as well. When is that compiled in and when is that solved with runtime dynamism? If you support persisting that struct in different representations, then you might want runtime dynamism in the loading code, but then converting to a specific format for the rest of the code… lots to think about.
A lot of these features, as it happens, are things that Rust either has or wants to have.
For instance: the “lock mutex -> more operations allowed” is achieved by having the mutex wrap around the object in question. Locking it returns a RAII guard type that you can use to access the guarded object, and the borrowchecker verifies that you don’t access them after you give the guard back.
Partially-initializable types aren’t yet here, and may never be, simply because a partially-initialized type rather by definition violates a bunch of type invariants; since Rust is very explicit about naming guarantees (which let the compiler make more assumptions and thus optimize more aggressively), the different possible states of partial initialization would all have to be explicitly named, and state transitions defined using method syntax, so any kind of partial initialization would be a lot of work. OTOH, that lends itself easily to other kinds of type-states, which is seen as necessary for Rust since the whole affine typing thing means that there are often times when you just want to rip parts out of an object constructed by someone else, to use yourself.
I like the idea of types as a state machine. You would have different segments in your class definition that you could turn on and off depending on what functions have been called on the object. And then multi-stage initialization is a special case.
As for the combinatorial explosion of codegen:
For generating functions I plan to follow the C++ model as it’s going to be with concepts, meaning just keep instantiation the function whenever there is a new implementation of the concept. (I think this is how traits or interfaces work in Rust, but I haven’t used Rust yet so I’m a bit vague on its concepts)
This will lead to all of the same problems that C++ has with compile time, but I think it might be OK. For one, C++ templates are slower than they need to be. And if you debug why things compile slowly, they often don’t have a good reason for compiling slowly other than that nobody has profiled them yet. It’s surprisingly easy to optimize the compile time of many libraries once you have tools that tell you about where the compiler is spending time.
Also if you have five string implementations (and I think a lot of large C++ projects probably will have this) you will probably find a lot of code duplication anyway. As in there are probably ten implementations of a the startswith() and endswith() functions. I’m confident that if I actually took the time to check at work, I might find fifty implementations of endswith(), where most of them are just loose code fragments in some random function because somebody needed to check a file extension. These won’t be in named, reusable functions. But despite all of this duplication, if I wanted to optimize compile time at work, I wouldn’t start there. This is not the reason why C++ compiles slowly. The reason why C++ compiles slowly is mainly header bloat and overcomplicated templates because simple things are difficult to do.
When you just use templates to generate bunch of code, they are actually surprisingly fast. They compile slowly when you use them to do compile time logic like “if this object is size 0 and can be inherited from, use the empty base class optimization, otherwise make it a normal member.”
Also I expect that when we run into problems with this (we will, probably) it will be possible to come up with solutions. For example I can imagine writing a wrapper class that uses reflection to generate a vtable for any interface that you pass to it. Then that wrapper class would also implement the interface, so you could pass it to the function. And then it should be possible to sort functions by how much they contribute to your overall compile time, and tell the compiler to use that wrapper class for a function where it’s expensive to instantiate it many times. You might even have one “fast compile” build that uses the wrapper class, and a “full optimization” build that instantiates the function separately for each implementation.
The important part is not that that is going to be the solution, but that I make it possible to come up with solutions like this one. In C++ writing something like that would be much more difficult. But I am thinking of more fully integrating things like compiler, buildsystem and IDE so that you can easily visualize which functions take a long time to compile or which functions contribute to code bloat.
Struct layout is a different question though. Whenever you have an implementation, it will have to decide on one data layout. I actually probably won’t allow interfaces as struct members. But you could for example have a wrapper class like I mentioned above. And you could model it after boost.any or std::function or after vtables. The language won’t prescribe one method, but I’ll probably provide a few common methods of wrapping interfaces in a library. So if your interface is called “Callable” then you can’t have a struct member of type Callable. But you could have a “VtableWrapper[Callable]” as a struct member, because the VtableWrapper would be a concrete type with exactly known size 8 and known implementations of all the functions of the interface.
So if you’re implementing an interface, and the interface says that there needs to be a member of type Callable, your struct can not actully have a member of type Callable. But it could either have a specific implementation (like say a function pointer) or a wrapper type like std::function that has one implementation that can handle generic data.
A number of these ideas are well-explored in the PLT/functonal programming sphere, here’s what they’re called:
1. Refinement types
3. In haskell, people just use small interfaces (typeclasses are I think nicer to use then OOP interfaces, and encourage many small ones)
4. Typeclasses can do this well
5. Dependent types
Thanks, I had not known about refinement types before! They do sound a lot like what I was thinking about.
I did know about dependent types and type classes, but I didn’t know enough about them to be sure that they are what I’m talking about. My first few drafts of this articles talked a lot about dependent typing, but then I felt like a fraud because I hadn’t used them and wasn’t sure if I was talking about the right thing, so I removed the term entirely.
The experience that type classes lead to smaller interfaces is very interesting. I am always fascinated in what differences in approach like type classes vs interfaces lead to. I should really write some Haskell again to get a feeling for things like this.