Databases are Fucking Stupid

by Malte Skarupke

I’ve been trying to do some database programming and goddamnit do databases make my life difficult. I’m using PostgreSQL but it doesn’t really matter what you use. Let me get the most minor complaint out of the way: Why do I have to talk to databases using text?

Every time that I want to store a value in a database I have to convert it to a string and the database has to convert it back from a string. This is slow, lossy, leads to bugs (SQL injection) and is completely unnecessary. But I can accept that bit of stupidity.

My bigger issue is that databases don’t know how to do their job.

Specifically I have a problem where I want to do a nested query to filter based on whether a certain ID exists in another table. I have an index on that ID in the other table, but PostgreSQL decides to do a Seq Scan over the entire table anyway. Which is O(n) as opposed to O(log n) for a Index Scan. Why does it do that? I have no idea. It doesn’t tell me. That decision by PostgreSQL slows down my query to take several minutes where it would otherwise take a few seconds at most. And it’s only going to get more slow as my database grows.

When you’re trying to find out what’s going on with a database, one piece of advice you’ll come across is to let PostgreSQL re-analyze your table for it to make smarter decisions. Why does it have to analyze your table in the first place? Because it doesn’t know what it’s doing and you can’t tell it what to do. There is a black box decision making process somewhere in PostgreSQL that decides which algorithms to use for your query based on estimates that it makes based on the analyse step. And that black box process goes wrong for me so maybe re-running the analyze step helps. (or it doesn’t in my case)

I’m hosting my database on Amazon AWS and when I was frustrated with the performance of my query I decided to try how it would perform on bigger instances. So I used one of the bigger machines that Amazon has, a db.m3.2xlarge, and my query runs more than a thousand times faster, even though Amazon only estimates that machine to be eight times faster than the db.m1.small I was using before. So why the speedup? Because suddenly PostgreSQL decides to use the index that it should have used from the beginning. It doesn’t tell me why it changed its mind, but it will go back to the stupid method if I go back to using the small instance.

So maybe the database thinks that it doesn’t have enough memory on the small instance. I disagree though because the Amazon AWS console shows me that half of the RAM on that machine is not used. Doing more research I come across this article which has lots of good performance advice but also says such beautiful things as this explanation of a memory related option:

shared_buffers = <num> — Editing this option is the simplest way to improve the performance of your database server. The default is pretty low for most modern hardware. General wisdom says that this should be set to roughly 25% of available RAM on the system. Like most of the options I will outline here you will simply need to try them at different levels (both up and down ) and see how well it works on your particular system. Most people find that setting it larger than a third starts to degrade performance.

Which is obviously completely useless. It doesn’t explain anything: Not what this does, not when you can expect this to improve performance and when you can’t, and not how you would even find out whether changing the value worked. Why is 25% a good value and why is 33% where it starts to go slow on most machines? How can I tell when I am going into the slow area? The official PostgreSQL documentation is not much more helpful. But since my database isn’t using the available RAM I try to change the value anyway and I set it to 50% of available RAM. (it’s at 25% by default on Amazon RDS) And nothing changes. The database doesn’t even use any more memory. So I have no idea if that setting does anything at all and I have no way of finding out. I could try different values but it takes several minutes to try a value because the server has to restart, so I won’t do that. After all I never even knew if the problem was that the memory isn’t used.

When you search through the documentation and StackOverflow and random mailing lists all you find is useless stuff like this. Someone’s index didn’t work because he had it sorted the wrong way and sometimes it’s important that you sort in ascending order, and other times you should pick descending order. (but usually it doesn’t matter) Somebody else’s index didn’t work because PostgreSQL used to not like it when you were looking for a four byte integer in an index on an eight byte integer. Another person had problems because the database thought that file accesses were more expensive than they actually were on his machine. Whenever you find a solution to a problem it is only a solution to that specific problem and there is no way to generalize or to use tools that could give you information about several performance problems. The best tool you have in PostgreSQL is the EXPLAIN ANALYZE command which tells you what decisions the solver made. But if the solver makes the wrong decisions you have to guess as to what you could do to change its mind. There is all of this imprecise handwavy knowledge out there where people guess the parameters for their databases based on rules of thumb which they developed during years of tweaking databases and there are never just any hard facts or measurements or explanations. Should I try the solution from some other guy where he changed the estimates that PostgreSQL uses for the cost of accessing the disk? I have no idea.

So I end up tweaking config files to influence a blackbox query optimizer that makes the wrong decision but it doesn’t tell me why it makes any decisions and all of that simply because I have to use a backwards syntax which doesn’t allow me to just write queries on the database. If I could just write code that does a binary search I would have had this problem solved within seconds. In fact there wouldn’t be a need for a black box solver or this magic analyze step that you regularly have to run over your database. And if it turns out that the index doesn’t fit into memory and that a binary search would be slow, I would like to be able to find that out myself, thank you. (but in my case even a slow binary search that has to go to disk for every btree node would be vastly faster than the algorithm that PostgreSQL chooses)

The best analogy that I have for the advice around PostgreSQL is shamanism. Because that’s what you develop when you have to deal with a black box. Every now and then they find a herb that actually does heal, but mostly it’s rules of thumb that are impossible to generalize or to use as any kind of basis for formalizing the knowledge. Every solution only works for one problem and if the solution doesn’t work you haven’t learned anything except that that solution doesn’t work. I don’t want advice that tells me to “try changing random_page_cost.” (from here) How about you give me a way to measure what’s wrong so that I find out that random_page_cost is too high and I can be certain that lowering it would fix it. And if lowering it doesn’t fix it I want to be able to run tests to find out why lowering it didn’t fix it. Maybe after changing random_page_cost it turns out that shared_buffers is too low. It would be nice to know that for sure so that I can make informed decisions. I would also like to know if changing variables has an impact on other queries. But in PostgreSQL you can try changing a variable and if it does nothing you haven’t learned anything. The scientific revolution happened four hundred years ago but people are still creating areas where you have to go work using the methods of shamanism.

Luckily there are now many NoSQL databases. Because the fundamental problem is that I have to work in this backwards language where I can’t just say what I want to do but instead have to set up the environment and carefully compose my statements in such a way as to confuse the solver the least. And if it doesn’t do the right thing it doesn’t tell me what’s missing and I can’t just write a binary search myself. And NoSQL promises that I don’t have to use that language.

But NoSQL is a prime lesson in throwing out the baby with the bathwater. Because most NoSQL databases are not relational. So now I can’t run my query at all. Thank you. NoSQL databases are pretty much just a data store that has faster search and support for concurrency. That’s not technically true because NoSQL means anything that’s not using SQL, but usually people refer to document databases when they say NoSQL. And that’s cool and all and it’s good to have learned about the existence of these because I will use them to store many things, but they don’t solve my problem.

I have half a mind to just write my own database. Unfortunately getting it to a good quality could take years. But just the energy savings that you would get if people didn’t turn their ints into strings only for them to be turned back into ints would probably solve global warming.

So now I can do one of two things: a) buy a bigger database server, or b) try NoSQL. I think if I store some redundant information I can also solve this specific issue using Amazon DynamoDB, which is NoSQL. So I will try that as a learning exercise. All DynamoDB requests are sent using JSON, so I will still have to turn floats into strings only for them to be turned back to floats on the server. Not a good start, but I can accept that bit of stupidity…