Databases are Fucking Stupid
by Malte Skarupke
I’ve been trying to do some database programming and goddamnit do databases make my life difficult. I’m using PostgreSQL but it doesn’t really matter what you use. Let me get the most minor complaint out of the way: Why do I have to talk to databases using text?
Every time that I want to store a value in a database I have to convert it to a string and the database has to convert it back from a string. This is slow, lossy, leads to bugs (SQL injection) and is completely unnecessary. But I can accept that bit of stupidity.
My bigger issue is that databases don’t know how to do their job.
Specifically I have a problem where I want to do a nested query to filter based on whether a certain ID exists in another table. I have an index on that ID in the other table, but PostgreSQL decides to do a Seq Scan over the entire table anyway. Which is O(n) as opposed to O(log n) for a Index Scan. Why does it do that? I have no idea. It doesn’t tell me. That decision by PostgreSQL slows down my query to take several minutes where it would otherwise take a few seconds at most. And it’s only going to get more slow as my database grows.
When you’re trying to find out what’s going on with a database, one piece of advice you’ll come across is to let PostgreSQL re-analyze your table for it to make smarter decisions. Why does it have to analyze your table in the first place? Because it doesn’t know what it’s doing and you can’t tell it what to do. There is a black box decision making process somewhere in PostgreSQL that decides which algorithms to use for your query based on estimates that it makes based on the analyse step. And that black box process goes wrong for me so maybe re-running the analyze step helps. (or it doesn’t in my case)
I’m hosting my database on Amazon AWS and when I was frustrated with the performance of my query I decided to try how it would perform on bigger instances. So I used one of the bigger machines that Amazon has, a db.m3.2xlarge, and my query runs more than a thousand times faster, even though Amazon only estimates that machine to be eight times faster than the db.m1.small I was using before. So why the speedup? Because suddenly PostgreSQL decides to use the index that it should have used from the beginning. It doesn’t tell me why it changed its mind, but it will go back to the stupid method if I go back to using the small instance.
So maybe the database thinks that it doesn’t have enough memory on the small instance. I disagree though because the Amazon AWS console shows me that half of the RAM on that machine is not used. Doing more research I come across this article which has lots of good performance advice but also says such beautiful things as this explanation of a memory related option:
shared_buffers = <num> — Editing this option is the simplest way to improve the performance of your database server. The default is pretty low for most modern hardware. General wisdom says that this should be set to roughly 25% of available RAM on the system. Like most of the options I will outline here you will simply need to try them at different levels (both up and down ) and see how well it works on your particular system. Most people find that setting it larger than a third starts to degrade performance.
Which is obviously completely useless. It doesn’t explain anything: Not what this does, not when you can expect this to improve performance and when you can’t, and not how you would even find out whether changing the value worked. Why is 25% a good value and why is 33% where it starts to go slow on most machines? How can I tell when I am going into the slow area? The official PostgreSQL documentation is not much more helpful. But since my database isn’t using the available RAM I try to change the value anyway and I set it to 50% of available RAM. (it’s at 25% by default on Amazon RDS) And nothing changes. The database doesn’t even use any more memory. So I have no idea if that setting does anything at all and I have no way of finding out. I could try different values but it takes several minutes to try a value because the server has to restart, so I won’t do that. After all I never even knew if the problem was that the memory isn’t used.
When you search through the documentation and StackOverflow and random mailing lists all you find is useless stuff like this. Someone’s index didn’t work because he had it sorted the wrong way and sometimes it’s important that you sort in ascending order, and other times you should pick descending order. (but usually it doesn’t matter) Somebody else’s index didn’t work because PostgreSQL used to not like it when you were looking for a four byte integer in an index on an eight byte integer. Another person had problems because the database thought that file accesses were more expensive than they actually were on his machine. Whenever you find a solution to a problem it is only a solution to that specific problem and there is no way to generalize or to use tools that could give you information about several performance problems. The best tool you have in PostgreSQL is the EXPLAIN ANALYZE command which tells you what decisions the solver made. But if the solver makes the wrong decisions you have to guess as to what you could do to change its mind. There is all of this imprecise handwavy knowledge out there where people guess the parameters for their databases based on rules of thumb which they developed during years of tweaking databases and there are never just any hard facts or measurements or explanations. Should I try the solution from some other guy where he changed the estimates that PostgreSQL uses for the cost of accessing the disk? I have no idea.
So I end up tweaking config files to influence a blackbox query optimizer that makes the wrong decision but it doesn’t tell me why it makes any decisions and all of that simply because I have to use a backwards syntax which doesn’t allow me to just write queries on the database. If I could just write code that does a binary search I would have had this problem solved within seconds. In fact there wouldn’t be a need for a black box solver or this magic analyze step that you regularly have to run over your database. And if it turns out that the index doesn’t fit into memory and that a binary search would be slow, I would like to be able to find that out myself, thank you. (but in my case even a slow binary search that has to go to disk for every btree node would be vastly faster than the algorithm that PostgreSQL chooses)
The best analogy that I have for the advice around PostgreSQL is shamanism. Because that’s what you develop when you have to deal with a black box. Every now and then they find a herb that actually does heal, but mostly it’s rules of thumb that are impossible to generalize or to use as any kind of basis for formalizing the knowledge. Every solution only works for one problem and if the solution doesn’t work you haven’t learned anything except that that solution doesn’t work. I don’t want advice that tells me to “try changing random_page_cost.” (from here) How about you give me a way to measure what’s wrong so that I find out that random_page_cost is too high and I can be certain that lowering it would fix it. And if lowering it doesn’t fix it I want to be able to run tests to find out why lowering it didn’t fix it. Maybe after changing random_page_cost it turns out that shared_buffers is too low. It would be nice to know that for sure so that I can make informed decisions. I would also like to know if changing variables has an impact on other queries. But in PostgreSQL you can try changing a variable and if it does nothing you haven’t learned anything. The scientific revolution happened four hundred years ago but people are still creating areas where you have to go work using the methods of shamanism.
Luckily there are now many NoSQL databases. Because the fundamental problem is that I have to work in this backwards language where I can’t just say what I want to do but instead have to set up the environment and carefully compose my statements in such a way as to confuse the solver the least. And if it doesn’t do the right thing it doesn’t tell me what’s missing and I can’t just write a binary search myself. And NoSQL promises that I don’t have to use that language.
But NoSQL is a prime lesson in throwing out the baby with the bathwater. Because most NoSQL databases are not relational. So now I can’t run my query at all. Thank you. NoSQL databases are pretty much just a data store that has faster search and support for concurrency. That’s not technically true because NoSQL means anything that’s not using SQL, but usually people refer to document databases when they say NoSQL. And that’s cool and all and it’s good to have learned about the existence of these because I will use them to store many things, but they don’t solve my problem.
I have half a mind to just write my own database. Unfortunately getting it to a good quality could take years. But just the energy savings that you would get if people didn’t turn their ints into strings only for them to be turned back into ints would probably solve global warming.
So now I can do one of two things: a) buy a bigger database server, or b) try NoSQL. I think if I store some redundant information I can also solve this specific issue using Amazon DynamoDB, which is NoSQL. So I will try that as a learning exercise. All DynamoDB requests are sent using JSON, so I will still have to turn floats into strings only for them to be turned back to floats on the server. Not a good start, but I can accept that bit of stupidity…
It would appear that your statement that “but it doesn’t really matter what you use” isn’t entirely correct.
See, e.g. http://dba.stackexchange.com/questions/68996/can-i-get-some-rationale-from-the-optimzer-why-it-chose-a-certain-explain-plan
Apparently, if you are interested in investing some serious time, you should be able to make the Oracle Optimizer at least tell you why it did what it did.
That’s awesome. That would have saved me a lot of useless research.
But the fundamental problem is still that the language SQL is backwards. Instead of saying what I want to do I have to configure the system so that it decides on its own to do what I want to do. It sounds like that configuration is going to be a lot easier in Oracle because I at least get feedback, but still I’m going to try NoSQL databases first.
A more fitting abstaction than SQL would be that I can create an index and then call functions using that index as an object. Which is how containers work in C++: If I create a std::multimap I have find, count, lower_bound, upper_bound and equal_range. If I create a std::unordered_multimap I only have the operations find, count and equal_range. And std::vector doesn’t offer any of those operations and I have to fall back to the linear time generic std::find. In SQL I can call all operations on all tables, and then it does different things depending on what your database is configured like. In C++ I can look at the documentation of the functions that I call and have guarantees about what the worst case complexity of my code is. In SQL I do not.
So it still mostly doesn’t matter what system you use. As long as SQL is your interface to the database, you’ll need to run magic analyze steps and tweak global variables. But it sure is nice that Oracle makes it easier to figure out what you have to tweak.
Well, as far as “easier” goes, I really doubt it. Not much is easy on Oracle, and my personal guess is that a lot of hacks would be much easier than actually trying to understand a 10053 trace 🙂
Anyway: While this was a nice rant, I think you’re doing injustice to SQL databases. My guess is that 90% of the SQL queries the Optimizer makes are in 90% of the cases better than what 90% of the programmer majority could produce by writing the access pattern by hand.
That’s not to say that a NoSQL solution that still offers proper ACID(http://en.wikipedia.org/wiki/ACID) support would’nt make sense in quite a few scenarios.
cheers!
I don’t think this is limited to databases, though they’re an egregious example. Modern CPUs: start using fancy new instructions and your code gets slower; pay Intel $1,000 for a program that tells you what’s wrong. OSes: a resource you didn’t know existed, much less know is limited, is exhausted for some reason; search around until you find a program that uses an undocumented API that tells you how many of this resource a specific process uses. Languages: inliner gives up or goes berserk; good luck figuring out why. Or a dynamic language has a memory leak and now you have to learn how the runtime actually works.
I don’t think I’ve ever seriously used a piece of someone else’s code without needing to peek at its source for some reason sooner or later. That’s one of the reasons why I am wary of “full stack” solutions: if it’s running in a dynamic language behind two proxies interfacing with a message queue running in a VM on a server in the cloud somewhere, I will not have any agency unless I actually understand each part and why it’s there and have some visibility into it. SQL databases are a wonderful example of how “sufficiently smart” doesn’t exist in real life and how every abstraction is leaky, but they aren’t necessarily more black-box than anything else that a software developer has to interact with. (Which is really more of an indictment of modern software/computer engineering than a defense of database systems.)
It’s the law of leaky abstractions all the way down, but still there is a difference between good abstractions and bad abstractions. The “use the wrong instruction” thing for example is often not a problem thanks to optimizing compilers. If I divide by a constant the optimizer will turn that into a multiplication by a constant instead. If I use MAD when a multiply followed by an add would have been faster, I want the optimizer to change that as well. Currently they are a bit hesitant when it comes to new instructions, but I believe that will change. Already Clang is optimizing the assembly of a bunch of my SIMD code where Visual Studio 2010 was scared to touch it.
I bet people have written about how you can tell a good abstraction from a bad abstraction. And I bet “being able to figure out why the abstraction leaks when it leaks” is one of the top things. As well as being able to work around the abstraction when it does the wrong thing. Ideally without having to break the interface.
Basically an abstraction that assumes that it will never leak is a bad abstraction. SQL is one such abstraction.
They could certainly add a bunch of features to make it easier to deal with abstraction leaks though. For example they could allow me to operate directly on indexes instead of tables, and then my problem is solved. Yes, it would expose more of the underlying behavior of how this magic select statement works, but not to a terrible amount. And it would still be an optional feature so if I trust the blackbox optimizer I don’t have to use it.
One problem that I see though is that SQL is not a language where it’s easy to extend it elegantly. (see the documentation of the select statement, which would have to be extended even more for pretty much every new feature) And any extension has to be made by the database software. Users can’t extend SQL. Maybe that’s a solvable problem as well, but it could also be a reason to move away from SQL.
“Use the wrong instruction” was actually referring to a performance regression that appeared on AVX code generated by VC++ when it was run on a Haswell CPU. Modern CPUs are optimized to run existing code faster, while modern compilers are optimized to create new code that runs faster on modern CPUs, and those goals do not always align. But that’s sort of the nature of the beast when it comes to performance.
Basically every framework or language that I’ve ever used has assumed, at some level, that it will not be a leaky abstraction. Because doing it “correctly”—having continuous granularity where every level of the layered abstraction allows you to descend to the next level if necessary—is a crazy amount of work and is often in conflict with the goal of simplifying code. For example, I don’t think that being able to drop down to asm in a C compiler is the right escape hatch for a lot of problems. Same with GUI frameworks exposing a handle to the underlying OS object if my problem is some weird event propagation behavior.
I guess what I’m saying is that “okay, something doesn’t work, time for source-level debugging” is a pretty bad status quo, but it *is* the status quo, and not just for databases. Part of the unique problem with databases is that every vendor has some different way to try to discover what’s going on, and none of them work as well as a real debugger for the database engine would.
Thank you for your post. I am feeling better now
use REDIS!!!
Good question. PostgresQL has some very readable documentation on the EXPLAIN command to explain planner estimates, which also illustrate why a planner is advantageous.
Should estimates not match with reality, as apparently in your case, Postgres has the EXPLAIN ANALYZE command which will execute the query & report actual costs for comparison.
Comparing EXPLAIN ANALYZE actual costs to EXPLAIN planned costs — looking for where planned vs actual rowcounts differ — will show reasonably directly which steps were mis-planned, and which statistics need to be updated.
https://www.postgresql.org/docs/current/using-explain.html
https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE