SQL (Structured Query Language) is used in various forms for data storage and other relational data operations. SQL is the way to “talk” to a database to retrieve or put data. If you’ve done any developing at all, you have probably at least heard of SQL or even gotten your hands dirty working with it. As a language in its own right, it has its own caveats to what makes it more or less efficient for real life use cases.
There are multiple SQL standards and each implementation has its own features and caveats. This article is primarily focused on the basics of optimization for every major, modern SQL database. Certain things are universally good and universally bad, and unfortunately near-universally overlooked by many developers, both new and old.
Establishing Use Case
What is your database being used for? Optimizing the database schema is one of the most efficient ways to control the speed of queries. This isn’t always an option, but you can often offload certain bits of data to new views or separate tables to offload the computational cost of accessing a database in a way it wasn’t meant to be accessed. Scope out the tasks your product will use and how data can be stored to reduce unnecessary joins and extraneous data.
Are you primarily reading, writing, or equally reading and writing data when working with SQL? If you are reading 99% of the time, who cares about write time (within reason) as long as the reads are fast? On the other hand, if you need to read and write equally, you have to take a more balanced approach. Knowing how the database is used shapes everything you do with your database.
Data Storage
The more data is in a given table, the more time it is going to take to access. It can be advantageous to split tables up into smaller, more sane units, or to combine small tables into a larger table to avoid joins. You have to know the use case before this decision can be made as splitting up tables means more upkeep to make sure the data stays consistent (and less efficiency writing in most cases), and combining tables means more overhead to either populate that data (if the tables are combined) or retrieve it.
This example is pretty trivial, but assume we have the following table for our program:
CREATE TABLE employees ( pid INT NOT NULL AUTO_INCREMENT, name NVARCHAR(255), bday DATE, social NVARCHAR(11), phone NVARCHAR(12), PRIMARY KEY (pid) );
And we have the following queries:
SELECT name, bday, social FROM employees;
SELECT name, phone FROM employees;
If we never use “phone” with “bday” and “social”, we could split these tables. For our trivial example, this is a pointless optimization, but it does matter as a table gets more and more columns. By ensuring the “pid” will match in our child tables (for instance, you would not set AUTO_INCREMENT on the other tables and would hard set the value), we ensure that a given user is easy to match between the database.
Views and Temporary Tables
Views and temporary tables can be employed as well in order to offload some of this, but they come with their own caveats. I try to avoid these as they have a non-negligible impact on performance in most SQL implementations, but sometimes the convenience of the view or temporary table outweighs the cost substantially, especially for complex queries and complex operations. They can also impact resource usage, but views can add security by granting access to data for certain accounts without giving them the full table.
An example of a view being setup for security purposes:
CREATE VIEW item_count AS ( SELECT COUNT( items ) AS myCount FROM inventory );
GRANT SELECT ON item_count TO apiuser
Our apiuser account can get their necessary count of items, which stays up to date without having to store the results in another table somewhere. The account has no insight into anything else unlike just giving them read only access and allowing it to select. This creates a much cleaner, more maintainable database and code base with negligible overhead.
Temporary tables, whether true SQL temporary tables or scratch tables, provide security as well. You can limit permissions for access to data and shape how the data is populated without giving everything in a table away and without using a view. A user or process can write to these as well which makes them more useful than a view for certain tasks. These are usually best left for staging work, or for caching data which will be further processed before being bulk inserted into the database. For instance, if you have a live database which caches data from another database in bulk, you could put that data in a temporary table to process into the final table to avoid blind insertion or database locks. This makes the end application faster and less error prone since all the interim processing is done somewhere else before the bulk changes are committed.
Reducing Data
Sometimes, you just don’t have access to the raw database or don’t have the ability to change anything about it. Reducing data reduces the overall cost of an SQL transaction.
The first query anyone learns in SQL is:
SELECT * FROM [table];
This is an easy query, but if you only need 3 columns, why would you get every single one? At best you’re just throwing away extra data, at worst, you’re causing your query to take significantly longer. If you query a table with 200 columns and only need 3, you’re going to complicate everything about your process. This is the cheapest, easiest optimization any programmer can make.
For instance:
SELECT name, phone, address FROM [table];
This will run substantially faster than if there are more columns in non-trivially sized databases. The other thing is, if your table has many columns, explicitly defining the columns means it does not need to query the schema to know what to return. This becomes more noticeable with multiple, nested select statements and extremely complex queries.
Joins
joins are extremely important in SQL. Without joins, you would need either multiple tables with largely redundant information or else massive tables with more and more columns to accomplish anything. joins are powerful when used correctly, but they are easy to abuse.
The most common issue I see is thinking of joins from the standpoint of “left to right”. This is a valid and intuitive approach most of the time, but easily causes issues. Sometimes a right join will solve the problem faster and easier than a left join. Knowing which join to use and when also reduces the total number of joins that are necessary. I see this quite often with certain product’s databases since they may map things directionally (table1 has a key which maps to table2 which has another key which maps to table3, but you need to start at table3 for the given process and need table1 in its entirety).
SQL joins are computationally expensive for the database. If your product is bottlenecked at the database level, it can be worth using multiple simple queries and putting the data together or otherwise processing it on the front-end. It’s not really ideal, but sometimes you have no choice, especially with shared databases. This tends to be a wicked Faustian bargain.
Indexing
Indexing is an often used trick to speed up a database with virtually no noticeable overhead. A database index is very similar to the index of a book which points to where specific content starts. An index has to be useful to be functional though.
An index on a random key assigned to a column which is never used is a complete waste. Large amounts of sequential data assigned a standard index will also impact the performance. A menu cookbook arranged by ingredient count is going to be a bit of a waste.
You have to know what the goal of your table is and why the data is arranged the way it is, as well as the use case for the data so you know specifically what to index on. Indexes can end up having an impact on your server if you decide to just index every single column in a given table. The MySQL documentation has a great section on indexes which is largely applicable to other forms of SQL.
Boolean Logic
Boolean logic is essential to coding in general. The way you order comparisons can affect the speed of a query. Most SQL implementations will use some form of short-circuit evaluation (Boolean logic and short-circuit evaluation in Lua). If something is going to be false 90% of the time, make it one of the first things evaluated in a where clause. For stacks of comparisons using or statements, put the least computationally intense ones first. If the whole whole condition is already decidedly true or false, there’s no reason to keep chomping at the data.
An example:
SELECT COUNT( pid )
FROM employees
WHERE
( salary > 100000 )
OR ( salary > 50000 AND title NOT LIKE 'Manager%' )
OR ( salary > 30000 AND title NOT LIKE 'Manager%' AND department LIKE '%Retail%' )
This type of technique isn’t a one-size fits all approach, but it can be extremely useful for scaling down queries. It can also be less useful with truly random data. As you get used to working with your data, you’ll see patterns which can help make this whole approach much more fruitful and practical. The general principle of putting the most computationally intense conditions at the end continues to be useful though.
Execution Analysis
This same logic can be abstracted to general queries. If you’re combining tables, start with the smallest which effectively reduces the data. If two tables contain the same data in different forms for what you need, pick the one which is going to be lighter. For instance, if you have an “employees” table which contains all employees including managers, and a “managers” table, why would you use the general employees table when looking for a manager? This isn’t always possible or practical, but it is something I keep in mind at every step. If you need to find something which begins with “CAT”, why would you search for WHERE value LIKE ‘%CAT%’? These sorts of mistakes can make queries take exponentially longer to execute.
Use profiling tools to analyze queries and see what’s wrong. The tools will help you find things like those previously listed, but they won’t make much sense without understanding what the analysis means. You need to know the basics and what to do about the given issues before they make sense. When I track down queries with a profiling tool, it’s usually after I’ve done general optimizations but before I’ve gotten too in depth. The numbers can feel meaningless without having done a little of the legwork first.
Levels of Optimization
Level of access affects the level of optimization. The more access you have to the database and implementation of the database, the more control you have over the entire optimization process. It’s easy to affect change from high above because you can see the big picture, but it’s harder at the ground level.
Optimizing the entire database around what you’re doing will get you the most bang for your buck, but if all you have is partial, read-only access, you are a bit out of luck. If a system is locked down, learn why, and see if there’s a sane way to get more control which doesn’t put the company at risk. As you work with third parties and similar, you can cache data to get more control at the expense of more expensive sync operations. When done correctly, this can be near transparent for the end user.
There are various levels of optimization from the lowest level to the highest. The lowest level are going to impact every level so you can almost always feel secure in adding them in. Tweaking a select statement to not use as many columns if you don’t need the data is never going to backfire.
Putting It All Together
Optimizations at every level will contribute to the overall quality of the SQL queries. Reduce useless data, apply database level optimizations where possible and when makes sense, and think about how your queries are used to decide how to optimize. If you understand the bigger picture behind the database, you can focus your effort on what actually matters. Noting what tables are indexed and why can help clear up queries to run much faster.
SQL has its own caveats, and the right few tricks can speed up an application several-fold. Use execution analysis when you are stuck, but after you have addressed the basics of what and why you’re doing a specific query. These few tricks may not make you a DBA, but they’ll make your code much more efficient.
Featured image by Colossus Cloud from Pixabay