There is a point when you reach 100K+ rows in your SQL Database, this is when normal queries become really slow. This is mostly noticeable with joins, aggregations or complex filtering.
Query optimisation
I am going to talk about this with the following DB as example:
erDiagram
MINECRAFT_SERVERS {
int ID PK
string Name
string ServerURL
}
PLAYER_VISITS {
int ID PK
string PlayerIP
string ServerURL FK
string Action
string Location
}
%% Relationships
MINECRAFT_SERVERS ||--o{ PLAYER_VISITS : "has visits"
And for our example we want to run the following query:
SELECT
m.*,
COUNT(DISTINCT pv.PlayerIP) AS UniquePlayerCount,
COUNT(CASE WHEN pv.Action IS NOT NULL THEN 1 ELSE NULL END) AS ActionCount
FROM MinecraftServers AS m
LEFT JOIN PlayerVisits AS pv
ON pv.ServerURL = m.ServerName
GROUP BY m.ServerID;
This might run fine when you do it on your 100 rows Database, but in the 100K+ itāll suck. I ran this query on phpmyadmin, only selecting the top 25. It took 3 minutes 48 seconds
. You donāt want to use a query that takes this long to finish.
Letās look at this next query:
SELECT
m.Name,
m.ID,
m.ServerName,
IFNULL(pv_stats.UniquePlayerCount, 0) AS UniquePlayerCount,
IFNULL(pv_stats.ActionCount, 0) AS ActionCount
FROM MinecraftServers AS m
LEFT JOIN (
SELECT
ServerURL,
COUNT(DISTINCT PlayerIP) AS UniquePlayerCount,
COUNT(CASE WHEN Action IS NOT NULL THEN 1 ELSE NULL END) AS ActionCount
FROM PlayerVisits
WHERE Location IN ('Overworld', 'Nether')
GROUP BY ServerURL
) AS pv_stats ON pv_stats.ServerURL = m.ServerName;
This query is a big improvement. Before Iāll tell you how long it took Iāll tell you what improvements were made.
- Take what you need
with
m.Name, m.ID, m.ServerName,
we only take the fields we are actually going to use. This is way better than taking all the data fromMinecraftServers
. This is better for the DB, and also for the memory / storage of the application that requested the data. - More filtering We made sure that we filter players from the overworld and nether. This also lowers our load of checks for the DB.
- IFNULL > CASE
IFNULL
is easier to check thanCASE
, this is becauseCASE
is more complex, which we donāt need for this query. - Subqueries
By moving the aggregation of
UniquePlayerCount
andActionCount
into thepv_stats
subquery, we know this will happen before the join operation. By doing so we only join the summarized data, and not the wholeActionCount
table.
This updated query took 17 seconds
on the same dataset.
Table optimization
Indexes
You can optimize your DB tables by using indexes. You should use these on the columns u access a lot, so for the example from before Iād create the following indexes:
For the PlayerVisits
table:
CREATE INDEX idx_pv ON PlayerVisits (ServerURL(191), PageType);`
For the MinecraftServers
table:
CREATE INDEX idx_ms ON MinecraftServers (ServerName);
Why is this? Because Indexes allow databases to not have to scan the whole table, with this it only needs to read through the indexes.
For example, letās take the following PlayerVisits
table:
ID | PlayerIP | ServerURL | Action | Location |
---|---|---|---|---|
1 | 192.168.1.1 | coolserver | Build | Nether |
2 | 192.168.1.2 | coolserver | Dig | Overworld |
3 | 192.168.1.3 | nonoserver | NULL | End |
4 | 192.168.1.4 | otherserver | Build | Nether |
5 | 192.168.1.2 | coolserver | NULL | Overworld |
Without Index
- If we were to search for
ServerURL = 'coolserver'
, the Database would scan every row to check whetherServerURL
matches ourcoolserver
. This is also known as a Full table scan. - As the table gets larger, the scan becomes slower.
With an Index on URL
If we were to add an index to ServerURL
column, the database would create a sorted structure that could look something like this:
coolserver -> [Row 1, Row 2, Row 5]
nonoserver -> [Row 3]
otherserver -> [Row 4]
In this case, when we lookup something WHERE ServerURL == ?
it can just look into the list to find it faster. You can compare it to a dictionary in programming, the key
is the value you want to find, and the value
is in what rows to find it. Whereas a full table scan that you get without indexes just loops over the whole Table array.
Indexes can also work with multiple columns. If we were to create an index of the ServerURL
Location
combination it would look something like this:
(coolserver, Nether) -> [Row 1]
(coolserver, Overworld) -> [Row 2, Row 5]
(nonoserver, End) -> [Row 3]
(otherserver, Nether) -> [Row 4]
Important note
- Indexes only speed up
SELECT
, and actually slows downINSERT
,UPDATE
andDELETE
since it has to update the index. - Indexes require more space for the DB
Final Notes
- Monitor Query Performance: Always test the performance of your queries with and without indexes. Use tools like
EXPLAIN
in MySQL to analyse the execution plan of your queries and make sure that indexes are being used properly. - Only Index What You Need: Indexing too many columns can negatively impact performance, so only index columns that are often used in
JOIN
,WHERE
, orORDER BY
.
Results
These 2 indexes at the top of the indexes section saved my 1st unoptimized query 215 seconds
and my optimised query 4 seconds
(Total 1st: 15s
, total 2nd: 13s
). Then you might have the question: which measure was more effective? The indexes, but when you take into account that it makes other actions slower, Iād suggest to first optimize your queries.