There is a point when you reach 100K+ rows in your SQL Database, this is when normal queries become really slow. This is mostly noticeable with joins, aggregations or complex filtering.

Query optimisation

I am going to talk about this with the following DB as example:

     erDiagram
    MINECRAFT_SERVERS {
        int ID PK
        string Name
        string ServerURL
    }
    PLAYER_VISITS {
        int ID PK
        string PlayerIP
        string ServerURL FK
        string Action
        string Location
    }

    %% Relationships
    MINECRAFT_SERVERS ||--o{ PLAYER_VISITS : "has visits"

And for our example we want to run the following query:

SELECT 
    m.*,
    COUNT(DISTINCT pv.PlayerIP) AS UniquePlayerCount,
    COUNT(CASE WHEN pv.Action IS NOT NULL THEN 1 ELSE NULL END) AS ActionCount
FROM MinecraftServers AS m
LEFT JOIN PlayerVisits AS pv
    ON pv.ServerURL = m.ServerName
GROUP BY m.ServerID;

This might run fine when you do it on your 100 rows Database, but in the 100K+ itā€™ll suck. I ran this query on phpmyadmin, only selecting the top 25. It took 3 minutes 48 seconds. You donā€™t want to use a query that takes this long to finish.

Letā€™s look at this next query:

SELECT 
    m.Name,
    m.ID,
    m.ServerName,
    IFNULL(pv_stats.UniquePlayerCount, 0) AS UniquePlayerCount,
    IFNULL(pv_stats.ActionCount, 0) AS ActionCount
FROM MinecraftServers AS m
LEFT JOIN (
    SELECT 
        ServerURL,
        COUNT(DISTINCT PlayerIP) AS UniquePlayerCount,
        COUNT(CASE WHEN Action IS NOT NULL THEN 1 ELSE NULL END) AS ActionCount
    FROM PlayerVisits
    WHERE Location IN ('Overworld', 'Nether')
    GROUP BY ServerURL
) AS pv_stats ON pv_stats.ServerURL = m.ServerName;

This query is a big improvement. Before Iā€™ll tell you how long it took Iā€™ll tell you what improvements were made.

  1. Take what you need with m.Name, m.ID, m.ServerName, we only take the fields we are actually going to use. This is way better than taking all the data from MinecraftServers. This is better for the DB, and also for the memory / storage of the application that requested the data.
  2. More filtering We made sure that we filter players from the overworld and nether. This also lowers our load of checks for the DB.
  3. IFNULL > CASE IFNULL is easier to check than CASE, this is because CASE is more complex, which we donā€™t need for this query.
  4. Subqueries By moving the aggregation of UniquePlayerCount and ActionCount into the pv_stats subquery, we know this will happen before the join operation. By doing so we only join the summarized data, and not the whole ActionCount table.

This updated query took 17 seconds on the same dataset.

Table optimization

Indexes

You can optimize your DB tables by using indexes. You should use these on the columns u access a lot, so for the example from before Iā€™d create the following indexes:

For the PlayerVisits table:

CREATE INDEX idx_pv ON PlayerVisits (ServerURL(191), PageType);`

For the MinecraftServers table:

CREATE INDEX idx_ms ON MinecraftServers (ServerName);

Why is this? Because Indexes allow databases to not have to scan the whole table, with this it only needs to read through the indexes.

For example, letā€™s take the following PlayerVisits table:

IDPlayerIPServerURLActionLocation
1192.168.1.1coolserverBuildNether
2192.168.1.2coolserverDigOverworld
3192.168.1.3nonoserverNULLEnd
4192.168.1.4otherserverBuildNether
5192.168.1.2coolserverNULLOverworld

Without Index

  • If we were to search for ServerURL = 'coolserver', the Database would scan every row to check whether ServerURL matches our coolserver. This is also known as a Full table scan.
  • As the table gets larger, the scan becomes slower.

With an Index on URL

If we were to add an index to ServerURL column, the database would create a sorted structure that could look something like this:

coolserver -> [Row 1, Row 2, Row 5]
nonoserver -> [Row 3]
otherserver -> [Row 4]

In this case, when we lookup something WHERE ServerURL == ? it can just look into the list to find it faster. You can compare it to a dictionary in programming, the key is the value you want to find, and the value is in what rows to find it. Whereas a full table scan that you get without indexes just loops over the whole Table array.

Indexes can also work with multiple columns. If we were to create an index of the ServerURL Location combination it would look something like this:

(coolserver, Nether) -> [Row 1]
(coolserver, Overworld) -> [Row 2, Row 5]
(nonoserver, End) -> [Row 3]
(otherserver, Nether) -> [Row 4]

Important note

  1. Indexes only speed up SELECT, and actually slows down INSERT, UPDATE and DELETE since it has to update the index.
  2. Indexes require more space for the DB
Final Notes
  1. Monitor Query Performance: Always test the performance of your queries with and without indexes. Use tools like EXPLAIN in MySQL to analyse the execution plan of your queries and make sure that indexes are being used properly.
  2. Only Index What You Need: Indexing too many columns can negatively impact performance, so only index columns that are often used in JOIN, WHERE, or ORDER BY.

Results

These 2 indexes at the top of the indexes section saved my 1st unoptimized query 215 seconds and my optimised query 4 seconds (Total 1st: 15s, total 2nd: 13s). Then you might have the question: which measure was more effective? The indexes, but when you take into account that it makes other actions slower, Iā€™d suggest to first optimize your queries.