5

Problem

I have some assets in my application which are updated from time to time in an asynchronous fashion.

The example I'm gonna use here is Vehicles. There are two tables:

  • Vehicles: holds information about the vehicles themselves
  • VehicleUpdates: holds information about all updates that happened for that vehicle.

The relevant parts of the table structure are:

CREATE TABLE `Vehicles` (
 `id` varchar(50) NOT NULL,
 `organizationId` varchar(50) NOT NULL,
 `plate` char(7) NOT NULL,
 `vehicleInfo` json DEFAULT NULL,
 `createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
 `updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
 PRIMARY KEY (`id`),
 UNIQUE KEY `unq_Vehicles_orgId_plate_idx` (`organizationId`,`plate`) USING BTREE,
 KEY `Vehicles_createdAt_idx` (`createdAt`),
);
CREATE TABLE `VehicleUpdates` (
 `id` varchar(50) NOT NULL,
 `organizationId` varchar(50) NOT NULL,
 `vehiclePlate` char(7) NOT NULL,
 `status` varchar(15) NOT NULL,
 `createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
 `updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
 PRIMARY KEY (`id`),
 KEY `VehicleUpdates_orgId_vhclPlt_createdAt_idx` (`organizationId`,`vehiclePlate`,`createdAt`) USING BTREE
);

Now I have a new requirement in which I must return the latest update information along side the vehicle information itself.

Groupwise MAX Solutions

After digging a little, I've found this blog article. I then tried to use the suggested "uncorrelated subquery" approach, since it's deemed the best one:

Uncorrelated subquery

SELECT vu1.*
 FROM VehicleUpdates AS vu1
 JOIN
 ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
 FROM VehicleUpdates
 GROUP BY organizationId, vehiclePlate
 ) AS vu2 USING (organizationId, vehiclePlate, createdAt);

This query has an average execution time of 275 ms in my production database.

I thought that was too slow, so I decided to give the "LEFT JOIN" approach a shot:

The Duds: LEFT JOIN

SELECT vu1.*
 FROM VehicleUpdates AS vu1
 LEFT JOIN VehicleUpdates AS vu2 ON vu1.organizationId = vu2.organizationId and vu1.vehiclePlate = vu2.vehiclePlate
 AND vu2.createdAt > vu1.createdAt
 WHERE vu2.id IS NULL;

This one performed way better, with an average execution time of 40 ms. Good enough for me.

Then I needed to run this query as part of the query on the Vehicles table.

Current results

The following query would satisfy my requirements:

SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1 
 ON v.plate = vu1.vehiclePlate 
 AND v.organizationId = vu1.organizationId
LEFT JOIN VehicleUpdates AS vu2 
 ON vu1.organizationId = vu2.organizationId 
 AND vu1.vehiclePlate = vu2.vehiclePlate
 AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL;

The problem is that it takes 20 s (!) to run. Huge problem!

But I never do a full table scan on production. The query is always limited to a single organizationId and it's paged, so I return at most 100 lines per page, so I ran the following query:

SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1 
 ON v.plate = vu1.vehiclePlate 
 AND v.organizationId = vu1.organizationId
LEFT JOIN VehicleUpdates AS vu2 
 ON vu1.organizationId = vu2.organizationId 
 AND vu1.vehiclePlate = vu2.vehiclePlate
 AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL
 and v.organizationId = '<some organization ID>'
LIMIT 100;

Now it takes from 750 ms to 11 s to run, depending on how many vehicles are associated with. Not good enough.

Running explain for the query above got me:

"select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
SIMPLE | v | ref | unq_Vehicles_orgId_plate_idx,Vehicles_orgId_status_idx | unq_Vehicles_orgId_plate_idx | "202" | const | 30 | 100 |
SIMPLE | vu1 | ALL | | | | | 263171 | 100 | Using where; Using join buffer (Block Nested Loop)
SIMPLE | vu2 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "173" | vu1.organizationId,vu1.vehiclePlate | 10 | 10 | Using where; Not exists; Using index

What strikes me is that the vu1 table is running a full table scan, even though the left-most table Vehicles is being filtered using an indexed column organizationId, which is also indexed in VehicleUpdates.

So I decided to give the "uncorrelated subquery" another try and ran:

SELECT v.*, vu.*
FROM Vehicles AS v
LEFT JOIN (
 SELECT vu1.*
 FROM VehicleUpdates AS vu1
 JOIN
 ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
 FROM VehicleUpdates
 GROUP BY organizationId, vehiclePlate
 ) AS vu2 USING (organizationId, vehiclePlate, createdAt)
) AS vu 
 ON vu.organizationId = v.organizationId 
 AND vu.vehiclePlate = v.plate
WHERE v.organizationId = '<SOME ORGANIZATION ID>'
LIMIT 100;

This time the execution time varies from 1.4 s to 13 s, depending on how many entries there was in the Vehicles table for a given organizationId. Unacceptable for my application.

Running explain got me:

| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
| PRIMARY | v | ALL | | | | | 14456 | 100 |
| PRIMARY | <derived3> | ALL | | | | | 29289 | 100 | Using where
| PRIMARY | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "327" | vu2.organizationId,vu2.vehiclePlate,vu2.createdAt | 1 | 100 | Using where
| DERIVED | VehicleUpdates | range | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "323" | | 29289 | 100 | Using index for group-by

Current Results - UPDATED

I noticed that adding the specific organizationId clause can increase the performance.

LEFT JOIN

Running:

SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1 
 ON v.plate = vu1.vehiclePlate 
 AND v.organizationId = vu1.organizationId
 AND vu1.organizationId = '<SOME ORGANIZATION ID>' -- <--------
LEFT JOIN VehicleUpdates AS vu2 
 ON vu1.organizationId = vu2.organizationId 
 AND vu1.vehiclePlate = vu2.vehiclePlate
 AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL
 and v.organizationId = '<SOME ORGANIZATION ID>' -- <-----------
LIMIT 100;

I get execution times varying from 65 ms (acceptable) to 2.5 s (not acceptable).

Uncorrelated query

Putting a organizationId = '<SOME ORGANIZATION ID>' clause in the "main" query and the join external subquery:

SELECT v.*, vu.*
FROM Vehicles AS v
LEFT JOIN (
 SELECT vu1.*
 FROM VehicleUpdates AS vu1
 JOIN
 ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
 FROM VehicleUpdates
 GROUP BY organizationId, vehiclePlate
 ) AS vu2 ON vu1.organizationId = vu2.organizationId
 and vu1.vehiclePlate = vu2.vehiclePlate
 and vu1.createdAt = vu2.createdAt
 WHERE organizationId = '<SOME ORGANIZATION ID>' -- <--------
 ) AS vu 
 ON vu.organizationId = v.organizationId 
 AND vu.vehiclePlate = v.plate
where
 v.organizationId = '<SOME ORGANIZATION ID>' -- <---------
LIMIT 100;

I get execution times varying from 450 ms (not acceptable) to 900 ms (not acceptable).

Putting a organizationId = '<SOME ORGANIZATION ID>' clause in the "main" query and the join internal subquery:

SELECT v.*, vu.*
FROM Vehicles AS v
LEFT JOIN (
 SELECT vu1.*
 FROM VehicleUpdates AS vu1
 JOIN
 ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
 FROM VehicleUpdates
 WHERE organizationId = '<SOME ORGANIZATION ID>' -- <--------
 GROUP BY organizationId, vehiclePlate
 ) AS vu2 ON vu1.organizationId = vu2.organizationId
 and vu1.vehiclePlate = vu2.vehiclePlate
 and vu1.createdAt = vu2.createdAt
 ) AS vu 
 ON vu.organizationId = v.organizationId 
 AND vu.vehiclePlate = v.plate
where
 v.organizationId = '<SOME ORGANIZATION ID>' -- <---------
LIMIT 100;

I get execution times varying from 225 ms (acceptable) to 500 ms (not acceptable).


Are there any better ways of handling such query?

Database Information

  • MySQL
  • Version: 5.7.23-log (Amazon RDS)
  • Engine: InnoDB
asked Feb 13, 2019 at 17:00
7
  • 3
    organizationId is VARCHAR(50) in one table and VARCHAR(100) in the other. Fix that for start and run your tests again. Commented Feb 13, 2019 at 17:23
  • Good eye! I missed that. I'll update the question with the results from the updated schema. Commented Feb 13, 2019 at 17:41
  • Thanks for putting my blog to the test. I have suspected that changes to the Optimizer have messed with my conclusions. Did you run each timing test twice, and take the second result? (The first gets I/O done; the second sees things cached.) Commented Feb 14, 2019 at 1:27
  • The final two queries are at risk from a different direction: a derived table in a LEFT JOIN. It may have to re-evaluate that many times. Can you change to JOIN without ruining the output? Commented Feb 14, 2019 at 1:29
  • Did you grab the Handler STATUS values? A brief note on it is here . Commented Feb 14, 2019 at 1:35

2 Answers 2

2

I feel so stupid! Just found the problem.

For some reason, Vehicles and VehicleUpdates had different charsets (utf8mb4 and utf8, respectively) in production.

That's the reason why the EXPLAIN result for the "uncorrelated subquery" approach had a full table scan in one of its steps:

| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
| PRIMARY | v | ALL | | | | | 14456 | 100 |
| PRIMARY | <derived3> | ALL | | | | | 29289 | 100 | Using where
| PRIMARY | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "327" | vu2.organizationId,vu2.vehiclePlate,vu2.createdAt | 1 | 100 | Using where
| DERIVED | VehicleUpdates | range | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "323" | | 29289 | 100 | Using index for group-by

After converting VehicleUpdates to utf8mb4, the EXPLAIN result is:

| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
| PRIMARY | v | ref | Vehicles_orgId_status_idx | Vehicles_orgId_status_idx | "202" | const | 188 | 100 |
| PRIMARY | <derived2> | ref | <auto_key1> | <auto_key1> | "230" | v.plate,v.organizationId | 10 | 100 |
| PRIMARY | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "234" | v.organizationId,vu2.vehiclePlate,vu2.createdAt | 1 | 100 | Using where
| DERIVED | VehicleUpdates | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "202" | const | 24090 | 100 | Using where; Using index

Likewise, the "LEFT JOIN" approach execution plan changed from:

| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
| SIMPLE | v | ref | unq_Vehicles_orgId_plate_idx,Vehicles_orgId_status_idx | unq_Vehicles_orgId_plate_idx | "202" | const | 30 | 100 |
| SIMPLE | vu1 | ALL | | | | | 263171 | 100 | Using where; Using join buffer (Block Nested Loop)
| SIMPLE | vu2 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "173" | vu1.organizationId,vu1.vehiclePlate | 10 | 10 | Using where; Not exists; Using index

To:

| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
| SIMPLE | v | ref | Vehicles_orgId_status_idx | Vehicles_orgId_status_idx | "202" | const | 188 | 100 |
| SIMPLE | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "230" | v.organizationId,v.plate | 9 | 100 |
| SIMPLE | vu2 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "230" | vu1.organizationId,vu1.vehiclePlate | 9 | 10 | Using where; Not exists; Using index

So, the performance now for the different queries are:

LEFT JOIN

SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1 
 ON v.plate = vu1.vehiclePlate 
 AND v.organizationId = vu1.organizationId
LEFT JOIN VehicleUpdates AS vu2 
 ON vu1.organizationId = vu2.organizationId 
 AND vu1.vehiclePlate = vu2.vehiclePlate
 AND vu2.createdAt > vu1.createdAt
where v.organizationId = '<SOME ORGANIZATION ID>'
 AND vu2.id IS NULL
LIMIT 100;

Always runs bellow 50 ms.

Uncorrelated subquery without WHERE in the inner query:

SELECT v.*, vu1.* 
FROM Vehicles AS v
LEFT JOIN (
 VehicleUpdates AS vu1
 JOIN
 ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
 FROM VehicleUpdates
 GROUP BY organizationId, vehiclePlate
 ) AS vu2 ON vu1.organizationId = vu2.organizationId
 and vu1.vehiclePlate = vu2.vehiclePlate
 and vu1.createdAt = vu2.createdAt
 )
 ON vu1.organizationId = v.organizationId 
 AND vu1.vehiclePlate = v.plate
where
 v.organizationId = '<SOME ORGANIZATION ID>'
LIMIT 100;

Runs on average in 300 ms.

Uncorrelated subquery with WHERE in the inner query:

SELECT v.*, vu1.* 
FROM Vehicles AS v
LEFT JOIN (
 VehicleUpdates AS vu1
 JOIN
 ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
 FROM VehicleUpdates
 WHERE organizationId = '<SOME ORGANIZATION ID>' -- <-- HERE
 GROUP BY organizationId, vehiclePlate
 ) AS vu2 ON vu1.organizationId = vu2.organizationId
 and vu1.vehiclePlate = vu2.vehiclePlate
 and vu1.createdAt = vu2.createdAt
 )
 ON vu1.organizationId = v.organizationId 
 AND vu1.vehiclePlate = v.plate
where
 v.organizationId = '<SOME ORGANIZATION ID>'
LIMIT 100;

Also always runs bellow 50 ms.


I decided to stick with the "LEFT JOIN" approach because it allows me to create a view to represent the inner query, so I can simplify the query for returning the vehicles.

I cannot do this with the "uncorrelated subquery", because it requires the WHERE organizationId = '<ORGANIZATION ID>' clause in the inner query, so the view would not be as efficient.

answered Feb 14, 2019 at 18:11
2
  • 1
    @RickJames looks like theses results go against what could be observed before 5.7: LEFT JOIN has a better performance than the Uncorrelated Subquery. Commented Feb 14, 2019 at 18:38
  • Hmmm -- Thanks for discovering that and pointing it out. Commented Feb 14, 2019 at 18:51
0

You are definitely on the right track with putting the organizationId into strategic parts of your query.

Try this slightly modified version of your "uncorrelated subquery" query:

SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN (
 VehicleUpdates AS vu1
 JOIN
 ( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
 FROM VehicleUpdates
 WHERE organizationId = 'A100'
 GROUP BY organizationId, vehiclePlate
 ) AS vu2 ON vu1.organizationId = vu2.organizationId
 and vu1.vehiclePlate = vu2.vehiclePlate
 and vu1.createdAt = vu2.createdAt
 )
 ON vu1.organizationId = v.organizationId 
 AND vu1.vehiclePlate = v.plate
where
 v.organizationId = 'A100'
LIMIT 100;

This version of the query avoids the intermediate SELECT * on the LEFT JOINed table, and so not have to lose speed with an intermediate result.

The only difference between my query and yours is I dropped the SELECT vu1.* FROM and AS vu from the query, and the updated all the occurrences of vu to vu1.

It is useful little trick that help out in situations where you need to use an INNER JOIN on a table that needs to be LEFT JOINed to the main query, which is exactly your case.

answered Feb 13, 2019 at 23:27
2
  • I didn't know this was possible. It seems like it sped up the queries with less results (from ~ 190 ms to ~ 150ms), but had no effect on the ones with more results (still 500ms). Commented Feb 14, 2019 at 13:50
  • Yeah, I was shocked when I first realized that this was possible a couple years back. Handy thing though. As for the speed, is it possible to make (organizationId,vehiclePlate, createdAt) the primary key of VehicleUpdates? If so, it should reduce the speed loss caused by the secondary lookup that is occurring for all of your query attempts as a result of the SELECT vu.*. Alternatively, you can try adding all the remaining columns to your index in that table. Making it your PK is better, but adding the columns may work as well. Commented Feb 14, 2019 at 16:31

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.