Problem
I have some assets in my application which are updated from time to time in an asynchronous fashion.
The example I'm gonna use here is Vehicles
. There are two tables:
Vehicles
: holds information about the vehicles themselvesVehicleUpdates
: holds information about all updates that happened for that vehicle.
The relevant parts of the table structure are:
CREATE TABLE `Vehicles` (
`id` varchar(50) NOT NULL,
`organizationId` varchar(50) NOT NULL,
`plate` char(7) NOT NULL,
`vehicleInfo` json DEFAULT NULL,
`createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `unq_Vehicles_orgId_plate_idx` (`organizationId`,`plate`) USING BTREE,
KEY `Vehicles_createdAt_idx` (`createdAt`),
);
CREATE TABLE `VehicleUpdates` (
`id` varchar(50) NOT NULL,
`organizationId` varchar(50) NOT NULL,
`vehiclePlate` char(7) NOT NULL,
`status` varchar(15) NOT NULL,
`createdAt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updatedAt` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `VehicleUpdates_orgId_vhclPlt_createdAt_idx` (`organizationId`,`vehiclePlate`,`createdAt`) USING BTREE
);
Now I have a new requirement in which I must return the latest update information along side the vehicle information itself.
Groupwise MAX Solutions
After digging a little, I've found this blog article. I then tried to use the suggested "uncorrelated subquery" approach, since it's deemed the best one:
Uncorrelated subquery
SELECT vu1.*
FROM VehicleUpdates AS vu1
JOIN
( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
FROM VehicleUpdates
GROUP BY organizationId, vehiclePlate
) AS vu2 USING (organizationId, vehiclePlate, createdAt);
This query has an average execution time of 275 ms
in my production database.
I thought that was too slow, so I decided to give the "LEFT JOIN" approach a shot:
The Duds: LEFT JOIN
SELECT vu1.*
FROM VehicleUpdates AS vu1
LEFT JOIN VehicleUpdates AS vu2 ON vu1.organizationId = vu2.organizationId and vu1.vehiclePlate = vu2.vehiclePlate
AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL;
This one performed way better, with an average execution time of 40 ms
. Good enough for me.
Then I needed to run this query as part of the query on the Vehicles
table.
Current results
The following query would satisfy my requirements:
SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1
ON v.plate = vu1.vehiclePlate
AND v.organizationId = vu1.organizationId
LEFT JOIN VehicleUpdates AS vu2
ON vu1.organizationId = vu2.organizationId
AND vu1.vehiclePlate = vu2.vehiclePlate
AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL;
The problem is that it takes 20 s
(!) to run. Huge problem!
But I never do a full table scan on production. The query is always limited to a single organizationId
and it's paged, so I return at most 100 lines per page, so I ran the following query:
SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1
ON v.plate = vu1.vehiclePlate
AND v.organizationId = vu1.organizationId
LEFT JOIN VehicleUpdates AS vu2
ON vu1.organizationId = vu2.organizationId
AND vu1.vehiclePlate = vu2.vehiclePlate
AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL
and v.organizationId = '<some organization ID>'
LIMIT 100;
Now it takes from 750 ms
to 11 s
to run, depending on how many vehicles are associated with. Not good enough.
Running explain
for the query above got me:
"select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
SIMPLE | v | ref | unq_Vehicles_orgId_plate_idx,Vehicles_orgId_status_idx | unq_Vehicles_orgId_plate_idx | "202" | const | 30 | 100 |
SIMPLE | vu1 | ALL | | | | | 263171 | 100 | Using where; Using join buffer (Block Nested Loop)
SIMPLE | vu2 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "173" | vu1.organizationId,vu1.vehiclePlate | 10 | 10 | Using where; Not exists; Using index
What strikes me is that the vu1
table is running a full table scan, even though the left-most table Vehicles
is being filtered using an indexed column organizationId
, which is also indexed in VehicleUpdates
.
So I decided to give the "uncorrelated subquery" another try and ran:
SELECT v.*, vu.*
FROM Vehicles AS v
LEFT JOIN (
SELECT vu1.*
FROM VehicleUpdates AS vu1
JOIN
( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
FROM VehicleUpdates
GROUP BY organizationId, vehiclePlate
) AS vu2 USING (organizationId, vehiclePlate, createdAt)
) AS vu
ON vu.organizationId = v.organizationId
AND vu.vehiclePlate = v.plate
WHERE v.organizationId = '<SOME ORGANIZATION ID>'
LIMIT 100;
This time the execution time varies from 1.4 s
to 13 s
, depending on how many entries there was in the Vehicles
table for a given organizationId
. Unacceptable for my application.
Running explain
got me:
| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
| PRIMARY | v | ALL | | | | | 14456 | 100 |
| PRIMARY | <derived3> | ALL | | | | | 29289 | 100 | Using where
| PRIMARY | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "327" | vu2.organizationId,vu2.vehiclePlate,vu2.createdAt | 1 | 100 | Using where
| DERIVED | VehicleUpdates | range | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "323" | | 29289 | 100 | Using index for group-by
Current Results - UPDATED
I noticed that adding the specific organizationId
clause can increase the performance.
LEFT JOIN
Running:
SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1
ON v.plate = vu1.vehiclePlate
AND v.organizationId = vu1.organizationId
AND vu1.organizationId = '<SOME ORGANIZATION ID>' -- <--------
LEFT JOIN VehicleUpdates AS vu2
ON vu1.organizationId = vu2.organizationId
AND vu1.vehiclePlate = vu2.vehiclePlate
AND vu2.createdAt > vu1.createdAt
WHERE vu2.id IS NULL
and v.organizationId = '<SOME ORGANIZATION ID>' -- <-----------
LIMIT 100;
I get execution times varying from 65 ms
(acceptable) to 2.5 s
(not acceptable).
Uncorrelated query
Putting a organizationId = '<SOME ORGANIZATION ID>'
clause in the "main" query and the join external subquery:
SELECT v.*, vu.*
FROM Vehicles AS v
LEFT JOIN (
SELECT vu1.*
FROM VehicleUpdates AS vu1
JOIN
( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
FROM VehicleUpdates
GROUP BY organizationId, vehiclePlate
) AS vu2 ON vu1.organizationId = vu2.organizationId
and vu1.vehiclePlate = vu2.vehiclePlate
and vu1.createdAt = vu2.createdAt
WHERE organizationId = '<SOME ORGANIZATION ID>' -- <--------
) AS vu
ON vu.organizationId = v.organizationId
AND vu.vehiclePlate = v.plate
where
v.organizationId = '<SOME ORGANIZATION ID>' -- <---------
LIMIT 100;
I get execution times varying from 450 ms
(not acceptable) to 900 ms
(not acceptable).
Putting a organizationId = '<SOME ORGANIZATION ID>'
clause in the "main" query and the join internal subquery:
SELECT v.*, vu.*
FROM Vehicles AS v
LEFT JOIN (
SELECT vu1.*
FROM VehicleUpdates AS vu1
JOIN
( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
FROM VehicleUpdates
WHERE organizationId = '<SOME ORGANIZATION ID>' -- <--------
GROUP BY organizationId, vehiclePlate
) AS vu2 ON vu1.organizationId = vu2.organizationId
and vu1.vehiclePlate = vu2.vehiclePlate
and vu1.createdAt = vu2.createdAt
) AS vu
ON vu.organizationId = v.organizationId
AND vu.vehiclePlate = v.plate
where
v.organizationId = '<SOME ORGANIZATION ID>' -- <---------
LIMIT 100;
I get execution times varying from 225 ms
(acceptable) to 500 ms
(not acceptable).
Are there any better ways of handling such query?
Database Information
- MySQL
- Version: 5.7.23-log (Amazon RDS)
- Engine: InnoDB
2 Answers 2
I feel so stupid! Just found the problem.
For some reason, Vehicles
and VehicleUpdates
had different charsets (utf8mb4
and utf8
, respectively) in production.
That's the reason why the EXPLAIN
result for the "uncorrelated subquery" approach had a full table scan in one of its steps:
| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
| PRIMARY | v | ALL | | | | | 14456 | 100 |
| PRIMARY | <derived3> | ALL | | | | | 29289 | 100 | Using where
| PRIMARY | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "327" | vu2.organizationId,vu2.vehiclePlate,vu2.createdAt | 1 | 100 | Using where
| DERIVED | VehicleUpdates | range | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "323" | | 29289 | 100 | Using index for group-by
After converting VehicleUpdates
to utf8mb4
, the EXPLAIN
result is:
| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
| PRIMARY | v | ref | Vehicles_orgId_status_idx | Vehicles_orgId_status_idx | "202" | const | 188 | 100 |
| PRIMARY | <derived2> | ref | <auto_key1> | <auto_key1> | "230" | v.plate,v.organizationId | 10 | 100 |
| PRIMARY | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "234" | v.organizationId,vu2.vehiclePlate,vu2.createdAt | 1 | 100 | Using where
| DERIVED | VehicleUpdates | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "202" | const | 24090 | 100 | Using where; Using index
Likewise, the "LEFT JOIN" approach execution plan changed from:
| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
| SIMPLE | v | ref | unq_Vehicles_orgId_plate_idx,Vehicles_orgId_status_idx | unq_Vehicles_orgId_plate_idx | "202" | const | 30 | 100 |
| SIMPLE | vu1 | ALL | | | | | 263171 | 100 | Using where; Using join buffer (Block Nested Loop)
| SIMPLE | vu2 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "173" | vu1.organizationId,vu1.vehiclePlate | 10 | 10 | Using where; Not exists; Using index
To:
| "select_type" | "table" | "type" | "possible_keys" | "key" | "key_len" | "ref" | "rows" | "filtered" | "Extra"
| SIMPLE | v | ref | Vehicles_orgId_status_idx | Vehicles_orgId_status_idx | "202" | const | 188 | 100 |
| SIMPLE | vu1 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "230" | v.organizationId,v.plate | 9 | 100 |
| SIMPLE | vu2 | ref | VehicleUpdates_orgId_vhclPlt_createdAt_idx | VehicleUpdates_orgId_vhclPlt_createdAt_idx | "230" | vu1.organizationId,vu1.vehiclePlate | 9 | 10 | Using where; Not exists; Using index
So, the performance now for the different queries are:
LEFT JOIN
SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN VehicleUpdates AS vu1
ON v.plate = vu1.vehiclePlate
AND v.organizationId = vu1.organizationId
LEFT JOIN VehicleUpdates AS vu2
ON vu1.organizationId = vu2.organizationId
AND vu1.vehiclePlate = vu2.vehiclePlate
AND vu2.createdAt > vu1.createdAt
where v.organizationId = '<SOME ORGANIZATION ID>'
AND vu2.id IS NULL
LIMIT 100;
Always runs bellow 50 ms
.
Uncorrelated subquery without WHERE in the inner query:
SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN (
VehicleUpdates AS vu1
JOIN
( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
FROM VehicleUpdates
GROUP BY organizationId, vehiclePlate
) AS vu2 ON vu1.organizationId = vu2.organizationId
and vu1.vehiclePlate = vu2.vehiclePlate
and vu1.createdAt = vu2.createdAt
)
ON vu1.organizationId = v.organizationId
AND vu1.vehiclePlate = v.plate
where
v.organizationId = '<SOME ORGANIZATION ID>'
LIMIT 100;
Runs on average in 300 ms
.
Uncorrelated subquery with WHERE in the inner query:
SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN (
VehicleUpdates AS vu1
JOIN
( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
FROM VehicleUpdates
WHERE organizationId = '<SOME ORGANIZATION ID>' -- <-- HERE
GROUP BY organizationId, vehiclePlate
) AS vu2 ON vu1.organizationId = vu2.organizationId
and vu1.vehiclePlate = vu2.vehiclePlate
and vu1.createdAt = vu2.createdAt
)
ON vu1.organizationId = v.organizationId
AND vu1.vehiclePlate = v.plate
where
v.organizationId = '<SOME ORGANIZATION ID>'
LIMIT 100;
Also always runs bellow 50 ms
.
I decided to stick with the "LEFT JOIN" approach because it allows me to create a view to represent the inner query, so I can simplify the query for returning the vehicles.
I cannot do this with the "uncorrelated subquery", because it requires the WHERE organizationId = '<ORGANIZATION ID>'
clause in the inner query, so the view would not be as efficient.
-
1@RickJames looks like theses results go against what could be observed before 5.7: LEFT JOIN has a better performance than the Uncorrelated Subquery.Henrique Barcelos– Henrique Barcelos2019年02月14日 18:38:38 +00:00Commented Feb 14, 2019 at 18:38
-
Hmmm -- Thanks for discovering that and pointing it out.Rick James– Rick James2019年02月14日 18:51:32 +00:00Commented Feb 14, 2019 at 18:51
You are definitely on the right track with putting the organizationId
into strategic parts of your query.
Try this slightly modified version of your "uncorrelated subquery" query:
SELECT v.*, vu1.*
FROM Vehicles AS v
LEFT JOIN (
VehicleUpdates AS vu1
JOIN
( SELECT vehiclePlate, organizationId, MAX(createdAt) AS createdAt
FROM VehicleUpdates
WHERE organizationId = 'A100'
GROUP BY organizationId, vehiclePlate
) AS vu2 ON vu1.organizationId = vu2.organizationId
and vu1.vehiclePlate = vu2.vehiclePlate
and vu1.createdAt = vu2.createdAt
)
ON vu1.organizationId = v.organizationId
AND vu1.vehiclePlate = v.plate
where
v.organizationId = 'A100'
LIMIT 100;
This version of the query avoids the intermediate SELECT *
on the LEFT JOINed table, and so not have to lose speed with an intermediate result.
The only difference between my query and yours is I dropped the SELECT vu1.* FROM
and AS vu
from the query, and the updated all the occurrences of vu
to vu1
.
It is useful little trick that help out in situations where you need to use an INNER JOIN on a table that needs to be LEFT JOINed to the main query, which is exactly your case.
-
I didn't know this was possible. It seems like it sped up the queries with less results (from ~ 190 ms to ~ 150ms), but had no effect on the ones with more results (still 500ms).Henrique Barcelos– Henrique Barcelos2019年02月14日 13:50:06 +00:00Commented Feb 14, 2019 at 13:50
-
Yeah, I was shocked when I first realized that this was possible a couple years back. Handy thing though. As for the speed, is it possible to make
(organizationId,vehiclePlate, createdAt)
the primary key ofVehicleUpdates
? If so, it should reduce the speed loss caused by the secondary lookup that is occurring for all of your query attempts as a result of theSELECT vu.*
. Alternatively, you can try adding all the remaining columns to your index in that table. Making it your PK is better, but adding the columns may work as well.Willem Renzema– Willem Renzema2019年02月14日 16:31:06 +00:00Commented Feb 14, 2019 at 16:31
organizationId
isVARCHAR(50)
in one table andVARCHAR(100)
in the other. Fix that for start and run your tests again.LEFT JOIN
. It may have to re-evaluate that many times. Can you change toJOIN
without ruining the output?