I need a "little" help with an SQL query (MySQL).
I have the following tables:
COURIERS
table:
+------------+ | COURIER_ID | +------------+
DELIVERIES
table:
+-------------+------------+------------+ | DELIVERY_ID | COURIER_ID | START_DATE | +-------------+------------+------------+
ORDERS
table:
+----------+-------------+-------------+ | ORDER_ID | DELIVERY_ID | FINISH_DATE | +----------+-------------+-------------+
COORDINATES
table:
+-------------+-----+-----+------+ | DELIVERY_ID | LAT | LNG | DATE | +-------------+-----+-----+------+
In the real database I have more columns in each table, but for this example the above columns are enough.
What do I need?
An SQL query that returns all couriers [
COURIER_ID
], their last delivery [DELIVERY_ID
] (based on lastSTART_DATE
), the delivery's last coordinate [LAT
andLNG
] (based on lastDATE
) and the remaining orders count (total of orders of the last delivery that have noFINISH_DATE
).A courier can have no deliveries, in this case I want
DELIVERY_ID
= NULL,LAT
= NULL andLNG
= NULL in the result.A delivery can have no coordinates, in this case I want
LAT
= NULL andLNG
= NULL in the result.
What was I able to do?
SELECT c.`COURIER_ID`,
d.`DELIVERY_ID`,
r.`LAT`,
r.`LNG`,
(SELECT COUNT(DISTINCT `ORDER_ID`)
FROM `ORDERS`
WHERE `DELIVERY_ID` = d.`DELIVERY_ID`
AND `FINISH_DATE` IS NULL) AS REMAINING_ORDERS
FROM `COURIERS` AS c
LEFT JOIN `DELIVERIES` AS d USING (`COURIER_ID`)
LEFT JOIN `COORDINATES` AS r ON r.`DELIVERY_ID` = d.`DELIVERY_ID`
WHERE (CASE WHEN
(SELECT MAX(`START_DATE`)
FROM `DELIVERIES`
WHERE `COURIER_ID` = c.`COURIER_ID`) IS NULL THEN d.`START_DATE` IS NULL ELSE d.`START_DATE` =
(SELECT MAX(`START_DATE`)
FROM `DELIVERIES`
WHERE `COURIER_ID` = c.`COURIER_ID`) END)
AND (CASE WHEN
(SELECT MAX(`DATE`)
FROM `COORDINATES`
WHERE `DELIVERY_ID` = d.`DELIVERY_ID`) IS NULL THEN r.`DATE` IS NULL ELSE r.`DATE` =
(SELECT MAX(`DATE`)
FROM `COORDINATES`
WHERE `DELIVERY_ID` = d.`DELIVERY_ID`) END)
GROUP BY c.`COURIER_ID`
ORDER BY d.`START_DATE` DESC
The problem is that this query is very slow (from 5 to 20 seconds) when I have over 5k COORDINATES
and it does not returns all couriers sometimes.
3 Answers 3
According to your description, you are only interested in each courier's delivery with the latest start date (if one exists) and each delivery's latest coordinates (if one exists). So, you should be able to simplify things by limiting those tables before joining.
Ignoring the remaining order count column for now...
SELECT c.`COURIER_ID`,
last_delivery_by_courier.`DELIVERY_ID`,
last_coordinates_by_delivery.`LAT`,
last_coordinates_by_delivery.`LNG`
FROM `COURIERS` AS c
LEFT JOIN (
SELECT d.`COURIER_ID` AS `COURIER_ID`,
MAX(d.`DELIVERY_ID`) AS `DELIVERY_ID`
FROM `DELIVERIES` AS d
GROUP BY d.`COURIER_ID`
ORDER BY d.`START_DATE` DESC
LIMIT 1
) AS last_delivery_by_courier
ON last_delivery_by_courier.`COURIER_ID` = c.`COURIER_ID`
LEFT JOIN (
SELECT coord.`DELIVERY_ID` AS `DELIVERY_ID`,
AVG(coord.`LAT`) AS `LAT`,
AVG(coord.`LNG`) AS `LNG`
FROM `COORDINATES` AS coord
GROUP BY coord.`DELIVERY_ID`
ORDER BY coord.`DATE` DESC
LIMIT 1
) AS last_coordinates_by_delivery
ON last_coordinates_by_delivery.`DELIVERY_ID` = last_delivery_by_courier.`DELIVERY_ID`
ORDER BY last_delivery_by_courier.`START_DATE` DESC
Those are uncorrelated subqueries, so you should be able to verify each subquery individually first.
As for the remaining orders... I'm not a fan of subselects in attributes, as they clutter the SELECT
clause horribly. Let's try to move the subselect into the FROM
clause instead.
SELECT c.`COURIER_ID`,
last_delivery_by_courier.`DELIVERY_ID`,
last_coordinates_by_delivery.`LAT`,
last_coordinates_by_delivery.`LNG`,
order_count.`REMAINING_ORDERS`
FROM `COURIERS` AS c
LEFT JOIN (
SELECT d.`COURIER_ID` AS `COURIER_ID`,
MAX(d.`DELIVERY_ID`) AS `DELIVERY_ID`
FROM `DELIVERIES` AS d
GROUP BY d.`COURIER_ID`
ORDER BY d.`START_DATE` DESC
LIMIT 1
) AS last_delivery_by_courier
ON last_delivery_by_courier.`COURIER_ID` = c.`COURIER_ID`
LEFT JOIN (
SELECT coord.`DELIVERY_ID` AS `DELIVERY_ID`,
AVG(coord.`LAT`) AS `LAT`,
AVG(coord.`LNG`) AS `LNG`
FROM `COORDINATES` AS coord
GROUP BY coord.`DELIVERY_ID`
ORDER BY coord.`DATE` DESC
LIMIT 1
) AS last_coordinates_by_delivery
ON last_coordinates_by_delivery.`DELIVERY_ID` = last_delivery_by_courier.`DELIVERY_ID`
LEFT JOIN (
SELECT `ORDERS`.`DELIVERY_ID` AS `DELIVERY_ID`,
COUNT(DISTINCT(`ORDER_ID`)) AS `REMAINING_ORDERS`
FROM `ORDERS`
WHERE `ORDERS`.`FINISH_DATE` IS NULL
GROUP BY `ORDERS`.`DELIVERY_ID`
) AS order_count
ON last_delivery_by_courier.`DELIVERY_ID` = order_count.`DELIVERY_ID`
ORDER BY last_delivery_by_courier.`START_DATE` DESC
-
\$\begingroup\$ Thanks, I will try your answer and I come back to tell if it worked. \$\endgroup\$Lucas NN– Lucas NN2014年09月05日 05:33:52 +00:00Commented Sep 5, 2014 at 5:33
Distinct
In a properly designed database, this DISTINCT
should not be needed:
(SELECT COUNT(DISTINCT `ORDER_ID`)
FROM `ORDERS`
If multiple different orders can have the same ORDER_ID
, we're gonna have a bad time.
Aliases
c
, d
, r
... these are bad table aliases. Why? Because Mr. Maintainer has to go through your entire query to know what those mean, as they are not self-evident. Your table names are pretty short, but if you really need to shorten them further... perhaps something like crier
, deliv
, coord
would be better.
Join syntax
This USING
command:
FROM `COURIERS` AS c
LEFT JOIN `DELIVERIES` AS d USING (`COURIER_ID`)
USING
is not standard SQL syntax, when it comes to joins. Sure, it can be a nice shorthand in MySQL, but generally it is preferred to use standard SQL syntax (usually more verbose) instead of depending on syntactic sugar provided by a particular build or version of a language. Like this:
FROM `COURIERS` AS c
LEFT JOIN `DELIVERIES` AS d
ON c.`COURIER_ID` = d.`COURIER_ID`
Performance
It's a bit difficult to assess performance hiccups if not presented with the execution plan, or a sample dataset to reproduce the issue. If I had to guess, I would say you have too many subqueries in your CASE
statements.
CASE
statements are great and very fast when you use constants to compare; but when you use variable values, such as what you are doing... well... it's pretty slow. You're asking the SQL engine to run each subquery for each record essentially. It may find a way to optimize that's halfway efficient, but I believe you would be better using a set-based approach. Example:
WHERE (CASE WHEN
(SELECT MAX(`START_DATE`)
FROM `DELIVERIES`
WHERE `COURIER_ID` = c.`COURIER_ID`) IS NULL THEN d.`START_DATE` IS NULL ELSE -- etc.
Instead:
-- this goes before your CASE statement
INSERT INTO TEMP TABLE DeliveryDates
SELECT IFNULL(
MAX( d.`START_DATE`, NULL )
)
WHERE d.`COURIER_ID` = c.`COURIER_ID`
Then:
WHERE (CASE WHEN
(SELECT `START_DATE`
FROM DeliveryDates) IS NULL THEN d.`START_DATE` IS NULL
ELSE d.`START_DATE`
Before I continue, it appears that your tables are not normalized. Ideally you should normalize them. Anyway, I think this should solve it (though I have not tested it).
SELECT C.courier_id, D.delivery_id, O.remaining_orders, COORD.lat, COORD.lng
FROM couriers C
LEFT JOIN (SELECT courier_id,
MAX(start_date) as start_date
FROM deliveries
GROUP BY courier_id
) as D_BY_S ON (C.courier_id = D_BY_S.courier_id)
LEFT JOIN JOIN deliveries D ON (D_BY_S.start_date = D.start_date
AND
D.courier_id = D_BY_S.courier_id)
LEFT JOIN (SELECT delivery_id,
MAX(date) AS date
FROM coordinates
GROUP BY delivery_id
) AS L_COORD ON D.delivery_id = L_COORD.delivery_id
LEFT JOIN coordinates COORD ON (L_COORD.date = COORD.date
AND
L_COORD.delivery_id = COORD.delivery_id)
LEFT JOIN (SELECT COUNT(*) as remaining_orders,
delivery_id
FROM orders
WHERE finish_date IS NULL
GROUP BY delivery_id
) AS O ON (O.delivery_id = D.delivery_id)
;
-
\$\begingroup\$ Thanks, I will try your answer and I come back to tell if it worked. \$\endgroup\$Lucas NN– Lucas NN2014年09月05日 05:32:55 +00:00Commented Sep 5, 2014 at 5:32
DELIVERY_ID
is an auto_increment column. But I don't want necessarily the lastDELIVERY_ID
, I need theDELIVERY_ID
with MAX(START_DATE
). \$\endgroup\$