SQL delivery query performance

Question 1

I need a "little" help with an SQL query (MySQL).

I have the following tables:

COURIERS table:

+------------+
| COURIER_ID |
+------------+

DELIVERIES table:

+-------------+------------+------------+
| DELIVERY_ID | COURIER_ID | START_DATE |
+-------------+------------+------------+

ORDERS table:

+----------+-------------+-------------+
| ORDER_ID | DELIVERY_ID | FINISH_DATE |
+----------+-------------+-------------+

COORDINATES table:

+-------------+-----+-----+------+
| DELIVERY_ID | LAT | LNG | DATE |
+-------------+-----+-----+------+

In the real database I have more columns in each table, but for this example the above columns are enough.

What do I need?

An SQL query that returns all couriers [COURIER_ID], their last delivery [DELIVERY_ID] (based on last START_DATE), the delivery's last coordinate [LAT and LNG] (based on last DATE) and the remaining orders count (total of orders of the last delivery that have no FINISH_DATE).
A courier can have no deliveries, in this case I want DELIVERY_ID = NULL, LAT = NULL and LNG = NULL in the result.
A delivery can have no coordinates, in this case I want LAT = NULL and LNG = NULL in the result.

What was I able to do?

SELECT c.`COURIER_ID`,
 d.`DELIVERY_ID`,
 r.`LAT`,
 r.`LNG`,
 (SELECT COUNT(DISTINCT `ORDER_ID`)
 FROM `ORDERS`
 WHERE `DELIVERY_ID` = d.`DELIVERY_ID`
 AND `FINISH_DATE` IS NULL) AS REMAINING_ORDERS
FROM `COURIERS` AS c
LEFT JOIN `DELIVERIES` AS d USING (`COURIER_ID`)
LEFT JOIN `COORDINATES` AS r ON r.`DELIVERY_ID` = d.`DELIVERY_ID`
WHERE (CASE WHEN
 (SELECT MAX(`START_DATE`)
 FROM `DELIVERIES`
 WHERE `COURIER_ID` = c.`COURIER_ID`) IS NULL THEN d.`START_DATE` IS NULL ELSE d.`START_DATE` =
 (SELECT MAX(`START_DATE`)
 FROM `DELIVERIES`
 WHERE `COURIER_ID` = c.`COURIER_ID`) END)
 AND (CASE WHEN
 (SELECT MAX(`DATE`)
 FROM `COORDINATES`
 WHERE `DELIVERY_ID` = d.`DELIVERY_ID`) IS NULL THEN r.`DATE` IS NULL ELSE r.`DATE` =
 (SELECT MAX(`DATE`)
 FROM `COORDINATES`
 WHERE `DELIVERY_ID` = d.`DELIVERY_ID`) END)
GROUP BY c.`COURIER_ID`
ORDER BY d.`START_DATE` DESC

The problem is that this query is very slow (from 5 to 20 seconds) when I have over 5k COORDINATES and it does not returns all couriers sometimes.

Question 2

can you assume that delivery_id is monotonically increasing? if that is the case then you can take the max delivery_id per courier and save yourself some of these correlated subqueries.

Question 3

@NickHristov DELIVERY_ID is an auto_increment column. But I don't want necessarily the last DELIVERY_ID, I need the DELIVERY_ID with MAX(START_DATE).

Question 4

According to your description, you are only interested in each courier's delivery with the latest start date (if one exists) and each delivery's latest coordinates (if one exists). So, you should be able to simplify things by limiting those tables before joining.

Ignoring the remaining order count column for now...

SELECT c.`COURIER_ID`,
 last_delivery_by_courier.`DELIVERY_ID`,
 last_coordinates_by_delivery.`LAT`,
 last_coordinates_by_delivery.`LNG`
 FROM `COURIERS` AS c
 LEFT JOIN (
 SELECT d.`COURIER_ID` AS `COURIER_ID`,
 MAX(d.`DELIVERY_ID`) AS `DELIVERY_ID`
 FROM `DELIVERIES` AS d
 GROUP BY d.`COURIER_ID`
 ORDER BY d.`START_DATE` DESC
 LIMIT 1
 ) AS last_delivery_by_courier
 ON last_delivery_by_courier.`COURIER_ID` = c.`COURIER_ID`
 LEFT JOIN (
 SELECT coord.`DELIVERY_ID` AS `DELIVERY_ID`,
 AVG(coord.`LAT`) AS `LAT`,
 AVG(coord.`LNG`) AS `LNG`
 FROM `COORDINATES` AS coord
 GROUP BY coord.`DELIVERY_ID`
 ORDER BY coord.`DATE` DESC
 LIMIT 1
 ) AS last_coordinates_by_delivery
 ON last_coordinates_by_delivery.`DELIVERY_ID` = last_delivery_by_courier.`DELIVERY_ID`
 ORDER BY last_delivery_by_courier.`START_DATE` DESC

Those are uncorrelated subqueries, so you should be able to verify each subquery individually first.

As for the remaining orders... I'm not a fan of subselects in attributes, as they clutter the SELECT clause horribly. Let's try to move the subselect into the FROM clause instead.

SELECT c.`COURIER_ID`,
 last_delivery_by_courier.`DELIVERY_ID`,
 last_coordinates_by_delivery.`LAT`,
 last_coordinates_by_delivery.`LNG`,
 order_count.`REMAINING_ORDERS`
 FROM `COURIERS` AS c
 LEFT JOIN (
 SELECT d.`COURIER_ID` AS `COURIER_ID`,
 MAX(d.`DELIVERY_ID`) AS `DELIVERY_ID`
 FROM `DELIVERIES` AS d
 GROUP BY d.`COURIER_ID`
 ORDER BY d.`START_DATE` DESC
 LIMIT 1
 ) AS last_delivery_by_courier
 ON last_delivery_by_courier.`COURIER_ID` = c.`COURIER_ID`
 LEFT JOIN (
 SELECT coord.`DELIVERY_ID` AS `DELIVERY_ID`,
 AVG(coord.`LAT`) AS `LAT`,
 AVG(coord.`LNG`) AS `LNG`
 FROM `COORDINATES` AS coord
 GROUP BY coord.`DELIVERY_ID`
 ORDER BY coord.`DATE` DESC
 LIMIT 1
 ) AS last_coordinates_by_delivery
 ON last_coordinates_by_delivery.`DELIVERY_ID` = last_delivery_by_courier.`DELIVERY_ID`
 LEFT JOIN (
 SELECT `ORDERS`.`DELIVERY_ID` AS `DELIVERY_ID`,
 COUNT(DISTINCT(`ORDER_ID`)) AS `REMAINING_ORDERS`
 FROM `ORDERS`
 WHERE `ORDERS`.`FINISH_DATE` IS NULL
 GROUP BY `ORDERS`.`DELIVERY_ID`
 ) AS order_count
 ON last_delivery_by_courier.`DELIVERY_ID` = order_count.`DELIVERY_ID`
 ORDER BY last_delivery_by_courier.`START_DATE` DESC

Question 5

Thanks, I will try your answer and I come back to tell if it worked.

Question 6

Distinct

In a properly designed database, this DISTINCT should not be needed:

 (SELECT COUNT(DISTINCT `ORDER_ID`)
 FROM `ORDERS`

If multiple different orders can have the same ORDER_ID, we're gonna have a bad time.

Aliases

c, d, r... these are bad table aliases. Why? Because Mr. Maintainer has to go through your entire query to know what those mean, as they are not self-evident. Your table names are pretty short, but if you really need to shorten them further... perhaps something like crier, deliv, coord would be better.

Join syntax

This USING command:

FROM `COURIERS` AS c
LEFT JOIN `DELIVERIES` AS d USING (`COURIER_ID`)

USING is not standard SQL syntax, when it comes to joins. Sure, it can be a nice shorthand in MySQL, but generally it is preferred to use standard SQL syntax (usually more verbose) instead of depending on syntactic sugar provided by a particular build or version of a language. Like this:

FROM `COURIERS` AS c
LEFT JOIN `DELIVERIES` AS d 
ON c.`COURIER_ID` = d.`COURIER_ID`

Performance

It's a bit difficult to assess performance hiccups if not presented with the execution plan, or a sample dataset to reproduce the issue. If I had to guess, I would say you have too many subqueries in your CASE statements.

CASE statements are great and very fast when you use constants to compare; but when you use variable values, such as what you are doing... well... it's pretty slow. You're asking the SQL engine to run each subquery for each record essentially. It may find a way to optimize that's halfway efficient, but I believe you would be better using a set-based approach. Example:

WHERE (CASE WHEN
 (SELECT MAX(`START_DATE`)
 FROM `DELIVERIES`
 WHERE `COURIER_ID` = c.`COURIER_ID`) IS NULL THEN d.`START_DATE` IS NULL ELSE -- etc.

Instead:

-- this goes before your CASE statement
INSERT INTO TEMP TABLE DeliveryDates
 SELECT IFNULL( 
 MAX( d.`START_DATE`, NULL ) 
 )
WHERE d.`COURIER_ID` = c.`COURIER_ID`

Then:

WHERE (CASE WHEN
 (SELECT `START_DATE`
 FROM DeliveryDates) IS NULL THEN d.`START_DATE` IS NULL 
 ELSE d.`START_DATE`

Question 7

Before I continue, it appears that your tables are not normalized. Ideally you should normalize them. Anyway, I think this should solve it (though I have not tested it).

SELECT C.courier_id, D.delivery_id, O.remaining_orders, COORD.lat, COORD.lng
FROM couriers C
LEFT JOIN (SELECT courier_id, 
 MAX(start_date) as start_date
 FROM deliveries
 GROUP BY courier_id
 ) as D_BY_S ON (C.courier_id = D_BY_S.courier_id)
LEFT JOIN JOIN deliveries D ON (D_BY_S.start_date = D.start_date 
 AND 
 D.courier_id = D_BY_S.courier_id)
LEFT JOIN (SELECT delivery_id, 
 MAX(date) AS date 
 FROM coordinates
 GROUP BY delivery_id
 ) AS L_COORD ON D.delivery_id = L_COORD.delivery_id
LEFT JOIN coordinates COORD ON (L_COORD.date = COORD.date
 AND
 L_COORD.delivery_id = COORD.delivery_id) 
LEFT JOIN (SELECT COUNT(*) as remaining_orders, 
 delivery_id 
 FROM orders 
 WHERE finish_date IS NULL
 GROUP BY delivery_id 
 ) AS O ON (O.delivery_id = D.delivery_id)
;

Question 8

Thanks, I will try your answer and I come back to tell if it worked.

user50399 user50399user50399 1,3376 silver badges6 bronze badges · Accepted Answer · 2014-09-05 05:27:04Z

According to your description, you are only interested in each courier's delivery with the latest start date (if one exists) and each delivery's latest coordinates (if one exists). So, you should be able to simplify things by limiting those tables before joining.

Ignoring the remaining order count column for now...

SELECT c.`COURIER_ID`,
 last_delivery_by_courier.`DELIVERY_ID`,
 last_coordinates_by_delivery.`LAT`,
 last_coordinates_by_delivery.`LNG`
 FROM `COURIERS` AS c
 LEFT JOIN (
 SELECT d.`COURIER_ID` AS `COURIER_ID`,
 MAX(d.`DELIVERY_ID`) AS `DELIVERY_ID`
 FROM `DELIVERIES` AS d
 GROUP BY d.`COURIER_ID`
 ORDER BY d.`START_DATE` DESC
 LIMIT 1
 ) AS last_delivery_by_courier
 ON last_delivery_by_courier.`COURIER_ID` = c.`COURIER_ID`
 LEFT JOIN (
 SELECT coord.`DELIVERY_ID` AS `DELIVERY_ID`,
 AVG(coord.`LAT`) AS `LAT`,
 AVG(coord.`LNG`) AS `LNG`
 FROM `COORDINATES` AS coord
 GROUP BY coord.`DELIVERY_ID`
 ORDER BY coord.`DATE` DESC
 LIMIT 1
 ) AS last_coordinates_by_delivery
 ON last_coordinates_by_delivery.`DELIVERY_ID` = last_delivery_by_courier.`DELIVERY_ID`
 ORDER BY last_delivery_by_courier.`START_DATE` DESC

Those are uncorrelated subqueries, so you should be able to verify each subquery individually first.

As for the remaining orders... I'm not a fan of subselects in attributes, as they clutter the SELECT clause horribly. Let's try to move the subselect into the FROM clause instead.

SELECT c.`COURIER_ID`,
 last_delivery_by_courier.`DELIVERY_ID`,
 last_coordinates_by_delivery.`LAT`,
 last_coordinates_by_delivery.`LNG`,
 order_count.`REMAINING_ORDERS`
 FROM `COURIERS` AS c
 LEFT JOIN (
 SELECT d.`COURIER_ID` AS `COURIER_ID`,
 MAX(d.`DELIVERY_ID`) AS `DELIVERY_ID`
 FROM `DELIVERIES` AS d
 GROUP BY d.`COURIER_ID`
 ORDER BY d.`START_DATE` DESC
 LIMIT 1
 ) AS last_delivery_by_courier
 ON last_delivery_by_courier.`COURIER_ID` = c.`COURIER_ID`
 LEFT JOIN (
 SELECT coord.`DELIVERY_ID` AS `DELIVERY_ID`,
 AVG(coord.`LAT`) AS `LAT`,
 AVG(coord.`LNG`) AS `LNG`
 FROM `COORDINATES` AS coord
 GROUP BY coord.`DELIVERY_ID`
 ORDER BY coord.`DATE` DESC
 LIMIT 1
 ) AS last_coordinates_by_delivery
 ON last_coordinates_by_delivery.`DELIVERY_ID` = last_delivery_by_courier.`DELIVERY_ID`
 LEFT JOIN (
 SELECT `ORDERS`.`DELIVERY_ID` AS `DELIVERY_ID`,
 COUNT(DISTINCT(`ORDER_ID`)) AS `REMAINING_ORDERS`
 FROM `ORDERS`
 WHERE `ORDERS`.`FINISH_DATE` IS NULL
 GROUP BY `ORDERS`.`DELIVERY_ID`
 ) AS order_count
 ON last_delivery_by_courier.`DELIVERY_ID` = order_count.`DELIVERY_ID`
 ORDER BY last_delivery_by_courier.`START_DATE` DESC

Thanks, I will try your answer and I come back to tell if it worked.

Stack Exchange Network

SQL delivery query performance

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

SQL delivery query performance

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions