Efficiently query MAX over multiple ranges

Question 1

When performing a MIN() or MAX() over a single range covered by an appropriately sorted index, SQL Server does a TOP() and so returns the value after fetching just one row. When the search criteria include more than one range, SQL Server instead grabs all the indexed values from both of the ranges and does a stream aggregate, which is far slower than performing a TOP() for each sub-value.

For example, assume a large number of orders per customer in a table like:

CREATE TABLE orders
(
 customer_id int,
 quantity int
)

Running this query:

SELECT MAX(quantity) 
FROM orders
WHERE customer_id IN (1,2)

will result in a query that takes several times as long as if only one customer ID were specified.

What is the most efficient way to perform a query like the above? Relatedly, if separate results were needed (i.e. GROUP BY customer_id), what would the best method be?

SQL Fiddle: http://sqlfiddle.com/#!3/ef0c6/1

Question 2

A Group By over all CustomerId's suffers the same fate. Amazed the optimizer hasn't covered this by now. What could the holdup be? Seems clear cut. Windowing function also fails.

Question 3

Here's a solution using CROSS APPLY, which does the same TOP query for each customer_id:

SELECT MAX(b.MaxQuantity) AS quantity
 FROM
 (
 SELECT 1 AS customer_id UNION ALL
 SELECT 2
 ) a
 CROSS APPLY
 (
 SELECT TOP 1
 quantity AS MaxQuantity
 FROM orders o
 WHERE o.customer_id = a.customer_id
 ORDER BY quantity DESC
 ) b;

This does the same work as the UNION ALL-based query you wrote in the Fiddle; the difference is that the customer_id input is abstracted from the meat of the query, so it can easily be converted to use a table variable or table-valued parameter, which will result in a static query plan, which is important. This approach will work well for a small number of customer_id values, and simply removing the outer MAX will return the maximum for each customer. I don't believe there's a way to further optimize this query for a small number of customer_ids using these data structures (assuming the customer_ids are random, and not a range).

For a large number of customer_ids, it probably is cheaper to do the index scan and stream aggregate than many seeks. To get this going faster, you'd have to move to some kind of denormalized data structure. MAX isn't supported in an indexed view, so rolling your own mechanism is the only way to go, either in application logic or triggers. Depending on the read/write ratio on this table, that may or may not be faster than the above approach: you'd have to test it in your exact scenario.

Question 4

I actually don't see the top with your query. It is using a range seek (20000 rows) with a stream aggregate. Tested on SQL Server 2008/2012. But if you change the query in the CROSS APPLY to use SELECT TOP(1) quantity .... ORDER BY quantity DESC then I see a nested loop join with a TOP and a seek operator with Actual Rows = 2.

Question 5

@Mikael: Fixed, thanks. I'm not used to the Fiddle plan icons and the shape of the plan looked correct. My bad.

Question 6

A non-RDBMS suggestions: if you are doing these sorts of queries often then I would suggest creating a cube of the source data and aggressively pre-aggregating the cube.

This does, of course, require an installation of Analysis Services and a DBI with Analysis Services experience but if you are doing OLAP queries often then creating the data warehouse may be cost-effective in the long run.

If you have Excel 2010 or 2013 then PowerPivot for Excel is a cheaper alternative to a full Analysis Services installation.

Jon Seigel Jon Seigel 16.9k6 gold badges46 silver badges86 bronze badges · Accepted Answer · 2013-08-28 02:53:12Z

Here's a solution using CROSS APPLY, which does the same TOP query for each customer_id:

SELECT MAX(b.MaxQuantity) AS quantity
 FROM
 (
 SELECT 1 AS customer_id UNION ALL
 SELECT 2
 ) a
 CROSS APPLY
 (
 SELECT TOP 1
 quantity AS MaxQuantity
 FROM orders o
 WHERE o.customer_id = a.customer_id
 ORDER BY quantity DESC
 ) b;

This does the same work as the UNION ALL-based query you wrote in the Fiddle; the difference is that the customer_id input is abstracted from the meat of the query, so it can easily be converted to use a table variable or table-valued parameter, which will result in a static query plan, which is important. This approach will work well for a small number of customer_id values, and simply removing the outer MAX will return the maximum for each customer. I don't believe there's a way to further optimize this query for a small number of customer_ids using these data structures (assuming the customer_ids are random, and not a range).

For a large number of customer_ids, it probably is cheaper to do the index scan and stream aggregate than many seeks. To get this going faster, you'd have to move to some kind of denormalized data structure. MAX isn't supported in an indexed view, so rolling your own mechanism is the only way to go, either in application logic or triggers. Depending on the read/write ratio on this table, that may or may not be faster than the above approach: you'd have to test it in your exact scenario.

I actually don't see the top with your query. It is using a range seek (20000 rows) with a stream aggregate. Tested on SQL Server 2008/2012. But if you change the query in the CROSS APPLY to use SELECT TOP(1) quantity .... ORDER BY quantity DESC then I see a nested loop join with a TOP and a seek operator with Actual Rows = 2.
@Mikael: Fixed, thanks. I'm not used to the Fiddle plan icons and the shape of the plan looked correct. My bad.

Stack Exchange Network

Efficiently query MAX over multiple ranges

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions