In my experience SELECT TOP
is much faster than SELECT MIN()
/MAX()
if you have thousands/millions of rows in your database as seen in this answer and what I've answered here and the askers response.
My question is WHY is that, in my understanding TOP
looks at the data same as SELECT column FROM table
and when you do a ORDER BY column
it sorts the results and than gives you only the top row, while the MIN()
/MAX()
functions should really just look at each row check the sum of that row, then go to the next row and if that row is >
or <
the previous row than it saves the results in the function depends if it's a MAX()
or a MIN()
, and this is exactly what the ORDER BY
should do.
Please don't answer that in your experience the MAX()
is better than TOP
as the question is really what's the difference between MAX()
/MIN()
and TOP
? both look at the data on each row and both are probably a log(n).
In response to @J.D:
One simple example is if you use TOP without an ORDER BY clause. Then the top result you'll get back is likely non-deterministic and doesn't always have to be the same row that the MAX() aggregate function would reference.
Absolutely I'm only asking when you are using ORDER BY
because the powerfulness TOP
has is when sorting it in a specific way.
Regarding the log(n) I would ask it in a different way, if the ORDER BY
has a Ω(n) because the minimum data it needs to look on is the n and has probably a O() >= n because it's probably using the best sorting algorithm which O(n).
With the MIN()
it's also probably a Ω(n) and O(n) because it will always need to scan all the data to find the answer, and never more than n, so why does it sometimes make a huge difference?
To put it another way:
What's the difference in execution time if I have a table of a million rows that has customers and the amount they've spent to buy product's, if I'm using SELECT TOP 1 amount_column FROM customer_table ORDER BY amount_column DESC
and using SELECT MAX(amount_column) FROM customer_table
the first need's to look at all the data in amount_column
and sort it descending order so it doesn't make a difference if it's already sorted or not because it needs to check all rows if it's > than previous row, and then when it finished executing it returns the maximum amount.
And when I'm doing it using the MAX()
function it also needs to scan (in my understanding) all rows and save the first row to the results and then go to the next row and check if it's > than results if yes it saves it to results, and so on for the whole table, and when it finishes it returns the maximum amount.
So in my understanding both are really doing the same thing, or maybe I'm wrong with my understanding the way it executes the function, and that's what I'm asking if I'm wrong? and why?
3 Answers 3
The thing is they're not always the same, hence them being different operators with different purposes. Some of those purposes overlap, yes, but they're not exactly one-to-one.
One simple example is if you use TOP
without an ORDER BY
clause. Then the top result you'll get back is likely non-deterministic and doesn't always have to be the same row that the MAX()
aggregate function would reference.
As mustaccio pointed out in the comments, the execution plans will tell you specifically what operations differed between two queries, one using TOP
and the other using MAX()
. Without a specific query, there is not a single answer on how the two are similar or different, since it can vary for each specific use case.
Finally, to assume the Big O function of time for both operators is always O(log(n)) is also incorrect and varies depending on how your data is structured and which data points you're applying those operators on. In Table1
with columns (A, B, C)
that's indexed on A
ascending, the query SELECT TOP 1 A FROM Table1 ORDER BY A
will likely be on the order of O(log(n)) (though I'm not sure if it can even possibly be O(1), since the ORDER BY
clause matches the index sorting) but a SELECT MAX(B) FROM Table1
could be on the order of O(n) since there's no covering index on column B
, and will likely need to scan the entire table.
In regards to your update, the aggregate functions like MIN()
and MAX()
don't need to always scan all of the data either. My first example was just setup in a way to demonstrate a difference in Big O search time complexity between MAX()
and TOP
in one specific use case. But using the same table example with the same columns and index covering column A
ascending only, then the query SELECT MAX(A) FROM Table1
should yield a time complexity of O(log(n)) too.
Despite even only focusing on queries that use the ORDER BY
clause, the similarities or differences between TOP
vs MAX()
and MIN()
will still vary from one use case to the next, directly dependent on the query being ran, and it's execution plan that the query engine produced for that query. If you have a specific query and execution plan you want to discuss, feel free to add it to your question and we can analyze it appropriately.
To your latest update, I believe where you're getting tripped up is on the concept of data structures. You keep assuming every row of data needs to be analyzed in both cases, and that's only true in an unindexed table (a Heap data structure) or if there's no covering index for your query. So in your example, for both operators, the search time complexity would be O(n) if there's no index covering the amount_column
. But if there was an index on it, descending, which would respect your statement "...if it's already sorted" then the data is stored in a B-Tree data structure which would allow your query to seek on the index and significantly reduce the runtime down to O(log(n)) because it would no longer need to compare every row. The subset of rows it would need to compare in that is much smaller than the entire table itself.
Yes in your specific example, both queries should see the same search time complexity in theory, but again what that actually is depends on how your data is structured (indexed in this case). Additionally, there's still other factors that could result in two separate query plans, where in actuality the runtimes do differ between the two queries and the reasoning for that can only be discussed when comparing two actual execution plans. It's not possible to discuss theoretically, because it could vary for a multitude of reasons based on specifically what's in the execution plan itself.
It is worth pointing out that there's another and pretty important a difference: null behaviour is not the same for min()/max() vs top(). In such a case, it's comparison of apples and oranges.
Consider:
create table T1(id int, content nvarchar(64));
insert T1 values
(1,'smallest'),
(3,'largest'),
(2,'middle'),
(-1,'minus one'),
(null, 'null');
select top(1) id from T1 order by id; -- returns null
select top(1) id from T1 order by id desc; -- returns 3
select 'max id' = max(id), 'min id' = min(id) from T1; -- returns 3 and -1
In my experience SELECT TOP is much faster than SELECT MIN()/MAX()
Yes, it is, because they do rather different things.
Both start with the same resultset, constructed from your "from", "join" and "where" clauses.
Min and Max are Aggregate Functions.
They trawl through every row in that resultset and work out the smallest/largest value of the stated column.
Top starts reading through the resultset and simply stops after the specified number of rows. As others have said, using "Top n" without an "Order By" can make for some interesting debugging sessions!
-
min
andmax
do not always trawl through every row. If there is a single-column index and the column is not nullable then they can read the lowest or highest value directly. You can see this by looking at the query plan. (This is a case wheremin
andmax
are just as fast astop
.)Ed Avis– Ed Avis2024年10月29日 15:04:00 +00:00Commented Oct 29, 2024 at 15:04
Explore related questions
See similar questions with these tags.