I need to split a comma delimited string, manipulate it, and then concatenate it back into a single string retaining the original order of the data (if possible).
For example, take a column definition list of a CREATE TABLE
statement (as a string) like so 'BrentOzarColumn INTEGER, PaulWhiteColumn DATETIME, ErikDarlingColumn VARCHAR(100)'
. I'd like to split the comma delimited list to a result set, such as using SQL Server's built in function STRING_SPLIT()
like so: SELECT TRIM([Value]) AS CoolDataPeople FROM STRING_SPLIT('BrentOzarColumn INTEGER, PaulWhiteColumn DATETIME, ErikDarlingColumn VARCHAR(100)', ',')
.
Without specifying an ORDER BY
clause, this repeatedly yields (by coincedence?) the following results that appear to be ordered by their same ordering as they are in the string:
Once I have the result set above, I want to apply some additional string manipulations to each row (such as append some constant text), and then concatenate every row back with a function like STRING_AGG()
(goodbye days of STUFF ... FOR XML PATH
:) in the same order as the original string. So an example of my final result could be 'BrentOzarColumn INTEGER SQLROX, PaulWhiteColumn DATETIME SQLROX, ErikDarlingColumn VARCHAR(100) SQLROX'
.
Ultimately my question is: Are the results of the STRING_SPLIT()
function returned in a deterministic order? I know without an ORDER BY
clause, ordering is not guaranteed when selecting from a dataset like a Table
or View
, but was wondering if there's a difference with functions?
As I type this out, I have a hunch the answer is no, the ordering is not deterministic therefor I'm not guaranteed the order of the results. Furthermore, I'm betting there's possible additional nondeterminism added for each function I run on top of the results, especially when I combine them back together with STRING_AGG()
. (Regardless of the answer, I appreciate your help, and you're all cool data people. ;)
1 Answer 1
No, they are not returned in a deterministic order.
While you are unlikely to see them returned in a different order, that doesn't make the current behavior deterministic or reliable. Tricks like applying ROW_NUMBER() OVER (ORDER BY (SELECT 1))
to the output in an intermediate place like a CTE are similarly not guaranteed to work.
It's one of the features notably missing from STRING_SPLIT()
, which I've blogged about:
- A way to improve STRING_SPLIT in SQL Server - and you can help
- Please help with STRING_SPLIT improvements
And so has Andy Mallon:
Vote and comment here:
(This item specifically asks for an additional column to be returned to indicate position within the original string but, due to backward compatibility issues, would likely need to be delivered via a new function, similar to how CONCAT_WS
came to be.)
As an aside, the documentation originally stated:
The sort order of the output rows matches the order of the substrings in the input string.
That was deliberately changed in this commit to remove any notion that the return order is promised. Now the documentation states:
The output rows might be in any order. The order is not guaranteed to match the order of the substrings in the input string.
Why would they need to remove that promise?
I am not familiar with the internal implementation, but I believe the original version of the documentation was written by observers, not the function authors. That statement was likely written that way because that's what behavior they observed when they used the function. Many of us do the same thing when we tell people we don't need ORDER BY
when selecting rows from a clustered table: "they will always come out in this order." Pretty reliable, until the optimizer chooses a different index.
My guess is they are covering their butts for future behavior. Think about all the changes they've been making to how functions work, and also ongoing changes to the optimizer.
Currently, with the simple case:
SELECT * FROM STRING_SPLIT('cow,dog,dinosaur','');
There isn't much opportunity or reason for that to be anything but sequential. But what about when you OUTER APPLY
against an nvarchar(max)
column between two huge, partitioned tables, and the data starts getting processed in batch mode and/or in parallel? Would they have to write additional logic in the code to ensure all that output came back in the right order? Would it be worth it? Would they have to revisit it for every future function processing / optimizer change to maintain that promise?
What changes are they making?
The documentation was recently updated to show a new argument to the function, enable_ordinal
, available in Azure SQL Database and coming soon in SQL Server 2022:
enable_ordinal
An int or bit expression that serves as a flag to enable or disable the ordinal output column. A value of 1 enables the ordinal column. If enable_ordinal is omitted, NULL, or has a value of 0, the ordinal column is disabled.
Note
The enable_ordinal argument and ordinal output column are currently only supported in Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics (serverless SQL pool only).
To be clear, adding this argument only adds an ordinal
column to the output indicating the sequence of value
within the original string; it still doesn't guarantee the output will be ordered by the ordinal
column. For that, you still need to add ORDER BY
to the output.
-
Dang, I was hoping you wouldn't be the one to answer because I didn't get a chance to add you to my CoolDataPeople example, even though I think you're a pretty cool guy. 😉 Ah that's unfortunate for my use case but I figured so, thanks for confirming.J.D.– J.D.2021年01月08日 04:21:23 +00:00Commented Jan 8, 2021 at 4:21
-
1@JD I'm next tier at bestAaron Bertrand– Aaron Bertrand2021年01月08日 04:47:36 +00:00Commented Jan 8, 2021 at 4:47
-
You da man! Thanks for the updated answer. I think looking at it from the optimizer's perspective is a very good point. Even if the isolated function returned a deterministic ordering of results, unisolated when used in a query in the system of SQL Server itself, the optimizer can certainly introduce nondeterministic behavior.J.D.– J.D.2021年01月08日 15:06:33 +00:00Commented Jan 8, 2021 at 15:06
Explore related questions
See similar questions with these tags.
String.Split()
call is assumed to be determinsitic, why is there any difference for a similar string manipulation function in SQL)?