Are the results of the STRING_SPLIT() function returned in a deterministic order?

Question 1

I need to split a comma delimited string, manipulate it, and then concatenate it back into a single string retaining the original order of the data (if possible).

For example, take a column definition list of a CREATE TABLE statement (as a string) like so 'BrentOzarColumn INTEGER, PaulWhiteColumn DATETIME, ErikDarlingColumn VARCHAR(100)'. I'd like to split the comma delimited list to a result set, such as using SQL Server's built in function STRING_SPLIT() like so: SELECT TRIM([Value]) AS CoolDataPeople FROM STRING_SPLIT('BrentOzarColumn INTEGER, PaulWhiteColumn DATETIME, ErikDarlingColumn VARCHAR(100)', ',').

Without specifying an ORDER BY clause, this repeatedly yields (by coincedence?) the following results that appear to be ordered by their same ordering as they are in the string:

Results

Once I have the result set above, I want to apply some additional string manipulations to each row (such as append some constant text), and then concatenate every row back with a function like STRING_AGG() (goodbye days of STUFF ... FOR XML PATH :) in the same order as the original string. So an example of my final result could be 'BrentOzarColumn INTEGER SQLROX, PaulWhiteColumn DATETIME SQLROX, ErikDarlingColumn VARCHAR(100) SQLROX'.

Ultimately my question is: Are the results of the STRING_SPLIT() function returned in a deterministic order? I know without an ORDER BY clause, ordering is not guaranteed when selecting from a dataset like a Table or View, but was wondering if there's a difference with functions?

As I type this out, I have a hunch the answer is no, the ordering is not deterministic therefor I'm not guaranteed the order of the results. Furthermore, I'm betting there's possible additional nondeterminism added for each function I run on top of the results, especially when I combine them back together with STRING_AGG(). (Regardless of the answer, I appreciate your help, and you're all cool data people. ;)

Question 2

As the documentation states directly - "The output rows might be in any order. The order is not guaranteed to match ..."

Question 3

@AaronBertrand Oo that's interesting AF, thanks! So it sounds like they originally believed the function would return the rows in a deterministic order but reneged their initial beliefs. Not sure if it's a quick answer or something you'd be able to add to your answer, but any idea why it isn't a deterministic ordered result set?...I imagine what the function actually does under the hood is always processing the string in the same order (e.g. in C# a String.Split() call is assumed to be determinsitic, why is there any difference for a similar string manipulation function in SQL)?

Question 4

@J.D. the function returns a table which by definition is unordered, so unless clearly documented I would not trust the order even if it succeeded the first 1000 times:-) learn.microsoft.com/en-us/sql/t-sql/functions/… also mentions it (under remarks, wrong version but I assume it still holds). Adding the ordinal number to the result as mentioned in Aarons answer seems like a nice improvement.

Question 5

No, they are not returned in a deterministic order.

While you are unlikely to see them returned in a different order, that doesn't make the current behavior deterministic or reliable. Tricks like applying ROW_NUMBER() OVER (ORDER BY (SELECT 1)) to the output in an intermediate place like a CTE are similarly not guaranteed to work.

It's one of the features notably missing from STRING_SPLIT(), which I've blogged about:

And so has Andy Mallon:

When STRING_SPLIT() falls short

Vote and comment here:

Add row position column to STRING_SPLIT

(This item specifically asks for an additional column to be returned to indicate position within the original string but, due to backward compatibility issues, would likely need to be delivered via a new function, similar to how CONCAT_WS came to be.)

As an aside, the documentation originally stated:

The sort order of the output rows matches the order of the substrings in the input string.

That was deliberately changed in this commit to remove any notion that the return order is promised. Now the documentation states:

The output rows might be in any order. The order is not guaranteed to match the order of the substrings in the input string.

Why would they need to remove that promise?

I am not familiar with the internal implementation, but I believe the original version of the documentation was written by observers, not the function authors. That statement was likely written that way because that's what behavior they observed when they used the function. Many of us do the same thing when we tell people we don't need ORDER BY when selecting rows from a clustered table: "they will always come out in this order." Pretty reliable, until the optimizer chooses a different index.

My guess is they are covering their butts for future behavior. Think about all the changes they've been making to how functions work, and also ongoing changes to the optimizer.

Currently, with the simple case:

SELECT * FROM STRING_SPLIT('cow,dog,dinosaur','');

There isn't much opportunity or reason for that to be anything but sequential. But what about when you OUTER APPLY against an nvarchar(max) column between two huge, partitioned tables, and the data starts getting processed in batch mode and/or in parallel? Would they have to write additional logic in the code to ensure all that output came back in the right order? Would it be worth it? Would they have to revisit it for every future function processing / optimizer change to maintain that promise?

What changes are they making?

The documentation was recently updated to show a new argument to the function, enable_ordinal, available in Azure SQL Database and coming soon in SQL Server 2022:

enable_ordinal

An int or bit expression that serves as a flag to enable or disable the ordinal output column. A value of 1 enables the ordinal column. If enable_ordinal is omitted, NULL, or has a value of 0, the ordinal column is disabled.

Note

The enable_ordinal argument and ordinal output column are currently only supported in Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics (serverless SQL pool only).

To be clear, adding this argument only adds an ordinal column to the output indicating the sequence of value within the original string; it still doesn't guarantee the output will be ordered by the ordinal column. For that, you still need to add ORDER BY to the output.

Question 6

Dang, I was hoping you wouldn't be the one to answer because I didn't get a chance to add you to my CoolDataPeople example, even though I think you're a pretty cool guy. 😉 Ah that's unfortunate for my use case but I figured so, thanks for confirming.

Question 7

@JD I'm next tier at best

Question 8

You da man! Thanks for the updated answer. I think looking at it from the optimizer's perspective is a very good point. Even if the isolated function returned a deterministic ordering of results, unisolated when used in a query in the system of SQL Server itself, the optimizer can certainly introduce nondeterministic behavior.

score 12 · Accepted Answer · 2021-01-08 02:50:04Z

No, they are not returned in a deterministic order.

While you are unlikely to see them returned in a different order, that doesn't make the current behavior deterministic or reliable. Tricks like applying ROW_NUMBER() OVER (ORDER BY (SELECT 1)) to the output in an intermediate place like a CTE are similarly not guaranteed to work.

It's one of the features notably missing from STRING_SPLIT(), which I've blogged about:

And so has Andy Mallon:

When STRING_SPLIT() falls short

Vote and comment here:

Add row position column to STRING_SPLIT

(This item specifically asks for an additional column to be returned to indicate position within the original string but, due to backward compatibility issues, would likely need to be delivered via a new function, similar to how CONCAT_WS came to be.)

As an aside, the documentation originally stated:

The sort order of the output rows matches the order of the substrings in the input string.

That was deliberately changed in this commit to remove any notion that the return order is promised. Now the documentation states:

The output rows might be in any order. The order is not guaranteed to match the order of the substrings in the input string.

Why would they need to remove that promise?

I am not familiar with the internal implementation, but I believe the original version of the documentation was written by observers, not the function authors. That statement was likely written that way because that's what behavior they observed when they used the function. Many of us do the same thing when we tell people we don't need ORDER BY when selecting rows from a clustered table: "they will always come out in this order." Pretty reliable, until the optimizer chooses a different index.

My guess is they are covering their butts for future behavior. Think about all the changes they've been making to how functions work, and also ongoing changes to the optimizer.

Currently, with the simple case:

SELECT * FROM STRING_SPLIT('cow,dog,dinosaur','');

There isn't much opportunity or reason for that to be anything but sequential. But what about when you OUTER APPLY against an nvarchar(max) column between two huge, partitioned tables, and the data starts getting processed in batch mode and/or in parallel? Would they have to write additional logic in the code to ensure all that output came back in the right order? Would it be worth it? Would they have to revisit it for every future function processing / optimizer change to maintain that promise?

What changes are they making?

The documentation was recently updated to show a new argument to the function, enable_ordinal, available in Azure SQL Database and coming soon in SQL Server 2022:

enable_ordinal

An int or bit expression that serves as a flag to enable or disable the ordinal output column. A value of 1 enables the ordinal column. If enable_ordinal is omitted, NULL, or has a value of 0, the ordinal column is disabled.

Note

The enable_ordinal argument and ordinal output column are currently only supported in Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics (serverless SQL pool only).

To be clear, adding this argument only adds an ordinal column to the output indicating the sequence of value within the original string; it still doesn't guarantee the output will be ordered by the ordinal column. For that, you still need to add ORDER BY to the output.

Dang, I was hoping you wouldn't be the one to answer because I didn't get a chance to add you to my CoolDataPeople example, even though I think you're a pretty cool guy. 😉 Ah that's unfortunate for my use case but I figured so, thanks for confirming.
You da man! Thanks for the updated answer. I think looking at it from the optimizer's perspective is a very good point. Even if the isolated function returned a deterministic ordering of results, unisolated when used in a query in the system of SQL Server itself, the optimizer can certainly introduce nondeterministic behavior.

Stack Exchange Network

Are the results of the STRING_SPLIT() function returned in a deterministic order?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Are the results of the STRING_SPLIT() function returned in a deterministic order?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions