Fast allocation-free alphanumeric comparer used for sorting

Question 1

I was searching for C# comparers that can be used for natural sorting ("a9" comes before "a11"), but most solutions have a lot of allocations, quite unreadable code or are not covering all scenarios.

So I decided to make my own:

public sealed class JacobusAlphanumericComparer : IComparer<string>
{
 private readonly CultureInfo _culture;
 /// <param name="culture">The culture to use for comparing single characters, defaults to <see cref="CultureInfo.InvariantCulture"/>.</param>
 public JacobusAlphanumericComparer(CultureInfo? culture = null)
 {
 _culture = culture ?? CultureInfo.InvariantCulture;
 }
 public int Compare(string? x, string? y)
 {
 // Nulls are always seen as smaller/lower than actual strings.
 if (x is null) return y is null ? 0 : -1;
 else if (y is null) return 1;
 int index = 0;
 // While at least 1 of the strings still contain a character...
 while (index < x.Length || index < y.Length)
 {
 bool xEnded = index >= x.Length;
 bool yEnded = index >= y.Length;
 if (xEnded || yEnded)
 {
 // One of the strings has ended, the longest string will be considered the biggest
 // (the while condition already excludes equal length scenarios).
 // TODO: Not sure which one would be more correct in this scenario.
 // return x.Length > y.Length ? 1 : -1;
 return x.Length - y.Length;
 }
 // If we get here, both of the strings contain a character at the index.
 bool xIsNumber = char.IsDigit(x[index]);
 bool yIsNumber = char.IsDigit(y[index]);
 if (xIsNumber && yIsNumber)
 {
 int numberStartIndex = index;
 // Keeping looping while there are digits.
 do
 {
 index++;
 }
 while (index < x.Length && index < y.Length && char.IsDigit(x[index]) && char.IsDigit(y[index]));
 bool xNumberEnded = index >= x.Length || !char.IsDigit(x[index]);
 bool yNumberEnded = index >= y.Length || !char.IsDigit(y[index]);
 if (xNumberEnded && yNumberEnded)
 {
 // Both numbers ended at the same index.
 int numberEndIndex = index - 1;
 int numberComparison = CompareNumbers(x, y, numberStartIndex, numberEndIndex);
 if (numberComparison != 0) return numberComparison;
 // If we get here, the numbers were equal and we need to continue looking for differences.
 // We also don't need to increase the index, because the while loop already did that.
 continue;
 }
 else if (xNumberEnded)
 {
 // We immediately know y is bigger, because it contains more digits.
 return -1;
 }
 else if (yNumberEnded)
 {
 // We immediately know x is bigger, because it contains more digits.
 return 1;
 }
 }
 else
 {
 // Compare characters normally.
 int charComparison = CompareCharacters(x, y, index);
 if (charComparison != 0) return charComparison;
 }
 index++;
 }
 // Looped through the entire string without finding differences.
 return 0;
 }
 /// <param name="c">A char containing any digit 0-9.</param>
 private static int ParseDigitCharToInt(char c)
 {
 return c - '0';
 }
 private int CompareCharacters(string x, string y, int index)
 {
 // ToUpperInvariant would be slighly faster (2ns vs 6ns according to benchmarks),
 // but not worth sacrificing the flexibility of supplying your preferred culture.
 char upperCharX = char.ToUpper(x[index], _culture);
 char upperCharY = char.ToUpper(y[index], _culture);
 return upperCharX.CompareTo(upperCharY);
 }
 private static int CompareNumbers(string x, string y, int numberStartIndex, int numberEndIndex)
 {
 int xNumber;
 int yNumber;
 if (numberStartIndex == numberEndIndex)
 {
 // Optimized for single digit comparison.
 xNumber = ParseDigitCharToInt(x[numberStartIndex]);
 yNumber = ParseDigitCharToInt(y[numberStartIndex]);
 }
 else
 {
 #if NET5_0_OR_GREATER
 ReadOnlySpan<char> numbersSpanX = x.AsSpan(numberStartIndex..numberEndIndex);
 ReadOnlySpan<char> numbersSpanY = y.AsSpan(numberStartIndex..numberEndIndex);
 xNumber = int.Parse(numbersSpanX);
 yNumber = int.Parse(numbersSpanY);
 #else
 ReadOnlySpan<char> numbersSpanX = x.AsSpan(numberStartIndex, numberEndIndex - numberStartIndex);
 ReadOnlySpan<char> numbersSpanY = y.AsSpan(numberStartIndex, numberEndIndex - numberStartIndex);
 xNumber = int.Parse(numbersSpanX.ToString());
 yNumber = int.Parse(numbersSpanY.ToString());
 #endif
 }
 return xNumber.CompareTo(yNumber);
 }
}

I've created some benchmarks, and my solution seems faster than anything I've found so far (only https://github.com/drewnoakes/natural-sort is faster, but that one is not covering all scenarios).

Have I missed something? Do you see anymore potential performance or readability improvements?

Question 2

Are you sure you need to parse consecutive digits as integers at all? Seems to me once you’ve established that both strings have an equally long run of digits, those can just be compared non- numerically character by character.

Question 3

@Seb Yes you are right, that will be in the next version I will post soon.

Question 4

What's the problem with "lots of allocations"? The .NET runtime is specially optimized around many small, short-lived objects. Have you found a particular scenario where it matters?

Question 5

@Alejandro More allocations cause more garbage collections, which cause performance loss as the garbage collector requires resources to clean up the mess. I didn't really have a scenario where performance was critical. So this is was just for fun and curiosity.

Question 6

"Allocation free"

"most solutions have a lot of allocations"
While I don't claim any knowledge of C#, the appearance of .ToString() suggests there are hidden (short lived) allocations happening in this code.
_{I'll gladly delete this point if I'm mistaken in my assumption.}

How much to re-invent

Beginners writing in C quickly learn to use (and trust) the Standard Library function qsort(), writing, as the OP has done, built-for-purpose compare() functions.

To serve user's needs, any compare() function (with its predefined parameters) is often accompanied by a wrapper that inverts to returning value. This allows calling code (eg: qsort()) to sort in either ascending or descending sequence.

Suggestion: Add a trivial public function to your class (CompareReverse()) that provides descending ordering to a generic sorting mill...

Flying too high

There's a tacit presumption that contiguous digits will convert to a value that can be represented in an int variable, not even an unsigned int capable of twice the magnitude.

For example, a credit card 'number' requires a 'high capacity' variable to store its 'integer' value.
It's not difficult to imagine even larger strings of digits being used by users.

The presumption of small numbers may need a re-think.

Don't mix apples with oranges

Consider this comment

 // If we get here, the numbers were equal and we need to continue looking for differences.

The OP's candidate data is, in fact, a string-ised data structure.

typedef struct {
 char str[ 10 ]; // string of 1-10 characters
 char str[ 10 ]; // string of 1-10 characters
 char str[ 10 ]; // string of 1-10 characters
 ...
};

Any number of segments each of which could be any length, and alpha segments are left justified while numeric segments are left padded with zeros. (Used 10 as a maximum segment length for this example.)

The code would be much easier to follow (and to maintain! (imo)) if it were broken into two independent 'service functions' to compare: 1) two segments full of alpha, and 2) two segments full of digits. The controlling function would be responsible for dissecting each segment, and ensuring alpha vs alpha and digits vs digits. As presented, these two different requirements seem to be hazardously interwoven.

Swapping priorities

^{In the following, alpha means any non-digit character.}

The alpha vs alpha comparison is no challenge, and the alpha vs digit is also rather easy to evaluate. The digits vs digits presents difficulties because each digit's applicable "power of 10" is unknown.

The OP's code prioritises the length of each substring of digits, then goes off to try to compare their values as integers.

Consider these scenarios:

"abc456444qrx"
"abc456444qry" // equal digits. Keep going until reaching 'x' vs 'y'
 ^--
"abc456444qrx"
"abc457444qry" // greater is TENTATIVE at 6 vs 7, confirmed at end of digits
 ^> ^!!!
"abc4569999qrx"
"abc457444qry" // greater is TENTATIVE at 6 vs 7, but conceded at 9 vs 'q'
 ^> ^!!! // assumption was wrong. Now we know!

The key idea being "which string becomes tentatively greater first while sequentially comparing paired digits?" While traversing remaining digits do not compare values, but be on the lookout for end of string or encountering an alpha character.

Make up a table listing the permutations of the conditions that could arise, and, for each entry, determine whether to return a relationship value (-1, 0, 1), or to continue scanning. This can be achieved with only a single index variable, but needs to be coded as simply as possible to begin with. Postpone combining operations to reduce lines of code. Get it working with a large number of outrageous test cases as the first priority. (Hint: comparing "HHGTTG" with "42" is decided on the very first character.)

Avoid slippery slopes

After reading the answer from @J_H, I've suddenly become aware of another scenario that could cause grief: Decimal Fractions.

From a numeric stand point, the value "foo5.8" is greater than "foo5.75". However, the latter's extra digit(s) to the right of the decimal point ARE already in their applicable power-of-ten sequence. That the user's data uses varying degrees of precision means simple "the longer sequence of digits represents the larger value" will not provide sorting into natural order.

(It's always the simplest of projects that turn out to have unforeseen complexities.)

Question 7

@Fe203 About your paragraph "How much to re-invent": In C# we often order with linq, in this case you would call either myList.Order(new JacobusAlphanumericComparer()) or call myList.OrderDescending(new JacobusAlphanumericComparer()). But if you want to keep the same array reference by using Array.Sort(), then yes, you would have to create a reverse comparer.

Question 8

In C# int and uint are 32-bit numbers with 9 - 10 digits. We also have 64-bit long and ulong with 19 - 20 digits. A decimal has 28 - 29 digits precision.

Question 9

@OlivierJacot-Descombes Thank you. The point is to avoid (if possible) an arbitrary ceiling, and, otherwise, to acknowledge any recognised "restrictions" used in the code... The OP's use-case may be satisfied with "up to 6 digits, max!!" Just noting a caveat should be present as a comment...

Question 10

Code the way you think about problem and solution.
Document your code to at least the point where it doesn't need to be reverse engineered to find out how to use it - does Compare() not need a description for reassurance? (I don't know about C# and "documentation inheritance". But I think digit string interpretation to deserve a description, anyway - starting with leading zeroes, sign before a digit, letter case...
While private, CompareNumbers() could document numberEndIndex to be inclusive. (or ex? guess "the NET5-#else" confuses me))

Come to think of it, handling digit sequences and case is independent:
consider parametrising letter comparison.

I would not think compare index to string length "twice in a row". (same for ?NumberEnded.)
Come to think of it, if I kept whether I swapped arguments, I don't need to check with both strings at all. But overall, you seem prioritise readability over microoptimisation. Good. Maybe drop the special case in CompareNumbers().

I see how introducing xIsNumber to yNumberEnded is self-documenting.

The condition of "the digit loop" repeats that of the enclosing if - a code smell.

The character loop should use for - caveat: digit string handling as presented makes this choice less clear.

I think it would be simpler to start special casing digit strings when a character difference has been detected:
If one is a digit and the other is not, check if both preceding characters are digits - if yes, digit at current index is greater.
Otherwise establish which digit is greater and check (remaining) lengths: you never need to convert into a number representation of worrisome range.

Question 11

Tried to code "non-converting digit string handling": leading zeroes make that interesting - if prepending "0" doesn't change order, further processing needs separate string indices.

Question 12

Have I missed something?

write down the spec

After seeing the OP and the github ReadMe, I didn't notice anything that formally describes what "less than" means in this context.

unary plus / minus

These three characters might conceivably fit into CompareNumbers(), though they are not digits.

0x2b +
0x2d -
0x2e .

(I predict that when you do write down a spec, representing "a thousand" as 1e3 will be very very far out of scope.)

I feel that decimal behavior we get "for free" -- there's little difference between a3Z4 and a3.4, so we can probably exclude that one. But it troubles me that ASCII 0x2b < 0x2d, as with a+3 versus a-3, given that many mathematicians would assert \$-3 ~ < +3\$.

Yikes! @Fe2O3 points out that 0x2C , also matters, in Europe. Though again I appeal to there being little difference between a3Z4 and a3,4.

leading zeros

It's not obvious to me which of these is smallest.

agent-007
agent-07

That's something that would be cleared up by a written spec. (We might also throw agent+007 and agent+07 into the mix.)

unicode codepoints

You might possibly wish to limit valid input to just 7-bit ASCII. There's a vast number of ways to express the numeric concept of "two" in modern Unicode, and char.IsDigit() is familiar with many of them. Maybe throw an exception upon encountering an invalid input character?

Alternatively, you might prefer to write your own IsDigit() helper which recognizes exactly ten digits.

unwarranted assumption

 /// <param name="c">A char containing any digit 0-9.</param>
 private static int ParseDigitCharToInt(char c)
 {
 return c - '0';
 }

The "0-9" comment is great, thank you.

But the caller's guard of char.IsDigit() admits glyphs from several codepoint ranges beyond Latin, including

Arabic
Superscript
Mathematical Alphanumeric Symbols
Arabic-Indic
Mathematical Double-Struck
Roman Numerals
Coptic
Devanagari
Urdu
Persian
Sindhi
Thai
Tamil
Gurmukhi
Hangul

Question 13

If we're going international, can't forget the inverse meanings of '.' and ',' between N.Am and Europe (at least)... :-)

Question 14

Allocation free? You could use 1 single new allocation simply for reversing the string for checking if the reverse is equal to the first string. It appears that if the lengths of the strings aren't equal it should return -1? Here is a code sample you could perhaps build on to get the result your looking for.

public static class Test
{
 private static bool IsValid(string value)
 {
 if (string.IsNullOrEmpty(value) ||
 string.IsNullOrWhiteSpace(value)) return false;
 foreach (char c in value)
 {
 if (!char.IsDigit(c)) return false;
 }
 return true;
 }
 private static bool IsComparableForward(string x, string y)
 {
 if (!IsValid(x) || !IsValid(y)) return false;
 if (x.Length != y.Length) return false;
 int length = x.Length + y.Length >> 1;
 for (int i = 0; i < length; i++)
 {
 char a = x[i];
 char b = y[i];
 if (a != b) return false;
 }
 return true;
 }
 private static bool IsComparableBackward(string x, string y)
 {
 if (!IsValid(x) || !IsValid(y)) return false;
 if (x.Length != y.Length) return false;
 //the only time a new string allocation 
 //will be done is for reversing y
 char[] temp = y.ToCharArray();
 Array.Reverse(temp);
 string y2 = new(temp);
 return IsComparableForward(x, y2);
 }
 /// <summary>
 /// Performs a comparison between <paramref 
 /// name="x"/> and <paramref name="y"/>
 /// </summary>
 /// <param name="x"></param>
 /// <param name="y"></param>
 /// <returns>-1 if not equal, 0 if comparable 
 /// forward, or 1 if comparable backward</returns>
 public static int Compare(string x, string y)
 { 
 if (!IsValid(x) || !IsValid(y)) return -1;
 
 return IsComparableForward(x, y) ? 0 : 
 IsComparableBackward(x, y) ? 1 : -1;
 }
}

Question 15

For one you use String.IsNullOrEmpty and the other String.IsNullOrWhiteSpace, that surely should be the same.

Question 16

No, String.IsNullOrEmpty checks for null & "" and String.IsNullORWhiteSpace checks for null or a string of any size that has nothing but white space for example string s = " " 1 whitespace, " " 6 whitespaces

Question 17

@CharlesHenington Tip: use 'back-ticks' to present data... " " This keeps the formatter from squishing-down adjacent whitespace...

Question 18

Based on all the feedback in the other answers, I've created a new version of the comparer.

Things I have decided not to support:

Scientific number representations like 1e3.
Unary plus/minus, meaning differing between positive and negative numbers is not supported.
Decimal fractions.

Changes:

More comments/documentation to clarify how the comparer and its private methods work.
The implementation no longer contains allocations for .NET versions lower than .NET 5.
Support for numbers greater than the C# int maximum value (2147483647). This change might also boost performance as it never has to convert characters to an actual number (like int or uint).
Leading zeroes are now correctly ignored.
No longer repeating if statements/conditions in while/for loop that are also conditions of the loop itself.
Made character loop a for loop instead of while.
Performance improvement for very similar strings by only using expensive ToUpper methods if the characters are not equal.

/// <summary>
/// Compares strings naturally/alphanumerically.
/// See <see href="https://en.wikipedia.org/wiki/Natural_sort_order"/> for more details.
/// </summary>
/// <remarks>
/// Does not support decimal fractions, only full numbers are compared.<br/>
/// Does not support negative numbers, so '-' and '+' will be compared just like any other non-digit character.<br/>
/// Does not support comparing scientific number representations like 1e3.
/// </remarks>
/// <example>
/// Compare("a", "a") == 0
/// Compare("a", "b") == -1
/// Compare("b", "a") == 1
/// Compare("a", "1") == 1
/// Compare("11", "4") == 1
/// Compare("11z", "2zzz") == 1
/// Compare("11zy", "11z") == 1
/// Compare("aa8aa", "aa7aa") == 1
/// Compare("b", "A") == 1
/// Compare("011", "10") == 1
/// Compare("10", "010") == 1 (this one was an arbitrary choice, because Windows also orders its folders that way).
/// Compare("21.49", "21.5") = -1 (no support for decimal fractions on purpose, because that would cause other issues with for example comparing semantic version numbers https://semver.org/)
/// </example>
/// <returns>
/// <b>Exactly zero</b> if equal.<br/>
/// <b>Greater than zero</b> if x is greater than y.<br/>
/// <b>Less than zero</b> if x is less than y.
/// </returns>
public sealed class JacobusAlphanumericComparer : IComparer<string>
{
 private readonly CultureInfo _culture;
 /// <param name="culture">The culture to use for comparing single characters, defaults to <see cref="CultureInfo.InvariantCulture"/>.</param>
 public JacobusAlphanumericComparer(CultureInfo culture = null)
 {
 _culture = culture ?? CultureInfo.InvariantCulture;
 }
 public int Compare(string x, string y)
 {
 if (x is null) return y is null ? 0 : -1;
 else if (y is null) return 1;
 // Loop while both strings still contain more characters...
 for (int index = 0; index < x.Length && index < y.Length; index++)
 {
 bool xIsNumber = char.IsDigit(x[index]);
 bool yIsNumber = char.IsDigit(y[index]);
 if (xIsNumber && yIsNumber)
 {
 int numberStartIndex = index;
 int xNumberEndIndex = index;
 int yNumberEndIndex = index;
 bool findingXNumberEnd;
 bool findingYNumberEnd;
 // Looping to find the end index of the number in both x and y.
 do
 {
 index++;
 findingXNumberEnd = index < x.Length && char.IsDigit(x[index]);
 if (findingXNumberEnd) xNumberEndIndex = index;
 findingYNumberEnd = index < y.Length && char.IsDigit(y[index]);
 if (findingYNumberEnd) yNumberEndIndex = index;
 }
 while (findingXNumberEnd || findingYNumberEnd);
 // Creating spans that start and end exactly where the numbers are.
 int xNumberLength = xNumberEndIndex - numberStartIndex + 1;
 ReadOnlySpan<char> xNumber = x.AsSpan(numberStartIndex, xNumberLength);
 int yNumberLength = yNumberEndIndex - numberStartIndex + 1;
 ReadOnlySpan<char> yNumber = y.AsSpan(numberStartIndex, yNumberLength);
 int numberComparison = CompareNumbers(xNumber, yNumber);
 bool numbersAreDifferent = numberComparison != 0;
 if (numbersAreDifferent) return numberComparison;
 // If we get here, the numbers were equal and we need to continue looking for differences.
 // Both xNumberEndIndex and yNumberEndIndex are equal here, so we can use any one of them.
 index = xNumberEndIndex;
 continue;
 }
 else
 {
 // Compare characters normally.
 int charComparison = CompareCharacters(x[index], y[index]);
 bool charactersAreDifferent = charComparison != 0;
 if (charactersAreDifferent) return charComparison;
 }
 }
 // If we get here, either:
 // - both strings were equal. For example: Compare("a1a", "a1a") == 0.
 // - or they were equal up until one string ended. For example: Compare("a1a", "a1aa") == -1
 return x.Length - y.Length;
 }
 private int CompareCharacters(char x, char y)
 {
 if (x == y) return 0;
 // ToUpperInvariant would be slightly faster (2ns vs 6ns according to benchmarks),
 // but not worth sacrificing the flexibility of supplying your preferred culture.
 char upperX = char.ToUpper(x, _culture);
 char upperY = char.ToUpper(y, _culture);
 return upperX.CompareTo(upperY);
 }
 /// <summary>
 /// Compares two texts numerically.
 /// </summary>
 /// <remarks>
 /// Ignores leading '0' characters, so CompareNumbers("013", "12") => 1
 /// </remarks>
 /// <example>
 /// CompareNumbers("12", "13") => -1
 /// CompareNumbers("13", "12") => 1
 /// CompareNumbers("12", "12") => 0
 /// CompareNumbers("012", "12") => -1
 /// </example>
 /// <param name="x">A span of characters containing only digits (0-9).</param>
 /// <param name="y">A span of characters containing only digits (0-9).</param>
 /// <returns>
 /// 0 if equal.<br/>
 /// 1 if <paramref name="x"/> is greater than <paramref name="y"/>.<br/>
 /// -1 if <paramref name="y"/> is greater than <paramref name="x"/>.
 /// </returns>
 private static int CompareNumbers(ReadOnlySpan<char> x, ReadOnlySpan<char> y)
 {
 // Trim leading '0' of x and y number, for example: "00123" becomes "123".
 ReadOnlySpan<char> trimmedX = ExcludeLeadingZeroes(x);
 ReadOnlySpan<char> trimmedY = ExcludeLeadingZeroes(y);
 // Return early if a number only contains zeroes: CompareNumbers("1", "0000") == 1.
 bool xContainsOnlyZeroes = trimmedX.Length is 0;
 bool yContainsOnlyZeroes = trimmedY.Length is 0;
 if (xContainsOnlyZeroes) return yContainsOnlyZeroes ? 0 : -1;
 else if (yContainsOnlyZeroes) return 1;
 // After leading zeroes are excluded, a longer number is always greater: CompareNumbers("1000", "999") == 1.
 if (trimmedX.Length > trimmedY.Length) return 1;
 else if (trimmedX.Length < trimmedY.Length) return -1;
 // If we get here, we know there are no leading zeroes and the numbers have the same length.
 // Now it's just a matter of comparing each character normally.
 for (int index = 0; index < trimmedX.Length; index++)
 {
 char xDigit = trimmedX[index];
 char yDigit = trimmedY[index];
 int comparison = xDigit.CompareTo(yDigit);
 if (comparison != 0) return comparison;
 }
 // TODO: Should "010" < "10" like Windows sorts its folders and files?
 // Or is "010" > "10" more logical? I'm clueless here.
 // Comparing original length to make sure "010" < "10".
 return y.Length - x.Length;
 }
 /// <summary>
 /// Shortens the span to exclude leading '0' characters.
 /// </summary>
 /// <example>
 /// ExcludeLeadingZeroes("00100200") => "100200"
 /// </example>
 /// <param name="span">The span to exclude leading zeroes from.</param>
 /// <returns>A span without leading zeroes.</returns>
 private static ReadOnlySpan<char> ExcludeLeadingZeroes(ReadOnlySpan<char> span)
 {
 int startIndex = 0;
 for (int i = 0; i < span.Length; i++)
 {
 if (span[i] is '0') startIndex++;
 else break;
 }
 // We can just return the same span if it had no leading zeroes.
 if (startIndex is 0) return span;
 // The range operator is only supported on .NET version 5 or greater,
 // in other versions .Slice() must be used.
 #if NET5_0_OR_GREATER
 return span[startIndex..];
 #else
 return span.Slice(startIndex);
 #endif
 }
}

Question 19

Looking forward to taking a closer look tomorrow. For the present, though, there's a now unnecessary continue; still lurking in the main loop. Cheers!

Question 20

That continue is only for readability purposes. However, what is more readable to me might be less readable for others ;)

Question 21

I have written such a comparer in the past and I recommend not to handle leading zeros as I had to remove it from my code again. It's mathematically correct but not what the user wants. In any GUI, if there are numbers with and without leading zeros present, then you want to have the ones with leading zeros together with the rest of the group with the same number of digits. I would sort "Foo1", "Foo2", "Foo99", "Foo001", "Foo002", "Foo099", "Foo999", "Foo1000" etc. and not "Foo1", "Foo001", "Foo2", "Foo002", "Foo099", "Foo999", "Foo1000".

Question 22

Congrats. Great learning experience!! It's only by climbing the mountain that you appreciate how 'rugged' the journey can be. The initial idea sounds quite simple. It often turns out there are unrecognised hurdles and decisions to be made along the way. The joy of coding is, in part, this improvisation... :-) Cheers! (PS: When asked by supervisors, "How long will this take?", be sure to respond, "How long is a piece of string? We won't know until we have it." All too often, the spec they wrote is full of holes they can't see...)

Question 23

If you want a review of the new code, you should create a follow up question. Questions are not open ended discussions.

Question 24

With strings of digits, equal strings are equal numbers, strings of equal lengths compare the same as ascii or numbers. For digit strings of different lengths with at least one equal digit at the start the longer string is larger.

Perform a letter compare which ends when one or both strings end or two letters are different. If one or both strings ended then the result is the letter compare result. Same if we found two different non-digits. All preceding numbers are the same.

Otherwise we have two letters which are different where at least one is a digit and all information to get the correct result.

If there is one digit only then the string with the extra digit is larger if there is a preceding digit; if there is no preceding digit then the strings are sorted according to letter sorting. If there are two different digits then we need to count the following digits. If the number of following digits is the same then the strings are sorted in letter order, otherwise the string with more digits is the larger one.

user272752user272752 · Answer 1 · 2025-01-16 01:52:09Z

"Allocation free"

"most solutions have a lot of allocations"
While I don't claim any knowledge of C#, the appearance of .ToString() suggests there are hidden (short lived) allocations happening in this code.
_{I'll gladly delete this point if I'm mistaken in my assumption.}

How much to re-invent

Beginners writing in C quickly learn to use (and trust) the Standard Library function qsort(), writing, as the OP has done, built-for-purpose compare() functions.

To serve user's needs, any compare() function (with its predefined parameters) is often accompanied by a wrapper that inverts to returning value. This allows calling code (eg: qsort()) to sort in either ascending or descending sequence.

Suggestion: Add a trivial public function to your class (CompareReverse()) that provides descending ordering to a generic sorting mill...

Flying too high

There's a tacit presumption that contiguous digits will convert to a value that can be represented in an int variable, not even an unsigned int capable of twice the magnitude.

For example, a credit card 'number' requires a 'high capacity' variable to store its 'integer' value.
It's not difficult to imagine even larger strings of digits being used by users.

The presumption of small numbers may need a re-think.

Don't mix apples with oranges

Consider this comment

 // If we get here, the numbers were equal and we need to continue looking for differences.

The OP's candidate data is, in fact, a string-ised data structure.

typedef struct {
 char str[ 10 ]; // string of 1-10 characters
 char str[ 10 ]; // string of 1-10 characters
 char str[ 10 ]; // string of 1-10 characters
 ...
};

Any number of segments each of which could be any length, and alpha segments are left justified while numeric segments are left padded with zeros. (Used 10 as a maximum segment length for this example.)

The code would be much easier to follow (and to maintain! (imo)) if it were broken into two independent 'service functions' to compare: 1) two segments full of alpha, and 2) two segments full of digits. The controlling function would be responsible for dissecting each segment, and ensuring alpha vs alpha and digits vs digits. As presented, these two different requirements seem to be hazardously interwoven.

Swapping priorities

^{In the following, alpha means any non-digit character.}

The alpha vs alpha comparison is no challenge, and the alpha vs digit is also rather easy to evaluate. The digits vs digits presents difficulties because each digit's applicable "power of 10" is unknown.

The OP's code prioritises the length of each substring of digits, then goes off to try to compare their values as integers.

Consider these scenarios:

"abc456444qrx"
"abc456444qry" // equal digits. Keep going until reaching 'x' vs 'y'
 ^--
"abc456444qrx"
"abc457444qry" // greater is TENTATIVE at 6 vs 7, confirmed at end of digits
 ^> ^!!!
"abc4569999qrx"
"abc457444qry" // greater is TENTATIVE at 6 vs 7, but conceded at 9 vs 'q'
 ^> ^!!! // assumption was wrong. Now we know!

The key idea being "which string becomes tentatively greater first while sequentially comparing paired digits?" While traversing remaining digits do not compare values, but be on the lookout for end of string or encountering an alpha character.

Make up a table listing the permutations of the conditions that could arise, and, for each entry, determine whether to return a relationship value (-1, 0, 1), or to continue scanning. This can be achieved with only a single index variable, but needs to be coded as simply as possible to begin with. Postpone combining operations to reduce lines of code. Get it working with a large number of outrageous test cases as the first priority. (Hint: comparing "HHGTTG" with "42" is decided on the very first character.)

Avoid slippery slopes

After reading the answer from @J_H, I've suddenly become aware of another scenario that could cause grief: Decimal Fractions.

From a numeric stand point, the value "foo5.8" is greater than "foo5.75". However, the latter's extra digit(s) to the right of the decimal point ARE already in their applicable power-of-ten sequence. That the user's data uses varying degrees of precision means simple "the longer sequence of digits represents the larger value" will not provide sorting into natural order.

(It's always the simplest of projects that turn out to have unforeseen complexities.)

@Fe203 About your paragraph "How much to re-invent": In C# we often order with linq, in this case you would call either myList.Order(new JacobusAlphanumericComparer()) or call myList.OrderDescending(new JacobusAlphanumericComparer()). But if you want to keep the same array reference by using Array.Sort(), then yes, you would have to create a reverse comparer.
In C# int and uint are 32-bit numbers with 9 - 10 digits. We also have 64-bit long and ulong with 19 - 20 digits. A decimal has 28 - 29 digits precision.
@OlivierJacot-Descombes Thank you. The point is to avoid (if possible) an arbitrary ceiling, and, otherwise, to acknowledge any recognised "restrictions" used in the code... The OP's use-case may be satisfied with "up to 6 digits, max!!" Just noting a caveat should be present as a comment...

greybeard greybeard 7,3913 gold badges21 silver badges55 bronze badges · Answer 2 · 2025-01-15 18:16:50Z

Code the way you think about problem and solution.
Document your code to at least the point where it doesn't need to be reverse engineered to find out how to use it - does Compare() not need a description for reassurance? (I don't know about C# and "documentation inheritance". But I think digit string interpretation to deserve a description, anyway - starting with leading zeroes, sign before a digit, letter case...
While private, CompareNumbers() could document numberEndIndex to be inclusive. (or ex? guess "the NET5-#else" confuses me))

Come to think of it, handling digit sequences and case is independent:
consider parametrising letter comparison.

I would not think compare index to string length "twice in a row". (same for ?NumberEnded.)
Come to think of it, if I kept whether I swapped arguments, I don't need to check with both strings at all. But overall, you seem prioritise readability over microoptimisation. Good. Maybe drop the special case in CompareNumbers().

I see how introducing xIsNumber to yNumberEnded is self-documenting.

The condition of "the digit loop" repeats that of the enclosing if - a code smell.

The character loop should use for - caveat: digit string handling as presented makes this choice less clear.

I think it would be simpler to start special casing digit strings when a character difference has been detected:
If one is a digit and the other is not, check if both preceding characters are digits - if yes, digit at current index is greater.
Otherwise establish which digit is greater and check (remaining) lengths: you never need to convert into a number representation of worrisome range.

Tried to code "non-converting digit string handling": leading zeroes make that interesting - if prepending "0" doesn't change order, further processing needs separate string indices.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 3 · 2025-01-16 20:19:44Z

Have I missed something?

write down the spec

After seeing the OP and the github ReadMe, I didn't notice anything that formally describes what "less than" means in this context.

unary plus / minus

These three characters might conceivably fit into CompareNumbers(), though they are not digits.

0x2b +
0x2d -
0x2e .

(I predict that when you do write down a spec, representing "a thousand" as 1e3 will be very very far out of scope.)

I feel that decimal behavior we get "for free" -- there's little difference between a3Z4 and a3.4, so we can probably exclude that one. But it troubles me that ASCII 0x2b < 0x2d, as with a+3 versus a-3, given that many mathematicians would assert \$-3 ~ < +3\$.

Yikes! @Fe2O3 points out that 0x2C , also matters, in Europe. Though again I appeal to there being little difference between a3Z4 and a3,4.

leading zeros

It's not obvious to me which of these is smallest.

agent-007
agent-07

That's something that would be cleared up by a written spec. (We might also throw agent+007 and agent+07 into the mix.)

unicode codepoints

You might possibly wish to limit valid input to just 7-bit ASCII. There's a vast number of ways to express the numeric concept of "two" in modern Unicode, and char.IsDigit() is familiar with many of them. Maybe throw an exception upon encountering an invalid input character?

Alternatively, you might prefer to write your own IsDigit() helper which recognizes exactly ten digits.

unwarranted assumption

 /// <param name="c">A char containing any digit 0-9.</param>
 private static int ParseDigitCharToInt(char c)
 {
 return c - '0';
 }

The "0-9" comment is great, thank you.

But the caller's guard of char.IsDigit() admits glyphs from several codepoint ranges beyond Latin, including

Arabic
Superscript
Mathematical Alphanumeric Symbols
Arabic-Indic
Mathematical Double-Struck
Roman Numerals
Coptic
Devanagari
Urdu
Persian
Sindhi
Thai
Tamil
Gurmukhi
Hangul

If we're going international, can't forget the inverse meanings of '.' and ',' between N.Am and Europe (at least)... :-)

Short Int Short Int 6262 silver badges14 bronze badges · Answer 4 · 2025-01-16 17:36:22Z

Allocation free? You could use 1 single new allocation simply for reversing the string for checking if the reverse is equal to the first string. It appears that if the lengths of the strings aren't equal it should return -1? Here is a code sample you could perhaps build on to get the result your looking for.

public static class Test
{
 private static bool IsValid(string value)
 {
 if (string.IsNullOrEmpty(value) ||
 string.IsNullOrWhiteSpace(value)) return false;
 foreach (char c in value)
 {
 if (!char.IsDigit(c)) return false;
 }
 return true;
 }
 private static bool IsComparableForward(string x, string y)
 {
 if (!IsValid(x) || !IsValid(y)) return false;
 if (x.Length != y.Length) return false;
 int length = x.Length + y.Length >> 1;
 for (int i = 0; i < length; i++)
 {
 char a = x[i];
 char b = y[i];
 if (a != b) return false;
 }
 return true;
 }
 private static bool IsComparableBackward(string x, string y)
 {
 if (!IsValid(x) || !IsValid(y)) return false;
 if (x.Length != y.Length) return false;
 //the only time a new string allocation 
 //will be done is for reversing y
 char[] temp = y.ToCharArray();
 Array.Reverse(temp);
 string y2 = new(temp);
 return IsComparableForward(x, y2);
 }
 /// <summary>
 /// Performs a comparison between <paramref 
 /// name="x"/> and <paramref name="y"/>
 /// </summary>
 /// <param name="x"></param>
 /// <param name="y"></param>
 /// <returns>-1 if not equal, 0 if comparable 
 /// forward, or 1 if comparable backward</returns>
 public static int Compare(string x, string y)
 { 
 if (!IsValid(x) || !IsValid(y)) return -1;
 
 return IsComparableForward(x, y) ? 0 : 
 IsComparableBackward(x, y) ? 1 : -1;
 }
}

For one you use String.IsNullOrEmpty and the other String.IsNullOrWhiteSpace, that surely should be the same.
No, String.IsNullOrEmpty checks for null & "" and String.IsNullORWhiteSpace checks for null or a string of any size that has nothing but white space for example string s = " " 1 whitespace, " " 6 whitespaces
@CharlesHenington Tip: use 'back-ticks' to present data... " " This keeps the formatter from squishing-down adjacent whitespace...

J.P. J.P. 916 bronze badges · Answer 5 · 2025-01-17 10:06:46Z

Based on all the feedback in the other answers, I've created a new version of the comparer.

Things I have decided not to support:

Scientific number representations like 1e3.
Unary plus/minus, meaning differing between positive and negative numbers is not supported.
Decimal fractions.

Changes:

More comments/documentation to clarify how the comparer and its private methods work.
The implementation no longer contains allocations for .NET versions lower than .NET 5.
Support for numbers greater than the C# int maximum value (2147483647). This change might also boost performance as it never has to convert characters to an actual number (like int or uint).
Leading zeroes are now correctly ignored.
No longer repeating if statements/conditions in while/for loop that are also conditions of the loop itself.
Made character loop a for loop instead of while.
Performance improvement for very similar strings by only using expensive ToUpper methods if the characters are not equal.

/// <summary>
/// Compares strings naturally/alphanumerically.
/// See <see href="https://en.wikipedia.org/wiki/Natural_sort_order"/> for more details.
/// </summary>
/// <remarks>
/// Does not support decimal fractions, only full numbers are compared.<br/>
/// Does not support negative numbers, so '-' and '+' will be compared just like any other non-digit character.<br/>
/// Does not support comparing scientific number representations like 1e3.
/// </remarks>
/// <example>
/// Compare("a", "a") == 0
/// Compare("a", "b") == -1
/// Compare("b", "a") == 1
/// Compare("a", "1") == 1
/// Compare("11", "4") == 1
/// Compare("11z", "2zzz") == 1
/// Compare("11zy", "11z") == 1
/// Compare("aa8aa", "aa7aa") == 1
/// Compare("b", "A") == 1
/// Compare("011", "10") == 1
/// Compare("10", "010") == 1 (this one was an arbitrary choice, because Windows also orders its folders that way).
/// Compare("21.49", "21.5") = -1 (no support for decimal fractions on purpose, because that would cause other issues with for example comparing semantic version numbers https://semver.org/)
/// </example>
/// <returns>
/// <b>Exactly zero</b> if equal.<br/>
/// <b>Greater than zero</b> if x is greater than y.<br/>
/// <b>Less than zero</b> if x is less than y.
/// </returns>
public sealed class JacobusAlphanumericComparer : IComparer<string>
{
 private readonly CultureInfo _culture;
 /// <param name="culture">The culture to use for comparing single characters, defaults to <see cref="CultureInfo.InvariantCulture"/>.</param>
 public JacobusAlphanumericComparer(CultureInfo culture = null)
 {
 _culture = culture ?? CultureInfo.InvariantCulture;
 }
 public int Compare(string x, string y)
 {
 if (x is null) return y is null ? 0 : -1;
 else if (y is null) return 1;
 // Loop while both strings still contain more characters...
 for (int index = 0; index < x.Length && index < y.Length; index++)
 {
 bool xIsNumber = char.IsDigit(x[index]);
 bool yIsNumber = char.IsDigit(y[index]);
 if (xIsNumber && yIsNumber)
 {
 int numberStartIndex = index;
 int xNumberEndIndex = index;
 int yNumberEndIndex = index;
 bool findingXNumberEnd;
 bool findingYNumberEnd;
 // Looping to find the end index of the number in both x and y.
 do
 {
 index++;
 findingXNumberEnd = index < x.Length && char.IsDigit(x[index]);
 if (findingXNumberEnd) xNumberEndIndex = index;
 findingYNumberEnd = index < y.Length && char.IsDigit(y[index]);
 if (findingYNumberEnd) yNumberEndIndex = index;
 }
 while (findingXNumberEnd || findingYNumberEnd);
 // Creating spans that start and end exactly where the numbers are.
 int xNumberLength = xNumberEndIndex - numberStartIndex + 1;
 ReadOnlySpan<char> xNumber = x.AsSpan(numberStartIndex, xNumberLength);
 int yNumberLength = yNumberEndIndex - numberStartIndex + 1;
 ReadOnlySpan<char> yNumber = y.AsSpan(numberStartIndex, yNumberLength);
 int numberComparison = CompareNumbers(xNumber, yNumber);
 bool numbersAreDifferent = numberComparison != 0;
 if (numbersAreDifferent) return numberComparison;
 // If we get here, the numbers were equal and we need to continue looking for differences.
 // Both xNumberEndIndex and yNumberEndIndex are equal here, so we can use any one of them.
 index = xNumberEndIndex;
 continue;
 }
 else
 {
 // Compare characters normally.
 int charComparison = CompareCharacters(x[index], y[index]);
 bool charactersAreDifferent = charComparison != 0;
 if (charactersAreDifferent) return charComparison;
 }
 }
 // If we get here, either:
 // - both strings were equal. For example: Compare("a1a", "a1a") == 0.
 // - or they were equal up until one string ended. For example: Compare("a1a", "a1aa") == -1
 return x.Length - y.Length;
 }
 private int CompareCharacters(char x, char y)
 {
 if (x == y) return 0;
 // ToUpperInvariant would be slightly faster (2ns vs 6ns according to benchmarks),
 // but not worth sacrificing the flexibility of supplying your preferred culture.
 char upperX = char.ToUpper(x, _culture);
 char upperY = char.ToUpper(y, _culture);
 return upperX.CompareTo(upperY);
 }
 /// <summary>
 /// Compares two texts numerically.
 /// </summary>
 /// <remarks>
 /// Ignores leading '0' characters, so CompareNumbers("013", "12") => 1
 /// </remarks>
 /// <example>
 /// CompareNumbers("12", "13") => -1
 /// CompareNumbers("13", "12") => 1
 /// CompareNumbers("12", "12") => 0
 /// CompareNumbers("012", "12") => -1
 /// </example>
 /// <param name="x">A span of characters containing only digits (0-9).</param>
 /// <param name="y">A span of characters containing only digits (0-9).</param>
 /// <returns>
 /// 0 if equal.<br/>
 /// 1 if <paramref name="x"/> is greater than <paramref name="y"/>.<br/>
 /// -1 if <paramref name="y"/> is greater than <paramref name="x"/>.
 /// </returns>
 private static int CompareNumbers(ReadOnlySpan<char> x, ReadOnlySpan<char> y)
 {
 // Trim leading '0' of x and y number, for example: "00123" becomes "123".
 ReadOnlySpan<char> trimmedX = ExcludeLeadingZeroes(x);
 ReadOnlySpan<char> trimmedY = ExcludeLeadingZeroes(y);
 // Return early if a number only contains zeroes: CompareNumbers("1", "0000") == 1.
 bool xContainsOnlyZeroes = trimmedX.Length is 0;
 bool yContainsOnlyZeroes = trimmedY.Length is 0;
 if (xContainsOnlyZeroes) return yContainsOnlyZeroes ? 0 : -1;
 else if (yContainsOnlyZeroes) return 1;
 // After leading zeroes are excluded, a longer number is always greater: CompareNumbers("1000", "999") == 1.
 if (trimmedX.Length > trimmedY.Length) return 1;
 else if (trimmedX.Length < trimmedY.Length) return -1;
 // If we get here, we know there are no leading zeroes and the numbers have the same length.
 // Now it's just a matter of comparing each character normally.
 for (int index = 0; index < trimmedX.Length; index++)
 {
 char xDigit = trimmedX[index];
 char yDigit = trimmedY[index];
 int comparison = xDigit.CompareTo(yDigit);
 if (comparison != 0) return comparison;
 }
 // TODO: Should "010" < "10" like Windows sorts its folders and files?
 // Or is "010" > "10" more logical? I'm clueless here.
 // Comparing original length to make sure "010" < "10".
 return y.Length - x.Length;
 }
 /// <summary>
 /// Shortens the span to exclude leading '0' characters.
 /// </summary>
 /// <example>
 /// ExcludeLeadingZeroes("00100200") => "100200"
 /// </example>
 /// <param name="span">The span to exclude leading zeroes from.</param>
 /// <returns>A span without leading zeroes.</returns>
 private static ReadOnlySpan<char> ExcludeLeadingZeroes(ReadOnlySpan<char> span)
 {
 int startIndex = 0;
 for (int i = 0; i < span.Length; i++)
 {
 if (span[i] is '0') startIndex++;
 else break;
 }
 // We can just return the same span if it had no leading zeroes.
 if (startIndex is 0) return span;
 // The range operator is only supported on .NET version 5 or greater,
 // in other versions .Slice() must be used.
 #if NET5_0_OR_GREATER
 return span[startIndex..];
 #else
 return span.Slice(startIndex);
 #endif
 }
}

Looking forward to taking a closer look tomorrow. For the present, though, there's a now unnecessary continue; still lurking in the main loop. Cheers!
That continue is only for readability purposes. However, what is more readable to me might be less readable for others ;)
I have written such a comparer in the past and I recommend not to handle leading zeros as I had to remove it from my code again. It's mathematically correct but not what the user wants. In any GUI, if there are numbers with and without leading zeros present, then you want to have the ones with leading zeros together with the rest of the group with the same number of digits. I would sort "Foo1", "Foo2", "Foo99", "Foo001", "Foo002", "Foo099", "Foo999", "Foo1000" etc. and not "Foo1", "Foo001", "Foo2", "Foo002", "Foo099", "Foo999", "Foo1000".
Congrats. Great learning experience!! It's only by climbing the mountain that you appreciate how 'rugged' the journey can be. The initial idea sounds quite simple. It often turns out there are unrecognised hurdles and decisions to be made along the way. The joy of coding is, in part, this improvisation... :-) Cheers! (PS: When asked by supervisors, "How long will this take?", be sure to respond, "How long is a piece of string? We won't know until we have it." All too often, the spec they wrote is full of holes they can't see...)
If you want a review of the new code, you should create a follow up question. Questions are not open ended discussions.

gnasher729 gnasher729 2,87014 silver badges13 bronze badges · Answer 6 · 2025-01-18 13:58:57Z

With strings of digits, equal strings are equal numbers, strings of equal lengths compare the same as ascii or numbers. For digit strings of different lengths with at least one equal digit at the start the longer string is larger.

Perform a letter compare which ends when one or both strings end or two letters are different. If one or both strings ended then the result is the letter compare result. Same if we found two different non-digits. All preceding numbers are the same.

Otherwise we have two letters which are different where at least one is a digit and all information to get the correct result.

If there is one digit only then the string with the extra digit is larger if there is a preceding digit; if there is no preceding digit then the strings are sorted according to letter sorting. If there are two different digits then we need to count the following digits. If the number of following digits is the same then the strings are sorted in letter order, otherwise the string with more digits is the larger one.

Stack Exchange Network

Fast allocation-free alphanumeric comparer used for sorting

6 Answers 6

"Allocation free"

How much to re-invent

Flying too high

Don't mix apples with oranges

Swapping priorities

Avoid slippery slopes

write down the spec

unary plus / minus

leading zeros

unicode codepoints

unwarranted assumption

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Fast allocation-free alphanumeric comparer used for sorting

6 Answers 6

"Allocation free"

How much to re-invent

Flying too high

Don't mix apples with oranges

Swapping priorities

Avoid slippery slopes

write down the spec

unary plus / minus

leading zeros

unicode codepoints

unwarranted assumption

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions