How do you "sordid sort" alphanumeric strings?

Question 1

The question is about sorting alphanumeric string in a "more natural" order. Source: Marc LeBrun's "Sordid Sort" Computist Quiz

Often a system string comparison sorts alphanumeric data "unnaturally", as with these filenames:
File1
File10 ← ?
File11 ← ?
File12 ← ?
File2
File3
File3X
File4
File5
File6
File7
File8
File999
Describe an algorithm that will sort alphanumeric strings in a "more natural" order:
File1
File2
File3
File3X
File4
File5
File6
File7
File8
File10
File11
File12
File999
The algorithm may only rely on comparing characters; to avoid overflow problems with very long numbers it must not convert substrings of digits into their numerical values.

Spoiler alert: If you would like to independently solve the above, please do so now before you let the below spoil it for you.

This implementation has its inspiration from Computist Quiz Answer Guide. Misunderstandings would be mine.

From the Answer Guide: Extra credit: discuss the behavior with respect to leading zeros, as in "007".

#!/usr/bin/perl -w
# Sordid Sort: http://fxpt.com/wp-content/uploads/2012/01/quiz.pdf
# Solution based on http://fxpt.com/wp-content/uploads/2012/01/QuizAnswers.pdf
use strict;
my $verbose = 0; # set it to non-zero to get regex matches printed to STDERR
my $ALNUM = qr/(\D*)(\d*)/; # match a pair of alphas-optional and/or num-optional
my @in = ();
while (<>) {
 chomp;
 push @in, $_; 
}
sub sordidcmp {
 my @a = $a =~ m/$ALNUM/g; # always even number of elements since $ALNUM matches in pairs; and always has a sentinel pair
 my @b = $b =~ m/$ALNUM/g;
 my $cmp = 0;
 my $min_ab = (@a < @b) ? @a : @b;
 my $i;
 print STDERR "comparing (<".join("> <", @a).">) with (<".join("> <", @b).">)" if $verbose;
 for ($i = 0; $i < $min_ab; ++$i) { # advance by 2; once here
 $cmp = $a[$i] cmp $b[$i]; # compare alphas
 last if $cmp; # end if we found unequals
 ++$i; # advance once more here
 # compare nums
 my $lenai = length($a[$i]);
 my $lenbi = length($b[$i]);
 # deal with Extra credit: prefix "0" to the shorter num
 if ($lenai < $lenbi) {
 $a[$i] = ("0" x ($lenbi-$lenai)).$a[$i];
 } elsif ($lenbi < $lenai) {
 $b[$i] = ("0" x ($lenai-$lenbi)).$b[$i];
 }
 $cmp = $a[$i] cmp $b[$i]; # note: stringwise compare of nums
 last if $cmp; # end if we found unequals
 }
 print STDERR " = $cmp\n" if $verbose;
 return $cmp;
}
print join "\n", sort sordidcmp @in;
print "\n" if @in; # guard against empty input

My questions are:

I observed that the match @a = $a =~ m/$ALNUM/g always produces a sentinel pair of ("", ""). Is the above code relying on undocumented behaviour? I could not find (read: understand) this from the perlop, perlretut, and perlre perldocs.
Can you help simplify the sordidcmp subroutine? Perl golf would be nice, but a more maintainable version is what I would be interested in. Of course, links to other solutions/implementations are welcome.

Question 2

When did SO turn into a combination of Code Review and Code Golf

Question 3

@JimGarrison thanks for the links, I did not know about them.

Question 4

Found a Perl golf challenge dated Jul-2002 @ http://thospel.home.xs4all.nl/golf/challenge.html as "Human sort" with some commentary on the golf solutions as well!

Question 5

More spoilers: Sort::Key, Sort::Naturally.

Question 6

I observed that the match @a = $a =~ m/$ALNUM/g always produces a sentinel pair. Is the above code relying on undocumented behaviour?

I don't know what you mean by "sentinel pair", but that will always return two values per match since you have two captures.

Can you help simplify the sordidcmp subroutine?

You could replace sort sordidcmp with Sort::Key::Natural's natsort.

Question 7

clarified sentinel pair as ("", ""). Running the above script after setting $verbose = 1; would print the matches and always a last pair of <> <>. And, thank you for the Sort::Key::Natural link.

Question 8

Your pattern successfully matches the zero-length strings at the end of the string.

Question 9

I didn't check to see if it supports very long numbers.

Question 10

You can use a Schwartzian Transform:

use strict;
use warnings;
my @data=qw(File1 AFile10 afile10 file10 10 File11 File10 File2 File3 File3A
 File3X File4 File5 File6 File12 File8 File999);
my @sorted = map { $_->[0] }
 sort { $a->[0] cmp $b->[0] 
 ||
 $a->[1] <=> $b->[1]}
 map { [$_, $_=~/(\d+)/] }
 @data;
print join("\n", @sorted)."\n";

Prints:

10
AFile10
File1
File10
File11
File12
File2
File3
File3A
File3X
File4
File5
File6
File8
File999
afile10
file10

If you do not want to use the numeric value for the capture group, you can a) strip the leading zeros; b) substitute the length of the group and c) use its lexicographic sort of the group and do something like this:

my @data=qw(File1 AFile10 afile10 file10 10 File11 File10 File2 File3 File3A
 File3X File4 File5 File6 File12 File8 File999 File000010);
my @sorted = map { $_->[0] }
 sort { $a->[1] cmp $b->[1] 
 ||
 length($a->[2]) <=> length($b->[2])
 ||
 $a->[2] cmp $b->[2]}
 map { [$_, $_=~/(^[^\d]*)0*(\d+)/] }
 @data;
print join("\n", @sorted)."\n";

Prints:

10
AFile10
File1
File2
File3
File3A
File3X
File4
File5
File6
File8
File10
File000010
File11
File12
File999
afile10
file10

Question 11

I wouldn't consider putting AFile after File a "more natural" order, regardless of the numeric suffix.

Question 12

Also, you need to change $a->[2] and $b->[2] to $a->[0] and $b->[0], respectively (turn on warnings and you'll see why). Finally, this solution ignores the requirement that "the algorithm may only rely on comparing characters; to avoid overflow problems with very long numbers it must not convert substrings of digits into their numerical values."

Question 13

@ThisSuitIsBlackNot: I think I have addressed your comments with my edit. Thanks.

Question 14

@drewk: again, it works with one alpha and one num. what if there are more? sorry, the question did not explicitly mention about the cases of repeating alphas and nums.

Question 15

@drewk: Thanks for the link to Schwartzian transform.

Question 16

How about a comparator version:

sub alnum_compar($$)
{
 my ($a0, $b0) = (shift, shift);
 my $a = $a0;
 my $b = $b0;
 my $c;
 while (length($a) && length($b)) {
 my @a = $a =~ /^(\d+|\D+)(.*)$/;
 my @b = $b =~ /^(\d+|\D+)(.*)$/;
 if ($a[0] =~ /^\d/ && $b[0] =~ /^\d/) {
 $c = int($a[0]) - int($b[0]);
 } else {
 $c = $a[0] cmp $b[0];
 }
 return $c if ($c != 0);
 $a = $a[1];
 $b = $b[1];
 }
 return length($a0) - length($b0);
}

Test:

my @data=qw(File1 AFile10 afile10 file10 10 File11 File10 File2 File3 File3A
 File3X File4 File5 File6 File12 File8 File999 File000010);
print join("\n", sort alnum_compar @data) . "\n";

Results:

10
AFile10
File1
File2
File3
File3A
File3X
File4
File5
File6
File8
File10
File000010
File11
File12
File999
afile10
file10

Question 17

This isn't much of a review. Can you at least explain why this is better please?

Question 18

I thought this can be one of solutions since it shows its algorithm clearer than the others. I'm sorry if it didn't match the review, I tried to answer the original article, it said "links to other solutions/implementations are welcome".

ikegami ikegami 2741 silver badge9 bronze badges · Accepted Answer · 2013-10-16 00:57:55Z

3

\$\begingroup\$

I observed that the match @a = $a =~ m/$ALNUM/g always produces a sentinel pair. Is the above code relying on undocumented behaviour?

I don't know what you mean by "sentinel pair", but that will always return two values per match since you have two captures.

Can you help simplify the sordidcmp subroutine?

You could replace sort sordidcmp with Sort::Key::Natural's natsort.

Share

edited May 29, 2020 at 3:16

answered Oct 16, 2013 at 0:57

ikegami's user avatar

ikegami ikegami

2741 silver badge9 bronze badges

\$\endgroup\$

3

\$\begingroup\$ clarified sentinel pair as ("", ""). Running the above script after setting $verbose = 1; would print the matches and always a last pair of <> <>. And, thank you for the Sort::Key::Natural link. \$\endgroup\$

vapace
– vapace

2013年10月16日 13:25:17 +00:00
Commented Oct 16, 2013 at 13:25
\$\begingroup\$ Your pattern successfully matches the zero-length strings at the end of the string. \$\endgroup\$

ikegami
– ikegami

2013年10月16日 13:53:58 +00:00
Commented Oct 16, 2013 at 13:53
\$\begingroup\$ I didn't check to see if it supports very long numbers. \$\endgroup\$

ikegami
– ikegami

2013年10月16日 14:32:43 +00:00
Commented Oct 16, 2013 at 14:32

Add a comment |

Stack Exchange Network

How do you "sordid sort" alphanumeric strings?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How do you "sordid sort" alphanumeric strings?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions