2
\$\begingroup\$

The question is about sorting alphanumeric string in a "more natural" order. Source: Marc LeBrun's "Sordid Sort" Computist Quiz

Often a system string comparison sorts alphanumeric data "unnaturally", as with these filenames:

File1
File10 ← ?
File11 ← ?
File12 ← ?
File2
File3
File3X
File4
File5
File6
File7
File8
File999

Describe an algorithm that will sort alphanumeric strings in a "more natural" order:

File1
File2
File3
File3X
File4
File5
File6
File7
File8
File10
File11
File12
File999

The algorithm may only rely on comparing characters; to avoid overflow problems with very long numbers it must not convert substrings of digits into their numerical values.

Spoiler alert: If you would like to independently solve the above, please do so now before you let the below spoil it for you.

This implementation has its inspiration from Computist Quiz Answer Guide. Misunderstandings would be mine.

From the Answer Guide: Extra credit: discuss the behavior with respect to leading zeros, as in "007".

#!/usr/bin/perl -w
# Sordid Sort: http://fxpt.com/wp-content/uploads/2012/01/quiz.pdf
# Solution based on http://fxpt.com/wp-content/uploads/2012/01/QuizAnswers.pdf
use strict;
my $verbose = 0; # set it to non-zero to get regex matches printed to STDERR
my $ALNUM = qr/(\D*)(\d*)/; # match a pair of alphas-optional and/or num-optional
my @in = ();
while (<>) {
 chomp;
 push @in, $_; 
}
sub sordidcmp {
 my @a = $a =~ m/$ALNUM/g; # always even number of elements since $ALNUM matches in pairs; and always has a sentinel pair
 my @b = $b =~ m/$ALNUM/g;
 my $cmp = 0;
 my $min_ab = (@a < @b) ? @a : @b;
 my $i;
 print STDERR "comparing (<".join("> <", @a).">) with (<".join("> <", @b).">)" if $verbose;
 for ($i = 0; $i < $min_ab; ++$i) { # advance by 2; once here
 $cmp = $a[$i] cmp $b[$i]; # compare alphas
 last if $cmp; # end if we found unequals
 ++$i; # advance once more here
 # compare nums
 my $lenai = length($a[$i]);
 my $lenbi = length($b[$i]);
 # deal with Extra credit: prefix "0" to the shorter num
 if ($lenai < $lenbi) {
 $a[$i] = ("0" x ($lenbi-$lenai)).$a[$i];
 } elsif ($lenbi < $lenai) {
 $b[$i] = ("0" x ($lenai-$lenbi)).$b[$i];
 }
 $cmp = $a[$i] cmp $b[$i]; # note: stringwise compare of nums
 last if $cmp; # end if we found unequals
 }
 print STDERR " = $cmp\n" if $verbose;
 return $cmp;
}
print join "\n", sort sordidcmp @in;
print "\n" if @in; # guard against empty input

My questions are:

  1. I observed that the match @a = $a =~ m/$ALNUM/g always produces a sentinel pair of ("", ""). Is the above code relying on undocumented behaviour? I could not find (read: understand) this from the perlop, perlretut, and perlre perldocs.

  2. Can you help simplify the sordidcmp subroutine? Perl golf would be nice, but a more maintainable version is what I would be interested in. Of course, links to other solutions/implementations are welcome.

asked Oct 15, 2013 at 19:22
\$\endgroup\$
4
  • 5
    \$\begingroup\$ When did SO turn into a combination of Code Review and Code Golf \$\endgroup\$ Commented Oct 15, 2013 at 19:39
  • \$\begingroup\$ @JimGarrison thanks for the links, I did not know about them. \$\endgroup\$ Commented Oct 15, 2013 at 19:49
  • \$\begingroup\$ Found a Perl golf challenge dated Jul-2002 @ http://thospel.home.xs4all.nl/golf/challenge.html as "Human sort" with some commentary on the golf solutions as well! \$\endgroup\$ Commented Oct 15, 2013 at 20:02
  • 4
    \$\begingroup\$ More spoilers: Sort::Key, Sort::Naturally. \$\endgroup\$ Commented Oct 15, 2013 at 22:49

3 Answers 3

3
\$\begingroup\$

I observed that the match @a = $a =~ m/$ALNUM/g always produces a sentinel pair. Is the above code relying on undocumented behaviour?

I don't know what you mean by "sentinel pair", but that will always return two values per match since you have two captures.

Can you help simplify the sordidcmp subroutine?

You could replace sort sordidcmp with Sort::Key::Natural's natsort.

answered Oct 16, 2013 at 0:57
\$\endgroup\$
3
  • \$\begingroup\$ clarified sentinel pair as ("", ""). Running the above script after setting $verbose = 1; would print the matches and always a last pair of <> <>. And, thank you for the Sort::Key::Natural link. \$\endgroup\$ Commented Oct 16, 2013 at 13:25
  • \$\begingroup\$ Your pattern successfully matches the zero-length strings at the end of the string. \$\endgroup\$ Commented Oct 16, 2013 at 13:53
  • \$\begingroup\$ I didn't check to see if it supports very long numbers. \$\endgroup\$ Commented Oct 16, 2013 at 14:32
1
\$\begingroup\$

You can use a Schwartzian Transform:

use strict;
use warnings;
my @data=qw(File1 AFile10 afile10 file10 10 File11 File10 File2 File3 File3A
 File3X File4 File5 File6 File12 File8 File999);
my @sorted = map { $_->[0] }
 sort { $a->[0] cmp $b->[0] 
 ||
 $a->[1] <=> $b->[1]}
 map { [$_, $_=~/(\d+)/] }
 @data;
print join("\n", @sorted)."\n";

Prints:

10
AFile10
File1
File10
File11
File12
File2
File3
File3A
File3X
File4
File5
File6
File8
File999
afile10
file10

If you do not want to use the numeric value for the capture group, you can a) strip the leading zeros; b) substitute the length of the group and c) use its lexicographic sort of the group and do something like this:

my @data=qw(File1 AFile10 afile10 file10 10 File11 File10 File2 File3 File3A
 File3X File4 File5 File6 File12 File8 File999 File000010);
my @sorted = map { $_->[0] }
 sort { $a->[1] cmp $b->[1] 
 ||
 length($a->[2]) <=> length($b->[2])
 ||
 $a->[2] cmp $b->[2]}
 map { [$_, $_=~/(^[^\d]*)0*(\d+)/] }
 @data;
print join("\n", @sorted)."\n";

Prints:

10
AFile10
File1
File2
File3
File3A
File3X
File4
File5
File6
File8
File10
File000010
File11
File12
File999
afile10
file10
answered Oct 15, 2013 at 19:33
\$\endgroup\$
5
  • \$\begingroup\$ I wouldn't consider putting AFile after File a "more natural" order, regardless of the numeric suffix. \$\endgroup\$ Commented Oct 15, 2013 at 23:52
  • 1
    \$\begingroup\$ Also, you need to change $a->[2] and $b->[2] to $a->[0] and $b->[0], respectively (turn on warnings and you'll see why). Finally, this solution ignores the requirement that "the algorithm may only rely on comparing characters; to avoid overflow problems with very long numbers it must not convert substrings of digits into their numerical values." \$\endgroup\$ Commented Oct 16, 2013 at 0:06
  • \$\begingroup\$ @ThisSuitIsBlackNot: I think I have addressed your comments with my edit. Thanks. \$\endgroup\$ Commented Oct 16, 2013 at 3:45
  • 1
    \$\begingroup\$ @drewk: again, it works with one alpha and one num. what if there are more? sorry, the question did not explicitly mention about the cases of repeating alphas and nums. \$\endgroup\$ Commented Oct 16, 2013 at 13:43
  • \$\begingroup\$ @drewk: Thanks for the link to Schwartzian transform. \$\endgroup\$ Commented Oct 16, 2013 at 14:11
1
\$\begingroup\$

How about a comparator version:

sub alnum_compar($$)
{
 my ($a0, $b0) = (shift, shift);
 my $a = $a0;
 my $b = $b0;
 my $c;
 while (length($a) && length($b)) {
 my @a = $a =~ /^(\d+|\D+)(.*)$/;
 my @b = $b =~ /^(\d+|\D+)(.*)$/;
 if ($a[0] =~ /^\d/ && $b[0] =~ /^\d/) {
 $c = int($a[0]) - int($b[0]);
 } else {
 $c = $a[0] cmp $b[0];
 }
 return $c if ($c != 0);
 $a = $a[1];
 $b = $b[1];
 }
 return length($a0) - length($b0);
}

Test:

my @data=qw(File1 AFile10 afile10 file10 10 File11 File10 File2 File3 File3A
 File3X File4 File5 File6 File12 File8 File999 File000010);
print join("\n", sort alnum_compar @data) . "\n";

Results:

10
AFile10
File1
File2
File3
File3A
File3X
File4
File5
File6
File8
File10
File000010
File11
File12
File999
afile10
file10
answered Aug 30, 2014 at 0:32
\$\endgroup\$
2
  • 1
    \$\begingroup\$ This isn't much of a review. Can you at least explain why this is better please? \$\endgroup\$ Commented Aug 30, 2014 at 0:40
  • \$\begingroup\$ I thought this can be one of solutions since it shows its algorithm clearer than the others. I'm sorry if it didn't match the review, I tried to answer the original article, it said "links to other solutions/implementations are welcome". \$\endgroup\$ Commented Nov 6, 2014 at 0:32

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.