The question is about sorting alphanumeric string in a "more natural" order. Source: Marc LeBrun's "Sordid Sort" Computist Quiz
Often a system string comparison sorts alphanumeric data "unnaturally", as with these filenames:
File1 File10 ← ? File11 ← ? File12 ← ? File2 File3 File3X File4 File5 File6 File7 File8 File999
Describe an algorithm that will sort alphanumeric strings in a "more natural" order:
File1 File2 File3 File3X File4 File5 File6 File7 File8 File10 File11 File12 File999
The algorithm may only rely on comparing characters; to avoid overflow problems with very long numbers it must not convert substrings of digits into their numerical values.
Spoiler alert: If you would like to independently solve the above, please do so now before you let the below spoil it for you.
This implementation has its inspiration from Computist Quiz Answer Guide. Misunderstandings would be mine.
From the Answer Guide: Extra credit: discuss the behavior with respect to leading zeros, as in "007".
#!/usr/bin/perl -w
# Sordid Sort: http://fxpt.com/wp-content/uploads/2012/01/quiz.pdf
# Solution based on http://fxpt.com/wp-content/uploads/2012/01/QuizAnswers.pdf
use strict;
my $verbose = 0; # set it to non-zero to get regex matches printed to STDERR
my $ALNUM = qr/(\D*)(\d*)/; # match a pair of alphas-optional and/or num-optional
my @in = ();
while (<>) {
chomp;
push @in, $_;
}
sub sordidcmp {
my @a = $a =~ m/$ALNUM/g; # always even number of elements since $ALNUM matches in pairs; and always has a sentinel pair
my @b = $b =~ m/$ALNUM/g;
my $cmp = 0;
my $min_ab = (@a < @b) ? @a : @b;
my $i;
print STDERR "comparing (<".join("> <", @a).">) with (<".join("> <", @b).">)" if $verbose;
for ($i = 0; $i < $min_ab; ++$i) { # advance by 2; once here
$cmp = $a[$i] cmp $b[$i]; # compare alphas
last if $cmp; # end if we found unequals
++$i; # advance once more here
# compare nums
my $lenai = length($a[$i]);
my $lenbi = length($b[$i]);
# deal with Extra credit: prefix "0" to the shorter num
if ($lenai < $lenbi) {
$a[$i] = ("0" x ($lenbi-$lenai)).$a[$i];
} elsif ($lenbi < $lenai) {
$b[$i] = ("0" x ($lenai-$lenbi)).$b[$i];
}
$cmp = $a[$i] cmp $b[$i]; # note: stringwise compare of nums
last if $cmp; # end if we found unequals
}
print STDERR " = $cmp\n" if $verbose;
return $cmp;
}
print join "\n", sort sordidcmp @in;
print "\n" if @in; # guard against empty input
My questions are:
I observed that the match
@a = $a =~ m/$ALNUM/g
always produces a sentinel pair of ("", ""). Is the above code relying on undocumented behaviour? I could not find (read: understand) this from theperlop
,perlretut
, andperlre
perldocs.Can you help simplify the sordidcmp subroutine? Perl golf would be nice, but a more maintainable version is what I would be interested in. Of course, links to other solutions/implementations are welcome.
3 Answers 3
I observed that the match @a = $a =~ m/$ALNUM/g always produces a sentinel pair. Is the above code relying on undocumented behaviour?
I don't know what you mean by "sentinel pair", but that will always return two values per match since you have two captures.
Can you help simplify the sordidcmp subroutine?
You could replace sort sordidcmp
with Sort::Key::Natural's natsort
.
-
\$\begingroup\$ clarified sentinel pair as ("", ""). Running the above script after setting $verbose = 1; would print the matches and always a last pair of <> <>. And, thank you for the
Sort::Key::Natural
link. \$\endgroup\$vapace– vapace2013年10月16日 13:25:17 +00:00Commented Oct 16, 2013 at 13:25 -
\$\begingroup\$ Your pattern successfully matches the zero-length strings at the end of the string. \$\endgroup\$ikegami– ikegami2013年10月16日 13:53:58 +00:00Commented Oct 16, 2013 at 13:53
-
\$\begingroup\$ I didn't check to see if it supports very long numbers. \$\endgroup\$ikegami– ikegami2013年10月16日 14:32:43 +00:00Commented Oct 16, 2013 at 14:32
You can use a Schwartzian Transform:
use strict;
use warnings;
my @data=qw(File1 AFile10 afile10 file10 10 File11 File10 File2 File3 File3A
File3X File4 File5 File6 File12 File8 File999);
my @sorted = map { $_->[0] }
sort { $a->[0] cmp $b->[0]
||
$a->[1] <=> $b->[1]}
map { [$_, $_=~/(\d+)/] }
@data;
print join("\n", @sorted)."\n";
Prints:
10
AFile10
File1
File10
File11
File12
File2
File3
File3A
File3X
File4
File5
File6
File8
File999
afile10
file10
If you do not want to use the numeric value for the capture group, you can a) strip the leading zeros; b) substitute the length of the group and c) use its lexicographic sort of the group and do something like this:
my @data=qw(File1 AFile10 afile10 file10 10 File11 File10 File2 File3 File3A
File3X File4 File5 File6 File12 File8 File999 File000010);
my @sorted = map { $_->[0] }
sort { $a->[1] cmp $b->[1]
||
length($a->[2]) <=> length($b->[2])
||
$a->[2] cmp $b->[2]}
map { [$_, $_=~/(^[^\d]*)0*(\d+)/] }
@data;
print join("\n", @sorted)."\n";
Prints:
10
AFile10
File1
File2
File3
File3A
File3X
File4
File5
File6
File8
File10
File000010
File11
File12
File999
afile10
file10
-
\$\begingroup\$ I wouldn't consider putting
AFile
afterFile
a "more natural" order, regardless of the numeric suffix. \$\endgroup\$ThisSuitIsBlackNot– ThisSuitIsBlackNot2013年10月15日 23:52:31 +00:00Commented Oct 15, 2013 at 23:52 -
1\$\begingroup\$ Also, you need to change
$a->[2]
and$b->[2]
to$a->[0]
and$b->[0]
, respectively (turn on warnings and you'll see why). Finally, this solution ignores the requirement that "the algorithm may only rely on comparing characters; to avoid overflow problems with very long numbers it must not convert substrings of digits into their numerical values." \$\endgroup\$ThisSuitIsBlackNot– ThisSuitIsBlackNot2013年10月16日 00:06:34 +00:00Commented Oct 16, 2013 at 0:06 -
\$\begingroup\$ @ThisSuitIsBlackNot: I think I have addressed your comments with my edit. Thanks. \$\endgroup\$drewk– drewk2013年10月16日 03:45:28 +00:00Commented Oct 16, 2013 at 3:45
-
1\$\begingroup\$ @drewk: again, it works with one alpha and one num. what if there are more? sorry, the question did not explicitly mention about the cases of repeating alphas and nums. \$\endgroup\$vapace– vapace2013年10月16日 13:43:17 +00:00Commented Oct 16, 2013 at 13:43
-
\$\begingroup\$ @drewk: Thanks for the link to
Schwartzian transform
. \$\endgroup\$vapace– vapace2013年10月16日 14:11:00 +00:00Commented Oct 16, 2013 at 14:11
How about a comparator version:
sub alnum_compar($$)
{
my ($a0, $b0) = (shift, shift);
my $a = $a0;
my $b = $b0;
my $c;
while (length($a) && length($b)) {
my @a = $a =~ /^(\d+|\D+)(.*)$/;
my @b = $b =~ /^(\d+|\D+)(.*)$/;
if ($a[0] =~ /^\d/ && $b[0] =~ /^\d/) {
$c = int($a[0]) - int($b[0]);
} else {
$c = $a[0] cmp $b[0];
}
return $c if ($c != 0);
$a = $a[1];
$b = $b[1];
}
return length($a0) - length($b0);
}
Test:
my @data=qw(File1 AFile10 afile10 file10 10 File11 File10 File2 File3 File3A
File3X File4 File5 File6 File12 File8 File999 File000010);
print join("\n", sort alnum_compar @data) . "\n";
Results:
10
AFile10
File1
File2
File3
File3A
File3X
File4
File5
File6
File8
File10
File000010
File11
File12
File999
afile10
file10
-
1\$\begingroup\$ This isn't much of a review. Can you at least explain why this is better please? \$\endgroup\$RubberDuck– RubberDuck2014年08月30日 00:40:33 +00:00Commented Aug 30, 2014 at 0:40
-
\$\begingroup\$ I thought this can be one of solutions since it shows its algorithm clearer than the others. I'm sorry if it didn't match the review, I tried to answer the original article, it said "links to other solutions/implementations are welcome". \$\endgroup\$Masahiko Yasuoka– Masahiko Yasuoka2014年11月06日 00:32:11 +00:00Commented Nov 6, 2014 at 0:32
Sort::Key
,Sort::Naturally
. \$\endgroup\$