I've been looking for an implementation of Comparator
that provides a "natural order". I found a couple but they were buggy or poorly designed. I wrote my own with 3 goals:
- efficient (in terms of execution time)
- well-designed (in terms of re-usability)
- readable (in terms of maintainability)
Natural Order
Here is a more rigorous definition of "natural order":
- Non-Alphanumeric characters are first
- Ordered by their ASCII values
- Numeric characters come second
- Consecutive numerics are group
- Then ordered by the resulting integer value
- Alphabetic characters come third
- Alphabetically ordered
Problems
With my current implementation, the way the TYPEs
are handle seems really messy. Especially since OTHER
and ALPHA
are chars
and DIGIT
is a String
. I'm also not a fan of using ALPHA_ORDER
to hard-code alphabetical ordering. However, I am at the end of my rope on how to make these cleaner. I'm also not sure how to better meet my 3 goals.
Code
import java.util.Comparator;
/**
* Compares strings in a 'natural' way. The ordering is as follows:
* + Non-Alphanumeric characters are first
* - Ordered by their ASCII values
* + Numeric characters come second
* - Consecutive numerics are group
* - Then ordered by the resulting integer value
* + Alphabetic characters come third
* - Alphabetically ordered
* For example, {A,C,03,1,b,002,$} would be sorted into {,1,002,03,ドルA,b,C}.
* @author NonlinearFruit
*/
public class NaturalOrderComparator implements Comparator
{
/**
* This enum represents the three types of characters:
* + DIGIT - [0-9]+
* + ALPHA - [A-Za-z]
* + OTHER - [^0-9A-Za-z]
*/
private static enum TYPE{
DIGIT(1), ALPHA(2), OTHER(0);
/**
* Takes a char and returns the TYPE it belongs to
* @param x a single char
* @return TYPE of x
*/
public static TYPE getType(char x){
if(Character.isDigit(x))
return DIGIT;
if(Character.isAlphabetic(x))
return ALPHA;
return OTHER;
}
/**
* Compares two TYPEs based on natural order. Returns:
* + Negative if `a` is before `b`
* + Zero if `a` and `b` are the same TYPE
* + Positive if `a` is after `b`
* @param a
* @param b
* @return a negative integer, zero, or a positive integer as the
* first argument is less than, equal to, or greater than the
* second.
*/
public static int compare(TYPE a, TYPE b){
return Integer.compare(a.getValue(), b.getValue());
}
private int value;
TYPE(int v){
value = v;
}
private int getValue(){
return value;
}
}
/**
* This provides the ordering of ALPHA characters
*/
private static final String ALPHA_ORDER = "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz";
/**
* Takes two integer strings and compares the integer value of them. For
* example, "03" and "1" would result in 1
* @param a
* @param b
* @return a negative integer, zero, or a positive integer as the
* first argument is less than, equal to, or greater than the
* second.
*/
private int compareDigitUnits(String a, String b)
{
int aInt = Integer.parseInt(a);
int bInt = Integer.parseInt(b);
return Integer.compare(aInt, bInt);
}
/**
* Takes two characters and compares them based on Alphabetical ordering.
* @param a
* @param b
* @return a negative integer, zero, or a positive integer as the
* first argument is less than, equal to, or greater than the
* second.
*/
private int compareAlphaUnits(char a, char b)
{
if(TYPE.getType(a)!=TYPE.ALPHA)
throw new IllegalArgumentException(a+" is not of TYPE Alpha");
if(TYPE.getType(b)!=TYPE.ALPHA)
throw new IllegalArgumentException(b+" is not of TYPE Alpha");
int indexA = ALPHA_ORDER.indexOf(a);
int indexB = ALPHA_ORDER.indexOf(b);
if(indexA<0)
throw new IllegalArgumentException(a+" is not in '"+ALPHA_ORDER+"'");
if(indexB<0)
throw new IllegalArgumentException(b+" is not in '"+ALPHA_ORDER+"'");
return Integer.compare(indexA, indexB);
}
/**
* Takes two character of TYPE Other and compares them based on their ASCII
* values.
* @param a
* @param b
* @return a negative integer, zero, or a positive integer as the
* first argument is less than, equal to, or greater than the
* second.
*/
private int compareOtherUnits(char a, char b)
{
if(TYPE.getType(a)!=TYPE.OTHER)
throw new IllegalArgumentException(a+" is not of TYPE Other");
if(TYPE.getType(b)!=TYPE.OTHER)
throw new IllegalArgumentException(b+" is not of TYPE Other");
return Character.compare(a, b);
}
/**
* This takes two strings and compares them based on natural ordering/
* @param o1 String of characters
* @param o2 String of characters
* @return a negative integer, zero, or a positive integer as the
* first argument is less than, equal to, or greater than the
* second.
*/
public int compare(Object o1, Object o2)
{
StringBuilder a = new StringBuilder(o1.toString());
StringBuilder b = new StringBuilder(o2.toString());
while(a.length()>0 && b.length()>0){
String nextAUnit = getNextUnit(a.toString());
String nextBUnit = getNextUnit(b.toString());
a.delete(0, nextAUnit.length());
b.delete(0, nextBUnit.length());
TYPE aUnitType = TYPE.getType(nextAUnit.charAt(0));
TYPE bUnitType = TYPE.getType(nextBUnit.charAt(0));
int result = TYPE.compare(aUnitType, bUnitType);
if(result != 0)
return result;
switch(aUnitType){
case ALPHA:
result = compareAlphaUnits(nextAUnit.charAt(0), nextBUnit.charAt(0));
break;
case DIGIT:
result = compareDigitUnits(nextAUnit, nextBUnit);
break;
case OTHER:
default:
result = compareOtherUnits(nextAUnit.charAt(0), nextBUnit.charAt(0));
break;
}
if(result != 0)
return result;
}
return 0;
}
/**
* Takes a string of characters and returns the 1st character unless it is
* a digit. If the 1st character is a digit, it returns the 1st character
* plus all of the consecutive digits.
* @param s
* @return a character or a string of digits
*/
private static String getNextUnit(String s){
char firstChar = s.charAt(0);
if(!Character.isDigit(firstChar))
return String.valueOf(firstChar);
StringBuilder bob = new StringBuilder();
char[] chars = s.toCharArray();
for (char aChar : chars) {
if(Character.isDigit(aChar))
bob.append(aChar);
else
break;
}
return bob.toString();
}
}
3 Answers 3
Overall I think the code is readable, and has good organization. Also the formatting is good, and the comments are nice.
Design/Readability
Since you are comparing
Strings
you should implementComparator<String>
, this will allow you to simplifycompare
method to takeStrings
as input, and you could get rid of theStringBuilders
inside.I think the error checking is a bit too much. For example, in your
compareAlphaUnits
method, you are checking if the letters are classified asALPHA
, and also if they exist in your map, even though the classification is based on the map. Since these are private methods and nobody could change them, it's safe to make assumptions about how they are used. SocompareAlphaUnits
andcompareOtherUnits
could be simplified to:private int compareAlphaUnits(char a, char b) { int indexA = ALPHA_ORDER.indexOf(a); int indexB = ALPHA_ORDER.indexOf(b); return Integer.compare(indexA, indexB); } private int compareOtherUnits(char a, char b) { return Character.compare(a, b); }
In
compareDigitUnits
, you have the risk of overflow if either string has some number that doesn't fit in an integer. A better way is to iterate over the characters and compare them one by one.You are using extra memory and doing extra effort in
compare
method by continuously getting a unit then deleting it. You could save an index for each string, and keep reusing it ingetNextUnit
.getNextUnit
could be simplified by using awhile
loop to get the last digit character, and thensubstring
method. It might also be worth it to extract that in another private method.private static String getNextUnit(String s, int startIndex){ char firstChar = s.charAt(startIndex); if(!Character.isDigit(firstChar)) return String.valueOf(firstChar); end = startIndex; while (endIndex < s.length() && Character.isDigit(s.charAt(endIndex))) { endIndex++; } return s.substring(startIndex, endIndex+1); }
Enum names shouldn't be all caps, so you should change
TYPE
toType
.Comments are good in general, but I would try to make them less verbose and not stating obvious things, like the comment
@return TYPE of x
above thegetType
method.
Efficiency
In
compareAlphaUnits
, you are doing a linear scan to get the index for each character. One other way is to initialize a map from each letter to its index in the constructor, and use it in the comparison. This will drop the cost from possible52
loop iterations to1
operation.Also related to long substrings of digits, is that you are extracting those into strings, so you could theoretically need as much memory as the original string in intermediate strings. A possible way to optimize is to pass the start and end indexes of the comparison units to the comparison methods instead of cutting the
String
. This is a tradeoff between efficiency and readability, since passing the indexes (or an object containing the indexes) will make the methods more complicated and less readable.
I love pathological edge cases. Consider these values:
03Units
3Units
003Units
What is their proper order? What order, if predictable, will NaturalOrderComparator
place them in?
As I read it, they will all be considered as equal
by NaturalOrderComparator
even though they are not This results in non-deterministic behavoir for a sorting algorythm that relies on NaturalOrderComparator
.
I can't say what their proper order should be, but it should have an order, as they are not equal. compareDigitUnits
is the place to correct for this. Possibly by doing a string comparisson for integer results of equal
.
Edit
Consider these strings. What should the final order be?
U1B
01
U01A
1.1.2-a
U1
001
01B
U01
001C
1.1b.2-a
U1.001.2-b
1.001.2-b
1.1a.2-a
U1.01.02-b
1.01.02-b
1C
U001B
U01C
U1A
1.01.2-b
01A
U001C
U01B
U1.1.2-b
U1.1a.2-a
U1.01.2-b
1.001.2-a
01C
001A
U1.01.2-a
U1.001.2-a
U1.1.2-a
1.01.2-a
1
1.1.2-b
1A
U1.1b.2-a
001B
U001A
U1.01.02-a
1B
U1C
U001
1.01.02-a
Another consideration would be hex strings. I haven't looked at them relative to your code, but I know that often I've been less than satisfied with how they have been sorted by other routines and tools. Sorted purely on ASCII value they sort nicely (if case insensitive), but adding the parsing you do, will they remain "natural" or not. Of course, hex string aren't natural, in an of themselves, but some of us computer types think they are/ :)
-
\$\begingroup\$ Brilliant! All things being equal the shorter should go first. But this can't be determined in
compareDigitUnits
. Otherwise,03C, 3B, 003A
would be sorted as3B, 03C,003A
when it should be sorted003A, 3B, 03C
. Since the number parts are equal, sorting should depend on the following characters. \$\endgroup\$NonlinearFruit– NonlinearFruit2017年03月07日 15:49:07 +00:00Commented Mar 7, 2017 at 15:49 -
\$\begingroup\$ @NonlinearFruit you're correct that
compareDigitUnits
can't handle it. I suppose the first step will be to defining what the order should be, then recoding to meet that. I've created a minor set of strings to consider and added it to my answer. \$\endgroup\$Chindraba– Chindraba2017年03月08日 07:51:56 +00:00Commented Mar 8, 2017 at 7:51 -
\$\begingroup\$ Here is a pastebin of the results. These are valid considerations, but the purpose of this natural sort is to be intuitive to end (non-technical) users. \$\endgroup\$NonlinearFruit– NonlinearFruit2017年03月08日 10:41:55 +00:00Commented Mar 8, 2017 at 10:41
-
\$\begingroup\$ @NonlinearFruit That's why I only tacked it onto the end, and didn't include any hex in the sample strings. And while we're used to seeing the
1.1.2-b
type data as version numbers, it's also used a lot in documentation for business and legal environments, and could need sorting for table of authority type entries. So those types are non-trivial for end users. \$\endgroup\$Chindraba– Chindraba2017年03月09日 07:42:28 +00:00Commented Mar 9, 2017 at 7:42
In my opinion, your approach has two main drawbacks:
All the code is within a single class. Why not split it? In my opinion, there are at least two main responsibilities to be split in different objects: deciding what type a unit is, and comparing it. Answer this question: what needs to be changed if you would come up with a fourth type?
Take a look at your
public int compare(Object o1, Object o2)
andprivate int compareAlphaUnits(char a, char b)
(among others). See all those pairs of duplicate lines? That's an indicator that it should be done somewhere else, in another abstraction layer that you'd invoke once, not twice. The combination ofswitch
and enum is also a hint that you should be thinking of polymorphism.You don't leverage Java's native comparison features. Java already knows how to compare characters, you could get rid of
ALPHA_ORDER
.
Other minor things:
- If you expect, and specify, that the inputs must be
String
, why do you allowObject
, why not restrict toString
? Your class should also reflect that:public class NaturalOrderComparator implements Comparator<String>
. - Add
@Override
tocompare()
. - Similarly for
TYPE
, if you plan to compare it, then specify it:enum TYPE implements Comparable<TYPE>
. static
is redundant for inner enums.- I agree with @Hesham in that you shouldn't be doing the check at the beginning of
compareAlphaUnits(char a, char b)
andcompareOtherUnits(char a, char b)
. This situation shouldn't happen at this point. - Why is
getNextUnit()
static? It's a bit confusing to me, I would expect it to be static because either another static method invokes it or because it's a utility method (but then should be public).
Something else, what should be the result for inputs of different length? I would imagine the shortest should come first, but your code returns 0.
This is the solution I came up with:
import java.util.Comparator;
public class NaturalOrderComparator implements Comparator<String> {
public int compare(String o1, String o2) {
NaturalOrderComparableGroup o1Comparable = new NaturalOrderComparableGroup(o1);
NaturalOrderComparableGroup o2Comparable = new NaturalOrderComparableGroup(o2);
return o1Comparable.compareTo(o2Comparable);
}
}
import java.util.ArrayList;
import java.util.List;
import static java.lang.Character.isAlphabetic;
import static java.lang.Character.isDigit;
public class NaturalOrderComparableGroup implements Comparable<NaturalOrderComparableGroup> {
private final List<NaturalOrderComparableElement> elements;
public NaturalOrderComparableGroup(String string) {
List<NaturalOrderComparableElement> elements = new ArrayList<>();
int i = 0;
while (string.length() > i) {
Character character = string.charAt(i);
if (isAlphabetic(character)) {
elements.add(new AlphabeticComparableElement(character));
i++;
} else if (isDigit(character)) {
NumericComparableElement numericComparableElement = new NumericComparableElement(character);
i++;
while (string.length() >= i + 1 && isDigit(string.charAt(i))) {
numericComparableElement.add(string.charAt(i));
i++;
}
elements.add(numericComparableElement);
} else {
elements.add(new NonAlphanumericComparableElement(character));
i++;
}
}
this.elements = elements;
}
public int compareTo(NaturalOrderComparableGroup o) {
int i = 0;
while (!anyOfTheGroupsDoesntHaveEnoughElements(o, i)) {
if (elements.get(i).compareTo(o.elements.get(i)) != 0) {
return elements.get(i).compareTo(o.elements.get(i));
}
i++;
}
return shortest(o);
}
private boolean anyOfTheGroupsDoesntHaveEnoughElements(NaturalOrderComparableGroup o, int i) {
return elements.size() < i + 1 || o.elements.size() < i + 1;
}
private int shortest(NaturalOrderComparableGroup o) {
return ((Integer) elements.size()).compareTo(o.elements.size());
}
}
public interface NaturalOrderComparableElement<T extends NaturalOrderComparableElement> extends Comparable<NaturalOrderComparableElement> {
Integer getPriority();
@Override
@SuppressWarnings("unchecked")
default int compareTo(NaturalOrderComparableElement other) {
int typeComparisonResult = this.getPriority().compareTo(other.getPriority());
if (typeComparisonResult != 0) {
return typeComparisonResult;
} else {
// Safe check because at this point we know that other is an instance of this same class
return compareToInstanceOfSameType((T) other);
}
}
int compareToInstanceOfSameType(T other);
}
public class AlphabeticComparableElement implements NaturalOrderComparableElement<AlphabeticComparableElement> {
private static final Integer PRIORITY = 3;
private final Character character;
public AlphabeticComparableElement(Character character) {
this.character = character;
}
@Override
public Integer getPriority() {
return PRIORITY;
}
@Override
public int compareToInstanceOfSameType(AlphabeticComparableElement other) {
return this.character.compareTo(other.character);
}
}
public class NumericComparableElement implements NaturalOrderComparableElement<NumericComparableElement> {
private static final Integer PRIORITY = 2;
private Integer integer;
public NumericComparableElement(Character character) {
this.integer = Integer.valueOf(character.toString());
}
@Override
public Integer getPriority() {
return PRIORITY;
}
@Override
public int compareToInstanceOfSameType(NumericComparableElement other) {
return this.integer.compareTo(other.integer);
}
public void add(char c) {
this.integer = this.integer * 10 + new Integer(String.valueOf(c));
}
}
public class NonAlphanumericComparableElement implements NaturalOrderComparableElement<NonAlphanumericComparableElement> {
private static final Integer PRIORITY = 1;
private final Character character;
public NonAlphanumericComparableElement(Character character) {
this.character = character;
}
@Override
public Integer getPriority() {
return PRIORITY;
}
@Override
public int compareToInstanceOfSameType(NonAlphanumericComparableElement other) {
return this.character.compareTo(other.character);
}
}
-
\$\begingroup\$ Great idea, I hadn't thought about breaking this into multiple classes. As a minor note, I tested the coded provided and two test cases failed.
E00088A
should be beforeE088B
anda
should be beforeB
. \$\endgroup\$NonlinearFruit– NonlinearFruit2017年04月24日 20:25:51 +00:00Commented Apr 24, 2017 at 20:25 -
\$\begingroup\$ You're right, those two are bugs in my code. I've fixed the first one. The second one can be easily solved by changing
character
insideAlphabeticComparableElement
to type String and usingcompareToIgnoreCase
if we accept that capital letters have same priority as their lower cases, and we'd still be leveraging Java's native capabilities. Otherwise, we'd have to use ALPHA_ORDER to specify that custom ordering. Anyway, you should notice that by having broken the problem into different classes we can fix each bug by modifying a single class. \$\endgroup\$lmrn19– lmrn192017年04月25日 09:07:25 +00:00Commented Apr 25, 2017 at 9:07
bob
theStringBuilder
. Love it! \$\endgroup\$RuleBasedCollator
class. As I understand it, you pass it a string that has the proper ordering rules encoded in it. Then you this class to make string comparisons. My two hesitations are: I'm not sure how to encode this and I'm not sure that this encoded string would be very readable. \$\endgroup\$