This program takes an arbitrary array of strings and group anagrams into arrays and return them in a array. Any string without any anagrams is still put in an array.
This code works perfectly and I would appreciate any comments on making this more efficient in terms of computation time.
import java.util.Arrays;
import java.util.HashMap;
import java.util.ArrayList;
public class AnagramSort{
public static void main( String[] args ){
HashMap< Integer, ArrayList< String >> hm = new HashMap();
groupAnagrams( args, hm );
System.out.println( hm );
}
public static void groupAnagrams( String[] list, HashMap< Integer, ArrayList< String >> hm ){
for( int x=0; x<list.length; x++ ){
if( list[ x ] == null ) continue;
String curX = list[ x ];
int hashX = primeHash( curX );
hm.put( hashX, new ArrayList( Arrays.asList( curX )));
for( int y=x+1; y<list.length; y++ ){
String curY = list[ y ];
int hashY = primeHash( curY );
if( curY == null || curY.length() != curX.length()) continue;
if( hashX == hashY ){
hm.get( hashX ).add( curY );
list[ y ] = null; // if its an anagram null it out to avoid checking again
}
}
}
}
// Utility Mehthods
public static int primeHash( String word ){
int productOfPrimes = 1;
int prime[] = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31,
37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101 };
for( char ch : word.toCharArray() ){
productOfPrimes *= prime[ (int) ch - (int) 'a' ];
}
return productOfPrimes;
}
}
Sample Input:
[ mother, mothre, dad, add, gift, gender ]
Output:
[ [mother, mothre], [dad,add], [gift], [gender] ]
2 Answers 2
Technical
groupAnagrams( args );
That should be:
groupAnagrams( args, hm );
Bug in copy/paste somehow?
1-liner statements, even simple ones like this:
if( list[ x ] == null ) continue;
should be braced like:
if( list[ x ] == null ) {
continue;
}
This prevents maintenance problems later, and makes revision history easier to follow.
Algorithm
Your algorithm is taking each member of the array, and comparing to each subsequent member, removing anagram matches.
The primeHash()
method is interesting, but ultimately it is a red herring of sorts... and it only works for lower-case input words. You have obviously invested some thought in to it, but there's a simpler solution to that problem:
private static final String anagramKey(String word) {
word = word.toLowerCase();
char[] chars = word.toCharArray();
Arrays.sort(chars);
return new String(chars);
}
Take all letters, sort them, return a String. All anagrams of the same letters will have the same keys.
With that key system, the basic code can be come simpler with:
HashMap<String,List<String>> matchMap = new HashMap<>();
for (String word : args) {
String key = anagramKey(word);
if (!matchMap.containsKey(key)) {
matchMap.put(key, new ArrayList<String>());
}
matchMap.get(key).add(word);
}
System.out.println(matchMap);
That reduces your problem to an \$O(n)\$ one.
-
\$\begingroup\$ Thanks for the answer. I used the code you suggested and it works really well. However, I realised that the
primeHash
method I used has a better worse case than sorting(merge sort in this case), O(n) So I combinedprimeHash
with your loop construct and had a really good run time. \$\endgroup\$SamAko– SamAko2015年02月18日 15:25:13 +00:00Commented Feb 18, 2015 at 15:25 -
\$\begingroup\$ @rolfl, also, it is a good idea to avoid using Arrays.sort since sorting is at least O(nlogn) time complexity, it is comparison-based sorting. We can take advantage of sorting using counting sort, alphabetic number size is 26 only. So hash function anagramKey(string word) time complexity can be lowered to O(n) algorithm. \$\endgroup\$Jianmin Chen– Jianmin Chen2017年02月05日 00:18:47 +00:00Commented Feb 5, 2017 at 0:18
-
\$\begingroup\$ @JianminChen - for sorting a small array of char, I don't think the complexity/work required to build up your dataset would be worth it. You may have a smaller \$O\$ complexity, but I would imagine that the additional data structures you need to create, and the additional computations required for the hash, would be significantly more than \$log(n)\$ of the sort check. In some cases your advice may be significantly useful, but I don't think it's worth it in this case. Note that I have defaulted to code simplicity in this answer, not raw performance too. ` \$\endgroup\$rolfl– rolfl2017年02月05日 02:27:42 +00:00Commented Feb 5, 2017 at 2:27
-
\$\begingroup\$ @rolfl, I read your statement. I just posted the similar question and also answered the similar question, please help me as well. The links are here. codereview.stackexchange.com/questions/154471/… \$\endgroup\$Jianmin Chen– Jianmin Chen2017年02月05日 02:32:00 +00:00Commented Feb 5, 2017 at 2:32
-
\$\begingroup\$ codereview.stackexchange.com/a/154458/123986 \$\endgroup\$Jianmin Chen– Jianmin Chen2017年02月05日 02:33:57 +00:00Commented Feb 5, 2017 at 2:33
Just a few minor things on top of @rolfl's answer.
Don't use raw types
Using raw types, like in new HashMap()
is a bad practice:
HashMap< Integer, ArrayList< String >> hm = new HashMap();
In Java6 you should write this as:
HashMap< Integer, ArrayList< String >> hm = new HashMap<Integer, ArrayList<String>>();
In Java7 and above you can use the diamond operator <>
to write simpler:
HashMap< Integer, ArrayList< String >> hm = new HashMap<>();
Use interface types when possible
Whenever appropriate, use interface types instead of implementations. For example given this code:
HashMap< Integer, ArrayList< String >> hm = new HashMap<>();
The hm
variable (poorly named, btw),
doesn't really need to be a HashMap
.
Your algorithm will work just fine with a TreeMap
too,
or WhatEverTheHeckMap
too,
as long as it's a Map
.
So use a Map
:
Map<Integer, List<String>> hm = new HashMap<>();
Note that I also changed ArrayList
to List
, for the same reason.
Do this way everywhere, for example in the groupAnagrams
method too.
Code organization
Why pass a Map
to the void
method groupAnagrams
?
Why not pass only the input parameters and make it return a Collection
(or List
) of the results?
One big reason to do this is that callers of the groupAnagrams
shouldn't need a Map
at all.
The fact that groupAnagrams
uses a Map
in its algorithm is just an implementation detail.
The caller really needs only a collection (or list) of the results.
Unit testing
To verify that your algorithm actually works, it's good to have unit tests.
@Test
public void test_mother_mothre_dad_add_gift_gender() {
Map<Integer, List<String>> map = new HashMap<>();
AnagramSort.groupAnagrams(new String[]{"mother", "mothre", "dad", "add", "gift", "gender"}, map);
assertEquals("[[gender], [dad, add], [gift], [mother, mothre]]", map.values().toString());
}
Once you have this (and hopefully more) test cases, you can refactor your algorithm like @rolfl suggested, and when your done you can simply re-run the tests with one simple click, and you'll know immediately if the implementation works or not.
Readability
Your coding style is very different from the way an IDE like Eclipse/IntelliJ would auto-format the code, for example instead of:
public static void main( String[] args ){ HashMap< Integer, ArrayList< String >> hm = new HashMap(); groupAnagrams( args, hm ); System.out.println( hm ); }
Like this:
public static void main(String[] args) {
HashMap<Integer, ArrayList<String>> hm = new HashMap();
groupAnagrams(args, hm);
System.out.println(hm);
}
Once you adopt this style, life gets simpler. If somebody gives you code in a different format, you can just use your IDE to reformat it to a familiar style.
Other minor things
You don't need to cast the char
values to (int)
:
productOfPrimes *= prime[(int) ch - (int) 'a'];
This works too and simpler:
productOfPrimes *= prime[ch - 'a'];
-
\$\begingroup\$ As there is only one output value, returning it instead of modifying an output parameter might seem more intuitive. \$\endgroup\$Adrian Leonhard– Adrian Leonhard2015年02月16日 21:19:51 +00:00Commented Feb 16, 2015 at 21:19
-
\$\begingroup\$ @AdrianLeonhard That was also my point in the "Code organization" section above \$\endgroup\$janos– janos2015年02月16日 21:39:34 +00:00Commented Feb 16, 2015 at 21:39
-
1\$\begingroup\$ That's a very elaborate answer :) I couldn't have asked for better. Thanks. For my formating choices, I do that because I feel it makes the code more readable. Not sure how harmful that it is though! \$\endgroup\$SamAko– SamAko2015年02月18日 15:33:59 +00:00Commented Feb 18, 2015 at 15:33