This is a two-part job:
- to identify a word
- to find the top
k
words
I've tried to use regex to split but it's taking 10x more time when string is really long. The top k
strings are found using a priority queue. I've created a min
priority queue for first k
element, if the frequency of the next word is greater than the min
element. I then remove min
and insert the new word.
public static List<String> getTopStrings(String input, int k)
{
if(k<0)
throw new IllegalArgumentException("Value k cannot be negative for top k elements: k= "+k);
List<String> list = new ArrayList<String>();
if(input==null || input.isEmpty() || k==0)
{
return list;
}
Set<Character> separators = new HashSet<Character>();
separators.add(' ');
separators.add(',');
separators.add('.');
separators.add(';');
separators.add(':');
separators.add('"');
separators.add('(');
separators.add(')');
separators.add('-');
separators.add('/');
separators.add('\\');
separators.add('\'');
separators.add('?');
separators.add('\n');
separators.add('\r');
separators.add('!');
separators.add('|');
separators.add('~');
separators.add('\'');
separators.add('[');
separators.add(']');
separators.add('{');
separators.add('}');
separators.add('&');
separators.add('%');
separators.add('$');
separators.add('#');
separators.add('@');
separators.add('*');
separators.add('=');
separators.add('+');
separators.add('>');
separators.add('<');
final Map<String, Integer> wordMap = new HashMap<String, Integer>();
int count, wordStart=0;
for(int i=0;i<input.length();i++)
{
if(separators.contains(input.charAt(i)))
{
addWord(input, wordMap,wordStart,i);
wordStart=i+1;
}
}
addWord(input,wordMap,wordStart,input.length());
List<String> keySet = new ArrayList<String>();
keySet.addAll(wordMap.keySet());
if(keySet.size()<=k){
return keySet;
}
PriorityQueue<String> minHeap = new PriorityQueue<String>(k, new Comparator<String>() {
@Override
public int compare(String o1, String o2) {
return (wordMap.get(o1)-wordMap.get(o2));
}
});
for(int i=0;i<k;i++)
{
minHeap.add(keySet.get(i));
}
for(int i=k;i<keySet.size();i++)
{
String key = keySet.get(i);
count = wordMap.get(key);
if(count>wordMap.get(minHeap.peek()))
{
minHeap.poll();
minHeap.add(key);
}
}
list.addAll(minHeap);
return list;
}
public static void addWord(String input, Map<String, Integer> wordMap, int startIndex, int endIndex)
{
int count = 0;
String word = input.substring(startIndex, endIndex).toLowerCase();
if(word!=null && !word.isEmpty())
{
if(wordMap.containsKey(word))
{
count = wordMap.get(word);
}
wordMap.put(word, count+1);
}
}
2 Answers 2
Initialize Set with values
Your method of initializing the hashset with values takes quite a bit of space. I would create a static field:
public static final String[] SEPARATORS = new String[] { " ", ",", ".", ";" };
And then use it like this:
public static final Set<String> separatorsSet = new HashSet<String>(Arrays.asList(SEPARATORS));
Extract code to methods
Your getTopStrings
method currently does three things: extract words from string, count amount of duplicate words, sort the result of that by frequency. I would at least create separate methods for the first two and the third functionality.
Simplify Sorting
Your sorting mechanism using a queue seems overly complicated. If you are using Java 8, it could be reduced to three lines (or one really long one):
// sort by value, reversed
Stream<Entry<String, Integer>> sorted = wordMap.entrySet().stream().sorted(Map.Entry.comparingByValue((x,y) -> (y < x) ? -1 : ((x == y) ? 0 : 1))); // alternatively: use reversed and cast
// flatten map to list:
List<String> collected = sorted.map(entry -> entry.getKey()).collect(Collectors.toCollection(ArrayList::new));
// top k:
List<String> topK = collected.subList(0, k);
Misc
- in Java, it is customary to place curly brackets on the same line as the opening statements.
- use curly brackets even for one line statements for improved readability and to avoid future bugs.
- use more spaces, eg around
==
,<
,+
, after;
, etc for increased readability. - declare variables in as small a scope as possible to increase readability.
-
\$\begingroup\$ The reason the OP is using a priority queue is to keep the running time limited to n * log(k). Sorting the entire list and picking the first k element will run in n * log(n). Seeing how they've emphasized that the strings can be very long, it's safe to assume n can be large enough to make a significant difference. \$\endgroup\$Mansour– Mansour2015年02月22日 12:01:48 +00:00Commented Feb 22, 2015 at 12:01
-
\$\begingroup\$ What's this about
comparingByValue((x,y) -> y - x)
? It's ugly, it's buggy and it's wrong. What's wrong withComparator.reversed
? Although beware of this bug. \$\endgroup\$Boris the Spider– Boris the Spider2015年02月22日 15:13:27 +00:00Commented Feb 22, 2015 at 15:13 -
\$\begingroup\$ @BoristheSpider how is it buggy? And reversed did not work for me because of incompatible types. But with a cast as mentioned in your link it works. thanks. \$\endgroup\$tim– tim2015年02月22日 15:27:39 +00:00Commented Feb 22, 2015 at 15:27
-
1\$\begingroup\$ @tim
Integer
overflow. \$\endgroup\$Boris the Spider– Boris the Spider2015年02月22日 15:28:38 +00:00Commented Feb 22, 2015 at 15:28 -
\$\begingroup\$ Can it be done in O(n) time ? \$\endgroup\$Piyush Gupta– Piyush Gupta2015年02月23日 00:30:44 +00:00Commented Feb 23, 2015 at 0:30
Well the disadvantage of regex is, that if written wrong you'll have a bad time. The next disadvantage of regex is, you will have to keep a large String in memory.
But regex has two strong advantages:
- There is character-classes. You can use these to define what you want to recognize as word.
- Java Matchers are inherently "iterative".
This makes it rather simple to create a short method that gets all words from a String:
(sidenote: I'm using a strong oversimplification for the Pattern. For more information on Java Regexes, read the "manual")
private static final Pattern WORD_PATTERN = Pattern.compile("(\\w++)", Pattern.MULTILINE);
//method header
Matcher words = WORD_PATTERN.matcher(input);
while (words.find()) {
final String nextWord = words.group();
// do something with your found word
}
}
Now since that is a simple way to grab all "words" in a String the only thing left is to count how often a certain word appeared, but you seem to have already accomplished that, so I'll leave that in your hands ;)
-
\$\begingroup\$ I agree. But, for really long string, its taking too much time. (10x) \$\endgroup\$Piyush Gupta– Piyush Gupta2015年02月23日 02:37:22 +00:00Commented Feb 23, 2015 at 2:37