Function to get top k words in a given String

Question 1

This is a two-part job:

to identify a word
to find the top k words

I've tried to use regex to split but it's taking 10x more time when string is really long. The top k strings are found using a priority queue. I've created a min priority queue for first k element, if the frequency of the next word is greater than the min element. I then remove min and insert the new word.

 public static List<String> getTopStrings(String input, int k)
 {
 if(k<0)
 throw new IllegalArgumentException("Value k cannot be negative for top k elements: k= "+k);
 List<String> list = new ArrayList<String>();
 if(input==null || input.isEmpty() || k==0)
 {
 return list;
 }
 Set<Character> separators = new HashSet<Character>();
 separators.add(' ');
 separators.add(',');
 separators.add('.');
 separators.add(';');
 separators.add(':');
 separators.add('"');
 separators.add('(');
 separators.add(')');
 separators.add('-');
 separators.add('/');
 separators.add('\\');
 separators.add('\'');
 separators.add('?');
 separators.add('\n');
 separators.add('\r');
 separators.add('!');
 separators.add('|');
 separators.add('~');
 separators.add('\'');
 separators.add('[');
 separators.add(']');
 separators.add('{');
 separators.add('}');
 separators.add('&');
 separators.add('%');
 separators.add('$');
 separators.add('#');
 separators.add('@');
 separators.add('*');
 separators.add('=');
 separators.add('+');
 separators.add('>');
 separators.add('<');
 final Map<String, Integer> wordMap = new HashMap<String, Integer>();
 int count, wordStart=0;
 for(int i=0;i<input.length();i++)
 {
 if(separators.contains(input.charAt(i)))
 {
 addWord(input, wordMap,wordStart,i);
 wordStart=i+1;
 }
 }
 addWord(input,wordMap,wordStart,input.length());
 List<String> keySet = new ArrayList<String>();
 keySet.addAll(wordMap.keySet());
 if(keySet.size()<=k){
 return keySet;
 }
 PriorityQueue<String> minHeap = new PriorityQueue<String>(k, new Comparator<String>() {
 @Override
 public int compare(String o1, String o2) {
 return (wordMap.get(o1)-wordMap.get(o2));
 }
 });
 for(int i=0;i<k;i++)
 {
 minHeap.add(keySet.get(i));
 }
 for(int i=k;i<keySet.size();i++)
 {
 String key = keySet.get(i);
 count = wordMap.get(key);
 if(count>wordMap.get(minHeap.peek()))
 {
 minHeap.poll();
 minHeap.add(key);
 }
 }
 list.addAll(minHeap);
 return list;
 }
 public static void addWord(String input, Map<String, Integer> wordMap, int startIndex, int endIndex)
 {
 int count = 0;
 String word = input.substring(startIndex, endIndex).toLowerCase();
 if(word!=null && !word.isEmpty())
 {
 if(wordMap.containsKey(word))
 {
 count = wordMap.get(word);
 }
 wordMap.put(word, count+1);
 }
 }

Question 2

Initialize Set with values

Your method of initializing the hashset with values takes quite a bit of space. I would create a static field:

public static final String[] SEPARATORS = new String[] { " ", ",", ".", ";" };

And then use it like this:

public static final Set<String> separatorsSet = new HashSet<String>(Arrays.asList(SEPARATORS));

Extract code to methods

Your getTopStrings method currently does three things: extract words from string, count amount of duplicate words, sort the result of that by frequency. I would at least create separate methods for the first two and the third functionality.

Simplify Sorting

Your sorting mechanism using a queue seems overly complicated. If you are using Java 8, it could be reduced to three lines (or one really long one):

 // sort by value, reversed
 Stream<Entry<String, Integer>> sorted = wordMap.entrySet().stream().sorted(Map.Entry.comparingByValue((x,y) -> (y < x) ? -1 : ((x == y) ? 0 : 1))); // alternatively: use reversed and cast
 // flatten map to list:
 List<String> collected = sorted.map(entry -> entry.getKey()).collect(Collectors.toCollection(ArrayList::new));
 // top k:
 List<String> topK = collected.subList(0, k);

Misc

in Java, it is customary to place curly brackets on the same line as the opening statements.
use curly brackets even for one line statements for improved readability and to avoid future bugs.
use more spaces, eg around ==, <, +, after ;, etc for increased readability.
declare variables in as small a scope as possible to increase readability.

Question 3

The reason the OP is using a priority queue is to keep the running time limited to n * log(k). Sorting the entire list and picking the first k element will run in n * log(n). Seeing how they've emphasized that the strings can be very long, it's safe to assume n can be large enough to make a significant difference.

Question 4

What's this about comparingByValue((x,y) -> y - x)? It's ugly, it's buggy and it's wrong. What's wrong with Comparator.reversed? Although beware of this bug.

Question 5

@BoristheSpider how is it buggy? And reversed did not work for me because of incompatible types. But with a cast as mentioned in your link it works. thanks.

Question 6

@tim Integer overflow.

Question 7

Can it be done in O(n) time ?

Question 8

Well the disadvantage of regex is, that if written wrong you'll have a bad time. The next disadvantage of regex is, you will have to keep a large String in memory.

But regex has two strong advantages:

There is character-classes. You can use these to define what you want to recognize as word.
Java Matchers are inherently "iterative".

This makes it rather simple to create a short method that gets all words from a String:
_{(sidenote: I'm using a strong oversimplification for the Pattern. For more information on Java Regexes, read the "manual")}

private static final Pattern WORD_PATTERN = Pattern.compile("(\\w++)", Pattern.MULTILINE);
//method header
 Matcher words = WORD_PATTERN.matcher(input);
 while (words.find()) {
 final String nextWord = words.group();
 // do something with your found word
 }
}

Now since that is a simple way to grab all "words" in a String the only thing left is to count how often a certain word appeared, but you seem to have already accomplished that, so I'll leave that in your hands ;)

Question 9

I agree. But, for really long string, its taking too much time. (10x)

tim tim 25.3k3 gold badges31 silver badges76 bronze badges · Accepted Answer · 2015-02-22 11:45:59Z

Initialize Set with values

Your method of initializing the hashset with values takes quite a bit of space. I would create a static field:

public static final String[] SEPARATORS = new String[] { " ", ",", ".", ";" };

And then use it like this:

public static final Set<String> separatorsSet = new HashSet<String>(Arrays.asList(SEPARATORS));

Extract code to methods

Your getTopStrings method currently does three things: extract words from string, count amount of duplicate words, sort the result of that by frequency. I would at least create separate methods for the first two and the third functionality.

Simplify Sorting

Your sorting mechanism using a queue seems overly complicated. If you are using Java 8, it could be reduced to three lines (or one really long one):

 // sort by value, reversed
 Stream<Entry<String, Integer>> sorted = wordMap.entrySet().stream().sorted(Map.Entry.comparingByValue((x,y) -> (y < x) ? -1 : ((x == y) ? 0 : 1))); // alternatively: use reversed and cast
 // flatten map to list:
 List<String> collected = sorted.map(entry -> entry.getKey()).collect(Collectors.toCollection(ArrayList::new));
 // top k:
 List<String> topK = collected.subList(0, k);

Misc

in Java, it is customary to place curly brackets on the same line as the opening statements.
use curly brackets even for one line statements for improved readability and to avoid future bugs.
use more spaces, eg around ==, <, +, after ;, etc for increased readability.
declare variables in as small a scope as possible to increase readability.

The reason the OP is using a priority queue is to keep the running time limited to n * log(k). Sorting the entire list and picking the first k element will run in n * log(n). Seeing how they've emphasized that the strings can be very long, it's safe to assume n can be large enough to make a significant difference.
What's this about comparingByValue((x,y) -> y - x)? It's ugly, it's buggy and it's wrong. What's wrong with Comparator.reversed? Although beware of this bug.
@BoristheSpider how is it buggy? And reversed did not work for me because of incompatible types. But with a cast as mentioned in your link it works. thanks.

Stack Exchange Network

Function to get top k words in a given String

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Function to get top k words in a given String

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions