6
\$\begingroup\$

This is a two-part job:

  1. to identify a word
  2. to find the top k words

I've tried to use regex to split but it's taking 10x more time when string is really long. The top k strings are found using a priority queue. I've created a min priority queue for first k element, if the frequency of the next word is greater than the min element. I then remove min and insert the new word.

 public static List<String> getTopStrings(String input, int k)
 {
 if(k<0)
 throw new IllegalArgumentException("Value k cannot be negative for top k elements: k= "+k);
 List<String> list = new ArrayList<String>();
 if(input==null || input.isEmpty() || k==0)
 {
 return list;
 }
 Set<Character> separators = new HashSet<Character>();
 separators.add(' ');
 separators.add(',');
 separators.add('.');
 separators.add(';');
 separators.add(':');
 separators.add('"');
 separators.add('(');
 separators.add(')');
 separators.add('-');
 separators.add('/');
 separators.add('\\');
 separators.add('\'');
 separators.add('?');
 separators.add('\n');
 separators.add('\r');
 separators.add('!');
 separators.add('|');
 separators.add('~');
 separators.add('\'');
 separators.add('[');
 separators.add(']');
 separators.add('{');
 separators.add('}');
 separators.add('&');
 separators.add('%');
 separators.add('$');
 separators.add('#');
 separators.add('@');
 separators.add('*');
 separators.add('=');
 separators.add('+');
 separators.add('>');
 separators.add('<');
 final Map<String, Integer> wordMap = new HashMap<String, Integer>();
 int count, wordStart=0;
 for(int i=0;i<input.length();i++)
 {
 if(separators.contains(input.charAt(i)))
 {
 addWord(input, wordMap,wordStart,i);
 wordStart=i+1;
 }
 }
 addWord(input,wordMap,wordStart,input.length());
 List<String> keySet = new ArrayList<String>();
 keySet.addAll(wordMap.keySet());
 if(keySet.size()<=k){
 return keySet;
 }
 PriorityQueue<String> minHeap = new PriorityQueue<String>(k, new Comparator<String>() {
 @Override
 public int compare(String o1, String o2) {
 return (wordMap.get(o1)-wordMap.get(o2));
 }
 });
 for(int i=0;i<k;i++)
 {
 minHeap.add(keySet.get(i));
 }
 for(int i=k;i<keySet.size();i++)
 {
 String key = keySet.get(i);
 count = wordMap.get(key);
 if(count>wordMap.get(minHeap.peek()))
 {
 minHeap.poll();
 minHeap.add(key);
 }
 }
 list.addAll(minHeap);
 return list;
 }
 public static void addWord(String input, Map<String, Integer> wordMap, int startIndex, int endIndex)
 {
 int count = 0;
 String word = input.substring(startIndex, endIndex).toLowerCase();
 if(word!=null && !word.isEmpty())
 {
 if(wordMap.containsKey(word))
 {
 count = wordMap.get(word);
 }
 wordMap.put(word, count+1);
 }
 }
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Feb 22, 2015 at 10:19
\$\endgroup\$

2 Answers 2

7
\$\begingroup\$

Initialize Set with values

Your method of initializing the hashset with values takes quite a bit of space. I would create a static field:

public static final String[] SEPARATORS = new String[] { " ", ",", ".", ";" };

And then use it like this:

public static final Set<String> separatorsSet = new HashSet<String>(Arrays.asList(SEPARATORS));

Extract code to methods

Your getTopStrings method currently does three things: extract words from string, count amount of duplicate words, sort the result of that by frequency. I would at least create separate methods for the first two and the third functionality.

Simplify Sorting

Your sorting mechanism using a queue seems overly complicated. If you are using Java 8, it could be reduced to three lines (or one really long one):

 // sort by value, reversed
 Stream<Entry<String, Integer>> sorted = wordMap.entrySet().stream().sorted(Map.Entry.comparingByValue((x,y) -> (y < x) ? -1 : ((x == y) ? 0 : 1))); // alternatively: use reversed and cast
 // flatten map to list:
 List<String> collected = sorted.map(entry -> entry.getKey()).collect(Collectors.toCollection(ArrayList::new));
 // top k:
 List<String> topK = collected.subList(0, k);

Misc

  • in Java, it is customary to place curly brackets on the same line as the opening statements.
  • use curly brackets even for one line statements for improved readability and to avoid future bugs.
  • use more spaces, eg around ==, <, +, after ;, etc for increased readability.
  • declare variables in as small a scope as possible to increase readability.
answered Feb 22, 2015 at 11:45
\$\endgroup\$
5
  • \$\begingroup\$ The reason the OP is using a priority queue is to keep the running time limited to n * log(k). Sorting the entire list and picking the first k element will run in n * log(n). Seeing how they've emphasized that the strings can be very long, it's safe to assume n can be large enough to make a significant difference. \$\endgroup\$ Commented Feb 22, 2015 at 12:01
  • \$\begingroup\$ What's this about comparingByValue((x,y) -> y - x)? It's ugly, it's buggy and it's wrong. What's wrong with Comparator.reversed? Although beware of this bug. \$\endgroup\$ Commented Feb 22, 2015 at 15:13
  • \$\begingroup\$ @BoristheSpider how is it buggy? And reversed did not work for me because of incompatible types. But with a cast as mentioned in your link it works. thanks. \$\endgroup\$ Commented Feb 22, 2015 at 15:27
  • 1
    \$\begingroup\$ @tim Integer overflow. \$\endgroup\$ Commented Feb 22, 2015 at 15:28
  • \$\begingroup\$ Can it be done in O(n) time ? \$\endgroup\$ Commented Feb 23, 2015 at 0:30
5
\$\begingroup\$

Well the disadvantage of regex is, that if written wrong you'll have a bad time. The next disadvantage of regex is, you will have to keep a large String in memory.

But regex has two strong advantages:

  1. There is character-classes. You can use these to define what you want to recognize as word.
  2. Java Matchers are inherently "iterative".

This makes it rather simple to create a short method that gets all words from a String:
(sidenote: I'm using a strong oversimplification for the Pattern. For more information on Java Regexes, read the "manual")

private static final Pattern WORD_PATTERN = Pattern.compile("(\\w++)", Pattern.MULTILINE);
//method header
 Matcher words = WORD_PATTERN.matcher(input);
 while (words.find()) {
 final String nextWord = words.group();
 // do something with your found word
 }
}

Now since that is a simple way to grab all "words" in a String the only thing left is to count how often a certain word appeared, but you seem to have already accomplished that, so I'll leave that in your hands ;)

answered Feb 22, 2015 at 20:30
\$\endgroup\$
1
  • \$\begingroup\$ I agree. But, for really long string, its taking too much time. (10x) \$\endgroup\$ Commented Feb 23, 2015 at 2:37

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.