Efficient top k words using MultiMap (Google Guava ext lib JAR)

Question 1

Basic algorithm:

Open document/huge text file
Use Map and List to separate the words. \$O(K)\$ time complexity (TC) considering \$K\$ distinct words.
If the word already exists in the list, increment the count of frequency, else add the new word. Constant TC.
Use a Collection to sort the list on the basis of frequency of usage. \$n*log(n)\$ TC.
Print the list, Entry.getKey (max count word), and its Entry.getValue (max count).

Now, I tried to use Multimap before sorting so that I can group same frequency count words. This will drastically decrease the sorting complexity as the number of elements to compare will decrease.

Is it efficient enough or are there other ways to make it more efficient?

//Below code works fine (Please include lib files as per your IDE setting)
 package Topk;
 //---import necessary lib---
 class Words
 {
 int count=1;
 String word;
 Words(int count, String word)
 {
 this.count = count;
 this.word = word;
 }
 }
 public class TopK 
 {
 public static void main(String argv[]) throws IOException
 {
 int Tcount = 0;
 Map<String,Integer> map = new TreeMap<String,Integer>();
 BufferedReader br = new BufferedReader(new FileReader("text.txt"));
 String line;
 while((line = br.readLine())!=null)
 {
 String[] words = line.split("\\W");
 for(String word:words)
 {
 word = word.toLowerCase();
 Tcount++;
 if(word.equals(""))
 continue;
 insert(word,map);
 }
 } 
 Multimap<Integer, String> mm = ArrayListMultimap.create();
 Iterator<String> itr = map.keySet().iterator();
 while(itr.hasNext())
 {
 String key = (String) itr.next();
 int tempi = map.get(key);
 String temps = key;
 mm.put(tempi,temps);
 }
 Map<String,Integer> fmap = new HashMap<String,Integer>();
 Set<Integer> keys = mm.keySet();
 for(int i : keys)
 {
 int value = i;
 String temps = (mm.get(i).toString());
 fmap.put(temps,value);
 }
 List<Entry<String, Integer>> wordList = sorting(fmap);
 display(wordList,Tcount);
 br.close();
 }
 // INSERTING INTO THE MAP////////////////////////////////////////////////////////////////////////////////////////////////
 private static void insert(String word, Map<String, Integer> map) 
 {
 if(map.containsKey(word))
 {
 int temp = map.get(word);
 temp++;
 map.put(word, temp);
 }
 else
 map.put(word, 1);
 }
 // SORTING METHOD/////////////////////////////////////////////////////////////////////////////////////////////////////////
 private static List<Entry<String, Integer>> sorting(Map<String, Integer> fmap) 
 {
 Set<Entry<String, Integer>> wordSet = fmap.entrySet();
 List<Entry<String, Integer>> wordList = new ArrayList<Entry<String, Integer>>(wordSet);
 Collections.sort(
 wordList, 
 new Comparator<Map.Entry<String, Integer>>()
 {
 public int compare( Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2 )
 {
 return (o2.getValue()).compareTo( o1.getValue() );
 }
 }
 ); 
 return wordList;
 }
 // DISPLAY METHOD/////////////////////////////////////////////////////////////////////////////////////////////////////////
 private static void display(List<Entry<String, Integer>> wordList, int tcount) 
 {
 // Display all the words & count ---------------------------------------------------------------------------------------
 for(Map.Entry<String, Integer> entry:wordList) // for every word search the frequency
 {
 System.out.println(entry.getValue()+": "+entry.getKey());
 }
 // Top frequently used word --------------------------------------------------------------------------------------------
 Entry<String, Integer> max = wordList.get(0);
 System.out.println("-------------------------------------------------------------------------------------------");
 System.out.println("Total words : "+tcount); 
 System.out.println("Maximum frueqncy word - "+ max.getKey()+" : "+max.getValue()+" times.");
 System.out.println("-------------------------------------------------------------------------------------------");
 }
 }

Question 2

Welcome to CR! Are you on Java 8?

Question 3

Thank you! If you mean Java 8 for coding - yes I have multiple version and if you mean Java 8 chat room - Not yet

Question 4

Yeah, I'm referring to Java 8 for coding... :)

Question 5

Yes, I wrote this code in "Java 8 Update 45", Did you asked coz of Guava lib? its compatible with 7 also

Question 6

I asked because some of the statements you wrote can be implemented in a more concise way using Java 8...

Question 7

Java/Java 8 tips

Words is an unused class, and even if it is in used, class naming convention recommends the singular form rather than the plural. This is because you will have a Word instance or many Word instances, but a Words instance doesn't sound as right. The only pluralized class names that I can recall off-hand are for utilities classes, i.e. those that only provide a bunch of static methods.

Since Java 7, the recommended approach for reading from a Reader source is try-with-resources, e.g.

try (BufferedReader reader = new BufferedReader(new FileReader("text.txt"))) {
 String line;
 while ((line = reader.readLine()) != null) {
 // ...
 }
}

In Java 8, there is an even more convenient way of specifying how you want to read and process lines from a text file, which I will explain below...

Your Guava-based approach

Read each line and split into words.

Store a running count as the variable Tcount.

For each non-empty word in words, insert the lowercase form into a map.

Insertion is based on checking if word exists in the map. If it does, increment the counter, else add word => 1 to the map.

Create a Multimap, mm.

Loop through the keys of map to add the count-per-word pairing to mm.

Create another HashMap, fmap.

Loop through mm to add the toString() representation of each value as the key for fmap, with the key being the value for fmap.

Sort fmap into a List<Entry<String, Integer>> data structure.

This is done by wrapping the entrySet() into a List, and applying Collections.sort() on the new List.

Display the result.

Tcount can be better renamed as totalCount or wordCount, following the camelCase naming convention for variables, and the unnecessary need to shorten variable names (unless they are really long).
Your implementation for step 3, i.e. the insert() method, has a simpler implementation in Java 8 using Map.merge():
```
map.merge(word, 1, (a, b) -> a + b);
```
What this means is to either add word => 1 to the map if it does not exist, or use the BiFunction lambda declaration to add the existing and new (i.e. 1) values.
Steps 4 to 7 are... quite lengthy and you can rely on some of Guava's built-in methods to perform the same operations too:
```
Multimap<Integer, String> inverseMap = Multimaps.invertFrom(Multimaps.forMap(map),
 ArrayListMultimap.create());
```
- You can use Multimaps.forMap(Map) to give you a Multimap wrapper over the original Map so that it can be used as an argument for...
- Multimaps.invertFrom, inverting the key => value mapping of a Multimap into a resulting Multimap instance, which you can declare as ArrayListMultimap.create().
Step 8 is quite an odd step, partly because I feel you have deconstructed a Map semantic as a sorted List of Entry instances. This definitely can be implemented in the form of a TreeMap, which allows you to sort on the keys while giving you additional flexibility of retrieving using Map-friendly methods. After sorting the keys, you can simply use the largest key to get() from your Multimap as your result, instead of having to deal with a List<Entry> data structure.

Putting together, steps 4 to 8 can probably be done as such:

private static <K, V extends Comparable<V>> Collection<K> getMaximumByValue(
 Map<K, V> map) {
 Multimap<V, K> inverseMap = Multimaps.invertFrom(Multimaps.forMap(map),
 ArrayListMultimap.create());
 return inverseMap.get(Ordering.natural().immutableSortedCopy(inverseMap.keys())
 .reverse().get(0));
}

A non-Guava-based approach

With Java 8, one can also learn to apply the new Stream-based processing techniques that can simplify most of the explicit looping.

For example, to read a file in as a Stream of lines using Files.lines(Path):

try (Stream<String> lines = Files.lines(Paths.get(path))) {
 // perform further operations on lines, e.g. lines.collect()...
}

Processing each line into your map, i.e. step 3 above, can be done by:

flapMap()-ing every line with a replacement of Stream of words,
- The method reference to use here will be Pattern::splitAsStream.
filter()-ing for non-empty Strings, before performing a

collect(), groupingBy(), and then counting() on the word occurrences:

private static final Pattern LINE_SPLIITER = Pattern.compile("\\W");
private static Map<String, Long> process(Stream<String> lines) {
 return lines.flatMap(LINE_SPLIITER::splitAsStream).filter(s -> !s.isEmpty())
 .collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
}

The equivalent of using a Guava Multimap so that the values can be sorted is to apply another form of groupingBy() on the resulting Map from above:

private static <K, V extends Comparable<V>> Collection<K> getMaximumByValue(
 Map<K, V> map) {
 return map.entrySet().stream()
 .collect(Collectors.groupingBy(Entry::getValue, TreeMap::new,
 Collectors.mapping(Entry::getKey, Collectors.toList())))
 .lastEntry().getValue();
}

This time round, we are grouping by using Entry::getValue (method reference) as the key, and then specifying the use of a TreeMap as our backing Map implementation. Collectors.mapping() is finally used to again specify how we want to collate duplicate keys together as the new Map values.

Now that we have a TreeMap instance, we only need to call lastEntry() to get our result.

Putting it all together

public static void main(String[] args) throws IOException {
 // will be good to validate args[0]
 try (Stream<String> lines = Files.lines(Paths.get(args[0])) {
 Map<String, Long> map = process(lines);
 System.out.println("Word count: "
 + map.values().stream().mapToLong(Long::longValue).sum());
 System.out.println(getMaximumByValue(map));
 }
}

Quick explanation on getting the word count: Convert the Stream<Long> from map.values().stream() into a LongStream by calling mapToLong(). Then, we can use the sum() method to get the count.

h.j.k. h.j.k. 19.3k3 gold badges37 silver badges93 bronze badges · Accepted Answer · 2015-08-23 16:10:49Z

Java/Java 8 tips

Words is an unused class, and even if it is in used, class naming convention recommends the singular form rather than the plural. This is because you will have a Word instance or many Word instances, but a Words instance doesn't sound as right. The only pluralized class names that I can recall off-hand are for utilities classes, i.e. those that only provide a bunch of static methods.

Since Java 7, the recommended approach for reading from a Reader source is try-with-resources, e.g.

try (BufferedReader reader = new BufferedReader(new FileReader("text.txt"))) {
 String line;
 while ((line = reader.readLine()) != null) {
 // ...
 }
}

In Java 8, there is an even more convenient way of specifying how you want to read and process lines from a text file, which I will explain below...

Your Guava-based approach

Read each line and split into words.

Store a running count as the variable Tcount.

For each non-empty word in words, insert the lowercase form into a map.

Insertion is based on checking if word exists in the map. If it does, increment the counter, else add word => 1 to the map.

Create a Multimap, mm.

Loop through the keys of map to add the count-per-word pairing to mm.

Create another HashMap, fmap.

Loop through mm to add the toString() representation of each value as the key for fmap, with the key being the value for fmap.

Sort fmap into a List<Entry<String, Integer>> data structure.

This is done by wrapping the entrySet() into a List, and applying Collections.sort() on the new List.

Display the result.

Tcount can be better renamed as totalCount or wordCount, following the camelCase naming convention for variables, and the unnecessary need to shorten variable names (unless they are really long).
Your implementation for step 3, i.e. the insert() method, has a simpler implementation in Java 8 using Map.merge():
```
map.merge(word, 1, (a, b) -> a + b);
```
What this means is to either add word => 1 to the map if it does not exist, or use the BiFunction lambda declaration to add the existing and new (i.e. 1) values.
Steps 4 to 7 are... quite lengthy and you can rely on some of Guava's built-in methods to perform the same operations too:
```
Multimap<Integer, String> inverseMap = Multimaps.invertFrom(Multimaps.forMap(map),
 ArrayListMultimap.create());
```
- You can use Multimaps.forMap(Map) to give you a Multimap wrapper over the original Map so that it can be used as an argument for...
- Multimaps.invertFrom, inverting the key => value mapping of a Multimap into a resulting Multimap instance, which you can declare as ArrayListMultimap.create().
Step 8 is quite an odd step, partly because I feel you have deconstructed a Map semantic as a sorted List of Entry instances. This definitely can be implemented in the form of a TreeMap, which allows you to sort on the keys while giving you additional flexibility of retrieving using Map-friendly methods. After sorting the keys, you can simply use the largest key to get() from your Multimap as your result, instead of having to deal with a List<Entry> data structure.

Putting together, steps 4 to 8 can probably be done as such:

private static <K, V extends Comparable<V>> Collection<K> getMaximumByValue(
 Map<K, V> map) {
 Multimap<V, K> inverseMap = Multimaps.invertFrom(Multimaps.forMap(map),
 ArrayListMultimap.create());
 return inverseMap.get(Ordering.natural().immutableSortedCopy(inverseMap.keys())
 .reverse().get(0));
}

A non-Guava-based approach

With Java 8, one can also learn to apply the new Stream-based processing techniques that can simplify most of the explicit looping.

For example, to read a file in as a Stream of lines using Files.lines(Path):

try (Stream<String> lines = Files.lines(Paths.get(path))) {
 // perform further operations on lines, e.g. lines.collect()...
}

Processing each line into your map, i.e. step 3 above, can be done by:

flapMap()-ing every line with a replacement of Stream of words,
- The method reference to use here will be Pattern::splitAsStream.
filter()-ing for non-empty Strings, before performing a

collect(), groupingBy(), and then counting() on the word occurrences:

private static final Pattern LINE_SPLIITER = Pattern.compile("\\W");
private static Map<String, Long> process(Stream<String> lines) {
 return lines.flatMap(LINE_SPLIITER::splitAsStream).filter(s -> !s.isEmpty())
 .collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
}

The equivalent of using a Guava Multimap so that the values can be sorted is to apply another form of groupingBy() on the resulting Map from above:

private static <K, V extends Comparable<V>> Collection<K> getMaximumByValue(
 Map<K, V> map) {
 return map.entrySet().stream()
 .collect(Collectors.groupingBy(Entry::getValue, TreeMap::new,
 Collectors.mapping(Entry::getKey, Collectors.toList())))
 .lastEntry().getValue();
}

This time round, we are grouping by using Entry::getValue (method reference) as the key, and then specifying the use of a TreeMap as our backing Map implementation. Collectors.mapping() is finally used to again specify how we want to collate duplicate keys together as the new Map values.

Now that we have a TreeMap instance, we only need to call lastEntry() to get our result.

Putting it all together

public static void main(String[] args) throws IOException {
 // will be good to validate args[0]
 try (Stream<String> lines = Files.lines(Paths.get(args[0])) {
 Map<String, Long> map = process(lines);
 System.out.println("Word count: "
 + map.values().stream().mapToLong(Long::longValue).sum());
 System.out.println(getMaximumByValue(map));
 }
}

Quick explanation on getting the word count: Convert the Stream<Long> from map.values().stream() into a LongStream by calling mapToLong(). Then, we can use the sum() method to get the count.

Stack Exchange Network

Efficient top k words using MultiMap (Google Guava ext lib JAR)

1 Answer 1

Java/Java 8 tips

Your Guava-based approach

A non-Guava-based approach

Putting it all together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Efficient top k words using MultiMap (Google Guava ext lib JAR)

1 Answer 1

Java/Java 8 tips

Your Guava-based approach

A non-Guava-based approach

Putting it all together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions