Basic algorithm:
- Open document/huge text file
- Use
Map
andList
to separate the words. \$O(K)\$ time complexity (TC) considering \$K\$ distinct words. - If the word already exists in the list, increment the count of frequency, else add the new word. Constant TC.
- Use a
Collection
to sort the list on the basis of frequency of usage. \$n*log(n)\$ TC. - Print the list,
Entry.getKey
(max count word), and itsEntry.getValue
(max count).
Now, I tried to use Multimap
before sorting so that I can group same frequency count words. This will drastically decrease the sorting complexity as the number of elements to compare will decrease.
Is it efficient enough or are there other ways to make it more efficient?
//Below code works fine (Please include lib files as per your IDE setting)
package Topk;
//---import necessary lib---
class Words
{
int count=1;
String word;
Words(int count, String word)
{
this.count = count;
this.word = word;
}
}
public class TopK
{
public static void main(String argv[]) throws IOException
{
int Tcount = 0;
Map<String,Integer> map = new TreeMap<String,Integer>();
BufferedReader br = new BufferedReader(new FileReader("text.txt"));
String line;
while((line = br.readLine())!=null)
{
String[] words = line.split("\\W");
for(String word:words)
{
word = word.toLowerCase();
Tcount++;
if(word.equals(""))
continue;
insert(word,map);
}
}
Multimap<Integer, String> mm = ArrayListMultimap.create();
Iterator<String> itr = map.keySet().iterator();
while(itr.hasNext())
{
String key = (String) itr.next();
int tempi = map.get(key);
String temps = key;
mm.put(tempi,temps);
}
Map<String,Integer> fmap = new HashMap<String,Integer>();
Set<Integer> keys = mm.keySet();
for(int i : keys)
{
int value = i;
String temps = (mm.get(i).toString());
fmap.put(temps,value);
}
List<Entry<String, Integer>> wordList = sorting(fmap);
display(wordList,Tcount);
br.close();
}
// INSERTING INTO THE MAP////////////////////////////////////////////////////////////////////////////////////////////////
private static void insert(String word, Map<String, Integer> map)
{
if(map.containsKey(word))
{
int temp = map.get(word);
temp++;
map.put(word, temp);
}
else
map.put(word, 1);
}
// SORTING METHOD/////////////////////////////////////////////////////////////////////////////////////////////////////////
private static List<Entry<String, Integer>> sorting(Map<String, Integer> fmap)
{
Set<Entry<String, Integer>> wordSet = fmap.entrySet();
List<Entry<String, Integer>> wordList = new ArrayList<Entry<String, Integer>>(wordSet);
Collections.sort(
wordList,
new Comparator<Map.Entry<String, Integer>>()
{
public int compare( Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2 )
{
return (o2.getValue()).compareTo( o1.getValue() );
}
}
);
return wordList;
}
// DISPLAY METHOD/////////////////////////////////////////////////////////////////////////////////////////////////////////
private static void display(List<Entry<String, Integer>> wordList, int tcount)
{
// Display all the words & count ---------------------------------------------------------------------------------------
for(Map.Entry<String, Integer> entry:wordList) // for every word search the frequency
{
System.out.println(entry.getValue()+": "+entry.getKey());
}
// Top frequently used word --------------------------------------------------------------------------------------------
Entry<String, Integer> max = wordList.get(0);
System.out.println("-------------------------------------------------------------------------------------------");
System.out.println("Total words : "+tcount);
System.out.println("Maximum frueqncy word - "+ max.getKey()+" : "+max.getValue()+" times.");
System.out.println("-------------------------------------------------------------------------------------------");
}
}
-
\$\begingroup\$ Welcome to CR! Are you on Java 8? \$\endgroup\$h.j.k.– h.j.k.2015年08月23日 08:15:55 +00:00Commented Aug 23, 2015 at 8:15
-
\$\begingroup\$ Thank you! If you mean Java 8 for coding - yes I have multiple version and if you mean Java 8 chat room - Not yet \$\endgroup\$JustVirtually– JustVirtually2015年08月23日 08:21:19 +00:00Commented Aug 23, 2015 at 8:21
-
\$\begingroup\$ Yeah, I'm referring to Java 8 for coding... :) \$\endgroup\$h.j.k.– h.j.k.2015年08月23日 08:36:11 +00:00Commented Aug 23, 2015 at 8:36
-
\$\begingroup\$ Yes, I wrote this code in "Java 8 Update 45", Did you asked coz of Guava lib? its compatible with 7 also \$\endgroup\$JustVirtually– JustVirtually2015年08月23日 08:47:31 +00:00Commented Aug 23, 2015 at 8:47
-
1\$\begingroup\$ I asked because some of the statements you wrote can be implemented in a more concise way using Java 8... \$\endgroup\$h.j.k.– h.j.k.2015年08月23日 08:51:46 +00:00Commented Aug 23, 2015 at 8:51
1 Answer 1
Java/Java 8 tips
Words
is an unused class, and even if it is in used, class naming convention recommends the singular form rather than the plural. This is because you will have a Word
instance or many Word
instances, but a Words
instance doesn't sound as right. The only pluralized class names that I can recall off-hand are for utilities classes, i.e. those that only provide a bunch of static
methods.
Since Java 7, the recommended approach for reading from a Reader
source is try-with-resources
, e.g.
try (BufferedReader reader = new BufferedReader(new FileReader("text.txt"))) {
String line;
while ((line = reader.readLine()) != null) {
// ...
}
}
In Java 8, there is an even more convenient way of specifying how you want to read and process lines from a text file, which I will explain below...
Your Guava-based approach
- Read each
line
and split intowords
.- Store a running count as the variable
Tcount
.- For each non-empty
word
inwords
, insert the lowercase form into amap
.
- Insertion is based on checking if
word
exists in themap
. If it does, increment the counter, else addword => 1
to themap
.- Create a
Multimap
,mm
.- Loop through the keys of
map
to add the count-per-word pairing tomm
.- Create another
HashMap
,fmap
.- Loop through
mm
to add thetoString()
representation of each value as the key forfmap
, with the key being the value forfmap
.- Sort
fmap
into aList<Entry<String, Integer>>
data structure.
- This is done by wrapping the
entrySet()
into aList
, and applyingCollections.sort()
on the newList
.- Display the result.
Tcount
can be better renamed astotalCount
orwordCount
, following thecamelCase
naming convention for variables, and the unnecessary need to shorten variable names (unless they are really long).Your implementation for step 3, i.e. the
insert()
method, has a simpler implementation in Java 8 usingMap.merge()
:map.merge(word, 1, (a, b) -> a + b);
What this means is to either add
word => 1
to themap
if it does not exist, or use theBiFunction
lambda declaration to add the existing and new (i.e.1
) values.Steps 4 to 7 are... quite lengthy and you can rely on some of Guava's built-in methods to perform the same operations too:
Multimap<Integer, String> inverseMap = Multimaps.invertFrom(Multimaps.forMap(map), ArrayListMultimap.create());
- You can use
Multimaps.forMap(Map)
to give you aMultimap
wrapper over the originalMap
so that it can be used as an argument for... Multimaps.invertFrom
, inverting thekey => value
mapping of aMultimap
into a resultingMultimap
instance, which you can declare asArrayListMultimap.create()
.
- You can use
Step 8 is quite an odd step, partly because I feel you have deconstructed a
Map
semantic as a sortedList
ofEntry
instances. This definitely can be implemented in the form of aTreeMap
, which allows you to sort on the keys while giving you additional flexibility of retrieving usingMap
-friendly methods. After sorting the keys, you can simply use the largest key toget()
from yourMultimap
as your result, instead of having to deal with aList<Entry>
data structure.Putting together, steps 4 to 8 can probably be done as such:
private static <K, V extends Comparable<V>> Collection<K> getMaximumByValue( Map<K, V> map) { Multimap<V, K> inverseMap = Multimaps.invertFrom(Multimaps.forMap(map), ArrayListMultimap.create()); return inverseMap.get(Ordering.natural().immutableSortedCopy(inverseMap.keys()) .reverse().get(0)); }
A non-Guava-based approach
With Java 8, one can also learn to apply the new Stream
-based processing techniques that can simplify most of the explicit looping.
For example, to read a file in as a Stream
of lines using Files.lines(Path)
:
try (Stream<String> lines = Files.lines(Paths.get(path))) {
// perform further operations on lines, e.g. lines.collect()...
}
Processing each line into your map
, i.e. step 3 above, can be done by:
flapMap()
-ing every line with a replacement ofStream
of words,- The method reference to use here will be
Pattern::splitAsStream
.
- The method reference to use here will be
filter()
-ing for non-emptyString
s, before performing acollect()
,groupingBy()
, and thencounting()
on the word occurrences:private static final Pattern LINE_SPLIITER = Pattern.compile("\\W"); private static Map<String, Long> process(Stream<String> lines) { return lines.flatMap(LINE_SPLIITER::splitAsStream).filter(s -> !s.isEmpty()) .collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting())); }
The equivalent of using a Guava Multimap
so that the values can be sorted is to apply another form of groupingBy()
on the resulting Map
from above:
private static <K, V extends Comparable<V>> Collection<K> getMaximumByValue(
Map<K, V> map) {
return map.entrySet().stream()
.collect(Collectors.groupingBy(Entry::getValue, TreeMap::new,
Collectors.mapping(Entry::getKey, Collectors.toList())))
.lastEntry().getValue();
}
This time round, we are grouping by using Entry::getValue
(method reference) as the key, and then specifying the use of a TreeMap
as our backing Map
implementation. Collectors.mapping()
is finally used to again specify how we want to collate duplicate keys together as the new Map
values.
Now that we have a TreeMap
instance, we only need to call lastEntry()
to get our result.
Putting it all together
public static void main(String[] args) throws IOException {
// will be good to validate args[0]
try (Stream<String> lines = Files.lines(Paths.get(args[0])) {
Map<String, Long> map = process(lines);
System.out.println("Word count: "
+ map.values().stream().mapToLong(Long::longValue).sum());
System.out.println(getMaximumByValue(map));
}
}
Quick explanation on getting the word count: Convert the Stream<Long>
from map.values().stream()
into a LongStream
by calling mapToLong()
. Then, we can use the sum()
method to get the count.