Indexing a book. Write a program that reads in a text file from standard input and compiles an alphabetical index of which words appear on which lines, as in the following input. Ignore case and punctuation. For each word maintain a list of location on which it appears. Try to use HashTable and/or HashMap class (of java.util).
I have used a HashMap to store the line numbers for each word where it appears. Can this program be made better?
Index.java
package java_assignments.beg_assignment5;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;
public class Index {
public Index(Readable text) {
Scanner sc = new Scanner(text);
occurences = new HashMap<String, ArrayList<Integer>>();
int lineNo = 1;
try {
while (sc.hasNextLine()) {
String line = sc.nextLine();
String[] words = line.split("\\W+");
for (String word : words) {
word = word.toLowerCase();
ArrayList<Integer> list = occurences.get(word);
if (list == null) {
list = new ArrayList<>();
list.add(lineNo);
} else {
list.add(lineNo);
}
occurences.put(word, list);
}
lineNo++;
}
} finally {
sc.close();
}
}
public String toString() {
return occurences.toString();
}
private Map<String, ArrayList<Integer>> occurences;
}
BookIndexer.java
package java_assignments.beg_assignment5;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.InputStreamReader;
public class BookIndexer {
public static void main(String[] args) {
try {
BufferedReader br;
if (args.length == 0) {
br = new BufferedReader(new InputStreamReader(System.in));
} else {
br = new BufferedReader(new FileReader(args[0]));
}
String index_str = new Index(br).toString();
System.out.println(index_str);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
-
\$\begingroup\$ Can you add a link to the source of the challenge? \$\endgroup\$Phrancis– Phrancis2016年03月12日 18:14:01 +00:00Commented Mar 12, 2016 at 18:14
-
\$\begingroup\$ @PinCrash: This was given as an assignment in my college \$\endgroup\$In78– In782016年03月12日 18:20:11 +00:00Commented Mar 12, 2016 at 18:20
-
\$\begingroup\$ What version of Java are you using? \$\endgroup\$Boris the Spider– Boris the Spider2016年03月12日 20:27:14 +00:00Commented Mar 12, 2016 at 20:27
-
\$\begingroup\$ @BoristheSpider: Java 7 \$\endgroup\$In78– In782016年03月12日 20:40:48 +00:00Commented Mar 12, 2016 at 20:40
2 Answers 2
Using a try-with-resources
You're correctly closing your Scanner
at the end of the method in a finally
block, so there can be no resource leaks.
However, starting with Java 7, you can simply use the try-with-resources construct to make this easier:
try (Scanner sc = new Scanner(text)) {
// ...
}
Reading words with lines
You're using a Scanner
to read each line and then you are splitting the line on non word characters, i.e. everything that is not [a-zA-Z_0-9]
.
This can be a problem: what if you encounter a word that has a dash or a quote? You will wrongly split it. It would be better to split around a whitespace character, i.e. \s
.
Also, you're currently using a lineNo
variable to hold the current line number. You could use the built-in LineNumberReader
that already maintains a line number. You can access it with getLineNumber()
.
Code structure
Your declaration of
private Map<String, ArrayList<Integer>> occurences;
is located at the bottom of the class. Generally, instance variables are found at the top instead so that you can see directly what the class has as instance variables.
You're currently using two classes: one for the main part and one to find the occurences. It introduces a problem: the constructor does too much work. In fact, the constructor of Index
does all the work. It would be better to refactor this into a method properly named after what it does. We could introduce a method populateOccurences
whose goal would be to create the occurences
map.
Also, I don't think the Index
class is really that necessary: the more a code is simple, the better it is to maintain it. In this case, this class really contains a single method, which to populate the occurences map. It would be easier to not have that class and simply have a method
private static Map<String, List<Integer>> getOccurencesMap(Reader text) throws IOException
inside the main class that would return the map.
Also, don't name your variables index_str
: use camel-case, as indexStr
.
Handling exceptions
When you're reading a text from a file, you're not directly catching the FileNotFoundException
, instead you're letting the main method do it:
try {
BufferedReader br;
if (args.length == 0) {
br = new BufferedReader(new InputStreamReader(System.in));
} else {
br = new BufferedReader(new FileReader(args[0]));
}
// ...
} catch (FileNotFoundException e) {
e.printStackTrace();
}
This create a coupling between the method and what it reads from. Instead, it would be best to delegate that to a method dedicated to returning the Reader
to read:
private static Reader getReader(String[] args) {
if (args.length == 0) {
return new BufferedReader(new InputStreamReader(System.in));
} else {
try {
return new BufferedReader(new FileReader(args[0]));
} catch (FileNotFoundException e) {
throw new IllegalArgumentException("The given file does not exist.", e);
}
}
}
Note two things:
- The
catch (FileNotFoundException e)
is done inside theelse
part: that is the only part of the code responsible for reading a file, so it must be the only part of the code for handling aFileNotFoundException
. - A custom
IllegalArgumentException
is re-thrown to indicate that the file wasn't found. This runtime exception wraps the initialFileNotFoundException
to have a proper stacktrace but it hides that from the surrounding code.
Lowercasing Strings
Be very careful when lowercasing / uppercasing Strings in Java. This depends on the locale. By default, Java will use the locale of the current JVM, which is your system locale (by default). If you were to read a Turkish text on a server in France, you might have inconsistencies and hard to understand bugs! It is preferable to use a locale when doing those operations
word = word.toLowerCase(Locale.ROOT);
Using Java 8 constructs
Your code updating the Map
holding the line numbers for each word reads line
ArrayList<Integer> list = occurences.get(word);
if (list == null) {
list = new ArrayList<>();
list.add(lineNo);
} else {
list.add(lineNo);
}
occurences.put(word, list);
Let alone the fact that you could drop the else
clause and have list.add(lineNo);
after the if
(which would remove this little duplication), you could use the method computeIfAbsent
that will get the value for a specified key or if there is no value, set it with an initial value based on the given mapping function. In this case, you can simply have
occurences.computeIfAbsent(word, k -> new ArrayList<>()).add(lineNo);
If the current word is not in the map, a new ArrayList
will be created and returned, otherwise the current list for that word will be returned. Then, on this instance, we add the current line number.
Beginning with Java 8, a BufferedReader
also has a useful lines()
method that returns a Stream<String>
of the lines. Instead of looping with a for
, we could make that a Stream pipeline. This is what it would look like:
- Make a
Stream
of the lines: this is done by callinglines()
on theBufferedReader
. - Flat map each line into a
Stream
of its words: this can done by using a method reference:Pattern.compile("\\s+")::splitAsStream
. This creates aPattern
around the whitespace characters delimiter and splits each givenString
into aStream<String>
usingsplitAsStream
. The::
operator creates the method-reference. Flat mapping is done by callingflatMap
from the Stream API. - Map each word as lowercase: this can be done by using the lamda expression
w -> w.toLowerCase(Locale.ROOT)
, fed to themap
method of the pipeline - Collect that into a
Map
having the word as key and the line numbers as value: this can be done with the built-inCollectors.groupingBy
collector, where the classifier returns the current word. All values mapped to the same word are collected using a downstream collector, which in this case would map, usingCollectors.mapping
, each line number into a downstream list (withCollectors.toList()
).
Into code, it would look like:
try (LineNumberReader reader = new LineNumberReader(text)) {
return reader.lines()
.flatMap(Pattern.compile("\\s+")::splitAsStream)
.map(w -> w.toLowerCase(Locale.ROOT))
.collect(Collectors.groupingBy(
w -> w,
Collectors.mapping(w -> reader.getLineNumber(), Collectors.toList())
));
}
Of course, you can't run this in parallel.
Putting it all together
With all this, this is what you could have
public class BookIndexer {
public static void main(String[] args) throws IOException {
Reader br = getReader(args);
String indexStr = getOccurencesMap(br).toString();
System.out.println(indexStr);
}
private static Reader getReader(String[] args) {
if (args.length == 0) {
return new BufferedReader(new InputStreamReader(System.in));
} else {
try {
return new BufferedReader(new FileReader(args[0]));
} catch (FileNotFoundException e) {
throw new IllegalArgumentException("The given file does not exist.", e);
}
}
}
private static Map<String, List<Integer>> getOccurencesMap(Reader text) throws IOException {
try (LineNumberReader reader = new LineNumberReader(text)) {
return reader.lines()
.flatMap(Pattern.compile("\\s+")::splitAsStream)
.map(w -> w.toLowerCase(Locale.ROOT))
.collect(Collectors.groupingBy(
w -> w,
Collectors.mapping(w -> reader.getLineNumber(), Collectors.toList())
));
}
}
}
-
\$\begingroup\$ I think
Files.lines()
would be much neater here than usingBufferedReader.lines()
- if you're making the move to Java 8, go all the way. Even if you don't want to useFiles.lines()
, there is aFiles.newBufferedReader
method which should be used in preference to creating one yourself. \$\endgroup\$Boris the Spider– Boris the Spider2016年03月12日 20:30:13 +00:00Commented Mar 12, 2016 at 20:30 -
\$\begingroup\$ @BoristheSpider I thought about it but how would you handle the line number with
Files.lines()
? But usingFiles.newBufferedReader
is indeed better, I'll edit with that. \$\endgroup\$Tunaki– Tunaki2016年03月12日 20:32:59 +00:00Commented Mar 12, 2016 at 20:32 -
\$\begingroup\$ You currently handle the line number is a somewhat hacky and non-threadsafe manner. If this were production code, I would expect a custom tuple class holding the line and the line number. \$\endgroup\$Boris the Spider– Boris the Spider2016年03月12日 20:34:25 +00:00Commented Mar 12, 2016 at 20:34
-
\$\begingroup\$ Retrieving a
Reader
outside of any reading construct (i.e. in themain
method) is a code smell in my book. I'd move thetry/catch
in the main method.getOccurrencesMap
can perfectly accept aLineNumberReader
as method parameter. It makes sense. \$\endgroup\$Olivier Grégoire– Olivier Grégoire2016年03月12日 21:17:59 +00:00Commented Mar 12, 2016 at 21:17 -
1\$\begingroup\$ @In78 Because the exception was not catched inside
getOccurencesMap
so it must declare to throw it (it is a checked exception). And since the main method is callinggetOccurencesMap
(and still not catching it), it must also declare to throw it. You could potentially catch theIOException
insidegetOccurencesMap
and throw a runtime exception instead, thus getting rid ofthrows IOException
. \$\endgroup\$Tunaki– Tunaki2016年03月12日 21:29:36 +00:00Commented Mar 12, 2016 at 21:29
Language
occurences
is misspelled: it should have two letters r
(occurrences
).
Coding conventions
In Java, developers are encouraged to specify the instance fields before methods and constructors.
private Map<String, ArrayList<Integer>> occurences;
:
I would declare this as private Map<String, List<Integer>> occurrences;
Also, you don't need to initialise occurrences
in the constructor of Index
: instead, you could initialise it as soon as you declare it.
new HashMap<String, ArrayList<Integer>>();
:
Use diamond inference: new HashMap<>();
Plus, there is an opportunity for making your code (a little) more tidy:
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
public class Index {
private Map<String, List<Integer>> occurences = new HashMap<>();
public Index(Readable text) {
try (Scanner sc = new Scanner(text)) {
int lineNo = 1;
while (sc.hasNextLine()) {
String[] words = sc.nextLine().split("\\W+");
for (String word : words) {
word = word.toLowerCase();
List<Integer> list = occurences.get(word);
if (list == null) {
list = new ArrayList<>();
occurences.put(word, list);
}
list.add(lineNo);
}
lineNo++;
}
}
}
public String toString() {
return occurences.toString();
}
}
Hope that helps.
-
\$\begingroup\$ How does the compiler decide which implementation of
List<Integer>
to use inoccurences
? \$\endgroup\$In78– In782016年03月12日 18:56:35 +00:00Commented Mar 12, 2016 at 18:56 -
2\$\begingroup\$ From the line
list = new ArrayList<>();
which precedes the lineoccurences.put(word, list);
Actually, JVM does not care whether it isArrayList
orLinkedList
or anything else; all it cares about is that the implementation implements thejava.util.List
interface. \$\endgroup\$coderodde– coderodde2016年03月12日 18:58:12 +00:00Commented Mar 12, 2016 at 18:58