Program to index a book

Question 1

Indexing a book. Write a program that reads in a text file from standard input and compiles an alphabetical index of which words appear on which lines, as in the following input. Ignore case and punctuation. For each word maintain a list of location on which it appears. Try to use HashTable and/or HashMap class (of java.util).

I have used a HashMap to store the line numbers for each word where it appears. Can this program be made better?

Index.java

package java_assignments.beg_assignment5;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;
public class Index {
 public Index(Readable text) {
 Scanner sc = new Scanner(text);
 occurences = new HashMap<String, ArrayList<Integer>>();
 int lineNo = 1;
 try {
 while (sc.hasNextLine()) {
 String line = sc.nextLine();
 String[] words = line.split("\\W+");
 for (String word : words) {
 word = word.toLowerCase();
 ArrayList<Integer> list = occurences.get(word);
 if (list == null) {
 list = new ArrayList<>();
 list.add(lineNo);
 } else {
 list.add(lineNo);
 }
 occurences.put(word, list);
 }
 lineNo++;
 }
 } finally {
 sc.close();
 }
 }
 public String toString() {
 return occurences.toString();
 }
 private Map<String, ArrayList<Integer>> occurences;
}

BookIndexer.java

package java_assignments.beg_assignment5;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.InputStreamReader;
public class BookIndexer {
 public static void main(String[] args) {
 try {
 BufferedReader br;
 if (args.length == 0) {
 br = new BufferedReader(new InputStreamReader(System.in));
 } else {
 br = new BufferedReader(new FileReader(args[0]));
 }
 String index_str = new Index(br).toString();
 System.out.println(index_str);
 } catch (FileNotFoundException e) {
 e.printStackTrace();
 }
 }
}

Question 2

Can you add a link to the source of the challenge?

Question 3

@PinCrash: This was given as an assignment in my college

Question 4

What version of Java are you using?

Question 5

@BoristheSpider: Java 7

Question 6

Using a try-with-resources

You're correctly closing your Scanner at the end of the method in a finally block, so there can be no resource leaks.

However, starting with Java 7, you can simply use the try-with-resources construct to make this easier:

try (Scanner sc = new Scanner(text)) {
 // ...
}

Reading words with lines

You're using a Scanner to read each line and then you are splitting the line on non word characters, i.e. everything that is not [a-zA-Z_0-9].

This can be a problem: what if you encounter a word that has a dash or a quote? You will wrongly split it. It would be better to split around a whitespace character, i.e. \s.

Also, you're currently using a lineNo variable to hold the current line number. You could use the built-in LineNumberReader that already maintains a line number. You can access it with getLineNumber().

Code structure

Your declaration of

private Map<String, ArrayList<Integer>> occurences;

is located at the bottom of the class. Generally, instance variables are found at the top instead so that you can see directly what the class has as instance variables.

You're currently using two classes: one for the main part and one to find the occurences. It introduces a problem: the constructor does too much work. In fact, the constructor of Index does all the work. It would be better to refactor this into a method properly named after what it does. We could introduce a method populateOccurences whose goal would be to create the occurences map.

Also, I don't think the Index class is really that necessary: the more a code is simple, the better it is to maintain it. In this case, this class really contains a single method, which to populate the occurences map. It would be easier to not have that class and simply have a method

private static Map<String, List<Integer>> getOccurencesMap(Reader text) throws IOException

inside the main class that would return the map.

Also, don't name your variables index_str: use camel-case, as indexStr.

Handling exceptions

When you're reading a text from a file, you're not directly catching the FileNotFoundException, instead you're letting the main method do it:

try {
 BufferedReader br;
 if (args.length == 0) {
 br = new BufferedReader(new InputStreamReader(System.in));
 } else {
 br = new BufferedReader(new FileReader(args[0]));
 }
 // ...
} catch (FileNotFoundException e) {
 e.printStackTrace();
}

This create a coupling between the method and what it reads from. Instead, it would be best to delegate that to a method dedicated to returning the Reader to read:

private static Reader getReader(String[] args) {
 if (args.length == 0) {
 return new BufferedReader(new InputStreamReader(System.in));
 } else {
 try {
 return new BufferedReader(new FileReader(args[0]));
 } catch (FileNotFoundException e) {
 throw new IllegalArgumentException("The given file does not exist.", e);
 }
 }
}

Note two things:

The catch (FileNotFoundException e) is done inside the else part: that is the only part of the code responsible for reading a file, so it must be the only part of the code for handling a FileNotFoundException.
A custom IllegalArgumentException is re-thrown to indicate that the file wasn't found. This runtime exception wraps the initial FileNotFoundException to have a proper stacktrace but it hides that from the surrounding code.

Lowercasing Strings

Be very careful when lowercasing / uppercasing Strings in Java. This depends on the locale. By default, Java will use the locale of the current JVM, which is your system locale (by default). If you were to read a Turkish text on a server in France, you might have inconsistencies and hard to understand bugs! It is preferable to use a locale when doing those operations

word = word.toLowerCase(Locale.ROOT);

Using Java 8 constructs

Your code updating the Map holding the line numbers for each word reads line

ArrayList<Integer> list = occurences.get(word);
if (list == null) {
 list = new ArrayList<>();
 list.add(lineNo);
} else {
 list.add(lineNo);
}
occurences.put(word, list);

Let alone the fact that you could drop the else clause and have list.add(lineNo); after the if (which would remove this little duplication), you could use the method computeIfAbsent that will get the value for a specified key or if there is no value, set it with an initial value based on the given mapping function. In this case, you can simply have

occurences.computeIfAbsent(word, k -> new ArrayList<>()).add(lineNo);

If the current word is not in the map, a new ArrayList will be created and returned, otherwise the current list for that word will be returned. Then, on this instance, we add the current line number.

Beginning with Java 8, a BufferedReader also has a useful lines() method that returns a Stream<String> of the lines. Instead of looping with a for, we could make that a Stream pipeline. This is what it would look like:

Make a Stream of the lines: this is done by calling lines() on the BufferedReader.
Flat map each line into a Stream of its words: this can done by using a method reference: Pattern.compile("\\s+")::splitAsStream. This creates a Pattern around the whitespace characters delimiter and splits each given String into a Stream<String> using splitAsStream. The :: operator creates the method-reference. Flat mapping is done by calling flatMap from the Stream API.
Map each word as lowercase: this can be done by using the lamda expression w -> w.toLowerCase(Locale.ROOT), fed to the map method of the pipeline
Collect that into a Map having the word as key and the line numbers as value: this can be done with the built-in Collectors.groupingBy collector, where the classifier returns the current word. All values mapped to the same word are collected using a downstream collector, which in this case would map, using Collectors.mapping, each line number into a downstream list (with Collectors.toList()).

Into code, it would look like:

try (LineNumberReader reader = new LineNumberReader(text)) {
 return reader.lines()
 .flatMap(Pattern.compile("\\s+")::splitAsStream)
 .map(w -> w.toLowerCase(Locale.ROOT))
 .collect(Collectors.groupingBy(
 w -> w,
 Collectors.mapping(w -> reader.getLineNumber(), Collectors.toList())
 ));
}

Of course, you can't run this in parallel.

Putting it all together

With all this, this is what you could have

public class BookIndexer {
 public static void main(String[] args) throws IOException {
 Reader br = getReader(args);
 String indexStr = getOccurencesMap(br).toString();
 System.out.println(indexStr);
 }
 private static Reader getReader(String[] args) {
 if (args.length == 0) {
 return new BufferedReader(new InputStreamReader(System.in));
 } else {
 try {
 return new BufferedReader(new FileReader(args[0]));
 } catch (FileNotFoundException e) {
 throw new IllegalArgumentException("The given file does not exist.", e);
 }
 }
 }
 private static Map<String, List<Integer>> getOccurencesMap(Reader text) throws IOException {
 try (LineNumberReader reader = new LineNumberReader(text)) {
 return reader.lines()
 .flatMap(Pattern.compile("\\s+")::splitAsStream)
 .map(w -> w.toLowerCase(Locale.ROOT))
 .collect(Collectors.groupingBy(
 w -> w,
 Collectors.mapping(w -> reader.getLineNumber(), Collectors.toList())
 ));
 }
 }
}

Question 7

I think Files.lines() would be much neater here than using BufferedReader.lines() - if you're making the move to Java 8, go all the way. Even if you don't want to use Files.lines(), there is a Files.newBufferedReader method which should be used in preference to creating one yourself.

Question 8

@BoristheSpider I thought about it but how would you handle the line number with Files.lines()? But using Files.newBufferedReader is indeed better, I'll edit with that.

Question 9

You currently handle the line number is a somewhat hacky and non-threadsafe manner. If this were production code, I would expect a custom tuple class holding the line and the line number.

Question 10

Retrieving a Reader outside of any reading construct (i.e. in the main method) is a code smell in my book. I'd move the try/catch in the main method. getOccurrencesMap can perfectly accept a LineNumberReader as method parameter. It makes sense.

Question 11

@In78 Because the exception was not catched inside getOccurencesMap so it must declare to throw it (it is a checked exception). And since the main method is calling getOccurencesMap (and still not catching it), it must also declare to throw it. You could potentially catch the IOException inside getOccurencesMap and throw a runtime exception instead, thus getting rid of throws IOException.

Question 12

Language

occurences is misspelled: it should have two letters r (occurrences).

Coding conventions

In Java, developers are encouraged to specify the instance fields before methods and constructors.

private Map<String, ArrayList<Integer>> occurences;:

I would declare this as private Map<String, List<Integer>> occurrences;

Also, you don't need to initialise occurrences in the constructor of Index: instead, you could initialise it as soon as you declare it.

new HashMap<String, ArrayList<Integer>>();:

Use diamond inference: new HashMap<>();

Plus, there is an opportunity for making your code (a little) more tidy:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
public class Index {
 private Map<String, List<Integer>> occurences = new HashMap<>();
 public Index(Readable text) {
 try (Scanner sc = new Scanner(text)) {
 int lineNo = 1;
 while (sc.hasNextLine()) {
 String[] words = sc.nextLine().split("\\W+");
 for (String word : words) {
 word = word.toLowerCase();
 List<Integer> list = occurences.get(word);
 if (list == null) {
 list = new ArrayList<>();
 occurences.put(word, list);
 }
 list.add(lineNo);
 }
 lineNo++;
 }
 } 
 }
 public String toString() {
 return occurences.toString();
 }
}

Hope that helps.

Question 13

How does the compiler decide which implementation of List<Integer> to use in occurences?

Question 14

From the line list = new ArrayList<>(); which precedes the line occurences.put(word, list); Actually, JVM does not care whether it is ArrayList or LinkedList or anything else; all it cares about is that the implementation implements the java.util.List interface.

Tunaki Tunaki 9,3011 gold badge31 silver badges46 bronze badges · Accepted Answer · 2016-03-12 18:39:18Z

Using a try-with-resources

You're correctly closing your Scanner at the end of the method in a finally block, so there can be no resource leaks.

However, starting with Java 7, you can simply use the try-with-resources construct to make this easier:

try (Scanner sc = new Scanner(text)) {
 // ...
}

Reading words with lines

You're using a Scanner to read each line and then you are splitting the line on non word characters, i.e. everything that is not [a-zA-Z_0-9].

This can be a problem: what if you encounter a word that has a dash or a quote? You will wrongly split it. It would be better to split around a whitespace character, i.e. \s.

Also, you're currently using a lineNo variable to hold the current line number. You could use the built-in LineNumberReader that already maintains a line number. You can access it with getLineNumber().

Code structure

Your declaration of

private Map<String, ArrayList<Integer>> occurences;

is located at the bottom of the class. Generally, instance variables are found at the top instead so that you can see directly what the class has as instance variables.

You're currently using two classes: one for the main part and one to find the occurences. It introduces a problem: the constructor does too much work. In fact, the constructor of Index does all the work. It would be better to refactor this into a method properly named after what it does. We could introduce a method populateOccurences whose goal would be to create the occurences map.

Also, I don't think the Index class is really that necessary: the more a code is simple, the better it is to maintain it. In this case, this class really contains a single method, which to populate the occurences map. It would be easier to not have that class and simply have a method

private static Map<String, List<Integer>> getOccurencesMap(Reader text) throws IOException

inside the main class that would return the map.

Also, don't name your variables index_str: use camel-case, as indexStr.

Handling exceptions

When you're reading a text from a file, you're not directly catching the FileNotFoundException, instead you're letting the main method do it:

try {
 BufferedReader br;
 if (args.length == 0) {
 br = new BufferedReader(new InputStreamReader(System.in));
 } else {
 br = new BufferedReader(new FileReader(args[0]));
 }
 // ...
} catch (FileNotFoundException e) {
 e.printStackTrace();
}

This create a coupling between the method and what it reads from. Instead, it would be best to delegate that to a method dedicated to returning the Reader to read:

private static Reader getReader(String[] args) {
 if (args.length == 0) {
 return new BufferedReader(new InputStreamReader(System.in));
 } else {
 try {
 return new BufferedReader(new FileReader(args[0]));
 } catch (FileNotFoundException e) {
 throw new IllegalArgumentException("The given file does not exist.", e);
 }
 }
}

Note two things:

The catch (FileNotFoundException e) is done inside the else part: that is the only part of the code responsible for reading a file, so it must be the only part of the code for handling a FileNotFoundException.
A custom IllegalArgumentException is re-thrown to indicate that the file wasn't found. This runtime exception wraps the initial FileNotFoundException to have a proper stacktrace but it hides that from the surrounding code.

Lowercasing Strings

Be very careful when lowercasing / uppercasing Strings in Java. This depends on the locale. By default, Java will use the locale of the current JVM, which is your system locale (by default). If you were to read a Turkish text on a server in France, you might have inconsistencies and hard to understand bugs! It is preferable to use a locale when doing those operations

word = word.toLowerCase(Locale.ROOT);

Using Java 8 constructs

Your code updating the Map holding the line numbers for each word reads line

ArrayList<Integer> list = occurences.get(word);
if (list == null) {
 list = new ArrayList<>();
 list.add(lineNo);
} else {
 list.add(lineNo);
}
occurences.put(word, list);

Let alone the fact that you could drop the else clause and have list.add(lineNo); after the if (which would remove this little duplication), you could use the method computeIfAbsent that will get the value for a specified key or if there is no value, set it with an initial value based on the given mapping function. In this case, you can simply have

occurences.computeIfAbsent(word, k -> new ArrayList<>()).add(lineNo);

If the current word is not in the map, a new ArrayList will be created and returned, otherwise the current list for that word will be returned. Then, on this instance, we add the current line number.

Beginning with Java 8, a BufferedReader also has a useful lines() method that returns a Stream<String> of the lines. Instead of looping with a for, we could make that a Stream pipeline. This is what it would look like:

Make a Stream of the lines: this is done by calling lines() on the BufferedReader.
Flat map each line into a Stream of its words: this can done by using a method reference: Pattern.compile("\\s+")::splitAsStream. This creates a Pattern around the whitespace characters delimiter and splits each given String into a Stream<String> using splitAsStream. The :: operator creates the method-reference. Flat mapping is done by calling flatMap from the Stream API.
Map each word as lowercase: this can be done by using the lamda expression w -> w.toLowerCase(Locale.ROOT), fed to the map method of the pipeline
Collect that into a Map having the word as key and the line numbers as value: this can be done with the built-in Collectors.groupingBy collector, where the classifier returns the current word. All values mapped to the same word are collected using a downstream collector, which in this case would map, using Collectors.mapping, each line number into a downstream list (with Collectors.toList()).

Into code, it would look like:

try (LineNumberReader reader = new LineNumberReader(text)) {
 return reader.lines()
 .flatMap(Pattern.compile("\\s+")::splitAsStream)
 .map(w -> w.toLowerCase(Locale.ROOT))
 .collect(Collectors.groupingBy(
 w -> w,
 Collectors.mapping(w -> reader.getLineNumber(), Collectors.toList())
 ));
}

Of course, you can't run this in parallel.

Putting it all together

With all this, this is what you could have

public class BookIndexer {
 public static void main(String[] args) throws IOException {
 Reader br = getReader(args);
 String indexStr = getOccurencesMap(br).toString();
 System.out.println(indexStr);
 }
 private static Reader getReader(String[] args) {
 if (args.length == 0) {
 return new BufferedReader(new InputStreamReader(System.in));
 } else {
 try {
 return new BufferedReader(new FileReader(args[0]));
 } catch (FileNotFoundException e) {
 throw new IllegalArgumentException("The given file does not exist.", e);
 }
 }
 }
 private static Map<String, List<Integer>> getOccurencesMap(Reader text) throws IOException {
 try (LineNumberReader reader = new LineNumberReader(text)) {
 return reader.lines()
 .flatMap(Pattern.compile("\\s+")::splitAsStream)
 .map(w -> w.toLowerCase(Locale.ROOT))
 .collect(Collectors.groupingBy(
 w -> w,
 Collectors.mapping(w -> reader.getLineNumber(), Collectors.toList())
 ));
 }
 }
}

I think Files.lines() would be much neater here than using BufferedReader.lines() - if you're making the move to Java 8, go all the way. Even if you don't want to use Files.lines(), there is a Files.newBufferedReader method which should be used in preference to creating one yourself.
@BoristheSpider I thought about it but how would you handle the line number with Files.lines()? But using Files.newBufferedReader is indeed better, I'll edit with that.
You currently handle the line number is a somewhat hacky and non-threadsafe manner. If this were production code, I would expect a custom tuple class holding the line and the line number.
Retrieving a Reader outside of any reading construct (i.e. in the main method) is a code smell in my book. I'd move the try/catch in the main method. getOccurrencesMap can perfectly accept a LineNumberReader as method parameter. It makes sense.
@In78 Because the exception was not catched inside getOccurencesMap so it must declare to throw it (it is a checked exception). And since the main method is calling getOccurencesMap (and still not catching it), it must also declare to throw it. You could potentially catch the IOException inside getOccurencesMap and throw a runtime exception instead, thus getting rid of throws IOException.

Stack Exchange Network

Program to index a book

2 Answers 2

Using a try-with-resources

Reading words with lines

Code structure

Handling exceptions

Lowercasing Strings

Using Java 8 constructs

Putting it all together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Program to index a book

2 Answers 2

Using a try-with-resources

Reading words with lines

Code structure

Handling exceptions

Lowercasing Strings

Using Java 8 constructs

Putting it all together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions