I have just written a small application to aggregate the lines of a text file, for example to group the lines according to the frequency of IP addresses in a log file.
Would the code be sufficiently commented (and self-explanatory) or is something still missing?
I would appreciate any comments from you.
Example call
javac Main.java
and java Main input.txt ".*? (\d+) .*"
with an input.txt file:
g 23 foo
a 234 bar
b 234 baz
c 123 qux
d 32 quux
e 234 corge
f 32 grault
will print
0003 - 234 - a 234 bar
0002 - 32 - d 32 quux
0001 - 23 - g 23 foo
0001 - 123 - c 123 qux
Code
import java.io.BufferedReader;
import java.io.FileReader;
import java.nio.charset.Charset;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
/**
* Main method to aggregate and count the occurrences of a regex pattern in a text file.
*
* @param args input text file, aggregate regex, and optional ignore regex.
* @throws Exception if the input file is not found or cannot be read.
*/
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.out.println(
"Usage: java -jar <jar file> <input text file> <aggregate regex> (<ignore regex>)");
System.out.println(
"Example: java -jar <jar file> \"input.txt\" \".*? (\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}) .*\"");
System.out.println(" to aggregate IP addresses from the input file.");
throw new IllegalArgumentException("Invalid number of arguments");
}
Pattern aggregatePattern = Pattern.compile(args[1]);
Pattern ignorePattern = args.length > 2 ? Pattern.compile(args[2]) : Pattern.compile("(?!x)x");
LinkedHashMap<String, Map.Entry<ArrayList<String>, Integer>> map = new LinkedHashMap<>();
try (BufferedReader reader =
new BufferedReader(new FileReader(args[0], Charset.defaultCharset()))) {
String line;
while ((line = reader.readLine()) != null) {
Matcher aggregateMatcher = aggregatePattern.matcher(line);
if (aggregateMatcher.find() && !ignorePattern.matcher(line).find()) {
String key = aggregateMatcher.group(1);
map.computeIfAbsent(key, k -> new AbstractMap.SimpleEntry<>(new ArrayList<>(), 0));
Map.Entry<ArrayList<String>, Integer> entry = map.get(key);
entry.getKey().add(line);
entry.setValue(entry.getValue() + 1);
}
}
}
ArrayList<Map.Entry<String, Map.Entry<ArrayList<String>, Integer>>> list =
new ArrayList<>(map.entrySet());
// Sort by count in descending order, if counts are equal, sort by input order (earlier first).
list.sort((o1, o2) -> o2.getValue().getValue().compareTo(o1.getValue().getValue()));
for (Map.Entry<String, Map.Entry<ArrayList<String>, Integer>> entry : list) {
System.out.printf(
"%04d - %s - %s%n",
entry.getValue().getValue(), entry.getKey(), entry.getValue().getKey().get(0));
}
}
}
2 Answers 2
The code is understandable but only because you spoiled us with the example in the question. That example should have been in the actual code documentation. I'm trying to read the code as if that example did not exist. The shortness of the code helps a lot, if the program was longer it would be very hard to follow.
LinkedHashMap<String, Map.Entry<ArrayList<String>, Integer>> map = new LinkedHashMap<>();
This is the literal reason why Java does not have a generic Tuple
or Pair
class. A generic pair does not have a semantic meaning. You need to define your own class that has a name that conveys it's purpose. That you also used the most abstract name for the map
doesn't make this data structure any easier to understand. To make this data structure more understandable you can extract the operations that manipulate it into functions and give the parameters names that tell the reader what is being done (and why).
You haven't documented that the regex must have a capturing group and you are not doing error checking on the input parameter to make sure it exists.
Keeping in mind the purpose of the code (simple tool to get the work done, no desire for reusability), I would make this more readable by moving the patterns and maps to be static class fields and split the main method into several sub-methods (parameter parsing, reading, printing result, printing help, etc.).
-
\$\begingroup\$ "You haven't documented that the regex must have a capturing group" - can I check that programmatically, or have I introduce to split the input arguments? (... Apart from the point that this should be documented.) \$\endgroup\$Tobias Grothe– Tobias Grothe2025年03月03日 06:34:01 +00:00Commented Mar 3 at 6:34
-
\$\begingroup\$ What immediately comes to my mind is
aggregatePattern.match("").groupCount()
but that tickles me from exactly the wrong places. Maybe there's a stackoverflow.com question about it... \$\endgroup\$TorbenPutkonen– TorbenPutkonen2025年03月03日 09:10:00 +00:00Commented Mar 3 at 9:10 -
\$\begingroup\$ @TorbenPutkonen I think one should be able to test if the pattern contains a capturing group with the regex /([^?].*)/ ? \$\endgroup\$Cecilia– Cecilia2025年03月03日 11:47:07 +00:00Commented Mar 3 at 11:47
-
\$\begingroup\$ As mentioned, Matcher has such a method, but Pattern does not. This means that you would have to check this in the loop for each line, "on the fly". \$\endgroup\$Tobias Grothe– Tobias Grothe2025年03月03日 15:43:55 +00:00Commented Mar 3 at 15:43
As @TorbenPutkonen says, use a proper type instead of abusing Map.Entry
I would further make use of the Stream APIs. Don't do a get
on your map after calling computeIfAbsent
.. you already have the entry.
This is slightly cleaner...
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
static class LineCounts {
final ArrayList<String> lines = new ArrayList<>();
int count;
}
/**
* Main method to aggregate and count the occurrences of a regex pattern in a text file.
*
* @param args input text file, aggregate regex, and optional ignore regex.
* @throws Exception if the input file is not found or cannot be read.
*/
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.out.println("""
Usage: java -jar <jar file> <input text file> <aggregate regex> (<ignore regex>)
Example: java -jar <jar file> "input.txt" ".*? (\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}) .*"
to aggregate IP addresses from the input file.""");
throw new IllegalArgumentException("Invalid number of arguments");
}
Pattern aggregatePattern = Pattern.compile(args[1]);
Pattern ignorePattern = Pattern.compile(args.length > 2 ? args[2] : "(?!x)x");
LinkedHashMap<String, LineCounts> map = new LinkedHashMap<>();
Files.lines(Paths.get(args[0]), Charset.defaultCharset()).forEach(line -> {
Matcher aggregateMatcher = aggregatePattern.matcher(line);
if (aggregateMatcher.find() && !ignorePattern.matcher(line).find()) {
String key = aggregateMatcher.group(1);
LineCounts entry = map.computeIfAbsent(key, k -> new LineCounts());
entry.lines.add(line);
entry.count++;
}
});
map.entrySet().stream()
// Sort by count in descending order, if counts are equal, sort by input order (earlier first).
.sorted((o1, o2) -> o2.getValue().count - o1.getValue().count)
.forEach(entry -> {
System.out.printf("%04d - %s - %s%n", entry.getValue().count, entry.getKey(), entry.getValue().lines.get(0));
});
}
}
main()
. It would be helpful to crack argv and then call some appropriately named helper function. Perhaps one that has a/** javadoc */
comment. What I'm looking for is, "what is the contract?", and there's room for the OP code to do a better job of explaining that. There's no Answer posted, yet, so there's still time to edit and improve the Question. \$\endgroup\$