Parse a text file, and map to an out-of-scope POJO

Question 1

I need to parse a file, with a standard formatting:

200 lines of uninteresting data
=====================================================================
 Column(s)
=====================================================================
Column Description : HP-PONA
Inventory# : autoID-1
Diameter : 200.0 μm
Length : 50.0 m

The data must be outputted to a POJO outside of the project scope (i.e. I cannot change the POJO definition - ColumnInnerDiameter and ChromatographyColumnLength could/should be the same class or inherit for the same base class, but I cannot change this.):

public class ChromatographyColumnDocument {
 private ChromatographyColumnLength chromatographyColumnLength;
 private ColumnInnerDiameter columnInnerDiameter;
 public ChromatographyColumnLength getChromatographyColumnLength() {
 return chromatographyColumnLength;
 }
 public void setChromatographyColumnLength(ChromatographyColumnLength 
chromatographyColumnLength) {
 this.chromatographyColumnLength = chromatographyColumnLength;
 }
 public ColumnInnerDiameter getColumnInnerDiameter() {
 return columnInnerDiameter;
 }
 public void setColumnInnerDiameter(ColumnInnerDiameter columnInnerDiameter) {
 this.columnInnerDiameter = columnInnerDiameter;
 }
}
public class ColumnInnerDiameter {
 private Double value;
 private String unit;
 public Double getValue() {
 return value;
 }
 public void setValue(Double value) {
 this.value = value;
 }
 public String getUnit() {
 return unit;
 }
 public void setUnit(String unit) {
 this.unit = unit;
 }
}
public class ChromatographyColumnLength{
 private Double value;
 private String unit;
 public Double getValue() {
 return value;
 }
 public void setValue(Double value) {
 this.value = value;
 }
 public String getUnit() {
 return unit;
 }
 public void setUnit(String unit) {
 this.unit = unit;
 }
}

Finally, the code under review:

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ColumnInformationMapper {
 private ColumnInformationMapper() {
 throw new IllegalStateException("Utility class");
 }
 private static final String valueAndUnitPattern = "([\\d|\\.]*) (\\S*)";
 public static ChromatographyColumnDocument readColumnDocumentFromFile(String folderPath) throws Exception {
 String line;
 ChromatographyColumnDocument columnDocument = new ChromatographyColumnDocument();
 File file = new File(folderPath,"acq.txt");
 try (BufferedReader br = new BufferedReader(
 new InputStreamReader(
 new FileInputStream(file), StandardCharsets.UTF_16))){
 /* Looking for pattern
 =====================================================================
 Column(s)
 =====================================================================
 Column Description : HP-PONA
 */
 while ((line = br.readLine()) != null) {
 if (line.contains("======")) {
 line = br.readLine();
 if (line.contains("Column(s)")) {
 br.readLine();// === line
 br.readLine();// empty line
 break;
 } 
 }
 }
 br.readLine(); //Column Description - Not in model
 br.readLine(); //Inventory # - Not in model
 columnDocument.setColumnInnerDiameter(parseColumnInnerDiameter(
 StringUtils.substringAfterLast(br.readLine(), ":").trim())); //Column diameter
 columnDocument.setChromatographyColumnLength(parseColumnLength(
 StringUtils.substringAfterLast(br.readLine(), ":").trim())); //Column length
 
 return columnDocument;
 }
 }
 private static ColumnInnerDiameter parseColumnInnerDiameter(String inputString) {
 Pattern p = Pattern.compile(valueAndUnitPattern);
 Matcher m = p.matcher(inputString);
 if (m.find()) {
 ColumnInnerDiameter columnInnerDiameter = new ColumnInnerDiameter();
 columnInnerDiameter.setValue(Double.parseDouble(m.group(1)));
 columnInnerDiameter.setUnit(m.group(2));
 return columnInnerDiameter;
 }
 throw new NumberFormatException();
 }
 private static ChromatographyColumnLength parseColumnLength(String inputString) {
 Pattern p = Pattern.compile(valueAndUnitPattern);
 Matcher m = p.matcher(inputString);
 if (m.find()) {
 ChromatographyColumnLength columnLength = new ChromatographyColumnLength();
 columnLength.setValue(Double.parseDouble(m.group(1)));
 columnLength.setUnit(m.group(2));
 return columnLength;
 }
 throw new NumberFormatException();
 }
}

My questions:

To skip a line I am not interested in, I use readLine() and just don't use the results. SonarQube is very mad at this, and calls it a Major Bug: "When a method is called that returns data read from some data source, that data should be stored rather than thrown away. Any other course of action is surely a bug.". Is there a more correct way of skipping lines?
The methods to parse ColumnInnerDiameter and ChromatographyColumnLength are identical. I have tried to fiddle with generics to get rid of code repetition, but could not get it to work. Is there any way to improve on this side, without modifying the POJO classes?
Any advice on how to improve the code overall is most welcome!

Question 2

Your questions:

SonarQube is wrong and you are right. That is the correct way to skip a line. You can use @SuppressWarnings to get rid of the warnings, or you can assign the skipped line to some local variable, such as String skip = br.nextLine().
Since the two Column classes do not share an interface which defines those methods, I do not see a good way to remove the duplication.
See below:

Avoid * imports. It makes it harder to determine where dependencies are coming from, and harder to determine what dependencies a class actually has.

Consider importing UTF_16 statically.

There's no need to throw an exception from a private constructor. Having the private constructor is enough.

Constants (and all variables) belong before the constructor, not after. Constants should use UPPER_SNAKE_CASE variable names rather than camelCase.

Throw the most specific exceptions possible. A client cannot handle exception types differently if all it knows it might get is an Exception.

Avoid acronyms and abbreviations when naming code. bufferedReader would be preferable to br. Code should strive to be readable. "bee arr dot readline" vs. "buffered reader dot readline".

The pattern currently used to determine if something is a number is not correct. It will clearly match invalid numbers.

Prefer passing in a File instead of String for the file path. A String can be anything. Consider changing the file path to a File object as soon as possible so the code gains type safety.

Rather than constructing nested readers, consider creating separate named instances. This aligns the code more neatly. Since the use is literally on the next line, it's reasonable to use abbreviated names here. The tradeoff is that the stream and the reader are now in scope for the try block, where they weren't before. Given the size of the method, I think the readability tradeoff is worth it.

Consider breaking out the skip portion of readColumnDocumentFromFile into its own method.

readColumnDocumentFromFile will throw a NullPointerException if the columns block is not in the text file. It would be better to catch the NoSuchElementException and rethrow something that gives more context into what went wrong.

The two parse methods are throwing NumberFormatException, which doesn't make sense when the real problem is that the column data is invalid.

Pattern is intended to be compiled once, and then used as a factory for Matcher instances. The Pattern instance should be the constant, not the String.

The code would probably read more cleanly if a Scanner was used. The methods would probably also then be unnecessary, as Scanner can cleanly locate the next double or String value.

If you made all these changes, your code might look more like:

import static java.nio.charset.StandardCharsets.UTF_16;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Scanner;
import java.util.regex.Pattern;
public final class ColumnInformationMapper {
 private static final Pattern COLUMN_NAME_PATTERN = Pattern.compile(".*:");
 private ColumnInformationMapper() {
 }
 /**
 * @throws FileNotFoundException if there is no `acq.txt` file in the given folderPath.
 * @throws IOException if the file cannot be read.
 */
 public static ChromatographyColumnDocument readColumnDocumentFromFile(File folderPath) 
 throws FileNotFoundException, IOException {
 ChromatographyColumnDocument columnDocument = new ChromatographyColumnDocument();
 File file = new File(folderPath, "acq.txt");
 try (FileInputStream fis = new FileInputStream(file);
 InputStreamReader isr = new InputStreamReader(fis, UTF_16);
 Scanner acqScanner = new Scanner(isr)) {
 skipToColumnInnerDiameterLine(acqScanner);
 ColumnInnerDiameter columnInnerDiameter = new ColumnInnerDiameter();
 acqScanner.next(COLUMN_NAME_PATTERN); // skip column name
 columnInnerDiameter.setValue(acqScanner.nextDouble());
 columnInnerDiameter.setUnit(acqScanner.next());
 columnDocument.setColumnInnerDiameter(columnInnerDiameter);
 ChromatographyColumnLength columnLength = new ChromatographyColumnLength();
 acqScanner.next(COLUMN_NAME_PATTERN); // skip column name
 columnLength.setValue(acqScanner.nextDouble());
 columnLength.setUnit(acqScanner.next());
 columnDocument.setChromatographyColumnLength(columnLength);
 return columnDocument;
 }
 }
 private static void skipToColumnInnerDiameterLine(Scanner acqScanner) {
 /* Looking for pattern
 =====================================================================
 Column(s)
 =====================================================================
 Column Description : HP-PONA
 */
 String line;
 while ((line = acqScanner.nextLine()) != null) {
 if (line.contains("======")) {
 line = acqScanner.nextLine();
 if (line.contains("Column(s)")) {
 acqScanner.nextLine();// === line
 acqScanner.nextLine();// empty line
 break;
 } 
 }
 }
 acqScanner.nextLine(); //Column Description - Not in model
 acqScanner.nextLine(); //Inventory # - Not in model
 }
}

Question 3

I disagree, Sonar is correct. If you read a line with the intention to skip it AND you know what the line is supposed to contain, then you need to verify that the skipped line actually contains it. Othwerise errors in the input will just be ignored. Maybe write a helper method for asserting that a string contains a substring and throwing an exception if it does not.

Question 4

Avoid code duplication

First, to remove duplicate code present in your methods parseColumnInnerDiameter and parseColumnLength, try to find some similarities between this two classes ColumnInnerDiameter and ChromatographyColumnLength. They have both Unit and Value attributes. So you can create a POJO with these two attributes.

ValueInit.class

import lombok.Getter;
import lombok.RequiredArgsConstructor;
import lombok.Setter;
@RequiredArgsConstructor
@Getter
@Setter
public class ValueUnit {
 private final Double value;
 private final String unit;
}

Now you can make a new class that extracts Value and Unit from a document line.

ValueUnitExtractor.class

public class ValueUnitExtractor {
 
 private static final String VALUE_AND_UNIT_REGEX = "([\\d|\\.]*) (\\S*)";
 
 // Pattern declaration have to be static final because compiling regex may cost a lot
 private static final Pattern VALUE_AND_UNIT_PATTERN = Pattern.compile(VALUE_AND_UNIT_REGEX);
 
 /**
 * Allow extracting the Value and Unit of a line
 */
 public ValueUnit extract(String line) {
 Matcher m = VALUE_AND_UNIT_PATTERN.matcher(line);
 
 if (!m.find()) {
 throw new NumberFormatException();
 }
 return new ValueUnit(Double.parseDouble(m.group(1)), m.group(2));
 }
}

To apply this extractor on your two similar objects, you can use a strategy pattern. We will use an enum for that point.

LineType.class

public enum LineType {
 
 COLUMN_INNER_DIAMETER() {
 @Override
 protected ColumnInnerDiameter create(ValueUnit valueUnit) {
 ColumnInnerDiameter columnLength = new ColumnInnerDiameter();
 columnLength.setValue(valueUnit.getValue());
 columnLength.setUnit(valueUnit.getUnit());
 return columnLength;
 }
 },
 CHROMATOGRAPHY_COLUMN_LENGTH() {
 @Override
 protected ChromatographyColumnLength create(ValueUnit valueUnit) {
 ChromatographyColumnLength columnLength = new ChromatographyColumnLength();
 columnLength.setValue(valueUnit.getValue());
 columnLength.setUnit(valueUnit.getUnit());
 return columnLength;
 }
 };
 
 // The common ValueUnit extractor 
 private final ValueUnitExtractor extractor = new ValueUnitExtractor();
 
 // The only method expose by your enum to parse a document line and create ChromatographyColumnLength or ColumnInnerDiameter object
 // I didn't find a way to return generic type here...
 public Object parse(String line) {
 ValueUnit valueUnit = extractor.extract(line);
 return create(valueUnit);
 }
 
 // This method allow you to create specific object from the ValueUnit POJO 
 protected abstract Object create(ValueUnit valueUnit);

After that, you can create a ChromatographyColumnDocument factory to help you build ChromatographyColumnDocument instance and hide the parsing document line : ChromatographyColumnDocumentFactory.class

public class ChromatographyColumnDocumentFactory {
 
 public static ChromatographyColumnDocument create(String firstLine, String secondLing) {
 ChromatographyColumnDocument columnDocument = new ChromatographyColumnDocument();
 columnDocument.setColumnInnerDiameter(parseColumnInnerDiameter(extractRightPart(firstLine)));
 columnDocument.setChromatographyColumnLength(parseColumnLength(extractRightPart(secondLing)));
 
 return columnDocument;
 }
 
 private static String extractRightPart(String fullLine) {
 // For null-safety, use StringUtils.trim on StringUtils.substringAfterLast() result
 return StringUtils.trim(StringUtils.substringAfterLast(fullLine, ":"));
 }
 
 // Here you use a specific enum value to build ColumnInnerDiameter
 private static ColumnInnerDiameter parseColumnInnerDiameter(String line) {
 return (ColumnInnerDiameter) LineType.COLUMN_INNER_DIAMETER.parse(line);
 }
 
 // Here you use a specific enum value to build ChromatographyColumnLength 
 private static ChromatographyColumnLength parseColumnLength(String line) {
 return (ChromatographyColumnLength) LineType.CHROMATOGRAPHY_COLUMN_LENGTH.parse(line);
 }
}

Sonar alert

If you looking for Sonar happiness, you could verify all lines after pattern matching and throwing custom error. DocumentReader.class

public class DocumentReader {
 
 public static void readAndFindPattern(BufferedReader br) throws IOException {
 boolean isFound = false;
 String line;
 while ((line = br.readLine()) != null) {
 if (line.contains("=====")) {
 line = br.readLine();
 if (line.contains("Column(s)")) {
 isFound = true;
 break;
 }
 }
 }
 
 if (!isFound) {
 throw new InvalidFileContentException("Pattern not found");
 }
 
 if (!br.readLine().contains("=====")) {
 throw new InvalidFileContentException("Line must contains '====='");
 }
 
 if (!br.readLine().isBlank()) {
 throw new InvalidFileContentException("Line after '=====' must be blank");
 }
 
 if (!br.readLine().contains("Column Description")) {
 throw new InvalidFileContentException("Line must contains 'Column Description'");
 }
 
 if (!br.readLine().contains("Inventory #")) {
 throw new InvalidFileContentException("Line must contains 'Inventory #'");
 }
 }
}

InvalidFileContentException.class

public class InvalidFileContentException extends RuntimeException { 
 public InvalidFileContentException(String message) {
 super(message);
 }
}

Finally, your ColumnInformationMapper uses DocumentReader to find the pattern and ChromatographyColumnDocumentFactory to create ChromatographyColumnDocument
ColumnInformationMapper.class

@UtilityClass
public class ColumnInformationMapper {
 
 public static ChromatographyColumnDocument readColumnDocumentFromFile(String folderPath) throws Exception {
 
 File file = new File(folderPath, "acq.txt");
 try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_16))) {
 DocumentReader.readAndFindPattern(br);
 return ChromatographyColumnDocumentFactory.create(br.readLine(), br.readLine());
 }
 } 
}

Question 5

I generally don't think lombok is worth pulling in the dependency when Java has records.

Question 6

Hi Guillaume, welcome to the site. In my opinion, this design is over-engineered. It adds 6 new classes to address 3 lines of duplicate code.

Eric Stein Eric Stein 6,70614 silver badges19 bronze badges · Accepted Answer · 2022-09-28 22:38:40Z

Your questions:

SonarQube is wrong and you are right. That is the correct way to skip a line. You can use @SuppressWarnings to get rid of the warnings, or you can assign the skipped line to some local variable, such as String skip = br.nextLine().
Since the two Column classes do not share an interface which defines those methods, I do not see a good way to remove the duplication.
See below: