Java code showing page labels from PDF files

Question 1

I'm writing simple application changing page labels in pdf files. First step is to interpret and show page labels. Here's the code doing this.

It uses PDFBox library. I don't think there is a need to know this library. I think that my piece of code should be easy to understand. However, I think that there is some code duplication (even thought IntelliJ doesn't find one) and some if/else could be rewritten to make the code easier to read. I didn't use any design pattern but I feel that some could be of use.

I am interested in any type of comments but especially suggestions on how to use better design and employ some design patterns if applicable. However, all kinds of feedback like style issues, missed corner cases, performance, logical organisation, etc. are welcomed.

Please find me code below:

Class with main logic:

@RequiredArgsConstructor
public class PageLabelIterator {
 private static final char[] ASCII_CHARS =
 {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's',
 't', 'u', 'v', 'w', 'x', 'y', 'z'};
 private final PDDocument document;
 @SneakyThrows
 public List<Label> print() {
 final PDPageLabels pageLabels = document.getDocumentCatalog().getPageLabels();
 final List<Label> labels = new ArrayList<>();
 for (int i = 0; i < document.getNumberOfPages(); i++) {
 final PDPageLabelRange range = pageLabels.getPageLabelRange(i);
 if (range == null) {
 continue;
 }
 final COSDictionary rangeDict = range.getCOSObject();
 // get page label
 String pageLabel = ((COSString) rangeDict.getDictionaryObject(COSName.P)).getString();
 // after label there could be suffix
 final String suffix = getSuffix(rangeDict);
 // add it or empty string instead
 pageLabel += suffix;
 final Label label = new Label(i + 1, pageLabel, LabelStyle.fromPdfBoxString(range.getStyle()));
 labels.add(label);
 }
 return labels;
 }
 private String getSuffix(final COSDictionary rangeDict) {
 final COSBase numberingStyle = rangeDict.getDictionaryObject(COSName.S);
 // if numbering style is not present then return empty string
 if (!(numberingStyle instanceof COSName)) {
 return "";
 }
 // Letters
 if (numberingStyle.equals(COSName.A) || numberingStyle.equals(COSName.getPDFName("a"))) {
 final COSBase number = rangeDict.getDictionaryObject(COSName.ST);
 if (number instanceof COSInteger cosInteger) {
 if (numberingStyle.equals(COSName.A)) {
 return Character.toString(ASCII_CHARS[cosInteger.intValue() - 1]).toUpperCase();
 } else {
 return Character.toString(ASCII_CHARS[cosInteger.intValue() - 1]).toLowerCase();
 }
 // null - then page = 1
 } else {
 if (numberingStyle.equals(COSName.A)) {
 return Character.toString(ASCII_CHARS[0]).toUpperCase();
 } else {
 return Character.toString(ASCII_CHARS[0]).toLowerCase();
 }
 }
 // Roman numerals
 } else if (numberingStyle.equals(COSName.R) || numberingStyle.equals(COSName.getPDFName("r"))) {
 final COSBase number = rangeDict.getDictionaryObject(COSName.ST);
 if (number instanceof COSInteger cosInteger) {
 if (numberingStyle.equals(COSName.R)) {
 return toRoman(cosInteger.intValue() - 1).toUpperCase();
 } else {
 return toRoman(cosInteger.intValue() - 1).toLowerCase();
 }
 // null - then page = 1
 } else {
 if (numberingStyle.equals(COSName.R)) {
 return toRoman(1).toUpperCase();
 } else {
 return toRoman(1).toLowerCase();
 }
 }
 // Decimal numerals
 } else if (numberingStyle.equals(COSName.D)) {
 final COSBase number = rangeDict.getDictionaryObject(COSName.ST);
 if (number instanceof COSInteger cosInteger) {
 return String.valueOf(cosInteger.intValue() - 1);
 // null - then page = 1
 } else {
 return String.valueOf(1);
 }
 // if unkown page style, then return empty string
 } else {
 return "";
 }
 }
}

Enum with some logic:

@RequiredArgsConstructor
public enum LabelStyle {
 DECIMAL("Decimal"),
 ROMAN_LOWER("Roman lower"),
 ROMAN_UPPER("Roman upper"),
 LETTER_UPPPER("Letters upper"),
 LETTER_LOWER("Letters lower"),
 NONE("None");
 private final String style;
 static LabelStyle fromPdfBoxString(final String style) {
 if (style == null) {
 return NONE;
 }
 return switch (style) {
 case "D" -> DECIMAL;
 case "r" -> ROMAN_LOWER;
 case "R" -> ROMAN_UPPER;
 case "a" -> LETTER_LOWER;
 case "A" -> LETTER_UPPPER;
 default -> NONE;
 };
 }
}

Simple record:

public record Label(
 int startPage,
 String label,
 LabelStyle style) {
}

Two external libraries are needed: PDFBox and Lombok. Please find below Maven dependency for the first library:

<dependency>
 <groupId>org.apache.pdfbox</groupId>
 <artifactId>pdfbox</artifactId>
 <version>2.0.29</version>
</dependency>

Question 2

@RequiredArgsConstructor
 ...
 @SneakyThrows

There is some review context missing here. The import statements are part of the source, they really matter to the Gentle Reader. You should explicitly mention the lombok dependency. (FWIW, I routinely re-throw wrapped RuntimeException, just to clean up my Public API's signature.)

 public List<Label> print() {

Consider renaming this. The "print" verb suggests we evaluate this for side effects on System.out.

 for (int i = 0; i < document.getNumberOfPages(); i++) {
 final PDPageLabelRange range = pageLabels.getPageLabelRange(i);

I'm not familiar with what Apache PDFBox exports, but I'll just throw this out there. Perhaps there is some way to iterate through document pages directly? One concern is that maybe .getPageLabelRange(i) does not run in O(1) constant time. If it is O(N) linear in i, then we have a quadratic loop here.

As part of the review context you should reveal whether you're using v3 of that library. (EDIT Oh, I see, "Please find below Maven dep" 2.0.29, thank you.)

Given the poor quality of the library's architecture documentation, it would be very helpful if a comment explains to the Gentle Reader what the secret COS abbreviation stands for. Maybe Compressed Object Stream? Or Content Object Stream? As it stands, I don't even know how to mentally pronounce those identifiers.

 // get page label
 String pageLabel = ...

Please elide that // comment, as it's not shedding any light on the situation. The identifier is brilliant and tells us all that we need. OTOH I thank you for the subsequent pair of comments, which helpfully explain how the suffix is optional.

I confess I don't understand what's going on when we cast the returned String to a COSString and then convert it back to a String. Now that would be worth a // comment. As it stands, I imagine it works fine and I don't need to know, as clearly we get a String in the end. If we're handling a corner case here, a unit test might illuminate the details.

 final Label label = new Label(i + 1, ...

I don't understand the + 1. There's an opportunity for the method's /** javadoc */ to spell out that one-origin is for some reason an important part of this method's contract. That is, caller should understand that the Public API we're defining chooses to differ from PDFBox's Public API in this regard. (Also, "missing review context", from the OP it's unclear which library defined the meaning of Label. My tacit assumption is that the Apache library supplies it. EDIT Oh, wait, here we go, the code appears below the "Simple record" remark, got it.)


 private String getSuffix(
 ...
 if (numberingStyle.equals(COSName.A) || numberingStyle.equals(COSName.getPDFName("a"))) {

Yikes! That's frightening. I'm sure it's necessary. Sounds like an issue in the underlying library, which deserves a bug report or PR.

If this is an upper / lower thing, consider working around the library's shortcoming by defining your own LOWER_A, in the interest of clarity. Or define a Set which includes both cases.

 if (numberingStyle.equals(COSName.A)) {
 return Character.toString(ASCII_CHARS[cosInteger.intValue() - 1]).toUpperCase();
 } else {
 return Character.toString(ASCII_CHARS[cosInteger.intValue() - 1]).toLowerCase();
 }
 // null - then page = 1
 } else {
 if (numberingStyle.equals(COSName.A)) {
 return Character.toString(ASCII_CHARS[0]).toUpperCase();
 } else {
 return Character.toString(ASCII_CHARS[0]).toLowerCase();
 }
 }

This is not a lot of repetition, but there's room to improve. I'm accustomed to seeing a .toUpperCase or .toLowerCase reference being passed into a helper, which will indirect through that. Also, it seems like the first thing you want to do is settle on an index of 0 vs cosInteger.intValue() - 1. With that in hand you can then worry about upper vs lower.

Oh, goodness, there's just more of it for Roman numerals. Yes, definitely break out the occasional helper method.

Identifiers of {A, R, D} and comments of {alpha, roman, decimal} were helpful -- thank you. I'm still unclear whether ST denotes style. A hint would be welcome.

You several times commented "null - then page = 1", which I found unclear. Upon reading this:

 // null - then page = 1
 } else {
 if (numberingStyle.equals(COSName.R)) {
 return toRoman(1).toUpperCase();

I realized that you habitually put the comment a little farther from the target code than I would have expected. That is, you associate the comment with the if clause but intend it for the else clause.

Adding a short reference document that's read by a few unit tests wouldn't hurt.

This codebase appears to achieve its design goals.

I would be willing to delegate or accept maintenance tasks on it.

Question 3

What's a magnificent review! Thank you very much. With all your comments my code would be better as well as further code review requests. Thanks one more time!

Question 4

Some asides: "COS" = Carousel Object System, see stackoverflow.com/questions/49207768/what-does-cos-stand-for ; "the library's shortcoming" essentially reflect peculiarities of the PDF file format in the library.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2023-07-16 23:43:56Z

@RequiredArgsConstructor
 ...
 @SneakyThrows

There is some review context missing here. The import statements are part of the source, they really matter to the Gentle Reader. You should explicitly mention the lombok dependency. (FWIW, I routinely re-throw wrapped RuntimeException, just to clean up my Public API's signature.)

 public List<Label> print() {

Consider renaming this. The "print" verb suggests we evaluate this for side effects on System.out.

 for (int i = 0; i < document.getNumberOfPages(); i++) {
 final PDPageLabelRange range = pageLabels.getPageLabelRange(i);

I'm not familiar with what Apache PDFBox exports, but I'll just throw this out there. Perhaps there is some way to iterate through document pages directly? One concern is that maybe .getPageLabelRange(i) does not run in O(1) constant time. If it is O(N) linear in i, then we have a quadratic loop here.

As part of the review context you should reveal whether you're using v3 of that library. (EDIT Oh, I see, "Please find below Maven dep" 2.0.29, thank you.)

Given the poor quality of the library's architecture documentation, it would be very helpful if a comment explains to the Gentle Reader what the secret COS abbreviation stands for. Maybe Compressed Object Stream? Or Content Object Stream? As it stands, I don't even know how to mentally pronounce those identifiers.

 // get page label
 String pageLabel = ...

Please elide that // comment, as it's not shedding any light on the situation. The identifier is brilliant and tells us all that we need. OTOH I thank you for the subsequent pair of comments, which helpfully explain how the suffix is optional.

I confess I don't understand what's going on when we cast the returned String to a COSString and then convert it back to a String. Now that would be worth a // comment. As it stands, I imagine it works fine and I don't need to know, as clearly we get a String in the end. If we're handling a corner case here, a unit test might illuminate the details.

 final Label label = new Label(i + 1, ...

I don't understand the + 1. There's an opportunity for the method's /** javadoc */ to spell out that one-origin is for some reason an important part of this method's contract. That is, caller should understand that the Public API we're defining chooses to differ from PDFBox's Public API in this regard. (Also, "missing review context", from the OP it's unclear which library defined the meaning of Label. My tacit assumption is that the Apache library supplies it. EDIT Oh, wait, here we go, the code appears below the "Simple record" remark, got it.)


 private String getSuffix(
 ...
 if (numberingStyle.equals(COSName.A) || numberingStyle.equals(COSName.getPDFName("a"))) {

Yikes! That's frightening. I'm sure it's necessary. Sounds like an issue in the underlying library, which deserves a bug report or PR.

If this is an upper / lower thing, consider working around the library's shortcoming by defining your own LOWER_A, in the interest of clarity. Or define a Set which includes both cases.

 if (numberingStyle.equals(COSName.A)) {
 return Character.toString(ASCII_CHARS[cosInteger.intValue() - 1]).toUpperCase();
 } else {
 return Character.toString(ASCII_CHARS[cosInteger.intValue() - 1]).toLowerCase();
 }
 // null - then page = 1
 } else {
 if (numberingStyle.equals(COSName.A)) {
 return Character.toString(ASCII_CHARS[0]).toUpperCase();
 } else {
 return Character.toString(ASCII_CHARS[0]).toLowerCase();
 }
 }

This is not a lot of repetition, but there's room to improve. I'm accustomed to seeing a .toUpperCase or .toLowerCase reference being passed into a helper, which will indirect through that. Also, it seems like the first thing you want to do is settle on an index of 0 vs cosInteger.intValue() - 1. With that in hand you can then worry about upper vs lower.

Oh, goodness, there's just more of it for Roman numerals. Yes, definitely break out the occasional helper method.

Identifiers of {A, R, D} and comments of {alpha, roman, decimal} were helpful -- thank you. I'm still unclear whether ST denotes style. A hint would be welcome.

You several times commented "null - then page = 1", which I found unclear. Upon reading this:

 // null - then page = 1
 } else {
 if (numberingStyle.equals(COSName.R)) {
 return toRoman(1).toUpperCase();

I realized that you habitually put the comment a little farther from the target code than I would have expected. That is, you associate the comment with the if clause but intend it for the else clause.

Adding a short reference document that's read by a few unit tests wouldn't hurt.

This codebase appears to achieve its design goals.

I would be willing to delegate or accept maintenance tasks on it.

What's a magnificent review! Thank you very much. With all your comments my code would be better as well as further code review requests. Thanks one more time!
Some asides: "COS" = Carousel Object System, see stackoverflow.com/questions/49207768/what-does-cos-stand-for ; "the library's shortcoming" essentially reflect peculiarities of the PDF file format in the library.

Stack Exchange Network

Java code showing page labels from PDF files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Java code showing page labels from PDF files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions