Get histogram of bytes in any set of files in Java - take III

Question 1

(This post is the continuation of Get histogram of bytes in any set of files in Java - take II.)

This time my code looks like as follows:

com.github.coderodde.file.util.ByteHistogramApp.java:

package com.github.coderodde.file.util;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.logging.Level;
import java.util.logging.Logger;
/**
 * This class implements a program for counting byte histograms in files.
 * 
 * @version 1.0.2 (Nov 17, 2024)
 * @since 1.0.0 (Nov 13, 2024)
 */
public final class ByteHistogramApp {
 
 private static final Logger LOGGER = 
 Logger.getLogger(ByteHistogram.class.getSimpleName());
 public static void main(String[] args) {
 final ByteHistogram byteHistogram = new ByteHistogram();
 
 if (args.length == 0) {
 try {
 loadByteHistogramFromStdin(byteHistogram);
 } catch (IOException ex) {
 LOGGER.log(
 Level.SEVERE, 
 "Could not load the histogram from " + 
 "standard input stream.");
 }
 } else {
 for (final String fileName : args) {
 try {
 loadByteHistogramFromFile(fileName, byteHistogram);
 } catch (IOException ex) {
 LOGGER.log(
 Level.SEVERE,
 "Could not load the histogram from file \"{0}\".", 
 fileName);
 }
 }
 }
 
 System.out.println(byteHistogram);
 }
 
 /**
 * Loads the byte histogram data only from standard input.
 * 
 * @param byteHistogram the byte histogram for holding the statistics.
 * 
 * @throws IOException if I/O fails.
 */
 private static void loadByteHistogramFromStdin(
 final ByteHistogram byteHistogram) throws IOException {
 
 loadImpl(new BufferedInputStream(System.in),
 byteHistogram);
 }
 
 /**
 * Loads the byte histogram data only from the file named {@code fileName}.
 * 
 * @param fileName the name of the file to load from.
 * @param byteHistogram the byte histogram for holding the statistics.
 * 
 * @throws FileNotFoundException if file named {@code fileName} is not 
 * found.
 * @throws IOException if I/O fails.
 */
 private static void loadByteHistogramFromFile(
 final String fileName,
 final ByteHistogram byteHistogram)
 throws FileNotFoundException, IOException {
 
 final InputStream is =
 new BufferedInputStream(
 new FileInputStream(fileName));
 
 loadImpl(is, byteHistogram);
 }
 
 /**
 * Performs the actual loading of the data to the input byte histogram.
 * 
 * @param is the input stream from which to load the data.
 * @param byteHistogram the target byte histogram.
 * 
 * @throws IOException if I/O fails.
 */
 private static void loadImpl(final InputStream is, 
 final ByteHistogram byteHistogram) 
 throws IOException {
 int ch;
 
 while ((ch = is.read()) != -1) {
 byteHistogram.insert(ch);
 }
 }
}

Critique request

Am I getting anywhere? Please tell me anything what comes to mind.

Question 2

 if (args.length == 0) {
 try {
 loadByteHistogramFromStdin(byteHistogram);

I fail to see why handling stdin is a special case compared to files. Technically all you need is to read from a stream, what that stream is does not matter. Simplified it could be:

if (args.length == 0) {
 loadIntoByteHistogram(System.in, byteHistogram);
} else {
 for (String argument : args) {
 loadIntoByteHistogram(new FileInputStream(argument), byteHistogram);
 }
}

With loadIntoByteHistogram closing the stream after it is done in this example.

On a sidenote, traditionally many UNIX-like tools allow to pass - as parameter to designate to read from stdin at that point:

yourtool file1 file2 - file3

Which might or might not be a functionality you want to support.

System.out.println(byteHistogram);

I'm not a fan of having UI logic inside toString(), as it is also often used for debugging purposes (for example by IDEs). Also the object itself (ByteHistogram) would contain logic for the UI this way, which might or might not be desirable. Having another method in your main CLI class which does the formatting seems like a good solution instead.

showHistogram(byteHistogram);

 private static void loadImpl(final InputStream is, 
 final ByteHistogram byteHistogram)

Every time you need to use the Impl postfix something is wrong. Either the method structuring is off, or the name could be better. loadFromInputStreamInto seems like a good candidate in this case.

loadFromStdInInto(ByteHistogram)
loadFromFileInto(Path, ByteHistogram)
loadFromInputStreamInto(InputStream, ByteHistogram)

 final InputStream is =
 new BufferedInputStream(
 new FileInputStream(fileName));
 
 loadImpl(is, byteHistogram);

That stream is never closed, which isn't really an issue in this application, but you should always consider closing the stream you're opening to avoid leaking anything. Either have loadImpl close the streams it is passed (which might or might not be desirable) or use a try-with-resources block:

 try (final InputStream is =
 new BufferedInputStream(
 new FileInputStream(fileName))) {
 loadImpl(is, byteHistogram);
 }

That will close the stream automatically when leaving the block.

That said, why do you need a BufferedInputStream here? You're never using the additional functionality down the line.

 int ch;
 
 while ((ch = is.read()) != -1) {
 byteHistogram.insert(ch);
 }

Reading a stream byte by byte can be slow. Consider reading from it in blocks instead, 16k, 32k or 64k depending on the usecase.

byte[] buffer = new byte[16 * 1024];
int read = 0;
while ((read = (stream.read(buffer)) >= 0) {
 byteHistogram.add(buffer, 0, read);
}

Also ch and is aren't really good names, especially as the ch seems to imply char but it's really an int.

 /**
 * Loads the byte histogram data only from standard input.
 * 
 * @param byteHistogram the byte histogram for holding the statistics.
 * 
 * @throws IOException if I/O fails.
 */

While it is a controversial topic, I'm not a fan of documentation that tells me nothing beyond what I could deduce from the function signature. It seems like wasted effort most of the time.

The next block is a little bit better:

 /**
 * Loads the byte histogram data only from the file named {@code fileName}.
 * 
 * @param fileName the name of the file to load from.
 * @param byteHistogram the byte histogram for holding the statistics.
 * 
 * @throws FileNotFoundException if file named {@code fileName} is not 
 * found.
 * @throws IOException if I/O fails.
 */

But raises a few points: First, the wording of "Loads the byte histogram data only from" implies that there are methods which do load from multiple sources or are not side-effect free. Second, "if I/O fails" is not that helpful again. A more explicit version would look like this:

/**
 * Loads the data from the file under the given {@code fileName} into the given {@link ByteHistogram}.
 *
 * @param fileName The name of the file from which to load the data into the given {@link ByteHistogram}.
 * @param byteHistogram The {@link ByteHistogram} into which to load the data from the given file.
 * @throws FileNotFoundException If the given {@code fileName} does not denote an existing file.
 * @throws IOException If reading from the given {@code fileName} failed.
 */

Considering that many IDEs only display parts of the Javadoc based on the current context, repeating yourself is not a bad idea.

 } catch (IOException ex) {
 LOGGER.log(
 Level.SEVERE, 
 "Could not load the histogram from " + 
 "standard input stream.");

You're swallowing the exception and its details here. That means that it will not be possible to tell what error/exception actually happened. Consider handing the exception itself to the logger so that it can be printed along with the error message.

 } catch (IOException ex) {
 LOGGER.log(
 Level.SEVERE, 
 "Could not load the histogram from " + 
 "standard input stream.",
 ex);

Question 3

About - on a command line in *nix: what should happen in case the tool is invoked with no arguments, read from stdin too?

Question 4

Most tools read from stdin in that case, yes.

Question 5

"why do you need a BufferedInputStream here? ... Reading a stream byte by byte can be slow." - I have to point out that you're wrong, reading bytes from BufferedInputStream is not slow, because int read() returns a byte from the internal 8 Kb buffer, which populated from the underlying stream as necessary. The size of the internal buffer can be tweaked while constructing BufferedInputStream. In this case, there's no advantage in providing an external buffer and then reading from it manually, only more boilerplate code.

Question 6

Also, if the command line parameters are tool a.txt - b.txt - should it process a.txt then stdin then b.txt and finally stdin again?

Question 7

@AlexanderIvanchenko The function itself is not guaranteed to get a BufferedInputStream, it only accepts an InputStream. If the functionality of the BufferedInputStream is "required" or assumed to be there, then it either should wrap the incoming stream or limit the parameter type accordingly. But I missed that interaction, yes.

Question 8

Closing resources

Again, you forgot to close InputStream reading from a file (refer to this answer, where I've demonstrated how to utilize try-with-resources for that purpose).

Also note that while a stream interacting with the file system should be closed as soon as you're done with it no matter what, a stream reading from the System.in is not mandatory to close. Because no other system processes can interact with System.in (i.e. there's no threat of resource leak), and when main() exits, JVM will take off System.in, as well as System.out and System.error.

And if you close a stream wrapped around the standard input, you'll not be able to use System.in anymore.

Code Structure

Focus on structuring your code by identifying responsibilities and introducing elements representing them.

This piece of code has many levels of nesting and bears a lot of cognitive load. You can make it cleaner by applying extract method refactoring technique on both branches.

if (args.length == 0) {
 try {
 loadByteHistogramFromStdin(byteHistogram);
 } catch (IOException ex) {
 LOGGER.log(
 Level.SEVERE, 
 "Could not load the histogram from " + 
 "standard input stream.");
 }
} else {
 for (final String fileName : args) {
 try {
 loadByteHistogramFromFile(fileName, byteHistogram);
 } catch (IOException ex) {
 LOGGER.log(
 Level.SEVERE,
 "Could not load the histogram from file \"{0}\".", 
 fileName);
 }
 }
}

And I don't see any logic that will allow your application to terminate while reading from the standard input.

It looks like you were too focused on the code reuse rather than addressing distinct use cases.

Question 9

Damn it! Forgot again to close the streams. :(

Question 10

@coderodde How do you expect the program to terminate while reading from the standard input? Looks like you have a bug

Question 11

you use System.out to print the byte histogram.

you already use logging library, so it should be used to write to standard output. the advantage of logging library is that output can be directed to a file or somewhere else via configuration
you did not share code of ByteHistogram. does it implement toString()? because the default implementation is not very helpful

Bobby Bobby 8,1662 gold badges35 silver badges43 bronze badges · Accepted Answer · 2024-11-18 21:48:34Z

 if (args.length == 0) {
 try {
 loadByteHistogramFromStdin(byteHistogram);

I fail to see why handling stdin is a special case compared to files. Technically all you need is to read from a stream, what that stream is does not matter. Simplified it could be:

if (args.length == 0) {
 loadIntoByteHistogram(System.in, byteHistogram);
} else {
 for (String argument : args) {
 loadIntoByteHistogram(new FileInputStream(argument), byteHistogram);
 }
}

With loadIntoByteHistogram closing the stream after it is done in this example.

On a sidenote, traditionally many UNIX-like tools allow to pass - as parameter to designate to read from stdin at that point:

yourtool file1 file2 - file3

Which might or might not be a functionality you want to support.

System.out.println(byteHistogram);

I'm not a fan of having UI logic inside toString(), as it is also often used for debugging purposes (for example by IDEs). Also the object itself (ByteHistogram) would contain logic for the UI this way, which might or might not be desirable. Having another method in your main CLI class which does the formatting seems like a good solution instead.

showHistogram(byteHistogram);

 private static void loadImpl(final InputStream is, 
 final ByteHistogram byteHistogram)

Every time you need to use the Impl postfix something is wrong. Either the method structuring is off, or the name could be better. loadFromInputStreamInto seems like a good candidate in this case.

loadFromStdInInto(ByteHistogram)
loadFromFileInto(Path, ByteHistogram)
loadFromInputStreamInto(InputStream, ByteHistogram)

 final InputStream is =
 new BufferedInputStream(
 new FileInputStream(fileName));
 
 loadImpl(is, byteHistogram);

That stream is never closed, which isn't really an issue in this application, but you should always consider closing the stream you're opening to avoid leaking anything. Either have loadImpl close the streams it is passed (which might or might not be desirable) or use a try-with-resources block:

 try (final InputStream is =
 new BufferedInputStream(
 new FileInputStream(fileName))) {
 loadImpl(is, byteHistogram);
 }

That will close the stream automatically when leaving the block.

That said, why do you need a BufferedInputStream here? You're never using the additional functionality down the line.

 int ch;
 
 while ((ch = is.read()) != -1) {
 byteHistogram.insert(ch);
 }

Reading a stream byte by byte can be slow. Consider reading from it in blocks instead, 16k, 32k or 64k depending on the usecase.

byte[] buffer = new byte[16 * 1024];
int read = 0;
while ((read = (stream.read(buffer)) >= 0) {
 byteHistogram.add(buffer, 0, read);
}

Also ch and is aren't really good names, especially as the ch seems to imply char but it's really an int.

 /**
 * Loads the byte histogram data only from standard input.
 * 
 * @param byteHistogram the byte histogram for holding the statistics.
 * 
 * @throws IOException if I/O fails.
 */

While it is a controversial topic, I'm not a fan of documentation that tells me nothing beyond what I could deduce from the function signature. It seems like wasted effort most of the time.

The next block is a little bit better:

 /**
 * Loads the byte histogram data only from the file named {@code fileName}.
 * 
 * @param fileName the name of the file to load from.
 * @param byteHistogram the byte histogram for holding the statistics.
 * 
 * @throws FileNotFoundException if file named {@code fileName} is not 
 * found.
 * @throws IOException if I/O fails.
 */

But raises a few points: First, the wording of "Loads the byte histogram data only from" implies that there are methods which do load from multiple sources or are not side-effect free. Second, "if I/O fails" is not that helpful again. A more explicit version would look like this:

/**
 * Loads the data from the file under the given {@code fileName} into the given {@link ByteHistogram}.
 *
 * @param fileName The name of the file from which to load the data into the given {@link ByteHistogram}.
 * @param byteHistogram The {@link ByteHistogram} into which to load the data from the given file.
 * @throws FileNotFoundException If the given {@code fileName} does not denote an existing file.
 * @throws IOException If reading from the given {@code fileName} failed.
 */

Considering that many IDEs only display parts of the Javadoc based on the current context, repeating yourself is not a bad idea.

 } catch (IOException ex) {
 LOGGER.log(
 Level.SEVERE, 
 "Could not load the histogram from " + 
 "standard input stream.");

You're swallowing the exception and its details here. That means that it will not be possible to tell what error/exception actually happened. Consider handing the exception itself to the logger so that it can be printed along with the error message.

 } catch (IOException ex) {
 LOGGER.log(
 Level.SEVERE, 
 "Could not load the histogram from " + 
 "standard input stream.",
 ex);

About - on a command line in *nix: what should happen in case the tool is invoked with no arguments, read from stdin too?
"why do you need a BufferedInputStream here? ... Reading a stream byte by byte can be slow." - I have to point out that you're wrong, reading bytes from BufferedInputStream is not slow, because int read() returns a byte from the internal 8 Kb buffer, which populated from the underlying stream as necessary. The size of the internal buffer can be tweaked while constructing BufferedInputStream. In this case, there's no advantage in providing an external buffer and then reading from it manually, only more boilerplate code.
Also, if the command line parameters are tool a.txt - b.txt - should it process a.txt then stdin then b.txt and finally stdin again?
@AlexanderIvanchenko The function itself is not guaranteed to get a BufferedInputStream, it only accepts an InputStream. If the functionality of the BufferedInputStream is "required" or assumed to be there, then it either should wrap the incoming stream or limit the parameter type accordingly. But I missed that interaction, yes.

Stack Exchange Network

Get histogram of bytes in any set of files in Java - take III

Critique request

3 Answers 3

Closing resources

Code Structure

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Get histogram of bytes in any set of files in Java - take III

Critique request

3 Answers 3

Closing resources

Code Structure

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions